2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-22 01:51:26 +00:00

2883 Commits

Author SHA1 Message Date
Eli Britstein
270de5dfb8 ofproto: Move group-modify to mod_start instead of mod_finish.
Upon modifying a group, the following steps occur:
1. ofproto_group_mod_start()->modify_group_start():
   Find an old group object, create a new one.
2. ofproto_bump_tables_version()
3. ofproto_group_mod_finish():
   Modify the new group object with buckets etc.

At step #3, the new group object is already in use by revalidators,
that may read incorrect data while being modified.

Instead, move the group modification of the new object to step #1.

Fixes: 0a8f6beb54ab ("ofproto-dpif: Fix dp_hash mapping after select group modification.")
Acked-by: Gaetan Rivet <gaetanr@nvidia.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-22 20:06:34 +02:00
Ilya Maximets
0d9dc8e9ca dpif-netlink: Provide original upcall pid in 'execute' commands.
When a packet enters kernel datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall.  With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet.  This allows us to send
packets from the same traffic flow through the same handler.

The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.

While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again.  And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.

However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread.  So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.

The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port.  In real setups the
number of recirculations can go up to 4 or 5, sometimes more.

There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet.  So, there is a race window when the flow is already
installed and the second packet can match it inside the kernel and be
forwarded to the destination before the first packet is re-injected.
But the fact that packets are going through multiple upcalls handled
by different userspace threads makes the reordering noticeably more
likely, because we not only have a race between the kernel and a
userspace handler (which is hard to avoid), but also between multiple
userspace handlers.

For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2.  Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets.  Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.

Let's provide the original upcall PID via the new netlink attribute
OVS_PACKET_ATTR_UPCALL_PID.  This way the upcall triggered during the
execution will go to the same handler.  Packets will be enqueued to
the same socket and re-injected in the same order.  This doesn't
eliminate re-ordering as stated above, since we still have a race
between the kernel and the handler thread, but it allows to eliminate
races between multiple handlers.

The openvswitch kernel module ignores unknown attributes for the
OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older
kernels.

Reported-at: https://issues.redhat.com/browse/FDP-1479
Link: https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maximets@ovn.org/
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-10 12:20:54 +02:00
Eelco Chaudron
d1bd62dae5 ofproto-dpif-upcall: Check odp_tun_key_from_attr() return value.
In the IPFIX and flow sample upcall handling, check the validity
of the tunnel key returned by odp_tun_key_from_attr(). If the
tunnel key is invalid, return an error.

This was reported by Coverity, but the change also improves
robustness and avoids undefined behavior in the case of malformed
tunnel attributes.

Fixes: 8b7ea2d48033 ("Extend OVS IPFIX exporter to export tunnel headers")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
2025-06-10 17:04:09 +02:00
Eelco Chaudron
88737f02ed ofproto-dpif-xlate: Fix memory leak in xlate_generic_encap_action().
This is not a real issue, as the initializer function,
rewrite_flow_push_nsh(), ensures it returns NULL on error.
However, cleaning this up improves code clarity and resolves
a Coverity warning about a potential memory leak.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-06-10 17:03:39 +02:00
Eli Britstein
1015b13f05 ofproto-dpif-xlate: Add a drop action for native tunnel failure.
Upon tunnel output failure, due to routing failure for example, add an
explicit drop action, so it will appear in the dp-flows for better
visibility for that case.
For those, add additional drop reasons.

Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-04-30 22:41:19 +02:00
Vladislav Odintsov
1724a293d0 ofproto: Log bond rebalancing stats once in rebalance run.
Rebalancing stats are printed once in a run with INFO log level.
Old detailed per-hash rebalancing stats now printed with DBG log level.

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422028.html
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Vladislav Odintsov <vlodintsov@k2.cloud>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-04-02 13:46:49 +02:00
Eli Britstein
ce77927bfe ofproto-dpif-xlate: Embed support check in put_drop_action.
All the calls to put_drop_action checks first if it is supported.
Instead, embed the check inside.

Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-04-02 12:01:54 +02:00
Eelco Chaudron
ba675897e4 ofproto-dpif: Fix spelling in comments and the support field macro.
Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2025-03-28 10:42:39 -04:00
Ilya Maximets
0b686a29b7 ofproto-dpif: Fix dp_hash mapping after select group modification.
When a new bucket is inserted to an existing group, OVS creates a new
group with just that one bucket and then copies old buckets into it.
The problem is that dp_hash mappings remain initialized for that one
bucket and no traffic can be sent to any of the old buckets.
If a bucket is removed, then OVS creates an empty new group and then
copies old buckets into it, except for the removed one.  Mappings are
also not updated in this case and the group behaves as if it had no
buckets at all.

We need to re-hash all the buckets after the copy of the old buckets.
ofproto-provider API already has a callback for this case, but it just
wasn't implemented for ofproto-dpif.  It wasn't necessary in the past,
but it became necessary when the hash map construction was moved to
the ofproto-dpif layer.

No locking in this function is required, because the new group must
not be visible during modification.  It becomes visible only after the
GROUP_MOD processing is finished.

Fixes: 2e3fd24c7c44 ("ofproto-dpif: Improve dp_hash selection method for select groups")
Reported-at: https://github.com/openvswitch/ovs-issues/issues/357
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2025-03-21 13:22:26 -04:00
Dima Chumak
1898112c8c ofproto: Add JSON output for 'fdb/show' command.
The 'fdb/show' command now supports machine-readable JSON output in
addition to the plain-text output for humans. An example would be:

  ovs-appctl --format=json --pretty fdb/show br-phy
  [
    {
      "static": true,
      "mac": "e4:8c:07:08:00:02",
      "port": 2,
      "vlan": 0
    },
    {
      "age": 14,
      "mac": "e4:8c:07:08:00:03",
      "port": 3,
      "vlan": 0
    }
  ]

Acked-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-03-12 15:23:52 +01:00
Ilya Maximets
19b8941620 tunnels: Remove support for deprecated STT and LISP.
STT and LISP tunnel types were deprecated and marked for removal in
the following commits in the OVS 3.5 release:

  3b37a6154a59 ("netdev-vport: Deprecate STT tunnel port type.")
  8d7ac031c03d ("netdev-vport: Deprecate LISP tunnel port type.")

Main reasons were that STT was rejected in upstream kernel and the
LISP was never upstreamed as well and doesn't really have a supported
implementation.  Both protocols also appear to have lost their former
relevance.

Removing both now.  While at it, also fixing some small documentation
issues and comments.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Alin Serdean <aserdean@ovn.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-02-28 17:19:41 +01:00
Eelco Chaudron
410e0f519f ofproto-dpif-xlate: Fix source IP lookup for non-bridge ports.
The commit "userspace: Enable non-bridge port as tunnel endpoint"
did not account for the ovs_router_get_netdev_source_address()
call. This patch ensures that the fix from "ofproto-dpif-xlate:
Fix netdev native tunnel neighbor discovery" remains functional
by properly calling the function with the output devices.

Fixes: dc0bd12f5b04 ("userspace: Enable non-bridge port as tunnel endpoint.")
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-02-26 13:17:55 +01:00
Eelco Chaudron
ce20ca0631 ofproto-dpif-xlate: Fix memory leak in xlate_generic_encap_action().
Ensure that xlate_generic_encap_action() does not return a buffer
on error.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-02-10 09:27:58 +01:00
Eelco Chaudron
f7c85a730b ofproto: Fix potential NULL pointer dereference in ofproto_type_xx().
Ensure a valid class is found before calling it's function pointer.
If it's not found, call ovs_assert().

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-02-10 09:27:43 +01:00
Eelco Chaudron
9e784ed6c2 ofproto: Fix potential null-ptr dereference in meter_insert_rule().
Add an ovs_assert() in  this case, as a valid meter should exists.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-02-10 09:27:23 +01:00
Eelco Chaudron
5f6b8facad ofproto-dpif-ipfix: Add NULL check to dpif_ipfix_set_options().
Coverity complains that we could potentially dereference a
NULL pointer. To prevent this warning, add an ovs_assert().

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-02-10 09:27:07 +01:00
Ales Musil
b272282abb ofproto: Fix default pmd_id for ofproto/detrace.
The system and netdev datapath have different default pmd_id, which
resulted in empty output of ofproto/detrace with kernel datapath.
Also indicate that the UFID or cache wasn't available.

Make sure we use the correct default pmd_id when it's not specified
as an argument. At the same time move slightly adjusted test into
system tests so it is tested with both datapaths.

Fixes: 600125b2c380 ("ofproto: Add ofproto/detrace command to map UFIDs to OpenFlow.")
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-12-12 10:57:39 +01:00
Ilya Maximets
464bc6f9c4 ofproto-dpif-upcall: Fix use of uninitialized missed dumps counter.
The first time revalidator checks the value - it is not initialized, so
we may end up marking valid flows for deletion.

 WARNING: MemorySanitizer: use-of-uninitialized-value
  0 0x6ee9e9 in revalidator_sweep__ ofproto/ofproto-dpif-upcall.c:3003:25
  1 0x6ed671 in revalidator_purge ofproto/ofproto-dpif-upcall.c:3056:5
  2 0x6e997d in udpif_stop_threads ofproto/ofproto-dpif-upcall.c:566:17
  3 0x6ecf05 in udpif_flush ofproto/ofproto-dpif-upcall.c:756:5
  4 0x60323e in flush ofproto/ofproto-dpif.c:2020:9
  5 0x56b10e in ofproto_flush__ ofproto/ofproto.c:1669:9
  6 0x56a67b in ofproto_destroy ofproto/ofproto.c:1821:5
  7 0x4c9012 in bridge_destroy vswitchd/bridge.c:3644:9
  8 0x4c7c13 in bridge_exit vswitchd/bridge.c:556:9
  9 0x5261a8 in main vswitchd/ovs-vswitchd.c:147:5
 10 0x7fa0bb in __libc_start_call_main
 11 0x7fa0bb in __libc_start_main@GLIBC_2.2.5
 12 0x432b24 in _start (vswitchd/ovs-vswitchd+0x432b24)

Fixes: 180ab2fd635e ("ofproto-dpif-upcall: Avoid stale ukeys leaks.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-11-29 18:32:52 +01:00
Ilya Maximets
2af7cef264 ofproto: Enable address prefix tracking for IPv6 by default.
It is common to have flow tables with OpenFlow rules for both IPv4 and
IPv6 traffic at the same time.  One major example is OVN.

Today only IPv4 addresses and L4 ports are prefix matched by default.
Since recently, it's possible to turn on the optimization for all
4 fields at the same time (nw_dst, nw_src, ipv6_dst and ipv6_src),
but it is an inconvenience for users.  IPv6 configurations become more
and more common in the real world, so having IPv6 tables not optimized
by default is not good.

Enable prefix tree lookups for IPv6 addresses by default in addition
to IPv4.

This change increases memory consumption slightly as well as takes a
few extra cycles per flow to create and later iterate over additional
prefix trees.  However, IPv4 and IPv6 matches are mutually exclusive,
so performance overhead should be minimal, especially in comparison
with the benefits this configuration brings to IPv6 setups reducing
the number of datapath flows by default.

A new test added to check that defaults are working as expected.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-11-27 19:20:04 +01:00
Mike Ovsiannikov
9fa73098c4 ofproto/ofproto: Initialize learn add rule flag.
Initialize learn_adds_rule in order to prevent
crash in ofproto_flow_mod_learn_finish() due to
being invoked with uninitialized rules
collections.

Here is the stack trace of the failure / segmentation fault:

orig_ofproto=orig_ofproto@entry=0x1d39220) at ofproto/ofproto-provider.h:509
xcache=<optimized out>, stats=0x7fff8e76eb10) at ofproto/ofproto-dpif.c:4966
ofproto/ofproto-dpif.c:5811
oh=<optimized out>) at ofproto/ofproto.c:3851
oh=<optimized out>, ofconn=0x20689a0) at ofproto/ofproto.c:8823
<handle_openflow>, ofconn=0x20689a0) at ofproto/connmgr.c:1329
handle_openflow=handle_openflow@entry=0x7f33b423e560 <handle_openflow>) at
ofproto/connmgr.c:356
vswitchd/ovs-vswitchd.c:132

The learn.ofm structure appears to be valid at this point, except that learn_adds_rule and
old_rules new_rules fields does not appear to be initialized:

(gdb) p entry->learn.ofm[0]
$3 = {
 temp_rule = 0x15fb520,
 criteria = {
   table_id = 1 '\001',
   cr = {
     node = {
       prev = 0x15bd570,
       next = {
         p = 0x15bd570
       }
     },
     priority = 31,
     cls_match = {
       p = 0x0
     },
     match = {
       {
         {
           flow = 0x21b77a0,
           mask = 0x21b77d0
         },
         flows = {0x21b77a0, 0x21b77d0}
       },
       tun_md = 0x0
     }
   },
   version = 18446744073709551614,
   cookie = 0,
   cookie_mask = 0,
   out_port = 65535,
   out_group = 4294967295,
   include_hidden = false,
   include_readonly = false
 },
 conjs = 0x0,
 n_conjs = 0,
 command = 2,
 modify_cookie = true,
 modify_may_add_flow = true,
 modify_keep_counts = true,
 event = 492535614,
 table_id = 1 '\001',
 version = 492535619,
 learn_adds_rule = 2,
 old_rules = {
   collection = {
     objs = 0x1,
     n = 492535617,
     capacity = 5,
     stub = {0x1d5b7f41, 0x30, 0x0, 0x0, 0x0}
   }
 },
 new_rules = {
   collection = {
     objs = 0x0,
     n = 0,
     capacity = 0,
     stub = {0x0, 0x0, 0x0, 0x600000000, 0x0}
   }
 }
}

Signed-off-by: Mike Ovsiannikov <mike.ovsiannikov@nutanix.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-11-11 13:42:28 -05:00
Mike Pattrick
b3e08faf9a bond: Always revalidate unbalanced bonds when active member changes.
Currently a bond will not always revalidate when an active member
changes. This can result in counter-intuitive behaviors like the fact
that using ovs-appctl bond/set-active-member will cause the bond to
revalidate but changing other_config:bond-primary will not trigger a
revalidate in the bond.

When revalidation is not set but the active member changes in an
unbalanced bond, OVS may send traffic out of previously active member
instead of the new active member.

This change will always mark unbalanced bonds for revalidation if the
active member changes.

Reported-at: https://issues.redhat.com/browse/FDP-845
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-10-30 13:09:34 +01:00
Mike Pattrick
49a249fb28 ofproto-dpif-upcall: Fix redundant mirror on metadata modification.
Previously a commit attempted to reset the mirror context when packets
were modified. However, this commit erroneously also reset the mirror
context when only a packet's metadata was modified. An intermediate
commit corrected this for tunnel metadata, but now that correction is
extended to other forms of metadata as well.

Fixes: feed7f677505 ("ofproto-dpif-upcall: Mirror packets that are modified.")
Reported-at: https://issues.redhat.com/browse/FDP-699
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-10-30 12:58:40 +01:00
Ilya Maximets
cb64234786 ofproto-dpif: Improve load balancing in dp_hash select groups.
ovn-kubernetes users observe uneven distribution of traffic among
service backends.  It gets noticeably worse when 9+ backends are
configured.  For example:

    Server        connections
    server-6nbg7     634
    server-72br8     326  <-----
    server-7b8rm     605
    server-9tnz7     655
    server-c84hw     596
    server-mjskl     622
    server-ptkcc     336  <-----
    server-rl7pk     592
    server-xd2cb     634

Here we can see that out of 5000 connections total, servers '72br8'
and 'ptkcc' received about 2 times less connections than any other
server.  With 10 backends, 4 servers will receive 2x less, etc.
The situation is common, because kubernetes users frequently scale
their applications with something like kubectl scale --replicas=10.

Services are implemented as OVN load-balancers, which are implemented
as OpenFlow select groups with a dp_hash selection method.  Balancing
is implemented by allocating a hash space (number of hashes is a
power of 2) and distributing hashes between the group buckets using
Webster's method taking into account bucket weights.

This works well enough when buckets have different weights, because
OVS needs to allocate larger hash space in order to accommodate the
weight difference.  However, in case of OVN load-balancers, all the
buckets (backends) have the same weight.  This leads to allocation
of a smaller hash space just enough to fit all the backends (closest
power of 2).  For example, if we have 10 backends (buckets), then
16 hashes will be allocated.  After hashes are distributed, we end
up with 6 buckets having 2 hashes each and 4 buckets having 1 hash.
So, each of the 6 buckets will receive on average 2x more traffic
than each of the remaining 4.

The problem can be mitigated by expanding the hash space and this is
already done for cases with small number of buckets by limiting the
hash space to have at least 16 hashes.  However, it is just not large
enough for 9 or 10 buckets to provide a good distribution.  At the
same time, blindly increasing the limit is also not a good solution
as we do not want too large of a hash space for small number of
buckets, simply because each hash is a separate datapath flow and
having too many of them may cause performance issues.

Approach taken in this change is to ensure that the hash space is
at least 4 times larger than the number of buckets, but not larger
than the maximum allowed (256).  This provides a better distribution
while not unnecessarily exploding number of datapath flows for
services with not that many backends.

Here is some data to demonstrate why the 4 was chosen as a coefficient:

  coeff.  :   1         2         3         4         5        100
  -------------------------------------------------------------------
  AvgDiff : 43.1 %    27.1 %    18.3 %    15.1 %    13.6 %    10.9 %
  MaxDiff : 50.0 %    33.3 %    25.0 %    20.0 %    20.0 %    20.0 %
  AvgDev  : 24.9 %    13.4 %     8.5 %     6.9 %     6.1 %     4.8 %
  MaxDev  : 35.4 %    20.4 %    14.4 %    11.2 %    11.2 %    11.2 %
  --------+----------------------------------------------------------
    16    :   1         1         1         1         1         -
    32    :  17         9         6         5         4         -
    64    :  33        17        11         9         7         -
   128    :  65        33        22        17        13         1
   256    : 129        65        43        33        26         2
  --------+----------------------------------------------------------
           current                       proposed

Table shows average and maximum load difference (Diff) between backends
across groups with 1 to 64 equally weighted backends.  And it shows
average and maximum standard deviation (Dev) of load distribution for
the same.  For example, with a coefficient 2, the maximum difference
between two backends will be 33% and the maximum standard deviation
will be 20.4%.  With the current logic (coefficient of 1) we have
maximum difference as high as 50%, as shown with the example at the
beginning, with the standard deviation of 35.4%.

The bottom half of the table shows from how many backends we start to
use a particular number of buckets.  For example, with a coeff. 3
we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
buckets for 11-21 and so on.

According to the table, the number 4 is about where we achieve a good
enough standard deviation for the load (11.2%) while still not creating
too many hashes for cases with low number of backends.  The standard
deviation also doesn't go down that much with higher coefficient.

The data is aggregated for up to 64 backends because with higher
numbers we're getting close to the cap of 256 hashes, so deviation
increases.  However, the load difference with so many backends should
have lower impact on a cluster in general.  The change improves the
cases with 64+ backends, but to get very good numbers we'll need to
consider moving the upper limit for the hash space, which is a separate
change to think about.

Added test checks that standard deviation of the load distribution
between buckets is below 12.5% for all cases up to 64 buckets.  It's
just a pretty number that I've chosen that should be IMO good enough.

Note: when number of buckets is a power of 2, then we have an ideal
distribution with zero deviation, because hashes can be evenly mapped
to buckets.

Note 2: The change doesn't affect distribution for 4 or less backends,
since there are still minimum 16 hashes for those groups.

Reported-at: https://issues.redhat.com/browse/FDP-826
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-10-07 21:17:06 +02:00
Jonathan Davies
5cb19214e9 ofproto/bond: Preserve active bond member over restarts.
The OVS DB has a bond_active_slave field. This gets read by
port_configure_bond() into struct bond_settings' active_member_mac
field. See commit 3e5aeeb5 ("bridge: Keep bond active slave selection
across OVS restart") for the original rationale for preserving the
active bond member.

But since commit 30353934 ("ofproto/bond: Validate active-slave mac.")
the bond_settings' active_member_mac field is ignored by bond_create(),
which set bond->active_member_mac to eth_addr_zero.

Instead, set it to the value of the bond_settings' active_member_mac so
that the selection is preserved across OVS restarts.

Also add a test that checks this behaviour.

Fixes: 303539348848 ("ofproto/bond: Validate active-slave mac.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Jonathan Davies <jonathan.davies@nutanix.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-09-11 15:36:02 +02:00
Eelco Chaudron
252ee0f182 dpif: Fix flow put debug message match content.
The odp_flow_format() function applies a wildcard mask when a
mask for a given key was not present. However, in the context of
installing flows in the datapath, the absence of a mask actually
indicates that the key should be ignored, meaning it should not
be masked at all.

To address this inconsistency, odp_flow_format() now includes an
option to skip formatting keys that are missing a mask.

This was found during a debug session of the ‘datapath - ping between two
ports on cvlan’ test case. The log message was showing the following:

  put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),
    skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
    eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8),
    vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100),
    vlan(vid=100/0x0,pcp=0/0x0),encap(eth_type(0x0800),
    ipv4(src=10.2.2.2,dst=10.2.2.1,proto=1,tos=0,ttl=64,frag=no),
    icmp(type=0,code=0))), actions:2

Where it should have shown the below:

  put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),
    skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
    eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8),
    vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100)), actions:2

Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-09-09 10:22:16 +02:00
Eelco Chaudron
180ab2fd63 ofproto-dpif-upcall: Avoid stale ukeys leaks.
It is observed in some environments that there are much more ukeys than
actual DP flows. For example:

$ ovs-appctl upcall/show
system@ovs-system:
flows : (current 7) (avg 6) (max 117) (limit 2125)
offloaded flows : 525
dump duration : 1063ms
ufid enabled : true

23: (keys 3612)
24: (keys 3625)
25: (keys 3485)

The revalidator threads are busy revalidating the stale ukeys leading to
high CPU and long dump duration.

This patch tracks the number of consecutive missed dumps. If four dumps
are missed in a row, it is assumed that the datapath flow no longer
exists, and the ukey can be deleted.

Reported-by: Roi Dayan <roid@nvidia.com>
Co-authored-by: Han Zhou <hzhou@ovn.org>
Co-authored-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-08-30 07:56:58 -04:00
Mike Pattrick
70bc3baaad ofproto-dpif-xlate: Add a recursion limit to tunnel address lookup.
In deployments with multiple tunnels, it can be possible to enter an
infinite loop where traffic generates an arp/nd lookup which is
forwarded to a different tunnel, generating a new arp/nd packet that is
send back to the first tunnel. Now the depth counter is incremented for
new arp/nd packets, just as happens in xlate_table_action().

Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-08-09 22:19:51 +02:00
Ilya Maximets
f9078407a9 ofproto-dpif-xlate: Initialize observe_offset for sample actions.
For some reason gcc 14.1.1 from Fedora 41 thinks that the variable
may end up not initialized:

 ofproto/ofproto-dpif-xlate.c: In function 'compose_sample_action':
 ofproto/ofproto-dpif-xlate.c:3465:40:
          error: 'observe_offset' may be used uninitialized
  3465 |         ctx->xout->last_observe_offset = observe_offset;
       |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
 ofproto/ofproto-dpif-xlate.c:3418:12:
          note: 'observe_offset' was declared here
  3418 |     size_t observe_offset;
       |            ^~~~~~~~~~~~~~

We have an assertion in the code to ensure that at least one of
the actions is present (userspace or psample), so the variable
should actually be always initialized.

Initialize explicitly just to silence the warning.

Fixes: 516569d31fbf ("ofproto: xlate: Make sampled drops explicit.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-17 20:32:45 +02:00
Ilya Maximets
f973d9543e ofproto-dpif-xlate: Remove misleading wc NULL check in packet mirror.
'wc' can't be NULL there and if it can we'd already crash a few lines
before setting up vlan flags.

The check is misleading as it makes people to assume that wc can be
NULL.  And it makes Coverity think the same:

  CID 1596572: (#1 of 1): Dereference after null check (FORWARD_NULL)
  25. var_deref_op: Dereferencing null pointer ctx->wc.

  14. var_compare_op: Comparing ctx->wc to null implies that ctx->wc
      might be null

Remove the check.

Fixes: 3b1882261c8b ("ofproto-dpif-mirror: Add support for pre-selection filter.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-17 16:29:37 +02:00
Mike Pattrick
3b1882261c ofproto-dpif-mirror: Add support for pre-selection filter.
Currently a bridge mirror will collect all packets and tools like
ovs-tcpdump can apply additional filters after they have already been
duplicated by vswitchd. This can result in inefficient collection.

This patch adds support to apply pre-selection to bridge mirrors, which
can limit which packets are mirrored based on flow metadata. This
significantly improves overall vswitchd performance during mirroring if
only a subset of traffic is required.

Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-16 00:28:23 +02:00
Mike Pattrick
04c090c61e ofproto-dpif-mirror: Reduce number of function parameters.
Previously the mirror_set() and mirror_get() functions took a large
number of parameters, which was inefficient and difficult to read and
extend. This patch moves most of the parameters into a struct.

Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-15 23:34:59 +02:00
Dumitru Ceara
600125b2c3 ofproto: Add ofproto/detrace command to map UFIDs to OpenFlow.
It improves the debugging experience if we can easily get a list of
OpenFlow rules and groups that contribute to the creation of a datapath
flow.

The suggested workflow is:
a. dump datapath flows (along with UUIDs), this also prints the core IDs
(PMD IDs) when applicable.

  $ ovs-appctl dpctl/dump-flows -m
  flow-dump from pmd on cpu core: 7
  ufid:7460db8f..., recirc_id(0), ....

b. dump related OpenFlow rules and groups:
  $ ovs-appctl ofproto/detrace ufid:7460db8f... pmd=7
  cookie=0x12345678, table=0 priority=100,ip,in_port=2,nw_dst=10.0.0.2,actions=resubmit(,1)
  cookie=0x0, table=1 priority=200,actions=group:1
  group_id=1,bucket=bucket_id:0,actions=ct(commit,table=2,nat(dst=20.0.0.2))
  cookie=0x0, table=2 actions=output:1

The new command only shows rules and groups attached to ukeys that are
in states UKEY_VISIBLE or UKEY_OPERATIONAL.  That should be fine as all
other ukeys should not be relevant for the use case presented above.

This commit tries to mimic the output format of the ovs-ofctl
dump-flows/dump-groups commands.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-15 19:12:52 +02:00
Adrian Moreno
1aa9e137fe ofp-actions: Load data from fields in sample action.
When sample action gets used as a way of sampling traffic with
controller-generated metadata (i.e: obs_domain_id and obs_point_id),
the controller will have to increase the number of flows to ensure each
part of the pipeline contains the right metadata.

As an example, if the controller decides to sample stateful traffic, it
could store the computed metadata for each connection in the conntrack
label. However, for established connections, a flow must be created for
each different ct_label value with a sample action that contains a
different hardcoded obs_domain and obs_point id.

This patch adds a new version of the NXAST_RAW_SAMPLE* action (number 4)
that supports specifying the observation point and domain using an
OpenFlow field reference, so now the controller can express:

 sample(...
        obs_domain_id=NXM_NX_CT_LABEL[0..31],
        obs_point_id=NXM_NX_CT_LABEL[32..63]
        ...
       )

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-15 11:31:09 +02:00
Adrian Moreno
c2e6836460 ofproto-dpif-xlate: Avoid allocating mf_subfield.
"enum mf_subfield" (a 128byte object) is dynamically allocated a few
times just to set it to an all-ones mask.

Avoid dynamically allocating them by creating a static all-ones mask
similar to what was done with "exact_match_mask".

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-15 11:31:09 +02:00
Adrian Moreno
516569d31f ofproto: xlate: Make sampled drops explicit.
When the flow translation results in a datapath action list whose last
action is an "observational" action, i.e: one generated for IPFIX,
sFlow or local sampling applications, the packet is actually going to be
dropped (and observed).

In that case, add an explicit drop action so that drop statistics remain
accurate. This behavior is controlled by a configurable boolean knob
called "explicit_sampled_drops"

Combine the "optimizations" and other odp_actions "tweaks" into a single
function.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-15 11:31:09 +02:00
Adrian Moreno
c10dbcec75 ofproto-dpif-xlate: Use psample for local sample.
Use the newly added psample action to implement OpenFlow sample() actions
with local sampling configuration if possible.

A bit of refactoring in compose_sample_actions arguments helps make it a
bit more readable.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-15 11:31:09 +02:00
Adrian Moreno
5b99ebc268 ofproto: Add ofproto-dpif-lsample.
Add a new resource in ofproto-dpif and the corresponding API in
ofproto_provider.h to represent and local sampling configuration.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-15 11:15:30 +02:00
Adrian Moreno
d0afbf0944 ofproto_dpif: Check for psample support.
Only kernel datapath supports this action so add a function in dpif.c
that checks for that.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-15 11:15:30 +02:00
Adrian Moreno
1a3bd96b4f odp-util: Add support OVS_ACTION_ATTR_PSAMPLE.
Add support for parsing and formatting the new action.

Also, flag OVS_ACTION_ATTR_SAMPLE as requiring datapath assistance if it
contains a nested OVS_ACTION_ATTR_PSAMPLE. The reason is that the
sampling rate from the parent "sample" is made available to the nested
"psample" by the kernel.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-14 17:19:52 +02:00
Adrian Moreno
d9de6b01c2 ofproto-dpif: Allow forcing dp features.
Datapath features can be set with dpif/set-dp-features unixctl command.
This command is not documented and therefore not supported in
production but only useful for unit tests.

A limitation was put in place originally to avoid enabling features at
runtime that were disabled at boot time to avoid breaking the datapath
in unexpected ways.

But, considering users should not use this command and it should only be
used for testing, we can assume whoever runs it knows what they are
doing. Therefore, the limitation should be bypass-able.

This patch adds a "--force" flag to the unixctl command to allow
bypassing the mentioned limitation.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-14 17:19:52 +02:00
Mike Pattrick
d7e77143fb tunnel: Allow UDP zero checksum with IPv6 tunnels.
This patch adopts the proposed RFC 6935 by allowing null UDP checksums
even if the tunnel protocol is IPv6. This is already supported by Linux
through the udp6zerocsumtx tunnel option. It is disabled by default and
IPv6 tunnels are flagged as requiring a checksum, but this patch enables
the user to set csum=false on IPv6 tunnels.

Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-12 11:54:45 +02:00
Jakob Meng
4935e89325 ofproto: Add JSON output for 'dpif/show' command.
The 'dpif/show' command now supports machine-readable JSON output in
addition to the plain-text output for humans. An example would be:

  ovs-appctl --format json dpif/show

Reported-at: https://bugzilla.redhat.com/1824861
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-09 13:49:44 +02:00
Eelco Chaudron
361d7bce0f ofproto-dpif: Define age as time_t in ofproto_unixctl_fdb_add().
Fix the warning from Coverity about potential truncation of the
time_t value when copying to a local variable by changing the
local variable's type to time_t.

Fixes: ccc24fc88d59 ("ofproto-dpif: APIs and CLI option to add/delete static fdb entry.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-06-03 13:23:41 +02:00
Eelco Chaudron
f673d0cd5f sflow: Fix check for disabled receive time.
Changed sFlowRcvrTimeout to a uint32_t to avoid time_t warnings
reported by Coverity. A uint32_t is more than large enough as
this is a (seconds) tick counter and OVS is not even using this.

Fixes: c72e245a0e2c ("Add InMon's sFlow Agent library to the build system.")
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-06-03 13:23:41 +02:00
Eelco Chaudron
0c8e626401 utilities: Correct deletion reason in flow_reval_monitor.py.
The flow_reval_monitor.py script incorrectly reported the reasons for
FDR_PURGE and FDR_TOO_EXPENSIVE, as their descriptions were swapped.
This patch rectifies the order using a dictionary to avoid similar
problems in the future.

In addition this patch also syncs the delete reason output of the
script, with the comments in the code.

Fixes: 86b9e653ef22 ("revalidator: Add a USDT probe during flow deletion with purge reason.")
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-05-21 13:31:02 +02:00
Ilya Maximets
b91f6788c4 ofproto-dpif-trace: Fix access to an out-of-scope stack memory.
While tracing NAT actions, pointer to the action may be stored in the
recirculation node for future reference.  However, while translating
actions for the group bucket in xlate_group_bucket, the action list is
allocated temporarily on stack.  So, in case the group translation
leads to NAT, the stack pointer can be stored in the recirculation node
and accessed later by the tracing mechanism when this stack memory is
long gone:

 ==396230==ERROR: AddressSanitizer: stack-use-after-return on address
 0x191844 at pc 0x64222a bp 0xa5da10 sp 0xa5da08
 READ of size 1 at 0x191844 thread T0
  0 0x642229 in ofproto_trace_recirc_node ofproto/ofproto-dpif-trace.c:704:49
  1 0x642229 in ofproto_trace ofproto/ofproto-dpif-trace.c:867:9
  2 0x6434c1 in ofproto_unixctl_trace ofproto/ofproto-dpif-trace.c:489:9
  3 0xc1e491 in process_command lib/unixctl.c:310:13
  4 0xc1e491 in run_connection lib/unixctl.c:344:17
  5 0xc1e491 in unixctl_server_run lib/unixctl.c:395:21
  6 0x53eedf in main ovs/vswitchd/ovs-vswitchd.c:131:9
  7 0x2be087 in __libc_start_call_main
  8 0x2be14a in __libc_start_main@GLIBC_2.2.5
  9 0x42dee4 in _start (vswitchd/ovs-vswitchd+0x42dee4)

 Address 0x191844 is located in stack of thread T0 at offset 68 in frame
  0 0x6d391f in xlate_group_bucket ofproto/ofproto-dpif-xlate.c:4751

  This frame has 3 object(s):
    [32, 1056) 'action_list_stub' (line 4760) <== Memory access at
                                                  offset 68 is inside
                                                  this variable
    [1184, 1248) 'action_list' (line 4761)
    [1280, 1344) 'action_set' (line 4762)

 SUMMARY: AddressSanitizer: stack-use-after-return
   ofproto/ofproto-dpif-trace.c:704:49 in ofproto_trace_recirc_node

Fix that by copying the action.

Fixes: d072d2de011b ("ofproto-dpif-trace: Improve NAT tracing.")
Reported-by: Ales Musil <amusil@redhat.com>
Reviewed-by: Adrian Moreno <amorenoz@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-05-03 14:16:20 +02:00
Lin Huang
120140f891 ofproto: Fix Coverity false positive.
Coverity reports a false positive below:
Ofproto_class_find__() may return NULL, and dereference it to cause segfault.

This patch is made just to avoid false-positive Coverity report.

Tested-by: Zhang YuHuang <zhangyuhuang@ruijie.com.cn>
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-04-09 19:36:09 +02:00
Ilya Maximets
ed379a810a ofproto-dpif-upcall: Fix ukey installation failure logs and counters.
ukey_install() returns boolean signaling if the ukey was installed
or not.  Installation may fail for a few reasons:

 1. Conflicting ukey.
 2. Mutex contention while trying to replace existing ukey.
 3. The same ukey already exists and active.

Only the first case here signals an actual problem.  Third one is
a little odd for userspace datapath, but harmless.  Second is the
most common one that can easily happen during normal operation
since other threads like revalidators may be currently working on
this ukey preventing an immediate access.

Since only the first case is actually worth logging and it already
has its own log message, removing the 'upcall installation fails'
warning from the upcall_cb().  This should fix most of the random
failures of userspace system tests in CI.

While at it, also fixing coverage counters.  Mutex contention was
mistakenly counted as a duplicate upcall.  ukey contention for
revalidators was counted only in one of two places.

New counter added for the ukey contention on replace.  We should
not re-use existing upcall_ukey_contention counter for this, since
it may lead to double counting.

Fixes: 67f08985d769 ("upcall: Replace ukeys for deleted flows.")
Fixes: 9cec8274ed9a ("ofproto-dpif-upcall: Add VLOG_WARN_RL logs for upcall_cb() error.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-04-06 00:09:53 +02:00
Eric Garver
3c8d069b9b dpif: Probe support for OVS_ACTION_ATTR_DROP.
Kernel support has been added for this action. As such, we need to probe
the datapath for support.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Eric Garver <eric@garver.life>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-04-05 23:10:17 +02:00
Eric Garver
54d94f8f4d dpif: Support atomic_bool field type.
The next commit will convert a dp feature from bool to atomic_bool. As
such we have to add support to the macros and functions. We must pass by
reference instead of pass by value because all the atomic operations
require a reference.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Eric Garver <eric@garver.life>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-04-05 23:10:17 +02:00