Upon modifying a group, the following steps occur:
1. ofproto_group_mod_start()->modify_group_start():
Find an old group object, create a new one.
2. ofproto_bump_tables_version()
3. ofproto_group_mod_finish():
Modify the new group object with buckets etc.
At step #3, the new group object is already in use by revalidators,
that may read incorrect data while being modified.
Instead, move the group modification of the new object to step #1.
Fixes: 0a8f6beb54ab ("ofproto-dpif: Fix dp_hash mapping after select group modification.")
Acked-by: Gaetan Rivet <gaetanr@nvidia.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a packet enters kernel datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall. With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet. This allows us to send
packets from the same traffic flow through the same handler.
The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again. And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.
However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread. So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.
The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port. In real setups the
number of recirculations can go up to 4 or 5, sometimes more.
There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet. So, there is a race window when the flow is already
installed and the second packet can match it inside the kernel and be
forwarded to the destination before the first packet is re-injected.
But the fact that packets are going through multiple upcalls handled
by different userspace threads makes the reordering noticeably more
likely, because we not only have a race between the kernel and a
userspace handler (which is hard to avoid), but also between multiple
userspace handlers.
For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2. Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets. Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.
Let's provide the original upcall PID via the new netlink attribute
OVS_PACKET_ATTR_UPCALL_PID. This way the upcall triggered during the
execution will go to the same handler. Packets will be enqueued to
the same socket and re-injected in the same order. This doesn't
eliminate re-ordering as stated above, since we still have a race
between the kernel and the handler thread, but it allows to eliminate
races between multiple handlers.
The openvswitch kernel module ignores unknown attributes for the
OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older
kernels.
Reported-at: https://issues.redhat.com/browse/FDP-1479
Link: https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maximets@ovn.org/
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In the IPFIX and flow sample upcall handling, check the validity
of the tunnel key returned by odp_tun_key_from_attr(). If the
tunnel key is invalid, return an error.
This was reported by Coverity, but the change also improves
robustness and avoids undefined behavior in the case of malformed
tunnel attributes.
Fixes: 8b7ea2d48033 ("Extend OVS IPFIX exporter to export tunnel headers")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
This is not a real issue, as the initializer function,
rewrite_flow_push_nsh(), ensures it returns NULL on error.
However, cleaning this up improves code clarity and resolves
a Coverity warning about a potential memory leak.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Upon tunnel output failure, due to routing failure for example, add an
explicit drop action, so it will appear in the dp-flows for better
visibility for that case.
For those, add additional drop reasons.
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Rebalancing stats are printed once in a run with INFO log level.
Old detailed per-hash rebalancing stats now printed with DBG log level.
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422028.html
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Vladislav Odintsov <vlodintsov@k2.cloud>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
All the calls to put_drop_action checks first if it is supported.
Instead, embed the check inside.
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a new bucket is inserted to an existing group, OVS creates a new
group with just that one bucket and then copies old buckets into it.
The problem is that dp_hash mappings remain initialized for that one
bucket and no traffic can be sent to any of the old buckets.
If a bucket is removed, then OVS creates an empty new group and then
copies old buckets into it, except for the removed one. Mappings are
also not updated in this case and the group behaves as if it had no
buckets at all.
We need to re-hash all the buckets after the copy of the old buckets.
ofproto-provider API already has a callback for this case, but it just
wasn't implemented for ofproto-dpif. It wasn't necessary in the past,
but it became necessary when the hash map construction was moved to
the ofproto-dpif layer.
No locking in this function is required, because the new group must
not be visible during modification. It becomes visible only after the
GROUP_MOD processing is finished.
Fixes: 2e3fd24c7c44 ("ofproto-dpif: Improve dp_hash selection method for select groups")
Reported-at: https://github.com/openvswitch/ovs-issues/issues/357
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Aaron Conole <aconole@redhat.com>
STT and LISP tunnel types were deprecated and marked for removal in
the following commits in the OVS 3.5 release:
3b37a6154a59 ("netdev-vport: Deprecate STT tunnel port type.")
8d7ac031c03d ("netdev-vport: Deprecate LISP tunnel port type.")
Main reasons were that STT was rejected in upstream kernel and the
LISP was never upstreamed as well and doesn't really have a supported
implementation. Both protocols also appear to have lost their former
relevance.
Removing both now. While at it, also fixing some small documentation
issues and comments.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Alin Serdean <aserdean@ovn.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The commit "userspace: Enable non-bridge port as tunnel endpoint"
did not account for the ovs_router_get_netdev_source_address()
call. This patch ensures that the fix from "ofproto-dpif-xlate:
Fix netdev native tunnel neighbor discovery" remains functional
by properly calling the function with the output devices.
Fixes: dc0bd12f5b04 ("userspace: Enable non-bridge port as tunnel endpoint.")
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Ensure that xlate_generic_encap_action() does not return a buffer
on error.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Ensure a valid class is found before calling it's function pointer.
If it's not found, call ovs_assert().
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Add an ovs_assert() in this case, as a valid meter should exists.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Coverity complains that we could potentially dereference a
NULL pointer. To prevent this warning, add an ovs_assert().
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
The system and netdev datapath have different default pmd_id, which
resulted in empty output of ofproto/detrace with kernel datapath.
Also indicate that the UFID or cache wasn't available.
Make sure we use the correct default pmd_id when it's not specified
as an argument. At the same time move slightly adjusted test into
system tests so it is tested with both datapaths.
Fixes: 600125b2c380 ("ofproto: Add ofproto/detrace command to map UFIDs to OpenFlow.")
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
The first time revalidator checks the value - it is not initialized, so
we may end up marking valid flows for deletion.
WARNING: MemorySanitizer: use-of-uninitialized-value
0 0x6ee9e9 in revalidator_sweep__ ofproto/ofproto-dpif-upcall.c:3003:25
1 0x6ed671 in revalidator_purge ofproto/ofproto-dpif-upcall.c:3056:5
2 0x6e997d in udpif_stop_threads ofproto/ofproto-dpif-upcall.c:566:17
3 0x6ecf05 in udpif_flush ofproto/ofproto-dpif-upcall.c:756:5
4 0x60323e in flush ofproto/ofproto-dpif.c:2020:9
5 0x56b10e in ofproto_flush__ ofproto/ofproto.c:1669:9
6 0x56a67b in ofproto_destroy ofproto/ofproto.c:1821:5
7 0x4c9012 in bridge_destroy vswitchd/bridge.c:3644:9
8 0x4c7c13 in bridge_exit vswitchd/bridge.c:556:9
9 0x5261a8 in main vswitchd/ovs-vswitchd.c:147:5
10 0x7fa0bb in __libc_start_call_main
11 0x7fa0bb in __libc_start_main@GLIBC_2.2.5
12 0x432b24 in _start (vswitchd/ovs-vswitchd+0x432b24)
Fixes: 180ab2fd635e ("ofproto-dpif-upcall: Avoid stale ukeys leaks.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It is common to have flow tables with OpenFlow rules for both IPv4 and
IPv6 traffic at the same time. One major example is OVN.
Today only IPv4 addresses and L4 ports are prefix matched by default.
Since recently, it's possible to turn on the optimization for all
4 fields at the same time (nw_dst, nw_src, ipv6_dst and ipv6_src),
but it is an inconvenience for users. IPv6 configurations become more
and more common in the real world, so having IPv6 tables not optimized
by default is not good.
Enable prefix tree lookups for IPv6 addresses by default in addition
to IPv4.
This change increases memory consumption slightly as well as takes a
few extra cycles per flow to create and later iterate over additional
prefix trees. However, IPv4 and IPv6 matches are mutually exclusive,
so performance overhead should be minimal, especially in comparison
with the benefits this configuration brings to IPv6 setups reducing
the number of datapath flows by default.
A new test added to check that defaults are working as expected.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently a bond will not always revalidate when an active member
changes. This can result in counter-intuitive behaviors like the fact
that using ovs-appctl bond/set-active-member will cause the bond to
revalidate but changing other_config:bond-primary will not trigger a
revalidate in the bond.
When revalidation is not set but the active member changes in an
unbalanced bond, OVS may send traffic out of previously active member
instead of the new active member.
This change will always mark unbalanced bonds for revalidation if the
active member changes.
Reported-at: https://issues.redhat.com/browse/FDP-845
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously a commit attempted to reset the mirror context when packets
were modified. However, this commit erroneously also reset the mirror
context when only a packet's metadata was modified. An intermediate
commit corrected this for tunnel metadata, but now that correction is
extended to other forms of metadata as well.
Fixes: feed7f677505 ("ofproto-dpif-upcall: Mirror packets that are modified.")
Reported-at: https://issues.redhat.com/browse/FDP-699
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ovn-kubernetes users observe uneven distribution of traffic among
service backends. It gets noticeably worse when 9+ backends are
configured. For example:
Server connections
server-6nbg7 634
server-72br8 326 <-----
server-7b8rm 605
server-9tnz7 655
server-c84hw 596
server-mjskl 622
server-ptkcc 336 <-----
server-rl7pk 592
server-xd2cb 634
Here we can see that out of 5000 connections total, servers '72br8'
and 'ptkcc' received about 2 times less connections than any other
server. With 10 backends, 4 servers will receive 2x less, etc.
The situation is common, because kubernetes users frequently scale
their applications with something like kubectl scale --replicas=10.
Services are implemented as OVN load-balancers, which are implemented
as OpenFlow select groups with a dp_hash selection method. Balancing
is implemented by allocating a hash space (number of hashes is a
power of 2) and distributing hashes between the group buckets using
Webster's method taking into account bucket weights.
This works well enough when buckets have different weights, because
OVS needs to allocate larger hash space in order to accommodate the
weight difference. However, in case of OVN load-balancers, all the
buckets (backends) have the same weight. This leads to allocation
of a smaller hash space just enough to fit all the backends (closest
power of 2). For example, if we have 10 backends (buckets), then
16 hashes will be allocated. After hashes are distributed, we end
up with 6 buckets having 2 hashes each and 4 buckets having 1 hash.
So, each of the 6 buckets will receive on average 2x more traffic
than each of the remaining 4.
The problem can be mitigated by expanding the hash space and this is
already done for cases with small number of buckets by limiting the
hash space to have at least 16 hashes. However, it is just not large
enough for 9 or 10 buckets to provide a good distribution. At the
same time, blindly increasing the limit is also not a good solution
as we do not want too large of a hash space for small number of
buckets, simply because each hash is a separate datapath flow and
having too many of them may cause performance issues.
Approach taken in this change is to ensure that the hash space is
at least 4 times larger than the number of buckets, but not larger
than the maximum allowed (256). This provides a better distribution
while not unnecessarily exploding number of datapath flows for
services with not that many backends.
Here is some data to demonstrate why the 4 was chosen as a coefficient:
coeff. : 1 2 3 4 5 100
-------------------------------------------------------------------
AvgDiff : 43.1 % 27.1 % 18.3 % 15.1 % 13.6 % 10.9 %
MaxDiff : 50.0 % 33.3 % 25.0 % 20.0 % 20.0 % 20.0 %
AvgDev : 24.9 % 13.4 % 8.5 % 6.9 % 6.1 % 4.8 %
MaxDev : 35.4 % 20.4 % 14.4 % 11.2 % 11.2 % 11.2 %
--------+----------------------------------------------------------
16 : 1 1 1 1 1 -
32 : 17 9 6 5 4 -
64 : 33 17 11 9 7 -
128 : 65 33 22 17 13 1
256 : 129 65 43 33 26 2
--------+----------------------------------------------------------
current proposed
Table shows average and maximum load difference (Diff) between backends
across groups with 1 to 64 equally weighted backends. And it shows
average and maximum standard deviation (Dev) of load distribution for
the same. For example, with a coefficient 2, the maximum difference
between two backends will be 33% and the maximum standard deviation
will be 20.4%. With the current logic (coefficient of 1) we have
maximum difference as high as 50%, as shown with the example at the
beginning, with the standard deviation of 35.4%.
The bottom half of the table shows from how many backends we start to
use a particular number of buckets. For example, with a coeff. 3
we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
buckets for 11-21 and so on.
According to the table, the number 4 is about where we achieve a good
enough standard deviation for the load (11.2%) while still not creating
too many hashes for cases with low number of backends. The standard
deviation also doesn't go down that much with higher coefficient.
The data is aggregated for up to 64 backends because with higher
numbers we're getting close to the cap of 256 hashes, so deviation
increases. However, the load difference with so many backends should
have lower impact on a cluster in general. The change improves the
cases with 64+ backends, but to get very good numbers we'll need to
consider moving the upper limit for the hash space, which is a separate
change to think about.
Added test checks that standard deviation of the load distribution
between buckets is below 12.5% for all cases up to 64 buckets. It's
just a pretty number that I've chosen that should be IMO good enough.
Note: when number of buckets is a power of 2, then we have an ideal
distribution with zero deviation, because hashes can be evenly mapped
to buckets.
Note 2: The change doesn't affect distribution for 4 or less backends,
since there are still minimum 16 hashes for those groups.
Reported-at: https://issues.redhat.com/browse/FDP-826
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The OVS DB has a bond_active_slave field. This gets read by
port_configure_bond() into struct bond_settings' active_member_mac
field. See commit 3e5aeeb5 ("bridge: Keep bond active slave selection
across OVS restart") for the original rationale for preserving the
active bond member.
But since commit 30353934 ("ofproto/bond: Validate active-slave mac.")
the bond_settings' active_member_mac field is ignored by bond_create(),
which set bond->active_member_mac to eth_addr_zero.
Instead, set it to the value of the bond_settings' active_member_mac so
that the selection is preserved across OVS restarts.
Also add a test that checks this behaviour.
Fixes: 303539348848 ("ofproto/bond: Validate active-slave mac.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Jonathan Davies <jonathan.davies@nutanix.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
The odp_flow_format() function applies a wildcard mask when a
mask for a given key was not present. However, in the context of
installing flows in the datapath, the absence of a mask actually
indicates that the key should be ignored, meaning it should not
be masked at all.
To address this inconsistency, odp_flow_format() now includes an
option to skip formatting keys that are missing a mask.
This was found during a debug session of the ‘datapath - ping between two
ports on cvlan’ test case. The log message was showing the following:
put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),
skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8),
vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100),
vlan(vid=100/0x0,pcp=0/0x0),encap(eth_type(0x0800),
ipv4(src=10.2.2.2,dst=10.2.2.1,proto=1,tos=0,ttl=64,frag=no),
icmp(type=0,code=0))), actions:2
Where it should have shown the below:
put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),
skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8),
vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100)), actions:2
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
It is observed in some environments that there are much more ukeys than
actual DP flows. For example:
$ ovs-appctl upcall/show
system@ovs-system:
flows : (current 7) (avg 6) (max 117) (limit 2125)
offloaded flows : 525
dump duration : 1063ms
ufid enabled : true
23: (keys 3612)
24: (keys 3625)
25: (keys 3485)
The revalidator threads are busy revalidating the stale ukeys leading to
high CPU and long dump duration.
This patch tracks the number of consecutive missed dumps. If four dumps
are missed in a row, it is assumed that the datapath flow no longer
exists, and the ukey can be deleted.
Reported-by: Roi Dayan <roid@nvidia.com>
Co-authored-by: Han Zhou <hzhou@ovn.org>
Co-authored-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
In deployments with multiple tunnels, it can be possible to enter an
infinite loop where traffic generates an arp/nd lookup which is
forwarded to a different tunnel, generating a new arp/nd packet that is
send back to the first tunnel. Now the depth counter is incremented for
new arp/nd packets, just as happens in xlate_table_action().
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
For some reason gcc 14.1.1 from Fedora 41 thinks that the variable
may end up not initialized:
ofproto/ofproto-dpif-xlate.c: In function 'compose_sample_action':
ofproto/ofproto-dpif-xlate.c:3465:40:
error: 'observe_offset' may be used uninitialized
3465 | ctx->xout->last_observe_offset = observe_offset;
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
ofproto/ofproto-dpif-xlate.c:3418:12:
note: 'observe_offset' was declared here
3418 | size_t observe_offset;
| ^~~~~~~~~~~~~~
We have an assertion in the code to ensure that at least one of
the actions is present (userspace or psample), so the variable
should actually be always initialized.
Initialize explicitly just to silence the warning.
Fixes: 516569d31fbf ("ofproto: xlate: Make sampled drops explicit.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
'wc' can't be NULL there and if it can we'd already crash a few lines
before setting up vlan flags.
The check is misleading as it makes people to assume that wc can be
NULL. And it makes Coverity think the same:
CID 1596572: (#1 of 1): Dereference after null check (FORWARD_NULL)
25. var_deref_op: Dereferencing null pointer ctx->wc.
14. var_compare_op: Comparing ctx->wc to null implies that ctx->wc
might be null
Remove the check.
Fixes: 3b1882261c8b ("ofproto-dpif-mirror: Add support for pre-selection filter.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently a bridge mirror will collect all packets and tools like
ovs-tcpdump can apply additional filters after they have already been
duplicated by vswitchd. This can result in inefficient collection.
This patch adds support to apply pre-selection to bridge mirrors, which
can limit which packets are mirrored based on flow metadata. This
significantly improves overall vswitchd performance during mirroring if
only a subset of traffic is required.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously the mirror_set() and mirror_get() functions took a large
number of parameters, which was inefficient and difficult to read and
extend. This patch moves most of the parameters into a struct.
Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It improves the debugging experience if we can easily get a list of
OpenFlow rules and groups that contribute to the creation of a datapath
flow.
The suggested workflow is:
a. dump datapath flows (along with UUIDs), this also prints the core IDs
(PMD IDs) when applicable.
$ ovs-appctl dpctl/dump-flows -m
flow-dump from pmd on cpu core: 7
ufid:7460db8f..., recirc_id(0), ....
b. dump related OpenFlow rules and groups:
$ ovs-appctl ofproto/detrace ufid:7460db8f... pmd=7
cookie=0x12345678, table=0 priority=100,ip,in_port=2,nw_dst=10.0.0.2,actions=resubmit(,1)
cookie=0x0, table=1 priority=200,actions=group:1
group_id=1,bucket=bucket_id:0,actions=ct(commit,table=2,nat(dst=20.0.0.2))
cookie=0x0, table=2 actions=output:1
The new command only shows rules and groups attached to ukeys that are
in states UKEY_VISIBLE or UKEY_OPERATIONAL. That should be fine as all
other ukeys should not be relevant for the use case presented above.
This commit tries to mimic the output format of the ovs-ofctl
dump-flows/dump-groups commands.
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When sample action gets used as a way of sampling traffic with
controller-generated metadata (i.e: obs_domain_id and obs_point_id),
the controller will have to increase the number of flows to ensure each
part of the pipeline contains the right metadata.
As an example, if the controller decides to sample stateful traffic, it
could store the computed metadata for each connection in the conntrack
label. However, for established connections, a flow must be created for
each different ct_label value with a sample action that contains a
different hardcoded obs_domain and obs_point id.
This patch adds a new version of the NXAST_RAW_SAMPLE* action (number 4)
that supports specifying the observation point and domain using an
OpenFlow field reference, so now the controller can express:
sample(...
obs_domain_id=NXM_NX_CT_LABEL[0..31],
obs_point_id=NXM_NX_CT_LABEL[32..63]
...
)
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
"enum mf_subfield" (a 128byte object) is dynamically allocated a few
times just to set it to an all-ones mask.
Avoid dynamically allocating them by creating a static all-ones mask
similar to what was done with "exact_match_mask".
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When the flow translation results in a datapath action list whose last
action is an "observational" action, i.e: one generated for IPFIX,
sFlow or local sampling applications, the packet is actually going to be
dropped (and observed).
In that case, add an explicit drop action so that drop statistics remain
accurate. This behavior is controlled by a configurable boolean knob
called "explicit_sampled_drops"
Combine the "optimizations" and other odp_actions "tweaks" into a single
function.
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Use the newly added psample action to implement OpenFlow sample() actions
with local sampling configuration if possible.
A bit of refactoring in compose_sample_actions arguments helps make it a
bit more readable.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add a new resource in ofproto-dpif and the corresponding API in
ofproto_provider.h to represent and local sampling configuration.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Only kernel datapath supports this action so add a function in dpif.c
that checks for that.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add support for parsing and formatting the new action.
Also, flag OVS_ACTION_ATTR_SAMPLE as requiring datapath assistance if it
contains a nested OVS_ACTION_ATTR_PSAMPLE. The reason is that the
sampling rate from the parent "sample" is made available to the nested
"psample" by the kernel.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Datapath features can be set with dpif/set-dp-features unixctl command.
This command is not documented and therefore not supported in
production but only useful for unit tests.
A limitation was put in place originally to avoid enabling features at
runtime that were disabled at boot time to avoid breaking the datapath
in unexpected ways.
But, considering users should not use this command and it should only be
used for testing, we can assume whoever runs it knows what they are
doing. Therefore, the limitation should be bypass-able.
This patch adds a "--force" flag to the unixctl command to allow
bypassing the mentioned limitation.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adopts the proposed RFC 6935 by allowing null UDP checksums
even if the tunnel protocol is IPv6. This is already supported by Linux
through the udp6zerocsumtx tunnel option. It is disabled by default and
IPv6 tunnels are flagged as requiring a checksum, but this patch enables
the user to set csum=false on IPv6 tunnels.
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The 'dpif/show' command now supports machine-readable JSON output in
addition to the plain-text output for humans. An example would be:
ovs-appctl --format json dpif/show
Reported-at: https://bugzilla.redhat.com/1824861
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Fix the warning from Coverity about potential truncation of the
time_t value when copying to a local variable by changing the
local variable's type to time_t.
Fixes: ccc24fc88d59 ("ofproto-dpif: APIs and CLI option to add/delete static fdb entry.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Changed sFlowRcvrTimeout to a uint32_t to avoid time_t warnings
reported by Coverity. A uint32_t is more than large enough as
this is a (seconds) tick counter and OVS is not even using this.
Fixes: c72e245a0e2c ("Add InMon's sFlow Agent library to the build system.")
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
The flow_reval_monitor.py script incorrectly reported the reasons for
FDR_PURGE and FDR_TOO_EXPENSIVE, as their descriptions were swapped.
This patch rectifies the order using a dictionary to avoid similar
problems in the future.
In addition this patch also syncs the delete reason output of the
script, with the comments in the code.
Fixes: 86b9e653ef22 ("revalidator: Add a USDT probe during flow deletion with purge reason.")
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
While tracing NAT actions, pointer to the action may be stored in the
recirculation node for future reference. However, while translating
actions for the group bucket in xlate_group_bucket, the action list is
allocated temporarily on stack. So, in case the group translation
leads to NAT, the stack pointer can be stored in the recirculation node
and accessed later by the tracing mechanism when this stack memory is
long gone:
==396230==ERROR: AddressSanitizer: stack-use-after-return on address
0x191844 at pc 0x64222a bp 0xa5da10 sp 0xa5da08
READ of size 1 at 0x191844 thread T0
0 0x642229 in ofproto_trace_recirc_node ofproto/ofproto-dpif-trace.c:704:49
1 0x642229 in ofproto_trace ofproto/ofproto-dpif-trace.c:867:9
2 0x6434c1 in ofproto_unixctl_trace ofproto/ofproto-dpif-trace.c:489:9
3 0xc1e491 in process_command lib/unixctl.c:310:13
4 0xc1e491 in run_connection lib/unixctl.c:344:17
5 0xc1e491 in unixctl_server_run lib/unixctl.c:395:21
6 0x53eedf in main ovs/vswitchd/ovs-vswitchd.c:131:9
7 0x2be087 in __libc_start_call_main
8 0x2be14a in __libc_start_main@GLIBC_2.2.5
9 0x42dee4 in _start (vswitchd/ovs-vswitchd+0x42dee4)
Address 0x191844 is located in stack of thread T0 at offset 68 in frame
0 0x6d391f in xlate_group_bucket ofproto/ofproto-dpif-xlate.c:4751
This frame has 3 object(s):
[32, 1056) 'action_list_stub' (line 4760) <== Memory access at
offset 68 is inside
this variable
[1184, 1248) 'action_list' (line 4761)
[1280, 1344) 'action_set' (line 4762)
SUMMARY: AddressSanitizer: stack-use-after-return
ofproto/ofproto-dpif-trace.c:704:49 in ofproto_trace_recirc_node
Fix that by copying the action.
Fixes: d072d2de011b ("ofproto-dpif-trace: Improve NAT tracing.")
Reported-by: Ales Musil <amusil@redhat.com>
Reviewed-by: Adrian Moreno <amorenoz@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Coverity reports a false positive below:
Ofproto_class_find__() may return NULL, and dereference it to cause segfault.
This patch is made just to avoid false-positive Coverity report.
Tested-by: Zhang YuHuang <zhangyuhuang@ruijie.com.cn>
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ukey_install() returns boolean signaling if the ukey was installed
or not. Installation may fail for a few reasons:
1. Conflicting ukey.
2. Mutex contention while trying to replace existing ukey.
3. The same ukey already exists and active.
Only the first case here signals an actual problem. Third one is
a little odd for userspace datapath, but harmless. Second is the
most common one that can easily happen during normal operation
since other threads like revalidators may be currently working on
this ukey preventing an immediate access.
Since only the first case is actually worth logging and it already
has its own log message, removing the 'upcall installation fails'
warning from the upcall_cb(). This should fix most of the random
failures of userspace system tests in CI.
While at it, also fixing coverage counters. Mutex contention was
mistakenly counted as a duplicate upcall. ukey contention for
revalidators was counted only in one of two places.
New counter added for the ukey contention on replace. We should
not re-use existing upcall_ukey_contention counter for this, since
it may lead to double counting.
Fixes: 67f08985d769 ("upcall: Replace ukeys for deleted flows.")
Fixes: 9cec8274ed9a ("ofproto-dpif-upcall: Add VLOG_WARN_RL logs for upcall_cb() error.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Kernel support has been added for this action. As such, we need to probe
the datapath for support.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Eric Garver <eric@garver.life>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The next commit will convert a dp feature from bool to atomic_bool. As
such we have to add support to the macros and functions. We must pass by
reference instead of pass by value because all the atomic operations
require a reference.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Eric Garver <eric@garver.life>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>