mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-29 05:18:13 +00:00

Author	SHA1	Message	Date
Kevin Traynor	ac32bbe2c7	dpif-netdev: Fix Auto Load Balance debug log. In the case where there is a NUMA node that has a zero variance improvement, the log will report it's variance improvement as value for a previous NUMA node with a non-zero variance improvement. For example in an artificial case: \|dpif_netdev\|DBG\|Numa node 1. Current variance 1000 Estimated variance 0. Variance improvement 100%. ^^^ correct value \|dpif_netdev\|DBG\|Numa node 0. Current variance 0 Estimated variance 0. Variance improvement 100%. ^^^ incorrect value for Numa 0, value from Numa 1 This is caused by not resetting the improvement between loops. This is a debug log reporting issue only, non-zero variance improvement will still trigger rebalance where appropriate. Move improvement and other variables into the loop code block to fix logs. Fixes: 46e04ec31bb2 ("dpif-netdev: Calculate per numa variance.") Reported-at: https://issues.redhat.com/browse/FDP-1145 Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Simon Horman <horms@ovn.org> Reviewed-by: David Marchand <david.marchand@redhat.com>	2025-02-14 13:27:05 +00:00
David Marchand	c771758249	dpif-netdev: Preserve inner offloads on recirculation. Rather than drop all pending Tx offloads on recirculation, preserve inner offloads (and mark packet with outer Tx offloads) when parsing the packet again. Fixes: c6538b443984 ("dpif-netdev: Fix crash due to tunnel offloading on recirculation.") Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Reported-at: https://issues.redhat.com/browse/FDP-1144 Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-02-13 21:32:15 +01:00
Mike Pattrick	2276c3a2c6	userspace: Support GRE TSO. This patch extends the userspace datapaths support of tunnel tso from only supporting VxLAN and Geneve to also supporting GRE tunnels. There is also a software fallback for cases where the egress netdev does not support this feature. Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-01-17 00:20:48 +01:00
Paolo Valerio	8ff40f3358	conntrack: Do not use atomics to report zones info. Atomics are not needed when reporting zone limits. Remove the restriction by defining a non-atomic common structure to report such data. The change also access atomics using the related operations to retrieve atomics reporting only the fields required by the requesting level instead of relying of struct copy. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>	2024-10-09 10:55:18 -04:00
Mike Pattrick	a67db28fd9	dpif-netdev: Remove undefined integer division. Clang analyzer will complain about floating point operations conducted with integer types as rounding is undefined. In pmd_info_show_rxq() a percentage was calculated inside uint64 integers instead of a floating pointer variable for a user visible message. This issue can be resolved simply by casting to double while dividing. Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>	2024-09-11 15:37:56 +02:00
Eelco Chaudron	252ee0f182	dpif: Fix flow put debug message match content. The odp_flow_format() function applies a wildcard mask when a mask for a given key was not present. However, in the context of installing flows in the datapath, the absence of a mask actually indicates that the key should be ignored, meaning it should not be masked at all. To address this inconsistency, odp_flow_format() now includes an option to skip formatting keys that are missing a mask. This was found during a debug session of the ‘datapath - ping between two ports on cvlan’ test case. The log message was showing the following: put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3), skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0), eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8), vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100), vlan(vid=100/0x0,pcp=0/0x0),encap(eth_type(0x0800), ipv4(src=10.2.2.2,dst=10.2.2.1,proto=1,tos=0,ttl=64,frag=no), icmp(type=0,code=0))), actions:2 Where it should have shown the below: put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3), skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0), eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8), vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100)), actions:2 Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>	2024-09-09 10:22:16 +02:00
Adrian Moreno	1a3bd96b4f	odp-util: Add support OVS_ACTION_ATTR_PSAMPLE. Add support for parsing and formatting the new action. Also, flag OVS_ACTION_ATTR_SAMPLE as requiring datapath assistance if it contains a nested OVS_ACTION_ATTR_PSAMPLE. The reason is that the sampling rate from the parent "sample" is made available to the nested "psample" by the kernel. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-07-14 17:19:52 +02:00
Eric Garver	8bb065961e	dpif: Stub out unimplemented action OVS_ACTION_ATTR_DEC_TTL. This is prep for adding a different OVS_ACTION_ATTR_ enum value. This action, OVS_ACTION_ATTR_DEC_TTL, is not actually implemented. However, to make -Werror happy we must add a case to all existing switches. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Eric Garver <eric@garver.life> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-04-05 23:10:17 +02:00
Ilya Maximets	c6538b4439	dpif-netdev: Fix crash due to tunnel offloading on recirculation. Recirculation involves re-parsing the packet from scratch and that process is not aware of multiple header levels nor the inner/outer offsets. So, it overwrites offsets with new ones from the outermost headers and sets offloading flags that change their meaning when the packet is marked for tunnel offloading. For example: 1. TCP packet enters OVS. 2. TCP packet gets encapsulated into UDP tunnel. 3. Recirculation happens. 4. Packet is re-parsed after recirculation with miniflow_extract() or similar function. 5. Packet is marked for UDP checksumming because we parse the outermost set of headers. But since it is tunneled, it means inner UDP checksumming. And that makes no sense, because the inner packet is TCP. This is causing packet drops due to malformed packets or even assertions and crashes in the code that is trying to fixup checksums for packets using incorrect metadata: SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior lib/packets.c:2061:15: runtime error: member access within null pointer of type 'struct udp_header' 0 0xbe5221 in packet_udp_complete_csum lib/packets.c:2061:15 1 0x7e5662 in dp_packet_ol_send_prepare lib/dp-packet.c:638:9 2 0x96ef89 in netdev_send lib/netdev.c:940:9 3 0x818e94 in dp_netdev_pmd_flush_output_on_port lib/dpif-netdev.c:5577:9 4 0x817606 in dp_netdev_pmd_flush_output_packets lib/dpif-netdev.c:5618:27 5 0x81cfa5 in dp_netdev_process_rxq_port lib/dpif-netdev.c:5677:9 6 0x7eefe4 in dpif_netdev_run lib/dpif-netdev.c:7001:25 7 0x610e87 in type_run ofproto/ofproto-dpif.c:367:9 8 0x5b9e80 in ofproto_type_run ofproto/ofproto.c:1879:31 9 0x55bbb4 in bridge_run__ vswitchd/bridge.c:3281:9 10 0x558b6b in bridge_run vswitchd/bridge.c:3346:5 11 0x591dc5 in main vswitchd/ovs-vswitchd.c:130:9 12 0x172b89 in __libc_start_call_main (/lib64/libc.so.6+0x27b89) 13 0x172c4a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x27c4a) 14 0x47eff4 in _start (vswitchd/ovs-vswitchd+0x47eff4) Tests added for both IPv4 and IPv6 cases. Though IPv6 test doesn't trigger the issue it's better to have a symmetric test. Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/053014.html Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-03-22 20:45:37 +01:00
Eelco Chaudron	f0d1beca6c	dpif-netdev: Do not create handler threads. Avoid unnecessary thread creation as no upcalls are generated, resulting in idle threads waiting for process termination. This optimization significantly reduces memory usage, cutting it by half on a 128 CPU/thread system during testing, with the number of threads reduced from 95 to 0. Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>	2024-02-29 12:24:59 +01:00
Paolo Valerio	afdc1171a8	conntrack: Handle persistent selection for IP addresses. The patch, when 'persistent' flag is specified, makes the IP selection in a range persistent across reboots. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Simon Horman <horms@ovn.org>	2024-02-21 10:12:42 +00:00
Paolo Valerio	99413ec261	conntrack: Handle random selection for port ranges. The userspace conntrack only supported hash for port selection. With the patch, both userspace and kernel datapath support the random flag. The default behavior remains the same, that is, if no flags are specified, hash is selected. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Simon Horman <horms@ovn.org>	2024-02-21 10:12:04 +00:00
Jakob Meng	5df46a44e8	dpif-netdev: Increase MAX_RECIRC_DEPTH to 8. In a scenario where OVN does load balancing and then SNAT with a OVS userspace datapath [0], the recirc_depth may be greater than 6. In that case, ovs-vswitchd might drop packets and raise warnings: dpif_netdev\|WARN\|Packet dropped. Max recirculation depth exceeded. Increasing MAX_RECIRC_DEPTH to 8 solves this issue. [0] `dd5cd73e3d/tests/system-ovn-kmod.at (L740)` Reported-at: https://issues.redhat.com/browse/FDP-251 Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Jakob Meng <code@jakobmeng.de> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-02-15 22:24:43 +01:00
Dexia Li	084c808729	userspace: Support VXLAN and GENEVE TSO. For userspace datapath, this patch provides vxlan and geneve tunnel tso. Only support userspace vxlan or geneve tunnel, meanwhile support tunnel outter and inner csum offload. If netdev do not support offload features, there is a software fallback.If netdev do not support vxlan and geneve tso,packets will drop. Front-end devices can close offload features by ethtool also. Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Dexia Li <dexia.li@jaguarmicro.com> Co-authored-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-01-17 22:06:45 +01:00
Viacheslav Galaktionov	14ef8b451f	lib/conntrack: Only use given packet in protocol detection. The current protocol detection logic relies on two pieces of metadata passed as arguments: tp_src and tp_dst, which represent the L4 source and destination port numbers from the flow that triggered the current flow rule first, and was responsible for creating the current DP flow. Since multiple network flows of many different kinds, potentially using different protocols on all layers, can be processed by one flow rule, using the metadata of some unrelated flow might lead to unexpected results. For example, ICMP type and code can be interpreted as TCP source and destination ports. This can confuse the code responsible for the helper selection, leading to errors in traffic handling and incorrect detection of related flows. One of the easiest ways to fix this problem is to simply remove the tp_src and tp_dst parameters from the picture. The current code base has no good use for them. The helper selection logic was based on these values and therefore needs to be changed. Ensure that the helper specified in a flow rule is used, given it is compatible with the L4 protocol of the packet. When a flow rule does not specify a helper, one can still be picked using the given packet's metadata like TCP/UDP ports. Signed-off-by: Viacheslav Galaktionov <viacheslav.galaktionov@arknetworks.am> Signed-off-by: Aaron Conole <aconole@redhat.com>	2024-01-10 16:16:08 -05:00
Kevin Traynor	4cbbf56e6c	dpif-netdev: Add per PMD sleep config. Extend 'pmd-sleep-max' so that individual PMD thread cores may have a specified max sleep request value. Existing behaviour is maintained. Any PMD thread core without a value will use the global value if set or default no sleep. To set PMD thread cores 8 and 9 to never request a load based sleep and all other PMD thread cores to be able to request a max sleep of 50 usecs: $ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50,8:0,9:0 To set PMD thread cores 10 and 11 to request a max sleep of 100 usecs and all other PMD thread cores to never request a sleep: $ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=10:100,11:100 'pmd-sleep-show' is updated to show the max sleep value for each PMD thread. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-12-16 01:07:59 +01:00
Ales Musil	4b9eb061b1	ct-dpif: Handle default zone limit the same way as other limits. Internally handle default CT zone limit as other limits that can be passed via the list with special value -1. Currently, the -1 is treated by both datapaths as default, add static asserts to make sure that this remains the case in the future. This allows us to easily delete the default zone limit. Signed-off-by: Ales Musil <amusil@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-12-05 20:42:04 +01:00
Eelco Chaudron	08212d755e	netdev-offload: Fix Clang's static analyzer 'Division by zero' warnings. When enabling DPDK with the configure the below, ovs-vswitchd will crash. ovs-vsctl set Open_vSwitch . other_config:n-offload-threads=0 ovs-vsctl set Open_vSwitch . other_config:hw-offload=true This issue arises because setting the 'n-offload-threads' value to zero is not a supported configuration. This fix addresses this by implementing a check to ensure a valid 'n-offload-threads' value, both during configuration and statistics gathering. Fixes: 62c2d8a67543 ("netdev-offload: Add multi-thread API.") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Simon Horman <horms@ovn.org>	2023-10-31 15:00:17 +00:00
Zhiqi Chen	785e22f876	dpif-netdev: Fix length calculation of netdet_flow_key. The 'len' of a netdev_flow_key initialized by netdev_flow_key_init() is always zero, which may cause errors when cloning a netdev_flow_key by netdev_flow_key_clone(). Currently the 'len' member of a netdev_flow_key initialized by netdev_flow_key_init() is not used, so this error will not cause any bad behavior for now. Fixes: c82f496c3b69 ("dpif-netdev: Use unmasked key when adding datapath flows.") Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Zhiqi Chen <chenzhiqi.123@bytedance.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-08-25 22:02:50 +02:00
Peng He	21410ff800	dpif-netdev: Fix dpif_netdev_flow_put. OVS allows overlapping megaflows, as long as the actions of these megaflows are equal. However, the current implementation of action modification relies on flow_lookup instead of UFID, this could result in looking up a wrong megaflow and make the ukeys and megaflows inconsistent. Just like the test case in the patch, at first we have a rule with the prefix: 10.1.2.0/24 And we will get a megaflow with prefixes 10.1.2.2/24 when a packet with IP 10.1.2.2 is received. Then suppose we change the rule into 10.1.0.0/16. OVS prefers to keep the 10.1.2.2/24 megaflow and just changes its action instead of extending the prefix into 10.1.2.2/16. Then suppose we have a 10.1.0.2 packet, since it misses the megaflow, this time, we will have an overlapping megaflow with the right prefix: 10.1.0.2/16 Now we have two megaflows: 10.1.2.2/24 10.1.0.2/16 Last, suppose we have changed the ruleset again. The revalidator this time still decides to change the actions of both megaflows instead of deleting them. The dpif_netdev_flow_put will search the megaflow to modify with unmasked keys, however it might lookup the wrong megaflow as the key 10.1.2.2 matches both 10.1.2.2/24 and 10.1.0.2/16! This patch changes the megaflow lookup code in modification path into relying the UFID to find the correct megaflow instead of key lookup. Falling back to a classifier lookup in case where UFID was not provided in order to support cases where UFID was not generated from the flow data during the flow addition. Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline") Signed-off-by: Peng He <hepeng.0320@bytedance.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-08-14 16:48:56 +02:00
Kevin Traynor	bc6a6f82e5	dpif-netdev: Add pmd-sleep-show command. Max requested sleep time and status for a PMD thread is logged at start up or when changed, but it can be convenient to have a command to dump this information explicitly. It is envisaged that this will be expanded for individual pmds in the future, hence adding to dpif_netdev_pmd_info(). Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-15 00:17:05 +02:00
Kevin Traynor	023dcdc7a1	dpif-netdev: Rename pmd-maxsleep config option. other_config:pmd-maxsleep is a config option to allow PMD thread cores to sleep under low or no load conditions. Rename it to 'pmd-sleep-max' to allow a more structured name and so that additional options or command can follow the 'pmd-sleep-xyz' pattern. Use of other_config:pmd-maxsleep is deprecated to be removed in a future release and will result in a warning. Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-15 00:11:21 +02:00
Ilya Maximets	c2433bdfc0	dpif-netdev: Lockless meters. Current implementation of meters in the userspace datapath takes the meter lock for every packet batch. If more than one thread hits the flow with the same meter, they will lock each other. Replace the critical section with atomic operations to avoid interlocking. Meters themselves are RCU-protected, so it's safe to access them without holding a lock. Implementation does the following: 1. Tries to advance the 'used' timer of the meter with atomic compare+exchange if it's smaller than 'now'. 2. If the timer change succeeds, atomically update band buckets. 3. Atomically update packet statistics for a meter. 4. Go over buckets and try to atomically subtract the amount of packets or bytes, recording the highest exceeded band. 5. Atomically update band statistics and drop packets. Bucket manipulations are implemented with atomic compare+exchange operations with extra checks, because bucket size should never exceed the maximum and it should never go below zero. Packet statistics may be momentarily inconsistent, i.e., number of packets and the number of bytes may reflect different sets of packets. But it should be eventually consistent. And the difference at any given time should be in just few packets. For the sake of reduced code complexity PKTPS meter tries to push packets through the band one by one, even though they all have the same weight. This is also more fair if more than one thread is passing packets through the same band at the same time. Trying to predict the number of packets that can pass may also cause extra atomic operations reducing the performance. This implementation shows similar performance to the previous one, but should scale better with more threads hitting the same meter. Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Lin Huang <linhuang@ruijie.com.cn> Tested-by: Zhang YuHuang <zhangyuhuang@ruijie.com.cn> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-01 00:35:18 +02:00
Paolo Valerio	9b4d2ad8e8	conntrack: Allow to dump userspace conntrack expectations. The patch introduces a new commands ovs-appctl dpctl/dump-conntrack-exp that allows to dump the existing expectations for the userspace ct. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-29 22:20:43 +02:00
Mike Pattrick	5d11c47d3e	userspace: Enable IP checksum offloading by default. The netdev receiving packets is supposed to provide the flags indicating if the IP checksum was verified and it is GOOD or BAD, otherwise the stack will check when appropriate by software. If the packet comes with good checksum, then postpone the checksum calculation to the egress device if needed. When encapsulate a packet with that flag, set the checksum of the inner IP header since that is not yet supported. Calculate the IP checksum when the packet is going to be sent over a device that doesn't support the feature. Linux devices don't support IP checksum offload alone, so the support is not enabled. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-15 23:49:51 +02:00
Paolo Valerio	9fa612959c	ovs-dpctl: Add new command dpctl/ct-[sg]et-sweep-interval. Since 3d9c1b855a5f ("conntrack: Replace timeout based expiration lists with rculists.") the sweep interval changed as well as the constraints related to the sweeper. Being able to change the default reschedule time may be convenient in some conditions, like debugging. This patch introduces new commands allowing to get and set the sweep interval in ms. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-04-06 22:59:25 +02:00
Eelco Chaudron	4d69c19000	ofproto-dpif-upcall: Reset ukey's last stats value if the datapath changed. When the ukey's action set changes, it could cause the flow to use a different datapath, for example, when it moves from tc to kernel. This will cause the the cached previous datapath statistics to be used. This change will reset the cached statistics when a change in datapath is discovered. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-03-03 22:27:37 +01:00
Kevin Traynor	948767a18d	dpif-netdev: Set PMD load based sleep start/inc to 1 us. Now that the timer slack for the PMD threads is reduced we can also reduce the start/increment for PMD load based sleeping to match it. This will further reduce initial sleep times making it more resilient to interfaces that might be sensitive to large sleep times. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-23 17:23:28 +01:00
David Marchand	f62629a558	dpif-netdev: Set timer slack for PMD threads. The default Linux timer slack groups timer expires into 50 uS intervals. With some traffic patterns this can mean that returning to process packets after a sleep takes too long and packets are dropped. Add a helper to util.c and set use it to reduce the timer slack for PMD threads, so that sleeps with smaller resolutions can be done to prevent sleeping for too long. Fixes: de3bbdc479a9 ("dpif-netdev: Add PMD load based sleeping.") Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401121.html Reported-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: David Marchand <david.marchand@redhat.com> Co-authored-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-23 17:23:20 +01:00
Kevin Traynor	de3bbdc479	dpif-netdev: Add PMD load based sleeping. Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD. Upon detecting the threshold of >= 16 pkts on an Rxq, reset the sleep time to zero (i.e. no sleep). Sleep time will be increased on each iteration where the low load conditions remain up to a total of the max sleep time which is set by the user e.g: ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500 The default pmd-maxsleep value is 0, which means that no sleeps will occur and the default behaviour is unchanged from previously. Also add new stats to pmd-perf-show to get visibility of operation e.g. ... - sleep iterations: 153994 ( 76.8 % of iterations) Sleep time (us): 9159399 ( 59 us/iteration avg.) ... Reviewed-by: Robin Jarry <rjarry@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-12 18:56:05 +01:00
Cheng Li	46e04ec31b	dpif-netdev: Calculate per numa variance. Currently, pmd_rebalance_dry_run() calculate overall variance of all pmds regardless of their numa location. The overall result may hide un-balance in an individual numa. Considering the following case. Numa0 is free because VMs on numa0 are not sending pkts, while numa1 is busy. Within numa1, pmds workloads are not balanced. Obviously, moving 500 kpps workloads from pmd 126 to pmd 62 will make numa1 much more balance. For numa1 the variance improvement will be almost 100%, because after rebalance each pmd in numa1 holds same workload(variance ~= 0). But the overall variance improvement is only about 20%, which may not trigger auto_lb. ``` numa_id core_id kpps 0 30 0 0 31 0 0 94 0 0 95 0 1 126 1500 1 127 1000 1 63 1000 1 62 500 ``` As auto_lb doesn't balance workload across numa nodes. So it makes more sense to calculate variance improvement per numa node. Signed-off-by: Cheng Li <lic121@chinatelecom.cn> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Co-authored-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-21 22:15:47 +01:00
Kevin Traynor	ad6e506fcb	dpif-netdev: Rename pmd_info_show_rxq variables. There are some similar readings taken for pmds and Rx queues in this function and a few of the variable names are ambiguous. Improve the readability of the code by updating some variables names to indicate that they are readings related to the pmd. Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-21 20:58:30 +01:00
Kevin Traynor	526230bfab	dpif-netdev: Make pmd-rxq-show time configurable. pmd-rxq-show shows the Rx queue to pmd assignments as well as the pmd usage of each Rx queue. Up until now a tail length of 60 seconds pmd usage was shown for each Rx queue, as this is the value used during rebalance to avoid any spike effects. When debugging or tuning, it is also convenient to display the pmd usage of an Rx queue over a shorter time frame, so any changes config or traffic that impact pmd usage can be evaluated more quickly. A parameter is added that allows pmd-rxq-show stats pmd usage to be shown for a shorter time frame. Values are rounded up to the nearest 5 seconds as that is the measurement granularity and the value used is displayed. e.g. $ ovs-appctl dpif-netdev/pmd-rxq-show -secs 5 Displaying last 5 seconds pmd usage % pmd thread numa_id 0 core_id 4: isolated : false port: dpdk0 queue-id: 0 (enabled) pmd usage: 95 % overhead: 4 % The default time frame has not changed and the maximum value is limited to the maximum stored tail length (60 seconds). Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-21 20:57:29 +01:00
Eelco Chaudron	c82f496c3b	dpif-netdev: Use unmasked key when adding datapath flows. The datapath supports installing wider flows, and OVS relies on this behavior. For example if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed to be added. However, if we try to add a wildcard rule, the installation fails: # ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2 # ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2 ovs-vswitchd: updating flow table (File exists) The reason is that the key used to determine if the flow is already present in the system uses the original key ANDed with the mask. This results in the IP address not being part of the (miniflow) key, i.e., being substituted with an all-zero value. When doing the actual lookup, this results in the key wrongfully matching the first flow, and therefore the flow does not get installed. The solution is to use the unmasked key for the existence check, the same way this is handled in the "slow" dpif_flow_put() case. OVS relies on the fact that overlapping flows can exist if one is a superset of the other. Note that this is only true when the same set of actions is applied. This is due to how the revalidator process works. During revalidation, OVS removes too generic flows from the datapath to avoid incorrect matches but allows too narrow flows to stay in the datapath to avoid the data plane disruption and also to avoid constant flow deletions if the datapath ignores wildcards on certain fields/bits. See flow_wildcards_has_extra() check in the revalidate_ukey__() function. The problem here is that we have a too narrow flow installed, and now OpenFlow rules got changed, so the actual flow should be more generic. Revalidators will not remove the narrow flow, and we will eventually get an upcall on the packet that doesn't match the narrow flow, but we will not be able to install a more generic flow because after masking with the new wider mask, the key matches on the narrow flow, so we get EEXIST. Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-20 13:07:17 +01:00
Eli Britstein	76ab364ea8	netdev-offload: Set 'miss_api_supported' to be under netdev. Cited commit introduced a flag in dpif-netdev level, to optimize performance and avoid hw_miss_packet_recover() for devices with no such support. However, there is a race condition between traffic processing and assigning a 'flow_api' object to the netdev. In such case, EOPNOTSUPP is returned by netdev_hw_miss_packet_recover() in netdev-offload.c layer because 'flow_api' is not yet initialized. As a result, the flag is falsely disabled, and subsequent packets won't be recovered, though they should. In order to fix it, move the flag to be in netdev-offload layer, to avoid that race. Fixes: 6e50c1651869 ("dpif-netdev: Avoid hw_miss_packet_recover() for devices with no support.") Signed-off-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-10-25 21:35:51 +02:00
Gaetan Rivet	6edc278c85	conntrack: Use a cmap to store zone limits. Change the data structure from hmap to cmap for zone limits. As they are shared amongst multiple conntrack users, multiple readers want to check the current zone limit state before progressing in their processing. Using a CMAP allows doing lookups without taking the global 'ct_lock', thus reducing contention. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-12 20:44:46 +02:00
Cian Ferriter	dfff8b67b2	dpif-netdev: Refactor simple match lookup functions. Make the simple match functions used during lookup non-static to allow reuse of these functions in the AVX512 DPIF. Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Tested-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-07-12 13:31:24 +01:00
Ilya Maximets	603bc853fb	dpif-netdev: Fix leak of AVX512 DPIF scratch pad. dp_netdev_input_outer_avx512 allocates a 16KB scratch pad per PMD thread, but it's never freed. This may cause significant memory drain in dynamic environments. ==4068109==ERROR: LeakSanitizer: detected memory leaks Direct leak of 38656 byte(s) in 2 object(s) allocated from: 0 0xf069fd in posix_memalign (vswitchd/ovs-vswitchd+0xf069fd) 1 0x1d7ed14 in xmalloc_size_align lib/util.c:254:13 2 0x1d7ed14 in xmalloc_pagealign lib/util.c:352:12 3 0x2098254 in dp_netdev_input_outer_avx512 lib/dpif-netdev-avx512.c:69:17 4 0x191591a in dp_netdev_process_rxq_port lib/dpif-netdev.c:5332:19 5 0x190a961 in pmd_thread_main lib/dpif-netdev.c:6963:17 6 0x1c4b69a in ovsthread_wrapper lib/ovs-thread.c:422:12 7 0x7fd5ea6f1179 in start_thread pthread_create.c SUMMARY: AddressSanitizer: 38656 byte(s) leaked in 2 allocation(s). Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.") Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-06-29 23:28:44 +02:00
Harry van Haaren	751d05b474	dpcls: Add unlisted alias for subtable lookup command. This patch adds the old name "subtable-lookup-prio-get" as an unlisted command, to restore a consistency between OVS releases for testing scripts. Fixes: 738c76a503f4 ("dpcls: Change info-get function to fetch dpcls usage stats.") Suggested-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-06-28 13:32:50 +02:00
Lin Huang	ba462b3589	dpif-netdev: Fix ALB 'rebalance_intvl' max hard limit. Currently the pmd-auto-lb-rebal-interval's value was not been checked properly. It maybe a negative, or too big value (>2 weeks between rebalances), which will be lead to a big unsigned value. So reset it to default if the value exceeds the max permitted as described in vswitchd.xml. Fixes: 5bf84282482a ("Adding support for PMD auto load balancing") Signed-off-by: Lin Huang <linhuang@ruijie.com.cn> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-30 23:28:22 +02:00
Lin Huang	83c0a36472	dpif-netdev: Fix ALB parameters type mismatch. The ALB parameters should never be negative. So it's to use unsigned smap_get versions to get it properly, and update VLOG formatting. Fixes: 5bf84282482a ("Adding support for PMD auto load balancing") Signed-off-by: Lin Huang <linhuang@ruijie.com.cn> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-30 23:28:21 +02:00
Kevin Traynor	3ecfaf1361	dpif-netdev: Restructure rxq schedule logging. Previously logging about rxq scheduling was done in a code branch with the selection of the PMD thread core after checking that a numa was selected. By splitting out the logging from the PMD thread core selection, it can simplify the code complexity and make it more extendable for future additions. Also, minor updates to a couple of variables to improve readability and fix a log indent while working on this code block. There is no user visible change in behaviour or logs. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-30 22:45:51 +02:00
Kevin Traynor	37ccbd9c9d	dpif-netdev: Split function to find lowest loaded PMD thread core. This splits up the looping through each PMD thread core on a numa node with the check to compare cycles or rxqs. This is done so in future the compare could be reused with any group of PMD thread cores. There is no user visible change in behaviour. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-30 22:45:51 +02:00
Kumar Amber	738c76a503	dpcls: Change info-get function to fetch dpcls usage stats. Modified the dplcs info-get command output to include the count for different dpcls implementations. $ovs-appctl dpif-netdev/subtable-lookup-info-get Available dpcls implementations: autovalidator (Use count: 1, Priority: 5) generic (Use count: 0, Priority: 1) avx512_gather (Use count: 0, Priority: 3) Test case to verify changes: 1061: PMD - dpcls configuration ok Signed-off-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-05-24 09:53:18 +01:00
Cian Ferriter	5ec5473304	dpif-netdev: Only hash port number when necessary. The hash of the port number is only needed when a DPCLS needs to be created. Move the hash calculation inside the if to accomplish this. Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-17 23:10:41 +02:00
Rosemarie O'Riorden	7e7083cc46	dpif-netdev: Replace loop iterating over packet batch with macro. The function dp_netdev_pmd_flush_output_on_port() iterates over the p->output_pkts batch directly, when it should be using the special iterator macro, DP_PACKET_BATCH_FOR_EACH. However, this wasn't possible because the macro could not accept &p->output_pkts. The addition of parentheses when BATCH is dereferenced allows the macro to expand properly. Parenthesizing arguments in macros is good practice to be able to handle whichever expressions are passed in. Signed-off-by: Rosemarie O'Riorden <roriorden@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-04 21:18:08 +02:00
Eelco Chaudron	9a67d883dc	dpif-netdev: Fix dp_netdev_get_pmd() function getting correct core_id. The dp_netdev_get_pmd() function is using only the hash of the core_id to get the pmd structure. So in case of hash collisions, the wrong pmd is returned. This patch is fixing this by checking for the correct core_id using the CMAP_FOR_EACH_WITH_HASH macro. Fixes: 65f13b50c5aa ("dpif-netdev: Create multiple pmd threads by default.") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-04-04 22:52:12 +02:00
Kevin Traynor	c591827ec0	dpif-netdev: Fix PMD auto load balance with pmd-rxq-isolate. There are currently some checks for cross-numa polling cases to ensure that they won't effect the accuracy of the PMD ALB. If an rxq is pinned to a PMD thread core by the user it will not be reassigned by OVS, so even if it is non-local numa polled it will not impact PMD ALB accuracy. To establish this, a check was made on whether the PMD thread core was isolated or not. However, since other_config:pmd-rxq-isolate was introduced, rxqs may be pinned but the PMD thread core not isolated. It means that by setting pmd-rxq-isolate=false and doing non-local numa pinning, PMD ALB may not run where it should. If the PMD thread core is isolated we can skip individual rxq checks but if not, we should check the individual rxqs for pinning before we disallow PMD ALB. Also, update function comments to make it's operation clearer. Fixes: 6193e03267c1 ("dpif-netdev: Allow pin rxq and non-isolate PMD.") Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-04-04 22:52:12 +02:00
Kevin Traynor	da6ce41d80	dpif-netdev: Fix non-local numa selection for more than two numas. This issue only occurs when there are more than 2 numa nodes and no local numa PMD thread cores available for an interface rxq. In the event of no PMD thread cores available on the local numa for an rxq to be assigned to, a PMD thread core from a non-local numa is selected. If there are more than one non-local numas with PMD thread cores they are RR through and checked if they have non-isolated PMD thread cores. When successfully finding a non-local numa with available PMD thread cores for an rxq, that numa was not being stored. It meant if a similar situation occurred for a subsequent rxq, the same numa would be selected again. Store the last numa used when successfully finding a non-local numa with available PMD thread cores, so the numa RR state is kept for subsequent rxqs. Fixes: f577c2d046b2 ("dpif-netdev: Rework rxq scheduling code.") Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-04-04 22:52:12 +02:00
Kevin Traynor	4b5c3b66aa	dpif-netdev: Fix typo in function name. Rename pmd_reblance_dry_run_needed() to pmd_rebalance_dry_run_needed(). Fixes: a83a406096e9 ("dpif-netdev: Sync PMD ALB state with user commands.") Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-04-04 22:52:12 +02:00

1 2 3 4 5 ...

829 Commits