mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-28 12:58:00 +00:00

Author	SHA1	Message	Date
Eelco Chaudron	08212d755e	netdev-offload: Fix Clang's static analyzer 'Division by zero' warnings. When enabling DPDK with the configure the below, ovs-vswitchd will crash. ovs-vsctl set Open_vSwitch . other_config:n-offload-threads=0 ovs-vsctl set Open_vSwitch . other_config:hw-offload=true This issue arises because setting the 'n-offload-threads' value to zero is not a supported configuration. This fix addresses this by implementing a check to ensure a valid 'n-offload-threads' value, both during configuration and statistics gathering. Fixes: 62c2d8a67543 ("netdev-offload: Add multi-thread API.") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Simon Horman <horms@ovn.org>	2023-10-31 15:00:17 +00:00
Zhiqi Chen	785e22f876	dpif-netdev: Fix length calculation of netdet_flow_key. The 'len' of a netdev_flow_key initialized by netdev_flow_key_init() is always zero, which may cause errors when cloning a netdev_flow_key by netdev_flow_key_clone(). Currently the 'len' member of a netdev_flow_key initialized by netdev_flow_key_init() is not used, so this error will not cause any bad behavior for now. Fixes: c82f496c3b69 ("dpif-netdev: Use unmasked key when adding datapath flows.") Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Zhiqi Chen <chenzhiqi.123@bytedance.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-08-25 22:02:50 +02:00
Peng He	21410ff800	dpif-netdev: Fix dpif_netdev_flow_put. OVS allows overlapping megaflows, as long as the actions of these megaflows are equal. However, the current implementation of action modification relies on flow_lookup instead of UFID, this could result in looking up a wrong megaflow and make the ukeys and megaflows inconsistent. Just like the test case in the patch, at first we have a rule with the prefix: 10.1.2.0/24 And we will get a megaflow with prefixes 10.1.2.2/24 when a packet with IP 10.1.2.2 is received. Then suppose we change the rule into 10.1.0.0/16. OVS prefers to keep the 10.1.2.2/24 megaflow and just changes its action instead of extending the prefix into 10.1.2.2/16. Then suppose we have a 10.1.0.2 packet, since it misses the megaflow, this time, we will have an overlapping megaflow with the right prefix: 10.1.0.2/16 Now we have two megaflows: 10.1.2.2/24 10.1.0.2/16 Last, suppose we have changed the ruleset again. The revalidator this time still decides to change the actions of both megaflows instead of deleting them. The dpif_netdev_flow_put will search the megaflow to modify with unmasked keys, however it might lookup the wrong megaflow as the key 10.1.2.2 matches both 10.1.2.2/24 and 10.1.0.2/16! This patch changes the megaflow lookup code in modification path into relying the UFID to find the correct megaflow instead of key lookup. Falling back to a classifier lookup in case where UFID was not provided in order to support cases where UFID was not generated from the flow data during the flow addition. Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline") Signed-off-by: Peng He <hepeng.0320@bytedance.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-08-14 16:48:56 +02:00
Kevin Traynor	bc6a6f82e5	dpif-netdev: Add pmd-sleep-show command. Max requested sleep time and status for a PMD thread is logged at start up or when changed, but it can be convenient to have a command to dump this information explicitly. It is envisaged that this will be expanded for individual pmds in the future, hence adding to dpif_netdev_pmd_info(). Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-15 00:17:05 +02:00
Kevin Traynor	023dcdc7a1	dpif-netdev: Rename pmd-maxsleep config option. other_config:pmd-maxsleep is a config option to allow PMD thread cores to sleep under low or no load conditions. Rename it to 'pmd-sleep-max' to allow a more structured name and so that additional options or command can follow the 'pmd-sleep-xyz' pattern. Use of other_config:pmd-maxsleep is deprecated to be removed in a future release and will result in a warning. Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-15 00:11:21 +02:00
Ilya Maximets	c2433bdfc0	dpif-netdev: Lockless meters. Current implementation of meters in the userspace datapath takes the meter lock for every packet batch. If more than one thread hits the flow with the same meter, they will lock each other. Replace the critical section with atomic operations to avoid interlocking. Meters themselves are RCU-protected, so it's safe to access them without holding a lock. Implementation does the following: 1. Tries to advance the 'used' timer of the meter with atomic compare+exchange if it's smaller than 'now'. 2. If the timer change succeeds, atomically update band buckets. 3. Atomically update packet statistics for a meter. 4. Go over buckets and try to atomically subtract the amount of packets or bytes, recording the highest exceeded band. 5. Atomically update band statistics and drop packets. Bucket manipulations are implemented with atomic compare+exchange operations with extra checks, because bucket size should never exceed the maximum and it should never go below zero. Packet statistics may be momentarily inconsistent, i.e., number of packets and the number of bytes may reflect different sets of packets. But it should be eventually consistent. And the difference at any given time should be in just few packets. For the sake of reduced code complexity PKTPS meter tries to push packets through the band one by one, even though they all have the same weight. This is also more fair if more than one thread is passing packets through the same band at the same time. Trying to predict the number of packets that can pass may also cause extra atomic operations reducing the performance. This implementation shows similar performance to the previous one, but should scale better with more threads hitting the same meter. Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Lin Huang <linhuang@ruijie.com.cn> Tested-by: Zhang YuHuang <zhangyuhuang@ruijie.com.cn> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-01 00:35:18 +02:00
Paolo Valerio	9b4d2ad8e8	conntrack: Allow to dump userspace conntrack expectations. The patch introduces a new commands ovs-appctl dpctl/dump-conntrack-exp that allows to dump the existing expectations for the userspace ct. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-29 22:20:43 +02:00
Mike Pattrick	5d11c47d3e	userspace: Enable IP checksum offloading by default. The netdev receiving packets is supposed to provide the flags indicating if the IP checksum was verified and it is GOOD or BAD, otherwise the stack will check when appropriate by software. If the packet comes with good checksum, then postpone the checksum calculation to the egress device if needed. When encapsulate a packet with that flag, set the checksum of the inner IP header since that is not yet supported. Calculate the IP checksum when the packet is going to be sent over a device that doesn't support the feature. Linux devices don't support IP checksum offload alone, so the support is not enabled. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-15 23:49:51 +02:00
Paolo Valerio	9fa612959c	ovs-dpctl: Add new command dpctl/ct-[sg]et-sweep-interval. Since 3d9c1b855a5f ("conntrack: Replace timeout based expiration lists with rculists.") the sweep interval changed as well as the constraints related to the sweeper. Being able to change the default reschedule time may be convenient in some conditions, like debugging. This patch introduces new commands allowing to get and set the sweep interval in ms. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-04-06 22:59:25 +02:00
Eelco Chaudron	4d69c19000	ofproto-dpif-upcall: Reset ukey's last stats value if the datapath changed. When the ukey's action set changes, it could cause the flow to use a different datapath, for example, when it moves from tc to kernel. This will cause the the cached previous datapath statistics to be used. This change will reset the cached statistics when a change in datapath is discovered. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-03-03 22:27:37 +01:00
Kevin Traynor	948767a18d	dpif-netdev: Set PMD load based sleep start/inc to 1 us. Now that the timer slack for the PMD threads is reduced we can also reduce the start/increment for PMD load based sleeping to match it. This will further reduce initial sleep times making it more resilient to interfaces that might be sensitive to large sleep times. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-23 17:23:28 +01:00
David Marchand	f62629a558	dpif-netdev: Set timer slack for PMD threads. The default Linux timer slack groups timer expires into 50 uS intervals. With some traffic patterns this can mean that returning to process packets after a sleep takes too long and packets are dropped. Add a helper to util.c and set use it to reduce the timer slack for PMD threads, so that sleeps with smaller resolutions can be done to prevent sleeping for too long. Fixes: de3bbdc479a9 ("dpif-netdev: Add PMD load based sleeping.") Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401121.html Reported-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: David Marchand <david.marchand@redhat.com> Co-authored-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-23 17:23:20 +01:00
Kevin Traynor	de3bbdc479	dpif-netdev: Add PMD load based sleeping. Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD. Upon detecting the threshold of >= 16 pkts on an Rxq, reset the sleep time to zero (i.e. no sleep). Sleep time will be increased on each iteration where the low load conditions remain up to a total of the max sleep time which is set by the user e.g: ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500 The default pmd-maxsleep value is 0, which means that no sleeps will occur and the default behaviour is unchanged from previously. Also add new stats to pmd-perf-show to get visibility of operation e.g. ... - sleep iterations: 153994 ( 76.8 % of iterations) Sleep time (us): 9159399 ( 59 us/iteration avg.) ... Reviewed-by: Robin Jarry <rjarry@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-12 18:56:05 +01:00
Cheng Li	46e04ec31b	dpif-netdev: Calculate per numa variance. Currently, pmd_rebalance_dry_run() calculate overall variance of all pmds regardless of their numa location. The overall result may hide un-balance in an individual numa. Considering the following case. Numa0 is free because VMs on numa0 are not sending pkts, while numa1 is busy. Within numa1, pmds workloads are not balanced. Obviously, moving 500 kpps workloads from pmd 126 to pmd 62 will make numa1 much more balance. For numa1 the variance improvement will be almost 100%, because after rebalance each pmd in numa1 holds same workload(variance ~= 0). But the overall variance improvement is only about 20%, which may not trigger auto_lb. ``` numa_id core_id kpps 0 30 0 0 31 0 0 94 0 0 95 0 1 126 1500 1 127 1000 1 63 1000 1 62 500 ``` As auto_lb doesn't balance workload across numa nodes. So it makes more sense to calculate variance improvement per numa node. Signed-off-by: Cheng Li <lic121@chinatelecom.cn> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Co-authored-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-21 22:15:47 +01:00
Kevin Traynor	ad6e506fcb	dpif-netdev: Rename pmd_info_show_rxq variables. There are some similar readings taken for pmds and Rx queues in this function and a few of the variable names are ambiguous. Improve the readability of the code by updating some variables names to indicate that they are readings related to the pmd. Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-21 20:58:30 +01:00
Kevin Traynor	526230bfab	dpif-netdev: Make pmd-rxq-show time configurable. pmd-rxq-show shows the Rx queue to pmd assignments as well as the pmd usage of each Rx queue. Up until now a tail length of 60 seconds pmd usage was shown for each Rx queue, as this is the value used during rebalance to avoid any spike effects. When debugging or tuning, it is also convenient to display the pmd usage of an Rx queue over a shorter time frame, so any changes config or traffic that impact pmd usage can be evaluated more quickly. A parameter is added that allows pmd-rxq-show stats pmd usage to be shown for a shorter time frame. Values are rounded up to the nearest 5 seconds as that is the measurement granularity and the value used is displayed. e.g. $ ovs-appctl dpif-netdev/pmd-rxq-show -secs 5 Displaying last 5 seconds pmd usage % pmd thread numa_id 0 core_id 4: isolated : false port: dpdk0 queue-id: 0 (enabled) pmd usage: 95 % overhead: 4 % The default time frame has not changed and the maximum value is limited to the maximum stored tail length (60 seconds). Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-21 20:57:29 +01:00
Eelco Chaudron	c82f496c3b	dpif-netdev: Use unmasked key when adding datapath flows. The datapath supports installing wider flows, and OVS relies on this behavior. For example if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed to be added. However, if we try to add a wildcard rule, the installation fails: # ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2 # ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2 ovs-vswitchd: updating flow table (File exists) The reason is that the key used to determine if the flow is already present in the system uses the original key ANDed with the mask. This results in the IP address not being part of the (miniflow) key, i.e., being substituted with an all-zero value. When doing the actual lookup, this results in the key wrongfully matching the first flow, and therefore the flow does not get installed. The solution is to use the unmasked key for the existence check, the same way this is handled in the "slow" dpif_flow_put() case. OVS relies on the fact that overlapping flows can exist if one is a superset of the other. Note that this is only true when the same set of actions is applied. This is due to how the revalidator process works. During revalidation, OVS removes too generic flows from the datapath to avoid incorrect matches but allows too narrow flows to stay in the datapath to avoid the data plane disruption and also to avoid constant flow deletions if the datapath ignores wildcards on certain fields/bits. See flow_wildcards_has_extra() check in the revalidate_ukey__() function. The problem here is that we have a too narrow flow installed, and now OpenFlow rules got changed, so the actual flow should be more generic. Revalidators will not remove the narrow flow, and we will eventually get an upcall on the packet that doesn't match the narrow flow, but we will not be able to install a more generic flow because after masking with the new wider mask, the key matches on the narrow flow, so we get EEXIST. Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-20 13:07:17 +01:00
Eli Britstein	76ab364ea8	netdev-offload: Set 'miss_api_supported' to be under netdev. Cited commit introduced a flag in dpif-netdev level, to optimize performance and avoid hw_miss_packet_recover() for devices with no such support. However, there is a race condition between traffic processing and assigning a 'flow_api' object to the netdev. In such case, EOPNOTSUPP is returned by netdev_hw_miss_packet_recover() in netdev-offload.c layer because 'flow_api' is not yet initialized. As a result, the flag is falsely disabled, and subsequent packets won't be recovered, though they should. In order to fix it, move the flag to be in netdev-offload layer, to avoid that race. Fixes: 6e50c1651869 ("dpif-netdev: Avoid hw_miss_packet_recover() for devices with no support.") Signed-off-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-10-25 21:35:51 +02:00
Gaetan Rivet	6edc278c85	conntrack: Use a cmap to store zone limits. Change the data structure from hmap to cmap for zone limits. As they are shared amongst multiple conntrack users, multiple readers want to check the current zone limit state before progressing in their processing. Using a CMAP allows doing lookups without taking the global 'ct_lock', thus reducing contention. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-12 20:44:46 +02:00
Cian Ferriter	dfff8b67b2	dpif-netdev: Refactor simple match lookup functions. Make the simple match functions used during lookup non-static to allow reuse of these functions in the AVX512 DPIF. Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Tested-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-07-12 13:31:24 +01:00
Ilya Maximets	603bc853fb	dpif-netdev: Fix leak of AVX512 DPIF scratch pad. dp_netdev_input_outer_avx512 allocates a 16KB scratch pad per PMD thread, but it's never freed. This may cause significant memory drain in dynamic environments. ==4068109==ERROR: LeakSanitizer: detected memory leaks Direct leak of 38656 byte(s) in 2 object(s) allocated from: 0 0xf069fd in posix_memalign (vswitchd/ovs-vswitchd+0xf069fd) 1 0x1d7ed14 in xmalloc_size_align lib/util.c:254:13 2 0x1d7ed14 in xmalloc_pagealign lib/util.c:352:12 3 0x2098254 in dp_netdev_input_outer_avx512 lib/dpif-netdev-avx512.c:69:17 4 0x191591a in dp_netdev_process_rxq_port lib/dpif-netdev.c:5332:19 5 0x190a961 in pmd_thread_main lib/dpif-netdev.c:6963:17 6 0x1c4b69a in ovsthread_wrapper lib/ovs-thread.c:422:12 7 0x7fd5ea6f1179 in start_thread pthread_create.c SUMMARY: AddressSanitizer: 38656 byte(s) leaked in 2 allocation(s). Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.") Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-06-29 23:28:44 +02:00
Harry van Haaren	751d05b474	dpcls: Add unlisted alias for subtable lookup command. This patch adds the old name "subtable-lookup-prio-get" as an unlisted command, to restore a consistency between OVS releases for testing scripts. Fixes: 738c76a503f4 ("dpcls: Change info-get function to fetch dpcls usage stats.") Suggested-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-06-28 13:32:50 +02:00
Lin Huang	ba462b3589	dpif-netdev: Fix ALB 'rebalance_intvl' max hard limit. Currently the pmd-auto-lb-rebal-interval's value was not been checked properly. It maybe a negative, or too big value (>2 weeks between rebalances), which will be lead to a big unsigned value. So reset it to default if the value exceeds the max permitted as described in vswitchd.xml. Fixes: 5bf84282482a ("Adding support for PMD auto load balancing") Signed-off-by: Lin Huang <linhuang@ruijie.com.cn> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-30 23:28:22 +02:00
Lin Huang	83c0a36472	dpif-netdev: Fix ALB parameters type mismatch. The ALB parameters should never be negative. So it's to use unsigned smap_get versions to get it properly, and update VLOG formatting. Fixes: 5bf84282482a ("Adding support for PMD auto load balancing") Signed-off-by: Lin Huang <linhuang@ruijie.com.cn> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-30 23:28:21 +02:00
Kevin Traynor	3ecfaf1361	dpif-netdev: Restructure rxq schedule logging. Previously logging about rxq scheduling was done in a code branch with the selection of the PMD thread core after checking that a numa was selected. By splitting out the logging from the PMD thread core selection, it can simplify the code complexity and make it more extendable for future additions. Also, minor updates to a couple of variables to improve readability and fix a log indent while working on this code block. There is no user visible change in behaviour or logs. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-30 22:45:51 +02:00
Kevin Traynor	37ccbd9c9d	dpif-netdev: Split function to find lowest loaded PMD thread core. This splits up the looping through each PMD thread core on a numa node with the check to compare cycles or rxqs. This is done so in future the compare could be reused with any group of PMD thread cores. There is no user visible change in behaviour. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-30 22:45:51 +02:00
Kumar Amber	738c76a503	dpcls: Change info-get function to fetch dpcls usage stats. Modified the dplcs info-get command output to include the count for different dpcls implementations. $ovs-appctl dpif-netdev/subtable-lookup-info-get Available dpcls implementations: autovalidator (Use count: 1, Priority: 5) generic (Use count: 0, Priority: 1) avx512_gather (Use count: 0, Priority: 3) Test case to verify changes: 1061: PMD - dpcls configuration ok Signed-off-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-05-24 09:53:18 +01:00
Cian Ferriter	5ec5473304	dpif-netdev: Only hash port number when necessary. The hash of the port number is only needed when a DPCLS needs to be created. Move the hash calculation inside the if to accomplish this. Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-17 23:10:41 +02:00
Rosemarie O'Riorden	7e7083cc46	dpif-netdev: Replace loop iterating over packet batch with macro. The function dp_netdev_pmd_flush_output_on_port() iterates over the p->output_pkts batch directly, when it should be using the special iterator macro, DP_PACKET_BATCH_FOR_EACH. However, this wasn't possible because the macro could not accept &p->output_pkts. The addition of parentheses when BATCH is dereferenced allows the macro to expand properly. Parenthesizing arguments in macros is good practice to be able to handle whichever expressions are passed in. Signed-off-by: Rosemarie O'Riorden <roriorden@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-04 21:18:08 +02:00
Eelco Chaudron	9a67d883dc	dpif-netdev: Fix dp_netdev_get_pmd() function getting correct core_id. The dp_netdev_get_pmd() function is using only the hash of the core_id to get the pmd structure. So in case of hash collisions, the wrong pmd is returned. This patch is fixing this by checking for the correct core_id using the CMAP_FOR_EACH_WITH_HASH macro. Fixes: 65f13b50c5aa ("dpif-netdev: Create multiple pmd threads by default.") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-04-04 22:52:12 +02:00
Kevin Traynor	c591827ec0	dpif-netdev: Fix PMD auto load balance with pmd-rxq-isolate. There are currently some checks for cross-numa polling cases to ensure that they won't effect the accuracy of the PMD ALB. If an rxq is pinned to a PMD thread core by the user it will not be reassigned by OVS, so even if it is non-local numa polled it will not impact PMD ALB accuracy. To establish this, a check was made on whether the PMD thread core was isolated or not. However, since other_config:pmd-rxq-isolate was introduced, rxqs may be pinned but the PMD thread core not isolated. It means that by setting pmd-rxq-isolate=false and doing non-local numa pinning, PMD ALB may not run where it should. If the PMD thread core is isolated we can skip individual rxq checks but if not, we should check the individual rxqs for pinning before we disallow PMD ALB. Also, update function comments to make it's operation clearer. Fixes: 6193e03267c1 ("dpif-netdev: Allow pin rxq and non-isolate PMD.") Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-04-04 22:52:12 +02:00
Kevin Traynor	da6ce41d80	dpif-netdev: Fix non-local numa selection for more than two numas. This issue only occurs when there are more than 2 numa nodes and no local numa PMD thread cores available for an interface rxq. In the event of no PMD thread cores available on the local numa for an rxq to be assigned to, a PMD thread core from a non-local numa is selected. If there are more than one non-local numas with PMD thread cores they are RR through and checked if they have non-isolated PMD thread cores. When successfully finding a non-local numa with available PMD thread cores for an rxq, that numa was not being stored. It meant if a similar situation occurred for a subsequent rxq, the same numa would be selected again. Store the last numa used when successfully finding a non-local numa with available PMD thread cores, so the numa RR state is kept for subsequent rxqs. Fixes: f577c2d046b2 ("dpif-netdev: Rework rxq scheduling code.") Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-04-04 22:52:12 +02:00
Kevin Traynor	4b5c3b66aa	dpif-netdev: Fix typo in function name. Rename pmd_reblance_dry_run_needed() to pmd_rebalance_dry_run_needed(). Fixes: a83a406096e9 ("dpif-netdev: Sync PMD ALB state with user commands.") Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-04-04 22:52:12 +02:00
Adrian Moreno	9e56549c2b	hmap: use short version of safe loops if possible. Using SHORT version of the *_SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for hmap and all its derived types. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-30 16:59:02 +02:00
Eli Britstein	635cb95e0c	dpif-netdev: Keep orig_in_port as a field of the flow. A flow may be modified after its initial offload failed. In this case, according to [1], the modification is handled as a flow add. For a vport flow "add", the orig_in_port should be provided. Keep that field in the flow struct, so it can be provided in the flow modification use case. [1] 0d25621e4d9f ("dpif-netdev: Fix flow modification after failure.") Fixes: b5e6f6f6bfbe ("dpif-netdev: Provide orig_in_port in metadata for tunneled packets.") Signed-off-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-22 22:12:24 +01:00
Cian Ferriter	c356f6c0b9	dpif-netdev: Simplify atomic function pointer stores. The same pattern for atomic stores and initialization was used for the DPIF and MFEX function pointers declared in struct dp_netdev_pmd_thread. Simplify this pattern for all stores to 'miniflow_extract_opt' and 'netdev_input_func'. Also replace the first store to 'miniflow_extract_opt' which was a atomic_store_relaxed() with atomic_init(). Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-11 21:07:34 +01:00
Gaetan Rivet	31dc72c644	dpif-netdev: Use dp_netdev reference in offload threads. The PMD reference taken is not actually used, it is only needed to get the dp_netdev linked. Additionally, the taking of the PMD reference does not protect against the disappearance of the dp_netdev, so it is misleading. The dp reference is protected by the way the ports are being deleted during datapath deletion. No further offload request should be found past a flush, so it is safe to keep this reference in the offload item. Signed-off-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-16 13:11:44 +01:00
Sriharsha Basavapatna	7d8b6ab64d	dpif-netdev: Fix a race condition in deletion of offloaded flows. In dp_netdev_pmd_remove_flow() we schedule the deletion of an offloaded flow, if a mark has been assigned to the flow. But if this occurs in the window in which the offload thread completes offloading the flow and assigns a mark to the flow, then we miss deleting the flow. This problem has been observed while adding and deleting flows in a loop. To fix this, always enqueue flow deletion regardless of the flow->mark being set. Fixes: 241bad15d99a("dpif-netdev: associate flow with a mark id") Co-authored-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-16 13:11:44 +01:00
Gaetan Rivet	a81bb674ed	dpif-netdev: Move port flush after datapath reconfiguration. Port flush and offload uninit should be moved after the datapath has been reconfigured. That way, no other thread such as PMDs will find this port to poll and enqueue further offload requests. After a flush, almost no further offload request for this port should be found in the queue. There will still be some issued by revalidators, but they will be caught when the offload thread fails to take a netdev ref. This change fixes the issue of datapath reference being improperly accessed by offload threads while it is being destroyed. Fixes: 5b0aa55776cb ("dpif-netdev: Execute flush from offload thread.") Fixes: 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.") Suggested-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-16 13:11:44 +01:00
Cian Ferriter	f92e6946d0	dpif-netdev-dpcls: Make subtable reprobe thread-safe. The subtable search function can be used at any time by a PMD thread. Setting the subtable search function should be done atomically to prevent garbage data from being read. A dpcls_subtable_lookup_reprobe() call can happen at the same time that DPCLS subtables are being sorted. This could lead to modifications done by the reprobe() to be lost. Prevent this from happening by locking on pmd->flow_mutex. After this change both the reprobe function and a subtable sort will share the flow_mutex preventing modifications by either one from being lost. Also remove the pvector_publish() call. The pvector is not being changed in dpcls_subtable_lookup_reprobe(), only the data pointed to by pointers in the vector are being changed. Fixes: 3d018c3ea79d ("dpif-netdev: add subtable lookup prio set command.") Reported-by: Ilya Maximets <i.maximets@ovn.org> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2022-January/390757.html Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-02-15 16:32:07 +00:00
Dumitru Ceara	5a9bb85caf	dpif-netdev: Fix misaligned access. Remove the forced cache-line size alignment markers from struct dp_netdev_pmd_thread and struct dp_netdev as discussed at [0]. They don't seem to add any benefit and cause 64 byte alignment requirements. UB Sanitizer report: lib/dpif-netdev.c:6758:13: runtime error: member access within misaligned address 0x7f7f24d25010 for type 'struct dp_netdev_pmd_thread', which requires 64 byte alignment 0x7f7f24d25010: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... ^ #0 0x5fbfde in dp_netdev_configure_pmd lib/dpif-netdev.c:6758 #1 0x5fbde9 in dp_netdev_set_nonpmd lib/dpif-netdev.c:6715 #2 0x5d6fdd in create_dp_netdev lib/dpif-netdev.c:1769 #3 0x5d72d0 in dpif_netdev_open lib/dpif-netdev.c:1807 #4 0x61c83f in do_open lib/dpif.c:347 [...] lib/dpif-netdev.c:1724:6: runtime error: member access within misaligned address 0x000002005eb0 for type 'struct dp_netdev', which requires 64 byte alignment 0x000002005eb0: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... ^ #0 0x5d6660 in create_dp_netdev lib/dpif-netdev.c:1724 #1 0x5d72d0 in dpif_netdev_open lib/dpif-netdev.c:1807 #2 0x61c846 in do_open lib/dpif.c:347 #3 0x61ca9c in dpif_create lib/dpif.c:402 #4 0x61cac9 in dpif_create_and_open lib/dpif.c:415 #5 0x48f235 in open_dpif_backer ofproto/ofproto-dpif.c:776 [...] [0] https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390256.html Suggested-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-14 22:38:42 +01:00
Dumitru Ceara	8ed26a8be3	treewide: Don't pass NULL to library functions that expect non-NULL. It's actually undefined behavior to pass NULL to standard library functions that manipulate arrays (e.g., qsort, memcpy, memcmp), even if the passed number of items is 0. UB Sanitizer reports: ovsdb/monitor.c:408:9: runtime error: null pointer passed as argument 1, which is declared to never be null #0 0x406ae1 in ovsdb_monitor_columns_sort ovsdb/monitor.c:408 #1 0x406ae1 in ovsdb_monitor_add ovsdb/monitor.c:1683 [...] lib/ovsdb-data.c:1970:5: runtime error: null pointer passed as argument 2, which is declared to never be null #0 0x4071c8 in ovsdb_datum_push_unsafe lib/ovsdb-data.c:1970 #1 0x471cd0 in ovsdb_datum_apply_diff_in_place lib/ovsdb-data.c:2345 [...] ofproto/ofproto-dpif-rid.c:159:17: runtime error: null pointer passed as argument 1, which is declared to never be null #0 0x4df5d8 in frozen_state_equal ofproto/ofproto-dpif-rid.c:159 #1 0x4dfd27 in recirc_find_equal ofproto/ofproto-dpif-rid.c:179 [...] Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-14 22:35:42 +01:00
Gaetan Rivet	b0b6b7b465	dpif-netdev: Use one or more offload threads. Read the user configuration in the netdev-offload module to modify the number of threads used to manage hardware offload requests. This allows processing insertion, deletion and modification concurrently. The offload thread structure was modified to contain all needed elements. This structure is multiplied by the number of requested threads and used separately. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	7daa503468	dpif-netdev: Replace port mutex by rwlock. The port mutex protects the netdev mapping, that can be changed by port addition or port deletion. HW offloads operations can be considered read operations on the port mapping itself. Use a rwlock to differentiate between read and write operations, allowing concurrent queries and offload insertions. Because offload queries, deletion, and reconfigure_datapath() calls are all rdlock, the deadlock fixed by [1] is still avoided, as the rdlock side is recursive as prescribed by the POSIX standard. Executing 'reconfigure_datapath()' only requires a rdlock taken, but it is sometimes executed in contexts where wrlock is taken ('do_add_port()' and 'do_del_port()'). This means that the deadlock described in [2] is still valid and should be mitigated. The rdlock is taken using 'tryrdlock()' during offload query, keeping the current behavior. [1]: 81e89d5c2645 ("dpif-netdev: Make datapath port mutex recursive.") [2]: 12d0edd75eba ("dpif-netdev: Avoid deadlock with offloading during PMD thread deletion."). Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	d85b9230ac	dpif-netdev: Make megaflow and mark mappings thread objects. In later commits hardware offloads are managed in several threads. Each offload is managed by a thread determined by its flow's 'mega_ufid'. As megaflow to mark and mark to flow mappings are 1:1 and 1:N respectively, then a single mark exists for a single 'mega_ufid', and multiple flows uses the same 'mega_ufid'. Because the managing thread will be chosen using the 'mega_ufid', then each mapping does not need to be shared with other offload threads. The mappings are kept as cmap as upcalls will sometimes query them before enqueuing orders to the offload threads. To prepare this change, move the mappings within the offload thread structure. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	ec4ac62588	dpif-netdev: Use lockless queue to manage offloads. The dataplane threads (PMDs) send offloading commands to a dedicated offload management thread. The current implementation uses a lock and benchmarks show a high contention on the queue in some cases. With high-contention, the mutex will more often lead to the locking thread yielding in wait, using a syscall. This should be avoided in a userland dataplane. The mpsc-queue can be used instead. It uses less cycles and has lower latency. Benchmarks show better behavior as multiple revalidators and one or multiple PMDs writes to a single queue while another thread polls it. One trade-off with the new scheme however is to be forced to poll the queue from the offload thread. Without mutex, a cond_wait cannot be used for signaling. The offload thread is implementing an exponential backoff and will sleep in short increments when no data is available. This makes the thread yield, at the price of some latency to manage offloads after an inactivity period. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	54dcf60e6f	netdev-offload-dpdk: Lock rte_flow map access. Add a lock to access the ufid to rte_flow map. This will protect it from concurrent write accesses when multiple threads attempt it. At this point, the reason for taking the lock is not to fullfill the needs of the DPDK offload implementation anymore. Rewrite the comments to reflect this change. The lock is still needed to protect against changes to netdev port mapping. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	5b0aa55776	dpif-netdev: Execute flush from offload thread. When a port is deleted, its offloads must be flushed. The operation runs in the thread that initiated it. Offload data is thus accessed jointly by the port deletion thread(s) and the offload thread, which complicates the data access model. To simplify this model, as a pre-step toward introducing parallel offloads, execute the flush operation in the offload thread. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	d68d2ed466	dpif-netdev: Introduce tagged union of offload requests. Offload requests are currently only supporting flow offloads. As a pre-step before supporting an offload flush request, modify the layout of an offload request item, to become a tagged union. Future offload types won't be forced to re-use the full flow offload structure, which consumes a lot of memory. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	73ecf098d2	dpif-netdev: Use id-fpool for mark allocation. Use the netdev-offload multithread API to allow multiple thread allocating marks concurrently. Initialize only once the pool in a multithread context by using the ovsthread_once type. Use the id-fpool module for faster concurrent ID allocation. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00

1 2 3 4 5 ...

812 Commits