When enabling DPDK with the configure the below, ovs-vswitchd will crash.
ovs-vsctl set Open_vSwitch . other_config:n-offload-threads=0
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
This issue arises because setting the 'n-offload-threads' value to zero
is not a supported configuration. This fix addresses this by implementing
a check to ensure a valid 'n-offload-threads' value, both during
configuration and statistics gathering.
Fixes: 62c2d8a67543 ("netdev-offload: Add multi-thread API.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Simon Horman <horms@ovn.org>
The 'len' of a netdev_flow_key initialized by netdev_flow_key_init()
is always zero, which may cause errors when cloning a netdev_flow_key
by netdev_flow_key_clone().
Currently the 'len' member of a netdev_flow_key initialized by
netdev_flow_key_init() is not used, so this error will not cause any
bad behavior for now.
Fixes: c82f496c3b69 ("dpif-netdev: Use unmasked key when adding datapath flows.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Zhiqi Chen <chenzhiqi.123@bytedance.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
OVS allows overlapping megaflows, as long as the actions of these
megaflows are equal. However, the current implementation of action
modification relies on flow_lookup instead of UFID, this could result
in looking up a wrong megaflow and make the ukeys and megaflows
inconsistent.
Just like the test case in the patch, at first we have a rule with the
prefix:
10.1.2.0/24
And we will get a megaflow with prefixes 10.1.2.2/24 when a packet with
IP 10.1.2.2 is received.
Then suppose we change the rule into 10.1.0.0/16. OVS prefers to keep
the 10.1.2.2/24 megaflow and just changes its action instead of
extending the prefix into 10.1.2.2/16.
Then suppose we have a 10.1.0.2 packet, since it misses the megaflow,
this time, we will have an overlapping megaflow with the right prefix:
10.1.0.2/16
Now we have two megaflows:
10.1.2.2/24
10.1.0.2/16
Last, suppose we have changed the ruleset again. The revalidator this
time still decides to change the actions of both megaflows instead of
deleting them.
The dpif_netdev_flow_put will search the megaflow to modify with
unmasked keys, however it might lookup the wrong megaflow as the key
10.1.2.2 matches both 10.1.2.2/24 and 10.1.0.2/16!
This patch changes the megaflow lookup code in modification path into
relying the UFID to find the correct megaflow instead of key lookup.
Falling back to a classifier lookup in case where UFID was not provided
in order to support cases where UFID was not generated from the flow
data during the flow addition.
Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
Signed-off-by: Peng He <hepeng.0320@bytedance.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Max requested sleep time and status for a PMD thread
is logged at start up or when changed, but it can be
convenient to have a command to dump this information
explicitly.
It is envisaged that this will be expanded for individual
pmds in the future, hence adding to dpif_netdev_pmd_info().
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
other_config:pmd-maxsleep is a config option to allow
PMD thread cores to sleep under low or no load conditions.
Rename it to 'pmd-sleep-max' to allow a more structured
name and so that additional options or command can follow
the 'pmd-sleep-xyz' pattern.
Use of other_config:pmd-maxsleep is deprecated to be
removed in a future release and will result in a warning.
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Current implementation of meters in the userspace datapath takes
the meter lock for every packet batch. If more than one thread
hits the flow with the same meter, they will lock each other.
Replace the critical section with atomic operations to avoid
interlocking. Meters themselves are RCU-protected, so it's safe
to access them without holding a lock.
Implementation does the following:
1. Tries to advance the 'used' timer of the meter with atomic
compare+exchange if it's smaller than 'now'.
2. If the timer change succeeds, atomically update band buckets.
3. Atomically update packet statistics for a meter.
4. Go over buckets and try to atomically subtract the amount of
packets or bytes, recording the highest exceeded band.
5. Atomically update band statistics and drop packets.
Bucket manipulations are implemented with atomic compare+exchange
operations with extra checks, because bucket size should never
exceed the maximum and it should never go below zero.
Packet statistics may be momentarily inconsistent, i.e., number
of packets and the number of bytes may reflect different sets
of packets. But it should be eventually consistent. And the
difference at any given time should be in just few packets.
For the sake of reduced code complexity PKTPS meter tries to push
packets through the band one by one, even though they all have
the same weight. This is also more fair if more than one thread
is passing packets through the same band at the same time.
Trying to predict the number of packets that can pass may also
cause extra atomic operations reducing the performance.
This implementation shows similar performance to the previous one,
but should scale better with more threads hitting the same meter.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Lin Huang <linhuang@ruijie.com.cn>
Tested-by: Zhang YuHuang <zhangyuhuang@ruijie.com.cn>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The patch introduces a new commands ovs-appctl dpctl/dump-conntrack-exp
that allows to dump the existing expectations for the userspace ct.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The netdev receiving packets is supposed to provide the flags
indicating if the IP checksum was verified and it is GOOD or BAD,
otherwise the stack will check when appropriate by software.
If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.
When encapsulate a packet with that flag, set the checksum
of the inner IP header since that is not yet supported.
Calculate the IP checksum when the packet is going to be sent over
a device that doesn't support the feature.
Linux devices don't support IP checksum offload alone, so the
support is not enabled.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Since 3d9c1b855a5f ("conntrack: Replace timeout based expiration lists
with rculists.") the sweep interval changed as well as the constraints
related to the sweeper.
Being able to change the default reschedule time may be convenient in
some conditions, like debugging.
This patch introduces new commands allowing to get and set the sweep
interval in ms.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When the ukey's action set changes, it could cause the flow to use a
different datapath, for example, when it moves from tc to kernel.
This will cause the the cached previous datapath statistics to be used.
This change will reset the cached statistics when a change in
datapath is discovered.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Now that the timer slack for the PMD threads is reduced we can also
reduce the start/increment for PMD load based sleeping to match it.
This will further reduce initial sleep times making it more resilient
to interfaces that might be sensitive to large sleep times.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The default Linux timer slack groups timer expires into 50 uS intervals.
With some traffic patterns this can mean that returning to process
packets after a sleep takes too long and packets are dropped.
Add a helper to util.c and set use it to reduce the timer slack
for PMD threads, so that sleeps with smaller resolutions can be done
to prevent sleeping for too long.
Fixes: de3bbdc479a9 ("dpif-netdev: Add PMD load based sleeping.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401121.html
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.
Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).
Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.
Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
- sleep iterations: 153994 ( 76.8 % of iterations)
Sleep time (us): 9159399 ( 59 us/iteration avg.)
...
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, pmd_rebalance_dry_run() calculate overall variance of
all pmds regardless of their numa location. The overall result may
hide un-balance in an individual numa.
Considering the following case. Numa0 is free because VMs on numa0
are not sending pkts, while numa1 is busy. Within numa1, pmds
workloads are not balanced. Obviously, moving 500 kpps workloads from
pmd 126 to pmd 62 will make numa1 much more balance. For numa1
the variance improvement will be almost 100%, because after rebalance
each pmd in numa1 holds same workload(variance ~= 0). But the overall
variance improvement is only about 20%, which may not trigger auto_lb.
```
numa_id core_id kpps
0 30 0
0 31 0
0 94 0
0 95 0
1 126 1500
1 127 1000
1 63 1000
1 62 500
```
As auto_lb doesn't balance workload across numa nodes. So it makes
more sense to calculate variance improvement per numa node.
Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There are some similar readings taken for pmds and Rx queues
in this function and a few of the variable names are ambiguous.
Improve the readability of the code by updating some variables
names to indicate that they are readings related to the pmd.
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
pmd-rxq-show shows the Rx queue to pmd assignments as well as the
pmd usage of each Rx queue.
Up until now a tail length of 60 seconds pmd usage was shown
for each Rx queue, as this is the value used during rebalance
to avoid any spike effects.
When debugging or tuning, it is also convenient to display the
pmd usage of an Rx queue over a shorter time frame, so any changes
config or traffic that impact pmd usage can be evaluated more quickly.
A parameter is added that allows pmd-rxq-show stats pmd usage to
be shown for a shorter time frame. Values are rounded up to the
nearest 5 seconds as that is the measurement granularity and the value
used is displayed. e.g.
$ ovs-appctl dpif-netdev/pmd-rxq-show -secs 5
Displaying last 5 seconds pmd usage %
pmd thread numa_id 0 core_id 4:
isolated : false
port: dpdk0 queue-id: 0 (enabled) pmd usage: 95 %
overhead: 4 %
The default time frame has not changed and the maximum value
is limited to the maximum stored tail length (60 seconds).
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The datapath supports installing wider flows, and OVS relies on
this behavior. For example if ipv4(src=1.1.1.1/192.0.0.0,
dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of
ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed
to be added.
However, if we try to add a wildcard rule, the installation fails:
# ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2
# ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
ovs-vswitchd: updating flow table (File exists)
The reason is that the key used to determine if the flow is already
present in the system uses the original key ANDed with the mask.
This results in the IP address not being part of the (miniflow) key,
i.e., being substituted with an all-zero value. When doing the actual
lookup, this results in the key wrongfully matching the first flow,
and therefore the flow does not get installed. The solution is to use
the unmasked key for the existence check, the same way this is handled
in the "slow" dpif_flow_put() case.
OVS relies on the fact that overlapping flows can exist if one is a
superset of the other. Note that this is only true when the same set
of actions is applied. This is due to how the revalidator process
works. During revalidation, OVS removes too generic flows from the
datapath to avoid incorrect matches but allows too narrow flows to
stay in the datapath to avoid the data plane disruption and also to
avoid constant flow deletions if the datapath ignores wildcards on
certain fields/bits. See flow_wildcards_has_extra() check in the
revalidate_ukey__() function.
The problem here is that we have a too narrow flow installed, and now
OpenFlow rules got changed, so the actual flow should be more generic.
Revalidators will not remove the narrow flow, and we will eventually get
an upcall on the packet that doesn't match the narrow flow, but we will
not be able to install a more generic flow because after masking with
the new wider mask, the key matches on the narrow flow, so we get EEXIST.
Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Cited commit introduced a flag in dpif-netdev level, to optimize
performance and avoid hw_miss_packet_recover() for devices with no such
support.
However, there is a race condition between traffic processing and
assigning a 'flow_api' object to the netdev. In such case, EOPNOTSUPP is
returned by netdev_hw_miss_packet_recover() in netdev-offload.c layer
because 'flow_api' is not yet initialized. As a result, the flag is
falsely disabled, and subsequent packets won't be recovered, though they
should.
In order to fix it, move the flag to be in netdev-offload layer, to
avoid that race.
Fixes: 6e50c1651869 ("dpif-netdev: Avoid hw_miss_packet_recover() for devices with no support.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Change the data structure from hmap to cmap for zone limits.
As they are shared amongst multiple conntrack users, multiple
readers want to check the current zone limit state before progressing in
their processing. Using a CMAP allows doing lookups without taking the
global 'ct_lock', thus reducing contention.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Make the simple match functions used during lookup non-static to allow
reuse of these functions in the AVX512 DPIF.
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Tested-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
dp_netdev_input_outer_avx512 allocates a 16KB scratch pad per PMD
thread, but it's never freed. This may cause significant memory
drain in dynamic environments.
==4068109==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 38656 byte(s) in 2 object(s) allocated from:
0 0xf069fd in posix_memalign (vswitchd/ovs-vswitchd+0xf069fd)
1 0x1d7ed14 in xmalloc_size_align lib/util.c:254:13
2 0x1d7ed14 in xmalloc_pagealign lib/util.c:352:12
3 0x2098254 in dp_netdev_input_outer_avx512 lib/dpif-netdev-avx512.c:69:17
4 0x191591a in dp_netdev_process_rxq_port lib/dpif-netdev.c:5332:19
5 0x190a961 in pmd_thread_main lib/dpif-netdev.c:6963:17
6 0x1c4b69a in ovsthread_wrapper lib/ovs-thread.c:422:12
7 0x7fd5ea6f1179 in start_thread pthread_create.c
SUMMARY: AddressSanitizer: 38656 byte(s) leaked in 2 allocation(s).
Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds the old name "subtable-lookup-prio-get" as an unlisted command,
to restore a consistency between OVS releases for testing scripts.
Fixes: 738c76a503f4 ("dpcls: Change info-get function to fetch dpcls usage stats.")
Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently the pmd-auto-lb-rebal-interval's value was not been
checked properly.
It maybe a negative, or too big value (>2 weeks between rebalances),
which will be lead to a big unsigned value. So reset it to default
if the value exceeds the max permitted as described in vswitchd.xml.
Fixes: 5bf84282482a ("Adding support for PMD auto load balancing")
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The ALB parameters should never be negative.
So it's to use unsigned smap_get versions to get it properly, and
update VLOG formatting.
Fixes: 5bf84282482a ("Adding support for PMD auto load balancing")
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously logging about rxq scheduling was done in a code branch with
the selection of the PMD thread core after checking that a numa was
selected.
By splitting out the logging from the PMD thread core selection, it can
simplify the code complexity and make it more extendable for future
additions.
Also, minor updates to a couple of variables to improve readability and
fix a log indent while working on this code block.
There is no user visible change in behaviour or logs.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This splits up the looping through each PMD thread core on a numa node
with the check to compare cycles or rxqs.
This is done so in future the compare could be reused with any group
of PMD thread cores.
There is no user visible change in behaviour.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Modified the dplcs info-get command output to include
the count for different dpcls implementations.
$ovs-appctl dpif-netdev/subtable-lookup-info-get
Available dpcls implementations:
autovalidator (Use count: 1, Priority: 5)
generic (Use count: 0, Priority: 1)
avx512_gather (Use count: 0, Priority: 3)
Test case to verify changes:
1061: PMD - dpcls configuration ok
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The hash of the port number is only needed when a DPCLS needs to be
created. Move the hash calculation inside the if to accomplish this.
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The function dp_netdev_pmd_flush_output_on_port() iterates over the
p->output_pkts batch directly, when it should be using the special
iterator macro, DP_PACKET_BATCH_FOR_EACH.
However, this wasn't possible because the macro could not accept
&p->output_pkts.
The addition of parentheses when BATCH is dereferenced allows the macro
to expand properly. Parenthesizing arguments in macros is good practice
to be able to handle whichever expressions are passed in.
Signed-off-by: Rosemarie O'Riorden <roriorden@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The dp_netdev_get_pmd() function is using only the hash of the core_id
to get the pmd structure. So in case of hash collisions, the wrong pmd
is returned.
This patch is fixing this by checking for the correct core_id using
the CMAP_FOR_EACH_WITH_HASH macro.
Fixes: 65f13b50c5aa ("dpif-netdev: Create multiple pmd threads by default.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There are currently some checks for cross-numa polling cases to
ensure that they won't effect the accuracy of the PMD ALB.
If an rxq is pinned to a PMD thread core by the user it will not
be reassigned by OVS, so even if it is non-local numa polled it
will not impact PMD ALB accuracy.
To establish this, a check was made on whether the PMD thread core
was isolated or not. However, since other_config:pmd-rxq-isolate was
introduced, rxqs may be pinned but the PMD thread core not isolated.
It means that by setting pmd-rxq-isolate=false and doing non-local
numa pinning, PMD ALB may not run where it should.
If the PMD thread core is isolated we can skip individual rxq checks
but if not, we should check the individual rxqs for pinning before we
disallow PMD ALB.
Also, update function comments to make it's operation clearer.
Fixes: 6193e03267c1 ("dpif-netdev: Allow pin rxq and non-isolate PMD.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This issue only occurs when there are more than 2 numa nodes
and no local numa PMD thread cores available for an interface
rxq.
In the event of no PMD thread cores available on the local numa
for an rxq to be assigned to, a PMD thread core from a non-local
numa is selected.
If there are more than one non-local numas with PMD thread cores
they are RR through and checked if they have non-isolated PMD thread
cores.
When successfully finding a non-local numa with available PMD
thread cores for an rxq, that numa was not being stored. It meant
if a similar situation occurred for a subsequent rxq, the same numa
would be selected again.
Store the last numa used when successfully finding a non-local numa
with available PMD thread cores, so the numa RR state is kept for subsequent
rxqs.
Fixes: f577c2d046b2 ("dpif-netdev: Rework rxq scheduling code.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Rename pmd_reblance_dry_run_needed() to
pmd_rebalance_dry_run_needed().
Fixes: a83a406096e9 ("dpif-netdev: Sync PMD ALB state with user commands.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for hmap and all its derived types.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
A flow may be modified after its initial offload failed. In this case,
according to [1], the modification is handled as a flow add.
For a vport flow "add", the orig_in_port should be provided.
Keep that field in the flow struct, so it can be provided in the flow
modification use case.
[1] 0d25621e4d9f ("dpif-netdev: Fix flow modification after failure.")
Fixes: b5e6f6f6bfbe ("dpif-netdev: Provide orig_in_port in metadata for tunneled packets.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The same pattern for atomic stores and initialization was used for the
DPIF and MFEX function pointers declared in struct dp_netdev_pmd_thread.
Simplify this pattern for all stores to 'miniflow_extract_opt' and
'netdev_input_func'.
Also replace the first store to 'miniflow_extract_opt' which was a
atomic_store_relaxed() with atomic_init().
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The PMD reference taken is not actually used, it is only needed to get
the dp_netdev linked. Additionally, the taking of the PMD reference
does not protect against the disappearance of the dp_netdev,
so it is misleading.
The dp reference is protected by the way the ports are being deleted
during datapath deletion. No further offload request should be found
past a flush, so it is safe to keep this reference in the offload item.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In dp_netdev_pmd_remove_flow() we schedule the deletion of an
offloaded flow, if a mark has been assigned to the flow. But if
this occurs in the window in which the offload thread completes
offloading the flow and assigns a mark to the flow, then we miss
deleting the flow. This problem has been observed while adding
and deleting flows in a loop. To fix this, always enqueue flow
deletion regardless of the flow->mark being set.
Fixes: 241bad15d99a("dpif-netdev: associate flow with a mark id")
Co-authored-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Port flush and offload uninit should be moved after the datapath
has been reconfigured. That way, no other thread such as PMDs will
find this port to poll and enqueue further offload requests.
After a flush, almost no further offload request for this port should
be found in the queue.
There will still be some issued by revalidators, but they
will be caught when the offload thread fails to take a netdev ref.
This change fixes the issue of datapath reference being improperly
accessed by offload threads while it is being destroyed.
Fixes: 5b0aa55776cb ("dpif-netdev: Execute flush from offload thread.")
Fixes: 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.")
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The subtable search function can be used at any time by a PMD thread.
Setting the subtable search function should be done atomically to
prevent garbage data from being read.
A dpcls_subtable_lookup_reprobe() call can happen at the same time that
DPCLS subtables are being sorted. This could lead to modifications done
by the reprobe() to be lost. Prevent this from happening by locking on
pmd->flow_mutex. After this change both the reprobe function and a
subtable sort will share the flow_mutex preventing modifications by
either one from being lost.
Also remove the pvector_publish() call. The pvector is not being changed
in dpcls_subtable_lookup_reprobe(), only the data pointed to by pointers
in the vector are being changed.
Fixes: 3d018c3ea79d ("dpif-netdev: add subtable lookup prio set command.")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2022-January/390757.html
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
It's actually undefined behavior to pass NULL to standard library
functions that manipulate arrays (e.g., qsort, memcpy, memcmp), even if
the passed number of items is 0.
UB Sanitizer reports:
ovsdb/monitor.c:408:9: runtime error: null pointer passed as argument 1,
which is declared to never be null
#0 0x406ae1 in ovsdb_monitor_columns_sort ovsdb/monitor.c:408
#1 0x406ae1 in ovsdb_monitor_add ovsdb/monitor.c:1683
[...]
lib/ovsdb-data.c:1970:5: runtime error: null pointer passed as argument 2,
which is declared to never be null
#0 0x4071c8 in ovsdb_datum_push_unsafe lib/ovsdb-data.c:1970
#1 0x471cd0 in ovsdb_datum_apply_diff_in_place lib/ovsdb-data.c:2345
[...]
ofproto/ofproto-dpif-rid.c:159:17:
runtime error: null pointer passed as argument 1,
which is declared to never be null
#0 0x4df5d8 in frozen_state_equal ofproto/ofproto-dpif-rid.c:159
#1 0x4dfd27 in recirc_find_equal ofproto/ofproto-dpif-rid.c:179
[...]
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Read the user configuration in the netdev-offload module to modify the
number of threads used to manage hardware offload requests.
This allows processing insertion, deletion and modification
concurrently.
The offload thread structure was modified to contain all needed
elements. This structure is multiplied by the number of requested
threads and used separately.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The port mutex protects the netdev mapping, that can be changed by port
addition or port deletion. HW offloads operations can be considered read
operations on the port mapping itself. Use a rwlock to differentiate
between read and write operations, allowing concurrent queries and
offload insertions.
Because offload queries, deletion, and reconfigure_datapath() calls are
all rdlock, the deadlock fixed by [1] is still avoided, as the rdlock
side is recursive as prescribed by the POSIX standard. Executing
'reconfigure_datapath()' only requires a rdlock taken, but it is sometimes
executed in contexts where wrlock is taken ('do_add_port()' and
'do_del_port()').
This means that the deadlock described in [2] is still valid and should
be mitigated. The rdlock is taken using 'tryrdlock()' during offload query,
keeping the current behavior.
[1]: 81e89d5c2645 ("dpif-netdev: Make datapath port mutex recursive.")
[2]: 12d0edd75eba ("dpif-netdev: Avoid deadlock with offloading during PMD
thread deletion.").
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In later commits hardware offloads are managed in several threads.
Each offload is managed by a thread determined by its flow's 'mega_ufid'.
As megaflow to mark and mark to flow mappings are 1:1 and 1:N
respectively, then a single mark exists for a single 'mega_ufid', and
multiple flows uses the same 'mega_ufid'. Because the managing thread will
be chosen using the 'mega_ufid', then each mapping does not need to be
shared with other offload threads.
The mappings are kept as cmap as upcalls will sometimes query them before
enqueuing orders to the offload threads.
To prepare this change, move the mappings within the offload thread
structure.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The dataplane threads (PMDs) send offloading commands to a dedicated
offload management thread. The current implementation uses a lock
and benchmarks show a high contention on the queue in some cases.
With high-contention, the mutex will more often lead to the locking
thread yielding in wait, using a syscall. This should be avoided in
a userland dataplane.
The mpsc-queue can be used instead. It uses less cycles and has
lower latency. Benchmarks show better behavior as multiple
revalidators and one or multiple PMDs writes to a single queue
while another thread polls it.
One trade-off with the new scheme however is to be forced to poll
the queue from the offload thread. Without mutex, a cond_wait
cannot be used for signaling. The offload thread is implementing
an exponential backoff and will sleep in short increments when no
data is available. This makes the thread yield, at the price of
some latency to manage offloads after an inactivity period.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add a lock to access the ufid to rte_flow map. This will protect it
from concurrent write accesses when multiple threads attempt it.
At this point, the reason for taking the lock is not to fullfill the
needs of the DPDK offload implementation anymore. Rewrite the comments
to reflect this change. The lock is still needed to protect against
changes to netdev port mapping.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a port is deleted, its offloads must be flushed. The operation
runs in the thread that initiated it. Offload data is thus accessed
jointly by the port deletion thread(s) and the offload thread, which
complicates the data access model.
To simplify this model, as a pre-step toward introducing parallel
offloads, execute the flush operation in the offload thread.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Offload requests are currently only supporting flow offloads.
As a pre-step before supporting an offload flush request,
modify the layout of an offload request item, to become a tagged union.
Future offload types won't be forced to re-use the full flow offload
structure, which consumes a lot of memory.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Use the netdev-offload multithread API to allow multiple thread
allocating marks concurrently.
Initialize only once the pool in a multithread context by using
the ovsthread_once type.
Use the id-fpool module for faster concurrent ID allocation.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>