In the case where there is a NUMA node that has a zero variance
improvement, the log will report it's variance improvement as value for
a previous NUMA node with a non-zero variance improvement.
For example in an artificial case:
|dpif_netdev|DBG|Numa node 1. Current variance 1000 Estimated variance 0.
Variance improvement 100%.
^^^ correct value
|dpif_netdev|DBG|Numa node 0. Current variance 0 Estimated variance 0.
Variance improvement 100%.
^^^ incorrect value for Numa 0, value from Numa 1
This is caused by not resetting the improvement between loops.
This is a debug log reporting issue only, non-zero variance improvement
will still trigger rebalance where appropriate.
Move improvement and other variables into the loop code block to fix
logs.
Fixes: 46e04ec31bb2 ("dpif-netdev: Calculate per numa variance.")
Reported-at: https://issues.redhat.com/browse/FDP-1145
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Rather than drop all pending Tx offloads on recirculation,
preserve inner offloads (and mark packet with outer Tx offloads)
when parsing the packet again.
Fixes: c6538b443984 ("dpif-netdev: Fix crash due to tunnel offloading on recirculation.")
Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Reported-at: https://issues.redhat.com/browse/FDP-1144
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch extends the userspace datapaths support of tunnel tso from
only supporting VxLAN and Geneve to also supporting GRE tunnels. There
is also a software fallback for cases where the egress netdev does not
support this feature.
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Atomics are not needed when reporting zone limits.
Remove the restriction by defining a non-atomic common structure
to report such data.
The change also access atomics using the related operations to
retrieve atomics reporting only the fields required by the requesting
level instead of relying of struct copy.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Clang analyzer will complain about floating point operations conducted
with integer types as rounding is undefined. In pmd_info_show_rxq() a
percentage was calculated inside uint64 integers instead of a floating
pointer variable for a user visible message. This issue can be resolved
simply by casting to double while dividing.
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
The odp_flow_format() function applies a wildcard mask when a
mask for a given key was not present. However, in the context of
installing flows in the datapath, the absence of a mask actually
indicates that the key should be ignored, meaning it should not
be masked at all.
To address this inconsistency, odp_flow_format() now includes an
option to skip formatting keys that are missing a mask.
This was found during a debug session of the ‘datapath - ping between two
ports on cvlan’ test case. The log message was showing the following:
put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),
skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8),
vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100),
vlan(vid=100/0x0,pcp=0/0x0),encap(eth_type(0x0800),
ipv4(src=10.2.2.2,dst=10.2.2.1,proto=1,tos=0,ttl=64,frag=no),
icmp(type=0,code=0))), actions:2
Where it should have shown the below:
put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),
skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8),
vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100)), actions:2
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Add support for parsing and formatting the new action.
Also, flag OVS_ACTION_ATTR_SAMPLE as requiring datapath assistance if it
contains a nested OVS_ACTION_ATTR_PSAMPLE. The reason is that the
sampling rate from the parent "sample" is made available to the nested
"psample" by the kernel.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This is prep for adding a different OVS_ACTION_ATTR_ enum value. This
action, OVS_ACTION_ATTR_DEC_TTL, is not actually implemented. However,
to make -Werror happy we must add a case to all existing switches.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Eric Garver <eric@garver.life>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Recirculation involves re-parsing the packet from scratch and that
process is not aware of multiple header levels nor the inner/outer
offsets. So, it overwrites offsets with new ones from the outermost
headers and sets offloading flags that change their meaning when
the packet is marked for tunnel offloading.
For example:
1. TCP packet enters OVS.
2. TCP packet gets encapsulated into UDP tunnel.
3. Recirculation happens.
4. Packet is re-parsed after recirculation with miniflow_extract()
or similar function.
5. Packet is marked for UDP checksumming because we parse the
outermost set of headers. But since it is tunneled, it means
inner UDP checksumming. And that makes no sense, because the
inner packet is TCP.
This is causing packet drops due to malformed packets or even
assertions and crashes in the code that is trying to fixup checksums
for packets using incorrect metadata:
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
lib/packets.c:2061:15: runtime error:
member access within null pointer of type 'struct udp_header'
0 0xbe5221 in packet_udp_complete_csum lib/packets.c:2061:15
1 0x7e5662 in dp_packet_ol_send_prepare lib/dp-packet.c:638:9
2 0x96ef89 in netdev_send lib/netdev.c:940:9
3 0x818e94 in dp_netdev_pmd_flush_output_on_port lib/dpif-netdev.c:5577:9
4 0x817606 in dp_netdev_pmd_flush_output_packets lib/dpif-netdev.c:5618:27
5 0x81cfa5 in dp_netdev_process_rxq_port lib/dpif-netdev.c:5677:9
6 0x7eefe4 in dpif_netdev_run lib/dpif-netdev.c:7001:25
7 0x610e87 in type_run ofproto/ofproto-dpif.c:367:9
8 0x5b9e80 in ofproto_type_run ofproto/ofproto.c:1879:31
9 0x55bbb4 in bridge_run__ vswitchd/bridge.c:3281:9
10 0x558b6b in bridge_run vswitchd/bridge.c:3346:5
11 0x591dc5 in main vswitchd/ovs-vswitchd.c:130:9
12 0x172b89 in __libc_start_call_main (/lib64/libc.so.6+0x27b89)
13 0x172c4a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x27c4a)
14 0x47eff4 in _start (vswitchd/ovs-vswitchd+0x47eff4)
Tests added for both IPv4 and IPv6 cases. Though IPv6 test doesn't
trigger the issue it's better to have a symmetric test.
Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/053014.html
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Avoid unnecessary thread creation as no upcalls are generated,
resulting in idle threads waiting for process termination.
This optimization significantly reduces memory usage, cutting it
by half on a 128 CPU/thread system during testing, with the number
of threads reduced from 95 to 0.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
The patch, when 'persistent' flag is specified, makes the IP selection
in a range persistent across reboots.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
The userspace conntrack only supported hash for port selection.
With the patch, both userspace and kernel datapath support the random
flag.
The default behavior remains the same, that is, if no flags are
specified, hash is selected.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
In a scenario where OVN does load balancing and then SNAT with a OVS
userspace datapath [0], the recirc_depth may be greater than 6. In
that case, ovs-vswitchd might drop packets and raise warnings:
dpif_netdev|WARN|Packet dropped. Max recirculation depth exceeded.
Increasing MAX_RECIRC_DEPTH to 8 solves this issue.
[0] dd5cd73e3d/tests/system-ovn-kmod.at (L740)
Reported-at: https://issues.redhat.com/browse/FDP-251
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
For userspace datapath, this patch provides vxlan and geneve tunnel tso.
Only support userspace vxlan or geneve tunnel, meanwhile support
tunnel outter and inner csum offload. If netdev do not support offload
features, there is a software fallback.If netdev do not support vxlan
and geneve tso,packets will drop. Front-end devices can close offload
features by ethtool also.
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Dexia Li <dexia.li@jaguarmicro.com>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The current protocol detection logic relies on two pieces of metadata
passed as arguments: tp_src and tp_dst, which represent the L4 source
and destination port numbers from the flow that triggered the current
flow rule first, and was responsible for creating the current DP flow.
Since multiple network flows of many different kinds, potentially using
different protocols on all layers, can be processed by one flow rule,
using the metadata of some unrelated flow might lead to unexpected
results. For example, ICMP type and code can be interpreted as TCP
source and destination ports. This can confuse the code responsible for
the helper selection, leading to errors in traffic handling and
incorrect detection of related flows.
One of the easiest ways to fix this problem is to simply remove the
tp_src and tp_dst parameters from the picture. The current code base has
no good use for them.
The helper selection logic was based on these values and therefore needs
to be changed. Ensure that the helper specified in a flow rule is used,
given it is compatible with the L4 protocol of the packet. When a flow
rule does not specify a helper, one can still be picked using the given
packet's metadata like TCP/UDP ports.
Signed-off-by: Viacheslav Galaktionov <viacheslav.galaktionov@arknetworks.am>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Extend 'pmd-sleep-max' so that individual PMD thread cores may have a
specified max sleep request value.
Existing behaviour is maintained.
Any PMD thread core without a value will use the global value if set
or default no sleep.
To set PMD thread cores 8 and 9 to never request a load based sleep
and all other PMD thread cores to be able to request a max sleep of
50 usecs:
$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50,8:0,9:0
To set PMD thread cores 10 and 11 to request a max sleep of 100 usecs
and all other PMD thread cores to never request a sleep:
$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=10:100,11:100
'pmd-sleep-show' is updated to show the max sleep value for each PMD
thread.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Internally handle default CT zone limit as other limits that
can be passed via the list with special value -1. Currently,
the -1 is treated by both datapaths as default, add static
asserts to make sure that this remains the case in the future.
This allows us to easily delete the default zone limit.
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When enabling DPDK with the configure the below, ovs-vswitchd will crash.
ovs-vsctl set Open_vSwitch . other_config:n-offload-threads=0
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
This issue arises because setting the 'n-offload-threads' value to zero
is not a supported configuration. This fix addresses this by implementing
a check to ensure a valid 'n-offload-threads' value, both during
configuration and statistics gathering.
Fixes: 62c2d8a67543 ("netdev-offload: Add multi-thread API.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Simon Horman <horms@ovn.org>
The 'len' of a netdev_flow_key initialized by netdev_flow_key_init()
is always zero, which may cause errors when cloning a netdev_flow_key
by netdev_flow_key_clone().
Currently the 'len' member of a netdev_flow_key initialized by
netdev_flow_key_init() is not used, so this error will not cause any
bad behavior for now.
Fixes: c82f496c3b69 ("dpif-netdev: Use unmasked key when adding datapath flows.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Zhiqi Chen <chenzhiqi.123@bytedance.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
OVS allows overlapping megaflows, as long as the actions of these
megaflows are equal. However, the current implementation of action
modification relies on flow_lookup instead of UFID, this could result
in looking up a wrong megaflow and make the ukeys and megaflows
inconsistent.
Just like the test case in the patch, at first we have a rule with the
prefix:
10.1.2.0/24
And we will get a megaflow with prefixes 10.1.2.2/24 when a packet with
IP 10.1.2.2 is received.
Then suppose we change the rule into 10.1.0.0/16. OVS prefers to keep
the 10.1.2.2/24 megaflow and just changes its action instead of
extending the prefix into 10.1.2.2/16.
Then suppose we have a 10.1.0.2 packet, since it misses the megaflow,
this time, we will have an overlapping megaflow with the right prefix:
10.1.0.2/16
Now we have two megaflows:
10.1.2.2/24
10.1.0.2/16
Last, suppose we have changed the ruleset again. The revalidator this
time still decides to change the actions of both megaflows instead of
deleting them.
The dpif_netdev_flow_put will search the megaflow to modify with
unmasked keys, however it might lookup the wrong megaflow as the key
10.1.2.2 matches both 10.1.2.2/24 and 10.1.0.2/16!
This patch changes the megaflow lookup code in modification path into
relying the UFID to find the correct megaflow instead of key lookup.
Falling back to a classifier lookup in case where UFID was not provided
in order to support cases where UFID was not generated from the flow
data during the flow addition.
Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
Signed-off-by: Peng He <hepeng.0320@bytedance.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Max requested sleep time and status for a PMD thread
is logged at start up or when changed, but it can be
convenient to have a command to dump this information
explicitly.
It is envisaged that this will be expanded for individual
pmds in the future, hence adding to dpif_netdev_pmd_info().
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
other_config:pmd-maxsleep is a config option to allow
PMD thread cores to sleep under low or no load conditions.
Rename it to 'pmd-sleep-max' to allow a more structured
name and so that additional options or command can follow
the 'pmd-sleep-xyz' pattern.
Use of other_config:pmd-maxsleep is deprecated to be
removed in a future release and will result in a warning.
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Current implementation of meters in the userspace datapath takes
the meter lock for every packet batch. If more than one thread
hits the flow with the same meter, they will lock each other.
Replace the critical section with atomic operations to avoid
interlocking. Meters themselves are RCU-protected, so it's safe
to access them without holding a lock.
Implementation does the following:
1. Tries to advance the 'used' timer of the meter with atomic
compare+exchange if it's smaller than 'now'.
2. If the timer change succeeds, atomically update band buckets.
3. Atomically update packet statistics for a meter.
4. Go over buckets and try to atomically subtract the amount of
packets or bytes, recording the highest exceeded band.
5. Atomically update band statistics and drop packets.
Bucket manipulations are implemented with atomic compare+exchange
operations with extra checks, because bucket size should never
exceed the maximum and it should never go below zero.
Packet statistics may be momentarily inconsistent, i.e., number
of packets and the number of bytes may reflect different sets
of packets. But it should be eventually consistent. And the
difference at any given time should be in just few packets.
For the sake of reduced code complexity PKTPS meter tries to push
packets through the band one by one, even though they all have
the same weight. This is also more fair if more than one thread
is passing packets through the same band at the same time.
Trying to predict the number of packets that can pass may also
cause extra atomic operations reducing the performance.
This implementation shows similar performance to the previous one,
but should scale better with more threads hitting the same meter.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Lin Huang <linhuang@ruijie.com.cn>
Tested-by: Zhang YuHuang <zhangyuhuang@ruijie.com.cn>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The patch introduces a new commands ovs-appctl dpctl/dump-conntrack-exp
that allows to dump the existing expectations for the userspace ct.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The netdev receiving packets is supposed to provide the flags
indicating if the IP checksum was verified and it is GOOD or BAD,
otherwise the stack will check when appropriate by software.
If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.
When encapsulate a packet with that flag, set the checksum
of the inner IP header since that is not yet supported.
Calculate the IP checksum when the packet is going to be sent over
a device that doesn't support the feature.
Linux devices don't support IP checksum offload alone, so the
support is not enabled.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Since 3d9c1b855a5f ("conntrack: Replace timeout based expiration lists
with rculists.") the sweep interval changed as well as the constraints
related to the sweeper.
Being able to change the default reschedule time may be convenient in
some conditions, like debugging.
This patch introduces new commands allowing to get and set the sweep
interval in ms.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When the ukey's action set changes, it could cause the flow to use a
different datapath, for example, when it moves from tc to kernel.
This will cause the the cached previous datapath statistics to be used.
This change will reset the cached statistics when a change in
datapath is discovered.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Now that the timer slack for the PMD threads is reduced we can also
reduce the start/increment for PMD load based sleeping to match it.
This will further reduce initial sleep times making it more resilient
to interfaces that might be sensitive to large sleep times.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The default Linux timer slack groups timer expires into 50 uS intervals.
With some traffic patterns this can mean that returning to process
packets after a sleep takes too long and packets are dropped.
Add a helper to util.c and set use it to reduce the timer slack
for PMD threads, so that sleeps with smaller resolutions can be done
to prevent sleeping for too long.
Fixes: de3bbdc479a9 ("dpif-netdev: Add PMD load based sleeping.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401121.html
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.
Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).
Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.
Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
- sleep iterations: 153994 ( 76.8 % of iterations)
Sleep time (us): 9159399 ( 59 us/iteration avg.)
...
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, pmd_rebalance_dry_run() calculate overall variance of
all pmds regardless of their numa location. The overall result may
hide un-balance in an individual numa.
Considering the following case. Numa0 is free because VMs on numa0
are not sending pkts, while numa1 is busy. Within numa1, pmds
workloads are not balanced. Obviously, moving 500 kpps workloads from
pmd 126 to pmd 62 will make numa1 much more balance. For numa1
the variance improvement will be almost 100%, because after rebalance
each pmd in numa1 holds same workload(variance ~= 0). But the overall
variance improvement is only about 20%, which may not trigger auto_lb.
```
numa_id core_id kpps
0 30 0
0 31 0
0 94 0
0 95 0
1 126 1500
1 127 1000
1 63 1000
1 62 500
```
As auto_lb doesn't balance workload across numa nodes. So it makes
more sense to calculate variance improvement per numa node.
Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There are some similar readings taken for pmds and Rx queues
in this function and a few of the variable names are ambiguous.
Improve the readability of the code by updating some variables
names to indicate that they are readings related to the pmd.
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
pmd-rxq-show shows the Rx queue to pmd assignments as well as the
pmd usage of each Rx queue.
Up until now a tail length of 60 seconds pmd usage was shown
for each Rx queue, as this is the value used during rebalance
to avoid any spike effects.
When debugging or tuning, it is also convenient to display the
pmd usage of an Rx queue over a shorter time frame, so any changes
config or traffic that impact pmd usage can be evaluated more quickly.
A parameter is added that allows pmd-rxq-show stats pmd usage to
be shown for a shorter time frame. Values are rounded up to the
nearest 5 seconds as that is the measurement granularity and the value
used is displayed. e.g.
$ ovs-appctl dpif-netdev/pmd-rxq-show -secs 5
Displaying last 5 seconds pmd usage %
pmd thread numa_id 0 core_id 4:
isolated : false
port: dpdk0 queue-id: 0 (enabled) pmd usage: 95 %
overhead: 4 %
The default time frame has not changed and the maximum value
is limited to the maximum stored tail length (60 seconds).
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The datapath supports installing wider flows, and OVS relies on
this behavior. For example if ipv4(src=1.1.1.1/192.0.0.0,
dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of
ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed
to be added.
However, if we try to add a wildcard rule, the installation fails:
# ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2
# ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
ovs-vswitchd: updating flow table (File exists)
The reason is that the key used to determine if the flow is already
present in the system uses the original key ANDed with the mask.
This results in the IP address not being part of the (miniflow) key,
i.e., being substituted with an all-zero value. When doing the actual
lookup, this results in the key wrongfully matching the first flow,
and therefore the flow does not get installed. The solution is to use
the unmasked key for the existence check, the same way this is handled
in the "slow" dpif_flow_put() case.
OVS relies on the fact that overlapping flows can exist if one is a
superset of the other. Note that this is only true when the same set
of actions is applied. This is due to how the revalidator process
works. During revalidation, OVS removes too generic flows from the
datapath to avoid incorrect matches but allows too narrow flows to
stay in the datapath to avoid the data plane disruption and also to
avoid constant flow deletions if the datapath ignores wildcards on
certain fields/bits. See flow_wildcards_has_extra() check in the
revalidate_ukey__() function.
The problem here is that we have a too narrow flow installed, and now
OpenFlow rules got changed, so the actual flow should be more generic.
Revalidators will not remove the narrow flow, and we will eventually get
an upcall on the packet that doesn't match the narrow flow, but we will
not be able to install a more generic flow because after masking with
the new wider mask, the key matches on the narrow flow, so we get EEXIST.
Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Cited commit introduced a flag in dpif-netdev level, to optimize
performance and avoid hw_miss_packet_recover() for devices with no such
support.
However, there is a race condition between traffic processing and
assigning a 'flow_api' object to the netdev. In such case, EOPNOTSUPP is
returned by netdev_hw_miss_packet_recover() in netdev-offload.c layer
because 'flow_api' is not yet initialized. As a result, the flag is
falsely disabled, and subsequent packets won't be recovered, though they
should.
In order to fix it, move the flag to be in netdev-offload layer, to
avoid that race.
Fixes: 6e50c1651869 ("dpif-netdev: Avoid hw_miss_packet_recover() for devices with no support.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Change the data structure from hmap to cmap for zone limits.
As they are shared amongst multiple conntrack users, multiple
readers want to check the current zone limit state before progressing in
their processing. Using a CMAP allows doing lookups without taking the
global 'ct_lock', thus reducing contention.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Make the simple match functions used during lookup non-static to allow
reuse of these functions in the AVX512 DPIF.
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Tested-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
dp_netdev_input_outer_avx512 allocates a 16KB scratch pad per PMD
thread, but it's never freed. This may cause significant memory
drain in dynamic environments.
==4068109==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 38656 byte(s) in 2 object(s) allocated from:
0 0xf069fd in posix_memalign (vswitchd/ovs-vswitchd+0xf069fd)
1 0x1d7ed14 in xmalloc_size_align lib/util.c:254:13
2 0x1d7ed14 in xmalloc_pagealign lib/util.c:352:12
3 0x2098254 in dp_netdev_input_outer_avx512 lib/dpif-netdev-avx512.c:69:17
4 0x191591a in dp_netdev_process_rxq_port lib/dpif-netdev.c:5332:19
5 0x190a961 in pmd_thread_main lib/dpif-netdev.c:6963:17
6 0x1c4b69a in ovsthread_wrapper lib/ovs-thread.c:422:12
7 0x7fd5ea6f1179 in start_thread pthread_create.c
SUMMARY: AddressSanitizer: 38656 byte(s) leaked in 2 allocation(s).
Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds the old name "subtable-lookup-prio-get" as an unlisted command,
to restore a consistency between OVS releases for testing scripts.
Fixes: 738c76a503f4 ("dpcls: Change info-get function to fetch dpcls usage stats.")
Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently the pmd-auto-lb-rebal-interval's value was not been
checked properly.
It maybe a negative, or too big value (>2 weeks between rebalances),
which will be lead to a big unsigned value. So reset it to default
if the value exceeds the max permitted as described in vswitchd.xml.
Fixes: 5bf84282482a ("Adding support for PMD auto load balancing")
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The ALB parameters should never be negative.
So it's to use unsigned smap_get versions to get it properly, and
update VLOG formatting.
Fixes: 5bf84282482a ("Adding support for PMD auto load balancing")
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously logging about rxq scheduling was done in a code branch with
the selection of the PMD thread core after checking that a numa was
selected.
By splitting out the logging from the PMD thread core selection, it can
simplify the code complexity and make it more extendable for future
additions.
Also, minor updates to a couple of variables to improve readability and
fix a log indent while working on this code block.
There is no user visible change in behaviour or logs.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This splits up the looping through each PMD thread core on a numa node
with the check to compare cycles or rxqs.
This is done so in future the compare could be reused with any group
of PMD thread cores.
There is no user visible change in behaviour.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Modified the dplcs info-get command output to include
the count for different dpcls implementations.
$ovs-appctl dpif-netdev/subtable-lookup-info-get
Available dpcls implementations:
autovalidator (Use count: 1, Priority: 5)
generic (Use count: 0, Priority: 1)
avx512_gather (Use count: 0, Priority: 3)
Test case to verify changes:
1061: PMD - dpcls configuration ok
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The hash of the port number is only needed when a DPCLS needs to be
created. Move the hash calculation inside the if to accomplish this.
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The function dp_netdev_pmd_flush_output_on_port() iterates over the
p->output_pkts batch directly, when it should be using the special
iterator macro, DP_PACKET_BATCH_FOR_EACH.
However, this wasn't possible because the macro could not accept
&p->output_pkts.
The addition of parentheses when BATCH is dereferenced allows the macro
to expand properly. Parenthesizing arguments in macros is good practice
to be able to handle whichever expressions are passed in.
Signed-off-by: Rosemarie O'Riorden <roriorden@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The dp_netdev_get_pmd() function is using only the hash of the core_id
to get the pmd structure. So in case of hash collisions, the wrong pmd
is returned.
This patch is fixing this by checking for the correct core_id using
the CMAP_FOR_EACH_WITH_HASH macro.
Fixes: 65f13b50c5aa ("dpif-netdev: Create multiple pmd threads by default.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There are currently some checks for cross-numa polling cases to
ensure that they won't effect the accuracy of the PMD ALB.
If an rxq is pinned to a PMD thread core by the user it will not
be reassigned by OVS, so even if it is non-local numa polled it
will not impact PMD ALB accuracy.
To establish this, a check was made on whether the PMD thread core
was isolated or not. However, since other_config:pmd-rxq-isolate was
introduced, rxqs may be pinned but the PMD thread core not isolated.
It means that by setting pmd-rxq-isolate=false and doing non-local
numa pinning, PMD ALB may not run where it should.
If the PMD thread core is isolated we can skip individual rxq checks
but if not, we should check the individual rxqs for pinning before we
disallow PMD ALB.
Also, update function comments to make it's operation clearer.
Fixes: 6193e03267c1 ("dpif-netdev: Allow pin rxq and non-isolate PMD.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This issue only occurs when there are more than 2 numa nodes
and no local numa PMD thread cores available for an interface
rxq.
In the event of no PMD thread cores available on the local numa
for an rxq to be assigned to, a PMD thread core from a non-local
numa is selected.
If there are more than one non-local numas with PMD thread cores
they are RR through and checked if they have non-isolated PMD thread
cores.
When successfully finding a non-local numa with available PMD
thread cores for an rxq, that numa was not being stored. It meant
if a similar situation occurred for a subsequent rxq, the same numa
would be selected again.
Store the last numa used when successfully finding a non-local numa
with available PMD thread cores, so the numa RR state is kept for subsequent
rxqs.
Fixes: f577c2d046b2 ("dpif-netdev: Rework rxq scheduling code.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Rename pmd_reblance_dry_run_needed() to
pmd_rebalance_dry_run_needed().
Fixes: a83a406096e9 ("dpif-netdev: Sync PMD ALB state with user commands.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>