2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-29 05:18:13 +00:00

829 Commits

Author SHA1 Message Date
Kevin Traynor
ac32bbe2c7 dpif-netdev: Fix Auto Load Balance debug log.
In the case where there is a NUMA node that has a zero variance
improvement, the log will report it's variance improvement as value for
a previous NUMA node with a non-zero variance improvement.

For example in an artificial case:
|dpif_netdev|DBG|Numa node 1. Current variance 1000 Estimated variance 0.
Variance improvement 100%.
                     ^^^ correct value

|dpif_netdev|DBG|Numa node 0. Current variance 0 Estimated variance 0.
Variance improvement 100%.
                     ^^^ incorrect value for Numa 0, value from Numa 1

This is caused by not resetting the improvement between loops.

This is a debug log reporting issue only, non-zero variance improvement
will still trigger rebalance where appropriate.

Move improvement and other variables into the loop code block to fix
logs.

Fixes: 46e04ec31bb2 ("dpif-netdev: Calculate per numa variance.")
Reported-at: https://issues.redhat.com/browse/FDP-1145
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
2025-02-14 13:27:05 +00:00
David Marchand
c771758249 dpif-netdev: Preserve inner offloads on recirculation.
Rather than drop all pending Tx offloads on recirculation,
preserve inner offloads (and mark packet with outer Tx offloads)
when parsing the packet again.

Fixes: c6538b443984 ("dpif-netdev: Fix crash due to tunnel offloading on recirculation.")
Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Reported-at: https://issues.redhat.com/browse/FDP-1144
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-02-13 21:32:15 +01:00
Mike Pattrick
2276c3a2c6 userspace: Support GRE TSO.
This patch extends the userspace datapaths support of tunnel tso from
only supporting VxLAN and Geneve to also supporting GRE tunnels. There
is also a software fallback for cases where the egress netdev does not
support this feature.

Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-01-17 00:20:48 +01:00
Paolo Valerio
8ff40f3358 conntrack: Do not use atomics to report zones info.
Atomics are not needed when reporting zone limits.
Remove the restriction by defining a non-atomic common structure
to report such data.
The change also access atomics using the related operations to
retrieve atomics reporting only the fields required by the requesting
level instead of relying of struct copy.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-10-09 10:55:18 -04:00
Mike Pattrick
a67db28fd9 dpif-netdev: Remove undefined integer division.
Clang analyzer will complain about floating point operations conducted
with integer types as rounding is undefined. In pmd_info_show_rxq() a
percentage was calculated inside uint64 integers instead of a floating
pointer variable for a user visible message. This issue can be resolved
simply by casting to double while dividing.

Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-09-11 15:37:56 +02:00
Eelco Chaudron
252ee0f182 dpif: Fix flow put debug message match content.
The odp_flow_format() function applies a wildcard mask when a
mask for a given key was not present. However, in the context of
installing flows in the datapath, the absence of a mask actually
indicates that the key should be ignored, meaning it should not
be masked at all.

To address this inconsistency, odp_flow_format() now includes an
option to skip formatting keys that are missing a mask.

This was found during a debug session of the ‘datapath - ping between two
ports on cvlan’ test case. The log message was showing the following:

  put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),
    skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
    eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8),
    vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100),
    vlan(vid=100/0x0,pcp=0/0x0),encap(eth_type(0x0800),
    ipv4(src=10.2.2.2,dst=10.2.2.1,proto=1,tos=0,ttl=64,frag=no),
    icmp(type=0,code=0))), actions:2

Where it should have shown the below:

  put[create] ufid:XX recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(3),
    skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
    eth(src=12:f6:8b:52:f9:75,dst=6e:48:c8:77:d3:8c),eth_type(0x88a8),
    vlan(vid=4094,pcp=0/0x0),encap(eth_type(0x8100)), actions:2

Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-09-09 10:22:16 +02:00
Adrian Moreno
1a3bd96b4f odp-util: Add support OVS_ACTION_ATTR_PSAMPLE.
Add support for parsing and formatting the new action.

Also, flag OVS_ACTION_ATTR_SAMPLE as requiring datapath assistance if it
contains a nested OVS_ACTION_ATTR_PSAMPLE. The reason is that the
sampling rate from the parent "sample" is made available to the nested
"psample" by the kernel.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-07-14 17:19:52 +02:00
Eric Garver
8bb065961e dpif: Stub out unimplemented action OVS_ACTION_ATTR_DEC_TTL.
This is prep for adding a different OVS_ACTION_ATTR_ enum value. This
action, OVS_ACTION_ATTR_DEC_TTL, is not actually implemented. However,
to make -Werror happy we must add a case to all existing switches.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Eric Garver <eric@garver.life>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-04-05 23:10:17 +02:00
Ilya Maximets
c6538b4439 dpif-netdev: Fix crash due to tunnel offloading on recirculation.
Recirculation involves re-parsing the packet from scratch and that
process is not aware of multiple header levels nor the inner/outer
offsets.  So, it overwrites offsets with new ones from the outermost
headers and sets offloading flags that change their meaning when
the packet is marked for tunnel offloading.

For example:

 1. TCP packet enters OVS.
 2. TCP packet gets encapsulated into UDP tunnel.
 3. Recirculation happens.
 4. Packet is re-parsed after recirculation with miniflow_extract()
    or similar function.
 5. Packet is marked for UDP checksumming because we parse the
    outermost set of headers.  But since it is tunneled, it means
    inner UDP checksumming.  And that makes no sense, because the
    inner packet is TCP.

This is causing packet drops due to malformed packets or even
assertions and crashes in the code that is trying to fixup checksums
for packets using incorrect metadata:

 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior

 lib/packets.c:2061:15: runtime error:
        member access within null pointer of type 'struct udp_header'

  0 0xbe5221 in packet_udp_complete_csum lib/packets.c:2061:15
  1 0x7e5662 in dp_packet_ol_send_prepare lib/dp-packet.c:638:9
  2 0x96ef89 in netdev_send lib/netdev.c:940:9
  3 0x818e94 in dp_netdev_pmd_flush_output_on_port lib/dpif-netdev.c:5577:9
  4 0x817606 in dp_netdev_pmd_flush_output_packets lib/dpif-netdev.c:5618:27
  5 0x81cfa5 in dp_netdev_process_rxq_port lib/dpif-netdev.c:5677:9
  6 0x7eefe4 in dpif_netdev_run lib/dpif-netdev.c:7001:25
  7 0x610e87 in type_run ofproto/ofproto-dpif.c:367:9
  8 0x5b9e80 in ofproto_type_run ofproto/ofproto.c:1879:31
  9 0x55bbb4 in bridge_run__ vswitchd/bridge.c:3281:9
 10 0x558b6b in bridge_run vswitchd/bridge.c:3346:5
 11 0x591dc5 in main vswitchd/ovs-vswitchd.c:130:9
 12 0x172b89 in __libc_start_call_main (/lib64/libc.so.6+0x27b89)
 13 0x172c4a in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x27c4a)
 14 0x47eff4 in _start (vswitchd/ovs-vswitchd+0x47eff4)

Tests added for both IPv4 and IPv6 cases.  Though IPv6 test doesn't
trigger the issue it's better to have a symmetric test.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/053014.html
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-03-22 20:45:37 +01:00
Eelco Chaudron
f0d1beca6c dpif-netdev: Do not create handler threads.
Avoid unnecessary thread creation as no upcalls are generated,
resulting in idle threads waiting for process termination.

This optimization significantly reduces memory usage, cutting it
by half on a 128 CPU/thread system during testing, with the number
of threads reduced from 95 to 0.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-02-29 12:24:59 +01:00
Paolo Valerio
afdc1171a8 conntrack: Handle persistent selection for IP addresses.
The patch, when 'persistent' flag is specified, makes the IP selection
in a range persistent across reboots.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
2024-02-21 10:12:42 +00:00
Paolo Valerio
99413ec261 conntrack: Handle random selection for port ranges.
The userspace conntrack only supported hash for port selection.
With the patch, both userspace and kernel datapath support the random
flag.

The default behavior remains the same, that is, if no flags are
specified, hash is selected.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
2024-02-21 10:12:04 +00:00
Jakob Meng
5df46a44e8 dpif-netdev: Increase MAX_RECIRC_DEPTH to 8.
In a scenario where OVN does load balancing and then SNAT with a OVS
userspace datapath [0], the recirc_depth may be greater than 6. In
that case, ovs-vswitchd might drop packets and raise warnings:

  dpif_netdev|WARN|Packet dropped. Max recirculation depth exceeded.

Increasing MAX_RECIRC_DEPTH to 8 solves this issue.

[0] dd5cd73e3d/tests/system-ovn-kmod.at (L740)

Reported-at: https://issues.redhat.com/browse/FDP-251
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-02-15 22:24:43 +01:00
Dexia Li
084c808729 userspace: Support VXLAN and GENEVE TSO.
For userspace datapath, this patch provides vxlan and geneve tunnel tso.
Only support userspace vxlan or geneve tunnel, meanwhile support
tunnel outter and inner csum offload. If netdev do not support offload
features, there is a software fallback.If netdev do not support vxlan
and geneve tso,packets will drop. Front-end devices can close offload
features by ethtool also.

Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Dexia Li <dexia.li@jaguarmicro.com>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-01-17 22:06:45 +01:00
Viacheslav Galaktionov
14ef8b451f lib/conntrack: Only use given packet in protocol detection.
The current protocol detection logic relies on two pieces of metadata
passed as arguments: tp_src and tp_dst, which represent the L4 source
and destination port numbers from the flow that triggered the current
flow rule first, and was responsible for creating the current DP flow.

Since multiple network flows of many different kinds, potentially using
different protocols on all layers, can be processed by one flow rule,
using the metadata of some unrelated flow might lead to unexpected
results. For example, ICMP type and code can be interpreted as TCP
source and destination ports. This can confuse the code responsible for
the helper selection, leading to errors in traffic handling and
incorrect detection of related flows.

One of the easiest ways to fix this problem is to simply remove the
tp_src and tp_dst parameters from the picture. The current code base has
no good use for them.

The helper selection logic was based on these values and therefore needs
to be changed. Ensure that the helper specified in a flow rule is used,
given it is compatible with the L4 protocol of the packet. When a flow
rule does not specify a helper, one can still be picked using the given
packet's metadata like TCP/UDP ports.

Signed-off-by: Viacheslav Galaktionov <viacheslav.galaktionov@arknetworks.am>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-01-10 16:16:08 -05:00
Kevin Traynor
4cbbf56e6c dpif-netdev: Add per PMD sleep config.
Extend 'pmd-sleep-max' so that individual PMD thread cores may have a
specified max sleep request value.

Existing behaviour is maintained.

Any PMD thread core without a value will use the global value if set
or default no sleep.

To set PMD thread cores 8 and 9 to never request a load based sleep
and all other PMD thread cores to be able to request a max sleep of
50 usecs:

 $ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50,8:0,9:0

To set PMD thread cores 10 and 11 to request a max sleep of 100 usecs
and all other PMD thread cores to never request a sleep:

 $ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=10:100,11:100

'pmd-sleep-show' is updated to show the max sleep value for each PMD
thread.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-12-16 01:07:59 +01:00
Ales Musil
4b9eb061b1 ct-dpif: Handle default zone limit the same way as other limits.
Internally handle default CT zone limit as other limits that
can be passed via the list with special value -1. Currently,
the -1 is treated by both datapaths as default, add static
asserts to make sure that this remains the case in the future.
This allows us to easily delete the default zone limit.

Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-12-05 20:42:04 +01:00
Eelco Chaudron
08212d755e netdev-offload: Fix Clang's static analyzer 'Division by zero' warnings.
When enabling DPDK with the configure the below, ovs-vswitchd will crash.

  ovs-vsctl set Open_vSwitch . other_config:n-offload-threads=0
  ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

This issue arises because setting the 'n-offload-threads' value to zero
is not a supported configuration. This fix addresses this by implementing
a check to ensure a valid 'n-offload-threads' value, both during
configuration and statistics gathering.

Fixes: 62c2d8a67543 ("netdev-offload: Add multi-thread API.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Simon Horman <horms@ovn.org>
2023-10-31 15:00:17 +00:00
Zhiqi Chen
785e22f876 dpif-netdev: Fix length calculation of netdet_flow_key.
The 'len' of a netdev_flow_key initialized by netdev_flow_key_init()
is always zero, which may cause errors when cloning a netdev_flow_key
by netdev_flow_key_clone().

Currently the 'len' member of a netdev_flow_key initialized by
netdev_flow_key_init() is not used, so this error will not cause any
bad behavior for now.

Fixes: c82f496c3b69 ("dpif-netdev: Use unmasked key when adding datapath flows.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Zhiqi Chen <chenzhiqi.123@bytedance.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-08-25 22:02:50 +02:00
Peng He
21410ff800 dpif-netdev: Fix dpif_netdev_flow_put.
OVS allows overlapping megaflows, as long as the actions of these
megaflows are equal.  However, the current implementation of action
modification relies on flow_lookup instead of UFID, this could result
in looking up a wrong megaflow and make the ukeys and megaflows
inconsistent.

Just like the test case in the patch, at first we have a rule with the
prefix:

10.1.2.0/24

And we will get a megaflow with prefixes 10.1.2.2/24 when a packet with
IP 10.1.2.2 is received.

Then suppose we change the rule into 10.1.0.0/16.  OVS prefers to keep
the 10.1.2.2/24 megaflow and just changes its action instead of
extending the prefix into 10.1.2.2/16.

Then suppose we have a 10.1.0.2 packet, since it misses the megaflow,
this time, we will have an overlapping megaflow with the right prefix:
10.1.0.2/16

Now we have two megaflows:
10.1.2.2/24
10.1.0.2/16

Last, suppose we have changed the ruleset again.  The revalidator this
time still decides to change the actions of both megaflows instead of
deleting them.

The dpif_netdev_flow_put will search the megaflow to modify with
unmasked keys, however it might lookup the wrong megaflow as the key
10.1.2.2 matches both 10.1.2.2/24 and 10.1.0.2/16!

This patch changes the megaflow lookup code in modification path into
relying the UFID to find the correct megaflow instead of key lookup.

Falling back to a classifier lookup in case where UFID was not provided
in order to support cases where UFID was not generated from the flow
data during the flow addition.

Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
Signed-off-by: Peng He <hepeng.0320@bytedance.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-08-14 16:48:56 +02:00
Kevin Traynor
bc6a6f82e5 dpif-netdev: Add pmd-sleep-show command.
Max requested sleep time and status for a PMD thread
is logged at start up or when changed, but it can be
convenient to have a command to dump this information
explicitly.

It is envisaged that this will be expanded for individual
pmds in the future, hence adding to dpif_netdev_pmd_info().

Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-15 00:17:05 +02:00
Kevin Traynor
023dcdc7a1 dpif-netdev: Rename pmd-maxsleep config option.
other_config:pmd-maxsleep is a config option to allow
PMD thread cores to sleep under low or no load conditions.

Rename it to 'pmd-sleep-max' to allow a more structured
name and so that additional options or command can follow
the 'pmd-sleep-xyz' pattern.

Use of other_config:pmd-maxsleep is deprecated to be
removed in a future release and will result in a warning.

Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-15 00:11:21 +02:00
Ilya Maximets
c2433bdfc0 dpif-netdev: Lockless meters.
Current implementation of meters in the userspace datapath takes
the meter lock for every packet batch.  If more than one thread
hits the flow with the same meter, they will lock each other.

Replace the critical section with atomic operations to avoid
interlocking.  Meters themselves are RCU-protected, so it's safe
to access them without holding a lock.

Implementation does the following:

 1. Tries to advance the 'used' timer of the meter with atomic
    compare+exchange if it's smaller than 'now'.
 2. If the timer change succeeds, atomically update band buckets.
 3. Atomically update packet statistics for a meter.
 4. Go over buckets and try to atomically subtract the amount of
    packets or bytes, recording the highest exceeded band.
 5. Atomically update band statistics and drop packets.

Bucket manipulations are implemented with atomic compare+exchange
operations with extra checks, because bucket size should never
exceed the maximum and it should never go below zero.

Packet statistics may be momentarily inconsistent, i.e., number
of packets and the number of bytes may reflect different sets
of packets.  But it should be eventually consistent.  And the
difference at any given time should be in just few packets.

For the sake of reduced code complexity PKTPS meter tries to push
packets through the band one by one, even though they all have
the same weight.  This is also more fair if more than one thread
is passing packets through the same band at the same time.
Trying to predict the number of packets that can pass may also
cause extra atomic operations reducing the performance.

This implementation shows similar performance to the previous one,
but should scale better with more threads hitting the same meter.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Lin Huang <linhuang@ruijie.com.cn>
Tested-by: Zhang YuHuang <zhangyuhuang@ruijie.com.cn>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-01 00:35:18 +02:00
Paolo Valerio
9b4d2ad8e8 conntrack: Allow to dump userspace conntrack expectations.
The patch introduces a new commands ovs-appctl dpctl/dump-conntrack-exp
that allows to dump the existing expectations for the userspace ct.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-06-29 22:20:43 +02:00
Mike Pattrick
5d11c47d3e userspace: Enable IP checksum offloading by default.
The netdev receiving packets is supposed to provide the flags
indicating if the IP checksum was verified and it is GOOD or BAD,
otherwise the stack will check when appropriate by software.

If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.

When encapsulate a packet with that flag, set the checksum
of the inner IP header since that is not yet supported.

Calculate the IP checksum when the packet is going to be sent over
a device that doesn't support the feature.

Linux devices don't support IP checksum offload alone, so the
support is not enabled.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-06-15 23:49:51 +02:00
Paolo Valerio
9fa612959c ovs-dpctl: Add new command dpctl/ct-[sg]et-sweep-interval.
Since 3d9c1b855a5f ("conntrack: Replace timeout based expiration lists
with rculists.") the sweep interval changed as well as the constraints
related to the sweeper.
Being able to change the default reschedule time may be convenient in
some conditions, like debugging.
This patch introduces new commands allowing to get and set the sweep
interval in ms.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-04-06 22:59:25 +02:00
Eelco Chaudron
4d69c19000 ofproto-dpif-upcall: Reset ukey's last stats value if the datapath changed.
When the ukey's action set changes, it could cause the flow to use a
different datapath, for example, when it moves from tc to kernel.
This will cause the the cached previous datapath statistics to be used.

This change will reset the cached statistics when a change in
datapath is discovered.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-03-03 22:27:37 +01:00
Kevin Traynor
948767a18d dpif-netdev: Set PMD load based sleep start/inc to 1 us.
Now that the timer slack for the PMD threads is reduced we can also
reduce the start/increment for PMD load based sleeping to match it.

This will further reduce initial sleep times making it more resilient
to interfaces that might be sensitive to large sleep times.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-23 17:23:28 +01:00
David Marchand
f62629a558 dpif-netdev: Set timer slack for PMD threads.
The default Linux timer slack groups timer expires into 50 uS intervals.

With some traffic patterns this can mean that returning to process
packets after a sleep takes too long and packets are dropped.

Add a helper to util.c and set use it to reduce the timer slack
for PMD threads, so that sleeps with smaller resolutions can be done
to prevent sleeping for too long.

Fixes: de3bbdc479a9 ("dpif-netdev: Add PMD load based sleeping.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401121.html
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-23 17:23:20 +01:00
Kevin Traynor
de3bbdc479 dpif-netdev: Add PMD load based sleeping.
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.

Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).

Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500

The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.

Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
   - sleep iterations:       153994  ( 76.8 % of iterations)
   Sleep time (us):         9159399  ( 59 us/iteration avg.)
...

Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-12 18:56:05 +01:00
Cheng Li
46e04ec31b dpif-netdev: Calculate per numa variance.
Currently, pmd_rebalance_dry_run() calculate overall variance of
all pmds regardless of their numa location. The overall result may
hide un-balance in an individual numa.

Considering the following case. Numa0 is free because VMs on numa0
are not sending pkts, while numa1 is busy. Within numa1, pmds
workloads are not balanced. Obviously, moving 500 kpps workloads from
pmd 126 to pmd 62 will make numa1 much more balance. For numa1
the variance improvement will be almost 100%, because after rebalance
each pmd in numa1 holds same workload(variance ~= 0). But the overall
variance improvement is only about 20%, which may not trigger auto_lb.

```
numa_id   core_id      kpps
      0        30         0
      0        31         0
      0        94         0
      0        95         0
      1       126      1500
      1       127      1000
      1        63      1000
      1        62       500
```

As auto_lb doesn't balance workload across numa nodes. So it makes
more sense to calculate variance improvement per numa node.

Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-12-21 22:15:47 +01:00
Kevin Traynor
ad6e506fcb dpif-netdev: Rename pmd_info_show_rxq variables.
There are some similar readings taken for pmds and Rx queues
in this function and a few of the variable names are ambiguous.

Improve the readability of the code by updating some variables
names to indicate that they are readings related to the pmd.

Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-12-21 20:58:30 +01:00
Kevin Traynor
526230bfab dpif-netdev: Make pmd-rxq-show time configurable.
pmd-rxq-show shows the Rx queue to pmd assignments as well as the
pmd usage of each Rx queue.

Up until now a tail length of 60 seconds pmd usage was shown
for each Rx queue, as this is the value used during rebalance
to avoid any spike effects.

When debugging or tuning, it is also convenient to display the
pmd usage of an Rx queue over a shorter time frame, so any changes
config or traffic that impact pmd usage can be evaluated more quickly.

A parameter is added that allows pmd-rxq-show stats pmd usage to
be shown for a shorter time frame. Values are rounded up to the
nearest 5 seconds as that is the measurement granularity and the value
used is displayed. e.g.

$ ovs-appctl dpif-netdev/pmd-rxq-show -secs 5
 Displaying last 5 seconds pmd usage %
 pmd thread numa_id 0 core_id 4:
   isolated : false
   port: dpdk0            queue-id:  0 (enabled)   pmd usage: 95 %
   overhead:  4 %

The default time frame has not changed and the maximum value
is limited to the maximum stored tail length (60 seconds).

Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-12-21 20:57:29 +01:00
Eelco Chaudron
c82f496c3b dpif-netdev: Use unmasked key when adding datapath flows.
The datapath supports installing wider flows, and OVS relies on
this behavior. For example if ipv4(src=1.1.1.1/192.0.0.0,
dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of
ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed
to be added.

However, if we try to add a wildcard rule, the installation fails:

# ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
  ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2
# ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
  ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
ovs-vswitchd: updating flow table (File exists)

The reason is that the key used to determine if the flow is already
present in the system uses the original key ANDed with the mask.
This results in the IP address not being part of the (miniflow) key,
i.e., being substituted with an all-zero value. When doing the actual
lookup, this results in the key wrongfully matching the first flow,
and therefore the flow does not get installed. The solution is to use
the unmasked key for the existence check, the same way this is handled
in the "slow" dpif_flow_put() case.

OVS relies on the fact that overlapping flows can exist if one is a
superset of the other. Note that this is only true when the same set
of actions is applied. This is due to how the revalidator process
works. During revalidation, OVS removes too generic flows from the
datapath to avoid incorrect matches but allows too narrow flows to
stay in the datapath to avoid the data plane disruption and also to
avoid constant flow deletions if the datapath ignores wildcards on
certain fields/bits.  See flow_wildcards_has_extra() check in the
revalidate_ukey__() function.

The problem here is that we have a too narrow flow installed, and now
OpenFlow rules got changed, so the actual flow should be more generic.
Revalidators will not remove the narrow flow, and we will eventually get
an upcall on the packet that doesn't match the narrow flow, but we will
not be able to install a more generic flow because after masking with
the new wider mask, the key matches on the narrow flow, so we get EEXIST.

Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-12-20 13:07:17 +01:00
Eli Britstein
76ab364ea8 netdev-offload: Set 'miss_api_supported' to be under netdev.
Cited commit introduced a flag in dpif-netdev level, to optimize
performance and avoid hw_miss_packet_recover() for devices with no such
support.
However, there is a race condition between traffic processing and
assigning a 'flow_api' object to the netdev. In such case, EOPNOTSUPP is
returned by netdev_hw_miss_packet_recover() in netdev-offload.c layer
because 'flow_api' is not yet initialized. As a result, the flag is
falsely disabled, and subsequent packets won't be recovered, though they
should.

In order to fix it, move the flag to be in netdev-offload layer, to
avoid that race.

Fixes: 6e50c1651869 ("dpif-netdev: Avoid hw_miss_packet_recover() for devices with no support.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-10-25 21:35:51 +02:00
Gaetan Rivet
6edc278c85 conntrack: Use a cmap to store zone limits.
Change the data structure from hmap to cmap for zone limits.
As they are shared amongst multiple conntrack users, multiple
readers want to check the current zone limit state before progressing in
their processing. Using a CMAP allows doing lookups without taking the
global 'ct_lock', thus reducing contention.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-12 20:44:46 +02:00
Cian Ferriter
dfff8b67b2 dpif-netdev: Refactor simple match lookup functions.
Make the simple match functions used during lookup non-static to allow
reuse of these functions in the AVX512 DPIF.

Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Tested-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-07-12 13:31:24 +01:00
Ilya Maximets
603bc853fb dpif-netdev: Fix leak of AVX512 DPIF scratch pad.
dp_netdev_input_outer_avx512 allocates a 16KB scratch pad per PMD
thread, but it's never freed.  This may cause significant memory
drain in dynamic environments.

  ==4068109==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 38656 byte(s) in 2 object(s) allocated from:
   0 0xf069fd in posix_memalign (vswitchd/ovs-vswitchd+0xf069fd)
   1 0x1d7ed14 in xmalloc_size_align lib/util.c:254:13
   2 0x1d7ed14 in xmalloc_pagealign lib/util.c:352:12
   3 0x2098254 in dp_netdev_input_outer_avx512 lib/dpif-netdev-avx512.c:69:17
   4 0x191591a in dp_netdev_process_rxq_port lib/dpif-netdev.c:5332:19
   5 0x190a961 in pmd_thread_main lib/dpif-netdev.c:6963:17
   6 0x1c4b69a in ovsthread_wrapper lib/ovs-thread.c:422:12
   7 0x7fd5ea6f1179 in start_thread pthread_create.c

 SUMMARY: AddressSanitizer: 38656 byte(s) leaked in 2 allocation(s).

Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-06-29 23:28:44 +02:00
Harry van Haaren
751d05b474 dpcls: Add unlisted alias for subtable lookup command.
This patch adds the old name "subtable-lookup-prio-get" as an unlisted command,
to restore a consistency between OVS releases for testing scripts.

Fixes: 738c76a503f4 ("dpcls: Change info-get function to fetch dpcls usage stats.")
Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-06-28 13:32:50 +02:00
Lin Huang
ba462b3589 dpif-netdev: Fix ALB 'rebalance_intvl' max hard limit.
Currently the pmd-auto-lb-rebal-interval's value was not been
checked properly.

It maybe a negative, or too big value (>2 weeks between rebalances),
which will be lead to a big unsigned value. So reset it to default
if the value exceeds the max permitted as described in vswitchd.xml.

Fixes: 5bf84282482a ("Adding support for PMD auto load balancing")
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-30 23:28:22 +02:00
Lin Huang
83c0a36472 dpif-netdev: Fix ALB parameters type mismatch.
The ALB parameters should never be negative.
So it's to use unsigned smap_get versions to get it properly, and
update VLOG formatting.

Fixes: 5bf84282482a ("Adding support for PMD auto load balancing")
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-30 23:28:21 +02:00
Kevin Traynor
3ecfaf1361 dpif-netdev: Restructure rxq schedule logging.
Previously logging about rxq scheduling was done in a code branch with
the selection of the PMD thread core after checking that a numa was
selected.

By splitting out the logging from the PMD thread core selection, it can
simplify the code complexity and make it more extendable for future
additions.

Also, minor updates to a couple of variables to improve readability and
fix a log indent while working on this code block.

There is no user visible change in behaviour or logs.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-30 22:45:51 +02:00
Kevin Traynor
37ccbd9c9d dpif-netdev: Split function to find lowest loaded PMD thread core.
This splits up the looping through each PMD thread core on a numa node
with the check to compare cycles or rxqs.

This is done so in future the compare could be reused with any group
of PMD thread cores.

There is no user visible change in behaviour.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-30 22:45:51 +02:00
Kumar Amber
738c76a503 dpcls: Change info-get function to fetch dpcls usage stats.
Modified the dplcs info-get command output to include
the count for different dpcls implementations.

$ovs-appctl dpif-netdev/subtable-lookup-info-get

Available dpcls implementations:
  autovalidator (Use count: 1, Priority: 5)
  generic (Use count: 0, Priority: 1)
  avx512_gather (Use count: 0, Priority: 3)

Test case to verify changes:
        1061: PMD - dpcls configuration     ok

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-05-24 09:53:18 +01:00
Cian Ferriter
5ec5473304 dpif-netdev: Only hash port number when necessary.
The hash of the port number is only needed when a DPCLS needs to be
created. Move the hash calculation inside the if to accomplish this.

Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-17 23:10:41 +02:00
Rosemarie O'Riorden
7e7083cc46 dpif-netdev: Replace loop iterating over packet batch with macro.
The function dp_netdev_pmd_flush_output_on_port() iterates over the
p->output_pkts batch directly, when it should be using the special
iterator macro, DP_PACKET_BATCH_FOR_EACH.

However, this wasn't possible because the macro could not accept
&p->output_pkts.

The addition of parentheses when BATCH is dereferenced allows the macro
to expand properly. Parenthesizing arguments in macros is good practice
to be able to handle whichever expressions are passed in.

Signed-off-by: Rosemarie O'Riorden <roriorden@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-04 21:18:08 +02:00
Eelco Chaudron
9a67d883dc dpif-netdev: Fix dp_netdev_get_pmd() function getting correct core_id.
The dp_netdev_get_pmd() function is using only the hash of the core_id
to get the pmd structure. So in case of hash collisions, the wrong pmd
is returned.

This patch is fixing this by checking for the correct core_id using
the CMAP_FOR_EACH_WITH_HASH macro.

Fixes: 65f13b50c5aa ("dpif-netdev: Create multiple pmd threads by default.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-04-04 22:52:12 +02:00
Kevin Traynor
c591827ec0 dpif-netdev: Fix PMD auto load balance with pmd-rxq-isolate.
There are currently some checks for cross-numa polling cases to
ensure that they won't effect the accuracy of the PMD ALB.

If an rxq is pinned to a PMD thread core by the user it will not
be reassigned by OVS, so even if it is non-local numa polled it
will not impact PMD ALB accuracy.

To establish this, a check was made on whether the PMD thread core
was isolated or not. However, since other_config:pmd-rxq-isolate was
introduced, rxqs may be pinned but the PMD thread core not isolated.

It means that by setting pmd-rxq-isolate=false and doing non-local
numa pinning, PMD ALB may not run where it should.

If the PMD thread core is isolated we can skip individual rxq checks
but if not, we should check the individual rxqs for pinning before we
disallow PMD ALB.

Also, update function comments to make it's operation clearer.

Fixes: 6193e03267c1 ("dpif-netdev: Allow pin rxq and non-isolate PMD.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-04-04 22:52:12 +02:00
Kevin Traynor
da6ce41d80 dpif-netdev: Fix non-local numa selection for more than two numas.
This issue only occurs when there are more than 2 numa nodes
and no local numa PMD thread cores available for an interface
rxq.

In the event of no PMD thread cores available on the local numa
for an rxq to be assigned to, a PMD thread core from a non-local
numa is selected.

If there are more than one non-local numas with PMD thread cores
they are RR through and checked if they have non-isolated PMD thread
cores.

When successfully finding a non-local numa with available PMD
thread cores for an rxq, that numa was not being stored. It meant
if a similar situation occurred for a subsequent rxq, the same numa
would be selected again.

Store the last numa used when successfully finding a non-local numa
with available PMD thread cores, so the numa RR state is kept for subsequent
rxqs.

Fixes: f577c2d046b2 ("dpif-netdev: Rework rxq scheduling code.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-04-04 22:52:12 +02:00
Kevin Traynor
4b5c3b66aa dpif-netdev: Fix typo in function name.
Rename pmd_reblance_dry_run_needed() to
pmd_rebalance_dry_run_needed().

Fixes: a83a406096e9 ("dpif-netdev: Sync PMD ALB state with user commands.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-04-04 22:52:12 +02:00