2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-22 18:07:40 +00:00

440 Commits

Author SHA1 Message Date
David Marchand
dd443c1a7a netdev-dpdk: Stop relying on vhost-user Tx flags.
vhost-user legacy behavior has been to mark mbuf with Tx offload flags
based on what the virtio-net header contained (but provide no
Rx information, like IP checksum or L4 checksum validity).

Changing to the non legacy mode means that no code out of OVS should set
any RTE_MBUF_F_TX_* flag. Had a check accordingly.

Link: https://git.dpdk.org/dpdk/commit/?id=ca7036b4af3a
Reported-at: https://issues.redhat.com/browse/FDP-1147
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:03:15 +02:00
David Marchand
cf7b86db1f dp-packet: Rework TCP segmentation.
Rather than mark with a offload flags + mark with a segmentation size,
simply rely on the netdev implementation which sets a segmentation size
when appropriate.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:03:09 +02:00
David Marchand
2956a61265 dp-packet: Rework L4 checksum offloads.
The DPDK mbuf API specifies 4 status when it comes to L4 checksums:
- RTE_MBUF_F_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum
- RTE_MBUF_F_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong
- RTE_MBUF_F_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid
- RTE_MBUF_F_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet
  data, but the integrity of the L4 data is verified.

Similarly to the IP checksum offloads API, revise OVS L4 offloads API.

No information about the L4 protocol is provided by any netdev-*
implementation, so OVS needs to mark this L4 protocol during flow
extraction.

Rename current API for consistency with dp_packet_(inner_)?l4_checksum_.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:02:56 +02:00
David Marchand
3daf04a4c5 dp-packet: Rework IP checksum offloads.
As the packet traverses through OVS, offloading Tx flags must be carefully
evaluated and updated which results in a bit of complexity because of a
separate "outer" Tx offloading flag coming from DPDK API,
and a "normal"/"inner" Tx offloading flag.

On the other hand, the DPDK mbuf API specifies 4 status when it comes to
IP checksums:
- RTE_MBUF_F_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum
- RTE_MBUF_F_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong
- RTE_MBUF_F_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid
- RTE_MBUF_F_RX_IP_CKSUM_NONE: the IP checksum is not correct in the
  packet data, but the integrity of the IP header is verified.

This patch changes OVS API so that OVS code only tracks the status of
the checksum of the "current" L3 header and let the Tx flags aspect to
the netdev-* implementations.

With this API, the flow extraction can be cleaned up.

During packet processing, OVS can simply look for the IP checksum validity
(either good, or partial) before changing some IP header, and then mark
the checksum as partial.

In the conntrack case, when natting packets, the checksum status of the
inner part (ICMP error case) must be forced temporarily as unknown
to force checksum resolution.

When tunneling comes into play, IP checksums status is bit-shifted for
future considerations in the processing if, for example, the tunnel
header gets decapsulated again, or in the netdev-* implementations that
support tunnel offloading.

Finally, netdev-* implementations only need to care about packets in
partial status: a good checksum does not need touching, a bad checksum
has been updated by kept as bad by OVS, an unknown checksum is either
an IPv6 or if it was an IPv4, OVS updated it too (keeping it good or bad
accordingly).

Rename current API for consistency with dp_packet_(inner_)?ip_checksum_.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:00:54 +02:00
David Marchand
67abd51540 dp-packet: Rework tunnel offloads.
Rather than set bits in the mbuf ol_flags field, that only makes sense
for netdev-dpdk ports, mark packet for tunnel offload in OVS offloads
API.

While at it, since there is nothing really "hardware" related, rename
current API for consistency with dp_packet_tunnel_ prefix.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:00:48 +02:00
David Marchand
d29ba0abdc dp-packet: Add OVS offloading API.
As a preparation for tracking inner checksums, separate Rx checksum
status from the DPDK ol_flags field.
To minimize the cost of translating from DPDK API to OVS API, simply map
OVS flags to DPDK Rx mbuf flags.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:00:34 +02:00
David Marchand
19ef1b1f0f dp-packet: Remove DPDK specific IP version.
Flagging packets with IP version is only needed at the netdev-dpdk level.

In most cases, OVS is already inspecting the IP header in packet data,
so maintaining such IP version metadata won't save much cycles
(given the cost of additional branches necessary for handling
outer/inner flags).

Cleanup OVS shared code and only set these flags in netdev-dpdk.c.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 20:59:22 +02:00
Roi Dayan
b42f9fde4a netdev-dpdk: Fix possible memory leak in vhost stats.
On error condition need to release the allocated structs.

Reported by Coverity.

Fixes: 3b29286db1c5 ("netdev-dpdk: Add per virtqueue statistics.")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2025-05-30 14:22:23 +01:00
Jay Ding
6f33ac6321 netdev-dpdk: Fix device info return value check.
rte_eth_dev_info_get() could fail due to device reset, etc.

The return value should be checked before the device info
pointer is dereferenced.

Fixes: 2f196c80e716 ("netdev-dpdk: Use LSC interrupt mode.")
Signed-off-by: Jay Ding <jay.ding@broadcom.com>
Co-Authored-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
2025-04-14 13:38:56 +01:00
Mike Pattrick
2276c3a2c6 userspace: Support GRE TSO.
This patch extends the userspace datapaths support of tunnel tso from
only supporting VxLAN and Geneve to also supporting GRE tunnels. There
is also a software fallback for cases where the egress netdev does not
support this feature.

Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-01-17 00:20:48 +01:00
Maxime Coquelin
a24413cd3e netdev-dpdk: Set vhost port maximum number of queue pairs.
This patch uses the new rte_vhost_driver_set_max_queue_num
API to set the maximum number of queue pairs supported by
the vhost-user port.

This is required for VDUSE which needs to specify the
maximum number of queue pairs at creation time. Without it
128 queue pairs metadata would be allocated.

To configure it, a new 'vhost-max-queue-pairs' option is
introduced.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2025-01-10 15:38:28 +00:00
David Marchand
af292d273f netdev-dpdk: Restore outer UDP checksum for Intel nics.
Fixes for Intel drivers are included in DPDK v23.11.2.

Link: https://git.dpdk.org/dpdk-stable/commit/?id=e8c2cccfbdef
Link: https://git.dpdk.org/dpdk-stable/commit/?id=1970a0ca45f1
Link: https://git.dpdk.org/dpdk-stable/commit/?id=80c5c9789b73
Fixes: 0256ee64ed39 ("dpdk: Use DPDK 23.11.2 release.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2024-12-12 17:02:07 +00:00
David Marchand
ba5a1536cd netdev-dpdk: Check error for device info and link status queries.
Since DPDK v19.11, a couple of ethdev API have been reporting errors in
case of invalid port id or other error conditions in drivers.
So far, OVS did not check for those error cases.

Starting v24.11 future release, the ethdev API warns for unchecked
returned values, so let's prepare for this.

Link: https://git.dpdk.org/dpdk/commit/?id=4f25d7d2252f
Link: https://git.dpdk.org/dpdk/commit/?id=4633c3b2ebf2
Link: https://git.dpdk.org/dpdk/commit/?id=1ff8b9a6ef24
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
2024-11-27 15:55:53 +00:00
David Marchand
7383f0e1bf netdev-dpdk: Cache representor flag at init.
No need to query device info during the life of a port for checking if
this port is a representor.
This capacity is decided at the ethdev port creation in DPDK and
OVS can simply store this info during dpdk_eth_dev_init().

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
2024-11-27 15:55:46 +00:00
David Marchand
6204d3837c netdev-dpdk: Cache device info during port configuration.
No need to query device info twice while configuring a port.
Simply pass the rte_eth_dev_info object.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
2024-11-27 10:54:21 +00:00
David Marchand
d4b222bb66 netdev-dpdk: Stop configuring after device init failure.
Caught by code review.

If dpdk_eth_dev_init() fails, no need to continue and try to initialise
other features for this port.
Plus, err may get overwritten later (like if some rss steering is
configured) which could result in non consistent error codes.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
2024-11-27 10:54:21 +00:00
Jun Wang
2bf609f70b netdev-dpdk: Disable outer udp checksum offload for txgbe driver.
Fixing the issue of incorrect outer UDP checksum in packets sent by
the wangxun network card (driver is txgbe), we disabled
RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Reported-by: Jun Wang <junwang01@cestc.cn>

Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Jun Wang <junwang01@cestc.cn>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-09-20 12:01:05 +02:00
Mike Pattrick
bd48ff8f7d netdev-dpdk: Re-enable VXLAN/Geneve offload for Intel cards.
Previously support for UDP tunneled traffic TCP traffic with UDP
checksum offloading did not work well in cases where the sending network
card didn't also support these features.

Some of the code had been written to assume that if a card supported
VXLAN/Geneve offloading, then it also supported outer UDP checksum
offloading. However, this was not the case for some Intel network cards.

A previous change disabled the VXLAN/Geneve offload flags for these
cards as a temporary fix. However, with "Userspace: Software fallback
for UDP encapsulated TCP segmentation.", the logic related to software
fallback for checksum offloading now anticipates this configuration.

The modification to the outer UDP offload flag is still required. This
feature does not work as expected in the current DPDK release.

Suggested-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-09-11 15:37:17 +02:00
Mike Pattrick
9f0c6e16e3 netdev-dpdk: Fix race condition in mempool information dump.
Currently it is possible to call netdev-dpdk/get-mempool-info before a
mempool as been created. This can happen because a device is added to
the netdev_shash before a mempool is allocated for it, which results in
a segmentation fault.

Now we check for a NULL value before attempting to dereference it.

Fixes: be4817331071 ("netdev-dpdk: Add debug appctl to get mempool information.")
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-08-08 21:55:43 +02:00
Ilya Maximets
56e315937e vswitchd: Only lock pages that are faulted in.
The main purpose of locking the memory is to ensure that OVS can keep
doing what it did before in case of increased memory pressure, e.g.,
during VM ingest / migration.  Fulfilling this requirement can be
achieved without locking all the allocated memory, but only the pages
already accessed in the past (faulted in).  Processing of the new
traffic involves new memory allocations.  Latency on these operations
can't be guaranteed by the locking.  The main difference would be
the pre-faulting of the stack memory.  However, in order to revalidate
or process upcalls on the same traffic, the same amount of stack is
likely needed, so all the necessary memory will already be faulted in.

Switch 'mlockall' to MCL_ONFAULT to avoid consuming unnecessarily
large amounts of RAM on systems with high core counts.  For example,
in a densely populated OVN cluster this saves about 650 MB of RAM per
node on a system with 64 cores.  This equates to 320 GB of allocated
but unused RAM in a 500 node cluster.

This also makes OVS better suited by default for small systems with
limited amount of memory.

The MCL_ONFAULT flag was introduced in Linux kernel 4.4 and wasn't
available at the time of '--mlockall' introduction, but we can use it
now.  Falling back to an old way of locking in case we're running on
an older kernel just in case.

Only locking the faulted in pages also makes locking compatible with
vhost post-copy live migration by default, because we'll no longer
pre-fault all the guest's memory.  Post-copy relies on userfaultfd
to work on shared huge pages, which is only available in 4.11+ kernels.
So, technically, it should not be possible for MCL_ONFAULT to fail and
the call without it to succeed.  But keeping the check just in case
for now.

Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-06-28 23:44:53 +02:00
Kevin Traynor
639fcf2005 netdev-dpdk: Check pending reset when adding device.
When a device reset interrupt event (RTE_ETH_EVENT_INTR_RESET)
is detected for a DPDK device added to OVS, a device reset is
performed.

If a device reset interrupt event is detected for a device before
it is added to OVS, device reset is not called.

If that device is later attempted to be added to OVS, it may fail
while being configured if it is still pending a reset as pending
reset is not checked when adding a device.

A simple way to force a reset event from the ice driver for an
iavf device is to set the mac address after binding iavf dev to
vfio but before adding to OVS. (note: should not be set like this
in normal case). e.g.

$ echo 2 > /sys/class/net/ens3f0/device/sriov_numvfs
$ ./devbind.py -b vfio-pci 0000:d8:01.1
$ ip link set ens3f0 vf 1 mac 26🆎e6:6f:79:4d
$ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk \
      options:dpdk-devargs=0000:d8:01.1

|dpdk|ERR|Port1 dev_configure = -1
|netdev_dpdk|WARN|Interface dpdk0 eth_dev setup error
    Operation not permitted
|netdev_dpdk|ERR|Interface dpdk0(rxq:1 txq:5 lsc interrupt mode:false)
    configure error: Operation not permitted
|dpif_netdev|ERR|Failed to set interface dpdk0 new configuration

Add a check if there was any previous device reset interrupt events
when a device is added to OVS. If there was, perform the reset
before continuing with the rest of the configuration.

netdev_dpdk_pending_reset[] already tracks device reset interrupt
events for all devices, so it can be reused to check if there is a
reset needed during configuration of newly added devices. By extending
it's usage, dev->reset_needed is no longer needed.

Fixes: 3eb91a8d1b9a ("netdev-dpdk: Trigger port reconfiguration in main thread for resets.")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-06-28 21:55:34 +02:00
David Marchand
2f196c80e7 netdev-dpdk: Use LSC interrupt mode.
Querying link status may get delayed for an undeterministic (long) time
with mlx5 ports. This is a consequence of the mlx5 driver calling ethtool
kernel API and getting stuck on the kernel RTNL lock while some other
operation is in progress under this lock.

One impact for long link status query is that it is called under the bond
lock taken in write mode periodically in bond_run().
In parallel, datapath threads may block requesting to read bonding related
info (like for example in bond_check_admissibility()).

The LSC interrupt mode is available with many DPDK drivers and is used by
default with testpmd.

It seems safe enough to switch on this feature by default in OVS.
We keep the per interface option to disable this feature in case of an
unforeseen bug.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
2024-06-24 12:46:11 +01:00
David Marchand
c39a84c131 netdev-dpdk: Refactor tunnel checksum offloading.
All information required for checksum offloading can be deduced by
already tracked dp_packet l3_ofs, l4_ofs, inner_l3_ofs and inner_l4_ofs
fields.
Remove DPDK specific l[2-4]_len from generic OVS code.

netdev-dpdk code then fills mbuf specifics step by step:
- outer_l2_len and outer_l3_len are needed for tunneling (and below
  features),
- l2_len and l3_len are needed for IP and L4 checksum (and below features),
- l4_len and tso_segsz are needed when doing TSO,

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2024-06-06 17:10:29 +01:00
David Marchand
844a7cfa6e netdev-dpdk: Use guest TSO segmentation size hint.
In a typical setup like:
guest A <-virtio-> OVS A <-vxlan-> OVS B <-virtio-> guest B

TSO packets from guest A are segmented against the OVS A physical port
mtu adjusted by the vxlan tunnel header size, regardless of guest A
interface mtu.

As an example, let's say guest A and guest B mtu are set to 1500 bytes.
OVS A and OVS B physical ports mtu are set to 1600 bytes.
Guest A will request TCP segmentation for 1448 bytes segments.
On the other hand, OVS A will request 1498 bytes segments to the HW.
This results in OVS B dropping packets because decapsulated packets
are larger than the vhost-user port (serving guest B) mtu.

2024-04-17T14:13:01.239Z|00002|netdev_dpdk(pmd-c03/id:7)|WARN|vhost0:
	Too big size 1564 max_packet_len 1518

vhost-user ports expose a guest mtu by filling mbuf->tso_segsz.
Use it as a hint.

This may result in segments (on the wire) slightly shorter than the
optimal size.

Reported-at: https://github.com/openvswitch/ovs-issues/issues/321
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2024-06-06 17:10:11 +01:00
David Marchand
d618d09173 netdev-dpdk: Refactor TSO request code.
Every L3, L4 checksum offload or TSO requires a (outer) L3 length to be
provided.
This length is computed via dp_packet_l4(pkt) that is always set when
such offloads are requested in OVS.
Getting a th == NULL is a bug in OVS, so an assert() is more appropriate.

Besides, filling l4_len and tso_segsz only matters to TSO, so there is
no need to set it for other L4 checksum offloading requests.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2024-06-06 17:10:05 +01:00
David Marchand
3d2c8223ab netdev-dpdk: Fix inner checksum when outer is not supported.
If outer checksum is not supported and OVS already set L3/L4 outer
checksums in the packet, no outer mark should be left in ol_flags
(as it confuses some driver, like net/ixgbe).

l2_len must be adjusted to account for the tunnel header.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2024-06-06 17:09:58 +01:00
David Marchand
29abd07e4f netdev-dpdk: Disable outer UDP checksum for net/iavf.
Same as the commit 6f93d8e62f13 ("netdev-dpdk: Disable outer UDP checksum
offload for ice/i40e driver."), disable outer UDP checksum and related
offloads for net/iavf.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2024-06-06 17:09:52 +01:00
David Marchand
041d6adeda netdev-dpdk: Fallback to non tunnel checksum offloading.
The outer checksum offloading API in DPDK is ambiguous and was
implemented by Intel folks in their drivers with the assumption that
any outer offloading always goes with an inner offloading request.

With net/i40e and net/ice drivers, in the case of encapsulating a ARP
packet in a vxlan tunnel (which results in requesting outer ip checksum
with a tunnel context but no inner offloading request), a Tx failure is
triggered, associated with a port MDD event.
2024-03-27T16:02:07.084Z|00018|dpdk|WARN|ice_interrupt_handler(): OICR:
	MDD event

To avoid this situation, if no checksum or segmentation offloading is
requested on the inner part of a packet, fallback to "normal" (non outer)
offloading request.

Reported-at: https://github.com/openvswitch/ovs-issues/issues/321
Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Fixes: f81d782c1906 ("netdev-native-tnl: Mark all vxlan/geneve packets as tunneled.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2024-06-06 17:09:37 +01:00
Roi Dayan
fb46f5d29a netdev-dpdk: Improve error print to the user for flow control error.
When failing to get flow control parameters use VLOG_WARN_BUF()
to expose the error string in ovs-vsctl show.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Suggested-by: Simon Horman <horms@ovn.org>
Acked-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Simon Horman <horms@ovn.org>
2024-04-26 10:06:34 +01:00
Roi Dayan via dev
4f29804f24 netdev-dpdk: Fix possible memory leak configuring VF MAC address.
VLOG_WARN_BUF() is allocating memory for the error string and should
e used if the configuration cannot continue and error is being returned
so the caller has indication of releasing the pointer.
Change to VLOG_WARN() to keep the logic that error is not being
returned.

Fixes: f4336f504b17 ("netdev-dpdk: Add option to configure VF MAC address.")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Acked-by: Gaetan Rivet <gaetanr@nvidia.com>
Acked-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Simon Horman <horms@ovn.org>
2024-04-23 11:27:09 +01:00
Jun Wang
6f93d8e62f netdev-dpdk: Disable outer UDP checksum offload for ice/i40e driver.
Fixing the issue of incorrect outer UDP checksum in packets sent by
E810 or X710. We disable RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM,but also
disable all the dependent offloads like
RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO and
RTE_ETH_TX_OFFLOAD_GENEVE_TNL_TSO.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Reported-at: https://github.com/openvswitch/ovs-issues/issues/321
Signed-off-by: Jun Wang <junwang01@cestc.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-03-22 20:36:50 +01:00
Ilya Maximets
0ce82ac45e netdev-dpdk: Fix tunnel type check during Tx offload preparation.
Tunnel types are not flags, but 4-bit fields, so checking them with
a simple binary 'and' is incorrect and may produce false-positive
matches.

While the current implementation is unlikely to cause any issues today,
since both RTE_MBUF_F_TX_TUNNEL_VXLAN and RTE_MBUF_F_TX_TUNNEL_GENEVE
only have 1 bit set, it is risky to have this code and it may lead
to problems if we add support for other tunnel types in the future.

Use proper field checks instead.  Also adding a warning for unexpected
tunnel types in case something goes wrong.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-03-16 02:06:20 +01:00
Ilya Maximets
05e9f05d14 netdev-dpdk: Fix TCP check during Tx offload preparation.
RTE_MBUF_F_TX_TCP_CKSUM is not a flag, but a 2-bit field, so checking
it with a simple binary 'and' is incorrect.  For example, this check
will succeed for a packet with UDP checksum requested as well.

Fix the check to avoid wrongly initializing tso_segz and potentially
accessing UDP header via TCP structure pointer.

The IPv4 checksum flag has to be set for any L4 checksum request,
regardless of the type, so moving this check out of the TCP condition.

Fixes: 8b5fe2dc6080 ("userspace: Add Generic Segmentation Offloading.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-03-16 02:06:20 +01:00
Ilya Maximets
f8809760fc netdev-dpdk: Clear inner packet marks if no inner offloads requested.
In some cases only outer offloads may be requested for a tunneled
packet.  In this case there is no need to mark the type of an
inner packet.  Clean these flags up to avoid potential confusion
of DPDK drivers.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-03-16 02:06:20 +01:00
Ilya Maximets
7df30c86ce netdev-dpdk: Clean up all marker flags if no offloads requested.
Some drivers (primarily, Intel ones) do not expect any marking flags
being set if no offloads are requested.  If these flags are present,
driver will fail Tx preparation or behave abnormally.

For example, ixgbe driver will refuse to process the packet with
only RTE_MBUF_F_TX_TUNNEL_GENEVE and RTE_MBUF_F_TX_OUTER_IPV4 set.
This pretty much breaks Geneve tunnels on these cards.

An extra check is added to make sure we don't have any unexpected
Tx offload flags set.

Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.")
Reported-at: https://github.com/openvswitch/ovs-issues/issues/321
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-03-13 16:29:25 +01:00
Ilya Maximets
2c4ffd2f8a netdev-dpdk: Dump packets that fail Tx preparation.
It's hard to debug situations where driver rejects packets for some
reason.  Dumping out the mbuf should help with that.

Sample output looks like this:

  |netdev_dpdk(pmd-c03/id:8)|DBG|ovs-p1: First invalid packet:
  dump mbuf at 0x1180bce140, iova=0x2cb7ce400, buf_len=2176
    pkt_len=64, ol_flags=0x2, nb_segs=1, port=65535, ptype=0
    segment at 0x1180bce140, data=0x1180bce580, len=90, off=384, refcnt=1
    Dump data at [0x1180bce580], len=64
  00000000: 33 33 00 00 00 16 AA 27 91 F9 4D 96 86 DD 60 00 | 33.....'..M...`.
  00000010: 00 00 00 24 00 01 00 00 00 00 00 00 00 00 00 00 | ...$............
  00000020: 00 00 00 00 00 00 FF 02 00 00 00 00 00 00 00 00 | ................
  00000030: 00 00 00 00 00 16 3A 00 05 02 00 00 01 00 8F 00 | ......:.........

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-03-08 20:19:54 +01:00
David Marchand
3eb91a8d1b netdev-dpdk: Trigger port reconfiguration in main thread for resets.
When OVS (main thread) configures a DPDK netdev, it holds a netdev_dpdk
mutex lock.
As part of this configure operation, the net/iavf driver (used with i40e
VF devices) triggers a queue count change. The PF entity (serviced by a
kernel PF driver for example) handles this change and requests back that
the VF driver resets the VF device. The driver then completes the VF reset
operation on its side and waits for completion of the iavf-event thread
responsible for handling various VF device events.

On the other hand, handling of the VF reset request in this iavf-event
thread results in notifying the application with a port reset request
(RTE_ETH_EVENT_INTR_RESET). The OVS reset callback tries to take a hold
of the same netdev_dpdk mutex and blocks the iavf-event thread.

As a result, the net/iavf driver (still running on OVS main thread) is
unable to complete as it is waiting for iavf-event to complete.

To break from this situation, the OVS reset callback now won't take a
netdev_dpdk mutex. Instead, the port reset request is stored in a simple
RTE_ETH_MAXPORTS array associated to a seq object.
This is enough to let the VF driver complete this port initialization.
The OVS main thread later handles the port reset request.

More details in the DPDK upstream bz as this issue appeared following a
change in DPDK.

Link: https://bugs.dpdk.org/show_bug.cgi?id=1337
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-01-19 13:52:57 +01:00
Dexia Li
084c808729 userspace: Support VXLAN and GENEVE TSO.
For userspace datapath, this patch provides vxlan and geneve tunnel tso.
Only support userspace vxlan or geneve tunnel, meanwhile support
tunnel outter and inner csum offload. If netdev do not support offload
features, there is a software fallback.If netdev do not support vxlan
and geneve tso,packets will drop. Front-end devices can close offload
features by ethtool also.

Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Dexia Li <dexia.li@jaguarmicro.com>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-01-17 22:06:45 +01:00
Flavio Leitner
8b5fe2dc60 userspace: Add Generic Segmentation Offloading.
This provides a software implementation in the case
the egress netdev doesn't support segmentation in hardware.

The challenge here is to guarantee packet ordering in the
original batch that may be full of TSO packets. Each TSO
packet can go up to ~64kB, so with segment size of 1440
that means about 44 packets for each TSO. Each batch has
32 packets, so the total batch amounts to 1408 normal
packets.

The segmentation estimates the total number of packets
and then the total number of batches. Then allocate
enough memory and finally do the work.

Finally each batch is sent in order to the netdev.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-12-02 01:33:37 +01:00
Flavio Leitner
e0056018c4 userspace: Respect tso/gso segment size.
Currently OVS will calculate the segment size based on the
MTU of the egress port. That usually happens to be correct
when the ports share the same MTU, but that is not always true.

Therefore, if the segment size is provided, then use that and
make sure the over sized packets are dropped.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-12-02 00:56:36 +01:00
Jakob Meng
c19a5b48bf netdev-dpdk: Sync and clean {get, set}_config() callbacks.
For better usability, the function pairs get_config() and
set_config() for netdevs should be symmetric: Options which are
accepted by set_config() should be returned by get_config() and the
latter should output valid options for set_config() only.

This patch moves key-value pairs which are not valid options from
get_config() to the get_status() callback. For example, get_config()
in lib/netdev-dpdk.c returned {configured,requested}_{rx,tx}_queues
previously. For requested rx queues the proper option name is n_rxq,
so requested_rx_queues has been renamed respectively. Tx queues
cannot be changed by the user, hence requested_tx_queues has been
dropped. Both configured_{rx,tx}_queues will be returned as
n_{r,t}xq in the get_status() callback.

The netdev dpdk classes no longer share a common get_config() callback,
instead both the dpdk_class and the dpdk_vhost_client_class define
their own callbacks. The get_config() callback for dpdk_vhost_class has
been dropped because it does not have a set_config() callback.

The documentation in vswitchd/vswitch.xml for status columns as well
as tests have been updated accordingly.

Reported-at: https://bugzilla.redhat.com/1949855
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2023-11-14 11:03:35 +00:00
David Marchand
bb61931dc5 netdev-dpdk: Disable net/tap Tx L4 checksum offloads.
As reported by Ales when doing some OVN integration tests with OVS 3.2,
net/tap has broken L4 checksum offloads.

Fixes are pending on DPDK side.
Until they get in a LTS release used by OVS, disable those Tx offloads.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-08-30 20:36:24 +02:00
David Marchand
9b7e1a7537 netdev-dpdk: Clear IP packet type when no offload is requested.
OVS currently sets RTE_MBUF_F_TX_IPV[46] flags in early stages of the
packet reception and keeps track of the IP packet type as the packet
goes through OVS pipeline.
When a packet leaves OVS and hits a DPDK driver, OVS may not request IP
checksum offloading but leaves one of this packet type flag in ol_flags.

The DPDK api describes that RTE_MBUF_F_TX_IPV4 must be set when
requesting some Tx offloads (like RTE_MBUF_F_TX_IPSUM,
RTE_MBUF_F_TX_TCP_CKSUM, .., RTE_MBUF_F_TX_TCP_SEG).
Even though setting RTE_MBUF_F_TX_IPV4 without requesting a Tx offload
is undefined, this can confuse some drivers (like net/iavf) which then
reads zeroed l2_len and l3_len and ends up dropping the packet.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2231081
Fixes: 5d11c47d3ebe ("userspace: Enable IP checksum offloading by default.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-08-28 20:20:39 +02:00
Ivan Malov
d460c473eb netdev-dpdk: Negotiate delivery of per-packet Rx metadata.
This may be required by some PMDs in offload scenarios.

Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow")
Signed-off-by: Ivan Malov <ivan.malov@arknetworks.am>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-25 14:42:41 +02:00
Adrian Moreno
6240c0b4c8 netdev: Add netdev_get_speed() to netdev API.
Currently, the netdev's speed is being calculated by taking the link's
feature bits (using netdev_get_features()) and transforming them into
bps.

This mechanism can be both inaccurate and difficult to maintain, mainly
because we currently use the feature bits supported by OpenFlow which
would have to be extended to support all new feature bits of all netdev
implementations while keeping the OpenFlow API intact.

In order to expose the link speed accurately for all current and future
hardware, add a new netdev API call that allows the implementations to
provide the current and maximum link speeds in Mbps.

Internally, the logic to get the maximum supported speed still relies on
feature bits so it might still get out of sync in the future. However,
the maximum configurable speed is not used as much as the current speed
and these feature bits are not exposed through the netdev interface so
it should be easier to add more.

Use this new function instead of netdev_get_features() where the link
speed is needed.

As a consequence of this patch, link speeds of cards is properly
reported (internally in OVSDB) even if not supported by OpenFlow.
A test verifies this behavior using a tap device.

Also, in order to avoid using the old, this patch adds a checkpatch.py
warning if the old API is used.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-17 20:03:32 +02:00
Robin Jarry
fc06ea9a18 netdev-dpdk: Add custom rx-steering configuration.
Some control protocols are used to maintain link status between
forwarding engines (e.g. LACP). When the system is not sized properly,
the PMD threads may not be able to process all incoming traffic from the
configured Rx queues. When a signaling packet of such protocols is
dropped, it can cause link flapping, worsening the situation.

Use the rte_flow API to redirect these protocols into a dedicated Rx
queue. The assumption is made that the ratio between control protocol
traffic and user data traffic is very low and thus this dedicated Rx
queue will never get full. Re-program the RSS redirection table to only
use the other Rx queues.

The additional Rx queue will be assigned a PMD core like any other Rx
queue. Polling that extra queue may introduce increased latency and
a slight performance penalty at the benefit of preventing link flapping.

This feature must be enabled per port on specific protocols via the
rx-steering option. This option takes "rss" followed by a "+" separated
list of protocol names. It is only supported on ethernet ports. This
feature is experimental.

If the user has already configured multiple Rx queues on the port, an
additional one will be allocated for control packets. If the hardware
cannot satisfy the number of requested Rx queues, the last Rx queue will
be assigned for control plane. If only one Rx queue is available, the
rx-steering feature will be disabled. If the hardware does not support
the rte_flow matchers/actions, the rx-steering feature will be
completely disabled on the port and regular rss will be performed
instead.

It cannot be enabled when other-config:hw-offload=true as it may
conflict with the offloaded flows. Similarly, if hw-offload is enabled,
custom rx-steering will be forcibly disabled on all ports and replaced
by regular rss.

Example use:

 ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
   set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
   set interface phy0 options:rx-steering=rss+lacp -- \
   set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
   set interface phy1 options:rx-steering=rss+lacp

As a starting point, only one protocol is supported: LACP. Other
protocols can be added in the future. NIC compatibility should be
checked.

To validate that this works as intended, I used a traffic generator to
generate random traffic slightly above the machine capacity at line rate
on a two ports bond interface. OVS is configured to receive traffic on
two VLANs and pop/push them in a br-int bridge based on tags set on
patch ports.

   +----------------------+
   |         DUT          |
   |+--------------------+|
   ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,
   ||                    ||                         mod_dl_dst:$tgen0,
   ||                    ||                         output:patch10
   ||                    || in_port=patch11,actions=mod_dl_src:$patch10
   ||                    ||                         mod_dl_dst:$tgen0,
   || patch10    patch11 ||                         output:patch10
   |+---|-----------|----+|
   |    |           |     |
   |+---|-----------|----+|
   || patch00    patch01 ||
   ||  tag:10    tag:20  ||
   ||                    ||
   ||       br-phy       || default flow, action=NORMAL
   ||                    ||
   ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
   ||    phy0   phy1     ||
   |+------|-----|-------+|
   +-------|-----|--------+
           |     |
   +-------|-----|--------+
   |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
   |         lag          | mode trunk VLANs 10, 20
   |                      |
   |        switch        |
   |                      |
   |  vlan 10    vlan 20  |  mode access
   |   port2      port3   |
   +-----|----------|-----+
         |          |
   +-----|----------|-----+
   |   tgen0      tgen1   |  Random traffic that is properly balanced
   |                      |  across the bond ports in both directions.
   |  traffic generator   |
   +----------------------+

Without rx-steering, the bond0 links are randomly switching to
"defaulted" when one of the LACP packets sent by the switch is dropped
because the RX queues are full and the PMD threads did not process them
fast enough. When that happens, all traffic must go through a single
link which causes above line rate traffic to be dropped.

 ~# ovs-appctl lacp/show-stats bond0
 ---- bond0 statistics ----
 member: phy0:
   TX PDUs: 347246
   RX PDUs: 14865
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 168
   Link Defaulted: 0
   Carrier Status Changed: 0
 member: phy1:
   TX PDUs: 347245
   RX PDUs: 14919
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 147
   Link Defaulted: 1
   Carrier Status Changed: 0

When rx-steering is enabled, no LACP packet is dropped and the bond
links remain enabled at all times, maximizing the throughput. Neither
the "Link Expired" nor the "Link Defaulted" counters are incremented
anymore.

This feature may be considered as "QoS". However, it does not work by
limiting the rate of traffic explicitly. It only guarantees that some
protocols have a lower chance of being dropped because the PMD cores
cannot keep up with regular traffic.

The choice of protocols is limited on purpose. This is not meant to be
configurable by users. Some limited configurability could be considered
in the future but it would expose to more potential issues if users are
accidentally redirecting all traffic in the isolated queue.

Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-10 15:49:44 +02:00
David Marchand
a5669fd51c netdev-dpdk: Drop TSO in case of conflicting virtio features.
At some point in OVS history, some virtio features were announced as
supported (ECN and UFO virtio features).

The userspace TSO code, which has been added later, does not support
those features and tries to disable them.

This breaks OVS upgrades: if an existing VM already negotiated such
features, their lack on reconnection to an upgraded OVS triggers a
vhost socket disconnection by Qemu.
This results in an endless loop because Qemu then retries with the same
set of virtio features.

This patch proposes to try and detect those vhost socket disconnection
and fallback restoring the old virtio features (and disabling TSO for
this vhost port).

Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-07 18:05:48 +02:00
Mike Pattrick
3337e6d91c userspace: Enable L4 checksum offloading by default.
The netdev receiving packets is supposed to provide the flags
indicating if the L4 checksum was verified and it is OK or BAD,
otherwise the stack will check when appropriate by software.

If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.

When encapsulate a packet with that flag, set the checksum
of the inner L4 header since that is not yet supported.

Calculate the L4 checksum when the packet is going to be sent
over a device that doesn't support the feature.

Linux tap devices allows enabling L3 and L4 offload, so this
patch enables the feature. However, Linux socket interface
remains disabled because the API doesn't allow enabling
those two features without enabling TSO too.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-06-15 23:50:30 +02:00
Mike Pattrick
5d11c47d3e userspace: Enable IP checksum offloading by default.
The netdev receiving packets is supposed to provide the flags
indicating if the IP checksum was verified and it is GOOD or BAD,
otherwise the stack will check when appropriate by software.

If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.

When encapsulate a packet with that flag, set the checksum
of the inner IP header since that is not yet supported.

Calculate the IP checksum when the packet is going to be sent over
a device that doesn't support the feature.

Linux devices don't support IP checksum offload alone, so the
support is not enabled.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-06-15 23:49:51 +02:00
Mike Pattrick
4433cc6860 dpif-netdev: Show netdev offloading flags.
This patch modifies netdev_get_status to include information about
checksum offload status by port, allowing the user to gain insight into
where checksum offloading is active.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-06-15 15:44:57 +02:00