mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-22 18:07:40 +00:00

Author	SHA1	Message	Date
David Marchand	dd443c1a7a	netdev-dpdk: Stop relying on vhost-user Tx flags. vhost-user legacy behavior has been to mark mbuf with Tx offload flags based on what the virtio-net header contained (but provide no Rx information, like IP checksum or L4 checksum validity). Changing to the non legacy mode means that no code out of OVS should set any RTE_MBUF_F_TX_* flag. Had a check accordingly. Link: https://git.dpdk.org/dpdk/commit/?id=ca7036b4af3a Reported-at: https://issues.redhat.com/browse/FDP-1147 Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 21:03:15 +02:00
David Marchand	cf7b86db1f	dp-packet: Rework TCP segmentation. Rather than mark with a offload flags + mark with a segmentation size, simply rely on the netdev implementation which sets a segmentation size when appropriate. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 21:03:09 +02:00
David Marchand	2956a61265	dp-packet: Rework L4 checksum offloads. The DPDK mbuf API specifies 4 status when it comes to L4 checksums: - RTE_MBUF_F_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum - RTE_MBUF_F_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong - RTE_MBUF_F_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid - RTE_MBUF_F_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet data, but the integrity of the L4 data is verified. Similarly to the IP checksum offloads API, revise OVS L4 offloads API. No information about the L4 protocol is provided by any netdev-* implementation, so OVS needs to mark this L4 protocol during flow extraction. Rename current API for consistency with dp_packet_(inner_)?l4_checksum_. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 21:02:56 +02:00
David Marchand	3daf04a4c5	dp-packet: Rework IP checksum offloads. As the packet traverses through OVS, offloading Tx flags must be carefully evaluated and updated which results in a bit of complexity because of a separate "outer" Tx offloading flag coming from DPDK API, and a "normal"/"inner" Tx offloading flag. On the other hand, the DPDK mbuf API specifies 4 status when it comes to IP checksums: - RTE_MBUF_F_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum - RTE_MBUF_F_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong - RTE_MBUF_F_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid - RTE_MBUF_F_RX_IP_CKSUM_NONE: the IP checksum is not correct in the packet data, but the integrity of the IP header is verified. This patch changes OVS API so that OVS code only tracks the status of the checksum of the "current" L3 header and let the Tx flags aspect to the netdev-* implementations. With this API, the flow extraction can be cleaned up. During packet processing, OVS can simply look for the IP checksum validity (either good, or partial) before changing some IP header, and then mark the checksum as partial. In the conntrack case, when natting packets, the checksum status of the inner part (ICMP error case) must be forced temporarily as unknown to force checksum resolution. When tunneling comes into play, IP checksums status is bit-shifted for future considerations in the processing if, for example, the tunnel header gets decapsulated again, or in the netdev-* implementations that support tunnel offloading. Finally, netdev-* implementations only need to care about packets in partial status: a good checksum does not need touching, a bad checksum has been updated by kept as bad by OVS, an unknown checksum is either an IPv6 or if it was an IPv4, OVS updated it too (keeping it good or bad accordingly). Rename current API for consistency with dp_packet_(inner_)?ip_checksum_. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 21:00:54 +02:00
David Marchand	67abd51540	dp-packet: Rework tunnel offloads. Rather than set bits in the mbuf ol_flags field, that only makes sense for netdev-dpdk ports, mark packet for tunnel offload in OVS offloads API. While at it, since there is nothing really "hardware" related, rename current API for consistency with dp_packet_tunnel_ prefix. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 21:00:48 +02:00
David Marchand	d29ba0abdc	dp-packet: Add OVS offloading API. As a preparation for tracking inner checksums, separate Rx checksum status from the DPDK ol_flags field. To minimize the cost of translating from DPDK API to OVS API, simply map OVS flags to DPDK Rx mbuf flags. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 21:00:34 +02:00
David Marchand	19ef1b1f0f	dp-packet: Remove DPDK specific IP version. Flagging packets with IP version is only needed at the netdev-dpdk level. In most cases, OVS is already inspecting the IP header in packet data, so maintaining such IP version metadata won't save much cycles (given the cost of additional branches necessary for handling outer/inner flags). Cleanup OVS shared code and only set these flags in netdev-dpdk.c. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 20:59:22 +02:00
Roi Dayan	b42f9fde4a	netdev-dpdk: Fix possible memory leak in vhost stats. On error condition need to release the allocated structs. Reported by Coverity. Fixes: 3b29286db1c5 ("netdev-dpdk: Add per virtqueue statistics.") Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2025-05-30 14:22:23 +01:00
Jay Ding	6f33ac6321	netdev-dpdk: Fix device info return value check. rte_eth_dev_info_get() could fail due to device reset, etc. The return value should be checked before the device info pointer is dereferenced. Fixes: 2f196c80e716 ("netdev-dpdk: Use LSC interrupt mode.") Signed-off-by: Jay Ding <jay.ding@broadcom.com> Co-Authored-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com>	2025-04-14 13:38:56 +01:00
Mike Pattrick	2276c3a2c6	userspace: Support GRE TSO. This patch extends the userspace datapaths support of tunnel tso from only supporting VxLAN and Geneve to also supporting GRE tunnels. There is also a software fallback for cases where the egress netdev does not support this feature. Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-01-17 00:20:48 +01:00
Maxime Coquelin	a24413cd3e	netdev-dpdk: Set vhost port maximum number of queue pairs. This patch uses the new rte_vhost_driver_set_max_queue_num API to set the maximum number of queue pairs supported by the vhost-user port. This is required for VDUSE which needs to specify the maximum number of queue pairs at creation time. Without it 128 queue pairs metadata would be allocated. To configure it, a new 'vhost-max-queue-pairs' option is introduced. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2025-01-10 15:38:28 +00:00
David Marchand	af292d273f	netdev-dpdk: Restore outer UDP checksum for Intel nics. Fixes for Intel drivers are included in DPDK v23.11.2. Link: https://git.dpdk.org/dpdk-stable/commit/?id=e8c2cccfbdef Link: https://git.dpdk.org/dpdk-stable/commit/?id=1970a0ca45f1 Link: https://git.dpdk.org/dpdk-stable/commit/?id=80c5c9789b73 Fixes: 0256ee64ed39 ("dpdk: Use DPDK 23.11.2 release.") Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2024-12-12 17:02:07 +00:00
David Marchand	ba5a1536cd	netdev-dpdk: Check error for device info and link status queries. Since DPDK v19.11, a couple of ethdev API have been reporting errors in case of invalid port id or other error conditions in drivers. So far, OVS did not check for those error cases. Starting v24.11 future release, the ethdev API warns for unchecked returned values, so let's prepare for this. Link: https://git.dpdk.org/dpdk/commit/?id=4f25d7d2252f Link: https://git.dpdk.org/dpdk/commit/?id=4633c3b2ebf2 Link: https://git.dpdk.org/dpdk/commit/?id=1ff8b9a6ef24 Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com>	2024-11-27 15:55:53 +00:00
David Marchand	7383f0e1bf	netdev-dpdk: Cache representor flag at init. No need to query device info during the life of a port for checking if this port is a representor. This capacity is decided at the ethdev port creation in DPDK and OVS can simply store this info during dpdk_eth_dev_init(). Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com>	2024-11-27 15:55:46 +00:00
David Marchand	6204d3837c	netdev-dpdk: Cache device info during port configuration. No need to query device info twice while configuring a port. Simply pass the rte_eth_dev_info object. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com>	2024-11-27 10:54:21 +00:00
David Marchand	d4b222bb66	netdev-dpdk: Stop configuring after device init failure. Caught by code review. If dpdk_eth_dev_init() fails, no need to continue and try to initialise other features for this port. Plus, err may get overwritten later (like if some rss steering is configured) which could result in non consistent error codes. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com>	2024-11-27 10:54:21 +00:00
Jun Wang	2bf609f70b	netdev-dpdk: Disable outer udp checksum offload for txgbe driver. Fixing the issue of incorrect outer UDP checksum in packets sent by the wangxun network card (driver is txgbe), we disabled RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM. Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Reported-by: Jun Wang <junwang01@cestc.cn> Acked-by: David Marchand <david.marchand@redhat.com> Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Jun Wang <junwang01@cestc.cn> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>	2024-09-20 12:01:05 +02:00
Mike Pattrick	bd48ff8f7d	netdev-dpdk: Re-enable VXLAN/Geneve offload for Intel cards. Previously support for UDP tunneled traffic TCP traffic with UDP checksum offloading did not work well in cases where the sending network card didn't also support these features. Some of the code had been written to assume that if a card supported VXLAN/Geneve offloading, then it also supported outer UDP checksum offloading. However, this was not the case for some Intel network cards. A previous change disabled the VXLAN/Geneve offload flags for these cards as a temporary fix. However, with "Userspace: Software fallback for UDP encapsulated TCP segmentation.", the logic related to software fallback for checksum offloading now anticipates this configuration. The modification to the outer UDP offload flag is still required. This feature does not work as expected in the current DPDK release. Suggested-by: David Marchand <david.marchand@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>	2024-09-11 15:37:17 +02:00
Mike Pattrick	9f0c6e16e3	netdev-dpdk: Fix race condition in mempool information dump. Currently it is possible to call netdev-dpdk/get-mempool-info before a mempool as been created. This can happen because a device is added to the netdev_shash before a mempool is allocated for it, which results in a segmentation fault. Now we check for a NULL value before attempting to dereference it. Fixes: be4817331071 ("netdev-dpdk: Add debug appctl to get mempool information.") Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-08-08 21:55:43 +02:00
Ilya Maximets	56e315937e	vswitchd: Only lock pages that are faulted in. The main purpose of locking the memory is to ensure that OVS can keep doing what it did before in case of increased memory pressure, e.g., during VM ingest / migration. Fulfilling this requirement can be achieved without locking all the allocated memory, but only the pages already accessed in the past (faulted in). Processing of the new traffic involves new memory allocations. Latency on these operations can't be guaranteed by the locking. The main difference would be the pre-faulting of the stack memory. However, in order to revalidate or process upcalls on the same traffic, the same amount of stack is likely needed, so all the necessary memory will already be faulted in. Switch 'mlockall' to MCL_ONFAULT to avoid consuming unnecessarily large amounts of RAM on systems with high core counts. For example, in a densely populated OVN cluster this saves about 650 MB of RAM per node on a system with 64 cores. This equates to 320 GB of allocated but unused RAM in a 500 node cluster. This also makes OVS better suited by default for small systems with limited amount of memory. The MCL_ONFAULT flag was introduced in Linux kernel 4.4 and wasn't available at the time of '--mlockall' introduction, but we can use it now. Falling back to an old way of locking in case we're running on an older kernel just in case. Only locking the faulted in pages also makes locking compatible with vhost post-copy live migration by default, because we'll no longer pre-fault all the guest's memory. Post-copy relies on userfaultfd to work on shared huge pages, which is only available in 4.11+ kernels. So, technically, it should not be possible for MCL_ONFAULT to fail and the call without it to succeed. But keeping the check just in case for now. Acked-by: Simon Horman <horms@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-06-28 23:44:53 +02:00
Kevin Traynor	639fcf2005	netdev-dpdk: Check pending reset when adding device. When a device reset interrupt event (RTE_ETH_EVENT_INTR_RESET) is detected for a DPDK device added to OVS, a device reset is performed. If a device reset interrupt event is detected for a device before it is added to OVS, device reset is not called. If that device is later attempted to be added to OVS, it may fail while being configured if it is still pending a reset as pending reset is not checked when adding a device. A simple way to force a reset event from the ice driver for an iavf device is to set the mac address after binding iavf dev to vfio but before adding to OVS. (note: should not be set like this in normal case). e.g. $ echo 2 > /sys/class/net/ens3f0/device/sriov_numvfs $ ./devbind.py -b vfio-pci 0000:d8:01.1 $ ip link set ens3f0 vf 1 mac 26🆎e6:6f:79:4d $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk \ options:dpdk-devargs=0000:d8:01.1 \|dpdk\|ERR\|Port1 dev_configure = -1 \|netdev_dpdk\|WARN\|Interface dpdk0 eth_dev setup error Operation not permitted \|netdev_dpdk\|ERR\|Interface dpdk0(rxq:1 txq:5 lsc interrupt mode:false) configure error: Operation not permitted \|dpif_netdev\|ERR\|Failed to set interface dpdk0 new configuration Add a check if there was any previous device reset interrupt events when a device is added to OVS. If there was, perform the reset before continuing with the rest of the configuration. netdev_dpdk_pending_reset[] already tracks device reset interrupt events for all devices, so it can be reused to check if there is a reset needed during configuration of newly added devices. By extending it's usage, dev->reset_needed is no longer needed. Fixes: 3eb91a8d1b9a ("netdev-dpdk: Trigger port reconfiguration in main thread for resets.") Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-06-28 21:55:34 +02:00
David Marchand	2f196c80e7	netdev-dpdk: Use LSC interrupt mode. Querying link status may get delayed for an undeterministic (long) time with mlx5 ports. This is a consequence of the mlx5 driver calling ethtool kernel API and getting stuck on the kernel RTNL lock while some other operation is in progress under this lock. One impact for long link status query is that it is called under the bond lock taken in write mode periodically in bond_run(). In parallel, datapath threads may block requesting to read bonding related info (like for example in bond_check_admissibility()). The LSC interrupt mode is available with many DPDK drivers and is used by default with testpmd. It seems safe enough to switch on this feature by default in OVS. We keep the per interface option to disable this feature in case of an unforeseen bug. Signed-off-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Robin Jarry <rjarry@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com>	2024-06-24 12:46:11 +01:00
David Marchand	c39a84c131	netdev-dpdk: Refactor tunnel checksum offloading. All information required for checksum offloading can be deduced by already tracked dp_packet l3_ofs, l4_ofs, inner_l3_ofs and inner_l4_ofs fields. Remove DPDK specific l[2-4]_len from generic OVS code. netdev-dpdk code then fills mbuf specifics step by step: - outer_l2_len and outer_l3_len are needed for tunneling (and below features), - l2_len and l3_len are needed for IP and L4 checksum (and below features), - l4_len and tso_segsz are needed when doing TSO, Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2024-06-06 17:10:29 +01:00
David Marchand	844a7cfa6e	netdev-dpdk: Use guest TSO segmentation size hint. In a typical setup like: guest A <-virtio-> OVS A <-vxlan-> OVS B <-virtio-> guest B TSO packets from guest A are segmented against the OVS A physical port mtu adjusted by the vxlan tunnel header size, regardless of guest A interface mtu. As an example, let's say guest A and guest B mtu are set to 1500 bytes. OVS A and OVS B physical ports mtu are set to 1600 bytes. Guest A will request TCP segmentation for 1448 bytes segments. On the other hand, OVS A will request 1498 bytes segments to the HW. This results in OVS B dropping packets because decapsulated packets are larger than the vhost-user port (serving guest B) mtu. 2024-04-17T14:13:01.239Z\|00002\|netdev_dpdk(pmd-c03/id:7)\|WARN\|vhost0: Too big size 1564 max_packet_len 1518 vhost-user ports expose a guest mtu by filling mbuf->tso_segsz. Use it as a hint. This may result in segments (on the wire) slightly shorter than the optimal size. Reported-at: https://github.com/openvswitch/ovs-issues/issues/321 Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2024-06-06 17:10:11 +01:00
David Marchand	d618d09173	netdev-dpdk: Refactor TSO request code. Every L3, L4 checksum offload or TSO requires a (outer) L3 length to be provided. This length is computed via dp_packet_l4(pkt) that is always set when such offloads are requested in OVS. Getting a th == NULL is a bug in OVS, so an assert() is more appropriate. Besides, filling l4_len and tso_segsz only matters to TSO, so there is no need to set it for other L4 checksum offloading requests. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2024-06-06 17:10:05 +01:00
David Marchand	3d2c8223ab	netdev-dpdk: Fix inner checksum when outer is not supported. If outer checksum is not supported and OVS already set L3/L4 outer checksums in the packet, no outer mark should be left in ol_flags (as it confuses some driver, like net/ixgbe). l2_len must be adjusted to account for the tunnel header. Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2024-06-06 17:09:58 +01:00
David Marchand	29abd07e4f	netdev-dpdk: Disable outer UDP checksum for net/iavf. Same as the commit 6f93d8e62f13 ("netdev-dpdk: Disable outer UDP checksum offload for ice/i40e driver."), disable outer UDP checksum and related offloads for net/iavf. Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2024-06-06 17:09:52 +01:00
David Marchand	041d6adeda	netdev-dpdk: Fallback to non tunnel checksum offloading. The outer checksum offloading API in DPDK is ambiguous and was implemented by Intel folks in their drivers with the assumption that any outer offloading always goes with an inner offloading request. With net/i40e and net/ice drivers, in the case of encapsulating a ARP packet in a vxlan tunnel (which results in requesting outer ip checksum with a tunnel context but no inner offloading request), a Tx failure is triggered, associated with a port MDD event. 2024-03-27T16:02:07.084Z\|00018\|dpdk\|WARN\|ice_interrupt_handler(): OICR: MDD event To avoid this situation, if no checksum or segmentation offloading is requested on the inner part of a packet, fallback to "normal" (non outer) offloading request. Reported-at: https://github.com/openvswitch/ovs-issues/issues/321 Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Fixes: f81d782c1906 ("netdev-native-tnl: Mark all vxlan/geneve packets as tunneled.") Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2024-06-06 17:09:37 +01:00
Roi Dayan	fb46f5d29a	netdev-dpdk: Improve error print to the user for flow control error. When failing to get flow control parameters use VLOG_WARN_BUF() to expose the error string in ovs-vsctl show. Signed-off-by: Roi Dayan <roid@nvidia.com> Suggested-by: Simon Horman <horms@ovn.org> Acked-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Simon Horman <horms@ovn.org>	2024-04-26 10:06:34 +01:00
Roi Dayan via dev	4f29804f24	netdev-dpdk: Fix possible memory leak configuring VF MAC address. VLOG_WARN_BUF() is allocating memory for the error string and should e used if the configuration cannot continue and error is being returned so the caller has indication of releasing the pointer. Change to VLOG_WARN() to keep the logic that error is not being returned. Fixes: f4336f504b17 ("netdev-dpdk: Add option to configure VF MAC address.") Signed-off-by: Roi Dayan <roid@nvidia.com> Acked-by: Gaetan Rivet <gaetanr@nvidia.com> Acked-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Simon Horman <horms@ovn.org>	2024-04-23 11:27:09 +01:00
Jun Wang	6f93d8e62f	netdev-dpdk: Disable outer UDP checksum offload for ice/i40e driver. Fixing the issue of incorrect outer UDP checksum in packets sent by E810 or X710. We disable RTE_ETH_TX_OFFLOAD_OUTER_UDP_CKSUM,but also disable all the dependent offloads like RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO and RTE_ETH_TX_OFFLOAD_GENEVE_TNL_TSO. Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Reported-at: https://github.com/openvswitch/ovs-issues/issues/321 Signed-off-by: Jun Wang <junwang01@cestc.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-03-22 20:36:50 +01:00
Ilya Maximets	0ce82ac45e	netdev-dpdk: Fix tunnel type check during Tx offload preparation. Tunnel types are not flags, but 4-bit fields, so checking them with a simple binary 'and' is incorrect and may produce false-positive matches. While the current implementation is unlikely to cause any issues today, since both RTE_MBUF_F_TX_TUNNEL_VXLAN and RTE_MBUF_F_TX_TUNNEL_GENEVE only have 1 bit set, it is risky to have this code and it may lead to problems if we add support for other tunnel types in the future. Use proper field checks instead. Also adding a warning for unexpected tunnel types in case something goes wrong. Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-03-16 02:06:20 +01:00
Ilya Maximets	05e9f05d14	netdev-dpdk: Fix TCP check during Tx offload preparation. RTE_MBUF_F_TX_TCP_CKSUM is not a flag, but a 2-bit field, so checking it with a simple binary 'and' is incorrect. For example, this check will succeed for a packet with UDP checksum requested as well. Fix the check to avoid wrongly initializing tso_segz and potentially accessing UDP header via TCP structure pointer. The IPv4 checksum flag has to be set for any L4 checksum request, regardless of the type, so moving this check out of the TCP condition. Fixes: 8b5fe2dc6080 ("userspace: Add Generic Segmentation Offloading.") Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-03-16 02:06:20 +01:00
Ilya Maximets	f8809760fc	netdev-dpdk: Clear inner packet marks if no inner offloads requested. In some cases only outer offloads may be requested for a tunneled packet. In this case there is no need to mark the type of an inner packet. Clean these flags up to avoid potential confusion of DPDK drivers. Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-03-16 02:06:20 +01:00
Ilya Maximets	7df30c86ce	netdev-dpdk: Clean up all marker flags if no offloads requested. Some drivers (primarily, Intel ones) do not expect any marking flags being set if no offloads are requested. If these flags are present, driver will fail Tx preparation or behave abnormally. For example, ixgbe driver will refuse to process the packet with only RTE_MBUF_F_TX_TUNNEL_GENEVE and RTE_MBUF_F_TX_OUTER_IPV4 set. This pretty much breaks Geneve tunnels on these cards. An extra check is added to make sure we don't have any unexpected Tx offload flags set. Fixes: 084c8087292c ("userspace: Support VXLAN and GENEVE TSO.") Reported-at: https://github.com/openvswitch/ovs-issues/issues/321 Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-03-13 16:29:25 +01:00
Ilya Maximets	2c4ffd2f8a	netdev-dpdk: Dump packets that fail Tx preparation. It's hard to debug situations where driver rejects packets for some reason. Dumping out the mbuf should help with that. Sample output looks like this: \|netdev_dpdk(pmd-c03/id:8)\|DBG\|ovs-p1: First invalid packet: dump mbuf at 0x1180bce140, iova=0x2cb7ce400, buf_len=2176 pkt_len=64, ol_flags=0x2, nb_segs=1, port=65535, ptype=0 segment at 0x1180bce140, data=0x1180bce580, len=90, off=384, refcnt=1 Dump data at [0x1180bce580], len=64 00000000: 33 33 00 00 00 16 AA 27 91 F9 4D 96 86 DD 60 00 \| 33.....'..M...`. 00000010: 00 00 00 24 00 01 00 00 00 00 00 00 00 00 00 00 \| ...$............ 00000020: 00 00 00 00 00 00 FF 02 00 00 00 00 00 00 00 00 \| ................ 00000030: 00 00 00 00 00 16 3A 00 05 02 00 00 01 00 8F 00 \| ......:......... Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-03-08 20:19:54 +01:00
David Marchand	3eb91a8d1b	netdev-dpdk: Trigger port reconfiguration in main thread for resets. When OVS (main thread) configures a DPDK netdev, it holds a netdev_dpdk mutex lock. As part of this configure operation, the net/iavf driver (used with i40e VF devices) triggers a queue count change. The PF entity (serviced by a kernel PF driver for example) handles this change and requests back that the VF driver resets the VF device. The driver then completes the VF reset operation on its side and waits for completion of the iavf-event thread responsible for handling various VF device events. On the other hand, handling of the VF reset request in this iavf-event thread results in notifying the application with a port reset request (RTE_ETH_EVENT_INTR_RESET). The OVS reset callback tries to take a hold of the same netdev_dpdk mutex and blocks the iavf-event thread. As a result, the net/iavf driver (still running on OVS main thread) is unable to complete as it is waiting for iavf-event to complete. To break from this situation, the OVS reset callback now won't take a netdev_dpdk mutex. Instead, the port reset request is stored in a simple RTE_ETH_MAXPORTS array associated to a seq object. This is enough to let the VF driver complete this port initialization. The OVS main thread later handles the port reset request. More details in the DPDK upstream bz as this issue appeared following a change in DPDK. Link: https://bugs.dpdk.org/show_bug.cgi?id=1337 Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-01-19 13:52:57 +01:00
Dexia Li	084c808729	userspace: Support VXLAN and GENEVE TSO. For userspace datapath, this patch provides vxlan and geneve tunnel tso. Only support userspace vxlan or geneve tunnel, meanwhile support tunnel outter and inner csum offload. If netdev do not support offload features, there is a software fallback.If netdev do not support vxlan and geneve tso,packets will drop. Front-end devices can close offload features by ethtool also. Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Dexia Li <dexia.li@jaguarmicro.com> Co-authored-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-01-17 22:06:45 +01:00
Flavio Leitner	8b5fe2dc60	userspace: Add Generic Segmentation Offloading. This provides a software implementation in the case the egress netdev doesn't support segmentation in hardware. The challenge here is to guarantee packet ordering in the original batch that may be full of TSO packets. Each TSO packet can go up to ~64kB, so with segment size of 1440 that means about 44 packets for each TSO. Each batch has 32 packets, so the total batch amounts to 1408 normal packets. The segmentation estimates the total number of packets and then the total number of batches. Then allocate enough memory and finally do the work. Finally each batch is sent in order to the netdev. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-12-02 01:33:37 +01:00
Flavio Leitner	e0056018c4	userspace: Respect tso/gso segment size. Currently OVS will calculate the segment size based on the MTU of the egress port. That usually happens to be correct when the ports share the same MTU, but that is not always true. Therefore, if the segment size is provided, then use that and make sure the over sized packets are dropped. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-12-02 00:56:36 +01:00
Jakob Meng	c19a5b48bf	netdev-dpdk: Sync and clean {get, set}_config() callbacks. For better usability, the function pairs get_config() and set_config() for netdevs should be symmetric: Options which are accepted by set_config() should be returned by get_config() and the latter should output valid options for set_config() only. This patch moves key-value pairs which are not valid options from get_config() to the get_status() callback. For example, get_config() in lib/netdev-dpdk.c returned {configured,requested}_{rx,tx}_queues previously. For requested rx queues the proper option name is n_rxq, so requested_rx_queues has been renamed respectively. Tx queues cannot be changed by the user, hence requested_tx_queues has been dropped. Both configured_{rx,tx}_queues will be returned as n_{r,t}xq in the get_status() callback. The netdev dpdk classes no longer share a common get_config() callback, instead both the dpdk_class and the dpdk_vhost_client_class define their own callbacks. The get_config() callback for dpdk_vhost_class has been dropped because it does not have a set_config() callback. The documentation in vswitchd/vswitch.xml for status columns as well as tests have been updated accordingly. Reported-at: https://bugzilla.redhat.com/1949855 Signed-off-by: Jakob Meng <code@jakobmeng.de> Reviewed-by: Robin Jarry <rjarry@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2023-11-14 11:03:35 +00:00
David Marchand	bb61931dc5	netdev-dpdk: Disable net/tap Tx L4 checksum offloads. As reported by Ales when doing some OVN integration tests with OVS 3.2, net/tap has broken L4 checksum offloads. Fixes are pending on DPDK side. Until they get in a LTS release used by OVS, disable those Tx offloads. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-08-30 20:36:24 +02:00
David Marchand	9b7e1a7537	netdev-dpdk: Clear IP packet type when no offload is requested. OVS currently sets RTE_MBUF_F_TX_IPV[46] flags in early stages of the packet reception and keeps track of the IP packet type as the packet goes through OVS pipeline. When a packet leaves OVS and hits a DPDK driver, OVS may not request IP checksum offloading but leaves one of this packet type flag in ol_flags. The DPDK api describes that RTE_MBUF_F_TX_IPV4 must be set when requesting some Tx offloads (like RTE_MBUF_F_TX_IPSUM, RTE_MBUF_F_TX_TCP_CKSUM, .., RTE_MBUF_F_TX_TCP_SEG). Even though setting RTE_MBUF_F_TX_IPV4 without requesting a Tx offload is undefined, this can confuse some drivers (like net/iavf) which then reads zeroed l2_len and l3_len and ends up dropping the packet. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2231081 Fixes: 5d11c47d3ebe ("userspace: Enable IP checksum offloading by default.") Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-08-28 20:20:39 +02:00
Ivan Malov	d460c473eb	netdev-dpdk: Negotiate delivery of per-packet Rx metadata. This may be required by some PMDs in offload scenarios. Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow") Signed-off-by: Ivan Malov <ivan.malov@arknetworks.am> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-25 14:42:41 +02:00
Adrian Moreno	6240c0b4c8	netdev: Add netdev_get_speed() to netdev API. Currently, the netdev's speed is being calculated by taking the link's feature bits (using netdev_get_features()) and transforming them into bps. This mechanism can be both inaccurate and difficult to maintain, mainly because we currently use the feature bits supported by OpenFlow which would have to be extended to support all new feature bits of all netdev implementations while keeping the OpenFlow API intact. In order to expose the link speed accurately for all current and future hardware, add a new netdev API call that allows the implementations to provide the current and maximum link speeds in Mbps. Internally, the logic to get the maximum supported speed still relies on feature bits so it might still get out of sync in the future. However, the maximum configurable speed is not used as much as the current speed and these feature bits are not exposed through the netdev interface so it should be easier to add more. Use this new function instead of netdev_get_features() where the link speed is needed. As a consequence of this patch, link speeds of cards is properly reported (internally in OVSDB) even if not supported by OpenFlow. A test verifies this behavior using a tap device. Also, in order to avoid using the old, this patch adds a checkpatch.py warning if the old API is used. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567 Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:32 +02:00
Robin Jarry	fc06ea9a18	netdev-dpdk: Add custom rx-steering configuration. Some control protocols are used to maintain link status between forwarding engines (e.g. LACP). When the system is not sized properly, the PMD threads may not be able to process all incoming traffic from the configured Rx queues. When a signaling packet of such protocols is dropped, it can cause link flapping, worsening the situation. Use the rte_flow API to redirect these protocols into a dedicated Rx queue. The assumption is made that the ratio between control protocol traffic and user data traffic is very low and thus this dedicated Rx queue will never get full. Re-program the RSS redirection table to only use the other Rx queues. The additional Rx queue will be assigned a PMD core like any other Rx queue. Polling that extra queue may introduce increased latency and a slight performance penalty at the benefit of preventing link flapping. This feature must be enabled per port on specific protocols via the rx-steering option. This option takes "rss" followed by a "+" separated list of protocol names. It is only supported on ethernet ports. This feature is experimental. If the user has already configured multiple Rx queues on the port, an additional one will be allocated for control packets. If the hardware cannot satisfy the number of requested Rx queues, the last Rx queue will be assigned for control plane. If only one Rx queue is available, the rx-steering feature will be disabled. If the hardware does not support the rte_flow matchers/actions, the rx-steering feature will be completely disabled on the port and regular rss will be performed instead. It cannot be enabled when other-config:hw-offload=true as it may conflict with the offloaded flows. Similarly, if hw-offload is enabled, custom rx-steering will be forcibly disabled on all ports and replaced by regular rss. Example use: ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \ set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \ set interface phy0 options:rx-steering=rss+lacp -- \ set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \ set interface phy1 options:rx-steering=rss+lacp As a starting point, only one protocol is supported: LACP. Other protocols can be added in the future. NIC compatibility should be checked. To validate that this works as intended, I used a traffic generator to generate random traffic slightly above the machine capacity at line rate on a two ports bond interface. OVS is configured to receive traffic on two VLANs and pop/push them in a br-int bridge based on tags set on patch ports. +----------------------+ \| DUT \| \|+--------------------+\| \|\| br-int \|\| in_port=patch10,actions=mod_dl_src:$patch11, \|\| \|\| mod_dl_dst:$tgen0, \|\| \|\| output:patch10 \|\| \|\| in_port=patch11,actions=mod_dl_src:$patch10 \|\| \|\| mod_dl_dst:$tgen0, \|\| patch10 patch11 \|\| output:patch10 \|+---\|-----------\|----+\| \| \| \| \| \|+---\|-----------\|----+\| \|\| patch00 patch01 \|\| \|\| tag:10 tag:20 \|\| \|\| \|\| \|\| br-phy \|\| default flow, action=NORMAL \|\| \|\| \|\| bond0 \|\| balance-slb, lacp=passive, lacp-time=fast \|\| phy0 phy1 \|\| \|+------\|-----\|-------+\| +-------\|-----\|--------+ \| \| +-------\|-----\|--------+ \| port0 port1 \| balance L3/L4, lacp=active, lacp-time=fast \| lag \| mode trunk VLANs 10, 20 \| \| \| switch \| \| \| \| vlan 10 vlan 20 \| mode access \| port2 port3 \| +-----\|----------\|-----+ \| \| +-----\|----------\|-----+ \| tgen0 tgen1 \| Random traffic that is properly balanced \| \| across the bond ports in both directions. \| traffic generator \| +----------------------+ Without rx-steering, the bond0 links are randomly switching to "defaulted" when one of the LACP packets sent by the switch is dropped because the RX queues are full and the PMD threads did not process them fast enough. When that happens, all traffic must go through a single link which causes above line rate traffic to be dropped. ~# ovs-appctl lacp/show-stats bond0 ---- bond0 statistics ---- member: phy0: TX PDUs: 347246 RX PDUs: 14865 RX Bad PDUs: 0 RX Marker Request PDUs: 0 Link Expired: 168 Link Defaulted: 0 Carrier Status Changed: 0 member: phy1: TX PDUs: 347245 RX PDUs: 14919 RX Bad PDUs: 0 RX Marker Request PDUs: 0 Link Expired: 147 Link Defaulted: 1 Carrier Status Changed: 0 When rx-steering is enabled, no LACP packet is dropped and the bond links remain enabled at all times, maximizing the throughput. Neither the "Link Expired" nor the "Link Defaulted" counters are incremented anymore. This feature may be considered as "QoS". However, it does not work by limiting the rate of traffic explicitly. It only guarantees that some protocols have a lower chance of being dropped because the PMD cores cannot keep up with regular traffic. The choice of protocols is limited on purpose. This is not meant to be configurable by users. Some limited configurability could be considered in the future but it would expose to more potential issues if users are accidentally redirecting all traffic in the isolated queue. Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Robin Jarry <rjarry@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-10 15:49:44 +02:00
David Marchand	a5669fd51c	netdev-dpdk: Drop TSO in case of conflicting virtio features. At some point in OVS history, some virtio features were announced as supported (ECN and UFO virtio features). The userspace TSO code, which has been added later, does not support those features and tries to disable them. This breaks OVS upgrades: if an existing VM already negotiated such features, their lack on reconnection to an upgraded OVS triggers a vhost socket disconnection by Qemu. This results in an endless loop because Qemu then retries with the same set of virtio features. This patch proposes to try and detect those vhost socket disconnection and fallback restoring the old virtio features (and disabling TSO for this vhost port). Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Simon Horman <simon.horman@corigine.com> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-07 18:05:48 +02:00
Mike Pattrick	3337e6d91c	userspace: Enable L4 checksum offloading by default. The netdev receiving packets is supposed to provide the flags indicating if the L4 checksum was verified and it is OK or BAD, otherwise the stack will check when appropriate by software. If the packet comes with good checksum, then postpone the checksum calculation to the egress device if needed. When encapsulate a packet with that flag, set the checksum of the inner L4 header since that is not yet supported. Calculate the L4 checksum when the packet is going to be sent over a device that doesn't support the feature. Linux tap devices allows enabling L3 and L4 offload, so this patch enables the feature. However, Linux socket interface remains disabled because the API doesn't allow enabling those two features without enabling TSO too. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-15 23:50:30 +02:00
Mike Pattrick	5d11c47d3e	userspace: Enable IP checksum offloading by default. The netdev receiving packets is supposed to provide the flags indicating if the IP checksum was verified and it is GOOD or BAD, otherwise the stack will check when appropriate by software. If the packet comes with good checksum, then postpone the checksum calculation to the egress device if needed. When encapsulate a packet with that flag, set the checksum of the inner IP header since that is not yet supported. Calculate the IP checksum when the packet is going to be sent over a device that doesn't support the feature. Linux devices don't support IP checksum offload alone, so the support is not enabled. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-15 23:49:51 +02:00
Mike Pattrick	4433cc6860	dpif-netdev: Show netdev offloading flags. This patch modifies netdev_get_status to include information about checksum offload status by port, allowing the user to gain insight into where checksum offloading is active. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-15 15:44:57 +02:00

1 2 3 4 5 ...

440 Commits