mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-29 13:27:59 +00:00

Author	SHA1	Message	Date
Adrian Moreno	9e56549c2b	hmap: use short version of safe loops if possible. Using SHORT version of the *_SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for hmap and all its derived types. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-30 16:59:02 +02:00
Eli Britstein	635cb95e0c	dpif-netdev: Keep orig_in_port as a field of the flow. A flow may be modified after its initial offload failed. In this case, according to [1], the modification is handled as a flow add. For a vport flow "add", the orig_in_port should be provided. Keep that field in the flow struct, so it can be provided in the flow modification use case. [1] 0d25621e4d9f ("dpif-netdev: Fix flow modification after failure.") Fixes: b5e6f6f6bfbe ("dpif-netdev: Provide orig_in_port in metadata for tunneled packets.") Signed-off-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-22 22:12:24 +01:00
Cian Ferriter	c356f6c0b9	dpif-netdev: Simplify atomic function pointer stores. The same pattern for atomic stores and initialization was used for the DPIF and MFEX function pointers declared in struct dp_netdev_pmd_thread. Simplify this pattern for all stores to 'miniflow_extract_opt' and 'netdev_input_func'. Also replace the first store to 'miniflow_extract_opt' which was a atomic_store_relaxed() with atomic_init(). Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-11 21:07:34 +01:00
Gaetan Rivet	31dc72c644	dpif-netdev: Use dp_netdev reference in offload threads. The PMD reference taken is not actually used, it is only needed to get the dp_netdev linked. Additionally, the taking of the PMD reference does not protect against the disappearance of the dp_netdev, so it is misleading. The dp reference is protected by the way the ports are being deleted during datapath deletion. No further offload request should be found past a flush, so it is safe to keep this reference in the offload item. Signed-off-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-16 13:11:44 +01:00
Sriharsha Basavapatna	7d8b6ab64d	dpif-netdev: Fix a race condition in deletion of offloaded flows. In dp_netdev_pmd_remove_flow() we schedule the deletion of an offloaded flow, if a mark has been assigned to the flow. But if this occurs in the window in which the offload thread completes offloading the flow and assigns a mark to the flow, then we miss deleting the flow. This problem has been observed while adding and deleting flows in a loop. To fix this, always enqueue flow deletion regardless of the flow->mark being set. Fixes: 241bad15d99a("dpif-netdev: associate flow with a mark id") Co-authored-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-16 13:11:44 +01:00
Gaetan Rivet	a81bb674ed	dpif-netdev: Move port flush after datapath reconfiguration. Port flush and offload uninit should be moved after the datapath has been reconfigured. That way, no other thread such as PMDs will find this port to poll and enqueue further offload requests. After a flush, almost no further offload request for this port should be found in the queue. There will still be some issued by revalidators, but they will be caught when the offload thread fails to take a netdev ref. This change fixes the issue of datapath reference being improperly accessed by offload threads while it is being destroyed. Fixes: 5b0aa55776cb ("dpif-netdev: Execute flush from offload thread.") Fixes: 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.") Suggested-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-16 13:11:44 +01:00
Cian Ferriter	f92e6946d0	dpif-netdev-dpcls: Make subtable reprobe thread-safe. The subtable search function can be used at any time by a PMD thread. Setting the subtable search function should be done atomically to prevent garbage data from being read. A dpcls_subtable_lookup_reprobe() call can happen at the same time that DPCLS subtables are being sorted. This could lead to modifications done by the reprobe() to be lost. Prevent this from happening by locking on pmd->flow_mutex. After this change both the reprobe function and a subtable sort will share the flow_mutex preventing modifications by either one from being lost. Also remove the pvector_publish() call. The pvector is not being changed in dpcls_subtable_lookup_reprobe(), only the data pointed to by pointers in the vector are being changed. Fixes: 3d018c3ea79d ("dpif-netdev: add subtable lookup prio set command.") Reported-by: Ilya Maximets <i.maximets@ovn.org> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2022-January/390757.html Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-02-15 16:32:07 +00:00
Dumitru Ceara	5a9bb85caf	dpif-netdev: Fix misaligned access. Remove the forced cache-line size alignment markers from struct dp_netdev_pmd_thread and struct dp_netdev as discussed at [0]. They don't seem to add any benefit and cause 64 byte alignment requirements. UB Sanitizer report: lib/dpif-netdev.c:6758:13: runtime error: member access within misaligned address 0x7f7f24d25010 for type 'struct dp_netdev_pmd_thread', which requires 64 byte alignment 0x7f7f24d25010: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... ^ #0 0x5fbfde in dp_netdev_configure_pmd lib/dpif-netdev.c:6758 #1 0x5fbde9 in dp_netdev_set_nonpmd lib/dpif-netdev.c:6715 #2 0x5d6fdd in create_dp_netdev lib/dpif-netdev.c:1769 #3 0x5d72d0 in dpif_netdev_open lib/dpif-netdev.c:1807 #4 0x61c83f in do_open lib/dpif.c:347 [...] lib/dpif-netdev.c:1724:6: runtime error: member access within misaligned address 0x000002005eb0 for type 'struct dp_netdev', which requires 64 byte alignment 0x000002005eb0: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... ^ #0 0x5d6660 in create_dp_netdev lib/dpif-netdev.c:1724 #1 0x5d72d0 in dpif_netdev_open lib/dpif-netdev.c:1807 #2 0x61c846 in do_open lib/dpif.c:347 #3 0x61ca9c in dpif_create lib/dpif.c:402 #4 0x61cac9 in dpif_create_and_open lib/dpif.c:415 #5 0x48f235 in open_dpif_backer ofproto/ofproto-dpif.c:776 [...] [0] https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390256.html Suggested-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-14 22:38:42 +01:00
Dumitru Ceara	8ed26a8be3	treewide: Don't pass NULL to library functions that expect non-NULL. It's actually undefined behavior to pass NULL to standard library functions that manipulate arrays (e.g., qsort, memcpy, memcmp), even if the passed number of items is 0. UB Sanitizer reports: ovsdb/monitor.c:408:9: runtime error: null pointer passed as argument 1, which is declared to never be null #0 0x406ae1 in ovsdb_monitor_columns_sort ovsdb/monitor.c:408 #1 0x406ae1 in ovsdb_monitor_add ovsdb/monitor.c:1683 [...] lib/ovsdb-data.c:1970:5: runtime error: null pointer passed as argument 2, which is declared to never be null #0 0x4071c8 in ovsdb_datum_push_unsafe lib/ovsdb-data.c:1970 #1 0x471cd0 in ovsdb_datum_apply_diff_in_place lib/ovsdb-data.c:2345 [...] ofproto/ofproto-dpif-rid.c:159:17: runtime error: null pointer passed as argument 1, which is declared to never be null #0 0x4df5d8 in frozen_state_equal ofproto/ofproto-dpif-rid.c:159 #1 0x4dfd27 in recirc_find_equal ofproto/ofproto-dpif-rid.c:179 [...] Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-02-14 22:35:42 +01:00
Gaetan Rivet	b0b6b7b465	dpif-netdev: Use one or more offload threads. Read the user configuration in the netdev-offload module to modify the number of threads used to manage hardware offload requests. This allows processing insertion, deletion and modification concurrently. The offload thread structure was modified to contain all needed elements. This structure is multiplied by the number of requested threads and used separately. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	7daa503468	dpif-netdev: Replace port mutex by rwlock. The port mutex protects the netdev mapping, that can be changed by port addition or port deletion. HW offloads operations can be considered read operations on the port mapping itself. Use a rwlock to differentiate between read and write operations, allowing concurrent queries and offload insertions. Because offload queries, deletion, and reconfigure_datapath() calls are all rdlock, the deadlock fixed by [1] is still avoided, as the rdlock side is recursive as prescribed by the POSIX standard. Executing 'reconfigure_datapath()' only requires a rdlock taken, but it is sometimes executed in contexts where wrlock is taken ('do_add_port()' and 'do_del_port()'). This means that the deadlock described in [2] is still valid and should be mitigated. The rdlock is taken using 'tryrdlock()' during offload query, keeping the current behavior. [1]: 81e89d5c2645 ("dpif-netdev: Make datapath port mutex recursive.") [2]: 12d0edd75eba ("dpif-netdev: Avoid deadlock with offloading during PMD thread deletion."). Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	d85b9230ac	dpif-netdev: Make megaflow and mark mappings thread objects. In later commits hardware offloads are managed in several threads. Each offload is managed by a thread determined by its flow's 'mega_ufid'. As megaflow to mark and mark to flow mappings are 1:1 and 1:N respectively, then a single mark exists for a single 'mega_ufid', and multiple flows uses the same 'mega_ufid'. Because the managing thread will be chosen using the 'mega_ufid', then each mapping does not need to be shared with other offload threads. The mappings are kept as cmap as upcalls will sometimes query them before enqueuing orders to the offload threads. To prepare this change, move the mappings within the offload thread structure. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	ec4ac62588	dpif-netdev: Use lockless queue to manage offloads. The dataplane threads (PMDs) send offloading commands to a dedicated offload management thread. The current implementation uses a lock and benchmarks show a high contention on the queue in some cases. With high-contention, the mutex will more often lead to the locking thread yielding in wait, using a syscall. This should be avoided in a userland dataplane. The mpsc-queue can be used instead. It uses less cycles and has lower latency. Benchmarks show better behavior as multiple revalidators and one or multiple PMDs writes to a single queue while another thread polls it. One trade-off with the new scheme however is to be forced to poll the queue from the offload thread. Without mutex, a cond_wait cannot be used for signaling. The offload thread is implementing an exponential backoff and will sleep in short increments when no data is available. This makes the thread yield, at the price of some latency to manage offloads after an inactivity period. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	54dcf60e6f	netdev-offload-dpdk: Lock rte_flow map access. Add a lock to access the ufid to rte_flow map. This will protect it from concurrent write accesses when multiple threads attempt it. At this point, the reason for taking the lock is not to fullfill the needs of the DPDK offload implementation anymore. Rewrite the comments to reflect this change. The lock is still needed to protect against changes to netdev port mapping. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	5b0aa55776	dpif-netdev: Execute flush from offload thread. When a port is deleted, its offloads must be flushed. The operation runs in the thread that initiated it. Offload data is thus accessed jointly by the port deletion thread(s) and the offload thread, which complicates the data access model. To simplify this model, as a pre-step toward introducing parallel offloads, execute the flush operation in the offload thread. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	d68d2ed466	dpif-netdev: Introduce tagged union of offload requests. Offload requests are currently only supporting flow offloads. As a pre-step before supporting an offload flush request, modify the layout of an offload request item, to become a tagged union. Future offload types won't be forced to re-use the full flow offload structure, which consumes a lot of memory. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	73ecf098d2	dpif-netdev: Use id-fpool for mark allocation. Use the netdev-offload multithread API to allow multiple thread allocating marks concurrently. Initialize only once the pool in a multithread context by using the ovsthread_once type. Use the id-fpool module for faster concurrent ID allocation. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	528a8ab627	dpif-netdev: Postpone flow offload item freeing. Profiling the HW offload thread, the flow offload freeing takes approximatively 25% of the time. Most of this time is spent waiting on the futex used by the libc free(), as it triggers a syscall and reschedule the thread. Avoid the syscall and its expensive context switch. Batch the offload messages freeing using the RCU. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	55dc4ef176	dpif-netdev: Quiesce offload thread periodically. Similar to what was done for the PMD threads [1], reduce the performance impact of quiescing too often in the offload thread. After each processed offload, the offload thread currently quiesce and will sync with RCU. This synchronization can be lengthy and make the thread unnecessary slow. Instead attempt to quiesce every 10 ms at most. While the queue is empty, the offload thread remains quiescent. [1]: 81ac8b3b194c ("dpif-netdev: Do RCU synchronization at fixed interval in PMD main loop.") Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Gaetan Rivet	83823ae328	dpif-netdev: Implement hardware offloads stats query. In the netdev datapath, keep track of the enqueued offloads between the PMDs and the offload thread. Additionally, query each netdev for their hardware offload counters. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-18 15:12:01 +01:00
Gaetan Rivet	e4543c7b17	dpif-netdev: Rename offload thread structure. The offload management in userspace is done through a separate thread. The naming of the structure holding the objects used for synchronization with the dataplane is generic and nondescript. Clarify the object function by renaming it. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-18 15:12:01 +01:00
Gaetan Rivet	9ab104718b	dpctl: Add function to read hardware offload statistics. Expose a function to query datapath offload statistics. This function is separate from the current one in netdev-offload as it exposes more detailed statistics from the datapath, instead of only from the netdev-offload provider. Each datapath is meant to use the custom counters as it sees fit for its handling of hardware offloads. Call the new API from dpctl. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-18 15:12:01 +01:00
Gaetan Rivet	59b8f9f8f4	dpif-netdev: Rename flow offload thread. ovs_strlcpy silently fails to copy the thread name if it is too long. Rename the flow offload thread to differentiate it from the main thread. Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread") Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-18 15:12:01 +01:00
Maxime Coquelin	c18e707b2f	dpif-netdev: Introduce hash-based Tx packet steering mode. This patch adds a new hash Tx steering mode that distributes the traffic on all the Tx queues, whatever the number of PMD threads. It would be useful for guests expecting traffic to be distributed on all the vCPUs. The idea here is to re-use the 5-tuple hash of the packets, already computed to build the flows batches (and so it does not provide flexibility on which fields are part of the hash). There are also no user-configurable indirection table, given the feature is transparent to the guest. The queue selection is just a modulo operation between the packet hash and the number of Tx queues. There are no (at least intentionnally) functionnal changes for the existing XPS and static modes. There should not be noticeable performance changes for these modes (only one more branch in the hot path). For the hash mode, performance could be impacted due to locking when multiple PMD threads are in use (same as XPS mode) and also because of the second level of batching. Regarding the batching, the existing Tx port output_pkts is not modified. It means that at maximum, NETDEV_MAX_BURST can be batched for all the Tx queues. A second level of batching is done in dp_netdev_pmd_flush_output_on_port(), only for this hash mode. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-17 18:07:00 +01:00
Maxime Coquelin	31e67c998b	dpif-netdev: Introduce Tx queue mode. A boolean is currently used to differenciate between the static and XPS Tx queue modes. Since we are going to introduce a new steering mode, replace this boolean with an enum. This patch does not introduce functionnal changes. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-17 18:07:00 +01:00
Martin Varghese	1917ace893	Encap & Decap actions for MPLS packet type. The encap & decap actions are extended to support MPLS packet type. Encap & decap actions adds and removes MPLS header at start of the packet. The existing PUSH MPLS & POP MPLS actions inserts & removes MPLS header between ethernet header and the IP header. Though this behaviour is fine for L3 VPN where an IP packet is encapsulated inside a MPLS tunnel, it does not suffice the L2 VPN requirements. In L2 VPN the ethernet packets must be encapsulated inside MPLS tunnel. In this change the encap & decap actions are extended to support MPLS packet type. The encap & decap adds and removes MPLS header at the start of packet as depicted below. Encapsulation: Actions - encap(mpls),encap(ethernet) Incoming packet -> \| ETH \| IP \| Payload \| 1 Actions - encap(mpls) [Datapath action - ADD_MPLS:0x8847] Outgoing packet -> \| MPLS \| ETH \| Payload\| 2 Actions - encap(ethernet) [ Datapath action - push_eth ] Outgoing packet -> \| ETH \| MPLS \| ETH \| Payload\| Decapsulation: Incoming packet -> \| ETH \| MPLS \| ETH \| IP \| Payload \| Actions - decap(),decap(packet_type(ns=0,type=0)) 1 Actions - decap() [Datapath action - pop_eth) Outgoing packet -> \| MPLS \| ETH \| IP \| Payload\| 2 Actions - decap(packet_type(ns=0,type=0)) [Datapath action - POP_MPLS:0x6558] Outgoing packet -> \| ETH \| IP \| Payload\| Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-17 02:04:20 +01:00
David Marchand	b84386fa9a	dpdk: Support running PMD threads on any core. Previously in OVS, a PMD thread running on cpu X used lcore X. This assumption limited OVS to run PMD threads on physical cpu < RTE_MAX_LCORE. DPDK 20.08 introduced a new API that associates a non-EAL thread to a free lcore. This new API does not change the thread characteristics (like CPU affinity) and let OVS run its PMD threads on any cpu regardless of RTE_MAX_LCORE. The DPDK multiprocess feature is not compatible with this new API and is disabled. DPDK still limits the number of lcores to RTE_MAX_LCORE (128 on x86_64) which should be enough for OVS pmd threads (hopefully). DPDK lcore/OVS pmd threads mapping are logged at threads when trying to attach a OVS PMD thread, and when detaching. A new command is added to help get DPDK point of view of the DPDK lcores at any time: $ ovs-appctl dpdk/lcore-list lcore 0, socket 0, role RTE, cpuset 0 lcore 1, socket 0, role NON_EAL, cpuset 1 lcore 2, socket 0, role NON_EAL, cpuset 15 Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-11 21:34:56 +01:00
Sriharsha Basavapatna	6e50c16518	dpif-netdev: Avoid hw_miss_packet_recover() for devices with no support. The hw_miss_packet_recover() API results in performance degradation, for ports that are either not offload capable or do not support this specific offload API. For example, in the test configuration shown below, the vhost-user port does not support offloads and the VF port doesn't support hw_miss offload API. But because tunnel offload needs to be configured in other bridges (br-vxlan and br-phy), OVS has been built with -DALLOW_EXPERIMENTAL_API. br-vhost br-vxlan br-phy vhost-user<-->VF VF-Rep<-->VxLAN uplink-port For every packet between the VF and the vhost-user ports, hw_miss API is called even though it is not supported by the ports involved. This leads to significant performance drop (~3x in some cases; both cycles and pps). Return EOPNOTSUPP when this API fails for a device that doesn't support it and avoid this API on that port for subsequent packets. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-07 20:53:20 +01:00
Ilya Maximets	e7e9973b80	dpif-netdev: Forwarding optimization for flows with a simple match. There are cases where users might want simple forwarding or drop rules for all packets received from a specific port, e.g :: "in_port=1,actions=2" "in_port=2,actions=IN_PORT" "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop" "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3" There are also cases where complex OpenFlow rules can be simplified down to datapath flows with very simple match criteria. In theory, for very simple forwarding, OVS doesn't need to parse packets at all in order to follow these rules. "Simple match" lookup optimization is intended to speed up packet forwarding in these cases. Design: Due to various implementation constraints userspace datapath has following flow fields always in exact match (i.e. it's required to match at least these fields of a packet even if the OF rule doesn't need that): - recirc_id - in_port - packet_type - dl_type - vlan_tci (CFI + VID) - in most cases - nw_frag - for ip packets Not all of these fields are related to packet itself. We already know the current 'recirc_id' and the 'in_port' before starting the packet processing. It also seems safe to assume that we're working with Ethernet packets. So, for the simple OF rule we need to match only on 'dl_type', 'vlan_tci' and 'nw_frag'. 'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be combined in a single 64bit integer (mark) that can be used as a hash in hash map. We are using only VID and CFI form the 'vlan_tci', flows that need to match on PCP will not qualify for the optimization. Workaround for matching on non-existence of vlan updated to match on CFI and VID only in order to qualify for the optimization. CFI is always set by OVS if vlan is present in a packet, so there is no need to match on PCP in this case. 'nw_frag' takes 2 bits of PCP inside the simple match mark. New per-PMD flow table 'simple_match_table' introduced to store simple match flows only. 'dp_netdev_flow_add' adds flow to the usual 'flow_table' and to the 'simple_match_table' if the flow meets following constraints: - 'recirc_id' in flow match is 0. - 'packet_type' in flow match is Ethernet. - Flow wildcards contains only minimal set of non-wildcarded fields (listed above). If the number of flows for current 'in_port' in a regular 'flow_table' equals number of flows for current 'in_port' in a 'simple_match_table', we may use simple match optimization, because all the flows we have are simple match flows. This means that we only need to parse 'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching. Now we make the unique flow mark from the 'in_port', 'dl_type', 'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'. On successful lookup we don't need to run full 'miniflow_extract()'. Unsuccessful lookup technically means that we have no suitable flow in the datapath and upcall will be required. So, in this case EMC and SMC lookups are disabled. We may optimize this path in the future by bypassing the dpcls lookup too. Performance improvement of this solution on a 'simple match' flows should be comparable with partial HW offloading, because it parses same packet fields and uses similar flow lookup scheme. However, unlike partial HW offloading, it works for all port types including virtual ones. Performance results when compared to EMC: Test setup: virtio-user OVS virtio-user Testpmd1 ------------> pmd1 ------------> Testpmd2 (txonly) x<------ pmd2 <------------ (mac swap) Single stream of 64byte packets. Actions: in_port=vhost0,actions=vhost1 in_port=vhost1,actions=vhost0 Stats collected from pmd1 and pmd2, so there are 2 scenarios: Virt-to-Virt : Testpmd1 ------> pmd1 ------> Testpmd2. Virt-to-NoCopy : Testpmd2 ------> pmd2 --->x Testpmd1. Here the packet sent from pmd2 to Testpmd1 is always dropped, because the virtqueue is full since Testpmd1 is in txonly mode and doesn't receive any packets. This should be closer to the performance of a VM-to-Phy scenario. Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz. Table below represents improvement in throughput when compared to EMC. +----------------+------------------------+------------------------+ \| \| Default (-g -O2) \| "-Ofast -march=native" \| \| Scenario +------------+-----------+------------+-----------+ \| \| GCC \| Clang \| GCC \| Clang \| +----------------+------------+-----------+------------+-----------+ \| Virt-to-Virt \| +18.9% \| +25.5% \| +10.8% \| +16.7% \| \| Virt-to-NoCopy \| +24.3% \| +33.7% \| +14.9% \| +22.0% \| +----------------+------------+-----------+------------+-----------+ For Phy-to-Phy case performance improvement should be even higher, but it's not the main use-case for this functionality. Performance difference for the non-simple flows is within a margin of error. Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-07 20:32:20 +01:00
Ilya Maximets	20a4f546f7	dpif-netdev: Use PMD context to get the port for HW miss recovery. Last RX queue, from which the packet got received, is already stored in the PMD context. So, we can get the netdev from it without the expensive hash map lookup. In my V2V testing this patch improves performance in case HW offload and experimental APIs are enabled by about 3%. That narrows down the performance difference with the case with experimental API disabled to about 0.5%, which is way within a margin of error for that setup. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-09 22:46:47 +01:00
Kevin Traynor	a83a406096	dpif-netdev: Sync PMD ALB state with user commands. Previously, when a user enabled PMD auto load balancer with pmd-auto-lb="true", some conditions such as number of PMDs/RxQs that were required for a rebalance to take place were checked. If the configuration meant that a rebalance would not take place then PMD ALB was logged as 'disabled' and not run. Later, if the PMD/RxQ configuration changed whereby a rebalance could be effective, PMD ALB was logged as 'enabled' and would run at the appropriate time. This worked ok from a functional view but it is unintuitive for the user reading the logs. e.g. with one PMD (PMD ALB would not be effective) User enables ALB, but logs say it is disabled because it won't run. $ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true" \|dpif_netdev\|INFO\|PMD auto load balance is disabled No dry run takes place. Add more PMDs (PMD ALB may be effective). $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50 \|dpif_netdev\|INFO\|PMD auto load balance is enabled ... Dry run takes place. \|dpif_netdev\|DBG\|PMD auto load balance performing dry run. A better approach is to simply reflect back the user enable/disable state in the logs and deal with whether the rebalance will be effective when needed. That is the approach taken in this patch. To cut down on unneccessary work, some basic checks are also made before starting a PMD ALB dry run and debug logs can indicate this to the user. e.g. with one PMD (PMD ALB would not be effective) User enables ALB, and logs confirm the user has enabled it. $ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true" \|dpif_netdev\|INFO\|PMD auto load balance is enabled... No dry run takes place. \|dpif_netdev\|DBG\|PMD auto load balance nothing to do, not enough non-isolated PMDs or RxQs. Add more PMDs (PMD ALB may be effective). $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50 Dry run takes place. \|dpif_netdev\|DBG\|PMD auto load balance performing dry run. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-11-17 21:19:11 +01:00
Kevin Traynor	23083672b5	dpif-netdev: Reset RxQ cycles history on PMD reload. When a PMD reload occurs, some PMD cycle measurements are reset. In order to preserve the full cycles history of an Rxq, the RxQ cycle measurements were not reset. These are both used together to display the % of a PMD that an RxQ is using in the pmd-rxq-show stat. Resetting one and not the other can lead to some unintuitive looking stats while the stats settle for the new config. In some cases, it may appear like the RxQs are using >100% of a PMD. This resolves when the stats settle for the new config, but seeing RxQs apparently using >100% of a PMD may confuse a user and lead them to think there is a bug. To avoid this, reset the RxQ cycle measurements on PMD reload. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-11-16 16:56:40 +01:00
Eelco Chaudron	9b20df73a6	dpctl: dpif: Allow viewing and configuring dp cache sizes. This patch adds a general way of viewing/configuring datapath cache sizes. With an implementation for the netlink interface. The ovs-dpctl/ovs-appctl show commands will display the current cache sizes configured: $ ovs-dpctl show system@ovs-system: lookups: hit:25 missed:63 lost:0 flows: 0 masks: hit:282 total:0 hit/pkt:3.20 cache: hit:4 hit-rate:4.54% caches: masks-cache: size:256 port 0: ovs-system (internal) port 1: br-int (internal) port 2: genev_sys_6081 (geneve: packet_type=ptap) port 3: br-ex (internal) port 4: eth2 port 5: sw0p1 (internal) port 6: sw0p3 (internal) A specific cache can be configured as follows: $ ovs-appctl dpctl/cache-set-size DP CACHE SIZE $ ovs-dpctl cache-set-size DP CACHE SIZE For example to disable the cache do: $ ovs-dpctl cache-set-size system@ovs-system masks-cache 0 Setting cache size successful, new size 0. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-11-08 21:48:05 +01:00
Eelco Chaudron	efd55eb34c	dpctl: dpif: Add kernel datapath cache hit output. This patch adds cache usage statistics to the output: $ ovs-dpctl show system@ovs-system: lookups: hit:24 missed:71 lost:0 flows: 0 masks: hit:334 total:0 hit/pkt:3.52 cache: hit:4 hit-rate:4.21% port 0: ovs-system (internal) port 1: genev_sys_6081 (geneve: packet_type=ptap) port 2: br-int (internal) port 3: br-ex (internal) port 4: eth2 port 5: sw1p1 (internal) port 6: sw0p4 (internal) Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-11-08 21:48:05 +01:00
Ilya Maximets	02aebad3fb	dpif-netdev: Fix use-after-free on PACKET_OUT of IP fragments. IP fragmentation engine may not only steal the packet but also add more. For example, after receiving the last fragment, it will add all previous fragments to a batch. Unfortunately, it will also free the original last fragment and replace it with a copy. This invalidates the 'packet_clone' pointer in the dpif_netdev_execute() leading to the use-after-free: ==3525086==ERROR: AddressSanitizer: heap-use-after-free on address 0x61600020439c at pc 0x000000688a6d READ of size 1 at 0x61600020439c thread T0 #0 0x688a6c in dp_packet_swap ./lib/dp-packet.h:265:5 #1 0x68781d in dpif_netdev_execute lib/dpif-netdev.c:4103:9 #2 0x6675db in dpif_netdev_operate lib/dpif-netdev.c:4129:25 #3 0x691e5e in dpif_operate lib/dpif.c:1367:13 #4 0x692909 in dpif_execute lib/dpif.c:1321:9 #5 0x5b19c6 in packet_execute ofproto/ofproto-dpif.c:4991:5 #6 0x5a2861 in ofproto_packet_out_finish ofproto/ofproto.c:3662:5 #7 0x5a65c6 in do_bundle_commit ofproto/ofproto.c:8270:13 #8 0x5a0cae in handle_bundle_control ofproto/ofproto.c:8309:17 #9 0x59a476 in handle_single_part_openflow ofproto/ofproto.c:8593:16 #10 0x5877ac in handle_openflow ofproto/ofproto.c:8674:21 #11 0x6296f1 in ofconn_run ofproto/connmgr.c:1329:13 #12 0x62925d in connmgr_run ofproto/connmgr.c:356:9 #13 0x586904 in ofproto_run ofproto/ofproto.c:1879:5 #14 0x55c830 in bridge_run__ vswitchd/bridge.c:3251:9 #15 0x55c015 in bridge_run vswitchd/bridge.c:3310:5 #16 0x575f31 in main vswitchd/ovs-vswitchd.c:127:9 #17 0x7f01099d3492 in __libc_start_main (/lib64/libc.so.6+0x23492) #18 0x47d96d in _start (vswitchd/ovs-vswitchd+0x47d96d) 0x61600020439c is located 28 bytes inside of 560-byte region freed by thread T0 here: #0 0x5177a8 in free (vswitchd/ovs-vswitchd+0x5177a8) #1 0x6b17b6 in dp_packet_delete ./lib/dp-packet.h:256:9 #2 0x6afeee in ipf_extract_frags_from_batch lib/ipf.c:947:17 #3 0x6afd63 in ipf_preprocess_conntrack lib/ipf.c:1232:9 #4 0x946b2c in conntrack_execute lib/conntrack.c:1446:5 #5 0x67e3ed in dp_execute_cb lib/dpif-netdev.c:8277:9 #6 0x7097d7 in odp_execute_actions lib/odp-execute.c:865:17 #7 0x66409e in dp_netdev_execute_actions lib/dpif-netdev.c:8322:5 #8 0x6877ad in dpif_netdev_execute lib/dpif-netdev.c:4090:5 #9 0x6675db in dpif_netdev_operate lib/dpif-netdev.c:4129:25 #10 0x691e5e in dpif_operate lib/dpif.c:1367:13 #11 0x692909 in dpif_execute lib/dpif.c:1321:9 #12 0x5b19c6 in packet_execute ofproto/ofproto-dpif.c:4991:5 #13 0x5a2861 in ofproto_packet_out_finish ofproto/ofproto.c:3662:5 #14 0x5a65c6 in do_bundle_commit ofproto/ofproto.c:8270:13 #15 0x5a0cae in handle_bundle_control ofproto/ofproto.c:8309:17 #16 0x59a476 in handle_single_part_openflow ofproto/ofproto.c:8593:16 #17 0x5877ac in handle_openflow ofproto/ofproto.c:8674:21 #18 0x6296f1 in ofconn_run ofproto/connmgr.c:1329:13 #19 0x62925d in connmgr_run ofproto/connmgr.c:356:9 #20 0x586904 in ofproto_run ofproto/ofproto.c:1879:5 #21 0x55c830 in bridge_run__ vswitchd/bridge.c:3251:9 #22 0x55c015 in bridge_run vswitchd/bridge.c:3310:5 #23 0x575f31 in main vswitchd/ovs-vswitchd.c:127:9 #24 0x7f01099d3492 in __libc_start_main (/lib64/libc.so.6+0x23492) The issue can be reproduced with system-userspace testsuite on the 'conntrack - IPv4 fragmentation with fragments specified' test. Previously, there was a leak inside the IP fragmentation module that kept the original segment, so 'packet_clone' remained a valid pointer. But commit 803ed12e31b0 ("ipf: release unhandled packets from the batch") fixed the leak leading to use-after-free. Using the packet from a batch instead of 'packet_clone' to swap packet content to avoid the issue. While investigating this problem, more issues uncovered. One of them is that IP fragmentation engine can add more packets to the batch, but there is no way to get them to a caller. Adding an extra branch for this case with a 'FIXME' comment in order to highlight the issue. Another one is that IP fragmentation engine will keep only 32 fragments dropping all other fragments while refilling a batch, but that should be fixed separately. Fixes: 7e6b41ac8d9d ("dpif-netdev: Fix crash when PACKET_OUT is metered.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com>	2021-10-13 13:36:30 +02:00
linhuang	bfc6e9735c	dpif-netdev: Remove OVS_UNUSED flag in functions for ct_zone limits. The dpif is used, so remove the OVS_UNUSED flag. Fixes: a7f33fdbfb67 ("conntrack: Support zone limits.") Signed-off-by: linhuang <linhuang@ruijie.com.cn> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-09-23 00:27:51 +02:00
Tony van der Peet	7e6b41ac8d	dpif-netdev: Fix crash when PACKET_OUT is metered. When a PACKET_OUT has output port of OFPP_TABLE, and the rule table includes a meter and this causes the packet to be deleted, execute with a clone of the packet, restoring the original packet if it is changed by the execution. Add tests to verify the original issue is fixed, and that the fix doesn't break tunnel processing. Reported-by: Tony van der Peet <tony.vanderpeet@alliedtelesis.co.nz> Signed-off-by: Tony van der Peet <tony.vanderpeet@alliedtelesis.co.nz> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-09-08 17:52:35 +02:00
Harry van Haaren	0b3a5d7add	dpif-netdev: fix memory leak in dpif and mfex commands This patch fixes a memory leak in the commands for DPIF and MFEX get and set. In order to operate the commands require a pmd_list, which is currently not freed after it has been used. This issue was identified by a static analysis tool. Fixes: 3d8f47bc ("dpif-netdev: Add command line and function pointer for miniflow extract") Fixes: abb807e2 ("dpif-netdev: Add command to switch dpif implementation.") Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-08-16 15:53:47 +01:00
Harry van Haaren	01cbe1ed41	dpif-netdev: fix memory leak in dpcls subtable set command This patch fixes a memory leak when the command "dpif-netdev/subtable-lookup-prio-set" is run, the pmd_list required to iterate the PMD threads was not being freed. This issue was identified by a static analysis tool. Fixes: 3d018c3e ("dpif-netdev: Add subtable lookup prio set command.") Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-08-16 15:52:55 +01:00
kumar Amber	d2ad305a64	dpif-netdev: Fix dead code in mfex command The commit removes the dead code from the MFEX set command as highlighted by static tool analysis. Fixes: a395b132b7 ("dpif-netdev: Add packet count and core id paramters for study") Signed-off-by: kumar Amber <kumar.amber@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-08-16 15:18:12 +01:00
Eli Britstein	e21e9dcec2	dpif-netdev: Log flow modification in debug level. Log flow modifications to help debugging. Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 19:33:03 +02:00
Eli Britstein	6f69e0e30b	dpif-netdev: Fix offloads of modified flows. Association of a mark to a flow is done as part of its offload handling, in the offloading thread. However, the PMD thread specifies whether an offload request is an "add" or "modify" by the association of a mark to the flow. This is exposed to a race condition. A flow might be created with actions that cannot be fully offloaded, for example flooding (before MAC learning), and later modified to have actions that can be fully offloaded. If the two requests are queued before the offload thread handling, they are both marked as "add". When the offload thread handles them, the first request is partially offloaded, and the second one is ignored as the flow is already considered as offloaded. Fix it by specifying add/modify of an offload request by the actual flow state change, without relying on the mark. Fixes: 3c7330ebf036 ("netdev-offload-dpdk: Support offload of output action.") Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 18:56:56 +02:00
Eli Britstein	0d25621e4d	dpif-netdev: Fix flow modification after failure. dp_netdev_flow_offload_main thread is asynchronous, by the cited commit. There might be a case where there are modification requests of the same flow submitted before handled. Then, if the first handling fails, the rule for the flow is deleted, and the mark is freed. Then, the following one should not be handled as a modification, but rather as an "add". Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread") Signed-off-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 18:56:56 +02:00
Eli Britstein	8917010b05	dpif-netdev: Do not flush PMD offloads on reload. Before flushing offloads of a removed port was supported by [1], it was necessary to flush the 'marks'. In doing so, all offloads of the PMD are removed, include the ones that are not related to the removed port and that are not modified following this removal. As a result such flows are evicted from being offloaded, and won't resume offloading. As PMD offload flush is not necessary, avoid it. [1] 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.") Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 18:56:47 +02:00
Ilya Maximets	cd36a34f33	dpif-netdev: Fix non-atomic read of smc_enable_db. Atomic variables must be read atomically. And since we already have 'smc_enable_db' in the PMD context, we need to use it from there to avoid reading atomically twice. Also, 'smc_enable_db' is a global configuration, there is no need to read it per-port or per-rxq. Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.") Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 18:52:44 +02:00
Mark Gray	b1e517bd2f	dpif-netlink: Introduce per-cpu upcall dispatch. The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is correspondingly a large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. Reported-at: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-16 20:05:03 +02:00
David Marchand	3222a89d9a	dpif-netdev: Report overhead busy cycles per pmd. Users complained that per rxq pmd usage was confusing: summing those values per pmd would never reach 100% even if increasing traffic load beyond pmd capacity. This is because the dpif-netdev/pmd-rxq-show command only reports "pure" rxq cycles while some cycles are used in the pmd mainloop and adds up to the total pmd load. dpif-netdev/pmd-stats-show does report per pmd load usage. This load is measured since the last dpif-netdev/pmd-stats-clear call. On the other hand, the per rxq pmd usage reflects the pmd load on a 10s sliding window which makes it non trivial to correlate. Gather per pmd busy cycles with the same periodicity and report the difference as overhead in dpif-netdev/pmd-rxq-show so that we have all info in a single command. Example: $ ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 1 core_id 3: isolated : true port: dpdk0 queue-id: 0 (enabled) pmd usage: 90 % overhead: 4 % pmd thread numa_id 1 core_id 5: isolated : false port: vhost0 queue-id: 0 (enabled) pmd usage: 0 % port: vhost1 queue-id: 0 (enabled) pmd usage: 93 % port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % port: vhost6 queue-id: 0 (enabled) pmd usage: 0 % overhead: 6 % pmd thread numa_id 1 core_id 31: isolated : true port: dpdk1 queue-id: 0 (enabled) pmd usage: 86 % overhead: 4 % pmd thread numa_id 1 core_id 33: isolated : false port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % port: vhost4 queue-id: 0 (enabled) pmd usage: 0 % port: vhost5 queue-id: 0 (enabled) pmd usage: 92 % port: vhost7 queue-id: 0 (enabled) pmd usage: 0 % overhead: 7 % Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 17:43:42 +01:00
Kevin Traynor	6193e03267	dpif-netdev: Allow pin rxq and non-isolate PMD. Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from a port are handled on different PMDs. Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be assigned to it by OVS. There may be cases where there is unused cycles on those pmds and the user would like other rxqs to also be able to be assigned to it by OVS. Add an option to pin the rxq and non-isolate the PMD. The default behaviour is unchanged, which is pin and isolate the PMD. In order to pin and non-isolate: ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false Note this is available only with group assignment type, as pinning conflicts with the operation of the other rxq assignment algorithms. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:57 +01:00
Kevin Traynor	3dd050909a	dpif-netdev: Add group rxq scheduling assignment type. Add an rxq scheduling option that allows rxqs to be grouped on a pmd based purely on their load. The current default 'cycles' assignment sorts rxqs by measured processing load and then assigns them to a list of round robin PMDs. This helps to keep the rxqs that require most processing on different cores but as it selects the PMDs in round robin order, it equally distributes rxqs to PMDs. 'cycles' assignment has the advantage in that it separates the most loaded rxqs from being on the same core but maintains the rxqs being spread across a broad range of PMDs to mitigate against changes to traffic pattern. 'cycles' assignment has the disadvantage that in order to make the trade off between optimising for current traffic load and mitigating against future changes, it tries to assign and equal amount of rxqs per PMD in a round robin manner and this can lead to a less than optimal balance of the processing load. Now that PMD auto load balance can help mitigate with future changes in traffic patterns, a 'group' assignment can be used to assign rxqs based on their measured cycles and the estimated running total of the PMDs. In this case, there is no restriction about keeping equal number of rxqs per PMD as it is purely load based. This means that one PMD may have a group of low load rxqs assigned to it while another PMD has one high load rxq assigned to it, as that is the best balance of their measured loads across the PMDs. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:47 +01:00
Kevin Traynor	4fb54652e0	dpif-netdev: Assign PMD for failed pinned rxqs. Previously, if pmd-rxq-affinity was used to pin an rxq to a core that was not in pmd-cpu-mask the rxq was not polled for and the user received a warning. This meant that no traffic would be received from that rxq. Now that pinned and non-pinned rxqs are assigned to PMDs in a common call to rxq scheduling, if an invalid core is selected in pmd-rxq-affinity the rxq can be assigned an available PMD (if any). A warning will still be logged as the requested core could not be used. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:35 +01:00

1 2 3 4 5 ...

779 Commits