Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for hmap and all its derived types.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
A flow may be modified after its initial offload failed. In this case,
according to [1], the modification is handled as a flow add.
For a vport flow "add", the orig_in_port should be provided.
Keep that field in the flow struct, so it can be provided in the flow
modification use case.
[1] 0d25621e4d9f ("dpif-netdev: Fix flow modification after failure.")
Fixes: b5e6f6f6bfbe ("dpif-netdev: Provide orig_in_port in metadata for tunneled packets.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The same pattern for atomic stores and initialization was used for the
DPIF and MFEX function pointers declared in struct dp_netdev_pmd_thread.
Simplify this pattern for all stores to 'miniflow_extract_opt' and
'netdev_input_func'.
Also replace the first store to 'miniflow_extract_opt' which was a
atomic_store_relaxed() with atomic_init().
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The PMD reference taken is not actually used, it is only needed to get
the dp_netdev linked. Additionally, the taking of the PMD reference
does not protect against the disappearance of the dp_netdev,
so it is misleading.
The dp reference is protected by the way the ports are being deleted
during datapath deletion. No further offload request should be found
past a flush, so it is safe to keep this reference in the offload item.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In dp_netdev_pmd_remove_flow() we schedule the deletion of an
offloaded flow, if a mark has been assigned to the flow. But if
this occurs in the window in which the offload thread completes
offloading the flow and assigns a mark to the flow, then we miss
deleting the flow. This problem has been observed while adding
and deleting flows in a loop. To fix this, always enqueue flow
deletion regardless of the flow->mark being set.
Fixes: 241bad15d99a("dpif-netdev: associate flow with a mark id")
Co-authored-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Port flush and offload uninit should be moved after the datapath
has been reconfigured. That way, no other thread such as PMDs will
find this port to poll and enqueue further offload requests.
After a flush, almost no further offload request for this port should
be found in the queue.
There will still be some issued by revalidators, but they
will be caught when the offload thread fails to take a netdev ref.
This change fixes the issue of datapath reference being improperly
accessed by offload threads while it is being destroyed.
Fixes: 5b0aa55776cb ("dpif-netdev: Execute flush from offload thread.")
Fixes: 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.")
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The subtable search function can be used at any time by a PMD thread.
Setting the subtable search function should be done atomically to
prevent garbage data from being read.
A dpcls_subtable_lookup_reprobe() call can happen at the same time that
DPCLS subtables are being sorted. This could lead to modifications done
by the reprobe() to be lost. Prevent this from happening by locking on
pmd->flow_mutex. After this change both the reprobe function and a
subtable sort will share the flow_mutex preventing modifications by
either one from being lost.
Also remove the pvector_publish() call. The pvector is not being changed
in dpcls_subtable_lookup_reprobe(), only the data pointed to by pointers
in the vector are being changed.
Fixes: 3d018c3ea79d ("dpif-netdev: add subtable lookup prio set command.")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2022-January/390757.html
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
It's actually undefined behavior to pass NULL to standard library
functions that manipulate arrays (e.g., qsort, memcpy, memcmp), even if
the passed number of items is 0.
UB Sanitizer reports:
ovsdb/monitor.c:408:9: runtime error: null pointer passed as argument 1,
which is declared to never be null
#0 0x406ae1 in ovsdb_monitor_columns_sort ovsdb/monitor.c:408
#1 0x406ae1 in ovsdb_monitor_add ovsdb/monitor.c:1683
[...]
lib/ovsdb-data.c:1970:5: runtime error: null pointer passed as argument 2,
which is declared to never be null
#0 0x4071c8 in ovsdb_datum_push_unsafe lib/ovsdb-data.c:1970
#1 0x471cd0 in ovsdb_datum_apply_diff_in_place lib/ovsdb-data.c:2345
[...]
ofproto/ofproto-dpif-rid.c:159:17:
runtime error: null pointer passed as argument 1,
which is declared to never be null
#0 0x4df5d8 in frozen_state_equal ofproto/ofproto-dpif-rid.c:159
#1 0x4dfd27 in recirc_find_equal ofproto/ofproto-dpif-rid.c:179
[...]
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Read the user configuration in the netdev-offload module to modify the
number of threads used to manage hardware offload requests.
This allows processing insertion, deletion and modification
concurrently.
The offload thread structure was modified to contain all needed
elements. This structure is multiplied by the number of requested
threads and used separately.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The port mutex protects the netdev mapping, that can be changed by port
addition or port deletion. HW offloads operations can be considered read
operations on the port mapping itself. Use a rwlock to differentiate
between read and write operations, allowing concurrent queries and
offload insertions.
Because offload queries, deletion, and reconfigure_datapath() calls are
all rdlock, the deadlock fixed by [1] is still avoided, as the rdlock
side is recursive as prescribed by the POSIX standard. Executing
'reconfigure_datapath()' only requires a rdlock taken, but it is sometimes
executed in contexts where wrlock is taken ('do_add_port()' and
'do_del_port()').
This means that the deadlock described in [2] is still valid and should
be mitigated. The rdlock is taken using 'tryrdlock()' during offload query,
keeping the current behavior.
[1]: 81e89d5c2645 ("dpif-netdev: Make datapath port mutex recursive.")
[2]: 12d0edd75eba ("dpif-netdev: Avoid deadlock with offloading during PMD
thread deletion.").
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In later commits hardware offloads are managed in several threads.
Each offload is managed by a thread determined by its flow's 'mega_ufid'.
As megaflow to mark and mark to flow mappings are 1:1 and 1:N
respectively, then a single mark exists for a single 'mega_ufid', and
multiple flows uses the same 'mega_ufid'. Because the managing thread will
be chosen using the 'mega_ufid', then each mapping does not need to be
shared with other offload threads.
The mappings are kept as cmap as upcalls will sometimes query them before
enqueuing orders to the offload threads.
To prepare this change, move the mappings within the offload thread
structure.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The dataplane threads (PMDs) send offloading commands to a dedicated
offload management thread. The current implementation uses a lock
and benchmarks show a high contention on the queue in some cases.
With high-contention, the mutex will more often lead to the locking
thread yielding in wait, using a syscall. This should be avoided in
a userland dataplane.
The mpsc-queue can be used instead. It uses less cycles and has
lower latency. Benchmarks show better behavior as multiple
revalidators and one or multiple PMDs writes to a single queue
while another thread polls it.
One trade-off with the new scheme however is to be forced to poll
the queue from the offload thread. Without mutex, a cond_wait
cannot be used for signaling. The offload thread is implementing
an exponential backoff and will sleep in short increments when no
data is available. This makes the thread yield, at the price of
some latency to manage offloads after an inactivity period.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add a lock to access the ufid to rte_flow map. This will protect it
from concurrent write accesses when multiple threads attempt it.
At this point, the reason for taking the lock is not to fullfill the
needs of the DPDK offload implementation anymore. Rewrite the comments
to reflect this change. The lock is still needed to protect against
changes to netdev port mapping.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a port is deleted, its offloads must be flushed. The operation
runs in the thread that initiated it. Offload data is thus accessed
jointly by the port deletion thread(s) and the offload thread, which
complicates the data access model.
To simplify this model, as a pre-step toward introducing parallel
offloads, execute the flush operation in the offload thread.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Offload requests are currently only supporting flow offloads.
As a pre-step before supporting an offload flush request,
modify the layout of an offload request item, to become a tagged union.
Future offload types won't be forced to re-use the full flow offload
structure, which consumes a lot of memory.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Use the netdev-offload multithread API to allow multiple thread
allocating marks concurrently.
Initialize only once the pool in a multithread context by using
the ovsthread_once type.
Use the id-fpool module for faster concurrent ID allocation.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Profiling the HW offload thread, the flow offload freeing takes
approximatively 25% of the time. Most of this time is spent waiting on
the futex used by the libc free(), as it triggers a syscall and
reschedule the thread.
Avoid the syscall and its expensive context switch. Batch the offload
messages freeing using the RCU.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Similar to what was done for the PMD threads [1], reduce the performance
impact of quiescing too often in the offload thread.
After each processed offload, the offload thread currently quiesce and
will sync with RCU. This synchronization can be lengthy and make the
thread unnecessary slow.
Instead attempt to quiesce every 10 ms at most. While the queue is
empty, the offload thread remains quiescent.
[1]: 81ac8b3b194c ("dpif-netdev: Do RCU synchronization at fixed interval
in PMD main loop.")
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In the netdev datapath, keep track of the enqueued offloads between
the PMDs and the offload thread. Additionally, query each netdev
for their hardware offload counters.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The offload management in userspace is done through a separate thread.
The naming of the structure holding the objects used for synchronization
with the dataplane is generic and nondescript.
Clarify the object function by renaming it.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Expose a function to query datapath offload statistics.
This function is separate from the current one in netdev-offload
as it exposes more detailed statistics from the datapath, instead of
only from the netdev-offload provider.
Each datapath is meant to use the custom counters as it sees fit for its
handling of hardware offloads.
Call the new API from dpctl.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ovs_strlcpy silently fails to copy the thread name if it is too long.
Rename the flow offload thread to differentiate it from the main thread.
Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds a new hash Tx steering mode that
distributes the traffic on all the Tx queues, whatever the
number of PMD threads. It would be useful for guests
expecting traffic to be distributed on all the vCPUs.
The idea here is to re-use the 5-tuple hash of the packets,
already computed to build the flows batches (and so it
does not provide flexibility on which fields are part of
the hash).
There are also no user-configurable indirection table,
given the feature is transparent to the guest. The queue
selection is just a modulo operation between the packet
hash and the number of Tx queues.
There are no (at least intentionnally) functionnal changes
for the existing XPS and static modes. There should not be
noticeable performance changes for these modes (only one
more branch in the hot path).
For the hash mode, performance could be impacted due to
locking when multiple PMD threads are in use (same as
XPS mode) and also because of the second level of batching.
Regarding the batching, the existing Tx port output_pkts
is not modified. It means that at maximum, NETDEV_MAX_BURST
can be batched for all the Tx queues. A second level of
batching is done in dp_netdev_pmd_flush_output_on_port(),
only for this hash mode.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
A boolean is currently used to differenciate between the
static and XPS Tx queue modes.
Since we are going to introduce a new steering mode, replace
this boolean with an enum.
This patch does not introduce functionnal changes.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The encap & decap actions are extended to support MPLS packet type.
Encap & decap actions adds and removes MPLS header at start of the
packet.
The existing PUSH MPLS & POP MPLS actions inserts & removes MPLS
header between ethernet header and the IP header. Though this behaviour
is fine for L3 VPN where an IP packet is encapsulated inside a MPLS
tunnel, it does not suffice the L2 VPN requirements. In L2 VPN the
ethernet packets must be encapsulated inside MPLS tunnel.
In this change the encap & decap actions are extended to support MPLS
packet type. The encap & decap adds and removes MPLS header at the
start of packet as depicted below.
Encapsulation:
Actions - encap(mpls),encap(ethernet)
Incoming packet -> | ETH | IP | Payload |
1 Actions - encap(mpls) [Datapath action - ADD_MPLS:0x8847]
Outgoing packet -> | MPLS | ETH | Payload|
2 Actions - encap(ethernet) [ Datapath action - push_eth ]
Outgoing packet -> | ETH | MPLS | ETH | Payload|
Decapsulation:
Incoming packet -> | ETH | MPLS | ETH | IP | Payload |
Actions - decap(),decap(packet_type(ns=0,type=0))
1 Actions - decap() [Datapath action - pop_eth)
Outgoing packet -> | MPLS | ETH | IP | Payload|
2 Actions - decap(packet_type(ns=0,type=0)) [Datapath action - POP_MPLS:0x6558]
Outgoing packet -> | ETH | IP | Payload|
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously in OVS, a PMD thread running on cpu X used lcore X.
This assumption limited OVS to run PMD threads on physical cpu <
RTE_MAX_LCORE.
DPDK 20.08 introduced a new API that associates a non-EAL thread to a free
lcore. This new API does not change the thread characteristics (like CPU
affinity) and let OVS run its PMD threads on any cpu regardless of
RTE_MAX_LCORE.
The DPDK multiprocess feature is not compatible with this new API and is
disabled.
DPDK still limits the number of lcores to RTE_MAX_LCORE (128 on x86_64)
which should be enough for OVS pmd threads (hopefully).
DPDK lcore/OVS pmd threads mapping are logged at threads when trying to
attach a OVS PMD thread, and when detaching.
A new command is added to help get DPDK point of view of the DPDK lcores
at any time:
$ ovs-appctl dpdk/lcore-list
lcore 0, socket 0, role RTE, cpuset 0
lcore 1, socket 0, role NON_EAL, cpuset 1
lcore 2, socket 0, role NON_EAL, cpuset 15
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The hw_miss_packet_recover() API results in performance degradation, for
ports that are either not offload capable or do not support this specific
offload API.
For example, in the test configuration shown below, the vhost-user port
does not support offloads and the VF port doesn't support hw_miss offload
API. But because tunnel offload needs to be configured in other bridges
(br-vxlan and br-phy), OVS has been built with -DALLOW_EXPERIMENTAL_API.
br-vhost br-vxlan br-phy
vhost-user<-->VF VF-Rep<-->VxLAN uplink-port
For every packet between the VF and the vhost-user ports, hw_miss API is
called even though it is not supported by the ports involved. This leads
to significant performance drop (~3x in some cases; both cycles and pps).
Return EOPNOTSUPP when this API fails for a device that doesn't support it
and avoid this API on that port for subsequent packets.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There are cases where users might want simple forwarding or drop rules
for all packets received from a specific port, e.g ::
"in_port=1,actions=2"
"in_port=2,actions=IN_PORT"
"in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
"in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"
There are also cases where complex OpenFlow rules can be simplified
down to datapath flows with very simple match criteria.
In theory, for very simple forwarding, OVS doesn't need to parse
packets at all in order to follow these rules. "Simple match" lookup
optimization is intended to speed up packet forwarding in these cases.
Design:
Due to various implementation constraints userspace datapath has
following flow fields always in exact match (i.e. it's required to
match at least these fields of a packet even if the OF rule doesn't
need that):
- recirc_id
- in_port
- packet_type
- dl_type
- vlan_tci (CFI + VID) - in most cases
- nw_frag - for ip packets
Not all of these fields are related to packet itself. We already
know the current 'recirc_id' and the 'in_port' before starting the
packet processing. It also seems safe to assume that we're working
with Ethernet packets. So, for the simple OF rule we need to match
only on 'dl_type', 'vlan_tci' and 'nw_frag'.
'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
combined in a single 64bit integer (mark) that can be used as a
hash in hash map. We are using only VID and CFI form the 'vlan_tci',
flows that need to match on PCP will not qualify for the optimization.
Workaround for matching on non-existence of vlan updated to match on
CFI and VID only in order to qualify for the optimization. CFI is
always set by OVS if vlan is present in a packet, so there is no need
to match on PCP in this case. 'nw_frag' takes 2 bits of PCP inside
the simple match mark.
New per-PMD flow table 'simple_match_table' introduced to store
simple match flows only. 'dp_netdev_flow_add' adds flow to the
usual 'flow_table' and to the 'simple_match_table' if the flow
meets following constraints:
- 'recirc_id' in flow match is 0.
- 'packet_type' in flow match is Ethernet.
- Flow wildcards contains only minimal set of non-wildcarded fields
(listed above).
If the number of flows for current 'in_port' in a regular 'flow_table'
equals number of flows for current 'in_port' in a 'simple_match_table',
we may use simple match optimization, because all the flows we have
are simple match flows. This means that we only need to parse
'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
Now we make the unique flow mark from the 'in_port', 'dl_type',
'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
On successful lookup we don't need to run full 'miniflow_extract()'.
Unsuccessful lookup technically means that we have no suitable flow
in the datapath and upcall will be required. So, in this case EMC and
SMC lookups are disabled. We may optimize this path in the future by
bypassing the dpcls lookup too.
Performance improvement of this solution on a 'simple match' flows
should be comparable with partial HW offloading, because it parses same
packet fields and uses similar flow lookup scheme.
However, unlike partial HW offloading, it works for all port types
including virtual ones.
Performance results when compared to EMC:
Test setup:
virtio-user OVS virtio-user
Testpmd1 ------------> pmd1 ------------> Testpmd2
(txonly) x<------ pmd2 <------------ (mac swap)
Single stream of 64byte packets. Actions:
in_port=vhost0,actions=vhost1
in_port=vhost1,actions=vhost0
Stats collected from pmd1 and pmd2, so there are 2 scenarios:
Virt-to-Virt : Testpmd1 ------> pmd1 ------> Testpmd2.
Virt-to-NoCopy : Testpmd2 ------> pmd2 --->x Testpmd1.
Here the packet sent from pmd2 to Testpmd1 is always dropped, because
the virtqueue is full since Testpmd1 is in txonly mode and doesn't
receive any packets. This should be closer to the performance of a
VM-to-Phy scenario.
Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
Table below represents improvement in throughput when compared to EMC.
+----------------+------------------------+------------------------+
| | Default (-g -O2) | "-Ofast -march=native" |
| Scenario +------------+-----------+------------+-----------+
| | GCC | Clang | GCC | Clang |
+----------------+------------+-----------+------------+-----------+
| Virt-to-Virt | +18.9% | +25.5% | +10.8% | +16.7% |
| Virt-to-NoCopy | +24.3% | +33.7% | +14.9% | +22.0% |
+----------------+------------+-----------+------------+-----------+
For Phy-to-Phy case performance improvement should be even higher, but
it's not the main use-case for this functionality. Performance
difference for the non-simple flows is within a margin of error.
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Last RX queue, from which the packet got received, is already stored
in the PMD context. So, we can get the netdev from it without the
expensive hash map lookup.
In my V2V testing this patch improves performance in case HW offload
and experimental APIs are enabled by about 3%. That narrows down the
performance difference with the case with experimental API disabled
to about 0.5%, which is way within a margin of error for that setup.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously, when a user enabled PMD auto load balancer with
pmd-auto-lb="true", some conditions such as number of PMDs/RxQs
that were required for a rebalance to take place were checked.
If the configuration meant that a rebalance would not take place
then PMD ALB was logged as 'disabled' and not run.
Later, if the PMD/RxQ configuration changed whereby a rebalance
could be effective, PMD ALB was logged as 'enabled' and would run at
the appropriate time.
This worked ok from a functional view but it is unintuitive for the
user reading the logs.
e.g. with one PMD (PMD ALB would not be effective)
User enables ALB, but logs say it is disabled because it won't run.
$ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true"
|dpif_netdev|INFO|PMD auto load balance is disabled
No dry run takes place.
Add more PMDs (PMD ALB may be effective).
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50
|dpif_netdev|INFO|PMD auto load balance is enabled ...
Dry run takes place.
|dpif_netdev|DBG|PMD auto load balance performing dry run.
A better approach is to simply reflect back the user enable/disable
state in the logs and deal with whether the rebalance will be effective
when needed. That is the approach taken in this patch.
To cut down on unneccessary work, some basic checks are also made before
starting a PMD ALB dry run and debug logs can indicate this to the user.
e.g. with one PMD (PMD ALB would not be effective)
User enables ALB, and logs confirm the user has enabled it.
$ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true"
|dpif_netdev|INFO|PMD auto load balance is enabled...
No dry run takes place.
|dpif_netdev|DBG|PMD auto load balance nothing to do, not enough non-isolated PMDs or RxQs.
Add more PMDs (PMD ALB may be effective).
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50
Dry run takes place.
|dpif_netdev|DBG|PMD auto load balance performing dry run.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a PMD reload occurs, some PMD cycle measurements are reset.
In order to preserve the full cycles history of an Rxq, the RxQ
cycle measurements were not reset.
These are both used together to display the % of a PMD that an
RxQ is using in the pmd-rxq-show stat.
Resetting one and not the other can lead to some unintuitive
looking stats while the stats settle for the new config. In
some cases, it may appear like the RxQs are using >100% of a PMD.
This resolves when the stats settle for the new config, but
seeing RxQs apparently using >100% of a PMD may confuse a user
and lead them to think there is a bug.
To avoid this, reset the RxQ cycle measurements on PMD reload.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds a general way of viewing/configuring datapath
cache sizes. With an implementation for the netlink interface.
The ovs-dpctl/ovs-appctl show commands will display the
current cache sizes configured:
$ ovs-dpctl show
system@ovs-system:
lookups: hit:25 missed:63 lost:0
flows: 0
masks: hit:282 total:0 hit/pkt:3.20
cache: hit:4 hit-rate:4.54%
caches:
masks-cache: size:256
port 0: ovs-system (internal)
port 1: br-int (internal)
port 2: genev_sys_6081 (geneve: packet_type=ptap)
port 3: br-ex (internal)
port 4: eth2
port 5: sw0p1 (internal)
port 6: sw0p3 (internal)
A specific cache can be configured as follows:
$ ovs-appctl dpctl/cache-set-size DP CACHE SIZE
$ ovs-dpctl cache-set-size DP CACHE SIZE
For example to disable the cache do:
$ ovs-dpctl cache-set-size system@ovs-system masks-cache 0
Setting cache size successful, new size 0.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
IP fragmentation engine may not only steal the packet but also add
more. For example, after receiving the last fragment, it will
add all previous fragments to a batch. Unfortunately, it will also
free the original last fragment and replace it with a copy.
This invalidates the 'packet_clone' pointer in the dpif_netdev_execute()
leading to the use-after-free:
==3525086==ERROR: AddressSanitizer: heap-use-after-free on
address 0x61600020439c at pc 0x000000688a6d
READ of size 1 at 0x61600020439c thread T0
#0 0x688a6c in dp_packet_swap ./lib/dp-packet.h:265:5
#1 0x68781d in dpif_netdev_execute lib/dpif-netdev.c:4103:9
#2 0x6675db in dpif_netdev_operate lib/dpif-netdev.c:4129:25
#3 0x691e5e in dpif_operate lib/dpif.c:1367:13
#4 0x692909 in dpif_execute lib/dpif.c:1321:9
#5 0x5b19c6 in packet_execute ofproto/ofproto-dpif.c:4991:5
#6 0x5a2861 in ofproto_packet_out_finish ofproto/ofproto.c:3662:5
#7 0x5a65c6 in do_bundle_commit ofproto/ofproto.c:8270:13
#8 0x5a0cae in handle_bundle_control ofproto/ofproto.c:8309:17
#9 0x59a476 in handle_single_part_openflow ofproto/ofproto.c:8593:16
#10 0x5877ac in handle_openflow ofproto/ofproto.c:8674:21
#11 0x6296f1 in ofconn_run ofproto/connmgr.c:1329:13
#12 0x62925d in connmgr_run ofproto/connmgr.c:356:9
#13 0x586904 in ofproto_run ofproto/ofproto.c:1879:5
#14 0x55c830 in bridge_run__ vswitchd/bridge.c:3251:9
#15 0x55c015 in bridge_run vswitchd/bridge.c:3310:5
#16 0x575f31 in main vswitchd/ovs-vswitchd.c:127:9
#17 0x7f01099d3492 in __libc_start_main (/lib64/libc.so.6+0x23492)
#18 0x47d96d in _start (vswitchd/ovs-vswitchd+0x47d96d)
0x61600020439c is located 28 bytes inside of 560-byte region
freed by thread T0 here:
#0 0x5177a8 in free (vswitchd/ovs-vswitchd+0x5177a8)
#1 0x6b17b6 in dp_packet_delete ./lib/dp-packet.h:256:9
#2 0x6afeee in ipf_extract_frags_from_batch lib/ipf.c:947:17
#3 0x6afd63 in ipf_preprocess_conntrack lib/ipf.c:1232:9
#4 0x946b2c in conntrack_execute lib/conntrack.c:1446:5
#5 0x67e3ed in dp_execute_cb lib/dpif-netdev.c:8277:9
#6 0x7097d7 in odp_execute_actions lib/odp-execute.c:865:17
#7 0x66409e in dp_netdev_execute_actions lib/dpif-netdev.c:8322:5
#8 0x6877ad in dpif_netdev_execute lib/dpif-netdev.c:4090:5
#9 0x6675db in dpif_netdev_operate lib/dpif-netdev.c:4129:25
#10 0x691e5e in dpif_operate lib/dpif.c:1367:13
#11 0x692909 in dpif_execute lib/dpif.c:1321:9
#12 0x5b19c6 in packet_execute ofproto/ofproto-dpif.c:4991:5
#13 0x5a2861 in ofproto_packet_out_finish ofproto/ofproto.c:3662:5
#14 0x5a65c6 in do_bundle_commit ofproto/ofproto.c:8270:13
#15 0x5a0cae in handle_bundle_control ofproto/ofproto.c:8309:17
#16 0x59a476 in handle_single_part_openflow ofproto/ofproto.c:8593:16
#17 0x5877ac in handle_openflow ofproto/ofproto.c:8674:21
#18 0x6296f1 in ofconn_run ofproto/connmgr.c:1329:13
#19 0x62925d in connmgr_run ofproto/connmgr.c:356:9
#20 0x586904 in ofproto_run ofproto/ofproto.c:1879:5
#21 0x55c830 in bridge_run__ vswitchd/bridge.c:3251:9
#22 0x55c015 in bridge_run vswitchd/bridge.c:3310:5
#23 0x575f31 in main vswitchd/ovs-vswitchd.c:127:9
#24 0x7f01099d3492 in __libc_start_main (/lib64/libc.so.6+0x23492)
The issue can be reproduced with system-userspace testsuite on the
'conntrack - IPv4 fragmentation with fragments specified' test.
Previously, there was a leak inside the IP fragmentation module that
kept the original segment, so 'packet_clone' remained a valid pointer.
But commit 803ed12e31b0 ("ipf: release unhandled packets from the batch")
fixed the leak leading to use-after-free.
Using the packet from a batch instead of 'packet_clone' to swap packet
content to avoid the issue.
While investigating this problem, more issues uncovered. One of them
is that IP fragmentation engine can add more packets to the batch, but
there is no way to get them to a caller. Adding an extra branch for
this case with a 'FIXME' comment in order to highlight the issue.
Another one is that IP fragmentation engine will keep only 32 fragments
dropping all other fragments while refilling a batch, but that should
be fixed separately.
Fixes: 7e6b41ac8d9d ("dpif-netdev: Fix crash when PACKET_OUT is metered.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
The dpif is used, so remove the OVS_UNUSED flag.
Fixes: a7f33fdbfb67 ("conntrack: Support zone limits.")
Signed-off-by: linhuang <linhuang@ruijie.com.cn>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a PACKET_OUT has output port of OFPP_TABLE, and the rule
table includes a meter and this causes the packet to be deleted,
execute with a clone of the packet, restoring the original packet
if it is changed by the execution.
Add tests to verify the original issue is fixed, and that the fix
doesn't break tunnel processing.
Reported-by: Tony van der Peet <tony.vanderpeet@alliedtelesis.co.nz>
Signed-off-by: Tony van der Peet <tony.vanderpeet@alliedtelesis.co.nz>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch fixes a memory leak in the commands for DPIF and MFEX
get and set. In order to operate the commands require a pmd_list,
which is currently not freed after it has been used. This issue
was identified by a static analysis tool.
Fixes: 3d8f47bc ("dpif-netdev: Add command line and function pointer for miniflow extract")
Fixes: abb807e2 ("dpif-netdev: Add command to switch dpif implementation.")
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This patch fixes a memory leak when the command
"dpif-netdev/subtable-lookup-prio-set" is run, the pmd_list
required to iterate the PMD threads was not being freed.
This issue was identified by a static analysis tool.
Fixes: 3d018c3e ("dpif-netdev: Add subtable lookup prio set command.")
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The commit removes the dead code from the
MFEX set command as highlighted by static tool
analysis.
Fixes: a395b132b7 ("dpif-netdev: Add packet count and core id paramters for study")
Signed-off-by: kumar Amber <kumar.amber@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Association of a mark to a flow is done as part of its offload handling,
in the offloading thread. However, the PMD thread specifies whether an
offload request is an "add" or "modify" by the association of a mark to
the flow.
This is exposed to a race condition. A flow might be created with
actions that cannot be fully offloaded, for example flooding (before MAC
learning), and later modified to have actions that can be fully
offloaded. If the two requests are queued before the offload thread
handling, they are both marked as "add". When the offload thread handles
them, the first request is partially offloaded, and the second one is
ignored as the flow is already considered as offloaded.
Fix it by specifying add/modify of an offload request by the actual flow
state change, without relying on the mark.
Fixes: 3c7330ebf036 ("netdev-offload-dpdk: Support offload of output action.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
dp_netdev_flow_offload_main thread is asynchronous, by the cited commit.
There might be a case where there are modification requests of the same
flow submitted before handled. Then, if the first handling fails, the
rule for the flow is deleted, and the mark is freed. Then, the following
one should not be handled as a modification, but rather as an "add".
Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Before flushing offloads of a removed port was supported by [1], it was
necessary to flush the 'marks'. In doing so, all offloads of the PMD are
removed, include the ones that are not related to the removed port and
that are not modified following this removal. As a result such flows are
evicted from being offloaded, and won't resume offloading.
As PMD offload flush is not necessary, avoid it.
[1] 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Atomic variables must be read atomically. And since we already have
'smc_enable_db' in the PMD context, we need to use it from there to
avoid reading atomically twice.
Also, 'smc_enable_db' is a global configuration, there is no need
to read it per-port or per-rxq.
Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.
This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:
* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.
In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:
a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.
Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Users complained that per rxq pmd usage was confusing: summing those
values per pmd would never reach 100% even if increasing traffic load
beyond pmd capacity.
This is because the dpif-netdev/pmd-rxq-show command only reports "pure"
rxq cycles while some cycles are used in the pmd mainloop and adds up to
the total pmd load.
dpif-netdev/pmd-stats-show does report per pmd load usage.
This load is measured since the last dpif-netdev/pmd-stats-clear call.
On the other hand, the per rxq pmd usage reflects the pmd load on a 10s
sliding window which makes it non trivial to correlate.
Gather per pmd busy cycles with the same periodicity and report the
difference as overhead in dpif-netdev/pmd-rxq-show so that we have all
info in a single command.
Example:
$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 3:
isolated : true
port: dpdk0 queue-id: 0 (enabled) pmd usage: 90 %
overhead: 4 %
pmd thread numa_id 1 core_id 5:
isolated : false
port: vhost0 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost1 queue-id: 0 (enabled) pmd usage: 93 %
port: vhost2 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost6 queue-id: 0 (enabled) pmd usage: 0 %
overhead: 6 %
pmd thread numa_id 1 core_id 31:
isolated : true
port: dpdk1 queue-id: 0 (enabled) pmd usage: 86 %
overhead: 4 %
pmd thread numa_id 1 core_id 33:
isolated : false
port: vhost3 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost4 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost5 queue-id: 0 (enabled) pmd usage: 92 %
port: vhost7 queue-id: 0 (enabled) pmd usage: 0 %
overhead: 7 %
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Pinning an rxq to a PMD with pmd-rxq-affinity may be done for
various reasons such as reserving a full PMD for an rxq, or to
ensure that multiple rxqs from a port are handled on different PMDs.
Previously pmd-rxq-affinity always isolated the PMD so no other rxqs
could be assigned to it by OVS. There may be cases where there is
unused cycles on those pmds and the user would like other rxqs to
also be able to be assigned to it by OVS.
Add an option to pin the rxq and non-isolate the PMD. The default
behaviour is unchanged, which is pin and isolate the PMD.
In order to pin and non-isolate:
ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false
Note this is available only with group assignment type, as pinning
conflicts with the operation of the other rxq assignment algorithms.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Add an rxq scheduling option that allows rxqs to be grouped
on a pmd based purely on their load.
The current default 'cycles' assignment sorts rxqs by measured
processing load and then assigns them to a list of round robin PMDs.
This helps to keep the rxqs that require most processing on different
cores but as it selects the PMDs in round robin order, it equally
distributes rxqs to PMDs.
'cycles' assignment has the advantage in that it separates the most
loaded rxqs from being on the same core but maintains the rxqs being
spread across a broad range of PMDs to mitigate against changes to
traffic pattern.
'cycles' assignment has the disadvantage that in order to make the
trade off between optimising for current traffic load and mitigating
against future changes, it tries to assign and equal amount of rxqs
per PMD in a round robin manner and this can lead to a less than optimal
balance of the processing load.
Now that PMD auto load balance can help mitigate with future changes in
traffic patterns, a 'group' assignment can be used to assign rxqs based
on their measured cycles and the estimated running total of the PMDs.
In this case, there is no restriction about keeping equal number of
rxqs per PMD as it is purely load based.
This means that one PMD may have a group of low load rxqs assigned to it
while another PMD has one high load rxq assigned to it, as that is the
best balance of their measured loads across the PMDs.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Previously, if pmd-rxq-affinity was used to pin an rxq to
a core that was not in pmd-cpu-mask the rxq was not polled
for and the user received a warning. This meant that no traffic
would be received from that rxq.
Now that pinned and non-pinned rxqs are assigned to PMDs in
a common call to rxq scheduling, if an invalid core is
selected in pmd-rxq-affinity the rxq can be assigned an
available PMD (if any).
A warning will still be logged as the requested core could
not be used.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>