2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-30 13:58:14 +00:00
Commit Graph

19352 Commits

Author SHA1 Message Date
Eelco Chaudron
4056ae4875 ofp-flow: Skip flow reply if it exceeds the maximum message size.
Currently, if a flow reply results in a message which exceeds
the maximum reply size, it will assert OVS. This would happen
when OVN uses OpenFlow15 to add large flows, and they get read
using OpenFlow10 with ovs-ofctl.

This patch prevents this and adds a test case to make sure the
code behaves as expected.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-31 21:11:58 +01:00
Paolo Valerio
77967b53fe conntrack: Check TCP state while testing established connections pick up.
When testing if an established connection is picked up, it could be
useful to verify that the protocol state matches the expectation, that
is, it moves to ESTABLISHED, as there's a chance that code modifications
may break the TCP conn_update() in a way that it returns CT_UPDATE_VALID
without moving to the correct state leading to a false positive.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-31 21:09:33 +01:00
Ilya Maximets
6e13565dd3 ovsdb: transaction: Keep one entry in the transaction history.
If a single transaction exceeds the size of the whole database (e.g.,
a lot of rows got removed and new ones added), transaction history will
be drained.  This leads to sending UUID_ZERO to the clients as the last
transaction id in the next monitor update, because monitor doesn't
know what was the actual last transaction id.  In case of a re-connect
that will cause re-downloading of the whole database, since the
client's last_id will be out of sync.

One solution would be to store the last transaction ID separately
from the actual transactions, but that will require a careful
management in cases where database gets reset and the history needs
to be cleared.  Keeping the one last transaction instead to avoid
the problem.  That should not be a big concern in terms of memory
consumption, because this last transaction will be removed from the
history once the next transaction appeared.  This is also not a concern
for a fast re-sync, because this last transaction will not be used
for the monitor reply; it's either client already has it, so no need
to send, or it's a history miss.

The test updated to not check the number of atoms if there is only
one transaction in the history.

Fixes: 317b1bfd7d ("ovsdb: Don't let transaction history grow larger than the database.")
Reported-at: https://bugzilla.redhat.com/2044621
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-31 21:05:20 +01:00
Ilya Maximets
3a05c63702 ovsdb-cs: Fix ignoring of the last id from the initial monitor reply.
Current code doesn't use the last id received in the monitor reply.
That may result in re-downloading the database content if the
re-connection happened after receiving the initial monitor reply,
but before receiving any other database updates.

Fixes: 1c337c43ac ("ovsdb-idl: Break into two layers.")
Reported-at: https://bugzilla.redhat.com/2044624
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-28 23:44:45 +01:00
Eelco Chaudron
dadd8357f2 ofproto-dpif: Fix issue with non-reversible actions on a patch ports.
For patch ports, the is_last_action value is not propagated and is
always set to true. This causes non-reversible actions to modify the
packet, and the original content is not preserved when processing
the remaining actions.

This patch propagates the is_last_action flag for patch port related
actions. In addition, it also fixes a general last action propagation
to the individual actions.

Fixed check_pkt_larger as last action, as it is a valid case for the
drop action, so it should not be skipped.

Fixes: feee58b95 ("ofproto-dpif-xlate: Keep track of the last action")
Fixes: 5b34f8fc3 ("Add a new OVS action check_pkt_larger")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-25 22:41:42 +01:00
David Marchand
0a395a52d6 NEWS: Fix some typo.
The experimantal typo got copy/paste a few times.

Fixes: be56e063d0 ("netdev-offload-dpdk: Support tunnel pop action.")
Fixes: e098c2f966 ("netdev-dpdk-offload: Add vxlan pattern matching function.")
Fixes: 7617d0583c ("netdev-offload-dpdk: Add support for matching on gre fields.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-21 20:54:55 +01:00
Antonin Bas
5b3bb16b84 ovs-monitor-ipsec: Fix generated strongSwan ipsec.conf for IPv6.
Setting the local address to 0.0.0.0 (v4 address) while setting the
remote address to a v6 address results in an invalid configuration.

See https://github.com/strongswan/strongswan/discussions/821

Signed-off-by: Antonin Bas <antonin.bas@gmail.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-21 18:45:23 +01:00
David Marchand
8723063c3c system-dpdk: Fix MFEX logs check.
Some warning logs must be waived when using the net/pcap DPDK driver.
Those logs can affect different DPDK drivers (like mlx5) and the tests in
system-dpdk are not testing MTU and Rx checksum, we might as well ignore
those warnings from OVS.

Fixes: d446dcb7e0 ("system-dpdk: Refactor common logs matching.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-01-20 15:48:34 +00:00
Wilson Peng
0506efbd0a datapath-windows: Pickup Ct tuple as CT lookup key in function OvsCtSetupLookupCtx
CT marks which are loaded in non-first commit will be lost in ovs-windows.In linux OVS,
the CT mark setting with same flow could be set successfully.

Currenlty Ovs-windows will create one new CT with the flowKey(Extracted from the packet itself)
If the packet is already done DNAT action after the 1st round flow processing. So the ct-mark
Set on previous Conntrack will be lost.In the fix, it will make use of CT tuple src/dst address
stored in the flowKey if the value is not zero and zone in the flowKey is same as the input zone.

In the fix, it is also to adjust function OvsProcessDeferredActions to make it clear.

 //DNAT flow
cookie=0x1040000000000, duration=950.326s, table=EndpointDNAT, n_packets=0, n_bytes=0, priority=200,tcp,reg3=0xc0a8fa2b,reg4=0x20050/0x7ffff
actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=192.168.250.43:80),exec(load:0x1->NXM_NX_CT_MARK[2])
// Append ct_mark flow
cookie=0x1000000000000, duration=11980.701s, table=SNATConntrackCommit, n_packets=6, n_bytes=396, priority=200,ct_state=+new+trk,ip,reg0=0x2
00/0x200,reg4=0/0xc00000 actions=load:0x3->NXM_NX_REG4[22..23],ct(commit,table=SNATConntrackCommit,zone=65520,exec(load:0x1->NXM_NX_CT_MARK[4
],load:0x1->NXM_NX_CT_MARK[5]))
// SNAT flow
cookie=0x1000000000000, duration=11980.701s, table=SNATConntrackCommit, n_packets=6, n_bytes=396, priority=200,ct_state=+new+trk,ip,reg0=0x6
00/0x600,reg4=0xc00000/0xc00000 actions=ct(commit,table=L2Forwarding,zone=65521,nat(src=192.168.250.1),exec(load:0x1->NXM_NX_CT_MARK[2]))

Reported-at:https://github.com/openvswitch/ovs-issues/issues/237
Signed-off-by: Wilson Peng <pweisong@vmware.com>
Signed-off-by: Alin-Gabriel Serdean <aserdean@ovn.org>
2022-01-20 02:55:15 +02:00
Ilya Maximets
c6f0b623e5 Prepare for post-2.17.0 (2.17.90).
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
2022-01-19 02:33:05 +01:00
Ilya Maximets
280d8de05f Prepare for 2.17.0.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
2022-01-19 02:33:05 +01:00
Gaetan Rivet
f20abde5a2 netdev-dpdk: Remove rte-flow API access locks.
The rte_flow DPDK API was made thread-safe [1] in release 20.11.
Now that the DPDK offload provider in OVS is thread safe, remove the
locks.

[1]: http://mails.dpdk.org/archives/dev/2020-October/184251.html

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
b0b6b7b465 dpif-netdev: Use one or more offload threads.
Read the user configuration in the netdev-offload module to modify the
number of threads used to manage hardware offload requests.

This allows processing insertion, deletion and modification
concurrently.

The offload thread structure was modified to contain all needed
elements. This structure is multiplied by the number of requested
threads and used separately.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
7daa503468 dpif-netdev: Replace port mutex by rwlock.
The port mutex protects the netdev mapping, that can be changed by port
addition or port deletion. HW offloads operations can be considered read
operations on the port mapping itself. Use a rwlock to differentiate
between read and write operations, allowing concurrent queries and
offload insertions.

Because offload queries, deletion, and reconfigure_datapath() calls are
all rdlock, the deadlock fixed by [1] is still avoided, as the rdlock
side is recursive as prescribed by the POSIX standard. Executing
'reconfigure_datapath()' only requires a rdlock taken, but it is sometimes
executed in contexts where wrlock is taken ('do_add_port()' and
'do_del_port()').

This means that the deadlock described in [2] is still valid and should
be mitigated. The rdlock is taken using 'tryrdlock()' during offload query,
keeping the current behavior.

[1]: 81e89d5c26 ("dpif-netdev: Make datapath port mutex recursive.")

[2]: 12d0edd75e ("dpif-netdev: Avoid deadlock with offloading during PMD
     thread deletion.").

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
d85b9230ac dpif-netdev: Make megaflow and mark mappings thread objects.
In later commits hardware offloads are managed in several threads.
Each offload is managed by a thread determined by its flow's 'mega_ufid'.

As megaflow to mark and mark to flow mappings are 1:1 and 1:N
respectively, then a single mark exists for a single 'mega_ufid', and
multiple flows uses the same 'mega_ufid'. Because the managing thread will
be chosen using the 'mega_ufid', then each mapping does not need to be
shared with other offload threads.

The mappings are kept as cmap as upcalls will sometimes query them before
enqueuing orders to the offload threads.

To prepare this change, move the mappings within the offload thread
structure.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
ec4ac62588 dpif-netdev: Use lockless queue to manage offloads.
The dataplane threads (PMDs) send offloading commands to a dedicated
offload management thread. The current implementation uses a lock
and benchmarks show a high contention on the queue in some cases.

With high-contention, the mutex will more often lead to the locking
thread yielding in wait, using a syscall. This should be avoided in
a userland dataplane.

The mpsc-queue can be used instead. It uses less cycles and has
lower latency. Benchmarks show better behavior as multiple
revalidators and one or multiple PMDs writes to a single queue
while another thread polls it.

One trade-off with the new scheme however is to be forced to poll
the queue from the offload thread. Without mutex, a cond_wait
cannot be used for signaling. The offload thread is implementing
an exponential backoff and will sleep in short increments when no
data is available. This makes the thread yield, at the price of
some latency to manage offloads after an inactivity period.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
b3e029f7c1 netdev-offload-dpdk: Protect concurrent offload destroy/query.
The rte_flow API in DPDK is now thread safe for insertion and deletion.
It is not however safe for concurrent query while the offload is being
inserted or deleted.

Insertion is not an issue as the rte_flow handle will be published to
other threads only once it has been inserted in the hardware, so the
query will only be able to proceed once it is already available.

For the deletion path however, offload status queries can be made while
an offload is being destroyed. This would create race conditions and
use-after-free if not properly protected.

As a pre-step before removing the OVS-level locks on the rte_flow API,
mutually exclude offload query and deletion from concurrent execution.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
54dcf60e6f netdev-offload-dpdk: Lock rte_flow map access.
Add a lock to access the ufid to rte_flow map.  This will protect it
from concurrent write accesses when multiple threads attempt it.

At this point, the reason for taking the lock is not to fullfill the
needs of the DPDK offload implementation anymore. Rewrite the comments
to reflect this change. The lock is still needed to protect against
changes to netdev port mapping.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
7851e602c0 netdev-offload-dpdk: Use per-thread HW offload stats.
The implementation of hardware offload counters in currently meant to be
managed by a single thread. Use the offload thread pool API to manage
one counter per thread.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
5b0aa55776 dpif-netdev: Execute flush from offload thread.
When a port is deleted, its offloads must be flushed.  The operation
runs in the thread that initiated it.  Offload data is thus accessed
jointly by the port deletion thread(s) and the offload thread, which
complicates the data access model.

To simplify this model, as a pre-step toward introducing parallel
offloads, execute the flush operation in the offload thread.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
d68d2ed466 dpif-netdev: Introduce tagged union of offload requests.
Offload requests are currently only supporting flow offloads.
As a pre-step before supporting an offload flush request,
modify the layout of an offload request item, to become a tagged union.

Future offload types won't be forced to re-use the full flow offload
structure, which consumes a lot of memory.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
73ecf098d2 dpif-netdev: Use id-fpool for mark allocation.
Use the netdev-offload multithread API to allow multiple thread
allocating marks concurrently.

Initialize only once the pool in a multithread context by using
the ovsthread_once type.

Use the id-fpool module for faster concurrent ID allocation.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
528a8ab627 dpif-netdev: Postpone flow offload item freeing.
Profiling the HW offload thread, the flow offload freeing takes
approximatively 25% of the time. Most of this time is spent waiting on
the futex used by the libc free(), as it triggers a syscall and
reschedule the thread.

Avoid the syscall and its expensive context switch. Batch the offload
messages freeing using the RCU.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
55dc4ef176 dpif-netdev: Quiesce offload thread periodically.
Similar to what was done for the PMD threads [1], reduce the performance
impact of quiescing too often in the offload thread.

After each processed offload, the offload thread currently quiesce and
will sync with RCU. This synchronization can be lengthy and make the
thread unnecessary slow.

Instead attempt to quiesce every 10 ms at most. While the queue is
empty, the offload thread remains quiescent.

[1]: 81ac8b3b19 ("dpif-netdev: Do RCU synchronization at fixed interval
     in PMD main loop.")

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
62c2d8a675 netdev-offload: Add multi-thread API.
Expose functions reporting user configuration of offloading threads, as
well as utility functions for multithreading.

This will only expose the configuration knob to the user, while no
datapath will implement the multiple thread request.

This will allow implementations to use this API for offload thread
management in relevant layers before enabling the actual dataplane
implementation.

The offload thread ID is lazily allocated and can as such be in a
different order than the offload thread start sequence.

The RCU thread will sometime access hardware-offload objects from
a provider for reclamation purposes.  In such case, it will get
a default offload thread ID of 0. Care must be taken that using
this thread ID is safe concurrently with the offload threads.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Gaetan Rivet
2eac33c6cc id-fpool: Module for fast ID generation.
The current id-pool module is slow to allocate the
next valid ID, and can be optimized when restricting
some properties of the pool.

Those restrictions are:

  * No ability to add a random ID to the pool.

  * A new ID is no more the smallest possible ID.
    It is however guaranteed to be in the range of

       [floor, last_alloc + nb_user * cache_size + 1].

    where 'cache_size' is the number of ID in each per-user
    cache.  It is defined as 'ID_FPOOL_CACHE_SIZE' to 64.

  * A user should never free an ID that is not allocated.
    No checks are done and doing so will duplicate the spurious
    ID.  Refcounting or other memory management scheme should
    be used to ensure an object and its ID are only freed once.

This allocator is designed to scale reasonably well in multithread
setup.  As it is aimed at being a faster replacement to the current
id-pool, a benchmark has been implemented alongside unit tests.

The benchmark is composed of 4 rounds: 'new', 'del', 'mix', and 'rnd'.
Respectively

  + 'new': only allocate IDs
  + 'del': only free IDs
  + 'mix': allocate, sequential free, then allocate ID.
  + 'rnd': allocate, random free, allocate ID.

Randomized freeing is done by swapping the latest allocated ID with any
from the range of currently allocated ID, which is reminiscent of the
Fisher-Yates shuffle.  This evaluates freeing non-sequential IDs,
which is the more natural use-case.

For this specific round, the id-pool performance is such that a timeout
of 10 seconds is added to the benchmark:

   $ ./tests/ovstest test-id-fpool benchmark 10000 1
   Benchmarking n=10000 on 1 thread.
    type\thread:       1    Avg
   id-fpool new:       1      1 ms
   id-fpool del:       1      1 ms
   id-fpool mix:       2      2 ms
   id-fpool rnd:       2      2 ms
    id-pool new:       4      4 ms
    id-pool del:       2      2 ms
    id-pool mix:       6      6 ms
    id-pool rnd:     431    431 ms

   $ ./tests/ovstest test-id-fpool benchmark 100000 1
   Benchmarking n=100000 on 1 thread.
    type\thread:       1    Avg
   id-fpool new:       2      2 ms
   id-fpool del:       2      2 ms
   id-fpool mix:       3      3 ms
   id-fpool rnd:       4      4 ms
    id-pool new:      12     12 ms
    id-pool del:       5      5 ms
    id-pool mix:      16     16 ms
    id-pool rnd:  10000+     -1 ms

   $ ./tests/ovstest test-id-fpool benchmark 1000000 1
   Benchmarking n=1000000 on 1 thread.
    type\thread:       1    Avg
   id-fpool new:      15     15 ms
   id-fpool del:      12     12 ms
   id-fpool mix:      34     34 ms
   id-fpool rnd:      48     48 ms
    id-pool new:     276    276 ms
    id-pool del:     286    286 ms
    id-pool mix:     448    448 ms
    id-pool rnd:  10000+     -1 ms

Running only a performance test on the fast pool:

   $ ./tests/ovstest test-id-fpool perf 1000000 1
   Benchmarking n=1000000 on 1 thread.
    type\thread:       1    Avg
   id-fpool new:      15     15 ms
   id-fpool del:      12     12 ms
   id-fpool mix:      34     34 ms
   id-fpool rnd:      47     47 ms

   $ ./tests/ovstest test-id-fpool perf 1000000 2
   Benchmarking n=1000000 on 2 threads.
    type\thread:       1      2    Avg
   id-fpool new:      11     11     11 ms
   id-fpool del:      10     10     10 ms
   id-fpool mix:      24     24     24 ms
   id-fpool rnd:      30     30     30 ms

   $ ./tests/ovstest test-id-fpool perf 1000000 4
   Benchmarking n=1000000 on 4 threads.
    type\thread:       1      2      3      4    Avg
   id-fpool new:       9     11     11     10     10 ms
   id-fpool del:       5      6      6      5      5 ms
   id-fpool mix:      16     16     16     16     16 ms
   id-fpool rnd:      20     20     20     20     20 ms

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 19:30:17 +01:00
Gaetan Rivet
5396ba5b21 mpsc-queue: Module for lock-free message passing.
Add a lockless multi-producer/single-consumer (MPSC), linked-list based,
intrusive, unbounded queue that does not require deferred memory
management.

The queue is designed to improve the specific MPSC setup.  A benchmark
accompanies the unit tests to measure the difference in this configuration.
A single reader thread polls the queue while N writers enqueue elements
as fast as possible.  The mpsc-queue is compared against the regular ovs-list
as well as the guarded list.  The latter usually offers a slight improvement
by batching the element removal, however the mpsc-queue is faster.

The average is of each producer threads time:

   $ ./tests/ovstest test-mpsc-queue benchmark 3000000 1
   Benchmarking n=3000000 on 1 + 1 threads.
    type\thread:  Reader      1    Avg
     mpsc-queue:     167    167    167 ms
     list(spin):      89     80     80 ms
    list(mutex):     745    745    745 ms
   guarded list:     788    788    788 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 3000000 2
   Benchmarking n=3000000 on 1 + 2 threads.
    type\thread:  Reader      1      2    Avg
     mpsc-queue:      98     97     94     95 ms
     list(spin):     185    171    173    172 ms
    list(mutex):     203    199    203    201 ms
   guarded list:     269    269    188    228 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 3000000 3
   Benchmarking n=3000000 on 1 + 3 threads.
    type\thread:  Reader      1      2      3    Avg
     mpsc-queue:      76     76     65     76     72 ms
     list(spin):     246    110    240    238    196 ms
    list(mutex):     542    541    541    539    540 ms
   guarded list:     535    535    507    511    517 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 3000000 4
   Benchmarking n=3000000 on 1 + 4 threads.
    type\thread:  Reader      1      2      3      4    Avg
     mpsc-queue:      73     68     68     68     68     68 ms
     list(spin):     294    275    279    277    282    278 ms
    list(mutex):     346    309    287    345    302    310 ms
   guarded list:     378    319    334    378    351    345 ms

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 19:30:17 +01:00
Gaetan Rivet
5878b92522 ovs-atomic: Expose atomic exchange operation.
The atomic exchange operation is a useful primitive that should be
available as well.  Most compilers already expose or offer a way
to use it, but a single symbol needs to be defined.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 19:30:17 +01:00
Gaetan Rivet
83823ae328 dpif-netdev: Implement hardware offloads stats query.
In the netdev datapath, keep track of the enqueued offloads between
the PMDs and the offload thread.  Additionally, query each netdev
for their hardware offload counters.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
9ac3d951b4 mov-avg: Add a moving average helper structure.
Add a new module offering a helper to compute the Cumulative
Moving Average (CMA) and the Exponential Moving Average (EMA)
of a series of values.

Use the new helpers to add latency metrics in dpif-netdev.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
e4543c7b17 dpif-netdev: Rename offload thread structure.
The offload management in userspace is done through a separate thread.
The naming of the structure holding the objects used for synchronization
with the dataplane is generic and nondescript.

Clarify the object function by renaming it.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
9ab104718b dpctl: Add function to read hardware offload statistics.
Expose a function to query datapath offload statistics.
This function is separate from the current one in netdev-offload
as it exposes more detailed statistics from the datapath, instead of
only from the netdev-offload provider.

Each datapath is meant to use the custom counters as it sees fit for its
handling of hardware offloads.

Call the new API from dpctl.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
0e6366c239 netdev-offload-dpdk: Implement hw-offload statistics read.
In the DPDK offload provider, keep track of inserted rte_flow and report
it when queried.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
adbd4301a2 netdev-offload-dpdk: Use per-netdev offload metadata.
Add a per-netdev offload data field as part of netdev hw_info structure.
Use this field in netdev-offload-dpdk to map offload metadata (ufid to
rte_flow). Use flow API deinit ops to destroy the per-netdev metadata
when deallocating a netdev. Use RCU primitives to ensure coherency
during port deletion.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
1088f4e7fb netdev: Add flow API uninit function.
Add a new operation for flow API providers to
uninitialize when the API is disassociated from a netdev.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
aec1081c7d tests: Add ovs-barrier unit test.
No unit test exist currently for the ovs-barrier type.
It is however crucial as a building block and should be verified to work
as expected.

Create a simple test verifying the basic function of ovs-barrier.
Integrate the test as part of the test suite.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
59b8f9f8f4 dpif-netdev: Rename flow offload thread.
ovs_strlcpy silently fails to copy the thread name if it is too long.
Rename the flow offload thread to differentiate it from the main thread.

Fixes: 02bb2824e5 ("dpif-netdev: do hw flow offload in a thread")
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Gaetan Rivet
6207205e58 ovs-thread: Fix barrier use-after-free.
When a thread is blocked on a barrier, there is no guarantee
regarding the moment it will resume, only that it will at some point in
the future.

One thread can resume first then proceed to destroy the barrier while
another thread has not yet awoken. When it finally happens, the second
thread will attempt a seq_read() on the barrier seq, while the first
thread have already destroyed it, triggering a use-after-free.

Introduce an additional indirection layer within the barrier.
A internal barrier implementation holds all the necessary elements
for a thread to safely block and destroy. Whenever a barrier is
destroyed, the internal implementation is left available to still
blocking threads if necessary. A reference counter is used to track
threads still using the implementation.

Note that current uses of ovs-barrier are not affected: RCU and
revalidators will not destroy their barrier immediately after blocking
on it.

Fixes: d8043da718 ("ovs-thread: Implement OVS specific barrier.")
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
Kevin Traynor
1b9fd884f5 Documentation: Remove experimental tag for PMD ALB.
PMD Auto Load Balance was introduced as an experimental feature in OVS
2.11. It is used to detect that the Rx queue to PMD assignments are no
longer balanced and it would be better to reassign.

It is disabled by default, and can be enabled with:
$ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 13:00:09 +01:00
Kevin Traynor
09192a815e Documentation: Update PMD Auto Load Balance section.
Updates to the PMD Auto Load Balance section to make it more readable.

No change to the core content.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 13:00:09 +01:00
Kevin Traynor
5cc0524351 Documentation: Update PMD thread statistics.
'pmd-perf-show' gives some extra information and has nicer
formatting than 'pmd-stats-show'.

Let the user know they can use that as well to get PMD stats.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 12:26:36 +01:00
Kevin Traynor
f0adea3fce Documentation: Minor spelling and grammar fixes.
Some minor spelling and grammar fixes in pmd.rst.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 12:26:36 +01:00
Kevin Traynor
4da71121da Documentation: Fix Rx/Tx queue configuration section.
ovs-vsctl is used to configure physical Rx queues, not ovs-appctl.

Number of Tx queues are configured differently depending on whether
physical or virtual. Present documentation does not distinguish.

Fixes: 31d0dae22a ("doc: Add "PMD" topic document")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 12:26:36 +01:00
Eelco Chaudron
85d3785e6a utilities: Add netlink flow operation USDT probes and upcall_cost script.
This patch adds a series of NetLink flow operation USDT probes.
These probes are in turn used in the upcall_cost Python script,
which in addition of some kernel tracepoints, give an insight into
the time spent on processing upcall.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 00:46:30 +01:00
Eelco Chaudron
51ec98635e utilities: Add upcall USDT probe and associated script.
Added the dpif_recv:recv_upcall USDT probe, which is used by the
included upcall_monitor.py script. This script receives all upcall
packets sent by the kernel to ovs-vswitchd. By default, it will
show all  upcall events, which looks something like this:

 TIME               CPU  COMM      PID      DPIF_NAME          TYPE PKT_LEN FLOW_KEY_LEN
 5952147.003848809  2    handler4  1381158  system@ovs-system  0    98      132
 5952147.003879643  2    handler4  1381158  system@ovs-system  0    70      160
 5952147.003914924  2    handler4  1381158  system@ovs-system  0    98      152

It can also dump the packet and NetLink content, and if required,
the packets can also be written to a pcap file.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 00:46:30 +01:00
Eelco Chaudron
ff4c712d45 Documentation: Add USDT documentation and bpftrace example.
Add the USDT documentation and a bpftrace example using the
bridge run USDT probes.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 00:46:30 +01:00
Eelco Chaudron
512fab8f21 openvswitch: Define the OVS_STATIC_TRACE() macro.
This patch defines the OVS_STATIC_TRACE() macro, and as an
example, adds two of them in the bridge run loop.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 00:46:30 +01:00
Eelco Chaudron
191013cae9 configure: Add --enable-usdt-probes option to enable USDT probes.
Allow inclusion of User Statically Defined Trace (USDT) probes
in the OVS binaries using the --enable-usdt-probes option to the
./configure script.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 00:46:07 +01:00
Maxime Coquelin
844f141814 dpif-netdev.at: Add test for Tx packet steering.
This patch introduces a new test for Tx packet
steering modes. First test validates the static mode,
by checking that all packets are transmitted on a single
queue (single PMD thread), then it tests the same with
enabling hash based packet steering, ensuring packets
are transmitted on both queues.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-17 18:07:00 +01:00
Maxime Coquelin
c18e707b2f dpif-netdev: Introduce hash-based Tx packet steering mode.
This patch adds a new hash Tx steering mode that
distributes the traffic on all the Tx queues, whatever the
number of PMD threads. It would be useful for guests
expecting traffic to be distributed on all the vCPUs.

The idea here is to re-use the 5-tuple hash of the packets,
already computed to build the flows batches (and so it
does not provide flexibility on which fields are part of
the hash).

There are also no user-configurable indirection table,
given the feature is transparent to the guest. The queue
selection is just a modulo operation between the packet
hash and the number of Tx queues.

There are no (at least intentionnally) functionnal changes
for the existing XPS and static modes. There should not be
noticeable performance changes for these modes (only one
more branch in the hot path).

For the hash mode, performance could be impacted due to
locking when multiple PMD threads are in use (same as
XPS mode) and also because of the second level of batching.

Regarding the batching, the existing Tx port output_pkts
is not modified. It means that at maximum, NETDEV_MAX_BURST
can be batched for all the Tx queues. A second level of
batching is done in dp_netdev_pmd_flush_output_on_port(),
only for this hash mode.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-17 18:07:00 +01:00