2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-31 06:15:47 +00:00
Commit Graph

360 Commits

Author SHA1 Message Date
Frode Nordahl
36bad31829 json: Add yielding json create/destroy functions.
Creating and destroying JSON objects may be time consuming.

Add json_serialized_object_create_with_yield() and
json_destroy_with_yield() functions that make use of the
cooperative multitasking module to yield during processing,
allowing time sensitive tasks in other parts of the program
to be completed during processing.

We keep these new functions private to OVS by adding a new
lib/json.h header file.

The include guard in the public include/openvswitch/json.h is
updated to contain the OPENVSWITCH prefix to be in line with the
other public header files, allowing us to use the non-prefixed
version in our private lib/json.h.

Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-01-17 14:41:18 +01:00
Frode Nordahl
3c8a4e942d lib: Introduce cooperative multitasking module.
One of the goals of Open vSwitch is to be as resource efficient as
possible.  Core parts of the program has been implemented as
asynchronous state machines, and when absolutely necessary
additional threads are used.

Introduce cooperative multitasking module which allow us to
interleave important processing with long running tasks while
avoiding the additional resource consumption of threads and
complexity of asynchronous state machines.

We will use this module to ensure long running processing in the
OVSDB server does not interfere with stable maintenance of the
RAFT cluster in subsequent patches.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-01-17 14:41:18 +01:00
Flavio Leitner
8b5fe2dc60 userspace: Add Generic Segmentation Offloading.
This provides a software implementation in the case
the egress netdev doesn't support segmentation in hardware.

The challenge here is to guarantee packet ordering in the
original batch that may be full of TSO packets. Each TSO
packet can go up to ~64kB, so with segment size of 1440
that means about 44 packets for each TSO. Each batch has
32 packets, so the total batch amounts to 1408 normal
packets.

The segmentation estimates the total number of packets
and then the total number of batches. Then allocate
enough memory and finally do the work.

Finally each batch is sent in order to the netdev.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-12-02 01:33:37 +01:00
Ilya Maximets
723cd4c9be automake: Move build-aux EXTRA_DIST updates to their own file.
Otherwise it's hard to keep track of all the scripts we have.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-By: Ihar Hrachyshka <ihar@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-10-31 19:34:44 +01:00
Ilya Maximets
1776aa17a9 sflow: Always enable _BSD_SOURCE.
sFlow library is using BSD-style types like u_char that require
_BSD_SOURCE to be defined.

Also adding _DEFAULT_SOURCE, because _BSD_SOURCE cannot be used
without it with glibc > 2.19:

  error: "_BSD_SOURCE and _SVID_SOURCE are deprecated,
          use _DEFAULT_SOURCE"

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-08-23 13:46:14 +02:00
Ales Musil
a9ae73b916 ofp, dpif: Allow CT flush based on partial match.
Currently, the CT can be flushed by dpctl only by specifying
the whole 5-tuple.  This is not very convenient when there are
only some fields known to the user of CT flush.  Add new struct
ofp_ct_match which represents the generic filtering that can
be done for CT flush. The match is done only on fields that are
non-zero with exception to the icmp fields.

This allows the filtering just within dpctl, however it is a
preparation for OpenFlow extension.

Reported-at: https://bugzilla.redhat.com/2120546
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-16 19:55:21 +01:00
Ilya Maximets
1dcc490d44 netdev-afxdp: Allow building with libxdp and newer libbpf.
AF_XDP functions was deprecated in libbpf 0.7 and moved to libxdp.
Functions bpf_get/set_link_xdp_id() was deprecated in libbpf 0.8
and replaced with bpf_xdp_query_id() and bpf_xdp_attach/detach().

Updating configuration and source code to accommodate above changes
and allow building OVS with AF_XDP support on newer systems:

 - Checking the version of libbpf by detecting availability
   of bpf_xdp_detach.

 - Checking availability of the libxdp in a system by looking
   for a library providing libxdp_strerror(), if libbpf is
   newer than 0.6.  And checking for xsk.h header provided by
   libxdp-dev[el].

 - Use xsk.h from libbpf if it is older than 0.7 and not linking
   with libxdp in this case as there are known incompatible
   versions of libxdp in distributions.

 - Check for the NEED_WAKEUP feature replaced with direct checking
   in the source code if XDP_USE_NEED_WAKEUP is defined.

 - Checking availability of bpf_xdp_query_id and bpf_xdp_detach
   and using them instead of deprecated APIs.  Fall back to old
   functions if not found.

 - Dropped LIBBPF_LDADD variable as it makes library and function
   detection much harder without providing any actual benefits.
   AC_SEARCH_LIBS is used instead and it allows use of AC_CHECK_FUNCS.

 - Header includes moved around to files where they are actually used.

 - Removed libelf dependency as it is not really used.

With these changes it should be possible to build OVS with either:

 - libbpf built from the kernel sources (5.19 or older).
 - libbpf < 0.7 provided in distributions.
 - libxdp and libbpf >= 0.7 provided in newer distributions.

While it is technically possible to build with libbpf 0.7+ without
libxdp at the moment we're not allowing that for a few reasons.
First, required functions in libbpf are deprecated and can be removed
in future releases.  Second, support for all these combinations makes
the detection code fairly complex.
AFAIK, most of the distributions packaging libbpf 0.7+ do package
libxdp as well.

libxdp added as a build dependency for Fedora build since all
supported versions of Fedora are packaging this library.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-03 16:06:21 +01:00
Dumitru Ceara
5a686267d3 lib: Add support for sets of UUIDs.
Part of the uuidset implementation is taken from the OVN codebase where
it was added via commit 0e77b3bcbfe2 ("ovn-northd-ddlog: New
implementation of ovn-northd based on ddlog.").

We now extend that, adding a few helpers and tests.

Co-authored-by: Leonid Ryzhyk <lryzhyk@vmware.com>
Signed-off-by: Leonid Ryzhyk <lryzhyk@vmware.com>
Co-authored-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Co-authored-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Reviewed-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-09-27 01:17:56 +02:00
Emma Finn
398f80fffc odp-execute: Add ISA implementation of pop_vlan action.
This commit adds the AVX512 implementation of the
pop_vlan action.

Signed-off-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-07-15 11:40:37 +01:00
Emma Finn
1713fc0116 odp-execute: Add command to switch action implementation.
This commit adds a new command to allow the user to switch
the active action implementation at runtime.

Usage:
  $ ovs-appctl odp-execute/action-impl-set scalar

This commit also adds a new command to retrieve the list of available
action implementations. This can be used by to check what implementations
of actions are available and what implementation is active during runtime.

Usage:
   $ ovs-appctl odp-execute/action-impl-show

Added separate test-case for ovs-actions show/set commands:
odp-execute - actions implementation

Signed-off-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
Co-authored-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-07-15 11:39:20 +01:00
Emma Finn
95e4a35b0a odp-execute: Add function pointers to odp-execute for different action implementations.
This commit introduces the initial infrastructure required to allow
different implementations for OvS actions. The patch introduces action
function pointers which allows user to switch between different action
implementations available. This will allow for more performance and flexibility
so the user can choose the action implementation to best suite their use case.

Signed-off-by: Emma Finn <emma.finn@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-07-15 11:38:16 +01:00
Kumar Amber
8cab30a9d2 dpif-netdev/mfex: Add AVX512 ipv6 traffic profiles.
Add AVX512 Ipv6 optimized profile for vlan/IPv6/UDP and
vlan/IPv6/TCP, IPv6/UDP and IPv6/TCP.

MFEX autovalidaton test-case already has the IPv6 support for
validating against the scalar mfex.

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-07-05 15:41:27 +01:00
David Marchand
fe171e4f10 dpif-netdev: Refactor AVX512 runtime checks.
As described in the bugzilla below, cpu_has_isa code may be compiled
with some AVX512 instructions in it, because cpu.c is built as part of
the libopenvswitchavx512.
This is a problem when this function (supposed to probe for AVX512
instructions availability) is invoked from generic OVS code, on older
CPUs that don't support them.

For the same reason, dpcls_subtable_avx512_gather_probe,
dp_netdev_input_outer_avx512_probe, mfex_avx512_probe and
mfex_avx512_vbmi_probe are potential runtime bombs and can't either be
built as part of libopenvswitchavx512.

Move cpu.c to be part of the "normal" libopenvswitch.
And move other helpers in generic OVS code.

Note:
- dpcls_subtable_avx512_gather_probe is split in two, because it also
  needs to do its own magic,
- while moving those helpers, prefer direct calls to cpu_has_isa and
  avoid cast to intermediate integer variables when a simple boolean
  is enough,

Fixes: 352b6c7116 ("dpif-lookup: add avx512 gather implementation.")
Fixes: abb807e27d ("dpif-netdev: Add command to switch dpif implementation.")
Fixes: 250ceddcc2 ("dpif-netdev/mfex: Add AVX512 based optimized miniflow extract")
Fixes: b366fa2f49 ("dpif-netdev: Call cpuid for x86 isa availability.")
Reported-at: https://bugzilla.redhat.com/2100393
Reported-by: Ales Musil <amusil@redhat.com>
Co-authored-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-06-29 11:27:09 +02:00
Cian Ferriter
cb1c640077 acinclude: Add seperate checks for AVX512 ISA.
Checking for each of the required AVX512 ISA separately will allow the
compiler to generate some AVX512 code where there is some support in the
compiler rather than only generating all AVX512 code when all of it is
supported or no AVX512 code at all.

For example, in GCC 4.9 where there is just support for AVX512F, this
patch will allow building the AVX512 DPIF.

Another example, in GCC 5 and 6, most AVX512 code can be generated, just
without AVX512VPOPCNTDQ support.

Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-30 23:12:51 +02:00
Cian Ferriter
fb85ae4340 automake.mk: Remove -mavx512dq CFLAG from AVX512 library.
No instructions from the AVX512DQ ISA are used anywhere in OVS. Remove
this unnecessary CFLAG.

Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-30 23:12:51 +02:00
Gaetan Rivet
2eac33c6cc id-fpool: Module for fast ID generation.
The current id-pool module is slow to allocate the
next valid ID, and can be optimized when restricting
some properties of the pool.

Those restrictions are:

  * No ability to add a random ID to the pool.

  * A new ID is no more the smallest possible ID.
    It is however guaranteed to be in the range of

       [floor, last_alloc + nb_user * cache_size + 1].

    where 'cache_size' is the number of ID in each per-user
    cache.  It is defined as 'ID_FPOOL_CACHE_SIZE' to 64.

  * A user should never free an ID that is not allocated.
    No checks are done and doing so will duplicate the spurious
    ID.  Refcounting or other memory management scheme should
    be used to ensure an object and its ID are only freed once.

This allocator is designed to scale reasonably well in multithread
setup.  As it is aimed at being a faster replacement to the current
id-pool, a benchmark has been implemented alongside unit tests.

The benchmark is composed of 4 rounds: 'new', 'del', 'mix', and 'rnd'.
Respectively

  + 'new': only allocate IDs
  + 'del': only free IDs
  + 'mix': allocate, sequential free, then allocate ID.
  + 'rnd': allocate, random free, allocate ID.

Randomized freeing is done by swapping the latest allocated ID with any
from the range of currently allocated ID, which is reminiscent of the
Fisher-Yates shuffle.  This evaluates freeing non-sequential IDs,
which is the more natural use-case.

For this specific round, the id-pool performance is such that a timeout
of 10 seconds is added to the benchmark:

   $ ./tests/ovstest test-id-fpool benchmark 10000 1
   Benchmarking n=10000 on 1 thread.
    type\thread:       1    Avg
   id-fpool new:       1      1 ms
   id-fpool del:       1      1 ms
   id-fpool mix:       2      2 ms
   id-fpool rnd:       2      2 ms
    id-pool new:       4      4 ms
    id-pool del:       2      2 ms
    id-pool mix:       6      6 ms
    id-pool rnd:     431    431 ms

   $ ./tests/ovstest test-id-fpool benchmark 100000 1
   Benchmarking n=100000 on 1 thread.
    type\thread:       1    Avg
   id-fpool new:       2      2 ms
   id-fpool del:       2      2 ms
   id-fpool mix:       3      3 ms
   id-fpool rnd:       4      4 ms
    id-pool new:      12     12 ms
    id-pool del:       5      5 ms
    id-pool mix:      16     16 ms
    id-pool rnd:  10000+     -1 ms

   $ ./tests/ovstest test-id-fpool benchmark 1000000 1
   Benchmarking n=1000000 on 1 thread.
    type\thread:       1    Avg
   id-fpool new:      15     15 ms
   id-fpool del:      12     12 ms
   id-fpool mix:      34     34 ms
   id-fpool rnd:      48     48 ms
    id-pool new:     276    276 ms
    id-pool del:     286    286 ms
    id-pool mix:     448    448 ms
    id-pool rnd:  10000+     -1 ms

Running only a performance test on the fast pool:

   $ ./tests/ovstest test-id-fpool perf 1000000 1
   Benchmarking n=1000000 on 1 thread.
    type\thread:       1    Avg
   id-fpool new:      15     15 ms
   id-fpool del:      12     12 ms
   id-fpool mix:      34     34 ms
   id-fpool rnd:      47     47 ms

   $ ./tests/ovstest test-id-fpool perf 1000000 2
   Benchmarking n=1000000 on 2 threads.
    type\thread:       1      2    Avg
   id-fpool new:      11     11     11 ms
   id-fpool del:      10     10     10 ms
   id-fpool mix:      24     24     24 ms
   id-fpool rnd:      30     30     30 ms

   $ ./tests/ovstest test-id-fpool perf 1000000 4
   Benchmarking n=1000000 on 4 threads.
    type\thread:       1      2      3      4    Avg
   id-fpool new:       9     11     11     10     10 ms
   id-fpool del:       5      6      6      5      5 ms
   id-fpool mix:      16     16     16     16     16 ms
   id-fpool rnd:      20     20     20     20     20 ms

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 19:30:17 +01:00
Gaetan Rivet
5396ba5b21 mpsc-queue: Module for lock-free message passing.
Add a lockless multi-producer/single-consumer (MPSC), linked-list based,
intrusive, unbounded queue that does not require deferred memory
management.

The queue is designed to improve the specific MPSC setup.  A benchmark
accompanies the unit tests to measure the difference in this configuration.
A single reader thread polls the queue while N writers enqueue elements
as fast as possible.  The mpsc-queue is compared against the regular ovs-list
as well as the guarded list.  The latter usually offers a slight improvement
by batching the element removal, however the mpsc-queue is faster.

The average is of each producer threads time:

   $ ./tests/ovstest test-mpsc-queue benchmark 3000000 1
   Benchmarking n=3000000 on 1 + 1 threads.
    type\thread:  Reader      1    Avg
     mpsc-queue:     167    167    167 ms
     list(spin):      89     80     80 ms
    list(mutex):     745    745    745 ms
   guarded list:     788    788    788 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 3000000 2
   Benchmarking n=3000000 on 1 + 2 threads.
    type\thread:  Reader      1      2    Avg
     mpsc-queue:      98     97     94     95 ms
     list(spin):     185    171    173    172 ms
    list(mutex):     203    199    203    201 ms
   guarded list:     269    269    188    228 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 3000000 3
   Benchmarking n=3000000 on 1 + 3 threads.
    type\thread:  Reader      1      2      3    Avg
     mpsc-queue:      76     76     65     76     72 ms
     list(spin):     246    110    240    238    196 ms
    list(mutex):     542    541    541    539    540 ms
   guarded list:     535    535    507    511    517 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 3000000 4
   Benchmarking n=3000000 on 1 + 4 threads.
    type\thread:  Reader      1      2      3      4    Avg
     mpsc-queue:      73     68     68     68     68     68 ms
     list(spin):     294    275    279    277    282    278 ms
    list(mutex):     346    309    287    345    302    310 ms
   guarded list:     378    319    334    378    351    345 ms

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 19:30:17 +01:00
Gaetan Rivet
9ac3d951b4 mov-avg: Add a moving average helper structure.
Add a new module offering a helper to compute the Cumulative
Moving Average (CMA) and the Exponential Moving Average (EMA)
of a series of values.

Use the new helpers to add latency metrics in dpif-netdev.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 15:12:01 +01:00
David Marchand
b366fa2f49 dpif-netdev: Call cpuid for x86 isa availability.
DPIF AVX512 optimizations currently rely on DPDK availability while
they can be used without DPDK.
Besides, checking for availability of some isa only has to be done once
and won't change while a OVS process runs.

Resolve isa availability in constructors by using a simplified query
based on cpuid API that comes from the compiler.

Note: this also fixes the check on BMI2 availability: DPDK had a bug
for this isa, see https://git.dpdk.org/dpdk/commit/?id=aae3037ab1e0.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-03 18:45:40 +01:00
Ilya Maximets
c2fb5bdae6 ovs-actions: Convert man page from xml to rST.
This way it's easier to show it on a website as it will be updated
automatically along with the rest of the documentation.

Sphinx doesn't render everything perfectly, but it looks good enough
in both man and html versions.  rST is a bit easier to read and it
takes less space.

Conversion performed manually since I didn't found any good tool
that can actually make the process any faster.

Along the way I replaced versions like x.y.90 with x.y+1, because
it doesn't seem correct to me to refer non-released versions of OVS
in the docs.  Fixed a couple of small mistakes like duplicated
paragraph and reference to a different section by incorrect name.
Also removed bits of xml->nroff conversion code that is not needed
anymore.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Roi Dayan <roid@nvidia.com>
2021-08-31 20:56:47 +02:00
Mark Gray
b1e517bd2f dpif-netlink: Introduce per-cpu upcall dispatch.
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.

This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:

* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)

This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.

In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:

a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.

Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 20:05:03 +02:00
Harry van Haaren
250ceddcc2 dpif-netdev/mfex: Add AVX512 based optimized miniflow extract
This commit adds AVX512 implementations of miniflow extract.
By using the 64 bytes available in an AVX512 register, it is
possible to convert a packet to a miniflow data-structure in
a small quantity instructions.

The implementation here probes for Ether()/IP()/UDP() traffic,
and builds the appropriate miniflow data-structure for packets
that match the probe.

The implementation here is auto-validated by the miniflow
extract autovalidator, hence its correctness can be easily
tested and verified.

Note that this commit is designed to easily allow addition of new
traffic profiles in a scalable way, without code duplication for
each traffic profile.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:31:42 +01:00
Kumar Amber
72dd22a0df dpif-netdev: Add study function to select the best mfex function
The study function runs all the available implementations
of miniflow_extract and makes a choice whose hitmask has
maximum hits and sets the mfex to that function.

Study can be run at runtime using the following command:

$ ovs-appctl dpif-netdev/miniflow-parser-set study

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 10:29:52 +01:00
Kumar Amber
3d8f47bc04 dpif-netdev: Add command line and function pointer for miniflow extract
This patch introduces the MFEX function pointers which allows
the user to switch between different miniflow extract implementations
which are provided by the OVS based on optimized ISA CPU.

The user can query for the available minflow extract variants available
for that CPU by following commands:

$ovs-appctl dpif-netdev/miniflow-parser-get

Similarly an user can set the miniflow implementation by the following
command :

$ ovs-appctl dpif-netdev/miniflow-parser-set name

This allows for more performance and flexibility to the user to choose
the miniflow implementation according to the needs.

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 09:56:58 +01:00
Harry van Haaren
abb807e27d dpif-netdev: Add command to switch dpif implementation.
This commit adds a new command to allow the user to switch
the active DPIF implementation at runtime. A probe function
is executed before switching the DPIF implementation, to ensure
the CPU is capable of running the ISA required. For example, the
below code will switch to the AVX512 enabled DPIF assuming
that the runtime CPU is capable of running AVX512 instructions:

 $ ovs-appctl dpif-netdev/dpif-impl-set dpif_avx512

A new configuration flag is added to allow selection of the
default DPIF. This is useful for running the unit-tests against
the available DPIF implementations, without modifying each unit test.

The design of the testing & validation for ISA optimized DPIF
implementations is based around the work already upstream for DPCLS.
Note however that a DPCLS lookup has no state or side-effects, allowing
the auto-validator implementation to perform multiple lookups and
provide consistent statistic counters.

The DPIF component does have state, so running two implementations in
parallel and comparing output is not a valid testing method, as there
are changes in DPIF statistic counters (side effects). As a result, the
DPIF is tested directly against the unit-tests.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-09 17:13:24 +01:00
Harry van Haaren
9ac84a1a36 dpif-avx512: Add ISA implementation of dpif.
This commit adds the AVX512 implementation of DPIF functionality,
specifically the dp_netdev_input_outer_avx512 function. This function
only handles outer (no re-circulations), and is optimized to use the
AVX512 ISA for packet batching and other DPIF work.

Sparse is not able to handle the AVX512 intrinsics, causing compile
time failures, so it is disabled for this file.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Co-authored-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-09 17:13:12 +01:00
Harry van Haaren
5930dfeeb2 dpif-netdev: Refactor to multiple header files.
Split the very large file dpif-netdev.c and the datastructures
it contains into multiple header files. Each header file is
responsible for the datastructures of that component.

This logical split allows better reuse and modularity of the code,
and reduces the very large file dpif-netdev.c to be more managable.

Due to dependencies between components, it is not possible to
move component in smaller granularities than this patch.

To explain the dependencies better, eg:

DPCLS has no deps (from dpif-netdev.c file)
FLOW depends on DPCLS (struct dpcls_rule)
DFC depends on DPCLS (netdev_flow_key) and FLOW (netdev_flow_key)
THREAD depends on DFC (struct dfc_cache)

DFC_PROC depends on THREAD (struct pmd_thread)

DPCLS lookup.h/c require only DPCLS
DPCLS implementations require only dpif-netdev-lookup.h.
- This change was made in 2.12 release with function pointers
- This commit only refactors the name to "private-dpcls.h"

netdev_flow_key_equal_mf() is renamed to emc_flow_key_equal_mf().

Rename functions specific to dpcls from netdev_* namespace to the
dpcls_* namespace, as they are only used by dpcls code.

'inline' is added to the dp_netdev_flow_hash() when it is moved
definition to fix a compiler error.

One valid checkpatch issue with the use of the
EMC_FOR_EACH_POS_WITH_HASH() macro was fixed.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-09 17:12:45 +01:00
Jaime Caamaño Ruiz
0c09952382 stream-ssl: Remove unsafe 1024 bit dh params
Using 1024 bit params for DH is considered unsafe [1]. Additionally,
from [2]:

"Modern servers that do not support export ciphersuites are advised to
either use SSL_CTX_set_tmp_dh() or alternatively, use the callback but
ignore keylength and is_export and simply supply at least 2048-bit
parameters in the callback."

Additionally, using 1024 bit dh params may block clients running on
recent openssl version from connecting given the stricter default
security requirements of those new openssl versions. The error message
for these clients looks like:

error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small:ssl/statem/statem_clnt.c:2150

As a workaround, this error can be suppressed tweaking the cipher list
(--ssl-ciphers) to either 'HIGH:!aNULL:!MD5:@SECLEVEL=1' to reduce
security requirements or 'HIGH:!aNULL:!MD5:!DH' to avoid using fixed
param DH based ciphers. The first option is recommended though as it
likely a fixed param DH cipher is the best possible option in that
situation.

[1] https://weakdh.org/
[2] https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_tmp_dh_callback.html

Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2021-07-07 13:25:32 -07:00
Ilya Maximets
fae1ae0434 stream: Add record/replay functionality.
For debugging purposes it is useful to be able to record all the
incoming transactions and commands and replay them locally under
debugger or with additional logging enabled.  This patch introduces
ability to record all the incoming stream data and replay it via new
stream provider named 'stream-replay'.  During the record phase all
the incoming stream data written to special replay_* files in the
application rundir.  On replay phase instead of opening real streams
application will open replay_* files and read all the incoming data
directly from them.

If enabled for ovsdb-server, for example, this allows to record all
the connections and transactions from the big setup and replay them
locally afterwards to debug the behaviour or test performance.

To start application in recording mode there is a --record cmdline
option. --replay is to replay previously recorded streams.

Current version doesn't work well with time-based stream events like
inactivity probes or any other events generated internally.  This is
a point for further improvement.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
2021-06-07 21:03:16 +02:00
Ilya Maximets
610ac1e82c ovs-replay: New library to create and manage replay files.
This library provides interfaces to open replay files and
read/write records.  Will be used later for stream record/replay
functionality, i.e. to record all the incoming connections and
data and replay it later for debugging and performance analysis
purposes.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
2021-06-07 20:59:34 +02:00
Aidan Shribman
23f9ec9eb0 make: don't prompt during build
When running repeated builds using `make build` you get prompts in cases
the `mv` command is about to overwrite a file which is write-protect.
This patch forced the `mv` w/o prompting for approval.

Signed-off-by: Aidan Shribman <aidan.shribman@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2021-04-08 08:57:37 -07:00
Ben Pfaff
a5c067a8b9 ovsdb-cs: New module that factors out code from ovsdb-idl.
This new module has a single direct user now.  In the future, it
will also be used by OVN.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
2021-01-21 15:09:05 -08:00
Mark Gray
d409f50062 python: Update build system to ensure dirs.py is created.
Update build system to ensure dirs.py is created when it is a
dependency for a build target. Also, update setup.py to
check for that dependency.

Fixes: 943c4a3250 ("python: set ovs.dirs variables with build system values")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2020-11-26 12:18:57 +01:00
Mark Gray
943c4a3250 python: set ovs.dirs variables with build system values
ovs/dirs.py should be auto-generated using the template
ovs/dirs.py.template at build time. This will set the
ovs.dirs python variables with a value specified by the
environment or, if the environment variable is not set, from
the build system.

Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-By: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2020-11-16 15:47:43 +00:00
Harry van Haaren
ba5e311782 dpif-netdev/avx512: add -fPIC flag to enable shared builds
In certain scenarios with OVS built with --enable-shared and
DPDK enabled as shared build too, Position Independant Code
is required to link the avx512.a file into the relocatable .so
that it must be linked into.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2020-08-05 15:57:34 +01:00
Harry van Haaren
4ed57c5028 dpif-netdev/avx512: avoid compiling avx512 code if binutils check fails
This commit avoids compiling and linking of avx512 code into the
vswitch_la library if the binutils check fails. This avoids compiling
code into OVS that will not be executed due to binutils issue.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2020-08-05 15:57:24 +01:00
David Marchand
9af9dbcec9 dpdk: Add commands to configure log levels.
Enabling debug logs in dpdk can be a challenge to be sure of what is
actually enabled, add commands to list and change those log levels.
However, these commands do not help when tracking issues in dpdk init
itself: dump log levels right after init.

Example:
$ ovs-appctl dpdk/log-list
global log level is debug
id 0: lib.eal, level is info
id 1: lib.malloc, level is info
id 2: lib.ring, level is info
id 3: lib.mempool, level is info
id 4: lib.timer, level is info
id 5: pmd, level is info
[...]
id 37: pmd.net.bnxt.driver, level is notice
id 38: pmd.net.e1000.init, level is notice
id 39: pmd.net.e1000.driver, level is notice
id 40: pmd.net.enic, level is info
[...]

$ ovs-appctl dpdk/log-set debug pmd.*:notice
$ ovs-appctl dpdk/log-list
global log level is debug
id 0: lib.eal, level is debug
id 1: lib.malloc, level is debug
id 2: lib.ring, level is debug
id 3: lib.mempool, level is debug
id 4: lib.timer, level is debug
id 5: pmd, level is debug
[...]
id 37: pmd.net.bnxt.driver, level is notice
id 38: pmd.net.e1000.init, level is notice
id 39: pmd.net.e1000.driver, level is notice
id 40: pmd.net.enic, level is notice
[...]

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2020-07-17 01:56:56 +02:00
Harry van Haaren
352b6c7116 dpif-lookup: add avx512 gather implementation.
This commit adds an AVX-512 dpcls lookup implementation.
It uses the AVX-512 SIMD ISA to perform multiple miniflow
operations in parallel.

To run this implementation, the "avx512f" and "bmi2" ISAs are
required. These ISA checks are performed at runtime while
probing the subtable implementation. If a CPU does not provide
both "avx512f" and "bmi2", then this code does not execute.

The avx512 code is built as a separate static library, with added
CFLAGS to enable the required ISA features. By building only this
static library with avx512 enabled, it is ensured that the main OVS
core library is *not* using avx512, and that OVS continues to run
as before on CPUs that do not support avx512.

The approach taken in this implementation is to use the
gather instruction to access the packet miniflow, allowing
any miniflow blocks to be loaded into an AVX-512 register.
This maximizes the usefulness of the register, and hence this
implementation handles any subtable with up to miniflow 8 bits.

Note that specialization of these avx512 lookup routines
still provides performance value, as the hashing of the
resulting data is performed in scalar code, and compile-time
loop unrolling occurs when specialized to miniflow bits.

This commit checks at configure time if the assembling in use
has a known bug in assembling AVX512 code. If this bug is present,
all AVX512 code is disabled. Checking the version string of the binutils
or assembler is not a good method to detect the issue, as back ported
fixes would not be reflected.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2020-07-13 14:55:25 +01:00
Harry van Haaren
e90e115a01 dpif-netdev: implement subtable lookup validation.
This commit refactors the existing dpif subtable function pointer
infrastructure, and implements an autovalidator component.

The refactoring of the existing dpcls subtable lookup function
handling, making it more generic, and cleaning up how to enable
more implementations in future.

In order to ensure all implementations provide identical results,
the autovalidator is added. The autovalidator itself implements
the subtable lookup function prototype, but internally iterates
over all other available implementations. The end result is that
testing of each implementation becomes automatic, when the auto-
validator implementation is selected.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2020-07-13 14:54:08 +01:00
William Tu
2078901a4c userspace: Add conntrack timeout policy support.
Commit 1f16131837 ("ct-dpif, dpif-netlink: Add conntrack timeout
policy support") adds conntrack timeout policy for kernel datapath.
This patch enables support for the userspace datapath.  I tested
using the 'make check-system-userspace' which checks the timeout
policies for ICMP and UDP cases.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
2020-05-01 08:22:45 -07:00
Flavio Leitner
29cf9c1b3b userspace: Add TCP Segmentation Offload support
Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.

A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.

It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.

If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.

It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Ciara Loftus <ciara.loftus.intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2020-01-17 22:27:25 +00:00
Ilya Maximets
db54e96720 bridge: Allow manual notifications about interfaces' updates.
Sometimes interface updates could happen in a way ifnotifier is not
able to catch.  For example some heavy operations (device reset) in
netdev-dpdk could require re-applying of the bridge configuration.

For this purpose new manual notifier introduced. Its function
'if_notifier_manual_report()' could be called directly by the code
that aware about changes.  This new notifier is thread-safe.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
2019-12-18 01:23:50 +01:00
William Tu
0de1b42596 netdev-afxdp: add new netdev type for AF_XDP.
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology.  It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
2019-07-19 17:42:06 +03:00
Harry van Haaren
92c7c870d6 dpif-netdev: Split out generic lookup function
This commit splits the generic hash-lookup-verify function to
its own file, for cleaner seperation between optimized versions.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-19 12:22:01 +01:00
Harry van Haaren
f5ace7cd8a dpif-netdev: Move dpcls lookup structures to .h
This commit moves some data-structures to be available
in the dpif-netdev-private.h header. This allows specific
implementations of the subtable lookup function to include
just that header file, and not require that the code exists
in dpif-netdev.c

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-19 12:21:37 +01:00
Ilya Maximets
4f746d526d netdev-offload: Rename offload providers.
Flow API providers renamed to be consistent with parent module
'netdev-offload' and look more like each other.

'_rte_' replaced with more convenient '_dpdk_'.

We'll have following structure:

  Common code:
    lib/netdev-offload-provider.h
    lib/netdev-offload.c
    lib/netdev-offload.h

  Providers:
    lib/netdev-offload-tc.c
    lib/netdev-offload-dpdk.c

'netdev-offload-dummy' still resides inside netdev-dummy, but it
makes no much sence to move it out of there.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
2019-06-11 09:39:36 +03:00
Ilya Maximets
b6cabb8f8f netdev: Split up netdev offloading to separate module.
New module 'netdev-offload' created to manage different flow API
implementations. All the generic and provider independent code moved
there from the 'netdev' module.

Flow API providers further encapsulated.

The only function that was changed is 'netdev_any_oor'.
Now it uses offloading related hmap instead of common 'netdev_shash'.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
2019-06-11 09:39:36 +03:00
Ilya Maximets
5fc5c50f3d netdev: Dynamic per-port Flow API.
Current issues with Flow API:

* OVS calls offloading functions regardless of successful
  flow API initialization. (ex. on init_flow_api failure)
* Static initilaization of Flow API for a netdev_class forbids
  having different offloading types for different instances
  of netdev with the same netdev_class. (ex. different vports in
  'system' and 'netdev' datapaths at the same time)

Solution:

* Move Flow API from the netdev_class to netdev instance.
* Make Flow API dynamic, i.e. probe the APIs and choose the
  suitable one.

Side effects:

* Flow API providers localized as possible in their modules.
* Now we have an ability to make runtime checks. For example,
  we could check if particular device supports features we
  need, like if dpdk device supports RSS+MARK action.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Roi Dayan <roid@mellanox.com>
2019-06-11 09:39:36 +03:00
Justin Pettit
e6fc4e2ce9 Remove ESX references.
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2019-06-01 10:12:02 -07:00
Roni Bar Yanai
3d67b2d28e netdev-dpdk: Move offloading code to a new file
Hardware offloading code is moved to a new file called
netdev-rte-offloads.c. The original offloading code is copied
from the netdev-dpdk.c file to the new file, where future
offloading code should be added as well.
The copied code was refactored based on coding style.
The netdev-dpdk.c file will remain unchanged as new offloading
code is added.

Co-authored-by: Ophir Munk <ophirmu@mellanox.com>
Reviewed-by: Asaf Penso <asafp@mellanox.com>
Signed-off-by: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-03-19 14:33:27 +00:00