Creating and destroying JSON objects may be time consuming.
Add json_serialized_object_create_with_yield() and
json_destroy_with_yield() functions that make use of the
cooperative multitasking module to yield during processing,
allowing time sensitive tasks in other parts of the program
to be completed during processing.
We keep these new functions private to OVS by adding a new
lib/json.h header file.
The include guard in the public include/openvswitch/json.h is
updated to contain the OPENVSWITCH prefix to be in line with the
other public header files, allowing us to use the non-prefixed
version in our private lib/json.h.
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
One of the goals of Open vSwitch is to be as resource efficient as
possible. Core parts of the program has been implemented as
asynchronous state machines, and when absolutely necessary
additional threads are used.
Introduce cooperative multitasking module which allow us to
interleave important processing with long running tasks while
avoiding the additional resource consumption of threads and
complexity of asynchronous state machines.
We will use this module to ensure long running processing in the
OVSDB server does not interfere with stable maintenance of the
RAFT cluster in subsequent patches.
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This provides a software implementation in the case
the egress netdev doesn't support segmentation in hardware.
The challenge here is to guarantee packet ordering in the
original batch that may be full of TSO packets. Each TSO
packet can go up to ~64kB, so with segment size of 1440
that means about 44 packets for each TSO. Each batch has
32 packets, so the total batch amounts to 1408 normal
packets.
The segmentation estimates the total number of packets
and then the total number of batches. Then allocate
enough memory and finally do the work.
Finally each batch is sent in order to the netdev.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
sFlow library is using BSD-style types like u_char that require
_BSD_SOURCE to be defined.
Also adding _DEFAULT_SOURCE, because _BSD_SOURCE cannot be used
without it with glibc > 2.19:
error: "_BSD_SOURCE and _SVID_SOURCE are deprecated,
use _DEFAULT_SOURCE"
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, the CT can be flushed by dpctl only by specifying
the whole 5-tuple. This is not very convenient when there are
only some fields known to the user of CT flush. Add new struct
ofp_ct_match which represents the generic filtering that can
be done for CT flush. The match is done only on fields that are
non-zero with exception to the icmp fields.
This allows the filtering just within dpctl, however it is a
preparation for OpenFlow extension.
Reported-at: https://bugzilla.redhat.com/2120546
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
AF_XDP functions was deprecated in libbpf 0.7 and moved to libxdp.
Functions bpf_get/set_link_xdp_id() was deprecated in libbpf 0.8
and replaced with bpf_xdp_query_id() and bpf_xdp_attach/detach().
Updating configuration and source code to accommodate above changes
and allow building OVS with AF_XDP support on newer systems:
- Checking the version of libbpf by detecting availability
of bpf_xdp_detach.
- Checking availability of the libxdp in a system by looking
for a library providing libxdp_strerror(), if libbpf is
newer than 0.6. And checking for xsk.h header provided by
libxdp-dev[el].
- Use xsk.h from libbpf if it is older than 0.7 and not linking
with libxdp in this case as there are known incompatible
versions of libxdp in distributions.
- Check for the NEED_WAKEUP feature replaced with direct checking
in the source code if XDP_USE_NEED_WAKEUP is defined.
- Checking availability of bpf_xdp_query_id and bpf_xdp_detach
and using them instead of deprecated APIs. Fall back to old
functions if not found.
- Dropped LIBBPF_LDADD variable as it makes library and function
detection much harder without providing any actual benefits.
AC_SEARCH_LIBS is used instead and it allows use of AC_CHECK_FUNCS.
- Header includes moved around to files where they are actually used.
- Removed libelf dependency as it is not really used.
With these changes it should be possible to build OVS with either:
- libbpf built from the kernel sources (5.19 or older).
- libbpf < 0.7 provided in distributions.
- libxdp and libbpf >= 0.7 provided in newer distributions.
While it is technically possible to build with libbpf 0.7+ without
libxdp at the moment we're not allowing that for a few reasons.
First, required functions in libbpf are deprecated and can be removed
in future releases. Second, support for all these combinations makes
the detection code fairly complex.
AFAIK, most of the distributions packaging libbpf 0.7+ do package
libxdp as well.
libxdp added as a build dependency for Fedora build since all
supported versions of Fedora are packaging this library.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Part of the uuidset implementation is taken from the OVN codebase where
it was added via commit 0e77b3bcbfe2 ("ovn-northd-ddlog: New
implementation of ovn-northd based on ddlog.").
We now extend that, adding a few helpers and tests.
Co-authored-by: Leonid Ryzhyk <lryzhyk@vmware.com>
Signed-off-by: Leonid Ryzhyk <lryzhyk@vmware.com>
Co-authored-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Co-authored-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Reviewed-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit adds a new command to allow the user to switch
the active action implementation at runtime.
Usage:
$ ovs-appctl odp-execute/action-impl-set scalar
This commit also adds a new command to retrieve the list of available
action implementations. This can be used by to check what implementations
of actions are available and what implementation is active during runtime.
Usage:
$ ovs-appctl odp-execute/action-impl-show
Added separate test-case for ovs-actions show/set commands:
odp-execute - actions implementation
Signed-off-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
Co-authored-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit introduces the initial infrastructure required to allow
different implementations for OvS actions. The patch introduces action
function pointers which allows user to switch between different action
implementations available. This will allow for more performance and flexibility
so the user can choose the action implementation to best suite their use case.
Signed-off-by: Emma Finn <emma.finn@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Add AVX512 Ipv6 optimized profile for vlan/IPv6/UDP and
vlan/IPv6/TCP, IPv6/UDP and IPv6/TCP.
MFEX autovalidaton test-case already has the IPv6 support for
validating against the scalar mfex.
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
As described in the bugzilla below, cpu_has_isa code may be compiled
with some AVX512 instructions in it, because cpu.c is built as part of
the libopenvswitchavx512.
This is a problem when this function (supposed to probe for AVX512
instructions availability) is invoked from generic OVS code, on older
CPUs that don't support them.
For the same reason, dpcls_subtable_avx512_gather_probe,
dp_netdev_input_outer_avx512_probe, mfex_avx512_probe and
mfex_avx512_vbmi_probe are potential runtime bombs and can't either be
built as part of libopenvswitchavx512.
Move cpu.c to be part of the "normal" libopenvswitch.
And move other helpers in generic OVS code.
Note:
- dpcls_subtable_avx512_gather_probe is split in two, because it also
needs to do its own magic,
- while moving those helpers, prefer direct calls to cpu_has_isa and
avoid cast to intermediate integer variables when a simple boolean
is enough,
Fixes: 352b6c7116 ("dpif-lookup: add avx512 gather implementation.")
Fixes: abb807e27d ("dpif-netdev: Add command to switch dpif implementation.")
Fixes: 250ceddcc2 ("dpif-netdev/mfex: Add AVX512 based optimized miniflow extract")
Fixes: b366fa2f49 ("dpif-netdev: Call cpuid for x86 isa availability.")
Reported-at: https://bugzilla.redhat.com/2100393
Reported-by: Ales Musil <amusil@redhat.com>
Co-authored-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Checking for each of the required AVX512 ISA separately will allow the
compiler to generate some AVX512 code where there is some support in the
compiler rather than only generating all AVX512 code when all of it is
supported or no AVX512 code at all.
For example, in GCC 4.9 where there is just support for AVX512F, this
patch will allow building the AVX512 DPIF.
Another example, in GCC 5 and 6, most AVX512 code can be generated, just
without AVX512VPOPCNTDQ support.
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The current id-pool module is slow to allocate the
next valid ID, and can be optimized when restricting
some properties of the pool.
Those restrictions are:
* No ability to add a random ID to the pool.
* A new ID is no more the smallest possible ID.
It is however guaranteed to be in the range of
[floor, last_alloc + nb_user * cache_size + 1].
where 'cache_size' is the number of ID in each per-user
cache. It is defined as 'ID_FPOOL_CACHE_SIZE' to 64.
* A user should never free an ID that is not allocated.
No checks are done and doing so will duplicate the spurious
ID. Refcounting or other memory management scheme should
be used to ensure an object and its ID are only freed once.
This allocator is designed to scale reasonably well in multithread
setup. As it is aimed at being a faster replacement to the current
id-pool, a benchmark has been implemented alongside unit tests.
The benchmark is composed of 4 rounds: 'new', 'del', 'mix', and 'rnd'.
Respectively
+ 'new': only allocate IDs
+ 'del': only free IDs
+ 'mix': allocate, sequential free, then allocate ID.
+ 'rnd': allocate, random free, allocate ID.
Randomized freeing is done by swapping the latest allocated ID with any
from the range of currently allocated ID, which is reminiscent of the
Fisher-Yates shuffle. This evaluates freeing non-sequential IDs,
which is the more natural use-case.
For this specific round, the id-pool performance is such that a timeout
of 10 seconds is added to the benchmark:
$ ./tests/ovstest test-id-fpool benchmark 10000 1
Benchmarking n=10000 on 1 thread.
type\thread: 1 Avg
id-fpool new: 1 1 ms
id-fpool del: 1 1 ms
id-fpool mix: 2 2 ms
id-fpool rnd: 2 2 ms
id-pool new: 4 4 ms
id-pool del: 2 2 ms
id-pool mix: 6 6 ms
id-pool rnd: 431 431 ms
$ ./tests/ovstest test-id-fpool benchmark 100000 1
Benchmarking n=100000 on 1 thread.
type\thread: 1 Avg
id-fpool new: 2 2 ms
id-fpool del: 2 2 ms
id-fpool mix: 3 3 ms
id-fpool rnd: 4 4 ms
id-pool new: 12 12 ms
id-pool del: 5 5 ms
id-pool mix: 16 16 ms
id-pool rnd: 10000+ -1 ms
$ ./tests/ovstest test-id-fpool benchmark 1000000 1
Benchmarking n=1000000 on 1 thread.
type\thread: 1 Avg
id-fpool new: 15 15 ms
id-fpool del: 12 12 ms
id-fpool mix: 34 34 ms
id-fpool rnd: 48 48 ms
id-pool new: 276 276 ms
id-pool del: 286 286 ms
id-pool mix: 448 448 ms
id-pool rnd: 10000+ -1 ms
Running only a performance test on the fast pool:
$ ./tests/ovstest test-id-fpool perf 1000000 1
Benchmarking n=1000000 on 1 thread.
type\thread: 1 Avg
id-fpool new: 15 15 ms
id-fpool del: 12 12 ms
id-fpool mix: 34 34 ms
id-fpool rnd: 47 47 ms
$ ./tests/ovstest test-id-fpool perf 1000000 2
Benchmarking n=1000000 on 2 threads.
type\thread: 1 2 Avg
id-fpool new: 11 11 11 ms
id-fpool del: 10 10 10 ms
id-fpool mix: 24 24 24 ms
id-fpool rnd: 30 30 30 ms
$ ./tests/ovstest test-id-fpool perf 1000000 4
Benchmarking n=1000000 on 4 threads.
type\thread: 1 2 3 4 Avg
id-fpool new: 9 11 11 10 10 ms
id-fpool del: 5 6 6 5 5 ms
id-fpool mix: 16 16 16 16 16 ms
id-fpool rnd: 20 20 20 20 20 ms
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add a lockless multi-producer/single-consumer (MPSC), linked-list based,
intrusive, unbounded queue that does not require deferred memory
management.
The queue is designed to improve the specific MPSC setup. A benchmark
accompanies the unit tests to measure the difference in this configuration.
A single reader thread polls the queue while N writers enqueue elements
as fast as possible. The mpsc-queue is compared against the regular ovs-list
as well as the guarded list. The latter usually offers a slight improvement
by batching the element removal, however the mpsc-queue is faster.
The average is of each producer threads time:
$ ./tests/ovstest test-mpsc-queue benchmark 3000000 1
Benchmarking n=3000000 on 1 + 1 threads.
type\thread: Reader 1 Avg
mpsc-queue: 167 167 167 ms
list(spin): 89 80 80 ms
list(mutex): 745 745 745 ms
guarded list: 788 788 788 ms
$ ./tests/ovstest test-mpsc-queue benchmark 3000000 2
Benchmarking n=3000000 on 1 + 2 threads.
type\thread: Reader 1 2 Avg
mpsc-queue: 98 97 94 95 ms
list(spin): 185 171 173 172 ms
list(mutex): 203 199 203 201 ms
guarded list: 269 269 188 228 ms
$ ./tests/ovstest test-mpsc-queue benchmark 3000000 3
Benchmarking n=3000000 on 1 + 3 threads.
type\thread: Reader 1 2 3 Avg
mpsc-queue: 76 76 65 76 72 ms
list(spin): 246 110 240 238 196 ms
list(mutex): 542 541 541 539 540 ms
guarded list: 535 535 507 511 517 ms
$ ./tests/ovstest test-mpsc-queue benchmark 3000000 4
Benchmarking n=3000000 on 1 + 4 threads.
type\thread: Reader 1 2 3 4 Avg
mpsc-queue: 73 68 68 68 68 68 ms
list(spin): 294 275 279 277 282 278 ms
list(mutex): 346 309 287 345 302 310 ms
guarded list: 378 319 334 378 351 345 ms
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add a new module offering a helper to compute the Cumulative
Moving Average (CMA) and the Exponential Moving Average (EMA)
of a series of values.
Use the new helpers to add latency metrics in dpif-netdev.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
DPIF AVX512 optimizations currently rely on DPDK availability while
they can be used without DPDK.
Besides, checking for availability of some isa only has to be done once
and won't change while a OVS process runs.
Resolve isa availability in constructors by using a simplified query
based on cpuid API that comes from the compiler.
Note: this also fixes the check on BMI2 availability: DPDK had a bug
for this isa, see https://git.dpdk.org/dpdk/commit/?id=aae3037ab1e0.
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This way it's easier to show it on a website as it will be updated
automatically along with the rest of the documentation.
Sphinx doesn't render everything perfectly, but it looks good enough
in both man and html versions. rST is a bit easier to read and it
takes less space.
Conversion performed manually since I didn't found any good tool
that can actually make the process any faster.
Along the way I replaced versions like x.y.90 with x.y+1, because
it doesn't seem correct to me to refer non-released versions of OVS
in the docs. Fixed a couple of small mistakes like duplicated
paragraph and reference to a different section by incorrect name.
Also removed bits of xml->nroff conversion code that is not needed
anymore.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Roi Dayan <roid@nvidia.com>
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.
This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:
* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.
In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:
a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.
Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit adds AVX512 implementations of miniflow extract.
By using the 64 bytes available in an AVX512 register, it is
possible to convert a packet to a miniflow data-structure in
a small quantity instructions.
The implementation here probes for Ether()/IP()/UDP() traffic,
and builds the appropriate miniflow data-structure for packets
that match the probe.
The implementation here is auto-validated by the miniflow
extract autovalidator, hence its correctness can be easily
tested and verified.
Note that this commit is designed to easily allow addition of new
traffic profiles in a scalable way, without code duplication for
each traffic profile.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The study function runs all the available implementations
of miniflow_extract and makes a choice whose hitmask has
maximum hits and sets the mfex to that function.
Study can be run at runtime using the following command:
$ ovs-appctl dpif-netdev/miniflow-parser-set study
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This patch introduces the MFEX function pointers which allows
the user to switch between different miniflow extract implementations
which are provided by the OVS based on optimized ISA CPU.
The user can query for the available minflow extract variants available
for that CPU by following commands:
$ovs-appctl dpif-netdev/miniflow-parser-get
Similarly an user can set the miniflow implementation by the following
command :
$ ovs-appctl dpif-netdev/miniflow-parser-set name
This allows for more performance and flexibility to the user to choose
the miniflow implementation according to the needs.
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds a new command to allow the user to switch
the active DPIF implementation at runtime. A probe function
is executed before switching the DPIF implementation, to ensure
the CPU is capable of running the ISA required. For example, the
below code will switch to the AVX512 enabled DPIF assuming
that the runtime CPU is capable of running AVX512 instructions:
$ ovs-appctl dpif-netdev/dpif-impl-set dpif_avx512
A new configuration flag is added to allow selection of the
default DPIF. This is useful for running the unit-tests against
the available DPIF implementations, without modifying each unit test.
The design of the testing & validation for ISA optimized DPIF
implementations is based around the work already upstream for DPCLS.
Note however that a DPCLS lookup has no state or side-effects, allowing
the auto-validator implementation to perform multiple lookups and
provide consistent statistic counters.
The DPIF component does have state, so running two implementations in
parallel and comparing output is not a valid testing method, as there
are changes in DPIF statistic counters (side effects). As a result, the
DPIF is tested directly against the unit-tests.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds the AVX512 implementation of DPIF functionality,
specifically the dp_netdev_input_outer_avx512 function. This function
only handles outer (no re-circulations), and is optimized to use the
AVX512 ISA for packet batching and other DPIF work.
Sparse is not able to handle the AVX512 intrinsics, causing compile
time failures, so it is disabled for this file.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Co-authored-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Split the very large file dpif-netdev.c and the datastructures
it contains into multiple header files. Each header file is
responsible for the datastructures of that component.
This logical split allows better reuse and modularity of the code,
and reduces the very large file dpif-netdev.c to be more managable.
Due to dependencies between components, it is not possible to
move component in smaller granularities than this patch.
To explain the dependencies better, eg:
DPCLS has no deps (from dpif-netdev.c file)
FLOW depends on DPCLS (struct dpcls_rule)
DFC depends on DPCLS (netdev_flow_key) and FLOW (netdev_flow_key)
THREAD depends on DFC (struct dfc_cache)
DFC_PROC depends on THREAD (struct pmd_thread)
DPCLS lookup.h/c require only DPCLS
DPCLS implementations require only dpif-netdev-lookup.h.
- This change was made in 2.12 release with function pointers
- This commit only refactors the name to "private-dpcls.h"
netdev_flow_key_equal_mf() is renamed to emc_flow_key_equal_mf().
Rename functions specific to dpcls from netdev_* namespace to the
dpcls_* namespace, as they are only used by dpcls code.
'inline' is added to the dp_netdev_flow_hash() when it is moved
definition to fix a compiler error.
One valid checkpatch issue with the use of the
EMC_FOR_EACH_POS_WITH_HASH() macro was fixed.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Using 1024 bit params for DH is considered unsafe [1]. Additionally,
from [2]:
"Modern servers that do not support export ciphersuites are advised to
either use SSL_CTX_set_tmp_dh() or alternatively, use the callback but
ignore keylength and is_export and simply supply at least 2048-bit
parameters in the callback."
Additionally, using 1024 bit dh params may block clients running on
recent openssl version from connecting given the stricter default
security requirements of those new openssl versions. The error message
for these clients looks like:
error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small:ssl/statem/statem_clnt.c:2150
As a workaround, this error can be suppressed tweaking the cipher list
(--ssl-ciphers) to either 'HIGH:!aNULL:!MD5:@SECLEVEL=1' to reduce
security requirements or 'HIGH:!aNULL:!MD5:!DH' to avoid using fixed
param DH based ciphers. The first option is recommended though as it
likely a fixed param DH cipher is the best possible option in that
situation.
[1] https://weakdh.org/
[2] https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_tmp_dh_callback.html
Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
For debugging purposes it is useful to be able to record all the
incoming transactions and commands and replay them locally under
debugger or with additional logging enabled. This patch introduces
ability to record all the incoming stream data and replay it via new
stream provider named 'stream-replay'. During the record phase all
the incoming stream data written to special replay_* files in the
application rundir. On replay phase instead of opening real streams
application will open replay_* files and read all the incoming data
directly from them.
If enabled for ovsdb-server, for example, this allows to record all
the connections and transactions from the big setup and replay them
locally afterwards to debug the behaviour or test performance.
To start application in recording mode there is a --record cmdline
option. --replay is to replay previously recorded streams.
Current version doesn't work well with time-based stream events like
inactivity probes or any other events generated internally. This is
a point for further improvement.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
This library provides interfaces to open replay files and
read/write records. Will be used later for stream record/replay
functionality, i.e. to record all the incoming connections and
data and replay it later for debugging and performance analysis
purposes.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
When running repeated builds using `make build` you get prompts in cases
the `mv` command is about to overwrite a file which is write-protect.
This patch forced the `mv` w/o prompting for approval.
Signed-off-by: Aidan Shribman <aidan.shribman@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
This new module has a single direct user now. In the future, it
will also be used by OVN.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Update build system to ensure dirs.py is created when it is a
dependency for a build target. Also, update setup.py to
check for that dependency.
Fixes: 943c4a3250 ("python: set ovs.dirs variables with build system values")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ovs/dirs.py should be auto-generated using the template
ovs/dirs.py.template at build time. This will set the
ovs.dirs python variables with a value specified by the
environment or, if the environment variable is not set, from
the build system.
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-By: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
In certain scenarios with OVS built with --enable-shared and
DPDK enabled as shared build too, Position Independant Code
is required to link the avx512.a file into the relocatable .so
that it must be linked into.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit avoids compiling and linking of avx512 code into the
vswitch_la library if the binutils check fails. This avoids compiling
code into OVS that will not be executed due to binutils issue.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Enabling debug logs in dpdk can be a challenge to be sure of what is
actually enabled, add commands to list and change those log levels.
However, these commands do not help when tracking issues in dpdk init
itself: dump log levels right after init.
Example:
$ ovs-appctl dpdk/log-list
global log level is debug
id 0: lib.eal, level is info
id 1: lib.malloc, level is info
id 2: lib.ring, level is info
id 3: lib.mempool, level is info
id 4: lib.timer, level is info
id 5: pmd, level is info
[...]
id 37: pmd.net.bnxt.driver, level is notice
id 38: pmd.net.e1000.init, level is notice
id 39: pmd.net.e1000.driver, level is notice
id 40: pmd.net.enic, level is info
[...]
$ ovs-appctl dpdk/log-set debug pmd.*:notice
$ ovs-appctl dpdk/log-list
global log level is debug
id 0: lib.eal, level is debug
id 1: lib.malloc, level is debug
id 2: lib.ring, level is debug
id 3: lib.mempool, level is debug
id 4: lib.timer, level is debug
id 5: pmd, level is debug
[...]
id 37: pmd.net.bnxt.driver, level is notice
id 38: pmd.net.e1000.init, level is notice
id 39: pmd.net.e1000.driver, level is notice
id 40: pmd.net.enic, level is notice
[...]
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit adds an AVX-512 dpcls lookup implementation.
It uses the AVX-512 SIMD ISA to perform multiple miniflow
operations in parallel.
To run this implementation, the "avx512f" and "bmi2" ISAs are
required. These ISA checks are performed at runtime while
probing the subtable implementation. If a CPU does not provide
both "avx512f" and "bmi2", then this code does not execute.
The avx512 code is built as a separate static library, with added
CFLAGS to enable the required ISA features. By building only this
static library with avx512 enabled, it is ensured that the main OVS
core library is *not* using avx512, and that OVS continues to run
as before on CPUs that do not support avx512.
The approach taken in this implementation is to use the
gather instruction to access the packet miniflow, allowing
any miniflow blocks to be loaded into an AVX-512 register.
This maximizes the usefulness of the register, and hence this
implementation handles any subtable with up to miniflow 8 bits.
Note that specialization of these avx512 lookup routines
still provides performance value, as the hashing of the
resulting data is performed in scalar code, and compile-time
loop unrolling occurs when specialized to miniflow bits.
This commit checks at configure time if the assembling in use
has a known bug in assembling AVX512 code. If this bug is present,
all AVX512 code is disabled. Checking the version string of the binutils
or assembler is not a good method to detect the issue, as back ported
fixes would not be reflected.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit refactors the existing dpif subtable function pointer
infrastructure, and implements an autovalidator component.
The refactoring of the existing dpcls subtable lookup function
handling, making it more generic, and cleaning up how to enable
more implementations in future.
In order to ensure all implementations provide identical results,
the autovalidator is added. The autovalidator itself implements
the subtable lookup function prototype, but internally iterates
over all other available implementations. The end result is that
testing of each implementation becomes automatic, when the auto-
validator implementation is selected.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Commit 1f16131837 ("ct-dpif, dpif-netlink: Add conntrack timeout
policy support") adds conntrack timeout policy for kernel datapath.
This patch enables support for the userspace datapath. I tested
using the 'make check-system-userspace' which checks the timeout
policies for ICMP and UDP cases.
Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.
A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.
It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.
If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.
It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Ciara Loftus <ciara.loftus.intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Sometimes interface updates could happen in a way ifnotifier is not
able to catch. For example some heavy operations (device reset) in
netdev-dpdk could require re-applying of the bridge configuration.
For this purpose new manual notifier introduced. Its function
'if_notifier_manual_report()' could be called directly by the code
that aware about changes. This new notifier is thread-safe.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology. It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
This commit moves some data-structures to be available
in the dpif-netdev-private.h header. This allows specific
implementations of the subtable lookup function to include
just that header file, and not require that the code exists
in dpif-netdev.c
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Flow API providers renamed to be consistent with parent module
'netdev-offload' and look more like each other.
'_rte_' replaced with more convenient '_dpdk_'.
We'll have following structure:
Common code:
lib/netdev-offload-provider.h
lib/netdev-offload.c
lib/netdev-offload.h
Providers:
lib/netdev-offload-tc.c
lib/netdev-offload-dpdk.c
'netdev-offload-dummy' still resides inside netdev-dummy, but it
makes no much sence to move it out of there.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
New module 'netdev-offload' created to manage different flow API
implementations. All the generic and provider independent code moved
there from the 'netdev' module.
Flow API providers further encapsulated.
The only function that was changed is 'netdev_any_oor'.
Now it uses offloading related hmap instead of common 'netdev_shash'.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
Current issues with Flow API:
* OVS calls offloading functions regardless of successful
flow API initialization. (ex. on init_flow_api failure)
* Static initilaization of Flow API for a netdev_class forbids
having different offloading types for different instances
of netdev with the same netdev_class. (ex. different vports in
'system' and 'netdev' datapaths at the same time)
Solution:
* Move Flow API from the netdev_class to netdev instance.
* Make Flow API dynamic, i.e. probe the APIs and choose the
suitable one.
Side effects:
* Flow API providers localized as possible in their modules.
* Now we have an ability to make runtime checks. For example,
we could check if particular device supports features we
need, like if dpdk device supports RSS+MARK action.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Hardware offloading code is moved to a new file called
netdev-rte-offloads.c. The original offloading code is copied
from the netdev-dpdk.c file to the new file, where future
offloading code should be added as well.
The copied code was refactored based on coding style.
The netdev-dpdk.c file will remain unchanged as new offloading
code is added.
Co-authored-by: Ophir Munk <ophirmu@mellanox.com>
Reviewed-by: Asaf Penso <asafp@mellanox.com>
Signed-off-by: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>