A boolean is currently used to differenciate between the
static and XPS Tx queue modes.
Since we are going to introduce a new steering mode, replace
this boolean with an enum.
This patch does not introduce functionnal changes.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds Rx and Tx per-queue statistics. It will be
used to test hash-based Tx packet steering. Only "bytes",
and "packets" per-queue custom statistics are added, as
there are no global "errors" counters in netdev-dummy.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The encap & decap actions are extended to support MPLS packet type.
Encap & decap actions adds and removes MPLS header at start of the
packet.
The existing PUSH MPLS & POP MPLS actions inserts & removes MPLS
header between ethernet header and the IP header. Though this behaviour
is fine for L3 VPN where an IP packet is encapsulated inside a MPLS
tunnel, it does not suffice the L2 VPN requirements. In L2 VPN the
ethernet packets must be encapsulated inside MPLS tunnel.
In this change the encap & decap actions are extended to support MPLS
packet type. The encap & decap adds and removes MPLS header at the
start of packet as depicted below.
Encapsulation:
Actions - encap(mpls),encap(ethernet)
Incoming packet -> | ETH | IP | Payload |
1 Actions - encap(mpls) [Datapath action - ADD_MPLS:0x8847]
Outgoing packet -> | MPLS | ETH | Payload|
2 Actions - encap(ethernet) [ Datapath action - push_eth ]
Outgoing packet -> | ETH | MPLS | ETH | Payload|
Decapsulation:
Incoming packet -> | ETH | MPLS | ETH | IP | Payload |
Actions - decap(),decap(packet_type(ns=0,type=0))
1 Actions - decap() [Datapath action - pop_eth)
Outgoing packet -> | MPLS | ETH | IP | Payload|
2 Actions - decap(packet_type(ns=0,type=0)) [Datapath action - POP_MPLS:0x6558]
Outgoing packet -> | ETH | IP | Payload|
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
During a netlink transaction, in case of replies of type NLMSG_ERROR,
the current behavior includes the translation of the error number
received into a string that describes the error code.
Netlink replies may carry a more descriptive error message, and
although it is possible to read those messages using the existing perf
tracepoint, it is more convenient to retrieve them directly from ovs.
This patch extends nl_msg_nlmsgerr() so that it retrieves the message
that later, if present, will be used by nl_sock_transact_multiple__()
in place of the generic descriptive form of the error number. This is
particularly useful with tc that makes use of such kind of mechanism.
As an example, with this patch applied, the following generic message:
ovs|00239|netlink_socket|DBG|received NAK error=0 (Operation not supported)
becomes:
ovs|00239|netlink_socket|DBG|received NAK error=0 - Conntrack isn't enabled
The layout has been slightly modified to avoid nested parentheses.
Suggested-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently ingress policing uses the basic classifier to apply traffic
control filters if hardware offload is not enabled, in which case it
uses matchall. This change changes the behavior to always use matchall,
and fall back onto basic if the kernel is built without matchall
support.
The system tests are modified to allow either basic or matchall
classification on the ingestion filter, and to allow either 10000 or
10240 packets for the packet burst filter. 10000 is accurate for kernel
5.14 and the most recent iproute2, however, 10240 is left for
compatibility with older kernels.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit improves handling of packets where the allocated memory
is less than 64 bytes. For packets recevied from DPDK ports this
never matters, as an mbuf always pre-allocates enough space, however
this can occur in cases where packet received from a kernel interface
or injected by an OpenFlow controller. The fix is required to
ensure OVS doesn't overread the allocated memory, e.g.:
==49944==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x6060000d8181 at pc 0x000001cb9d24 bp 0x7ffce3b385d0 sp 0x7ffce3b385c8
READ of size 64 at 0x6060000d8181 thread T0
#0 0x1cb9d23 in mfex_avx512_process lib/dpif-netdev-extract-avx512.c:491:26
#1 0x1cb9d23 in mfex_avx512_ip_udp lib/dpif-netdev-extract-avx512.c:625:1
#2 0x18786a1 in dpif_miniflow_extract_autovalidator lib/dpif-netdev-private-extract.c:277:29
#3 0x1cbca5c in dp_netdev_input_outer_avx512 lib/dpif-netdev-avx512.c:159:19
#4 0x1853048 in dp_netdev_process_rxq_port lib/dpif-netdev.c:4900:19
#5 0x1837c76 in dpif_netdev_run lib/dpif-netdev.c:6197:25
#6 0x1727a02 in type_run ofproto/ofproto-dpif.c:370:9
#7 0x16f6e07 in ofproto_type_run ofproto/ofproto.c:1778:31
#8 0x16c1a8b in bridge_run__ vswitchd/bridge.c:3245:9
#9 0x16bd2fd in bridge_run vswitchd/bridge.c:3310:5
#10 0x16db8fe in main vswitchd/ovs-vswitchd.c:127:9
#11 0x7fbc0c5b61a2 in __libc_start_main (/lib64/libc.so.6+0x271a2)
#12 0xedabbd in _start (vswitchd/ovs-vswitchd+0xedabbd)
0x6060000d8181 is located 9 bytes to the right of 56-byte
region [0x6060000d8140,0x6060000d8178)
allocated by thread T0 here:
#0 0xf7b09f in malloc (vswitchd/ovs-vswitchd+0xf7b09f)
#1 0x1aff3b9 in xmalloc__ lib/util.c:137:15
#2 0x1aff3b9 in xmalloc lib/util.c:172:12
#3 0x1afe211 in process_command lib/unixctl.c:310:13
#4 0x1afe211 in run_connection lib/unixctl.c:344:17
#5 0x1afe211 in unixctl_server_run lib/unixctl.c:395:21
#6 0x16db918 in main vswitchd/ovs-vswitchd.c:128:9
#7 0x7fbc0c5b61a2 in __libc_start_main (/lib64/libc.so.6+0x271a2)
The solution implemented uses a mask-to-zero if the available buffer
size is less than 64 bytes, and a branch for which type of load is used.
Fixes: 250ceddcc2 ("dpif-netdev/mfex: Add AVX512 based optimized miniflow extract")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, on failure to link with DPDK, the configure script provides
an error message to update the PKG_CONFIG_PATH even though the cause of
failure was missing dependencies. Improve the error message to include this
scenario.
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When troubleshooting multiqueue setups, having per queue statistics helps
checking packets repartition in rx and tx queues.
Per queue statistics are exported by most DPDK drivers (with capability
RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS).
OVS only filters DPDK statistics, there is nothing to request in DPDK API.
So the only change is to extend the filter on xstats.
Querying statistics with
$ ovs-vsctl get interface dpdk0 statistics |
sed -e 's#[{}]##g' -e 's#, #\n#g'
and comparing gives:
@@ -13,7 +13,12 @@
rx_phy_crc_errors=0
rx_phy_in_range_len_errors=0
rx_phy_symbol_errors=0
+rx_q0_bytes=0
rx_q0_errors=0
+rx_q0_packets=0
+rx_q1_bytes=0
rx_q1_errors=0
+rx_q1_packets=0
rx_wqe_errors=0
tx_broadcast_packets=0
tx_bytes=0
@@ -27,3 +32,13 @@
tx_pp_rearm_queue_errors=0
tx_pp_timestamp_future_errors=0
tx_pp_timestamp_past_errors=0
+tx_q0_bytes=0
+tx_q0_packets=0
+tx_q1_bytes=0
+tx_q1_packets=0
+tx_q2_bytes=0
+tx_q2_packets=0
+tx_q3_bytes=0
+tx_q3_packets=0
+tx_q4_bytes=0
+tx_q4_packets=0
Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When changing number of Rx or Tx queues, per queue basic stats can be
renumbered in DPDK ethdev layer [1].
OVS maintains an internal xstats IDs cache that was refreshed when a
cached id was not valid anymore (in netdev_dpdk_get_custom_stats) or if
a new DPDK port was created.
This did not handle changes of Rx/Tx queues count.
For example, with a mlx5 port:
$ ovs-vsctl set interface dpdk0 options:n_rxq=2
$ ovs-vsctl get interface dpdk0 statistics |
sed -e 's#[{}]##g' -e 's#, #\n#g' |
grep rx_q._errors
rx_q0_errors=0
Move the cache filling after reconfiguring and starting the port.
There is no need to flush the cache in netdev_dpdk_get_custom_stats.
While at it, the xstats code can be cleaned up:
- remove wrong or Lapalissade comments,
- don't check x*alloc return value,
- expect that consecutive calls to xstats API return the same number of
elements,
- only write to dev-> when all DPDK calls succeeded,
- add missing lock annotations to netdev_dpdk_clear_xstats and
netdev_dpdk_get_xstat_name,
1: https://git.dpdk.org/dpdk/tree/lib/librte_ethdev/rte_ethdev.c?h=v20.11#n2696
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-November/389456.html
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously in OVS, a PMD thread running on cpu X used lcore X.
This assumption limited OVS to run PMD threads on physical cpu <
RTE_MAX_LCORE.
DPDK 20.08 introduced a new API that associates a non-EAL thread to a free
lcore. This new API does not change the thread characteristics (like CPU
affinity) and let OVS run its PMD threads on any cpu regardless of
RTE_MAX_LCORE.
The DPDK multiprocess feature is not compatible with this new API and is
disabled.
DPDK still limits the number of lcores to RTE_MAX_LCORE (128 on x86_64)
which should be enough for OVS pmd threads (hopefully).
DPDK lcore/OVS pmd threads mapping are logged at threads when trying to
attach a OVS PMD thread, and when detaching.
A new command is added to help get DPDK point of view of the DPDK lcores
at any time:
$ ovs-appctl dpdk/lcore-list
lcore 0, socket 0, role RTE, cpuset 0
lcore 1, socket 0, role NON_EAL, cpuset 1
lcore 2, socket 0, role NON_EAL, cpuset 15
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
tests/oss-fuzz/flow_extract_target.c:59:53:
error: too few arguments to function call, expected 4, have 1
uint16_t tcp_flags = parse_tcp_flags(&packet);
~~~~~~~~~~~~~~~ ^
Fixes: e7e9973b80 ("dpif-netdev: Forwarding optimization for flows with a simple match.")
Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43498
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
The fuzzing target times out if the action list is too big. And we
don't really need to fully parse all the actions just to say that they
are too big in the end. So, check early and exit.
This is a pure performance optimization, so not adding a unit test.
All other code paths during the parsing are using E2BIG and not EFBIG
for similar conditions, so using it here too.
Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39670
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The hw_miss_packet_recover() API results in performance degradation, for
ports that are either not offload capable or do not support this specific
offload API.
For example, in the test configuration shown below, the vhost-user port
does not support offloads and the VF port doesn't support hw_miss offload
API. But because tunnel offload needs to be configured in other bridges
(br-vxlan and br-phy), OVS has been built with -DALLOW_EXPERIMENTAL_API.
br-vhost br-vxlan br-phy
vhost-user<-->VF VF-Rep<-->VxLAN uplink-port
For every packet between the VF and the vhost-user ports, hw_miss API is
called even though it is not supported by the ports involved. This leads
to significant performance drop (~3x in some cases; both cycles and pps).
Return EOPNOTSUPP when this API fails for a device that doesn't support it
and avoid this API on that port for subsequent packets.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There are cases where users might want simple forwarding or drop rules
for all packets received from a specific port, e.g ::
"in_port=1,actions=2"
"in_port=2,actions=IN_PORT"
"in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
"in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"
There are also cases where complex OpenFlow rules can be simplified
down to datapath flows with very simple match criteria.
In theory, for very simple forwarding, OVS doesn't need to parse
packets at all in order to follow these rules. "Simple match" lookup
optimization is intended to speed up packet forwarding in these cases.
Design:
Due to various implementation constraints userspace datapath has
following flow fields always in exact match (i.e. it's required to
match at least these fields of a packet even if the OF rule doesn't
need that):
- recirc_id
- in_port
- packet_type
- dl_type
- vlan_tci (CFI + VID) - in most cases
- nw_frag - for ip packets
Not all of these fields are related to packet itself. We already
know the current 'recirc_id' and the 'in_port' before starting the
packet processing. It also seems safe to assume that we're working
with Ethernet packets. So, for the simple OF rule we need to match
only on 'dl_type', 'vlan_tci' and 'nw_frag'.
'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
combined in a single 64bit integer (mark) that can be used as a
hash in hash map. We are using only VID and CFI form the 'vlan_tci',
flows that need to match on PCP will not qualify for the optimization.
Workaround for matching on non-existence of vlan updated to match on
CFI and VID only in order to qualify for the optimization. CFI is
always set by OVS if vlan is present in a packet, so there is no need
to match on PCP in this case. 'nw_frag' takes 2 bits of PCP inside
the simple match mark.
New per-PMD flow table 'simple_match_table' introduced to store
simple match flows only. 'dp_netdev_flow_add' adds flow to the
usual 'flow_table' and to the 'simple_match_table' if the flow
meets following constraints:
- 'recirc_id' in flow match is 0.
- 'packet_type' in flow match is Ethernet.
- Flow wildcards contains only minimal set of non-wildcarded fields
(listed above).
If the number of flows for current 'in_port' in a regular 'flow_table'
equals number of flows for current 'in_port' in a 'simple_match_table',
we may use simple match optimization, because all the flows we have
are simple match flows. This means that we only need to parse
'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
Now we make the unique flow mark from the 'in_port', 'dl_type',
'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
On successful lookup we don't need to run full 'miniflow_extract()'.
Unsuccessful lookup technically means that we have no suitable flow
in the datapath and upcall will be required. So, in this case EMC and
SMC lookups are disabled. We may optimize this path in the future by
bypassing the dpcls lookup too.
Performance improvement of this solution on a 'simple match' flows
should be comparable with partial HW offloading, because it parses same
packet fields and uses similar flow lookup scheme.
However, unlike partial HW offloading, it works for all port types
including virtual ones.
Performance results when compared to EMC:
Test setup:
virtio-user OVS virtio-user
Testpmd1 ------------> pmd1 ------------> Testpmd2
(txonly) x<------ pmd2 <------------ (mac swap)
Single stream of 64byte packets. Actions:
in_port=vhost0,actions=vhost1
in_port=vhost1,actions=vhost0
Stats collected from pmd1 and pmd2, so there are 2 scenarios:
Virt-to-Virt : Testpmd1 ------> pmd1 ------> Testpmd2.
Virt-to-NoCopy : Testpmd2 ------> pmd2 --->x Testpmd1.
Here the packet sent from pmd2 to Testpmd1 is always dropped, because
the virtqueue is full since Testpmd1 is in txonly mode and doesn't
receive any packets. This should be closer to the performance of a
VM-to-Phy scenario.
Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
Table below represents improvement in throughput when compared to EMC.
+----------------+------------------------+------------------------+
| | Default (-g -O2) | "-Ofast -march=native" |
| Scenario +------------+-----------+------------+-----------+
| | GCC | Clang | GCC | Clang |
+----------------+------------+-----------+------------+-----------+
| Virt-to-Virt | +18.9% | +25.5% | +10.8% | +16.7% |
| Virt-to-NoCopy | +24.3% | +33.7% | +14.9% | +22.0% |
+----------------+------------+-----------+------------+-----------+
For Phy-to-Phy case performance improvement should be even higher, but
it's not the main use-case for this functionality. Performance
difference for the non-simple flows is within a margin of error.
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add support for monitor_cond_since / update3 to python-ovs to
allow more efficient reconnections when connecting to clustered
OVSDB servers.
Signed-off-by: Terry Wilson <twilson@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Recently there has been a lot of press about the "trojan source" attack,
where Unicode characters are used to obfuscate the true functionality of
code. This attack didn't effect OVS, but adding the check here will help
guard against it sneaking in later.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit adds a basic packet metadata macro to the already existing
macros in ovs_gdb.py, ovs_dump_packets will print out information about
one or more packets. It feeds packets into tcpdump, and the user can
pass in tcpdump options to modify how packets are parsed or even write
out packets to a pcap file.
Example usage:
(gdb) break fast_path_processing
(gdb) commands
ovs_dump_packets packets_
continue
end
(gdb) continue
Thread 1 "ovs-vswitchd" hit Breakpoint 2, fast_path_processing ...
12:01:05.962485 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
10.1.1.1 tell 10.1.1.2, length 28
Thread 1 "ovs-vswitchd" hit Breakpoint 1, fast_path_processing ...
12:01:05.981214 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.1.1
is-at a6:0f:c3:f0:5f:bd (oui Unknown), length 28
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The netlink policy unit test contains test fixture data that is
subject to endianness and currently fails on big endian systems.
Store the fixture data in a struct to ensure proper byte order for
the header data.
Also fix improper style for sizeof with expressions.
Fixes: bfee9f6c01 ("netlink: Add support for parsing link layer address.")
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Following [1]-[3] in DPDK, there are no more such warnings from DPDK.
Remove ignoring them if they occur.
GitHub actions:
v1: https://github.com/elibritstein/OVS/actions/runs/1540651133
[1] a3f8d0587188 ("net: avoid cast-align warning in VLAN insert function")
[2] da0333c8790b ("mbuf: avoid cast-align warning in data offset macro")
[3] 6de430b7079e ("eal/x86: avoid cast-align warning in memcpy functions")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
As part of some previous checkpatch work, we discovered that checkpatch
isn't always reporting correct line numbers. As it turns out, Python's
splitlines function considers several characters to be new lines which
common text editors do not typically consider to be new lines. For
example, form feed characters, which this code base uses to cluster
functionality.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Availability logs are not essential for a normal run. The same
information can be obtained via appctl in runtime. They also can not
show if particular implementation will actually be used or not, hence
not useful for post-crash investigations. Moving to DBG level to avoid
bulky unnecessary logging.
Additionally making them a bit more readable.
Acked-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
DPIF AVX512 optimizations currently rely on DPDK availability while
they can be used without DPDK.
Besides, checking for availability of some isa only has to be done once
and won't change while a OVS process runs.
Resolve isa availability in constructors by using a simplified query
based on cpuid API that comes from the compiler.
Note: this also fixes the check on BMI2 availability: DPDK had a bug
for this isa, see https://git.dpdk.org/dpdk/commit/?id=aae3037ab1e0.
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
netdev_set_dpif_type() can only be used with a normalized dpif type
as an argument, which is a constant static string derived from a type
of a dpif_class or a constant string "system". Usage of a same
constant string allows netdev-offload module to compare types by
simply comparing pointers.
OTOH, 'br->ofproto->type' is a dynamic string that:
a. Can be NULL.
b. Even if not NULL and equal, can be a different dynamically
allocated string.
Both these qualities breaks assumptions made by all other modules
related to HW offload, breaking the functionality.
Fix that by moving netdev_set_dpif_type() to dpif.c and calling with
a correct constant string as an argument.
The call moved from bridge.c to dpif.c, because we need to have access
to the dpif class, but bridge.c should not.
Not trying to set the dpif_type inside the netdev_ports_insert(),
because it's used now outside the offloading context. So, it's
cleaner to move the netdev_set_dpif_type() call outside of the
netdev-offload module.
Additionally removed the redundant call from the netdev_ports_insert()
and refactored the function, since it doesn't need an extra argument
anymore.
Fixes: 4f19a78a61 ("netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.")
Reported-by: Roi Dayan <roid@nvidia.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390117.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Roi Dayan <roid@nvidia.com>
In case of native tunnel with bfd enabled, if the MAC address of the
remote end's interface changes (e.g. because it got rebooted, and the
MAC address is allocated dynamically), the BFD session will never be
re-established.
This happens because the local tunnel neigh entry doesn't get updated,
and the local end keeps sending BFD packets with the old destination
MAC address. This was not an issue until
b23ddcc57d ("tnl-neigh-cache: tighten arp and nd snooping.")
because ARP requests were snooped as well avoiding the problem.
Fix this by snooping the incoming packets in the slow path, and
updating the neigh cache accordingly.
Fixes: b23ddcc57d ("tnl-neigh-cache: tighten arp and nd snooping.")
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2002430
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This is a minor issue but visible e.g. when you try to flush the neigh
cache while the ARP flow is still present in the datapath, triggering
the revalidation of the datapath flows which subsequently
refreshes/adds the entry in the cache.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
with the command is now possible to change the aging time of the
cache entries.
For the existing entries the aging time is updated only if the
current expiration is greater than the new one. In any case, the next
refresh will set it to the new value.
This is intended mostly for debugging purpose.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Expires is modified in different threads (revalidator, pmd-rx, bfd-tx).
It's better to use atomics for such potentially parallel write.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit tightens the requirements for processing TCP packets
in AVX512, ensuring that there are no TCP options by validating that
the "data offset" field of the TCP header is exactly equal to 5.
This ensures that the TCP header is not too short, and that it does
not contain extra options.
On the IP handling side, improve checks around total packet length.
Now the next protocol is included in the length checks, ensuring that
the IP header reported length is of appropriate size to contain the
next protocol (e.g. UDP requires 8 bytes, TCP requires 20). Note that
the inner protocol is always of a fixed size per profile, so it can be
set using the UDP_ and TCP_ HEADER_LEN defines.
Fixes: 250ceddcc2 ("dpif-netdev/mfex: Add AVX512 based optimized miniflow extract")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
My primary focus has been in adjacent communities for some time now.
It seems appropriate that my status in the OVS maintainers file should
reflect the level of attention I am able to provide to the project.
Thanks to the other contributors past and present for the experiences
we've shared, and I'll see you around wherever our paths cross again :-)
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Waiting only on the vhost user port to be ready is not enough since a
tap is also initialized by testpmd and is used to inject/receive packets
in/from the kernel.
Wait on the tap link status.
Fixes: 18db7ec5eb ("system-dpdk: Improve vhost-user ping tests reliability.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Few problems with a current documentation:
1. bridge.rst is the high-level documentation for the end user.
Unit testing and complex implementation details are for developers,
hence should not be there. Testing instructions for developers
should be in testing.rst. Words in the doc should be understandable
for the user who doesn't know OVS internals.
2. Some paragraphs in the current documentation are repeating each
other almost to the word.
3. Some paragraphs are incorrectly formatted. That affects the
rendering.
4. There is no point describing every separate test of a system-dpdk
testsuite.
What is done:
1. All the testing related paragraphs are consolidated and moved
to the testing.rst.
2. Most of abbreviations replaced with more readable and understandable
for the end user words.
3. Meaning or the purpose of several sentences I failed to understand,
therefore just deleted.
4. Fixed formatting and a few typos along the way.
IMO, some parts of the doc still needs some re-wording, but this change
provides at least a starting point for improvement setting a better
structure for the document.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
The autovalidator uses incorrect block count while printing the
miniflow buffer from a tested implementation. This results in
not printing fields that was incorrectly added to the miniflow
or printing more of a buffer even if not needed.
Fix that by requesting and using the correct block count.
Also fixed the output formatting issues: extra spaces, characters,
unclear relations between names and numbers due to mixed up
delimiters, '%u' used for uint16_t, '\t' in the output, etc.
Fixes: dd3f5d86d9 ("dpif-netdev: Add auto validation function for miniflow extract")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Kumar Amber <kumar.amber@intel.com>
Snapshots are scheduled for every 10-20 minutes. It's a random value
in this interval for each server. Once the time is up, but the maximum
time (24 hours) not reached yet, ovsdb will start checking if the log
grew a lot on every iteration. Once the growth is detected, compaction
is triggered.
OTOH, it's very common for an OVSDB cluster to not have the log growing
very fast. If the log didn't grow 2x in 20 minutes, the randomness of
the initial scheduled time is gone and all the servers are checking if
they need to create snapshot on every iteration. And since all of them
are part of the same cluster, their logs are growing with the same
speed. Once the critical mass is reached, all the servers will start
creating snapshots at the same time. If the database is big enough,
that might leave the cluster unresponsive for an extended period of
time (e.g. 10-15 seconds for OVN_Southbound database in a larger scale
OVN deployment) until the compaction completed.
Fix that by re-scheduling a quick retry if the minimal time already
passed. Effectively, this will work as a randomized 1-2 min delay
between checks, so the servers will not synchronize.
Scheduling function updated to not change the upper limit on quick
reschedules to avoid delaying the snapshot creation indefinitely.
Currently quick re-schedules are only used for the error cases, and
there is always a 'slow' re-schedule after the successful compaction.
So, the change of a scheduling function doesn't change the current
behavior much.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Commit 3c2d6274bc ("raft: Transfer leadership before creating
snapshots.") made it such that raft leaders transfer leadership before
snapshotting. However, there's still the case when the next leader to
be is in the process of snapshotting. To avoid delays in that case too,
we now explicitly allow snapshots only on followers. Cluster members
will have to wait until the current election is settled before
snapshotting.
Given the following logs taken from an OVN_Southbound 3-server cluster
during a scale test:
S1 (old leader):
19:07:51.226Z|raft|INFO|Transferring leadership to write a snapshot.
19:08:03.830Z|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms
19:08:03.940Z|raft|INFO|server 8b8d is leader for term 43
S2 (follower):
19:08:00.870Z|raft|INFO|server 8b8d is leader for term 43
S3 (new leader):
19:07:51.242Z|raft|INFO|received leadership transfer from f5c9 in term 42
19:07:51.244Z|raft|INFO|term 43: starting election
19:08:00.805Z|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms
19:08:00.869Z|raft|INFO|term 43: elected leader by 2+ of 3 servers
We see that the leader to be (S3) receives the leadership transfer,
initiates the election and immediately after starts a snapshot that
takes ~9.5 seconds. During this time, S2 votes for S3 electing it
as cluster leader but S3 doesn't effectively become leader until it
finishes snapshotting, essentially keeping the cluster without a
leader for up to ~9.5 seconds.
With the current change, S3 will delay compaction and snapshotting until
the election is finished.
The only exception is the case of single-node clusters for which we
allow the node to snapshot regardless of role.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Last RX queue, from which the packet got received, is already stored
in the PMD context. So, we can get the netdev from it without the
expensive hash map lookup.
In my V2V testing this patch improves performance in case HW offload
and experimental APIs are enabled by about 3%. That narrows down the
performance difference with the case with experimental API disabled
to about 0.5%, which is way within a margin of error for that setup.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit adds support for DPDK v21.11, it includes the following
changes.
1. ci: Install python elftools for DPDK 21.02.
2. ci: Update meson requirement for DPDK 21.05.
3. netdev-dpdk: Fix build with 21.05.
4. ci: Compile DPDK in non developer mode.
http://patchwork.ozlabs.org/project/openvswitch/list/?series=242480&state=*
5. netdev-dpdk: Remove access to DPDK internals.
6. netdev-dpdk: Remove unused attribute from rte_flow rule.
7. netdev-dpdk: Fix mbuf macros namespace with 21.11-rc1.
8. netdev-dpdk: Fix vhost namespace with 21.11-rc2.
http://patchwork.ozlabs.org/project/openvswitch/list/?series=271159&state=*
In addition documentation and DPDK unit tests were also updated in this
commit for use with DPDK v21.11.
For credit all authors of the original commits to 'dpdk-latest' with the above
changes have been added as co-authors for this commit.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Co-authored-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Emma Finn <emma.finn"intel.com>
Tested-by: Seamus Ryan <seamus.ryan@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
In patch [1] rpl_nf_conntrack_in was backported as static inline
function without do..while loop handling NF_REPEAT error.
In patch [2] rpl_nf_conntrack_in backported function was removed
from compat/include/net/netfilter/nf_conntrack_core.h as an unused.
As a result the do..while loop around nf_conntrack_in was lost and
this caused problems on old RHEL kernels with the tcp SYN
loss on a connection with same 5-tuple, which ran in last
nf_conntrack_tcp_timeout_time_wait. The connection could be
initiated on a tcp SYN retry after one second.
1: 4fdec8986a
2: e9b33ad780
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-September/387623.html
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-October/388424.html
Signed-off-by: Vladislav Odintsov <odivlad@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Instead of waiting 10 seconds for testpmd to start, this
patch makes use of OVS_WAIT_UNTIL() macro to wait for
the virtio device readiness notification in ovs-vswitchd
logs.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Userspace tunnel doesn't have a valid device in the kernel. So
get_ifindex() function (ioctl) always get error during
adding a port, deleting a port or updating a port status.
The info log is
"2021-08-29T09:17:39.830Z|00059|netdev_linux|INFO|ioctl(SIOCGIFINDEX)
on vxlan_sys_4789 device failed: No such device"
If there are a lot of userspace tunnel ports on a bridge, the
iface_refresh_netdev_status() function will spend a lot of time.
So ignore userspace tunnel port ioctl(SIOCGIFINDEX) operation, just
return -ENODEV.
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>