2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-31 06:15:47 +00:00
Commit Graph

19352 Commits

Author SHA1 Message Date
Maxime Coquelin
31e67c998b dpif-netdev: Introduce Tx queue mode.
A boolean is currently used to differenciate between the
static and XPS Tx queue modes.

Since we are going to introduce a new steering mode, replace
this boolean with an enum.

This patch does not introduce functionnal changes.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-17 18:07:00 +01:00
Maxime Coquelin
e97112ce78 netdev-dummy: Introduce per rxq/txq statistics.
This patch adds Rx and Tx per-queue statistics. It will be
used to test hash-based Tx packet steering. Only "bytes",
and "packets" per-queue custom statistics are added, as
there are no global "errors" counters in netdev-dummy.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-17 18:07:00 +01:00
Ilya Maximets
eff740b14e ofproto-dpif: Fix memory leak in dpif/show-dp-features appctl.
Fixes: a98b700db6 ("ofproto: Add appctl command to show Datapath features")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-17 11:00:29 +01:00
Martin Varghese
1917ace893 Encap & Decap actions for MPLS packet type.
The encap & decap actions are extended to support MPLS packet type.
Encap & decap actions adds and removes MPLS header at start of the
packet.

The existing PUSH MPLS & POP MPLS actions inserts & removes MPLS
header between ethernet header and the IP header. Though this behaviour
is fine for L3 VPN where an IP packet is encapsulated inside a MPLS
tunnel, it does not suffice the L2 VPN requirements. In L2 VPN the
ethernet packets must be encapsulated inside MPLS tunnel.

In this change the encap & decap actions are extended to support MPLS
packet type. The encap & decap adds and removes MPLS header at the
start of packet as depicted below.

Encapsulation:

Actions - encap(mpls),encap(ethernet)

Incoming packet -> | ETH | IP | Payload |

1 Actions -  encap(mpls) [Datapath action - ADD_MPLS:0x8847]

        Outgoing packet -> | MPLS | ETH | Payload|

2 Actions - encap(ethernet) [ Datapath action - push_eth ]

        Outgoing packet -> | ETH | MPLS | ETH | Payload|

Decapsulation:

Incoming packet -> | ETH | MPLS | ETH | IP | Payload |

Actions - decap(),decap(packet_type(ns=0,type=0))

1 Actions -  decap() [Datapath action - pop_eth)

        Outgoing packet -> | MPLS | ETH | IP | Payload|

2 Actions - decap(packet_type(ns=0,type=0)) [Datapath action - POP_MPLS:0x6558]

        Outgoing packet -> | ETH  | IP | Payload|

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-17 02:04:20 +01:00
Paolo Valerio
4a6a473462 netlink-socket: Log extack error messages in netlink transactions.
During a netlink transaction, in case of replies of type NLMSG_ERROR,
the current behavior includes the translation of the error number
received into a string that describes the error code.

Netlink replies may carry a more descriptive error message, and
although it is possible to read those messages using the existing perf
tracepoint, it is more convenient to retrieve them directly from ovs.

This patch extends nl_msg_nlmsgerr() so that it retrieves the message
that later, if present, will be used by nl_sock_transact_multiple__()
in place of the generic descriptive form of the error number.  This is
particularly useful with tc that makes use of such kind of mechanism.

As an example, with this patch applied, the following generic message:

ovs|00239|netlink_socket|DBG|received NAK error=0 (Operation not supported)

becomes:

ovs|00239|netlink_socket|DBG|received NAK error=0 - Conntrack isn't enabled

The layout has been slightly modified to avoid nested parentheses.

Suggested-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-16 22:16:16 +01:00
Mike Pattrick
eb1ab5357b netdev-linux: Use matchall classifier for ingress policing.
Currently ingress policing uses the basic classifier to apply traffic
control filters if hardware offload is not enabled, in which case it
uses matchall. This change changes the behavior to always use matchall,
and fall back onto basic if the kernel is built without matchall
support.

The system tests are modified to allow either basic or matchall
classification on the ingestion filter, and to allow either 10000 or
10240 packets for the packet burst filter. 10000 is accurate for kernel
5.14 and the most recent iproute2, however, 10240 is left for
compatibility with older kernels.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-12 15:23:29 +01:00
Harry van Haaren
3b489a3b1b dpif-netdev: Improve loading of packet data for undersized packets.
This commit improves handling of packets where the allocated memory
is less than 64 bytes.  For packets recevied from DPDK ports this
never matters, as an mbuf always pre-allocates enough space, however
this can occur in cases where packet received from a kernel interface
or injected by an OpenFlow controller.  The fix is required to
ensure OVS doesn't overread the allocated memory, e.g.:

 ==49944==ERROR: AddressSanitizer: heap-buffer-overflow on address
 0x6060000d8181 at pc 0x000001cb9d24 bp 0x7ffce3b385d0 sp 0x7ffce3b385c8
 READ of size 64 at 0x6060000d8181 thread T0
    #0 0x1cb9d23 in mfex_avx512_process lib/dpif-netdev-extract-avx512.c:491:26
    #1 0x1cb9d23 in mfex_avx512_ip_udp lib/dpif-netdev-extract-avx512.c:625:1
    #2 0x18786a1 in dpif_miniflow_extract_autovalidator lib/dpif-netdev-private-extract.c:277:29
    #3 0x1cbca5c in dp_netdev_input_outer_avx512 lib/dpif-netdev-avx512.c:159:19
    #4 0x1853048 in dp_netdev_process_rxq_port lib/dpif-netdev.c:4900:19
    #5 0x1837c76 in dpif_netdev_run lib/dpif-netdev.c:6197:25
    #6 0x1727a02 in type_run ofproto/ofproto-dpif.c:370:9
    #7 0x16f6e07 in ofproto_type_run ofproto/ofproto.c:1778:31
    #8 0x16c1a8b in bridge_run__ vswitchd/bridge.c:3245:9
    #9 0x16bd2fd in bridge_run vswitchd/bridge.c:3310:5
    #10 0x16db8fe in main vswitchd/ovs-vswitchd.c:127:9
    #11 0x7fbc0c5b61a2 in __libc_start_main (/lib64/libc.so.6+0x271a2)
    #12 0xedabbd in _start (vswitchd/ovs-vswitchd+0xedabbd)

 0x6060000d8181 is located 9 bytes to the right of 56-byte
                region [0x6060000d8140,0x6060000d8178)
 allocated by thread T0 here:
    #0 0xf7b09f in malloc (vswitchd/ovs-vswitchd+0xf7b09f)
    #1 0x1aff3b9 in xmalloc__ lib/util.c:137:15
    #2 0x1aff3b9 in xmalloc lib/util.c:172:12
    #3 0x1afe211 in process_command lib/unixctl.c:310:13
    #4 0x1afe211 in run_connection lib/unixctl.c:344:17
    #5 0x1afe211 in unixctl_server_run lib/unixctl.c:395:21
    #6 0x16db918 in main vswitchd/ovs-vswitchd.c:128:9
    #7 0x7fbc0c5b61a2 in __libc_start_main (/lib64/libc.so.6+0x271a2)

The solution implemented uses a mask-to-zero if the available buffer
size is less than 64 bytes, and a branch for which type of load is used.

Fixes: 250ceddcc2 ("dpif-netdev/mfex: Add AVX512 based optimized miniflow extract")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-12 13:59:33 +01:00
Sunil Pai G
8bc135d2d5 acinclude: Provide better error info when linking fails with DPDK.
Currently, on failure to link with DPDK, the configure script provides
an error message to update the PKG_CONFIG_PATH even though the cause of
failure was missing dependencies. Improve the error message to include this
scenario.

Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-12 13:11:29 +01:00
David Marchand
1140c87e2e netdev-dpdk: Expose per rxq/txq basic statistics.
When troubleshooting multiqueue setups, having per queue statistics helps
checking packets repartition in rx and tx queues.

Per queue statistics are exported by most DPDK drivers (with capability
RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS).
OVS only filters DPDK statistics, there is nothing to request in DPDK API.
So the only change is to extend the filter on xstats.

Querying statistics with
$ ovs-vsctl get interface dpdk0 statistics |
  sed -e 's#[{}]##g' -e 's#, #\n#g'

and comparing gives:
@@ -13,7 +13,12 @@
 rx_phy_crc_errors=0
 rx_phy_in_range_len_errors=0
 rx_phy_symbol_errors=0
+rx_q0_bytes=0
 rx_q0_errors=0
+rx_q0_packets=0
+rx_q1_bytes=0
 rx_q1_errors=0
+rx_q1_packets=0
 rx_wqe_errors=0
 tx_broadcast_packets=0
 tx_bytes=0
@@ -27,3 +32,13 @@
 tx_pp_rearm_queue_errors=0
 tx_pp_timestamp_future_errors=0
 tx_pp_timestamp_past_errors=0
+tx_q0_bytes=0
+tx_q0_packets=0
+tx_q1_bytes=0
+tx_q1_packets=0
+tx_q2_bytes=0
+tx_q2_packets=0
+tx_q3_bytes=0
+tx_q3_packets=0
+tx_q4_bytes=0
+tx_q4_packets=0

Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-12 11:40:16 +01:00
David Marchand
f260db1efc netdev-dpdk: Fix statistics when changing Rx/Tx queues count.
When changing number of Rx or Tx queues, per queue basic stats can be
renumbered in DPDK ethdev layer [1].

OVS maintains an internal xstats IDs cache that was refreshed when a
cached id was not valid anymore (in netdev_dpdk_get_custom_stats) or if
a new DPDK port was created.
This did not handle changes of Rx/Tx queues count.

For example, with a mlx5 port:
$ ovs-vsctl set interface dpdk0 options:n_rxq=2
$ ovs-vsctl get interface dpdk0 statistics |
  sed -e 's#[{}]##g' -e 's#, #\n#g' |
  grep rx_q._errors
rx_q0_errors=0

Move the cache filling after reconfiguring and starting the port.
There is no need to flush the cache in netdev_dpdk_get_custom_stats.

While at it, the xstats code can be cleaned up:
- remove wrong or Lapalissade comments,
- don't check x*alloc return value,
- expect that consecutive calls to xstats API return the same number of
  elements,
- only write to dev-> when all DPDK calls succeeded,
- add missing lock annotations to netdev_dpdk_clear_xstats and
  netdev_dpdk_get_xstat_name,

1: https://git.dpdk.org/dpdk/tree/lib/librte_ethdev/rte_ethdev.c?h=v20.11#n2696

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-November/389456.html
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-12 11:40:16 +01:00
David Marchand
b84386fa9a dpdk: Support running PMD threads on any core.
Previously in OVS, a PMD thread running on cpu X used lcore X.
This assumption limited OVS to run PMD threads on physical cpu <
RTE_MAX_LCORE.

DPDK 20.08 introduced a new API that associates a non-EAL thread to a free
lcore. This new API does not change the thread characteristics (like CPU
affinity) and let OVS run its PMD threads on any cpu regardless of
RTE_MAX_LCORE.

The DPDK multiprocess feature is not compatible with this new API and is
disabled.

DPDK still limits the number of lcores to RTE_MAX_LCORE (128 on x86_64)
which should be enough for OVS pmd threads (hopefully).

DPDK lcore/OVS pmd threads mapping are logged at threads when trying to
attach a OVS PMD thread, and when detaching.
A new command is added to help get DPDK point of view of the DPDK lcores
at any time:

$ ovs-appctl dpdk/lcore-list
lcore 0, socket 0, role RTE, cpuset 0
lcore 1, socket 0, role NON_EAL, cpuset 1
lcore 2, socket 0, role NON_EAL, cpuset 15

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-11 21:34:56 +01:00
Ilya Maximets
356f362068 tests/oss-fuzz: Fix the arguments of parse_tcp_flags.
tests/oss-fuzz/flow_extract_target.c:59:53:
     error: too few arguments to function call, expected 4, have 1
         uint16_t tcp_flags = parse_tcp_flags(&packet);
                              ~~~~~~~~~~~~~~~        ^

Fixes: e7e9973b80 ("dpif-netdev: Forwarding optimization for flows with a simple match.")
Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43498
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
2022-01-10 23:31:29 +01:00
Ilya Maximets
ddca1eb3ab odp-util: Stop action list parsing if already oversized.
The fuzzing target times out if the action list is too big.  And we
don't really need to fully parse all the actions just to say that they
are too big in the end.  So, check early and exit.

This is a pure performance optimization, so not adding a unit test.

All other code paths during the parsing are using E2BIG and not EFBIG
for similar conditions, so using it here too.

Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39670
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-07 23:48:20 +01:00
Sriharsha Basavapatna
6e50c16518 dpif-netdev: Avoid hw_miss_packet_recover() for devices with no support.
The hw_miss_packet_recover() API results in performance degradation, for
ports that are either not offload capable or do not support this specific
offload API.

For example, in the test configuration shown below, the vhost-user port
does not support offloads and the VF port doesn't support hw_miss offload
API. But because tunnel offload needs to be configured in other bridges
(br-vxlan and br-phy), OVS has been built with -DALLOW_EXPERIMENTAL_API.

    br-vhost            br-vxlan            br-phy
vhost-user<-->VF    VF-Rep<-->VxLAN       uplink-port

For every packet between the VF and the vhost-user ports, hw_miss API is
called even though it is not supported by the ports involved. This leads
to significant performance drop (~3x in some cases; both cycles and pps).

Return EOPNOTSUPP when this API fails for a device that doesn't support it
and avoid this API on that port for subsequent packets.

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-07 20:53:20 +01:00
Ilya Maximets
e7e9973b80 dpif-netdev: Forwarding optimization for flows with a simple match.
There are cases where users might want simple forwarding or drop rules
for all packets received from a specific port, e.g ::

  "in_port=1,actions=2"
  "in_port=2,actions=IN_PORT"
  "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
  "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"

There are also cases where complex OpenFlow rules can be simplified
down to datapath flows with very simple match criteria.

In theory, for very simple forwarding, OVS doesn't need to parse
packets at all in order to follow these rules.  "Simple match" lookup
optimization is intended to speed up packet forwarding in these cases.

Design:

Due to various implementation constraints userspace datapath has
following flow fields always in exact match (i.e. it's required to
match at least these fields of a packet even if the OF rule doesn't
need that):

  - recirc_id
  - in_port
  - packet_type
  - dl_type
  - vlan_tci (CFI + VID) - in most cases
  - nw_frag - for ip packets

Not all of these fields are related to packet itself.  We already
know the current 'recirc_id' and the 'in_port' before starting the
packet processing.  It also seems safe to assume that we're working
with Ethernet packets.  So, for the simple OF rule we need to match
only on 'dl_type', 'vlan_tci' and 'nw_frag'.

'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
combined in a single 64bit integer (mark) that can be used as a
hash in hash map.  We are using only VID and CFI form the 'vlan_tci',
flows that need to match on PCP will not qualify for the optimization.
Workaround for matching on non-existence of vlan updated to match on
CFI and VID only in order to qualify for the optimization.  CFI is
always set by OVS if vlan is present in a packet, so there is no need
to match on PCP in this case.  'nw_frag' takes 2 bits of PCP inside
the simple match mark.

New per-PMD flow table 'simple_match_table' introduced to store
simple match flows only.  'dp_netdev_flow_add' adds flow to the
usual 'flow_table' and to the 'simple_match_table' if the flow
meets following constraints:

  - 'recirc_id' in flow match is 0.
  - 'packet_type' in flow match is Ethernet.
  - Flow wildcards contains only minimal set of non-wildcarded fields
    (listed above).

If the number of flows for current 'in_port' in a regular 'flow_table'
equals number of flows for current 'in_port' in a 'simple_match_table',
we may use simple match optimization, because all the flows we have
are simple match flows.  This means that we only need to parse
'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
Now we make the unique flow mark from the 'in_port', 'dl_type',
'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
On successful lookup we don't need to run full 'miniflow_extract()'.

Unsuccessful lookup technically means that we have no suitable flow
in the datapath and upcall will be required.  So, in this case EMC and
SMC lookups are disabled.  We may optimize this path in the future by
bypassing the dpcls lookup too.

Performance improvement of this solution on a 'simple match' flows
should be comparable with partial HW offloading, because it parses same
packet fields and uses similar flow lookup scheme.
However, unlike partial HW offloading, it works for all port types
including virtual ones.

Performance results when compared to EMC:

Test setup:

             virtio-user   OVS    virtio-user
  Testpmd1  ------------>  pmd1  ------------>  Testpmd2
  (txonly)       x<------  pmd2  <------------ (mac swap)

Single stream of 64byte packets.  Actions:
  in_port=vhost0,actions=vhost1
  in_port=vhost1,actions=vhost0

Stats collected from pmd1 and pmd2, so there are 2 scenarios:
Virt-to-Virt   :     Testpmd1 ------> pmd1 ------> Testpmd2.
Virt-to-NoCopy :     Testpmd2 ------> pmd2 --->x   Testpmd1.
Here the packet sent from pmd2 to Testpmd1 is always dropped, because
the virtqueue is full since Testpmd1 is in txonly mode and doesn't
receive any packets.  This should be closer to the performance of a
VM-to-Phy scenario.

Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
Table below represents improvement in throughput when compared to EMC.

 +----------------+------------------------+------------------------+
 |                |    Default (-g -O2)    | "-Ofast -march=native" |
 |   Scenario     +------------+-----------+------------+-----------+
 |                |     GCC    |   Clang   |     GCC    |   Clang   |
 +----------------+------------+-----------+------------+-----------+
 | Virt-to-Virt   |    +18.9%  |   +25.5%  |    +10.8%  |   +16.7%  |
 | Virt-to-NoCopy |    +24.3%  |   +33.7%  |    +14.9%  |   +22.0%  |
 +----------------+------------+-----------+------------+-----------+

For Phy-to-Phy case performance improvement should be even higher, but
it's not the main use-case for this functionality.  Performance
difference for the non-simple flows is within a margin of error.

Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-07 20:32:20 +01:00
Terry Wilson
46d44cf3be python: idl: Add monitor_cond_since support.
Add support for monitor_cond_since / update3 to python-ovs to
allow more efficient reconnections when connecting to clustered
OVSDB servers.

Signed-off-by: Terry Wilson <twilson@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-06 16:45:56 +01:00
Mike Pattrick
0d1ffb7756 checkpatch: Detect "trojan source" attack.
Recently there has been a lot of press about the "trojan source" attack,
where Unicode characters are used to obfuscate the true functionality of
code. This attack didn't effect OVS, but adding the check here will help
guard against it sneaking in later.

Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-04 19:14:11 +01:00
Mike Pattrick
428b11caa7 utilities: Add another GDB macro for ovs-vswitchd.
This commit adds a basic packet metadata macro to the already existing
macros in ovs_gdb.py, ovs_dump_packets will print out information about
one or more packets. It feeds packets into tcpdump, and the user can
pass in tcpdump options to modify how packets are parsed or even write
out packets to a pcap file.

Example usage:
(gdb) break fast_path_processing
(gdb) commands
 ovs_dump_packets packets_
 continue
 end
(gdb) continue

Thread 1 "ovs-vswitchd" hit Breakpoint 2, fast_path_processing ...
12:01:05.962485 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
    10.1.1.1 tell 10.1.1.2, length 28

Thread 1 "ovs-vswitchd" hit Breakpoint 1, fast_path_processing ...
12:01:05.981214 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.1.1
    is-at a6:0f:c3:f0:5f:bd (oui Unknown), length 28

Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-04 19:14:11 +01:00
Frode Nordahl
2f2ae5b6bd tests: Fix endianness in netlink policy test fixtures.
The netlink policy unit test contains test fixture data that is
subject to endianness and currently fails on big endian systems.

Store the fixture data in a struct to ensure proper byte order for
the header data.

Also fix improper style for sizeof with expressions.

Fixes: bfee9f6c01 ("netlink: Add support for parsing link layer address.")
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-04 19:14:11 +01:00
Eli Britstein
0b6d2faace ci: Remove -Wno-cast-align from CI.
Following [1]-[3] in DPDK, there are no more such warnings from DPDK.
Remove ignoring them if they occur.

GitHub actions:
v1: https://github.com/elibritstein/OVS/actions/runs/1540651133

[1] a3f8d0587188 ("net: avoid cast-align warning in VLAN insert function")
[2] da0333c8790b ("mbuf: avoid cast-align warning in data offset macro")
[3] 6de430b7079e ("eal/x86: avoid cast-align warning in memcpy functions")

Signed-off-by: Eli Britstein <elibr@nvidia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-04 19:14:11 +01:00
Mike Pattrick
d652fc6a5a checkpatch: Correct line count in error messages.
As part of some previous checkpatch work, we discovered that checkpatch
isn't always reporting correct line numbers. As it turns out, Python's
splitlines function considers several characters to be new lines which
common text editors do not typically consider to be new lines. For
example, form feed characters, which this code base uses to cluster
functionality.

Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-04 17:23:30 +01:00
Ilya Maximets
28ef2535c1 dpif-netdev-extract: Change availability log level to DBG.
Availability logs are not essential for a normal run.  The same
information can be obtained via appctl in runtime.  They also can not
show if particular implementation will actually be used or not, hence
not useful for post-crash investigations.  Moving to DBG level to avoid
bulky unnecessary logging.

Additionally making them a bit more readable.

Acked-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-04 17:19:18 +01:00
Ilya Maximets
38c53dd17d AUTHORS: Add Nobuhiro MIKI.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-04 17:17:28 +01:00
Nobuhiro MIKI
9a834205a4 docs: afxdp: Remove duplicated lines.
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-04 17:16:03 +01:00
David Marchand
d446dcb7e0 system-dpdk: Refactor common logs matching.
Move EAL logs and commonly ignored logs to a common macro.
Remove/update obsolete ones (like i40e [1], timer [2], EAL [3][4] logs).
Set log level for DPDK drivers to error only: the rationale is that we are
not testing DPDK drivers in system-dpdk.
Extend regex on hugepage logs since a check on hugepages availability is
already present on OVS side, and as a consequence, we don't care about
the warnings on availability for certain hugepage size.
Add logs checks for MFEX tests that were missing them.

1: https://git.dpdk.org/dpdk/commit/?id=a075ce2b3e8c
2: https://git.dpdk.org/dpdk/commit/?id=c1077933d45b
3: https://git.dpdk.org/dpdk/commit/?id=e9b3d79b0696
4: https://git.dpdk.org/dpdk/commit/?id=c69150679891

Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-03 19:47:43 +01:00
David Marchand
b366fa2f49 dpif-netdev: Call cpuid for x86 isa availability.
DPIF AVX512 optimizations currently rely on DPDK availability while
they can be used without DPDK.
Besides, checking for availability of some isa only has to be done once
and won't change while a OVS process runs.

Resolve isa availability in constructors by using a simplified query
based on cpuid API that comes from the compiler.

Note: this also fixes the check on BMI2 availability: DPDK had a bug
for this isa, see https://git.dpdk.org/dpdk/commit/?id=aae3037ab1e0.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-03 18:45:40 +01:00
Ilya Maximets
11441385c2 bridge: Fix incorrect configuration of netdev's dpif type.
netdev_set_dpif_type() can only be used with a normalized dpif type
as an argument, which is a constant static string derived from a type
of a dpif_class or a constant string "system".  Usage of a same
constant string allows netdev-offload module to compare types by
simply comparing pointers.

OTOH, 'br->ofproto->type' is a dynamic string that:
a. Can be NULL.
b. Even if not NULL and equal, can be a different dynamically
   allocated string.

Both these qualities breaks assumptions made by all other modules
related to HW offload, breaking the functionality.

Fix that by moving netdev_set_dpif_type() to dpif.c and calling with
a correct constant string as an argument.

The call moved from bridge.c to dpif.c, because we need to have access
to the dpif class, but bridge.c should not.

Not trying to set the dpif_type inside the netdev_ports_insert(),
because it's used now outside the offloading context.  So, it's
cleaner to move the netdev_set_dpif_type() call outside of the
netdev-offload module.

Additionally removed the redundant call from the netdev_ports_insert()
and refactored the function, since it doesn't need an extra argument
anymore.

Fixes: 4f19a78a61 ("netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.")
Reported-by: Roi Dayan <roid@nvidia.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390117.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Roi Dayan <roid@nvidia.com>
2021-12-17 21:31:55 +01:00
Paolo Valerio
ec2aa2ab46 ofproto-dpif-xlate: Snoop ingress packets and update neigh cache if needed.
In case of native tunnel with bfd enabled, if the MAC address of the
remote end's interface changes (e.g. because it got rebooted, and the
MAC address is allocated dynamically), the BFD session will never be
re-established.

This happens because the local tunnel neigh entry doesn't get updated,
and the local end keeps sending BFD packets with the old destination
MAC address. This was not an issue until
b23ddcc57d ("tnl-neigh-cache: tighten arp and nd snooping.")
because ARP requests were snooped as well avoiding the problem.

Fix this by snooping the incoming packets in the slow path, and
updating the neigh cache accordingly.

Fixes: b23ddcc57d ("tnl-neigh-cache: tighten arp and nd snooping.")
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2002430
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-17 20:32:13 +01:00
Paolo Valerio
b723b93200 tnl-neigh-cache: Do not refresh the entry while revalidating.
This is a minor issue but visible e.g. when you try to flush the neigh
cache while the ARP flow is still present in the datapath, triggering
the revalidation of the datapath flows which subsequently
refreshes/adds the entry in the cache.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-17 20:32:05 +01:00
Paolo Valerio
02f95638a4 tnl-neigh-cache: Add tnl/neigh/aging command.
with the command is now possible to change the aging time of the
cache entries.

For the existing entries the aging time is updated only if the
current expiration is greater than the new one. In any case, the next
refresh will set it to the new value.

This is intended mostly for debugging purpose.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-17 20:31:57 +01:00
Paolo Valerio
f527aef147 tnl-neigh-cache: Read/write expires atomically.
Expires is modified in different threads (revalidator, pmd-rx, bfd-tx).
It's better to use atomics for such potentially parallel write.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-17 20:31:14 +01:00
Harry van Haaren
f0266292b7 dpif-netdev: Improve handling of IP/TCP in avx512 mfex.
This commit tightens the requirements for processing TCP packets
in AVX512, ensuring that there are no TCP options by validating that
the "data offset" field of the TCP header is exactly equal to 5.
This ensures that the TCP header is not too short, and that it does
not contain extra options.

On the IP handling side, improve checks around total packet length.
Now the next protocol is included in the length checks, ensuring that
the IP header reported length is of appropriate size to contain the
next protocol (e.g. UDP requires 8 bytes, TCP requires 20). Note that
the inner protocol is always of a fixed size per profile, so it can be
set using the UDP_ and TCP_ HEADER_LEN defines.

Fixes: 250ceddcc2 ("dpif-netdev/mfex: Add AVX512 based optimized miniflow extract")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-17 19:50:55 +01:00
Ilya Maximets
893693e808 AUTHORS: Add Nir Anteby.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-16 22:48:01 +01:00
Nir Anteby
7617d0583c netdev-offload-dpdk: Add support for matching on gre fields.
Add parsing gre match fields.

Signed-off-by: Nir Anteby <nanteby@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-16 12:03:13 +01:00
Nir Anteby
5f60741dcf netdev-offload-dpdk: Support tnl_pop for gre tunnel.
Add support for tnl_pop action for gre vport.

Signed-off-by: Nir Anteby <nanteby@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-16 12:03:07 +01:00
Nir Anteby
a32cb78b5a netdev-dpdk: Add flow_api support for netdev gre vports.
Add the acceptance of GRE devices to netdev_dpdk_flow_api_supported() API,
to allow offloading of DPDK GRE devices.

Signed-off-by: Nir Anteby <nanteby@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-16 12:03:02 +01:00
Nir Anteby
8279041460 netdev-offload-dpdk: Refactor get_vport_netdev().
Refactor the function as a pre-step towards supporting more tunnel
types.

Signed-off-by: Nir Anteby <nanteby@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-16 12:02:15 +01:00
Alin-Gabriel Serdean
76527525e0 AUTHORS: Update email for Alin Serdean.
Signed-off-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-16 11:48:12 +01:00
Joe Stringer
38b42aa93f MAINTAINERS: Move Joe to emeritus status.
My primary focus has been in adjacent communities for some time now.
It seems appropriate that my status in the OVS maintainers file should
reflect the level of attention I am able to provide to the project.
Thanks to the other contributors past and present for the experiences
we've shared, and I'll see you around wherever our paths cross again :-)

Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-15 20:39:30 +01:00
Rosemarie O'Riorden
269b927fd7 dpdk: Use --in-memory by default.
If anonymous memory mapping is supported by the kernel, it's better
to run OVS entirely in memory rather than creating shared data
structures. OVS doesn't work in multi-process mode, so there is no need
to litter a filesystem.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1949849
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-15 18:22:51 +01:00
David Marchand
b5d2dbdbb5 system-dpdk: Fix race in vhost-user tests.
Waiting only on the vhost user port to be ready is not enough since a
tap is also initialized by testpmd and is used to inject/receive packets
in/from the kernel.
Wait on the tap link status.

Fixes: 18db7ec5eb ("system-dpdk: Improve vhost-user ping tests reliability.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-15 18:20:01 +01:00
Ilya Maximets
9827312fa4 docs: Re-work the documentation around CPU ISA optimizations.
Few problems with a current documentation:

1. bridge.rst is the high-level documentation for the end user.
   Unit testing and complex implementation details are for developers,
   hence should not be there.  Testing instructions for developers
   should be in testing.rst.  Words in the doc should be understandable
   for the user who doesn't know OVS internals.

2. Some paragraphs in the current documentation are repeating each
   other almost to the word.

3. Some paragraphs are incorrectly formatted.  That affects the
   rendering.

4. There is no point describing every separate test of a system-dpdk
   testsuite.

What is done:

1. All the testing related paragraphs are consolidated and moved
   to the testing.rst.

2. Most of abbreviations replaced with more readable and understandable
   for the end user words.

3. Meaning or the purpose of several sentences I failed to understand,
   therefore just deleted.

4. Fixed formatting and a few typos along the way.

IMO, some parts of the doc still needs some re-wording, but this change
provides at least a starting point for improvement setting a better
structure for the document.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
2021-12-15 10:41:37 +01:00
Ilya Maximets
ed9778e94f dpif-netdev: Fix the autovalidator output for the miniflow extract.
The autovalidator uses incorrect block count while printing the
miniflow buffer from a tested implementation.  This results in
not printing fields that was incorrectly added to the miniflow
or printing more of a buffer even if not needed.

Fix that by requesting and using the correct block count.

Also fixed the output formatting issues: extra spaces, characters,
unclear relations between names and numbers due to mixed up
delimiters, '%u' used for uint16_t, '\t' in the output, etc.

Fixes: dd3f5d86d9 ("dpif-netdev: Add auto validation function for miniflow extract")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Kumar Amber <kumar.amber@intel.com>
2021-12-15 10:20:42 +01:00
Ilya Maximets
339f97044e ovsdb: storage: Randomize should_snapshot checks when the minimum time passed.
Snapshots are scheduled for every 10-20 minutes.  It's a random value
in this interval for each server.  Once the time is up, but the maximum
time (24 hours) not reached yet, ovsdb will start checking if the log
grew a lot on every iteration.  Once the growth is detected, compaction
is triggered.

OTOH, it's very common for an OVSDB cluster to not have the log growing
very fast.  If the log didn't grow 2x in 20 minutes, the randomness of
the initial scheduled time is gone and all the servers are checking if
they need to create snapshot on every iteration.  And since all of them
are part of the same cluster, their logs are growing with the same
speed.  Once the critical mass is reached, all the servers will start
creating snapshots at the same time.  If the database is big enough,
that might leave the cluster unresponsive for an extended period of
time (e.g. 10-15 seconds for OVN_Southbound database in a larger scale
OVN deployment) until the compaction completed.

Fix that by re-scheduling a quick retry if the minimal time already
passed.  Effectively, this will work as a randomized 1-2 min delay
between checks, so the servers will not synchronize.

Scheduling function updated to not change the upper limit on quick
reschedules to avoid delaying the snapshot creation indefinitely.
Currently quick re-schedules are only used for the error cases, and
there is always a 'slow' re-schedule after the successful compaction.
So, the change of a scheduling function doesn't change the current
behavior much.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
2021-12-13 21:54:45 +01:00
Dumitru Ceara
bf07cc9cdb raft: Only allow followers to snapshot.
Commit 3c2d6274bc ("raft: Transfer leadership before creating
snapshots.") made it such that raft leaders transfer leadership before
snapshotting.  However, there's still the case when the next leader to
be is in the process of snapshotting.  To avoid delays in that case too,
we now explicitly allow snapshots only on followers.  Cluster members
will have to wait until the current election is settled before
snapshotting.

Given the following logs taken from an OVN_Southbound 3-server cluster
during a scale test:

S1 (old leader):
  19:07:51.226Z|raft|INFO|Transferring leadership to write a snapshot.
  19:08:03.830Z|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms
  19:08:03.940Z|raft|INFO|server 8b8d is leader for term 43

S2 (follower):
  19:08:00.870Z|raft|INFO|server 8b8d is leader for term 43

S3 (new leader):
  19:07:51.242Z|raft|INFO|received leadership transfer from f5c9 in term 42
  19:07:51.244Z|raft|INFO|term 43: starting election
  19:08:00.805Z|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms
  19:08:00.869Z|raft|INFO|term 43: elected leader by 2+ of 3 servers

We see that the leader to be (S3) receives the leadership transfer,
initiates the election and immediately after starts a snapshot that
takes ~9.5 seconds.  During this time, S2 votes for S3 electing it
as cluster leader but S3 doesn't effectively become leader until it
finishes snapshotting, essentially keeping the cluster without a
leader for up to ~9.5 seconds.

With the current change, S3 will delay compaction and snapshotting until
the election is finished.

The only exception is the case of single-node clusters for which we
allow the node to snapshot regardless of role.

Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-13 21:52:59 +01:00
Ilya Maximets
20a4f546f7 dpif-netdev: Use PMD context to get the port for HW miss recovery.
Last RX queue, from which the packet got received, is already stored
in the PMD context.  So, we can get the netdev from it without the
expensive hash map lookup.

In my V2V testing this patch improves performance in case HW offload
and experimental APIs are enabled by about 3%.  That narrows down the
performance difference with the case with experimental API disabled
to about 0.5%, which is way within a margin of error for that setup.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-09 22:46:47 +01:00
Ian Stokes
17346b3899 dpdk: Update to use DPDK v21.11.
This commit adds support for DPDK v21.11, it includes the following
changes.

1. ci: Install python elftools for DPDK 21.02.
2. ci: Update meson requirement for DPDK 21.05.
3. netdev-dpdk: Fix build with 21.05.
4. ci: Compile DPDK in non developer mode.

   http://patchwork.ozlabs.org/project/openvswitch/list/?series=242480&state=*

5. netdev-dpdk: Remove access to DPDK internals.
6. netdev-dpdk: Remove unused attribute from rte_flow rule.
7. netdev-dpdk: Fix mbuf macros namespace with 21.11-rc1.
8. netdev-dpdk: Fix vhost namespace with 21.11-rc2.

   http://patchwork.ozlabs.org/project/openvswitch/list/?series=271159&state=*

In addition documentation and DPDK unit tests were also updated in this
commit for use with DPDK v21.11.

For credit all authors of the original commits to 'dpdk-latest' with the above
changes have been added as co-authors for this commit.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Co-authored-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Emma Finn <emma.finn"intel.com>
Tested-by: Seamus Ryan <seamus.ryan@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-12-09 18:40:14 +00:00
Vladislav Odintsov
72745ab0cd compat: handle NF_REPEAT error on nf_conntrack_in.
In patch [1] rpl_nf_conntrack_in was backported as static inline
function without do..while loop handling NF_REPEAT error.
In patch [2] rpl_nf_conntrack_in backported function was removed
from compat/include/net/netfilter/nf_conntrack_core.h as an unused.

As a result the do..while loop around nf_conntrack_in was lost and
this caused problems on old RHEL kernels with the tcp SYN
loss on a connection with same 5-tuple, which ran in last
nf_conntrack_tcp_timeout_time_wait. The connection could be
initiated on a tcp SYN retry after one second.

1: 4fdec8986a
2: e9b33ad780

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-September/387623.html
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-October/388424.html
Signed-off-by: Vladislav Odintsov <odivlad@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-09 15:18:23 +01:00
Maxime Coquelin
18db7ec5eb system-dpdk: Improve vhost-user ping tests reliability.
Instead of waiting 10 seconds for testpmd to start, this
patch makes use of OVS_WAIT_UNTIL() macro to wait for
the virtio device readiness notification in ovs-vswitchd
logs.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-09 14:36:37 +01:00
Lin Huang
4f19a78a61 netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.
Userspace tunnel doesn't have a valid device in the kernel. So
get_ifindex() function (ioctl) always get error during
adding a port, deleting a port or updating a port status.

The info log is
"2021-08-29T09:17:39.830Z|00059|netdev_linux|INFO|ioctl(SIOCGIFINDEX)
on vxlan_sys_4789 device failed: No such device"

If there are a lot of userspace tunnel ports on a bridge, the
iface_refresh_netdev_status() function will spend a lot of time.

So ignore userspace tunnel port ioctl(SIOCGIFINDEX) operation, just
return -ENODEV.

Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-08 18:17:19 +01:00