mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-31 06:15:47 +00:00

Author	SHA1	Message	Date
Maxime Coquelin	31e67c998b	dpif-netdev: Introduce Tx queue mode. A boolean is currently used to differenciate between the static and XPS Tx queue modes. Since we are going to introduce a new steering mode, replace this boolean with an enum. This patch does not introduce functionnal changes. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-17 18:07:00 +01:00
Maxime Coquelin	e97112ce78	netdev-dummy: Introduce per rxq/txq statistics. This patch adds Rx and Tx per-queue statistics. It will be used to test hash-based Tx packet steering. Only "bytes", and "packets" per-queue custom statistics are added, as there are no global "errors" counters in netdev-dummy. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-17 18:07:00 +01:00
Ilya Maximets	eff740b14e	ofproto-dpif: Fix memory leak in dpif/show-dp-features appctl. Fixes: `a98b700db6` ("ofproto: Add appctl command to show Datapath features") Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-17 11:00:29 +01:00
Martin Varghese	1917ace893	Encap & Decap actions for MPLS packet type. The encap & decap actions are extended to support MPLS packet type. Encap & decap actions adds and removes MPLS header at start of the packet. The existing PUSH MPLS & POP MPLS actions inserts & removes MPLS header between ethernet header and the IP header. Though this behaviour is fine for L3 VPN where an IP packet is encapsulated inside a MPLS tunnel, it does not suffice the L2 VPN requirements. In L2 VPN the ethernet packets must be encapsulated inside MPLS tunnel. In this change the encap & decap actions are extended to support MPLS packet type. The encap & decap adds and removes MPLS header at the start of packet as depicted below. Encapsulation: Actions - encap(mpls),encap(ethernet) Incoming packet -> \| ETH \| IP \| Payload \| 1 Actions - encap(mpls) [Datapath action - ADD_MPLS:0x8847] Outgoing packet -> \| MPLS \| ETH \| Payload\| 2 Actions - encap(ethernet) [ Datapath action - push_eth ] Outgoing packet -> \| ETH \| MPLS \| ETH \| Payload\| Decapsulation: Incoming packet -> \| ETH \| MPLS \| ETH \| IP \| Payload \| Actions - decap(),decap(packet_type(ns=0,type=0)) 1 Actions - decap() [Datapath action - pop_eth) Outgoing packet -> \| MPLS \| ETH \| IP \| Payload\| 2 Actions - decap(packet_type(ns=0,type=0)) [Datapath action - POP_MPLS:0x6558] Outgoing packet -> \| ETH \| IP \| Payload\| Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-17 02:04:20 +01:00
Paolo Valerio	4a6a473462	netlink-socket: Log extack error messages in netlink transactions. During a netlink transaction, in case of replies of type NLMSG_ERROR, the current behavior includes the translation of the error number received into a string that describes the error code. Netlink replies may carry a more descriptive error message, and although it is possible to read those messages using the existing perf tracepoint, it is more convenient to retrieve them directly from ovs. This patch extends nl_msg_nlmsgerr() so that it retrieves the message that later, if present, will be used by nl_sock_transact_multiple__() in place of the generic descriptive form of the error number. This is particularly useful with tc that makes use of such kind of mechanism. As an example, with this patch applied, the following generic message: ovs\|00239\|netlink_socket\|DBG\|received NAK error=0 (Operation not supported) becomes: ovs\|00239\|netlink_socket\|DBG\|received NAK error=0 - Conntrack isn't enabled The layout has been slightly modified to avoid nested parentheses. Suggested-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-16 22:16:16 +01:00
Mike Pattrick	eb1ab5357b	netdev-linux: Use matchall classifier for ingress policing. Currently ingress policing uses the basic classifier to apply traffic control filters if hardware offload is not enabled, in which case it uses matchall. This change changes the behavior to always use matchall, and fall back onto basic if the kernel is built without matchall support. The system tests are modified to allow either basic or matchall classification on the ingestion filter, and to allow either 10000 or 10240 packets for the packet burst filter. 10000 is accurate for kernel 5.14 and the most recent iproute2, however, 10240 is left for compatibility with older kernels. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-12 15:23:29 +01:00
Harry van Haaren	3b489a3b1b	dpif-netdev: Improve loading of packet data for undersized packets. This commit improves handling of packets where the allocated memory is less than 64 bytes. For packets recevied from DPDK ports this never matters, as an mbuf always pre-allocates enough space, however this can occur in cases where packet received from a kernel interface or injected by an OpenFlow controller. The fix is required to ensure OVS doesn't overread the allocated memory, e.g.: ==49944==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6060000d8181 at pc 0x000001cb9d24 bp 0x7ffce3b385d0 sp 0x7ffce3b385c8 READ of size 64 at 0x6060000d8181 thread T0 #0 0x1cb9d23 in mfex_avx512_process lib/dpif-netdev-extract-avx512.c:491:26 #1 0x1cb9d23 in mfex_avx512_ip_udp lib/dpif-netdev-extract-avx512.c:625:1 #2 0x18786a1 in dpif_miniflow_extract_autovalidator lib/dpif-netdev-private-extract.c:277:29 #3 0x1cbca5c in dp_netdev_input_outer_avx512 lib/dpif-netdev-avx512.c:159:19 #4 0x1853048 in dp_netdev_process_rxq_port lib/dpif-netdev.c:4900:19 #5 0x1837c76 in dpif_netdev_run lib/dpif-netdev.c:6197:25 #6 0x1727a02 in type_run ofproto/ofproto-dpif.c:370:9 #7 0x16f6e07 in ofproto_type_run ofproto/ofproto.c:1778:31 #8 0x16c1a8b in bridge_run__ vswitchd/bridge.c:3245:9 #9 0x16bd2fd in bridge_run vswitchd/bridge.c:3310:5 #10 0x16db8fe in main vswitchd/ovs-vswitchd.c:127:9 #11 0x7fbc0c5b61a2 in __libc_start_main (/lib64/libc.so.6+0x271a2) #12 0xedabbd in _start (vswitchd/ovs-vswitchd+0xedabbd) 0x6060000d8181 is located 9 bytes to the right of 56-byte region [0x6060000d8140,0x6060000d8178) allocated by thread T0 here: #0 0xf7b09f in malloc (vswitchd/ovs-vswitchd+0xf7b09f) #1 0x1aff3b9 in xmalloc__ lib/util.c:137:15 #2 0x1aff3b9 in xmalloc lib/util.c:172:12 #3 0x1afe211 in process_command lib/unixctl.c:310:13 #4 0x1afe211 in run_connection lib/unixctl.c:344:17 #5 0x1afe211 in unixctl_server_run lib/unixctl.c:395:21 #6 0x16db918 in main vswitchd/ovs-vswitchd.c:128:9 #7 0x7fbc0c5b61a2 in __libc_start_main (/lib64/libc.so.6+0x271a2) The solution implemented uses a mask-to-zero if the available buffer size is less than 64 bytes, and a branch for which type of load is used. Fixes: `250ceddcc2` ("dpif-netdev/mfex: Add AVX512 based optimized miniflow extract") Reported-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-12 13:59:33 +01:00
Sunil Pai G	8bc135d2d5	acinclude: Provide better error info when linking fails with DPDK. Currently, on failure to link with DPDK, the configure script provides an error message to update the PKG_CONFIG_PATH even though the cause of failure was missing dependencies. Improve the error message to include this scenario. Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-12 13:11:29 +01:00
David Marchand	1140c87e2e	netdev-dpdk: Expose per rxq/txq basic statistics. When troubleshooting multiqueue setups, having per queue statistics helps checking packets repartition in rx and tx queues. Per queue statistics are exported by most DPDK drivers (with capability RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS). OVS only filters DPDK statistics, there is nothing to request in DPDK API. So the only change is to extend the filter on xstats. Querying statistics with $ ovs-vsctl get interface dpdk0 statistics \| sed -e 's#[{}]##g' -e 's#, #\n#g' and comparing gives: @@ -13,7 +13,12 @@ rx_phy_crc_errors=0 rx_phy_in_range_len_errors=0 rx_phy_symbol_errors=0 +rx_q0_bytes=0 rx_q0_errors=0 +rx_q0_packets=0 +rx_q1_bytes=0 rx_q1_errors=0 +rx_q1_packets=0 rx_wqe_errors=0 tx_broadcast_packets=0 tx_bytes=0 @@ -27,3 +32,13 @@ tx_pp_rearm_queue_errors=0 tx_pp_timestamp_future_errors=0 tx_pp_timestamp_past_errors=0 +tx_q0_bytes=0 +tx_q0_packets=0 +tx_q1_bytes=0 +tx_q1_packets=0 +tx_q2_bytes=0 +tx_q2_packets=0 +tx_q3_bytes=0 +tx_q3_packets=0 +tx_q4_bytes=0 +tx_q4_packets=0 Signed-off-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-12 11:40:16 +01:00
David Marchand	f260db1efc	netdev-dpdk: Fix statistics when changing Rx/Tx queues count. When changing number of Rx or Tx queues, per queue basic stats can be renumbered in DPDK ethdev layer [1]. OVS maintains an internal xstats IDs cache that was refreshed when a cached id was not valid anymore (in netdev_dpdk_get_custom_stats) or if a new DPDK port was created. This did not handle changes of Rx/Tx queues count. For example, with a mlx5 port: $ ovs-vsctl set interface dpdk0 options:n_rxq=2 $ ovs-vsctl get interface dpdk0 statistics \| sed -e 's#[{}]##g' -e 's#, #\n#g' \| grep rx_q._errors rx_q0_errors=0 Move the cache filling after reconfiguring and starting the port. There is no need to flush the cache in netdev_dpdk_get_custom_stats. While at it, the xstats code can be cleaned up: - remove wrong or Lapalissade comments, - don't check x*alloc return value, - expect that consecutive calls to xstats API return the same number of elements, - only write to dev-> when all DPDK calls succeeded, - add missing lock annotations to netdev_dpdk_clear_xstats and netdev_dpdk_get_xstat_name, 1: https://git.dpdk.org/dpdk/tree/lib/librte_ethdev/rte_ethdev.c?h=v20.11#n2696 Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-November/389456.html Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-12 11:40:16 +01:00
David Marchand	b84386fa9a	dpdk: Support running PMD threads on any core. Previously in OVS, a PMD thread running on cpu X used lcore X. This assumption limited OVS to run PMD threads on physical cpu < RTE_MAX_LCORE. DPDK 20.08 introduced a new API that associates a non-EAL thread to a free lcore. This new API does not change the thread characteristics (like CPU affinity) and let OVS run its PMD threads on any cpu regardless of RTE_MAX_LCORE. The DPDK multiprocess feature is not compatible with this new API and is disabled. DPDK still limits the number of lcores to RTE_MAX_LCORE (128 on x86_64) which should be enough for OVS pmd threads (hopefully). DPDK lcore/OVS pmd threads mapping are logged at threads when trying to attach a OVS PMD thread, and when detaching. A new command is added to help get DPDK point of view of the DPDK lcores at any time: $ ovs-appctl dpdk/lcore-list lcore 0, socket 0, role RTE, cpuset 0 lcore 1, socket 0, role NON_EAL, cpuset 1 lcore 2, socket 0, role NON_EAL, cpuset 15 Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-11 21:34:56 +01:00
Ilya Maximets	356f362068	tests/oss-fuzz: Fix the arguments of parse_tcp_flags. tests/oss-fuzz/flow_extract_target.c:59:53: error: too few arguments to function call, expected 4, have 1 uint16_t tcp_flags = parse_tcp_flags(&packet); ~~~~~~~~~~~~~~~ ^ Fixes: `e7e9973b80` ("dpif-netdev: Forwarding optimization for flows with a simple match.") Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43498 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com>	2022-01-10 23:31:29 +01:00
Ilya Maximets	ddca1eb3ab	odp-util: Stop action list parsing if already oversized. The fuzzing target times out if the action list is too big. And we don't really need to fully parse all the actions just to say that they are too big in the end. So, check early and exit. This is a pure performance optimization, so not adding a unit test. All other code paths during the parsing are using E2BIG and not EFBIG for similar conditions, so using it here too. Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=39670 Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-07 23:48:20 +01:00
Sriharsha Basavapatna	6e50c16518	dpif-netdev: Avoid hw_miss_packet_recover() for devices with no support. The hw_miss_packet_recover() API results in performance degradation, for ports that are either not offload capable or do not support this specific offload API. For example, in the test configuration shown below, the vhost-user port does not support offloads and the VF port doesn't support hw_miss offload API. But because tunnel offload needs to be configured in other bridges (br-vxlan and br-phy), OVS has been built with -DALLOW_EXPERIMENTAL_API. br-vhost br-vxlan br-phy vhost-user<-->VF VF-Rep<-->VxLAN uplink-port For every packet between the VF and the vhost-user ports, hw_miss API is called even though it is not supported by the ports involved. This leads to significant performance drop (~3x in some cases; both cycles and pps). Return EOPNOTSUPP when this API fails for a device that doesn't support it and avoid this API on that port for subsequent packets. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-07 20:53:20 +01:00
Ilya Maximets	e7e9973b80	dpif-netdev: Forwarding optimization for flows with a simple match. There are cases where users might want simple forwarding or drop rules for all packets received from a specific port, e.g :: "in_port=1,actions=2" "in_port=2,actions=IN_PORT" "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop" "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3" There are also cases where complex OpenFlow rules can be simplified down to datapath flows with very simple match criteria. In theory, for very simple forwarding, OVS doesn't need to parse packets at all in order to follow these rules. "Simple match" lookup optimization is intended to speed up packet forwarding in these cases. Design: Due to various implementation constraints userspace datapath has following flow fields always in exact match (i.e. it's required to match at least these fields of a packet even if the OF rule doesn't need that): - recirc_id - in_port - packet_type - dl_type - vlan_tci (CFI + VID) - in most cases - nw_frag - for ip packets Not all of these fields are related to packet itself. We already know the current 'recirc_id' and the 'in_port' before starting the packet processing. It also seems safe to assume that we're working with Ethernet packets. So, for the simple OF rule we need to match only on 'dl_type', 'vlan_tci' and 'nw_frag'. 'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be combined in a single 64bit integer (mark) that can be used as a hash in hash map. We are using only VID and CFI form the 'vlan_tci', flows that need to match on PCP will not qualify for the optimization. Workaround for matching on non-existence of vlan updated to match on CFI and VID only in order to qualify for the optimization. CFI is always set by OVS if vlan is present in a packet, so there is no need to match on PCP in this case. 'nw_frag' takes 2 bits of PCP inside the simple match mark. New per-PMD flow table 'simple_match_table' introduced to store simple match flows only. 'dp_netdev_flow_add' adds flow to the usual 'flow_table' and to the 'simple_match_table' if the flow meets following constraints: - 'recirc_id' in flow match is 0. - 'packet_type' in flow match is Ethernet. - Flow wildcards contains only minimal set of non-wildcarded fields (listed above). If the number of flows for current 'in_port' in a regular 'flow_table' equals number of flows for current 'in_port' in a 'simple_match_table', we may use simple match optimization, because all the flows we have are simple match flows. This means that we only need to parse 'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching. Now we make the unique flow mark from the 'in_port', 'dl_type', 'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'. On successful lookup we don't need to run full 'miniflow_extract()'. Unsuccessful lookup technically means that we have no suitable flow in the datapath and upcall will be required. So, in this case EMC and SMC lookups are disabled. We may optimize this path in the future by bypassing the dpcls lookup too. Performance improvement of this solution on a 'simple match' flows should be comparable with partial HW offloading, because it parses same packet fields and uses similar flow lookup scheme. However, unlike partial HW offloading, it works for all port types including virtual ones. Performance results when compared to EMC: Test setup: virtio-user OVS virtio-user Testpmd1 ------------> pmd1 ------------> Testpmd2 (txonly) x<------ pmd2 <------------ (mac swap) Single stream of 64byte packets. Actions: in_port=vhost0,actions=vhost1 in_port=vhost1,actions=vhost0 Stats collected from pmd1 and pmd2, so there are 2 scenarios: Virt-to-Virt : Testpmd1 ------> pmd1 ------> Testpmd2. Virt-to-NoCopy : Testpmd2 ------> pmd2 --->x Testpmd1. Here the packet sent from pmd2 to Testpmd1 is always dropped, because the virtqueue is full since Testpmd1 is in txonly mode and doesn't receive any packets. This should be closer to the performance of a VM-to-Phy scenario. Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz. Table below represents improvement in throughput when compared to EMC. +----------------+------------------------+------------------------+ \| \| Default (-g -O2) \| "-Ofast -march=native" \| \| Scenario +------------+-----------+------------+-----------+ \| \| GCC \| Clang \| GCC \| Clang \| +----------------+------------+-----------+------------+-----------+ \| Virt-to-Virt \| +18.9% \| +25.5% \| +10.8% \| +16.7% \| \| Virt-to-NoCopy \| +24.3% \| +33.7% \| +14.9% \| +22.0% \| +----------------+------------+-----------+------------+-----------+ For Phy-to-Phy case performance improvement should be even higher, but it's not the main use-case for this functionality. Performance difference for the non-simple flows is within a margin of error. Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-07 20:32:20 +01:00
Terry Wilson	46d44cf3be	python: idl: Add monitor_cond_since support. Add support for monitor_cond_since / update3 to python-ovs to allow more efficient reconnections when connecting to clustered OVSDB servers. Signed-off-by: Terry Wilson <twilson@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-06 16:45:56 +01:00
Mike Pattrick	0d1ffb7756	checkpatch: Detect "trojan source" attack. Recently there has been a lot of press about the "trojan source" attack, where Unicode characters are used to obfuscate the true functionality of code. This attack didn't effect OVS, but adding the check here will help guard against it sneaking in later. Signed-off-by: Mike Pattrick <mkp@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-04 19:14:11 +01:00
Mike Pattrick	428b11caa7	utilities: Add another GDB macro for ovs-vswitchd. This commit adds a basic packet metadata macro to the already existing macros in ovs_gdb.py, ovs_dump_packets will print out information about one or more packets. It feeds packets into tcpdump, and the user can pass in tcpdump options to modify how packets are parsed or even write out packets to a pcap file. Example usage: (gdb) break fast_path_processing (gdb) commands ovs_dump_packets packets_ continue end (gdb) continue Thread 1 "ovs-vswitchd" hit Breakpoint 2, fast_path_processing ... 12:01:05.962485 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.1.1 tell 10.1.1.2, length 28 Thread 1 "ovs-vswitchd" hit Breakpoint 1, fast_path_processing ... 12:01:05.981214 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.1.1 is-at a6:0f:c3:f0:5f:bd (oui Unknown), length 28 Signed-off-by: Mike Pattrick <mkp@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-04 19:14:11 +01:00
Frode Nordahl	2f2ae5b6bd	tests: Fix endianness in netlink policy test fixtures. The netlink policy unit test contains test fixture data that is subject to endianness and currently fails on big endian systems. Store the fixture data in a struct to ensure proper byte order for the header data. Also fix improper style for sizeof with expressions. Fixes: `bfee9f6c01` ("netlink: Add support for parsing link layer address.") Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-04 19:14:11 +01:00
Eli Britstein	0b6d2faace	ci: Remove -Wno-cast-align from CI. Following [1]-[3] in DPDK, there are no more such warnings from DPDK. Remove ignoring them if they occur. GitHub actions: v1: https://github.com/elibritstein/OVS/actions/runs/1540651133 [1] a3f8d0587188 ("net: avoid cast-align warning in VLAN insert function") [2] da0333c8790b ("mbuf: avoid cast-align warning in data offset macro") [3] 6de430b7079e ("eal/x86: avoid cast-align warning in memcpy functions") Signed-off-by: Eli Britstein <elibr@nvidia.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-04 19:14:11 +01:00
Mike Pattrick	d652fc6a5a	checkpatch: Correct line count in error messages. As part of some previous checkpatch work, we discovered that checkpatch isn't always reporting correct line numbers. As it turns out, Python's splitlines function considers several characters to be new lines which common text editors do not typically consider to be new lines. For example, form feed characters, which this code base uses to cluster functionality. Signed-off-by: Mike Pattrick <mkp@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-04 17:23:30 +01:00
Ilya Maximets	28ef2535c1	dpif-netdev-extract: Change availability log level to DBG. Availability logs are not essential for a normal run. The same information can be obtained via appctl in runtime. They also can not show if particular implementation will actually be used or not, hence not useful for post-crash investigations. Moving to DBG level to avoid bulky unnecessary logging. Additionally making them a bit more readable. Acked-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-04 17:19:18 +01:00
Ilya Maximets	38c53dd17d	AUTHORS: Add Nobuhiro MIKI. Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-04 17:17:28 +01:00
Nobuhiro MIKI	9a834205a4	docs: afxdp: Remove duplicated lines. Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-04 17:16:03 +01:00
David Marchand	d446dcb7e0	system-dpdk: Refactor common logs matching. Move EAL logs and commonly ignored logs to a common macro. Remove/update obsolete ones (like i40e [1], timer [2], EAL [3][4] logs). Set log level for DPDK drivers to error only: the rationale is that we are not testing DPDK drivers in system-dpdk. Extend regex on hugepage logs since a check on hugepages availability is already present on OVS side, and as a consequence, we don't care about the warnings on availability for certain hugepage size. Add logs checks for MFEX tests that were missing them. 1: https://git.dpdk.org/dpdk/commit/?id=a075ce2b3e8c 2: https://git.dpdk.org/dpdk/commit/?id=c1077933d45b 3: https://git.dpdk.org/dpdk/commit/?id=e9b3d79b0696 4: https://git.dpdk.org/dpdk/commit/?id=c69150679891 Signed-off-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-03 19:47:43 +01:00
David Marchand	b366fa2f49	dpif-netdev: Call cpuid for x86 isa availability. DPIF AVX512 optimizations currently rely on DPDK availability while they can be used without DPDK. Besides, checking for availability of some isa only has to be done once and won't change while a OVS process runs. Resolve isa availability in constructors by using a simplified query based on cpuid API that comes from the compiler. Note: this also fixes the check on BMI2 availability: DPDK had a bug for this isa, see https://git.dpdk.org/dpdk/commit/?id=aae3037ab1e0. Suggested-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-03 18:45:40 +01:00
Ilya Maximets	11441385c2	bridge: Fix incorrect configuration of netdev's dpif type. netdev_set_dpif_type() can only be used with a normalized dpif type as an argument, which is a constant static string derived from a type of a dpif_class or a constant string "system". Usage of a same constant string allows netdev-offload module to compare types by simply comparing pointers. OTOH, 'br->ofproto->type' is a dynamic string that: a. Can be NULL. b. Even if not NULL and equal, can be a different dynamically allocated string. Both these qualities breaks assumptions made by all other modules related to HW offload, breaking the functionality. Fix that by moving netdev_set_dpif_type() to dpif.c and calling with a correct constant string as an argument. The call moved from bridge.c to dpif.c, because we need to have access to the dpif class, but bridge.c should not. Not trying to set the dpif_type inside the netdev_ports_insert(), because it's used now outside the offloading context. So, it's cleaner to move the netdev_set_dpif_type() call outside of the netdev-offload module. Additionally removed the redundant call from the netdev_ports_insert() and refactored the function, since it doesn't need an extra argument anymore. Fixes: `4f19a78a61` ("netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.") Reported-by: Roi Dayan <roid@nvidia.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390117.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Lin Huang <linhuang@ruijie.com.cn> Acked-by: Roi Dayan <roid@nvidia.com>	2021-12-17 21:31:55 +01:00
Paolo Valerio	ec2aa2ab46	ofproto-dpif-xlate: Snoop ingress packets and update neigh cache if needed. In case of native tunnel with bfd enabled, if the MAC address of the remote end's interface changes (e.g. because it got rebooted, and the MAC address is allocated dynamically), the BFD session will never be re-established. This happens because the local tunnel neigh entry doesn't get updated, and the local end keeps sending BFD packets with the old destination MAC address. This was not an issue until `b23ddcc57d` ("tnl-neigh-cache: tighten arp and nd snooping.") because ARP requests were snooped as well avoiding the problem. Fix this by snooping the incoming packets in the slow path, and updating the neigh cache accordingly. Fixes: `b23ddcc57d` ("tnl-neigh-cache: tighten arp and nd snooping.") Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2002430 Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-17 20:32:13 +01:00
Paolo Valerio	b723b93200	tnl-neigh-cache: Do not refresh the entry while revalidating. This is a minor issue but visible e.g. when you try to flush the neigh cache while the ARP flow is still present in the datapath, triggering the revalidation of the datapath flows which subsequently refreshes/adds the entry in the cache. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-17 20:32:05 +01:00
Paolo Valerio	02f95638a4	tnl-neigh-cache: Add tnl/neigh/aging command. with the command is now possible to change the aging time of the cache entries. For the existing entries the aging time is updated only if the current expiration is greater than the new one. In any case, the next refresh will set it to the new value. This is intended mostly for debugging purpose. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-17 20:31:57 +01:00
Paolo Valerio	f527aef147	tnl-neigh-cache: Read/write expires atomically. Expires is modified in different threads (revalidator, pmd-rx, bfd-tx). It's better to use atomics for such potentially parallel write. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-17 20:31:14 +01:00
Harry van Haaren	f0266292b7	dpif-netdev: Improve handling of IP/TCP in avx512 mfex. This commit tightens the requirements for processing TCP packets in AVX512, ensuring that there are no TCP options by validating that the "data offset" field of the TCP header is exactly equal to 5. This ensures that the TCP header is not too short, and that it does not contain extra options. On the IP handling side, improve checks around total packet length. Now the next protocol is included in the length checks, ensuring that the IP header reported length is of appropriate size to contain the next protocol (e.g. UDP requires 8 bytes, TCP requires 20). Note that the inner protocol is always of a fixed size per profile, so it can be set using the UDP_ and TCP_ HEADER_LEN defines. Fixes: `250ceddcc2` ("dpif-netdev/mfex: Add AVX512 based optimized miniflow extract") Reported-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-17 19:50:55 +01:00
Ilya Maximets	893693e808	AUTHORS: Add Nir Anteby. Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-16 22:48:01 +01:00
Nir Anteby	7617d0583c	netdev-offload-dpdk: Add support for matching on gre fields. Add parsing gre match fields. Signed-off-by: Nir Anteby <nanteby@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Emma Finn <emma.finn@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-16 12:03:13 +01:00
Nir Anteby	5f60741dcf	netdev-offload-dpdk: Support tnl_pop for gre tunnel. Add support for tnl_pop action for gre vport. Signed-off-by: Nir Anteby <nanteby@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Emma Finn <emma.finn@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-16 12:03:07 +01:00
Nir Anteby	a32cb78b5a	netdev-dpdk: Add flow_api support for netdev gre vports. Add the acceptance of GRE devices to netdev_dpdk_flow_api_supported() API, to allow offloading of DPDK GRE devices. Signed-off-by: Nir Anteby <nanteby@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Emma Finn <emma.finn@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-16 12:03:02 +01:00
Nir Anteby	8279041460	netdev-offload-dpdk: Refactor get_vport_netdev(). Refactor the function as a pre-step towards supporting more tunnel types. Signed-off-by: Nir Anteby <nanteby@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Emma Finn <emma.finn@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-16 12:02:15 +01:00
Alin-Gabriel Serdean	76527525e0	AUTHORS: Update email for Alin Serdean. Signed-off-by: Alin-Gabriel Serdean <aserdean@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-16 11:48:12 +01:00
Joe Stringer	38b42aa93f	MAINTAINERS: Move Joe to emeritus status. My primary focus has been in adjacent communities for some time now. It seems appropriate that my status in the OVS maintainers file should reflect the level of attention I am able to provide to the project. Thanks to the other contributors past and present for the experiences we've shared, and I'll see you around wherever our paths cross again :-) Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-15 20:39:30 +01:00
Rosemarie O'Riorden	269b927fd7	dpdk: Use --in-memory by default. If anonymous memory mapping is supported by the kernel, it's better to run OVS entirely in memory rather than creating shared data structures. OVS doesn't work in multi-process mode, so there is no need to litter a filesystem. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1949849 Acked-by: David Marchand <david.marchand@redhat.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-15 18:22:51 +01:00
David Marchand	b5d2dbdbb5	system-dpdk: Fix race in vhost-user tests. Waiting only on the vhost user port to be ready is not enough since a tap is also initialized by testpmd and is used to inject/receive packets in/from the kernel. Wait on the tap link status. Fixes: `18db7ec5eb` ("system-dpdk: Improve vhost-user ping tests reliability.") Signed-off-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-15 18:20:01 +01:00
Ilya Maximets	9827312fa4	docs: Re-work the documentation around CPU ISA optimizations. Few problems with a current documentation: 1. bridge.rst is the high-level documentation for the end user. Unit testing and complex implementation details are for developers, hence should not be there. Testing instructions for developers should be in testing.rst. Words in the doc should be understandable for the user who doesn't know OVS internals. 2. Some paragraphs in the current documentation are repeating each other almost to the word. 3. Some paragraphs are incorrectly formatted. That affects the rendering. 4. There is no point describing every separate test of a system-dpdk testsuite. What is done: 1. All the testing related paragraphs are consolidated and moved to the testing.rst. 2. Most of abbreviations replaced with more readable and understandable for the end user words. 3. Meaning or the purpose of several sentences I failed to understand, therefore just deleted. 4. Fixed formatting and a few typos along the way. IMO, some parts of the doc still needs some re-wording, but this change provides at least a starting point for improvement setting a better structure for the document. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: David Marchand <david.marchand@redhat.com>	2021-12-15 10:41:37 +01:00
Ilya Maximets	ed9778e94f	dpif-netdev: Fix the autovalidator output for the miniflow extract. The autovalidator uses incorrect block count while printing the miniflow buffer from a tested implementation. This results in not printing fields that was incorrectly added to the miniflow or printing more of a buffer even if not needed. Fix that by requesting and using the correct block count. Also fixed the output formatting issues: extra spaces, characters, unclear relations between names and numbers due to mixed up delimiters, '%u' used for uint16_t, '\t' in the output, etc. Fixes: `dd3f5d86d9` ("dpif-netdev: Add auto validation function for miniflow extract") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Kumar Amber <kumar.amber@intel.com>	2021-12-15 10:20:42 +01:00
Ilya Maximets	339f97044e	ovsdb: storage: Randomize should_snapshot checks when the minimum time passed. Snapshots are scheduled for every 10-20 minutes. It's a random value in this interval for each server. Once the time is up, but the maximum time (24 hours) not reached yet, ovsdb will start checking if the log grew a lot on every iteration. Once the growth is detected, compaction is triggered. OTOH, it's very common for an OVSDB cluster to not have the log growing very fast. If the log didn't grow 2x in 20 minutes, the randomness of the initial scheduled time is gone and all the servers are checking if they need to create snapshot on every iteration. And since all of them are part of the same cluster, their logs are growing with the same speed. Once the critical mass is reached, all the servers will start creating snapshots at the same time. If the database is big enough, that might leave the cluster unresponsive for an extended period of time (e.g. 10-15 seconds for OVN_Southbound database in a larger scale OVN deployment) until the compaction completed. Fix that by re-scheduling a quick retry if the minimal time already passed. Effectively, this will work as a randomized 1-2 min delay between checks, so the servers will not synchronize. Scheduling function updated to not change the upper limit on quick reschedules to avoid delaying the snapshot creation indefinitely. Currently quick re-schedules are only used for the error cases, and there is always a 'slow' re-schedule after the successful compaction. So, the change of a scheduling function doesn't change the current behavior much. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>	2021-12-13 21:54:45 +01:00
Dumitru Ceara	bf07cc9cdb	raft: Only allow followers to snapshot. Commit `3c2d6274bc` ("raft: Transfer leadership before creating snapshots.") made it such that raft leaders transfer leadership before snapshotting. However, there's still the case when the next leader to be is in the process of snapshotting. To avoid delays in that case too, we now explicitly allow snapshots only on followers. Cluster members will have to wait until the current election is settled before snapshotting. Given the following logs taken from an OVN_Southbound 3-server cluster during a scale test: S1 (old leader): 19:07:51.226Z\|raft\|INFO\|Transferring leadership to write a snapshot. 19:08:03.830Z\|ovsdb\|INFO\|OVN_Southbound: Database compaction took 12601ms 19:08:03.940Z\|raft\|INFO\|server 8b8d is leader for term 43 S2 (follower): 19:08:00.870Z\|raft\|INFO\|server 8b8d is leader for term 43 S3 (new leader): 19:07:51.242Z\|raft\|INFO\|received leadership transfer from f5c9 in term 42 19:07:51.244Z\|raft\|INFO\|term 43: starting election 19:08:00.805Z\|ovsdb\|INFO\|OVN_Southbound: Database compaction took 9559ms 19:08:00.869Z\|raft\|INFO\|term 43: elected leader by 2+ of 3 servers We see that the leader to be (S3) receives the leadership transfer, initiates the election and immediately after starts a snapshot that takes ~9.5 seconds. During this time, S2 votes for S3 electing it as cluster leader but S3 doesn't effectively become leader until it finishes snapshotting, essentially keeping the cluster without a leader for up to ~9.5 seconds. With the current change, S3 will delay compaction and snapshotting until the election is finished. The only exception is the case of single-node clusters for which we allow the node to snapshot regardless of role. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-13 21:52:59 +01:00
Ilya Maximets	20a4f546f7	dpif-netdev: Use PMD context to get the port for HW miss recovery. Last RX queue, from which the packet got received, is already stored in the PMD context. So, we can get the netdev from it without the expensive hash map lookup. In my V2V testing this patch improves performance in case HW offload and experimental APIs are enabled by about 3%. That narrows down the performance difference with the case with experimental API disabled to about 0.5%, which is way within a margin of error for that setup. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-09 22:46:47 +01:00
Ian Stokes	17346b3899	dpdk: Update to use DPDK v21.11. This commit adds support for DPDK v21.11, it includes the following changes. 1. ci: Install python elftools for DPDK 21.02. 2. ci: Update meson requirement for DPDK 21.05. 3. netdev-dpdk: Fix build with 21.05. 4. ci: Compile DPDK in non developer mode. http://patchwork.ozlabs.org/project/openvswitch/list/?series=242480&state=* 5. netdev-dpdk: Remove access to DPDK internals. 6. netdev-dpdk: Remove unused attribute from rte_flow rule. 7. netdev-dpdk: Fix mbuf macros namespace with 21.11-rc1. 8. netdev-dpdk: Fix vhost namespace with 21.11-rc2. http://patchwork.ozlabs.org/project/openvswitch/list/?series=271159&state=* In addition documentation and DPDK unit tests were also updated in this commit for use with DPDK v21.11. For credit all authors of the original commits to 'dpdk-latest' with the above changes have been added as co-authors for this commit. Signed-off-by: David Marchand <david.marchand@redhat.com> Co-authored-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Emma Finn <emma.finn"intel.com> Tested-by: Seamus Ryan <seamus.ryan@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-12-09 18:40:14 +00:00
Vladislav Odintsov	72745ab0cd	compat: handle NF_REPEAT error on nf_conntrack_in. In patch [1] rpl_nf_conntrack_in was backported as static inline function without do..while loop handling NF_REPEAT error. In patch [2] rpl_nf_conntrack_in backported function was removed from compat/include/net/netfilter/nf_conntrack_core.h as an unused. As a result the do..while loop around nf_conntrack_in was lost and this caused problems on old RHEL kernels with the tcp SYN loss on a connection with same 5-tuple, which ran in last nf_conntrack_tcp_timeout_time_wait. The connection could be initiated on a tcp SYN retry after one second. 1: `4fdec8986a` 2: `e9b33ad780` Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-September/387623.html Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-October/388424.html Signed-off-by: Vladislav Odintsov <odivlad@gmail.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-09 15:18:23 +01:00
Maxime Coquelin	18db7ec5eb	system-dpdk: Improve vhost-user ping tests reliability. Instead of waiting 10 seconds for testpmd to start, this patch makes use of OVS_WAIT_UNTIL() macro to wait for the virtio device readiness notification in ovs-vswitchd logs. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-09 14:36:37 +01:00
Lin Huang	4f19a78a61	netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs. Userspace tunnel doesn't have a valid device in the kernel. So get_ifindex() function (ioctl) always get error during adding a port, deleting a port or updating a port status. The info log is "2021-08-29T09:17:39.830Z\|00059\|netdev_linux\|INFO\|ioctl(SIOCGIFINDEX) on vxlan_sys_4789 device failed: No such device" If there are a lot of userspace tunnel ports on a bridge, the iface_refresh_netdev_status() function will spend a lot of time. So ignore userspace tunnel port ioctl(SIOCGIFINDEX) operation, just return -ENODEV. Signed-off-by: Lin Huang <linhuang@ruijie.com.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-08 18:17:19 +01:00

... 8 9 10 11 12 ...

19352 Commits