mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-29 05:18:13 +00:00

Author	SHA1	Message	Date
Ilya Maximets	acc5df0e3c	dpif-netdev: Fix time delta overflow in case of race for meter lock. There is a race window between getting the time and getting the meter lock. This could lead to situation where the thread with larger current time (this thread called time_{um}sec() later than others) will acquire meter lock first and update meter->used to the large value. Next threads will try to calculate time delta by subtracting the large meter->used from their lower time getting the negative value which will be converted to a big unsigned delta. Fix that by assuming that all these threads received packets in the same time in this case, i.e. dropping negative delta to 0. CC: Jarno Rajahalme <jarno@ovn.org> Fixes: 4b27db644a8c ("dpif-netdev: Simple DROP meter implementation.") Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-September/363126.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: William Tu <u9012063@gmail.com>	2019-10-28 13:38:37 +01:00
Ilya Maximets	18ae34ae1f	dpif-netdev: Do not mix recirculation depth into RSS hash itself. Mixing of RSS hash with recirculation depth is useful for flow lookup because same packet after recirculation should match with different datapath rule. Setting of the mixed value back to the packet is completely unnecessary because recirculation depth is different on each recirculation, i.e. we will have different packet hash for flow lookup anyway. This should fix the issue that packets from the same flow could be directed to different buckets based on a dp_hash or different ports of a balanced bonding in case they were recirculated different number of times (e.g. due to conntrack rules). With this change, the original RSS hash will remain the same making it possible to calculate equal dp_hash values for such packets. Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-September/363127.html Fixes: 048963aa8507 ("dpif-netdev: Reset RSS hash when recirculating.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>	2019-10-28 13:38:37 +01:00
Yi-Hung Wei	187bb41fbf	ofproto-dpif-xlate: Translate timeout policy in ct action This patch derives the timeout policy based on ct zone from the internal data structure that we maintain on dpif layer. It also adds a system traffic test to verify the zone-based conntrack timeout feature. The test uses ovs-vsctl commands to configure the customized ICMP and UDP timeout on zone 5 to a shorter period. It then injects ICMP and UDP traffic to conntrack, and checks if the corresponding conntrack entry expires after the predefined timeout. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> ofproto-dpif: Checks if datapath supports OVS_CT_ATTR_TIMEOUT This patch checks whether datapath supports OVS_CT_ATTR_TIMEOUT. With this check, ofproto-dpif-xlate can use this information to decide whether to translate the ct timeout policy. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Justin Pettit <jpettit@ovn.org>	2019-09-26 13:51:04 -07:00
Yi-Hung Wei	ebe62ec1b9	datapath: Add support for conntrack timeout policy This patch adds support for specifying a timeout policy for a connection in connection tracking system in kernel datapath. The timeout policy will be attached to a connection when the connection is committed to conntrack. This patch introduces a new odp field OVS_CT_ATTR_TIMEOUT in the ct action that specifies the timeout policy in the datapath. In the following patch, during the upcall process, the vswitchd will use the ct_zone to look up the corresponding timeout policy and fill OVS_CT_ATTR_TIMEOUT if it is available. The datapath code is from the following two net-next upstream commits. Upstream commit: commit 06bd2bdf19d2f3d22731625e1a47fa1dff5ac407 Author: Yi-Hung Wei <yihung.wei@gmail.com> Date: Tue Mar 26 11:31:14 2019 -0700 openvswitch: Add timeout support to ct action Add support for fine-grain timeout support to conntrack action. The new OVS_CT_ATTR_TIMEOUT attribute of the conntrack action specifies a timeout to be associated with this connection. If no timeout is specified, it acts as is, that is the default timeout for the connection will be automatically applied. Example usage: $ nfct timeout add timeout_1 inet tcp syn_sent 100 established 200 $ ovs-ofctl add-flow br0 in_port=1,ip,tcp,action=ct(commit,timeout=timeout_1) CC: Pravin Shelar <pshelar@ovn.org> CC: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> commit 6d670497e01803b486aa72cc1a718401ab986896 Author: Dan Carpenter <dan.carpenter@oracle.com> Date: Tue Apr 2 09:53:14 2019 +0300 openvswitch: use after free in __ovs_ct_free_action() We free "ct_info->ct" and then use it on the next line when we pass it to nf_ct_destroy_timeout(). This patch swaps the order to avoid the use after free. Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Justin Pettit <jpettit@ovn.org>	2019-09-26 13:50:17 -07:00
Yi-Hung Wei	1f16131837	ct-dpif, dpif-netlink: Add conntrack timeout policy support This patch first defines the dpif interface for a datapath to support adding, deleting, getting and dumping conntrack timeout policy. The timeout policy is identified by a 4 bytes unsigned integer in datapath, and it currently support timeout for TCP, UDP, and ICMP protocols. Moreover, this patch provides the implementation for Linux kernel datapath in dpif-netlink. In Linux kernel, the timeout policy is maintained per L3/L4 protocol, and it is identified by 32 bytes null terminated string. On the other hand, in vswitchd, the timeout policy is a generic one that consists of all the supported L4 protocols. Therefore, one of the main task in dpif-netlink is to break down the generic timeout policy into 6 sub policies (ipv4 tcp, udp, icmp, and ipv6 tcp, udp, icmp), and push down the configuration using the netlink API in netlink-conntrack.c. This patch also adds missing symbols in the windows datapath so that the build on windows can pass. Appveyor CI: * https://ci.appveyor.com/project/YiHungWei/ovs/builds/26387754 Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org> Signed-off-by: Justin Pettit <jpettit@ovn.org>	2019-09-26 13:50:17 -07:00
Paul Chaignon	940ac2ce88	treewide: Use packet batch APIs This patch replaces direct accesses to dp_packet_batch and dp_packet internal components by the appropriate API calls. It extends commit 1270b6e52 (treewide: Wider use of packet batch APIs). This patch was generated using the following semantic patch (cf. http://coccinelle.lip6.fr). // <smpl> @ dp_packet @ struct dp_packet_batch b1; struct dp_packet_batch b2; struct dp_packet p; expression e; @@ ( - b1->packets[b1->count++] = p; + dp_packet_batch_add(b1, p); \| - b2.packets[b2.count++] = p; + dp_packet_batch_add(&b2, p); \| - p->packet_type == htonl(PT_ETH) + dp_packet_is_eth(p) \| - p->packet_type != htonl(PT_ETH) + !dp_packet_is_eth(p) \| - b1->count == 0 + dp_packet_batch_is_empty(b1) \| - !b1->count + dp_packet_batch_is_empty(b1) \| b1->count = e; \| b1->count++ \| b2.count = e; \| b2.count++ \| - b1->count + dp_packet_batch_size(b1) \| - b2.count + dp_packet_batch_size(&b2) ) // </smpl> Signed-off-by: Paul Chaignon <paul.chaignon@orange.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-09-25 14:42:00 -07:00
Darrell Ball	64207120c8	conntrack: Add option to disable TCP sequence checking. This may be needed in some special cases, such as to support some hardware offload implementations. Note that disabling TCP sequence number verification is not an optimization in itself, but supporting some hardware offload implementations may offer better performance. TCP sequence number verification is enabled by default. This option is only available for the userspace datapath. Access to this option is presently provided via 'dpctl' commands as the need for this option is quite node specific, by virtue of which nics are in use on a given node. A test is added to verify this option. Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-May/359188.html Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-09-25 12:11:32 -07:00
Yifeng Sun	c98eedf9ef	dpif-netdev: Handle uninitialized value error for 'match.wc' Valgrind reported that match.wc was not initialized, as below: 1176: ofproto-dpif - fragment handling - actions ==21214== Conditional jump or move depends on uninitialised value(s) ==21214== at 0x4B77C1: odp_flow_key_from_flow__ (odp-util.c:6143) ==21214== by 0x46DB58: dp_netdev_upcall (dpif-netdev.c:6239) ==21214== by 0x4774A7: handle_packet_upcall (dpif-netdev.c:6608) ==21214== by 0x4774A7: fast_path_processing (dpif-netdev.c:6726) ==21214== by 0x47933C: dp_netdev_input__ (dpif-netdev.c:6814) ==21214== by 0x479AB8: dp_netdev_input (dpif-netdev.c:6852) ==21214== by 0x479AB8: dp_netdev_process_rxq_port (dpif-netdev.c:4287) ==21214== by 0x47A6A9: dpif_netdev_run (dpif-netdev.c:5264) ==21214== by 0x4324E7: type_run (ofproto-dpif.c:342) ==21214== by 0x41C5FE: ofproto_type_run (ofproto.c:1734) ==21214== by 0x40BAAC: bridge_run__ (bridge.c:2965) ==21214== by 0x410CF3: bridge_run (bridge.c:3029) ==21214== by 0x407614: main (ovs-vswitchd.c:127) ==21214== Uninitialised value was created by a stack allocation ==21214== at 0x4769C3: fast_path_processing (dpif-netdev.c:6672) 'match' is allocated on stack but its 'wc' is accessed in odp_flow_key_from_flow__ without proper initialization. This patch fixes it. Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-09-19 09:23:41 -07:00
Ilya Maximets	8afbf2facc	dpif-netdev: Add core id in the PMD thread name. This is highly useful to see on which core PMD is running by only looking at the thread name. Thread Id still allows to distinguish different threads running on the same core over the time: \|dpif_netdev(pmd-c10/id:53)\|DBG\|Creating 2. subtable <...> \|dpif_netdev(pmd-c10/id:53)\|DBG\|flow_add: <...>, actions:2 \|dpif_netdev(pmd-c09/id:70)\|DBG\|Core 9 processing port <..> In gdb, top or any other utility it's useful to quickly catch up needed thread without parsing logs, memory or matching threads by port names they're handling. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com>	2019-09-06 11:45:39 +03:00
Ilya Maximets	1276e3db89	dpif-netdev-perf: Fix TSC frequency for non-DPDK case. Unlike 'rte_get_tsc_cycles()' which doesn't need any specific initialization, 'rte_get_tsc_hz()' could be used only after successfull call to 'rte_eal_init()'. 'rte_eal_init()' estimates the TSC frequency for later use by 'rte_get_tsc_hz()'. Fairly said, we're not allowed to use 'rte_get_tsc_cycles()' before initializing DPDK too, but it works this way for now and provides correct results. This patch provides TSC frequency estimation code that will be used in two cases: * DPDK is not compiled in, i.e. DPDK_NETDEV not defined. * DPDK compiled in but not initialized, i.e. other_config:dpdk-init=false This change is mostly useful for AF_XDP netdev support, i.e. allows to use dpif-netdev/pmd-perf-show command and various PMD perf metrics. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: William Tu <u9012063@gmail.com>	2019-09-06 11:45:39 +03:00
Ilya Maximets	3f51ea180b	dpif-netdev: Fail port addition if reconfiguration failed. If the port was destroyed during the initial reconfiguration, we should report an error to the upper layers. Otherwise, successful addition of the port will be logged and upper layers will continue to configure this port. For example, the 'dpif' layer will try to initilaize flow API for this device. Fix that by checking for port existence after reconfiguration. We can't get the real error code here, so let's assume EINVAL. 'ovs-vsctl' will tell the user to check the logs for a real reason anyway. Fixes: e32971b8ddb4 ("dpif-netdev: Centralized threads and queues handling code.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Ian Stokes <ian.stokes@intel.com>	2019-08-29 18:25:50 +03:00
Harry van Haaren	f54d8f004f	dpif-netdev: Add specialized generic scalar functions This commit adds a number of specialized functions, that handle common miniflow fingerprints. This enables compiler optimization, resulting in higher performance. Below a quick description of how this optimization actually works; "Specialized functions" are "instances" of the generic implementation, but the compiler is given extra context when compiling. In the case of iterating miniflow datastructures, the most interesting value to enable compile time optimizations is the loop trip count per unit. In order to create a specialized function, there is a generic implementation, which uses a for() loop without the compiler knowing the loop trip count at compile time. The loop trip count is passed in as an argument to the function: uint32_t miniflow_impl_generic(struct miniflow mf, uint32_t loop_count) { for(uint32_t i = 0; i < loop_count; i++) // do work } In order to "specialize" the function, we call the generic implementation with hard-coded numbers - these are compile time constants! uint32_t miniflow_impl_loop5(struct miniflow mf, uint32_t loop_count) { // use hard coded constant for compile-time constant-propogation return miniflow_impl_generic(mf, 5); } Given the compiler is aware of the loop trip count at compile time, it can perform an optimization known as "constant propogation". Combined with inlining of the miniflow_impl_generic() function, the compiler is now enabled to compile time unroll the loop 5x, and produce "flat" code. The last step to using the specialized functions is to utilize a function-pointer to choose the specialized (or generic) implementation. The selection of the function pointer is performed at subtable creation time, when miniflow fingerprint of the subtable is known. This technique is known as "multiple dispatch" in some literature, as it uses multiple items of information (miniflow bit counts) to select the dispatch function. By pointing the function pointer at the optimized implementation, OvS benefits from the compile time optimizations at runtime. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:24:13 +01:00
Harry van Haaren	a0b36b3924	dpif-netdev: Refactor generic implementation This commit refactors the generic implementation. The goal of this refactor is to simplify the code to enable "specialization" of the functions at compile time. Given compile-time optimizations, the compiler is able to unroll loops, and create optimized code sequences due to compile time knowledge of loop-trip counts. In order to enable these compiler optimizations, we must refactor the code to pass the loop-trip counts to functions as compile time constants. This patch allows the number of miniflow-bits set per "unit" in the miniflow to be passed around as a function argument. Note that this patch does NOT yet take advantage of doing so, this is only a refactor to enable it in the next patches. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:22:23 +01:00
Harry van Haaren	92c7c870d6	dpif-netdev: Split out generic lookup function This commit splits the generic hash-lookup-verify function to its own file, for cleaner seperation between optimized versions. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:22:01 +01:00
Harry van Haaren	f5ace7cd8a	dpif-netdev: Move dpcls lookup structures to .h This commit moves some data-structures to be available in the dpif-netdev-private.h header. This allows specific implementations of the subtable lookup function to include just that header file, and not require that the code exists in dpif-netdev.c Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:21:37 +01:00
Harry van Haaren	aadede3dda	dpif-netdev: Implement function pointers/subtable This allows plugging-in of different subtable hash-lookup-verify routines, and allows special casing of those functions based on known context (eg: # of bits set) of the specific subtable. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:21:16 +01:00
Ilya Maximets	ec61d4707b	dpif-netdev: Clarify PMD reloading scheme. It became more complicated, hence needs to be documented. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 13:15:19 +01:00
David Marchand	68a0625b78	dpif-netdev: Catch reloads faster. Looking at the reload flag only every 1024 loops can be a long time under load, since we might be handling 32 packets per rxq, per iteration, which means up to poll_cnt * 32 * 1024 packets. Look at the flag every loop, no major performance impact seen. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:51:09 +01:00
David Marchand	e2cafa8692	dpif-netdev: Only reload static tx qid when needed. pmd->static_tx_qid is allocated under a mutex by the different pmd threads. Unconditionally reallocating it will make those pmd threads sleep when contention occurs. During "normal" reloads like for rebalancing queues between pmd threads, this can make pmd threads waste time on this. Reallocating the tx qid is only needed when removing other pmd threads as it is the only situation when the qid pool can become uncontiguous. Add a flag to instruct the pmd to reload tx qid for this case which is Step 1 in current code. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:50:21 +01:00
David Marchand	6d9fead107	dpif-netdev: Do not sleep when swapping queues. When swapping queues from a pmd thread to another (q0 polled by pmd0/q1 polled by pmd1 -> q1 polled by pmd0/q0 polled by pmd1), the current "Step 5" puts both pmds to sleep waiting for the control thread to wake them up later. Prefer to make them spin in such a case to avoid sleeping an undeterministic amount of time. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:47:46 +01:00
David Marchand	8f077b31e9	dpif-netdev: Trigger parallel pmd reloads. pmd reloads are currently serialised in each steps calling reload_affected_pmds. Any pmd processing packets, waiting on a mutex etc... will make other pmd threads wait for a delay that can be undeterministic when syscalls adds up. Switch to a little busy loop on the control thread using the existing per-pmd reload boolean. The memory order on this atomic is rel-acq to have an explicit synchronisation between the pmd threads and the control thread. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:46:31 +01:00
David Marchand	299c8d611e	dpif-netdev: Convert exit latch to flag. No need for a latch here since we don't have to wait. A simple boolean flag is enough. The memory order on the reload flag is changed to rel-acq ordering to serve as a synchronisation point between the pmd threads and the control thread that asks for termination. Fixes: e4cfed38b159 ("dpif-netdev: Add poll-mode-device thread.") Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:45:55 +01:00
Ilya Maximets	f87c135706	vswitchd: Always cleanup userspace datapath. 'netdev' datapath is implemented within ovs-vswitchd process and can not exist without it, so it should be gracefully terminated with a full cleanup of resources upon ovs-vswitchd exit. This change forces dpif cleanup for 'netdev' datapath regardless of passing '--cleanup' to 'ovs-appctl exit'. Such solution allowes to not pass this additional option everytime for userspace datapath installations and also allowes to not terminate system datapath in setups where both datapaths runs at the same time. The main part is that dpif_port_del() will lead to netdev_close() and subsequent netdev_class->destroy(dev) which will stop HW NICs and free their resources. For vhost-user interfaces it will invoke vhost driver unregistering with a properly closed vhost-user connection. For upcoming AF_XDP netdev this will allow to gracefully destroy xdp sockets and unload xdp programs from linux interfaces. Another important thing is that port deletion will also trigger flushing of flows offloaded to HW NICs. Exception made for 'internal' ports that could have user ip/route configuration. These ports will not be removed without '--cleanup'. This change fixes OVS disappearing from the DPDK point of view (keeping HW NICs improperly configured, sudden closing of vhost-user connections) and will help with linux devices clearing with upcoming AF_XDP netdev support. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Tested-by: William Tu <u9012063@gmail.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ben Pfaff <blp@ovn.org>	2019-07-02 12:24:47 +03:00
David Marchand	35c91567c8	dpif-netdev: Only poll enabled vhost queues. We currently poll all available queues based on the max queue count exchanged with the vhost peer and rely on the vhost library in DPDK to check the vring status beneath. This can lead to some overhead when we have a lot of unused queues. To enhance the situation, we can skip the disabled queues. On rxq notifications, we make use of the netdev's change_seq number so that the pmd thread main loop can cache the queue state periodically. $ ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 1: isolated : true port: dpdk0 queue-id: 0 (enabled) pmd usage: 0 % pmd thread numa_id 0 core_id 2: isolated : true port: vhost1 queue-id: 0 (enabled) pmd usage: 0 % port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % pmd thread numa_id 0 core_id 15: isolated : true port: dpdk1 queue-id: 0 (enabled) pmd usage: 0 % pmd thread numa_id 0 core_id 16: isolated : true port: vhost0 queue-id: 0 (enabled) pmd usage: 0 % port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % $ while true; do ovs-appctl dpif-netdev/pmd-rxq-show \|awk ' /port: / { tot++; if ($5 == "(enabled)") { en++; } } END { print "total: " tot ", enabled: " en }' sleep 1 done total: 6, enabled: 2 total: 6, enabled: 2 ... # Started vm, virtio devices are bound to kernel driver which enables # F_MQ + all queue pairs total: 6, enabled: 2 total: 66, enabled: 66 ... # Unbound vhost0 and vhost1 from the kernel driver total: 66, enabled: 66 total: 66, enabled: 34 ... # Configured kernel bound devices to use only 1 queue pair total: 66, enabled: 34 total: 66, enabled: 19 total: 66, enabled: 4 ... # While rebooting the vm total: 66, enabled: 4 total: 66, enabled: 2 ... total: 66, enabled: 66 ... # After shutting down the vm total: 66, enabled: 66 total: 66, enabled: 2 Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-06-26 18:43:39 +01:00
Ilya Maximets	b6cabb8f8f	netdev: Split up netdev offloading to separate module. New module 'netdev-offload' created to manage different flow API implementations. All the generic and provider independent code moved there from the 'netdev' module. Flow API providers further encapsulated. The only function that was changed is 'netdev_any_oor'. Now it uses offloading related hmap instead of common 'netdev_shash'. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Ben Pfaff <blp@ovn.org> Acked-by: Roi Dayan <roid@mellanox.com>	2019-06-11 09:39:36 +03:00
Ilya Maximets	0da667e345	dpif-netdev: Forbid vport offloading attempts. 'netdev_flow_put()' for vports could eventually succeed for userspace datapath in case there is a kernel datapath with similar vport at the same time. The root cause is that vports like 'vxlan' uses same 'vxlan_sys_<port>' system interfaces for flow offloading and there is no way to distinguish system and userspace vports using only 'netdev' structure. Let's forbid vport offloading from userspace datapath to avoid installing userspace flows to unrelated system devices. Future dynamic flow API management will allow to enable vport offloading back using more flexible checks. Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id") Reported-by: Ophir Munk <ophirmu@mellanox.com> Acked-By: Roni Bar Yanai <roniba@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>	2019-06-06 17:23:58 +03:00
Ilya Maximets	0a5cba6591	dpif-netdev: Fix flow mark leak on port lookup failure. Flow mark should be properly freed in all error cases. Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id") Acked-By: Roni Bar Yanai <roniba@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>	2019-06-06 17:23:58 +03:00
Ilya Maximets	eef8538081	dpif-netdev: Fix unsafe access to pmd polling lists. All accesses to 'pmd->poll_list' should be guarded by 'pmd->port_mutex'. Additionally fixed inappropriate usage of 'HMAP_FOR_EACH_SAFE' (hmap doesn't change in a loop) and dropped not needed local variable 'proc_cycles'. CC: Nitin Katiyar <nitin.katiyar@ericsson.com> Fixes: 5bf84282482a ("Adding support for PMD auto load balancing") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com>	2019-05-29 14:31:54 +03:00
Darrell Ball	57593fd243	conntrack: Stop exporting internal datastructures. Stop the exporting of the main internal conntrack datastructure. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-05-03 09:46:22 -07:00
Zhantao Fu	0fcf0776c7	Double postponing to free subtables. Subtable destruction should be double postponed because readers could still obtain old values while iterating over pvector implementation before its new version published. Signed-off-by: Zhantao Fu <fuzhantao@huawei.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-04-23 09:08:51 -07:00
Numan Siddique	5b34f8fc3b	Add a new OVS action check_pkt_larger This patch adds a new action 'check_pkt_larger' which checks if the packet is larger than the given size and stores the result in the destination register. Usage: check_pkt_larger(len)->REGISTER Eg. match=...,actions=check_pkt_larger(1442)->NXM_NX_REG0[0],next; This patch makes use of the new datapath action - 'check_pkt_len' which was recently added in the commit [1]. At the start of ovs-vswitchd, datapath is probed for this action. If the datapath action is present, then 'check_pkt_larger' makes use of this datapath action. Datapath action 'check_pkt_len' takes these nlattrs * OVS_CHECK_PKT_LEN_ATTR_PKT_LEN - 'pkt_len' to check for * OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER (optional) - Nested actions to apply if the packet length is greater than the specified 'pkt_len' * OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL (optional) - Nested actions to apply if the packet length is lesser or equal to the specified 'pkt_len'. Let's say we have these flows added to an OVS bridge br-int table=0, priority=100 in_port=1,ip,actions=check_pkt_larger:100->NXM_NX_REG0[0],resubmit(,1) table=1, priority=200,in_port=1,ip,reg0=0x1/0x1 actions=output:3 table=1, priority=100,in_port=1,ip,actions=output:4 Then the action 'check_pkt_larger' will be translated as - check_pkt_len(size=100,gt(3),le(4)) datapath will check the packet length and if the packet length is greater than 100, it will output to port 3, else it will output to port 4. In case, datapath doesn't support 'check_pkt_len' action, the OVS action 'check_pkt_larger' sets SLOW_ACTION so that datapath flow is not added. This OVS action is intended to be used by OVN to check the packet length and generate an ICMP packet with type 3, code 4 and next hop mtu in the logical router pipeline if the MTU of the physical interface is lesser than the packet length. More information can be found here [2] [1] - `4d5ec89fc8` [2] - https://mail.openvswitch.org/pipermail/ovs-discuss/2018-July/047039.html Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2018-July/047039.html Suggested-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> CC: Ben Pfaff <blp@ovn.org> CC: Gregory Rose <gvrose8192@gmail.com> Acked-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-04-22 12:56:50 -07:00
William Tu	42697ca775	dpif-netdev: fix meter at high packet rate. When testing packet rate around 1Mpps with meter enabled, the frequency of hitting meter action becomes much higher, around 30us each time. As a result, the meter's calculation of 'uint32_t delta_t' becomes always 0 and meter action has no effect. This is due to the previous commit 05f9e707e194 divides the delta by 1000, in order to convert to msec granularity. The patch fixes it updating the time when across millisecond boundary. Fixes: 05f9e707e194 ("dpif-netdev: Use microsecond granularity.") Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-04-22 09:51:28 -07:00
Ilya Maximets	af741ca346	dpif-netdev: Update comment about flow installation race. Userspace datapath uses per-PMD flow tables/classifiers for a long time. However, it was decided to keep this race window to not block revalidators. Comment should be updated to reflect the current state. Fixes: 1c1e46ed8457 ("dpif-netdev: Add per-pmd flow-table/classifier.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-04-18 08:51:46 +01:00
Ilya Maximets	b137383e86	dpif-netdev: Fix double parsing of packets when EMC disabled. This partially reverts commit bde94613e6276d48a6e0be7a592ebcf9836b4aaf. Commit bde94613e627 was aimed to slightly ( < 1%) increase performance in the case where EMC disabled, but it avoids RSS hash calculation and OVS has to calculate it while executing OVS_ACTION_ATTR_HASH in order to handle balanced-tcp bonding. At the time of executing that action there is no parsed flow, and OVS parses the packet for the second time to calculate the hash. This happens for all packets received from the virtual interfaces because they have no HW RSS. Here is the example of 'perf' output for VM --> (bonded PHY) traffic: Samples: 401K of event 'cycles', Event count (approx.): 50964771478 Overhead Shared Object Symbol 27.50% ovs-vswitchd [.] dpcls_lookup.370382 16.30% ovs-vswitchd [.] rte_vhost_dequeue_burst.9267 14.95% ovs-vswitchd [.] miniflow_extract 7.22% ovs-vswitchd [.] flow_extract 7.10% ovs-vswitchd [.] dp_netdev_input__.371002.4826 4.01% ovs-vswitchd [.] fast_path_processing.370987.4893 We can see that packet parsed twice. First time by 'miniflow_extract' right after receiving and the second time by 'flow_extract' while executing actions. In this particular case calculating RSS on receive saves > 7% of the total CPU processing time. It varies from ~7 to ~10 % depending on scenario/traffic types. It's better to calculate hash each time because performance improvements of avoiding are negligible in compare with performance drop in case of sending packets to bonded interface. Another solution could be to pass the parsed flow explicitly through the datapath, but this will require big code changes and will have additional overhead for metadata updating on packet changes. Also, this change should have small impact since SMC works well in most cases and will be enabled/recommended by default in the future. CC: Antonio Fischetti <antonio.fischetti@intel.com> Fixes: bde94613e627 ("dpif-netdev: Avoid reading RSS hash when EMC is disabled.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-04-18 08:45:16 +01:00
Ilya Maximets	5d1765d30e	dpif-netdev: Reduce log level for not found mark id. It's a normal case for 'find' function, especially because this happens for every first packet of flow that was not offloaded yet. Should not warn about this. Dropped to DBG to avoid log trashing in case of big number of new flows. CC: Yuanhan Liu <yliu@fridaylinux.org> Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id") Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-02-27 22:19:03 +00:00
Ben Pfaff	d40533fc82	odp-util: Improve log messages and error reporting for Netlink parsing. As a side effect, this also reduces a lot of log messages' severities from ERR to WARN. They just didn't seem like messages that in general reported anything that would prevent functioning. Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-25 15:38:25 -08:00
Darrell Ball	4ea96698f6	Userspace datapath: Add fragmentation handling. Fragmentation handling is added for supporting conntrack. Both v4 and v6 are supported. After discussion with several people, I decided to not store configuration state in the database to be more consistent with the kernel in future, similarity with other conntrack configuration which will not be in the database as well and overall simplicity. Accordingly, fragmentation handling is enabled by default. This patch enables fragmentation tests for the userspace datapath. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-14 14:18:56 -08:00
Darrell Ball	9f17f104fe	dp-packet: Add 'do_not_steal' packet batch flag. This is needed in a subsequent patch and may otherwise be useful. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-14 11:39:22 -08:00
Ilya Maximets	216abd2808	dpif-netdev: Add thread safety annotation to sorted_poll_list. 'sorted_poll_list()' uses the 'pmd->poll_list' that should be guarded by 'pmd->port_mutex'. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-02-12 13:23:09 +00:00
Ilya Maximets	2fbadeb665	dpif-netdev: Per-port configurable EMC. Conditional EMC insert helps a lot in scenarios with high numbers of parallel flows, but in current implementation this option affects all the threads and ports at once. There are scenarios where we have different number of flows on different ports. For example, if one of the VMs encapsulates traffic using additional headers, it will receive large number of flows but only few flows will come out of this VM. In this scenario it's much faster to use EMC instead of classifier for traffic from the VM, but it's better to disable EMC for the traffic which flows to VM. To handle above issue introduced 'emc-enable' configurable to enable/disable EMC on a per-port basis. Ex.: ovs-vsctl set interface dpdk0 other_config:emc-enable=false EMC probability kept as is and it works for all the ports with 'emc-enable=true'. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-01-18 11:54:42 +00:00
Nitin Katiyar	5bf8428248	Adding support for PMD auto load balancing Port rx queues that have not been statically assigned to PMDs are currently assigned based on periodically sampled load measurements. The assignment is performed at specific instances – port addition, port deletion, upon reassignment request via CLI etc. Due to change in traffic pattern over time it can cause uneven load among the PMDs and thus resulting in lower overall throughout. This patch enables the support of auto load balancing of PMDs based on measured load of RX queues. Each PMD measures the processing load for each of its associated queues every 10 seconds. If the aggregated PMD load reaches 95% for 6 consecutive intervals then PMD considers itself to be overloaded. If any PMD is overloaded, a dry-run of the PMD assignment algorithm is performed by OVS main thread. The dry-run does NOT change the existing queue to PMD assignments. If the resultant mapping of dry-run indicates an improved distribution of the load then the actual reassignment will be performed. The automatic rebalancing will be disabled by default and has to be enabled via configuration option. The interval (in minutes) between two consecutive rebalancing can also be configured via CLI, default is 1 min. Following example commands can be used to set the auto-lb params: ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true" ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5" Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Tested-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-01-16 10:53:17 +00:00
Ilya Maximets	6c95dbf96b	dpif-netdev: End the quiescent state for flow offloading thread. Flow offloading thread uses concurrent hash maps which are based on rcu protected variables. It must use them while in active state. Working in a quiescent state could cause segmentation faults because of possible cmap internal structure changes. Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread") Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-11-02 15:17:19 +00:00
Ilya Maximets	5752eae485	dpif-netdev: Fix cmap node use after free on flow disassociation. Data pointed by cmap node must not be freed while iterating. ovsrcu_postpone should be used instead. CC: Finn Christensen <fc@napatech.com> Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-11-02 15:13:54 +00:00
Sriharsha Basavapatna via dev	57924fc91c	revalidator: Rebalance offloaded flows based on the pps rate This is the third patch in the patch-set to support dynamic rebalancing of offloaded flows. The dynamic rebalancing functionality is implemented in this patch. The ukeys that are not scheduled for deletion are obtained and passed as input to the rebalancing routine. The rebalancing is done in the context of revalidation leader thread, after all other revalidator threads are done with gathering rebalancing data for flows. For each netdev that is in OOR state, a list of flows - both offloaded and non-offloaded (pending) - is obtained using the ukeys. For each netdev that is in OOR state, the flows are grouped and sorted into offloaded and pending flows. The offloaded flows are sorted in descending order of pps-rate, while pending flows are sorted in ascending order of pps-rate. The rebalancing is done in two phases. In the first phase, we try to offload all pending flows and if that succeeds, the OOR state on the device is cleared. If some (or none) of the pending flows could not be offloaded, then we start replacing an offloaded flow that has a lower pps-rate than a pending flow, until there are no more pending flows with a higher rate than an offloaded flow. The flows that are replaced from the device are added into kernel datapath. A new OVS configuration parameter "offload-rebalance", is added to ovsdb. The default value of this is "false". To enable this feature, set the value of this parameter to "true", which provides packets-per-second rate based policy to dynamically offload and un-offload flows. Note: This option can be enabled only when 'hw-offload' policy is enabled. It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow offload errors (specifically ENOSPC error this feature depends on) reported by an offloaded device are supressed by TC-Flower kernel module. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com> Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com> Reviewed-by: Sathya Perla <sathya.perla@broadcom.com> Reviewed-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-10-19 11:27:52 +02:00
Ilya Maximets	35fe9efb2f	dpif-netdev: Add vlan to mask for flow_put operation. Datapath flows in dpif-netdev classifier always has exact match mask set for vlan. We have to enable it for flow_put operation too in order to avoid flow modification failure due to classifier lookup with wrong hash. Found by OFtest. CC: Jan Scheurich <jan.scheurich@ericsson.com> Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline") Reported-by: Ben Pfaff <blp@ovn.org> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2018-September/352579.html Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-10-09 10:26:39 -07:00
Kevin Traynor	e77c97b9d6	dpif-netdev: Add round-robin based rxq to pmd assignment. Prior to OVS 2.9 automatic assignment of Rxqs to PMDs (i.e. CPUs) was done by round-robin. That was changed in OVS 2.9 to ordering the Rxqs based on their measured processing cycles. This was to assign the busiest Rxqs to different PMDs, improving aggregate throughput. For the most part the new scheme should be better, but there could be situations where a user prefers a simple round-robin scheme because Rxqs from a single port are more likely to be spread across multiple PMDs, and/or traffic is very bursty/unpredictable. Add 'pmd-rxq-assign' config to allow a user to select round-robin based assignment. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-09-14 11:45:05 +01:00
Gavi Teitz	a692410af0	dpctl: Expand the flow dump type filter Added new types to the flow dump filter, and allowed multiple filter types to be passed at once, as a comma separated list. The new types added are: * tc - specifies flows handled by the tc dp * non-offloaded - specifies flows not offloaded to the HW * all - specifies flows of all types The type list is now fully parsed by the dpctl, and a new struct was added to dpif which enables dpctl to define which types of dumps to provide, rather than passing the type string and having dpif parse it. Signed-off-by: Gavi Teitz <gavi@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-09-13 16:56:25 +02:00
Gavi Teitz	0d6b401cf6	dpif-netdev: Initialize dpif_flow attrs In a previous commit, the dpif_flow struct was expanded, with the 'offloaded' field being moved into a new struct which also includes a field for the dp layer the flow is handled on. The initialization of these fields was only done in dpif-netlink. This completes that commit, by initializing the fields in dpif-netdev as well. As the 'offloaded' field was previously ignored by dpif-netdev, the attrs are initialized to the default values of 'false' for the offloaded state, and 'ovs' for the dp layer. Fixes: d63ca5329ff9 ("dpctl: Properly reflect a rule's offloaded to HW state") Signed-off-by: Gavi Teitz <gavi@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-09-13 16:56:25 +02:00
Justin Pettit	866bc7567a	dpif-netdev: Prevent unsafe access when retrieving meter stats. dpif_netdev_meter_get() retrieved a pointer to a meter entry without holding a lock. It's possible that another thread could have deleted that entry between retrieving the pointer and dereferencing the pointer. This makes the function hold the lock the entire time the meter entry is needed. Found by inspection. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org>	2018-09-04 13:36:37 -07:00
Justin Pettit	d0db81eac8	dpif-netdev: Don't check if xcalloc() failed when creating meter. xcalloc() can't return null. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-09-04 13:36:37 -07:00

... 2 3 4 5 6 ...

812 Commits