mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-29 05:18:13 +00:00

Author	SHA1	Message	Date
Harry van Haaren	f54d8f004f	dpif-netdev: Add specialized generic scalar functions This commit adds a number of specialized functions, that handle common miniflow fingerprints. This enables compiler optimization, resulting in higher performance. Below a quick description of how this optimization actually works; "Specialized functions" are "instances" of the generic implementation, but the compiler is given extra context when compiling. In the case of iterating miniflow datastructures, the most interesting value to enable compile time optimizations is the loop trip count per unit. In order to create a specialized function, there is a generic implementation, which uses a for() loop without the compiler knowing the loop trip count at compile time. The loop trip count is passed in as an argument to the function: uint32_t miniflow_impl_generic(struct miniflow mf, uint32_t loop_count) { for(uint32_t i = 0; i < loop_count; i++) // do work } In order to "specialize" the function, we call the generic implementation with hard-coded numbers - these are compile time constants! uint32_t miniflow_impl_loop5(struct miniflow mf, uint32_t loop_count) { // use hard coded constant for compile-time constant-propogation return miniflow_impl_generic(mf, 5); } Given the compiler is aware of the loop trip count at compile time, it can perform an optimization known as "constant propogation". Combined with inlining of the miniflow_impl_generic() function, the compiler is now enabled to compile time unroll the loop 5x, and produce "flat" code. The last step to using the specialized functions is to utilize a function-pointer to choose the specialized (or generic) implementation. The selection of the function pointer is performed at subtable creation time, when miniflow fingerprint of the subtable is known. This technique is known as "multiple dispatch" in some literature, as it uses multiple items of information (miniflow bit counts) to select the dispatch function. By pointing the function pointer at the optimized implementation, OvS benefits from the compile time optimizations at runtime. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:24:13 +01:00
Harry van Haaren	a0b36b3924	dpif-netdev: Refactor generic implementation This commit refactors the generic implementation. The goal of this refactor is to simplify the code to enable "specialization" of the functions at compile time. Given compile-time optimizations, the compiler is able to unroll loops, and create optimized code sequences due to compile time knowledge of loop-trip counts. In order to enable these compiler optimizations, we must refactor the code to pass the loop-trip counts to functions as compile time constants. This patch allows the number of miniflow-bits set per "unit" in the miniflow to be passed around as a function argument. Note that this patch does NOT yet take advantage of doing so, this is only a refactor to enable it in the next patches. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:22:23 +01:00
Harry van Haaren	92c7c870d6	dpif-netdev: Split out generic lookup function This commit splits the generic hash-lookup-verify function to its own file, for cleaner seperation between optimized versions. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:22:01 +01:00
Harry van Haaren	f5ace7cd8a	dpif-netdev: Move dpcls lookup structures to .h This commit moves some data-structures to be available in the dpif-netdev-private.h header. This allows specific implementations of the subtable lookup function to include just that header file, and not require that the code exists in dpif-netdev.c Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:21:37 +01:00
Harry van Haaren	aadede3dda	dpif-netdev: Implement function pointers/subtable This allows plugging-in of different subtable hash-lookup-verify routines, and allows special casing of those functions based on known context (eg: # of bits set) of the specific subtable. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-19 12:21:16 +01:00
Ilya Maximets	ec61d4707b	dpif-netdev: Clarify PMD reloading scheme. It became more complicated, hence needs to be documented. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 13:15:19 +01:00
David Marchand	68a0625b78	dpif-netdev: Catch reloads faster. Looking at the reload flag only every 1024 loops can be a long time under load, since we might be handling 32 packets per rxq, per iteration, which means up to poll_cnt * 32 * 1024 packets. Look at the flag every loop, no major performance impact seen. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:51:09 +01:00
David Marchand	e2cafa8692	dpif-netdev: Only reload static tx qid when needed. pmd->static_tx_qid is allocated under a mutex by the different pmd threads. Unconditionally reallocating it will make those pmd threads sleep when contention occurs. During "normal" reloads like for rebalancing queues between pmd threads, this can make pmd threads waste time on this. Reallocating the tx qid is only needed when removing other pmd threads as it is the only situation when the qid pool can become uncontiguous. Add a flag to instruct the pmd to reload tx qid for this case which is Step 1 in current code. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:50:21 +01:00
David Marchand	6d9fead107	dpif-netdev: Do not sleep when swapping queues. When swapping queues from a pmd thread to another (q0 polled by pmd0/q1 polled by pmd1 -> q1 polled by pmd0/q0 polled by pmd1), the current "Step 5" puts both pmds to sleep waiting for the control thread to wake them up later. Prefer to make them spin in such a case to avoid sleeping an undeterministic amount of time. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:47:46 +01:00
David Marchand	8f077b31e9	dpif-netdev: Trigger parallel pmd reloads. pmd reloads are currently serialised in each steps calling reload_affected_pmds. Any pmd processing packets, waiting on a mutex etc... will make other pmd threads wait for a delay that can be undeterministic when syscalls adds up. Switch to a little busy loop on the control thread using the existing per-pmd reload boolean. The memory order on this atomic is rel-acq to have an explicit synchronisation between the pmd threads and the control thread. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:46:31 +01:00
David Marchand	299c8d611e	dpif-netdev: Convert exit latch to flag. No need for a latch here since we don't have to wait. A simple boolean flag is enough. The memory order on the reload flag is changed to rel-acq ordering to serve as a synchronisation point between the pmd threads and the control thread that asks for termination. Fixes: e4cfed38b159 ("dpif-netdev: Add poll-mode-device thread.") Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-07-10 11:45:55 +01:00
Ilya Maximets	f87c135706	vswitchd: Always cleanup userspace datapath. 'netdev' datapath is implemented within ovs-vswitchd process and can not exist without it, so it should be gracefully terminated with a full cleanup of resources upon ovs-vswitchd exit. This change forces dpif cleanup for 'netdev' datapath regardless of passing '--cleanup' to 'ovs-appctl exit'. Such solution allowes to not pass this additional option everytime for userspace datapath installations and also allowes to not terminate system datapath in setups where both datapaths runs at the same time. The main part is that dpif_port_del() will lead to netdev_close() and subsequent netdev_class->destroy(dev) which will stop HW NICs and free their resources. For vhost-user interfaces it will invoke vhost driver unregistering with a properly closed vhost-user connection. For upcoming AF_XDP netdev this will allow to gracefully destroy xdp sockets and unload xdp programs from linux interfaces. Another important thing is that port deletion will also trigger flushing of flows offloaded to HW NICs. Exception made for 'internal' ports that could have user ip/route configuration. These ports will not be removed without '--cleanup'. This change fixes OVS disappearing from the DPDK point of view (keeping HW NICs improperly configured, sudden closing of vhost-user connections) and will help with linux devices clearing with upcoming AF_XDP netdev support. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Tested-by: William Tu <u9012063@gmail.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ben Pfaff <blp@ovn.org>	2019-07-02 12:24:47 +03:00
David Marchand	35c91567c8	dpif-netdev: Only poll enabled vhost queues. We currently poll all available queues based on the max queue count exchanged with the vhost peer and rely on the vhost library in DPDK to check the vring status beneath. This can lead to some overhead when we have a lot of unused queues. To enhance the situation, we can skip the disabled queues. On rxq notifications, we make use of the netdev's change_seq number so that the pmd thread main loop can cache the queue state periodically. $ ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 1: isolated : true port: dpdk0 queue-id: 0 (enabled) pmd usage: 0 % pmd thread numa_id 0 core_id 2: isolated : true port: vhost1 queue-id: 0 (enabled) pmd usage: 0 % port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % pmd thread numa_id 0 core_id 15: isolated : true port: dpdk1 queue-id: 0 (enabled) pmd usage: 0 % pmd thread numa_id 0 core_id 16: isolated : true port: vhost0 queue-id: 0 (enabled) pmd usage: 0 % port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % $ while true; do ovs-appctl dpif-netdev/pmd-rxq-show \|awk ' /port: / { tot++; if ($5 == "(enabled)") { en++; } } END { print "total: " tot ", enabled: " en }' sleep 1 done total: 6, enabled: 2 total: 6, enabled: 2 ... # Started vm, virtio devices are bound to kernel driver which enables # F_MQ + all queue pairs total: 6, enabled: 2 total: 66, enabled: 66 ... # Unbound vhost0 and vhost1 from the kernel driver total: 66, enabled: 66 total: 66, enabled: 34 ... # Configured kernel bound devices to use only 1 queue pair total: 66, enabled: 34 total: 66, enabled: 19 total: 66, enabled: 4 ... # While rebooting the vm total: 66, enabled: 4 total: 66, enabled: 2 ... total: 66, enabled: 66 ... # After shutting down the vm total: 66, enabled: 66 total: 66, enabled: 2 Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-06-26 18:43:39 +01:00
Ilya Maximets	b6cabb8f8f	netdev: Split up netdev offloading to separate module. New module 'netdev-offload' created to manage different flow API implementations. All the generic and provider independent code moved there from the 'netdev' module. Flow API providers further encapsulated. The only function that was changed is 'netdev_any_oor'. Now it uses offloading related hmap instead of common 'netdev_shash'. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Ben Pfaff <blp@ovn.org> Acked-by: Roi Dayan <roid@mellanox.com>	2019-06-11 09:39:36 +03:00
Ilya Maximets	0da667e345	dpif-netdev: Forbid vport offloading attempts. 'netdev_flow_put()' for vports could eventually succeed for userspace datapath in case there is a kernel datapath with similar vport at the same time. The root cause is that vports like 'vxlan' uses same 'vxlan_sys_<port>' system interfaces for flow offloading and there is no way to distinguish system and userspace vports using only 'netdev' structure. Let's forbid vport offloading from userspace datapath to avoid installing userspace flows to unrelated system devices. Future dynamic flow API management will allow to enable vport offloading back using more flexible checks. Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id") Reported-by: Ophir Munk <ophirmu@mellanox.com> Acked-By: Roni Bar Yanai <roniba@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>	2019-06-06 17:23:58 +03:00
Ilya Maximets	0a5cba6591	dpif-netdev: Fix flow mark leak on port lookup failure. Flow mark should be properly freed in all error cases. Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id") Acked-By: Roni Bar Yanai <roniba@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>	2019-06-06 17:23:58 +03:00
Ilya Maximets	eef8538081	dpif-netdev: Fix unsafe access to pmd polling lists. All accesses to 'pmd->poll_list' should be guarded by 'pmd->port_mutex'. Additionally fixed inappropriate usage of 'HMAP_FOR_EACH_SAFE' (hmap doesn't change in a loop) and dropped not needed local variable 'proc_cycles'. CC: Nitin Katiyar <nitin.katiyar@ericsson.com> Fixes: 5bf84282482a ("Adding support for PMD auto load balancing") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com>	2019-05-29 14:31:54 +03:00
Darrell Ball	57593fd243	conntrack: Stop exporting internal datastructures. Stop the exporting of the main internal conntrack datastructure. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-05-03 09:46:22 -07:00
Zhantao Fu	0fcf0776c7	Double postponing to free subtables. Subtable destruction should be double postponed because readers could still obtain old values while iterating over pvector implementation before its new version published. Signed-off-by: Zhantao Fu <fuzhantao@huawei.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-04-23 09:08:51 -07:00
Numan Siddique	5b34f8fc3b	Add a new OVS action check_pkt_larger This patch adds a new action 'check_pkt_larger' which checks if the packet is larger than the given size and stores the result in the destination register. Usage: check_pkt_larger(len)->REGISTER Eg. match=...,actions=check_pkt_larger(1442)->NXM_NX_REG0[0],next; This patch makes use of the new datapath action - 'check_pkt_len' which was recently added in the commit [1]. At the start of ovs-vswitchd, datapath is probed for this action. If the datapath action is present, then 'check_pkt_larger' makes use of this datapath action. Datapath action 'check_pkt_len' takes these nlattrs * OVS_CHECK_PKT_LEN_ATTR_PKT_LEN - 'pkt_len' to check for * OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER (optional) - Nested actions to apply if the packet length is greater than the specified 'pkt_len' * OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL (optional) - Nested actions to apply if the packet length is lesser or equal to the specified 'pkt_len'. Let's say we have these flows added to an OVS bridge br-int table=0, priority=100 in_port=1,ip,actions=check_pkt_larger:100->NXM_NX_REG0[0],resubmit(,1) table=1, priority=200,in_port=1,ip,reg0=0x1/0x1 actions=output:3 table=1, priority=100,in_port=1,ip,actions=output:4 Then the action 'check_pkt_larger' will be translated as - check_pkt_len(size=100,gt(3),le(4)) datapath will check the packet length and if the packet length is greater than 100, it will output to port 3, else it will output to port 4. In case, datapath doesn't support 'check_pkt_len' action, the OVS action 'check_pkt_larger' sets SLOW_ACTION so that datapath flow is not added. This OVS action is intended to be used by OVN to check the packet length and generate an ICMP packet with type 3, code 4 and next hop mtu in the logical router pipeline if the MTU of the physical interface is lesser than the packet length. More information can be found here [2] [1] - `4d5ec89fc8` [2] - https://mail.openvswitch.org/pipermail/ovs-discuss/2018-July/047039.html Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2018-July/047039.html Suggested-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> CC: Ben Pfaff <blp@ovn.org> CC: Gregory Rose <gvrose8192@gmail.com> Acked-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-04-22 12:56:50 -07:00
William Tu	42697ca775	dpif-netdev: fix meter at high packet rate. When testing packet rate around 1Mpps with meter enabled, the frequency of hitting meter action becomes much higher, around 30us each time. As a result, the meter's calculation of 'uint32_t delta_t' becomes always 0 and meter action has no effect. This is due to the previous commit 05f9e707e194 divides the delta by 1000, in order to convert to msec granularity. The patch fixes it updating the time when across millisecond boundary. Fixes: 05f9e707e194 ("dpif-netdev: Use microsecond granularity.") Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-04-22 09:51:28 -07:00
Ilya Maximets	af741ca346	dpif-netdev: Update comment about flow installation race. Userspace datapath uses per-PMD flow tables/classifiers for a long time. However, it was decided to keep this race window to not block revalidators. Comment should be updated to reflect the current state. Fixes: 1c1e46ed8457 ("dpif-netdev: Add per-pmd flow-table/classifier.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-04-18 08:51:46 +01:00
Ilya Maximets	b137383e86	dpif-netdev: Fix double parsing of packets when EMC disabled. This partially reverts commit bde94613e6276d48a6e0be7a592ebcf9836b4aaf. Commit bde94613e627 was aimed to slightly ( < 1%) increase performance in the case where EMC disabled, but it avoids RSS hash calculation and OVS has to calculate it while executing OVS_ACTION_ATTR_HASH in order to handle balanced-tcp bonding. At the time of executing that action there is no parsed flow, and OVS parses the packet for the second time to calculate the hash. This happens for all packets received from the virtual interfaces because they have no HW RSS. Here is the example of 'perf' output for VM --> (bonded PHY) traffic: Samples: 401K of event 'cycles', Event count (approx.): 50964771478 Overhead Shared Object Symbol 27.50% ovs-vswitchd [.] dpcls_lookup.370382 16.30% ovs-vswitchd [.] rte_vhost_dequeue_burst.9267 14.95% ovs-vswitchd [.] miniflow_extract 7.22% ovs-vswitchd [.] flow_extract 7.10% ovs-vswitchd [.] dp_netdev_input__.371002.4826 4.01% ovs-vswitchd [.] fast_path_processing.370987.4893 We can see that packet parsed twice. First time by 'miniflow_extract' right after receiving and the second time by 'flow_extract' while executing actions. In this particular case calculating RSS on receive saves > 7% of the total CPU processing time. It varies from ~7 to ~10 % depending on scenario/traffic types. It's better to calculate hash each time because performance improvements of avoiding are negligible in compare with performance drop in case of sending packets to bonded interface. Another solution could be to pass the parsed flow explicitly through the datapath, but this will require big code changes and will have additional overhead for metadata updating on packet changes. Also, this change should have small impact since SMC works well in most cases and will be enabled/recommended by default in the future. CC: Antonio Fischetti <antonio.fischetti@intel.com> Fixes: bde94613e627 ("dpif-netdev: Avoid reading RSS hash when EMC is disabled.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-04-18 08:45:16 +01:00
Ilya Maximets	5d1765d30e	dpif-netdev: Reduce log level for not found mark id. It's a normal case for 'find' function, especially because this happens for every first packet of flow that was not offloaded yet. Should not warn about this. Dropped to DBG to avoid log trashing in case of big number of new flows. CC: Yuanhan Liu <yliu@fridaylinux.org> Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id") Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-02-27 22:19:03 +00:00
Ben Pfaff	d40533fc82	odp-util: Improve log messages and error reporting for Netlink parsing. As a side effect, this also reduces a lot of log messages' severities from ERR to WARN. They just didn't seem like messages that in general reported anything that would prevent functioning. Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-25 15:38:25 -08:00
Darrell Ball	4ea96698f6	Userspace datapath: Add fragmentation handling. Fragmentation handling is added for supporting conntrack. Both v4 and v6 are supported. After discussion with several people, I decided to not store configuration state in the database to be more consistent with the kernel in future, similarity with other conntrack configuration which will not be in the database as well and overall simplicity. Accordingly, fragmentation handling is enabled by default. This patch enables fragmentation tests for the userspace datapath. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-14 14:18:56 -08:00
Darrell Ball	9f17f104fe	dp-packet: Add 'do_not_steal' packet batch flag. This is needed in a subsequent patch and may otherwise be useful. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-14 11:39:22 -08:00
Ilya Maximets	216abd2808	dpif-netdev: Add thread safety annotation to sorted_poll_list. 'sorted_poll_list()' uses the 'pmd->poll_list' that should be guarded by 'pmd->port_mutex'. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-02-12 13:23:09 +00:00
Ilya Maximets	2fbadeb665	dpif-netdev: Per-port configurable EMC. Conditional EMC insert helps a lot in scenarios with high numbers of parallel flows, but in current implementation this option affects all the threads and ports at once. There are scenarios where we have different number of flows on different ports. For example, if one of the VMs encapsulates traffic using additional headers, it will receive large number of flows but only few flows will come out of this VM. In this scenario it's much faster to use EMC instead of classifier for traffic from the VM, but it's better to disable EMC for the traffic which flows to VM. To handle above issue introduced 'emc-enable' configurable to enable/disable EMC on a per-port basis. Ex.: ovs-vsctl set interface dpdk0 other_config:emc-enable=false EMC probability kept as is and it works for all the ports with 'emc-enable=true'. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-01-18 11:54:42 +00:00
Nitin Katiyar	5bf8428248	Adding support for PMD auto load balancing Port rx queues that have not been statically assigned to PMDs are currently assigned based on periodically sampled load measurements. The assignment is performed at specific instances – port addition, port deletion, upon reassignment request via CLI etc. Due to change in traffic pattern over time it can cause uneven load among the PMDs and thus resulting in lower overall throughout. This patch enables the support of auto load balancing of PMDs based on measured load of RX queues. Each PMD measures the processing load for each of its associated queues every 10 seconds. If the aggregated PMD load reaches 95% for 6 consecutive intervals then PMD considers itself to be overloaded. If any PMD is overloaded, a dry-run of the PMD assignment algorithm is performed by OVS main thread. The dry-run does NOT change the existing queue to PMD assignments. If the resultant mapping of dry-run indicates an improved distribution of the load then the actual reassignment will be performed. The automatic rebalancing will be disabled by default and has to be enabled via configuration option. The interval (in minutes) between two consecutive rebalancing can also be configured via CLI, default is 1 min. Following example commands can be used to set the auto-lb params: ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true" ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5" Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Tested-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-01-16 10:53:17 +00:00
Ilya Maximets	6c95dbf96b	dpif-netdev: End the quiescent state for flow offloading thread. Flow offloading thread uses concurrent hash maps which are based on rcu protected variables. It must use them while in active state. Working in a quiescent state could cause segmentation faults because of possible cmap internal structure changes. Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread") Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-11-02 15:17:19 +00:00
Ilya Maximets	5752eae485	dpif-netdev: Fix cmap node use after free on flow disassociation. Data pointed by cmap node must not be freed while iterating. ovsrcu_postpone should be used instead. CC: Finn Christensen <fc@napatech.com> Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-11-02 15:13:54 +00:00
Sriharsha Basavapatna via dev	57924fc91c	revalidator: Rebalance offloaded flows based on the pps rate This is the third patch in the patch-set to support dynamic rebalancing of offloaded flows. The dynamic rebalancing functionality is implemented in this patch. The ukeys that are not scheduled for deletion are obtained and passed as input to the rebalancing routine. The rebalancing is done in the context of revalidation leader thread, after all other revalidator threads are done with gathering rebalancing data for flows. For each netdev that is in OOR state, a list of flows - both offloaded and non-offloaded (pending) - is obtained using the ukeys. For each netdev that is in OOR state, the flows are grouped and sorted into offloaded and pending flows. The offloaded flows are sorted in descending order of pps-rate, while pending flows are sorted in ascending order of pps-rate. The rebalancing is done in two phases. In the first phase, we try to offload all pending flows and if that succeeds, the OOR state on the device is cleared. If some (or none) of the pending flows could not be offloaded, then we start replacing an offloaded flow that has a lower pps-rate than a pending flow, until there are no more pending flows with a higher rate than an offloaded flow. The flows that are replaced from the device are added into kernel datapath. A new OVS configuration parameter "offload-rebalance", is added to ovsdb. The default value of this is "false". To enable this feature, set the value of this parameter to "true", which provides packets-per-second rate based policy to dynamically offload and un-offload flows. Note: This option can be enabled only when 'hw-offload' policy is enabled. It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow offload errors (specifically ENOSPC error this feature depends on) reported by an offloaded device are supressed by TC-Flower kernel module. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com> Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com> Reviewed-by: Sathya Perla <sathya.perla@broadcom.com> Reviewed-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-10-19 11:27:52 +02:00
Ilya Maximets	35fe9efb2f	dpif-netdev: Add vlan to mask for flow_put operation. Datapath flows in dpif-netdev classifier always has exact match mask set for vlan. We have to enable it for flow_put operation too in order to avoid flow modification failure due to classifier lookup with wrong hash. Found by OFtest. CC: Jan Scheurich <jan.scheurich@ericsson.com> Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline") Reported-by: Ben Pfaff <blp@ovn.org> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2018-September/352579.html Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-10-09 10:26:39 -07:00
Kevin Traynor	e77c97b9d6	dpif-netdev: Add round-robin based rxq to pmd assignment. Prior to OVS 2.9 automatic assignment of Rxqs to PMDs (i.e. CPUs) was done by round-robin. That was changed in OVS 2.9 to ordering the Rxqs based on their measured processing cycles. This was to assign the busiest Rxqs to different PMDs, improving aggregate throughput. For the most part the new scheme should be better, but there could be situations where a user prefers a simple round-robin scheme because Rxqs from a single port are more likely to be spread across multiple PMDs, and/or traffic is very bursty/unpredictable. Add 'pmd-rxq-assign' config to allow a user to select round-robin based assignment. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-09-14 11:45:05 +01:00
Gavi Teitz	a692410af0	dpctl: Expand the flow dump type filter Added new types to the flow dump filter, and allowed multiple filter types to be passed at once, as a comma separated list. The new types added are: * tc - specifies flows handled by the tc dp * non-offloaded - specifies flows not offloaded to the HW * all - specifies flows of all types The type list is now fully parsed by the dpctl, and a new struct was added to dpif which enables dpctl to define which types of dumps to provide, rather than passing the type string and having dpif parse it. Signed-off-by: Gavi Teitz <gavi@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-09-13 16:56:25 +02:00
Gavi Teitz	0d6b401cf6	dpif-netdev: Initialize dpif_flow attrs In a previous commit, the dpif_flow struct was expanded, with the 'offloaded' field being moved into a new struct which also includes a field for the dp layer the flow is handled on. The initialization of these fields was only done in dpif-netlink. This completes that commit, by initializing the fields in dpif-netdev as well. As the 'offloaded' field was previously ignored by dpif-netdev, the attrs are initialized to the default values of 'false' for the offloaded state, and 'ovs' for the dp layer. Fixes: d63ca5329ff9 ("dpctl: Properly reflect a rule's offloaded to HW state") Signed-off-by: Gavi Teitz <gavi@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-09-13 16:56:25 +02:00
Justin Pettit	866bc7567a	dpif-netdev: Prevent unsafe access when retrieving meter stats. dpif_netdev_meter_get() retrieved a pointer to a meter entry without holding a lock. It's possible that another thread could have deleted that entry between retrieving the pointer and dereferencing the pointer. This makes the function hold the lock the entire time the meter entry is needed. Found by inspection. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org>	2018-09-04 13:36:37 -07:00
Justin Pettit	d0db81eac8	dpif-netdev: Don't check if xcalloc() failed when creating meter. xcalloc() can't return null. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-09-04 13:36:37 -07:00
Vishal Deep Ajmera	9b4f08cdca	dpif-netdev: Avoid reordering of packets in a batch with same megaflow OVS reads packets in batches from a given port and packets in the batch are subjected to potentially 3 levels of lookups to identify the datapath megaflow entry (or flow) associated with the packet. Each megaflow entry has a dedicated buffer in which packets that match the flow classification criteria are collected. This buffer helps OVS perform batch processing for all packets associated with a given flow. Each packet in the received batch is first subjected to lookup in the Exact Match Cache (EMC). Each EMC entry will point to a flow. If the EMC lookup is successful, the packet is moved from the rx batch to the per-flow buffer. Packets that did not match any EMC entry are rearranged in the rx batch at the beginning and are now subjected to a lookup in the megaflow cache. Packets that match a megaflow cache entry are appended to the per-flow buffer. Packets that do not match any megaflow entry are subjected to slow-path processing through the upcall mechanism. This cannot change the order of packets as by definition upcall processing is only done for packets without matching megaflow entry. The EMC entry match fields encompass all potentially significant header fields, typically more than specified in the associated flow's match criteria. Hence, multiple EMC entries can point to the same flow. Given that per-flow batching happens at each lookup stage, packets belonging to the same megaflow can get re-ordered because some packets match EMC entries while others do not. The following example can illustrate the issue better. Consider following batch of packets (labelled P1 to P8) associated with a single TCP connection and associated with a single flow. Let us assume that packets with just the ACK bit set in TCP flags have been received in a prior batch also and a corresponding EMC entry exists. 1. P1 (TCP Flag: ACK) 2. P2 (TCP Flag: ACK) 3. P3 (TCP Flag: ACK) 4. P4 (TCP Flag: ACK, PSH) 5. P5 (TCP Flag: ACK) 6. P6 (TCP Flag: ACK) 7. P7 (TCP Flag: ACK) 8. P8 (TCP Flag: ACK) The megaflow classification criteria does not include TCP flags while the EMC match criteria does. Thus, all packets other than P4 match the existing EMC entry and are moved to the per-flow packet batch. Subsequently, packet P4 is moved to the same per-flow packet batch as a result of the megaflow lookup. Though the packets have all been correctly classified as being associated with the same flow, the packet order has not been preserved because of the per-flow batching performed during the EMC lookup stage. This packet re-ordering has performance implications for TCP applications. This patch preserves the packet ordering by performing the per-flow batching after both the EMC and megaflow lookups are complete. As an optimization, packets are flow-batched in emc processing till any packet in the batch has an EMC miss. A new flow map is maintained to keep the original order of packet along with flow information. Post fastpath processing, packets from flow map are appended to per-flow buffer. Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-08-27 17:48:23 +01:00
Yi-Hung Wei	cd015a11c2	dpif: Support conntrack zone limit. This patch defines the dpif interface to support conntrack per zone limit. Basically, OVS users can use this interface to set, delete, and get the conntrack per zone limit for various dpif interfaces. The following patch will make use of the proposed interface to implement the feature. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Justin Pettit <jpettit@ovn.org>	2018-08-17 09:30:55 -07:00
Justin Pettit	8101f03fcd	dpif: Don't pass in '*meter_id' to meter_set commands. The original intent of the API appears to be that the underlying DPIF implementaion would choose a local meter id. However, neither of the existing datapath meter implementations (userspace or Linux) implemented that; they expected a valid meter id to be passed in, otherwise they returned an error. This commit follows the existing implementations and makes the API somewhat cleaner. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-08-16 10:20:52 -07:00
Ilya Maximets	18e08953cf	dpif-netdev: Fix zero length keys insertion to EMC. 'key.len' should be calculated before inserting to EMC, otherwise resulting entry will match with any packet with the same hash. CC: Yipeng Wang <yipeng1.wang@intel.com> Fixes: 60d8ccae135f ("dpif-netdev: Add SMC cache after EMC cache") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Yipeng Wang <yipeng1.wang@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-08-08 22:06:21 +01:00
Justin Pettit	6508c845ad	dpif: Move common meter checks into the dpif layer. Another dpif provider will soon add support for meters, so move some of the common sanity checks up into the dpif layer so that each provider doesn't need to re-implement them. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-07-30 13:00:49 -07:00
Justin Pettit	f603d7d262	Revert "dpif-netdev: Use compatible function type to fix broken build." Commit ab15e70eb587 ("dpctl: Expand the flow dump type filter") will be reverted, which this patch fixed, so it needs to be reverted as well. This reverts commit b10ac772218afd4f296db866f6b80258e1d1ca8a. CC: Gavi Teitz <gavi@mellanox.com> CC: Simon Horman <simon.horman@netronome.com> CC: Roi Dayan <roid@mellanox.com> CC: Aaron Conole <aconole@redhat.com> Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-07-25 14:17:33 -07:00
Aaron Conole	b10ac77221	dpif-netdev: Use compatible function type to fix broken build. The dpif_provder flow_dump_create function signature was changed, but the netdev dpif was not updated along with it. This generated a build error with the following warnings: libtool: compile: gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I ./include -I ./include -I ./lib -I ./lib -Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers -fno-strict-aliasing -Wshadow -Wno-null-pointer-arithmetic -Werror -Werror -g -O2 -MT lib/dpif-netdev.lo -MD -MP -MF lib/.deps/dpif-netdev.Tpo -c lib/dpif-netdev.c -o lib/dpif-netdev.o lib/dpif-netdev.c:6812:5: error: initialization from incompatible pointer type [-Werror] dpif_netdev_flow_dump_create, ^ lib/dpif-netdev.c:6812:5: error: (near initialization for 'dpif_netdev_class.flow_dump_create') [-Werror] Fixes: ab15e70eb587 ("dpctl: Expand the flow dump type filter") Cc: Gavi Teitz <gavi@mellanox.com> Cc: Roi Dayan <roid@mellanox.com> Cc: Simon Horman <simon.horman@netronome.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-07-25 11:35:23 -07:00
Yipeng Wang	60d8ccae13	dpif-netdev: Add SMC cache after EMC cache This patch adds a signature match cache (SMC) after exact match cache (EMC). The difference between SMC and EMC is SMC only stores a signature of a flow thus it is much more memory efficient. With same memory space, EMC can store 8k flows while SMC can store 1M flows. It is generally beneficial to turn on SMC but turn off EMC when traffic flow count is much larger than EMC size. SMC cache will map a signature to an dp_netdev_flow index in flow_table. Thus, we add two new APIs in cmap for lookup key by index and lookup index by key. For now, SMC is an experimental feature that it is turned off by default. One can turn it on using ovsdb options. Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com> Co-authored-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-07-24 17:01:03 +01:00
Justin Pettit	425a7b9eaf	dpif-netdev: Fix a couple of comments for dp_netdev_run_meter(). Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-07-06 14:23:27 -07:00
Yuanhan Liu	02bb2824e5	dpif-netdev: do hw flow offload in a thread Currently, the major trigger for hw flow offload is at upcall handling, which is actually in the datapath. Moreover, the hw offload installation and modification is not that lightweight. Meaning, if there are so many flows being added or modified frequently, it could stall the datapath, which could result to packet loss. To diminish that, all those flow operations will be recorded and appended to a list. A thread is then introduced to process this list (to do the real flow offloading put/del operations). This could leave the datapath as lightweight as possible. Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org> Co-authored-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-07-06 10:32:52 +01:00
Yuanhan Liu	aab96ec4d8	dpif-netdev: retrieve flow directly from the flow mark So that we could skip some very costly CPU operations, including but not limiting to miniflow_extract, emc lookup, dpcls lookup, etc. Thus, performance could be greatly improved. A PHY-PHY forwarding with 1000 mega flows (udp,tp_src=1000-1999) and 1 million streams (tp_src=1000-1999, tp_dst=2000-2999) show more that 260% performance boost. Note that though the heavy miniflow_extract is skipped, we still have to do per packet checking, due to we have to check the tcp_flags. Co-authored-by: Finn Christensen <fc@napatech.com> Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org> Signed-off-by: Finn Christensen <fc@napatech.com> Co-authored-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-07-06 10:32:52 +01:00

1 2 3 4 5 ...

651 Commits