mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-29 13:27:59 +00:00

Author	SHA1	Message	Date
Ilya Maximets	b137383e86	dpif-netdev: Fix double parsing of packets when EMC disabled. This partially reverts commit bde94613e6276d48a6e0be7a592ebcf9836b4aaf. Commit bde94613e627 was aimed to slightly ( < 1%) increase performance in the case where EMC disabled, but it avoids RSS hash calculation and OVS has to calculate it while executing OVS_ACTION_ATTR_HASH in order to handle balanced-tcp bonding. At the time of executing that action there is no parsed flow, and OVS parses the packet for the second time to calculate the hash. This happens for all packets received from the virtual interfaces because they have no HW RSS. Here is the example of 'perf' output for VM --> (bonded PHY) traffic: Samples: 401K of event 'cycles', Event count (approx.): 50964771478 Overhead Shared Object Symbol 27.50% ovs-vswitchd [.] dpcls_lookup.370382 16.30% ovs-vswitchd [.] rte_vhost_dequeue_burst.9267 14.95% ovs-vswitchd [.] miniflow_extract 7.22% ovs-vswitchd [.] flow_extract 7.10% ovs-vswitchd [.] dp_netdev_input__.371002.4826 4.01% ovs-vswitchd [.] fast_path_processing.370987.4893 We can see that packet parsed twice. First time by 'miniflow_extract' right after receiving and the second time by 'flow_extract' while executing actions. In this particular case calculating RSS on receive saves > 7% of the total CPU processing time. It varies from ~7 to ~10 % depending on scenario/traffic types. It's better to calculate hash each time because performance improvements of avoiding are negligible in compare with performance drop in case of sending packets to bonded interface. Another solution could be to pass the parsed flow explicitly through the datapath, but this will require big code changes and will have additional overhead for metadata updating on packet changes. Also, this change should have small impact since SMC works well in most cases and will be enabled/recommended by default in the future. CC: Antonio Fischetti <antonio.fischetti@intel.com> Fixes: bde94613e627 ("dpif-netdev: Avoid reading RSS hash when EMC is disabled.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-04-18 08:45:16 +01:00
Ilya Maximets	5d1765d30e	dpif-netdev: Reduce log level for not found mark id. It's a normal case for 'find' function, especially because this happens for every first packet of flow that was not offloaded yet. Should not warn about this. Dropped to DBG to avoid log trashing in case of big number of new flows. CC: Yuanhan Liu <yliu@fridaylinux.org> Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id") Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-02-27 22:19:03 +00:00
Ben Pfaff	d40533fc82	odp-util: Improve log messages and error reporting for Netlink parsing. As a side effect, this also reduces a lot of log messages' severities from ERR to WARN. They just didn't seem like messages that in general reported anything that would prevent functioning. Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-25 15:38:25 -08:00
Darrell Ball	4ea96698f6	Userspace datapath: Add fragmentation handling. Fragmentation handling is added for supporting conntrack. Both v4 and v6 are supported. After discussion with several people, I decided to not store configuration state in the database to be more consistent with the kernel in future, similarity with other conntrack configuration which will not be in the database as well and overall simplicity. Accordingly, fragmentation handling is enabled by default. This patch enables fragmentation tests for the userspace datapath. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-14 14:18:56 -08:00
Darrell Ball	9f17f104fe	dp-packet: Add 'do_not_steal' packet batch flag. This is needed in a subsequent patch and may otherwise be useful. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-02-14 11:39:22 -08:00
Ilya Maximets	216abd2808	dpif-netdev: Add thread safety annotation to sorted_poll_list. 'sorted_poll_list()' uses the 'pmd->poll_list' that should be guarded by 'pmd->port_mutex'. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-02-12 13:23:09 +00:00
Ilya Maximets	2fbadeb665	dpif-netdev: Per-port configurable EMC. Conditional EMC insert helps a lot in scenarios with high numbers of parallel flows, but in current implementation this option affects all the threads and ports at once. There are scenarios where we have different number of flows on different ports. For example, if one of the VMs encapsulates traffic using additional headers, it will receive large number of flows but only few flows will come out of this VM. In this scenario it's much faster to use EMC instead of classifier for traffic from the VM, but it's better to disable EMC for the traffic which flows to VM. To handle above issue introduced 'emc-enable' configurable to enable/disable EMC on a per-port basis. Ex.: ovs-vsctl set interface dpdk0 other_config:emc-enable=false EMC probability kept as is and it works for all the ports with 'emc-enable=true'. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-01-18 11:54:42 +00:00
Nitin Katiyar	5bf8428248	Adding support for PMD auto load balancing Port rx queues that have not been statically assigned to PMDs are currently assigned based on periodically sampled load measurements. The assignment is performed at specific instances – port addition, port deletion, upon reassignment request via CLI etc. Due to change in traffic pattern over time it can cause uneven load among the PMDs and thus resulting in lower overall throughout. This patch enables the support of auto load balancing of PMDs based on measured load of RX queues. Each PMD measures the processing load for each of its associated queues every 10 seconds. If the aggregated PMD load reaches 95% for 6 consecutive intervals then PMD considers itself to be overloaded. If any PMD is overloaded, a dry-run of the PMD assignment algorithm is performed by OVS main thread. The dry-run does NOT change the existing queue to PMD assignments. If the resultant mapping of dry-run indicates an improved distribution of the load then the actual reassignment will be performed. The automatic rebalancing will be disabled by default and has to be enabled via configuration option. The interval (in minutes) between two consecutive rebalancing can also be configured via CLI, default is 1 min. Following example commands can be used to set the auto-lb params: ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true" ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5" Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Tested-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2019-01-16 10:53:17 +00:00
Ilya Maximets	6c95dbf96b	dpif-netdev: End the quiescent state for flow offloading thread. Flow offloading thread uses concurrent hash maps which are based on rcu protected variables. It must use them while in active state. Working in a quiescent state could cause segmentation faults because of possible cmap internal structure changes. Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread") Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-11-02 15:17:19 +00:00
Ilya Maximets	5752eae485	dpif-netdev: Fix cmap node use after free on flow disassociation. Data pointed by cmap node must not be freed while iterating. ovsrcu_postpone should be used instead. CC: Finn Christensen <fc@napatech.com> Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-11-02 15:13:54 +00:00
Sriharsha Basavapatna via dev	57924fc91c	revalidator: Rebalance offloaded flows based on the pps rate This is the third patch in the patch-set to support dynamic rebalancing of offloaded flows. The dynamic rebalancing functionality is implemented in this patch. The ukeys that are not scheduled for deletion are obtained and passed as input to the rebalancing routine. The rebalancing is done in the context of revalidation leader thread, after all other revalidator threads are done with gathering rebalancing data for flows. For each netdev that is in OOR state, a list of flows - both offloaded and non-offloaded (pending) - is obtained using the ukeys. For each netdev that is in OOR state, the flows are grouped and sorted into offloaded and pending flows. The offloaded flows are sorted in descending order of pps-rate, while pending flows are sorted in ascending order of pps-rate. The rebalancing is done in two phases. In the first phase, we try to offload all pending flows and if that succeeds, the OOR state on the device is cleared. If some (or none) of the pending flows could not be offloaded, then we start replacing an offloaded flow that has a lower pps-rate than a pending flow, until there are no more pending flows with a higher rate than an offloaded flow. The flows that are replaced from the device are added into kernel datapath. A new OVS configuration parameter "offload-rebalance", is added to ovsdb. The default value of this is "false". To enable this feature, set the value of this parameter to "true", which provides packets-per-second rate based policy to dynamically offload and un-offload flows. Note: This option can be enabled only when 'hw-offload' policy is enabled. It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow offload errors (specifically ENOSPC error this feature depends on) reported by an offloaded device are supressed by TC-Flower kernel module. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com> Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com> Reviewed-by: Sathya Perla <sathya.perla@broadcom.com> Reviewed-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-10-19 11:27:52 +02:00
Ilya Maximets	35fe9efb2f	dpif-netdev: Add vlan to mask for flow_put operation. Datapath flows in dpif-netdev classifier always has exact match mask set for vlan. We have to enable it for flow_put operation too in order to avoid flow modification failure due to classifier lookup with wrong hash. Found by OFtest. CC: Jan Scheurich <jan.scheurich@ericsson.com> Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline") Reported-by: Ben Pfaff <blp@ovn.org> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2018-September/352579.html Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-10-09 10:26:39 -07:00
Kevin Traynor	e77c97b9d6	dpif-netdev: Add round-robin based rxq to pmd assignment. Prior to OVS 2.9 automatic assignment of Rxqs to PMDs (i.e. CPUs) was done by round-robin. That was changed in OVS 2.9 to ordering the Rxqs based on their measured processing cycles. This was to assign the busiest Rxqs to different PMDs, improving aggregate throughput. For the most part the new scheme should be better, but there could be situations where a user prefers a simple round-robin scheme because Rxqs from a single port are more likely to be spread across multiple PMDs, and/or traffic is very bursty/unpredictable. Add 'pmd-rxq-assign' config to allow a user to select round-robin based assignment. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-09-14 11:45:05 +01:00
Gavi Teitz	a692410af0	dpctl: Expand the flow dump type filter Added new types to the flow dump filter, and allowed multiple filter types to be passed at once, as a comma separated list. The new types added are: * tc - specifies flows handled by the tc dp * non-offloaded - specifies flows not offloaded to the HW * all - specifies flows of all types The type list is now fully parsed by the dpctl, and a new struct was added to dpif which enables dpctl to define which types of dumps to provide, rather than passing the type string and having dpif parse it. Signed-off-by: Gavi Teitz <gavi@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-09-13 16:56:25 +02:00
Gavi Teitz	0d6b401cf6	dpif-netdev: Initialize dpif_flow attrs In a previous commit, the dpif_flow struct was expanded, with the 'offloaded' field being moved into a new struct which also includes a field for the dp layer the flow is handled on. The initialization of these fields was only done in dpif-netlink. This completes that commit, by initializing the fields in dpif-netdev as well. As the 'offloaded' field was previously ignored by dpif-netdev, the attrs are initialized to the default values of 'false' for the offloaded state, and 'ovs' for the dp layer. Fixes: d63ca5329ff9 ("dpctl: Properly reflect a rule's offloaded to HW state") Signed-off-by: Gavi Teitz <gavi@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2018-09-13 16:56:25 +02:00
Justin Pettit	866bc7567a	dpif-netdev: Prevent unsafe access when retrieving meter stats. dpif_netdev_meter_get() retrieved a pointer to a meter entry without holding a lock. It's possible that another thread could have deleted that entry between retrieving the pointer and dereferencing the pointer. This makes the function hold the lock the entire time the meter entry is needed. Found by inspection. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org>	2018-09-04 13:36:37 -07:00
Justin Pettit	d0db81eac8	dpif-netdev: Don't check if xcalloc() failed when creating meter. xcalloc() can't return null. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-09-04 13:36:37 -07:00
Vishal Deep Ajmera	9b4f08cdca	dpif-netdev: Avoid reordering of packets in a batch with same megaflow OVS reads packets in batches from a given port and packets in the batch are subjected to potentially 3 levels of lookups to identify the datapath megaflow entry (or flow) associated with the packet. Each megaflow entry has a dedicated buffer in which packets that match the flow classification criteria are collected. This buffer helps OVS perform batch processing for all packets associated with a given flow. Each packet in the received batch is first subjected to lookup in the Exact Match Cache (EMC). Each EMC entry will point to a flow. If the EMC lookup is successful, the packet is moved from the rx batch to the per-flow buffer. Packets that did not match any EMC entry are rearranged in the rx batch at the beginning and are now subjected to a lookup in the megaflow cache. Packets that match a megaflow cache entry are appended to the per-flow buffer. Packets that do not match any megaflow entry are subjected to slow-path processing through the upcall mechanism. This cannot change the order of packets as by definition upcall processing is only done for packets without matching megaflow entry. The EMC entry match fields encompass all potentially significant header fields, typically more than specified in the associated flow's match criteria. Hence, multiple EMC entries can point to the same flow. Given that per-flow batching happens at each lookup stage, packets belonging to the same megaflow can get re-ordered because some packets match EMC entries while others do not. The following example can illustrate the issue better. Consider following batch of packets (labelled P1 to P8) associated with a single TCP connection and associated with a single flow. Let us assume that packets with just the ACK bit set in TCP flags have been received in a prior batch also and a corresponding EMC entry exists. 1. P1 (TCP Flag: ACK) 2. P2 (TCP Flag: ACK) 3. P3 (TCP Flag: ACK) 4. P4 (TCP Flag: ACK, PSH) 5. P5 (TCP Flag: ACK) 6. P6 (TCP Flag: ACK) 7. P7 (TCP Flag: ACK) 8. P8 (TCP Flag: ACK) The megaflow classification criteria does not include TCP flags while the EMC match criteria does. Thus, all packets other than P4 match the existing EMC entry and are moved to the per-flow packet batch. Subsequently, packet P4 is moved to the same per-flow packet batch as a result of the megaflow lookup. Though the packets have all been correctly classified as being associated with the same flow, the packet order has not been preserved because of the per-flow batching performed during the EMC lookup stage. This packet re-ordering has performance implications for TCP applications. This patch preserves the packet ordering by performing the per-flow batching after both the EMC and megaflow lookups are complete. As an optimization, packets are flow-batched in emc processing till any packet in the batch has an EMC miss. A new flow map is maintained to keep the original order of packet along with flow information. Post fastpath processing, packets from flow map are appended to per-flow buffer. Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-08-27 17:48:23 +01:00
Yi-Hung Wei	cd015a11c2	dpif: Support conntrack zone limit. This patch defines the dpif interface to support conntrack per zone limit. Basically, OVS users can use this interface to set, delete, and get the conntrack per zone limit for various dpif interfaces. The following patch will make use of the proposed interface to implement the feature. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Justin Pettit <jpettit@ovn.org>	2018-08-17 09:30:55 -07:00
Justin Pettit	8101f03fcd	dpif: Don't pass in '*meter_id' to meter_set commands. The original intent of the API appears to be that the underlying DPIF implementaion would choose a local meter id. However, neither of the existing datapath meter implementations (userspace or Linux) implemented that; they expected a valid meter id to be passed in, otherwise they returned an error. This commit follows the existing implementations and makes the API somewhat cleaner. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-08-16 10:20:52 -07:00
Ilya Maximets	18e08953cf	dpif-netdev: Fix zero length keys insertion to EMC. 'key.len' should be calculated before inserting to EMC, otherwise resulting entry will match with any packet with the same hash. CC: Yipeng Wang <yipeng1.wang@intel.com> Fixes: 60d8ccae135f ("dpif-netdev: Add SMC cache after EMC cache") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Yipeng Wang <yipeng1.wang@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-08-08 22:06:21 +01:00
Justin Pettit	6508c845ad	dpif: Move common meter checks into the dpif layer. Another dpif provider will soon add support for meters, so move some of the common sanity checks up into the dpif layer so that each provider doesn't need to re-implement them. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-07-30 13:00:49 -07:00
Justin Pettit	f603d7d262	Revert "dpif-netdev: Use compatible function type to fix broken build." Commit ab15e70eb587 ("dpctl: Expand the flow dump type filter") will be reverted, which this patch fixed, so it needs to be reverted as well. This reverts commit b10ac772218afd4f296db866f6b80258e1d1ca8a. CC: Gavi Teitz <gavi@mellanox.com> CC: Simon Horman <simon.horman@netronome.com> CC: Roi Dayan <roid@mellanox.com> CC: Aaron Conole <aconole@redhat.com> Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-07-25 14:17:33 -07:00
Aaron Conole	b10ac77221	dpif-netdev: Use compatible function type to fix broken build. The dpif_provder flow_dump_create function signature was changed, but the netdev dpif was not updated along with it. This generated a build error with the following warnings: libtool: compile: gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I ./include -I ./include -I ./lib -I ./lib -Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers -fno-strict-aliasing -Wshadow -Wno-null-pointer-arithmetic -Werror -Werror -g -O2 -MT lib/dpif-netdev.lo -MD -MP -MF lib/.deps/dpif-netdev.Tpo -c lib/dpif-netdev.c -o lib/dpif-netdev.o lib/dpif-netdev.c:6812:5: error: initialization from incompatible pointer type [-Werror] dpif_netdev_flow_dump_create, ^ lib/dpif-netdev.c:6812:5: error: (near initialization for 'dpif_netdev_class.flow_dump_create') [-Werror] Fixes: ab15e70eb587 ("dpctl: Expand the flow dump type filter") Cc: Gavi Teitz <gavi@mellanox.com> Cc: Roi Dayan <roid@mellanox.com> Cc: Simon Horman <simon.horman@netronome.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-07-25 11:35:23 -07:00
Yipeng Wang	60d8ccae13	dpif-netdev: Add SMC cache after EMC cache This patch adds a signature match cache (SMC) after exact match cache (EMC). The difference between SMC and EMC is SMC only stores a signature of a flow thus it is much more memory efficient. With same memory space, EMC can store 8k flows while SMC can store 1M flows. It is generally beneficial to turn on SMC but turn off EMC when traffic flow count is much larger than EMC size. SMC cache will map a signature to an dp_netdev_flow index in flow_table. Thus, we add two new APIs in cmap for lookup key by index and lookup index by key. For now, SMC is an experimental feature that it is turned off by default. One can turn it on using ovsdb options. Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com> Co-authored-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-07-24 17:01:03 +01:00
Justin Pettit	425a7b9eaf	dpif-netdev: Fix a couple of comments for dp_netdev_run_meter(). Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-07-06 14:23:27 -07:00
Yuanhan Liu	02bb2824e5	dpif-netdev: do hw flow offload in a thread Currently, the major trigger for hw flow offload is at upcall handling, which is actually in the datapath. Moreover, the hw offload installation and modification is not that lightweight. Meaning, if there are so many flows being added or modified frequently, it could stall the datapath, which could result to packet loss. To diminish that, all those flow operations will be recorded and appended to a list. A thread is then introduced to process this list (to do the real flow offloading put/del operations). This could leave the datapath as lightweight as possible. Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org> Co-authored-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-07-06 10:32:52 +01:00
Yuanhan Liu	aab96ec4d8	dpif-netdev: retrieve flow directly from the flow mark So that we could skip some very costly CPU operations, including but not limiting to miniflow_extract, emc lookup, dpcls lookup, etc. Thus, performance could be greatly improved. A PHY-PHY forwarding with 1000 mega flows (udp,tp_src=1000-1999) and 1 million streams (tp_src=1000-1999, tp_dst=2000-2999) show more that 260% performance boost. Note that though the heavy miniflow_extract is skipped, we still have to do per packet checking, due to we have to check the tcp_flags. Co-authored-by: Finn Christensen <fc@napatech.com> Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org> Signed-off-by: Finn Christensen <fc@napatech.com> Co-authored-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-07-06 10:32:52 +01:00
Yuanhan Liu	241bad15d9	dpif-netdev: associate flow with a mark id Most modern NICs have the ability to bind a flow with a mark, so that every packet matches such flow will have that mark present in its descriptor. The basic idea of doing that is, when we receives packets later, we could directly get the flow from the mark. That could avoid some very costly CPU operations, including (but not limiting to) miniflow_extract, emc lookup, dpcls lookup, etc. Thus, performance could be greatly improved. Thus, the major work of this patch is to associate a flow with a mark id (an uint32_t number). The association in netdev datapath is done by CMAP, while in hardware it's done by the rte_flow MARK action. One tricky thing in OVS-DPDK is, the flow tables is per-PMD. For the case there is only one phys port but with 2 queues, there could be 2 PMDs. In other words, even for a single mega flow (i.e. udp,tp_src=1000), there could be 2 different dp_netdev flows, one for each PMD. That could results to the same mega flow being offloaded twice in the hardware, worse, we may get 2 different marks and only the last one will work. To avoid that, a megaflow_to_mark CMAP is created. An entry will be added for the first PMD that wants to offload a flow. For later PMDs, it will see such megaflow is already offloaded, then the flow will not be offloaded to HW twice. Meanwhile, the mark to flow mapping becomes to 1:N mapping. That is what the mark_to_flow CMAP is for. When the first PMD wants to offload a flow, it allocates a new mark and performs the flow offload by reusing the ->flow_put method. When it succeeds, a "mark to flow" entry will be added. For later PMDs, it will get the corresponding mark by above megaflow_to_mark CMAP. Then, another "mark to flow" entry will be added. Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org> Co-authored-by: Finn Christensen <fc@napatech.com> Signed-off-by: Finn Christensen <fc@napatech.com> Co-authored-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-07-06 10:32:52 +01:00
Ben Pfaff	5a0e4aec1a	treewide: Convert leading tabs to spaces. It's always been OVS coding style to use spaces rather than tabs for indentation, but some tabs have snuck in over time. This commit converts them to spaces. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org>	2018-06-11 15:32:00 -07:00
Ben Pfaff	fa37affad3	Embrace anonymous unions. Several OVS structs contain embedded named unions, like this: struct { ... union { ... } u; }; C11 standardized a feature that many compilers already implemented anyway, where an embedded union may be unnamed, like this: struct { ... union { ... }; }; This is more convenient because it allows the programmer to omit "u." in many places. OVS already used this feature in several places. This commit embraces it in several others. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org> Tested-by: Alin Gabriel Serdean <aserdean@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>	2018-05-25 13:36:05 -07:00
Ilya Maximets	47e1b3b625	dpif-netdev: Free packets on TUNNEL_PUSH if should_steal. Unconditional return may cause packet leak in case of 'should_steal == true'. Additionally, removed redundant checking for depth level. CC: Sugesh Chandran <sugesh.chandran@intel.com> Fixes: 7c12dfc527a5 ("tunneling: Avoid datapath-recirc by combining recirc actions at xlate.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokess <ian.stokes@intel.com>	2018-05-25 09:09:50 +01:00
Eelco Chaudron	606f665072	netdev-dpdk: Don't use PMD driver if not configured successfully When initialization of the DPDK PMD driver fails (dpdk_eth_dev_init()), the reconfigure_datapath() function will remove the port from dp_netdev, and the port is not used. Now when bridge_reconfigure() is called again, no changes to the previous failing netdev configuration are detected and therefore the ports gets added to dp_netdev and used uninitialized. This is causing exceptions... The fix has two parts to it. First in netdev-dpdk.c we remember if the DPDK port was started or not, and when calling netdev_dpdk_reconfigure() we also try re-initialization if the port was not already active. The second part of the change is in dpif-netdev.c where it makes sure netdev_reconfigure() is called if the port needs reconfiguration, as netdev_is_reconf_required() is only true until netdev_reconfigure() is called (even if it fails). Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Tested-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-05-25 09:09:50 +01:00
Darrell Ball	7d7ded7af7	odp-execute: Rename 'may_steal' to 'should_steal'. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-05-23 11:36:47 -07:00
Jan Scheurich	7178fefbdf	dpif-netdev: Detection and logging of suspicious PMD iterations This patch enhances dpif-netdev-perf to detect iterations with suspicious statistics according to the following criteria: - iteration lasts longer than US_THR microseconds (default 250). This can be used to capture events where a PMD is blocked or interrupted for such a period of time that there is a risk for dropped packets on any of its Rx queues. - max vhost qlen exceeds a threshold Q_THR (default 128). This can be used to infer virtio queue overruns and dropped packets inside a VM, which are not visible in OVS otherwise. Such suspicious iterations can be logged together with their iteration statistics to be able to correlate them to packet drop or other events outside OVS. A new command is introduced to enable/disable logging at run-time and to adjust the above thresholds for suspicious iterations: ovs-appctl dpif-netdev/pmd-perf-log-set on \| off [-b before] [-a after] [-e\|-ne] [-us usec] [-q qlen] Turn logging on or off at run-time (on\|off). -b before: The number of iterations before the suspicious iteration to be logged (default 5). -a after: The number of iterations after the suspicious iteration to be logged (default 5). -e: Extend logging interval if another suspicious iteration is detected before logging occurs. -ne: Do not extend logging interval (default). -q qlen: Suspicious vhost queue fill level threshold. Increase this to 512 if the Qemu supports 1024 virtio queue length. (default 128). -us usec: change the duration threshold for a suspicious iteration (default 250 us). Note: Logging of suspicious iterations itself consumes a considerable amount of processing cycles of a PMD which may be visible in the iteration history. In the worst case this can lead OVS to detect another suspicious iteration caused by logging. If more than 100 iterations around a suspicious iteration have been logged once, OVS falls back to the safe default values (-b 5/-a 5/-ne) to avoid that logging itself causes continuos further logging. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-05-11 08:08:24 +01:00
Jan Scheurich	79f368756c	dpif-netdev: Detailed performance stats for PMDs This patch instruments the dpif-netdev datapath to record detailed statistics of what is happening in every iteration of a PMD thread. The collection of detailed statistics can be controlled by a new Open_vSwitch configuration parameter "other_config:pmd-perf-metrics". By default it is disabled. The run-time overhead, when enabled, is in the order of 1%. The covered metrics per iteration are: - cycles - packets - (rx) batches - packets/batch - max. vhostuser qlen - upcalls - cycles spent in upcalls This raw recorded data is used threefold: 1. In histograms for each of the following metrics: - cycles/iteration (log.) - packets/iteration (log.) - cycles/packet - packets/batch - max. vhostuser qlen (log.) - upcalls - cycles/upcall (log) The histograms bins are divided linear or logarithmic. 2. A cyclic history of the above statistics for 999 iterations 3. A cyclic history of the cummulative/average values per millisecond wall clock for the last 1000 milliseconds: - number of iterations - avg. cycles/iteration - packets (Kpps) - avg. packets/batch - avg. max vhost qlen - upcalls - avg. cycles/upcall The gathered performance metrics can be printed at any time with the new CLI command ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len] [-pmd core] [dp] The options are -nh: Suppress the histograms -it iter_len: Display the last iter_len iteration stats -ms ms_len: Display the last ms_len millisecond stats -pmd core: Display only the specified PMD The performance statistics are reset with the existing dpif-netdev/pmd-stats-clear command. The output always contains the following global PMD statistics, similar to the pmd-stats-show command: Time: 15:24:55.270 Measurement duration: 1.008 s pmd thread numa_id 0 core_id 1: Cycles: 2419034712 (2.40 GHz) Iterations: 572817 (1.76 us/it) - idle: 486808 (15.9 % cycles) - busy: 86009 (84.1 % cycles) Rx packets: 2399607 (2381 Kpps, 848 cycles/pkt) Datapath passes: 3599415 (1.50 passes/pkt) - EMC hits: 336472 ( 9.3 %) - Megaflow hits: 3262943 (90.7 %, 1.00 subtbl lookups/hit) - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 2399607 (2381 Kpps) Tx batches: 171400 (14.00 pkts/batch) Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-05-11 08:08:24 +01:00
Jan Scheurich	8492adc270	netdev: Add optional qfill output parameter to rxq_recv() If the caller provides a non-NULL qfill pointer and the netdev implemementation supports reading the rx queue fill level, the rxq_recv() function returns the remaining number of packets in the rx queue after reception of the packet burst to the caller. If the implementation does not support this, it returns -ENOTSUP instead. Reading the remaining queue fill level should not substantilly slow down the recv() operation. A first implementation is provided for ethernet and vhostuser DPDK ports in netdev-dpdk.c. This output parameter will be used in the upcoming commit for PMD performance metrics to supervise the rx queue fill level for DPDK vhostuser ports. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-05-11 08:08:24 +01:00
Ben Pfaff	f825fdd4ff	flow: Improve type-safety of MINIFLOW_GET_TYPE. Until mow, this macro has blindly read the passed-in type's size, but that's unnecessarily risky. This commit changes it to verify that the passed-in type is the same size as the field and, on GCC and Clang, that the types are compatible. It also adds a version that does not check, for the one case where (currently) we deliberately read the wrong size, and updates a few uses to use more precise field names. Signed-off-by: Ben Pfaff <blp@ovn.org> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Reviewed-by: Armando Migliaccio <armamig@gmail.com>	2018-03-31 11:31:51 -07:00
Justin Pettit	97bf8f478d	Don't shadow iterator values. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-02-28 14:53:29 -08:00
Justin Pettit	e883448e3f	dp-packet: Add index to DP_PACKET_BATCH_FOR_EACH to prevent shadowing. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-02-28 14:53:27 -08:00
Yi-Hung Wei	271e48a0e2	conntrack: Support conntrack flush by ct 5-tuple This patch adds support of flushing a conntrack entry specified by the conntrack 5-tuple in dpif-netdev. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Darrell Ball <dlu998@gmail.com>	2018-02-14 13:59:09 -08:00
Ben Pfaff	0d71302e36	ofp-util, ofp-parse: Break up into many separate modules. ofp-util had been far too large and monolithic for a long time. This commit breaks it up into units that make some logical sense. It also moves the pieces of ofp-parse that were specific to each unit into the relevant unit. Most of this commit is just moving code around. Signed-off-by: Ben Pfaff <blp@ovn.org> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>	2018-02-13 10:43:13 -08:00
Eric Garver	1fe178d251	dpif: Add support for OVS_ACTION_ATTR_CT_CLEAR This supports using the ct_clear action in the kernel datapath. To preserve compatibility with current ct_clear behavior on old kernels, we only pass this action down to the datapath if a probe reveals the datapath actually supports it. Signed-off-by: Eric Garver <e@erig.me> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Justin Pettit <jpettit@ovn.org>	2018-01-20 11:16:37 -08:00
Kevin Traynor	2a2c67b435	dpif-netdev: Add percentage of pmd/core used by each rxq. It is based on the length of history that is stored about an rxq (currently 1 min). $ ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 4: isolated : false port: dpdkphy1 queue-id: 0 pmd usage: 70 % port: dpdkvhost0 queue-id: 0 pmd usage: 0 % pmd thread numa_id 0 core_id 6: isolated : false port: dpdkphy0 queue-id: 0 pmd usage: 64 % port: dpdkvhost1 queue-id: 0 pmd usage: 0 % These values are what would be used as part of rxq to pmd assignment due to a reconfiguration event e.g. adding pmds, adding rxqs or with the command: ovs-appctl dpif-netdev/pmd-rxq-rebalance Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Co-authored-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Kevin Traynor	4f5d13e241	dpif-netdev: Reset the rxq current cycle counter on reload. An rxq may have processing cycles counted in the current counter when a reload happens. That could temporarily create a small skew on the stats for an rxq. Reset the counter after reload. Fixes: 4809891b2e01 ("dpif-netdev: Count the rxq processing cycles for an rxq.") Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Ilya Maximets	c71ea3c4a7	dpif-netdev: Time based output batching. This allows to collect packets from more than one RX burst and send them together with a configurable intervals. 'other_config:tx-flush-interval' can be used to configure time that a packet can wait in output batch for sending. 'tx-flush-interval' has microsecond resolution. Tested-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Ilya Maximets	58ed6df048	dpif-netdev: Count cycles on per-rxq basis. Upcoming time-based output batching will allow to collect in a single output batch packets from different RX queues. Lets keep the list of RX queues for each output packet and collect cycles for them on send. Tested-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Ilya Maximets	05f9e707e1	dpif-netdev: Use microsecond granularity. Upcoming time-based output batching will require microsecond granularity for it's flexible configuration. Acked-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Jan Scheurich	a19896abe5	dpif-netdev: Refactor cycle counting Simplify the historically grown TSC cycle counting in PMD threads. Cycles are currently counted for the following purposes: 1. Measure PMD ustilization PMD utilization is defined as ratio of cycles spent in busy iterations (at least one packet received or sent) over the total number of cycles. This is already done in pmd_perf_start_iteration() and pmd_perf_end_iteration() based on a TSC timestamp saved in current iteration at start_iteration() and the actual TSC at end_iteration(). No dependency on intermediate cycle accounting. 2. Measure the processing load per RX queue This comprises cycles spend on polling and processing packets received from the rx queue and the cycles spent on delayed sending of these packets to tx queues (with time-based batching). The previous scheme using cycles_count_start(), cycles_count_intermediate() and cycles-count_end() originally introduced to simplify cycle counting and saving calls to rte_get_tsc_cycles() was rather obscuring things. Replace by a nestable cycle_timer with with start and stop functions to embrace a code segment to be timed. The timed code may contain arbitrary nested cycle_timers. The duration of nested timers is excluded from the outer timer. The caller must ensure that each call to cycle_timer_start() is followed by a call to cycle_timer_end(). Failure to do so will lead to assertion failure or a memory leak. The new cycle_timer is used to measure the processing cycles per rx queue. This is not yet strictly necessary but will be made use of in a subsequent commit. All cycle count functions and data are relocated to module dpif-netdev-perf. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Jan Scheurich	82a48ead4e	dpif-netdev: Refactor PMD performance into dpif-netdev-perf Add module dpif-netdev-perf to host all PMD performance-related data structures and functions in dpif-netdev. Refactor the PMD stats handling in dpif-netdev and delegate whatever possible into the new module, using clean interfaces to shield dpif-netdev from the implementation details. Accordingly, the all PMD statistics members are moved from the main struct dp_netdev_pmd_thread into a dedicated member of type struct pmd_perf_stats. Include Darrel's prior refactoring of PMD stats contained in [PATCH v5,2/3] dpif-netdev: Refactor some pmd stats: 1. The cycles per packet counts are now based on packets received rather than packet passes through the datapath. 2. Packet counters are now kept for packets received and packets recirculated. These are kept as separate counters for maintainability reasons. The cost of incrementing these counters is negligible. These new counters are also displayed to the user. 3. A display statistic is added for the average number of datapath passes per packet. This should be useful for user debugging and understanding of packet processing. 4. The user visible 'miss' counter is used for successful upcalls, rather than the sum of sucessful and unsuccessful upcalls. Hence, this becomes what user historically understands by OVS 'miss upcall'. The user display is annotated to make this clear as well. 5. The user visible 'lost' counter remains as failed upcalls, but is annotated to make it clear what the meaning is. 6. The enum pmd_stat_type is annotated to make the usage of the stats counters clear. 7. The subtable lookup stats is renamed to make it clear that it relates to masked lookups. 8. The PMD stats test is updated to handle the new user stats of packets received, packets recirculated and average number of datapath passes per packet. On top of that introduce a "-pmd <core>" option to the PMD info commands to filter the output for a single PMD. Made the pmd-stats-show output a bit more readable by adding a blank between colon and value. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Co-authored-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Darrell Ball <dlu998@gmail.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00

... 2 3 4 5 6 ...

779 Commits