mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-29 13:27:59 +00:00

Author	SHA1	Message	Date
Yuanhan Liu	241bad15d9	dpif-netdev: associate flow with a mark id Most modern NICs have the ability to bind a flow with a mark, so that every packet matches such flow will have that mark present in its descriptor. The basic idea of doing that is, when we receives packets later, we could directly get the flow from the mark. That could avoid some very costly CPU operations, including (but not limiting to) miniflow_extract, emc lookup, dpcls lookup, etc. Thus, performance could be greatly improved. Thus, the major work of this patch is to associate a flow with a mark id (an uint32_t number). The association in netdev datapath is done by CMAP, while in hardware it's done by the rte_flow MARK action. One tricky thing in OVS-DPDK is, the flow tables is per-PMD. For the case there is only one phys port but with 2 queues, there could be 2 PMDs. In other words, even for a single mega flow (i.e. udp,tp_src=1000), there could be 2 different dp_netdev flows, one for each PMD. That could results to the same mega flow being offloaded twice in the hardware, worse, we may get 2 different marks and only the last one will work. To avoid that, a megaflow_to_mark CMAP is created. An entry will be added for the first PMD that wants to offload a flow. For later PMDs, it will see such megaflow is already offloaded, then the flow will not be offloaded to HW twice. Meanwhile, the mark to flow mapping becomes to 1:N mapping. That is what the mark_to_flow CMAP is for. When the first PMD wants to offload a flow, it allocates a new mark and performs the flow offload by reusing the ->flow_put method. When it succeeds, a "mark to flow" entry will be added. For later PMDs, it will get the corresponding mark by above megaflow_to_mark CMAP. Then, another "mark to flow" entry will be added. Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org> Co-authored-by: Finn Christensen <fc@napatech.com> Signed-off-by: Finn Christensen <fc@napatech.com> Co-authored-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-07-06 10:32:52 +01:00
Ben Pfaff	5a0e4aec1a	treewide: Convert leading tabs to spaces. It's always been OVS coding style to use spaces rather than tabs for indentation, but some tabs have snuck in over time. This commit converts them to spaces. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org>	2018-06-11 15:32:00 -07:00
Ben Pfaff	fa37affad3	Embrace anonymous unions. Several OVS structs contain embedded named unions, like this: struct { ... union { ... } u; }; C11 standardized a feature that many compilers already implemented anyway, where an embedded union may be unnamed, like this: struct { ... union { ... }; }; This is more convenient because it allows the programmer to omit "u." in many places. OVS already used this feature in several places. This commit embraces it in several others. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org> Tested-by: Alin Gabriel Serdean <aserdean@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>	2018-05-25 13:36:05 -07:00
Ilya Maximets	47e1b3b625	dpif-netdev: Free packets on TUNNEL_PUSH if should_steal. Unconditional return may cause packet leak in case of 'should_steal == true'. Additionally, removed redundant checking for depth level. CC: Sugesh Chandran <sugesh.chandran@intel.com> Fixes: 7c12dfc527a5 ("tunneling: Avoid datapath-recirc by combining recirc actions at xlate.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokess <ian.stokes@intel.com>	2018-05-25 09:09:50 +01:00
Eelco Chaudron	606f665072	netdev-dpdk: Don't use PMD driver if not configured successfully When initialization of the DPDK PMD driver fails (dpdk_eth_dev_init()), the reconfigure_datapath() function will remove the port from dp_netdev, and the port is not used. Now when bridge_reconfigure() is called again, no changes to the previous failing netdev configuration are detected and therefore the ports gets added to dp_netdev and used uninitialized. This is causing exceptions... The fix has two parts to it. First in netdev-dpdk.c we remember if the DPDK port was started or not, and when calling netdev_dpdk_reconfigure() we also try re-initialization if the port was not already active. The second part of the change is in dpif-netdev.c where it makes sure netdev_reconfigure() is called if the port needs reconfiguration, as netdev_is_reconf_required() is only true until netdev_reconfigure() is called (even if it fails). Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Tested-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-05-25 09:09:50 +01:00
Darrell Ball	7d7ded7af7	odp-execute: Rename 'may_steal' to 'should_steal'. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-05-23 11:36:47 -07:00
Jan Scheurich	7178fefbdf	dpif-netdev: Detection and logging of suspicious PMD iterations This patch enhances dpif-netdev-perf to detect iterations with suspicious statistics according to the following criteria: - iteration lasts longer than US_THR microseconds (default 250). This can be used to capture events where a PMD is blocked or interrupted for such a period of time that there is a risk for dropped packets on any of its Rx queues. - max vhost qlen exceeds a threshold Q_THR (default 128). This can be used to infer virtio queue overruns and dropped packets inside a VM, which are not visible in OVS otherwise. Such suspicious iterations can be logged together with their iteration statistics to be able to correlate them to packet drop or other events outside OVS. A new command is introduced to enable/disable logging at run-time and to adjust the above thresholds for suspicious iterations: ovs-appctl dpif-netdev/pmd-perf-log-set on \| off [-b before] [-a after] [-e\|-ne] [-us usec] [-q qlen] Turn logging on or off at run-time (on\|off). -b before: The number of iterations before the suspicious iteration to be logged (default 5). -a after: The number of iterations after the suspicious iteration to be logged (default 5). -e: Extend logging interval if another suspicious iteration is detected before logging occurs. -ne: Do not extend logging interval (default). -q qlen: Suspicious vhost queue fill level threshold. Increase this to 512 if the Qemu supports 1024 virtio queue length. (default 128). -us usec: change the duration threshold for a suspicious iteration (default 250 us). Note: Logging of suspicious iterations itself consumes a considerable amount of processing cycles of a PMD which may be visible in the iteration history. In the worst case this can lead OVS to detect another suspicious iteration caused by logging. If more than 100 iterations around a suspicious iteration have been logged once, OVS falls back to the safe default values (-b 5/-a 5/-ne) to avoid that logging itself causes continuos further logging. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-05-11 08:08:24 +01:00
Jan Scheurich	79f368756c	dpif-netdev: Detailed performance stats for PMDs This patch instruments the dpif-netdev datapath to record detailed statistics of what is happening in every iteration of a PMD thread. The collection of detailed statistics can be controlled by a new Open_vSwitch configuration parameter "other_config:pmd-perf-metrics". By default it is disabled. The run-time overhead, when enabled, is in the order of 1%. The covered metrics per iteration are: - cycles - packets - (rx) batches - packets/batch - max. vhostuser qlen - upcalls - cycles spent in upcalls This raw recorded data is used threefold: 1. In histograms for each of the following metrics: - cycles/iteration (log.) - packets/iteration (log.) - cycles/packet - packets/batch - max. vhostuser qlen (log.) - upcalls - cycles/upcall (log) The histograms bins are divided linear or logarithmic. 2. A cyclic history of the above statistics for 999 iterations 3. A cyclic history of the cummulative/average values per millisecond wall clock for the last 1000 milliseconds: - number of iterations - avg. cycles/iteration - packets (Kpps) - avg. packets/batch - avg. max vhost qlen - upcalls - avg. cycles/upcall The gathered performance metrics can be printed at any time with the new CLI command ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len] [-pmd core] [dp] The options are -nh: Suppress the histograms -it iter_len: Display the last iter_len iteration stats -ms ms_len: Display the last ms_len millisecond stats -pmd core: Display only the specified PMD The performance statistics are reset with the existing dpif-netdev/pmd-stats-clear command. The output always contains the following global PMD statistics, similar to the pmd-stats-show command: Time: 15:24:55.270 Measurement duration: 1.008 s pmd thread numa_id 0 core_id 1: Cycles: 2419034712 (2.40 GHz) Iterations: 572817 (1.76 us/it) - idle: 486808 (15.9 % cycles) - busy: 86009 (84.1 % cycles) Rx packets: 2399607 (2381 Kpps, 848 cycles/pkt) Datapath passes: 3599415 (1.50 passes/pkt) - EMC hits: 336472 ( 9.3 %) - Megaflow hits: 3262943 (90.7 %, 1.00 subtbl lookups/hit) - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 2399607 (2381 Kpps) Tx batches: 171400 (14.00 pkts/batch) Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-05-11 08:08:24 +01:00
Jan Scheurich	8492adc270	netdev: Add optional qfill output parameter to rxq_recv() If the caller provides a non-NULL qfill pointer and the netdev implemementation supports reading the rx queue fill level, the rxq_recv() function returns the remaining number of packets in the rx queue after reception of the packet burst to the caller. If the implementation does not support this, it returns -ENOTSUP instead. Reading the remaining queue fill level should not substantilly slow down the recv() operation. A first implementation is provided for ethernet and vhostuser DPDK ports in netdev-dpdk.c. This output parameter will be used in the upcoming commit for PMD performance metrics to supervise the rx queue fill level for DPDK vhostuser ports. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-05-11 08:08:24 +01:00
Ben Pfaff	f825fdd4ff	flow: Improve type-safety of MINIFLOW_GET_TYPE. Until mow, this macro has blindly read the passed-in type's size, but that's unnecessarily risky. This commit changes it to verify that the passed-in type is the same size as the field and, on GCC and Clang, that the types are compatible. It also adds a version that does not check, for the one case where (currently) we deliberately read the wrong size, and updates a few uses to use more precise field names. Signed-off-by: Ben Pfaff <blp@ovn.org> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Reviewed-by: Armando Migliaccio <armamig@gmail.com>	2018-03-31 11:31:51 -07:00
Justin Pettit	97bf8f478d	Don't shadow iterator values. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-02-28 14:53:29 -08:00
Justin Pettit	e883448e3f	dp-packet: Add index to DP_PACKET_BATCH_FOR_EACH to prevent shadowing. Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2018-02-28 14:53:27 -08:00
Yi-Hung Wei	271e48a0e2	conntrack: Support conntrack flush by ct 5-tuple This patch adds support of flushing a conntrack entry specified by the conntrack 5-tuple in dpif-netdev. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Darrell Ball <dlu998@gmail.com>	2018-02-14 13:59:09 -08:00
Ben Pfaff	0d71302e36	ofp-util, ofp-parse: Break up into many separate modules. ofp-util had been far too large and monolithic for a long time. This commit breaks it up into units that make some logical sense. It also moves the pieces of ofp-parse that were specific to each unit into the relevant unit. Most of this commit is just moving code around. Signed-off-by: Ben Pfaff <blp@ovn.org> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>	2018-02-13 10:43:13 -08:00
Eric Garver	1fe178d251	dpif: Add support for OVS_ACTION_ATTR_CT_CLEAR This supports using the ct_clear action in the kernel datapath. To preserve compatibility with current ct_clear behavior on old kernels, we only pass this action down to the datapath if a probe reveals the datapath actually supports it. Signed-off-by: Eric Garver <e@erig.me> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Justin Pettit <jpettit@ovn.org>	2018-01-20 11:16:37 -08:00
Kevin Traynor	2a2c67b435	dpif-netdev: Add percentage of pmd/core used by each rxq. It is based on the length of history that is stored about an rxq (currently 1 min). $ ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 4: isolated : false port: dpdkphy1 queue-id: 0 pmd usage: 70 % port: dpdkvhost0 queue-id: 0 pmd usage: 0 % pmd thread numa_id 0 core_id 6: isolated : false port: dpdkphy0 queue-id: 0 pmd usage: 64 % port: dpdkvhost1 queue-id: 0 pmd usage: 0 % These values are what would be used as part of rxq to pmd assignment due to a reconfiguration event e.g. adding pmds, adding rxqs or with the command: ovs-appctl dpif-netdev/pmd-rxq-rebalance Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Co-authored-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Kevin Traynor	4f5d13e241	dpif-netdev: Reset the rxq current cycle counter on reload. An rxq may have processing cycles counted in the current counter when a reload happens. That could temporarily create a small skew on the stats for an rxq. Reset the counter after reload. Fixes: 4809891b2e01 ("dpif-netdev: Count the rxq processing cycles for an rxq.") Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Ilya Maximets	c71ea3c4a7	dpif-netdev: Time based output batching. This allows to collect packets from more than one RX burst and send them together with a configurable intervals. 'other_config:tx-flush-interval' can be used to configure time that a packet can wait in output batch for sending. 'tx-flush-interval' has microsecond resolution. Tested-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Ilya Maximets	58ed6df048	dpif-netdev: Count cycles on per-rxq basis. Upcoming time-based output batching will allow to collect in a single output batch packets from different RX queues. Lets keep the list of RX queues for each output packet and collect cycles for them on send. Tested-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Ilya Maximets	05f9e707e1	dpif-netdev: Use microsecond granularity. Upcoming time-based output batching will require microsecond granularity for it's flexible configuration. Acked-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Jan Scheurich	a19896abe5	dpif-netdev: Refactor cycle counting Simplify the historically grown TSC cycle counting in PMD threads. Cycles are currently counted for the following purposes: 1. Measure PMD ustilization PMD utilization is defined as ratio of cycles spent in busy iterations (at least one packet received or sent) over the total number of cycles. This is already done in pmd_perf_start_iteration() and pmd_perf_end_iteration() based on a TSC timestamp saved in current iteration at start_iteration() and the actual TSC at end_iteration(). No dependency on intermediate cycle accounting. 2. Measure the processing load per RX queue This comprises cycles spend on polling and processing packets received from the rx queue and the cycles spent on delayed sending of these packets to tx queues (with time-based batching). The previous scheme using cycles_count_start(), cycles_count_intermediate() and cycles-count_end() originally introduced to simplify cycle counting and saving calls to rte_get_tsc_cycles() was rather obscuring things. Replace by a nestable cycle_timer with with start and stop functions to embrace a code segment to be timed. The timed code may contain arbitrary nested cycle_timers. The duration of nested timers is excluded from the outer timer. The caller must ensure that each call to cycle_timer_start() is followed by a call to cycle_timer_end(). Failure to do so will lead to assertion failure or a memory leak. The new cycle_timer is used to measure the processing cycles per rx queue. This is not yet strictly necessary but will be made use of in a subsequent commit. All cycle count functions and data are relocated to module dpif-netdev-perf. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Jan Scheurich	82a48ead4e	dpif-netdev: Refactor PMD performance into dpif-netdev-perf Add module dpif-netdev-perf to host all PMD performance-related data structures and functions in dpif-netdev. Refactor the PMD stats handling in dpif-netdev and delegate whatever possible into the new module, using clean interfaces to shield dpif-netdev from the implementation details. Accordingly, the all PMD statistics members are moved from the main struct dp_netdev_pmd_thread into a dedicated member of type struct pmd_perf_stats. Include Darrel's prior refactoring of PMD stats contained in [PATCH v5,2/3] dpif-netdev: Refactor some pmd stats: 1. The cycles per packet counts are now based on packets received rather than packet passes through the datapath. 2. Packet counters are now kept for packets received and packets recirculated. These are kept as separate counters for maintainability reasons. The cost of incrementing these counters is negligible. These new counters are also displayed to the user. 3. A display statistic is added for the average number of datapath passes per packet. This should be useful for user debugging and understanding of packet processing. 4. The user visible 'miss' counter is used for successful upcalls, rather than the sum of sucessful and unsuccessful upcalls. Hence, this becomes what user historically understands by OVS 'miss upcall'. The user display is annotated to make this clear as well. 5. The user visible 'lost' counter remains as failed upcalls, but is annotated to make it clear what the meaning is. 6. The enum pmd_stat_type is annotated to make the usage of the stats counters clear. 7. The subtable lookup stats is renamed to make it clear that it relates to masked lookups. 8. The PMD stats test is updated to handle the new user stats of packets received, packets recirculated and average number of datapath passes per packet. On top of that introduce a "-pmd <core>" option to the PMD info commands to filter the output for a single PMD. Made the pmd-stats-show output a bit more readable by adding a blank between colon and value. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Co-authored-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Darrell Ball <dlu998@gmail.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off: Ian Stokes <ian.stokes@intel.com>	2018-01-17 18:11:28 +00:00
Darrell Ball	875075b362	dpctl conntrack: Add get number of connections. A get command is added for number of conntrack connections. This command is only supported in the userspace datapath at this time. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-01-09 11:17:44 -08:00
Darrell Ball	c92339ad19	dpctl conntrack: Add get and set maxconns command. Get and set dpctl commands are added for conntrack maxconns. These commands are only supported in the userspace datapath at this time. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-01-09 11:16:44 -08:00
Yi Yang	f59cb331c4	nsh: rework NSH netlink keys and actions This patch changes OVS_KEY_ATTR_NSH to nested attribute and adds three new NSH sub attribute keys: OVS_NSH_KEY_ATTR_BASE: for length-fixed NSH base header OVS_NSH_KEY_ATTR_MD1: for length-fixed MD type 1 context OVS_NSH_KEY_ATTR_MD2: for length-variable MD type 2 metadata Its intention is to align to NSH kernel implementation. NSH match fields, set and PUSH_NSH action all use the below nested attribute format: OVS_KEY_ATTR_NSH begin OVS_NSH_KEY_ATTR_BASE OVS_NSH_KEY_ATTR_MD1 OVS_KEY_ATTR_NSH end or OVS_KEY_ATTR_NSH begin OVS_NSH_KEY_ATTR_BASE OVS_NSH_KEY_ATTR_MD2 OVS_KEY_ATTR_NSH end In addition, NSH encap and decap actions are renamed as push_nsh and pop_nsh to meet action naming convention. Signed-off-by: Yi Yang <yi.y.yang@intel.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-01-08 13:19:14 -08:00
Ben Pfaff	34944e81f0	Merge branch 'dpdk_merge' of https://github.com/istokes/ovs into HEAD	2018-01-02 07:45:17 -08:00
Ben Pfaff	b2befd5bb2	sparse: Add guards to prevent FreeBSD-incompatible #include order. FreeBSD insists that <sys/types.h> be included before <netinet/in.h> and that <netinet/in.h> be included before <arpa/inet.h>. This adds guards to the "sparse" headers to yield a warning if this order is violated. This commit also adjusts the order of many #includes to suit this requirement. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org>	2017-12-22 12:58:02 -08:00
Ilya Maximets	cc4891f39d	dpif-netdev: Count sent packets and batches. New statistics for 'pmd-stats-show' command: average number of packets per output batch. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com	2017-12-20 21:07:46 +00:00
Ilya Maximets	b30896c969	netdev: Remove unused may_steal. Not needed anymore because 'may_steal' already handled on dpif-netdev layer and always true. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com	2017-12-20 21:07:46 +00:00
Ilya Maximets	009e0033dc	dpif-netdev: Output packet batching. While processing incoming batch of packets they are scattered across many per-flow batches and sent separately. This becomes an issue while using more than a few flows. For example if we have balanced-tcp OvS bonding with 2 ports there will be 256 datapath internal flows for each dp_hash pattern. This will lead to scattering of a single recieved batch across all of that 256 per-flow batches and invoking send for each packet separately. This behaviour greatly degrades overall performance of netdev_send because of inability to use advantages of vectorized transmit functions. But the half (if 2 ports in bonding) of datapath flows will have the same output actions. This means that we can collect them in a single place back and send at once using single call to netdev_send. This patch introduces per-port packet batch for output packets for that purpose. 'output_pkts' batch is thread local and located in send port cache. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com	2017-12-20 21:07:46 +00:00
Ilya Maximets	b010be1760	dpif-netdev: Keep latest measured time for PMD thread. In current implementation 'now' variable updated once on each receive cycle and passed through the whole datapath via function arguments. It'll be better to keep this variable inside PMD thread structure to be able to get it at any time. Such solution will save the stack memory and simplify possible modifications in current logic. This patch introduces new structure 'dp_netdev_pmd_thread_ctx' contained by 'struct dp_netdev_pmd_thread' to store any processing context of this PMD thread. For now, only time and cycles moved to that structure. Can be extended in the future. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2017-12-20 21:07:46 +00:00
Darrell Ball	bd7d93f8b4	conntrack: Allow specified alg port numbers. Algs can use variable control port numbers for servers. The main use case is a kind of feeble security measure; the thinking being by some is that it obscures the alg traffic. It is really not very effective, but the kernel has this capability. This patch mimics the capability. Signed-off-by: Darrell Ball <dlu998@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com>	2017-12-11 14:14:11 -08:00
Ben Pfaff	f0aa3801f1	dpif-netdev: Avoid "sparse" warning. "sparse" warns when odp_port_t is used directly in an inequality comparison. This avoids the warning. CC: Kevin Traynor <ktraynor@redhat.com> Fixes: a130f1a89bd8 ("dpif-netdev: Add port/queue tiebreaker to rxq_cycle_sort.") Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Ian Stokes <ian.stokes@intel.com>	2017-12-11 13:42:53 -08:00
Ilya Maximets	d9d73f84ea	Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure." This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941. Padding and aligning of dp_netdev_pmd_thread structure members is useless, broken in a several ways and only greatly degrades maintainability and extensibility of the structure. Issues: 1. It's not working because all the instances of struct dp_netdev_pmd_thread allocated only by usual malloc. All the memory is not aligned to cachelines -> structure almost never starts at aligned memory address. This means that any further paddings and alignments inside the structure are completely useless. Fo example: Breakpoint 1, pmd_thread_main (gdb) p pmd $49 = (struct dp_netdev_pmd_thread ) 0x1b1af20 (gdb) p &pmd->cacheline1 $51 = (OVS_CACHE_LINE_MARKER ) 0x1b1af60 (gdb) p &pmd->cacheline0 $52 = (OVS_CACHE_LINE_MARKER ) 0x1b1af20 (gdb) p &pmd->flow_cache $53 = (struct emc_cache ) 0x1b1afe0 All of the above addresses shifted from cacheline start by 32B. Can we fix it properly? NO. OVS currently doesn't have appropriate API to allocate aligned memory. The best candidate is 'xmalloc_cacheline()' but it clearly states that "The memory returned will not be at the start of a cache line, though, so don't assume such alignment". And also, this function will never return aligned memory on Windows or MacOS. 2. CACHE_LINE_SIZE is not constant. Different architectures have different cache line sizes, but the code assumes that CACHE_LINE_SIZE is always equal to 64 bytes. All the structure members are grouped by 64 bytes and padded to CACHE_LINE_SIZE. This leads to a huge holes in a structures if CACHE_LINE_SIZE differs from 64. This is opposite to portability. If I want good performance of cmap I need to have CACHE_LINE_SIZE equal to the real cache line size, but I will have huge holes in the structures. If you'll take a look to struct rte_mbuf from DPDK you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure. 3. Sizes of system/libc defined types are not constant for all the systems. For example, sizeof(pthread_mutex_t) == 48 on my ARMv8 machine, but only 40 on x86. The difference could be much bigger on Windows or MacOS systems. But the code assumes that sizeof(struct ovs_mutex) is always 48 bytes. This may lead to broken alignment/big holes in case of padding/wrong comments about amount of free pad bytes. 4. Sizes of the many fileds in structure depends on defines like DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on. Any change in these defines or any change in any structure contained by thread should lead to the not so simple refactoring of the whole dp_netdev_pmd_thread structure. This greatly reduces maintainability and complicates development of a new features. 5. There is no reason to align flow_cache member because it's too big and we usually access random entries by single thread only. So, the padding/alignment only creates some visibility of performance optimization but does nothing useful in reality. It only complicates maintenance and adds huge holes for non-x86 architectures and non-Linux systems. Performance improvement stated in a original commit message should be random and not valuable. I see no performance difference. Most of the above issues are also true for some other padded/aligned structures like 'struct netdev_dpdk'. They will be treated separately. CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> CC: Ben Pfaff <blp@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2017-12-08 21:42:54 +00:00
Yifeng Sun	d1ce9c2033	dpif-netdev: Fix memory leak Valgrind complains in test 1019 (dpctl - add-if set-if del-if): 4,850,896 (4,850,240 direct, 656 indirect) bytes in 1 blocks are definitely lost in loss record 364 of 364 by 0x517062: xcalloc (util.c:103) by 0x46CBBC: dp_netdev_set_nonpmd (dpif-netdev.c:4498) by 0x46CBBC: create_dp_netdev (dpif-netdev.c:1299) by 0x46CBBC: dpif_netdev_open (dpif-netdev.c:1337) by 0x472CB0: do_open (dpif.c:350) by 0x472E6F: dpif_create (dpif.c:404) by 0x472E6F: dpif_create_and_open (dpif.c:417) by 0x430EBC: open_dpif_backer (ofproto-dpif.c:727) by 0x430EBC: construct (ofproto-dpif.c:1411) by 0x41B714: ofproto_create (ofproto.c:539) by 0x40C84E: bridge_reconfigure (bridge.c:647) by 0x4104C5: bridge_run (bridge.c:2998) by 0x406FA4: main (ovs-vswitchd.c:119) The reference count wasn't released at this earlier return. This fix passes the test 'make check'. Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2017-12-08 21:42:54 +00:00
Kevin Traynor	8368866efb	dpif-netdev: Calculate rxq cycles prior to compare_rxq_cycles calls. compare_rxq_cycles sums the latest cycles from each queue for comparison with each other. While each comparison correctly gets the latest cycles, the cycles could change between calls to compare_rxq_cycle. In order to use consistent values through each call of compare_rxq_cycles, sum the cycles before qsort is called. Requested-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2017-12-08 21:42:54 +00:00
Kevin Traynor	cc131ac184	dpif-netdev: Rename rxq_cycle_sort to compare_rxq_cycles. This function is used for comparison between queues as part of the sort. It does not do the sort itself. As such, give it a more appropriate name. Suggested-by: Billy O'Mahony <billy.o.mahony@intel.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Billy O'Mahony Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2017-12-08 21:42:54 +00:00
Kevin Traynor	a130f1a89b	dpif-netdev: Add port/queue tiebreaker to rxq_cycle_sort. rxq_cycle_sort is used to compare rx queues by their measured number of cycles. In the event that they are equal, 0 could be returned. However, it is observed that returning 0 results in a different sort order on Windows/Linux. This is ok in practice but it causes a unit test failure for "1007: PMD - pmd-cpu-mask/distribution of rx queues" when running on different OS's. In order to have a consistent sort result across multiple OS's, introduce a tiebreaker of port/queue. Fixes: 655856ef39b9 ("dpif-netdev: Change rxq_scheduling to use rxq processing cycles.") Reported-by: Alin Gabriel Serdean <aserdean@ovn.org> Tested-by: Alin Gabriel Serdean <aserdean@ovn.org> Co-authored-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2017-12-08 21:42:54 +00:00
Yi-Hung Wei	817a76577f	ct-dpif,dpif-netlink: Support conntrack flush by ct 5-tuple This patch adds support of flushing a conntrack entry specified by the conntrack 5-tuple, and provides the implementation in dpif-netlink. The implementation of dpif-netlink in the linux datapath utilizes the NFNL_SUBSYS_CTNETLINK netlink subsystem to delete a conntrack entry in nf_conntrack. Future patches will add support for the userspace and Windows datapaths. VMWare-BZ: #1983178 Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Justin Pettit <jpettit@ovn.org>	2017-12-07 13:49:40 -08:00
Kevin Traynor	64bf452e68	dpif-netdev: Rename rxq_interval. rxq_interval was added before there was other #defines and code related to rxq intervals. Rename to rxq_next_cycles_store in order to make it more intuitive. Requested-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2017-11-16 16:24:11 +00:00
Kevin Traynor	d9f79b6a5c	dpif-netdev: Remove unnecessary resets on new rxqs. Commit 38259bd7eb21 (dpif-netdev: Initialize new rxqs in port_reconfigure().) added a memset for the dp_netdev_rxq of new rxq's to remove a valgrind warning for an index field in that struct. With the addition of that memset, it also means there are some existing resets on other fields in that struct that are no longer needed and gives the opportunity to simplify by removing them. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2017-11-12 14:44:12 -08:00
Guoshuai Li	3f9d3836d6	dpif-netdev: Set MAX_RECIRC_DEPTH to 6. In an ovn gateway node with DPDK, the RECIRC_DEPTH may be greater than 5. Scenes: VM ping self floating IP, or VM ping Floating IP of VMs with the same network. It need process UNDNAT SNAT in LRouter egress and UNSNAT DNAT in LRouter ingress, and output to geneve tunnel also need recirc. This has an WARN: dpif_netdev(pmd36)\|WARN\|Packet dropped. Max recirculation depth exceeded. Signed-off-by: Guoshuai Li <ligs@dtdream.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2017-11-03 14:29:39 -07:00
Bhanuprakash Bodireddy	a807c15796	dpif_netdev: Refactor dp_netdev_pmd_thread structure. This commit introduces below changes to dp_netdev_pmd_thread structure. - Mark cachelines and in this process reorder few members to avoid holes. - Align emc_cache to a cacheline. - Maintain the grouping of related member variables. - Add comment on the information on pad bytes whereever appropriate so that new member variables may be introduced to fill the holes in future. Below is how the structure looks with this commit. Member size OVS_CACHE_LINE_MARKER cacheline0; struct dp_netdev * dp; 8 struct cmap_node node; 8 pthread_cond_t cond; 48 OVS_CACHE_LINE_MARKER cacheline1; struct ovs_mutex cond_mutex; 48 pthread_t thread; 8 unsigned int core_id; 4 int numa_id; 4 OVS_CACHE_LINE_MARKER cacheline2; struct emc_cache flow_cache; 4849672 ###cachelineX: 64 bytes, 0 pad bytes#### struct cmap flow_table; 8 .... ###cachelineY: 59 bytes, 5 pad bytes#### struct dp_netdev_pmd_stats stats 40 .... ###cachelineZ: 48 bytes, 16 pad bytes### struct ovs_mutex port_mutex; 48 .... This change also improve the performance marginally. Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2017-11-03 13:36:14 -07:00
Bhanuprakash Bodireddy	ee42dd70dc	dpif-netdev: Reorder elements in dp_netdev_rxq structure. By reordering elements in dp_netdev_rxq structure, pad bytes and a hole can be removed. Before: structure size: 104, sum holes: 1, sum padbytes:4, cachelines:2 After : structure size: 96, sum holes: 0, sum padbytes:0, cachelines:2 Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2017-11-03 12:56:22 -07:00
Xiao Liang	fd016ae3fb	lib: Move lib/poll-loop.h to include/openvswitch Poll-loop is the core to implement main loop. It should be available in libopenvswitch. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2017-11-03 10:47:55 -07:00
Ben Pfaff	38259bd7eb	dpif-netdev: Initialize new rxqs in port_reconfigure(). valgrind reported use of uninitialized data in port_reconfigure(), which was due to xrealloc() not initializing the newly added data, combined with dp_netdev_rxq_set_intrvl_cycles() reading 'intrvl_idx' from the added data. This avoids the warning. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Kevin Traynor <ktraynor@redhat.com>	2017-10-27 10:01:33 -07:00
Andy Zhou	66a396d4ff	dpif-netdev: Use portable error code for zero rate meter band 'EBADRQC' is only defined on the Linux platform. Without this fix, The travis MacOS build fails. Switching to using EDOM which is more portable. Fixes: 2029ce9ac3a601 (dpif-netdev: Fix a zero-rate bug for meter) CC: Ali Volkan ATLI <volkan.atli@argela.com.tr> Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Joe Stringer <joe@ovn.org>	2017-09-29 12:35:59 -07:00
Ali Volkan ATLI	2029ce9ac3	dpif-netdev: Fix a zero-rate bug for meter Open vSwitch daemon crashes (by receiving signal SIGFPE, Arithmetic exception) when a controller tries to send a meter-mod message with zero rate. Signed-off-by: Ali Volkan ATLI <volkan.atli@argela.com.tr> Signed-off-by: Andy Zhou <azhou@ovn.org>	2017-09-27 10:35:28 -07:00
Bhanuprakash Bodireddy	899363ed03	dpif-netdev: Fix comments for pmd_load_cached_ports. Commit 57eebbb4c315 replaces thread local 'pmd->port_cache' with 'pmd->tnl_port_cache' and 'pmd->send_port_cache' maps. Update the comments accordingly. Fixes: 57eebbb4c315 ("Don't try to output on a device without txqs") Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Signed-off-by: Darrell Ball <dlu998@gmail.com>	2017-09-22 02:19:59 -07:00
Bhanuprakash Bodireddy	37eabc706e	dpif-netdev: Remove 'cnt' in dp_netdev_input__(). There is little use of 'cnt' variable in dp_netdev_input__(). Get rid of it and use dp_packet_batch_size() to initialize PKT_ARRAY_SIZE. Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Signed-off-by: Darrell Ball <dlu998@gmail.com>	2017-09-22 02:16:05 -07:00

... 2 3 4 5 6 ...

751 Commits