mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-29 05:18:13 +00:00

Author	SHA1	Message	Date
Jarno Rajahalme	a76a37efec	conntrack: Force commit. Userspace support for force commit. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org>	2017-03-08 17:23:57 -08:00
Jarno Rajahalme	b80e259f8e	datapath: Add force commit. Upstream patch: commit dd41d33f0b033885211a5d6f3ee19e73238aa9ee Author: Jarno Rajahalme <jarno@ovn.org> Date: Thu Feb 9 11:22:00 2017 -0800 openvswitch: Add force commit. Stateful network admission policy may allow connections to one direction and reject connections initiated in the other direction. After policy change it is possible that for a new connection an overlapping conntrack entry already exists, where the original direction of the existing connection is opposed to the new connection's initial packet. Most importantly, conntrack state relating to the current packet gets the "reply" designation based on whether the original direction tuple or the reply direction tuple matched. If this "directionality" is wrong w.r.t. to the stateful network admission policy it may happen that packets in neither direction are correctly admitted. This patch adds a new "force commit" option to the OVS conntrack action that checks the original direction of an existing conntrack entry. If that direction is opposed to the current packet, the existing conntrack entry is deleted and a new one is subsequently created in the correct direction. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org>	2017-03-08 17:23:46 -08:00
Jarno Rajahalme	4b27db644a	dpif-netdev: Simple DROP meter implementation. Meters may be used by any flow, so some kind of locking must be used. In this version we have an adaptive mutex for each meter, which may not be optimal for DPDK. However, this should serve as a basis for further improvement. A batch of packets is first tried as a whole, and only if some of the meter bands are hit, we need to process the packets individually. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: Andy Zhou <azhou@ovn.org>	2017-03-08 13:09:44 -08:00
Jarno Rajahalme	5dddf96065	dpif: Meter framework. Add DPIF-level infrastructure for meters. Allow meter_set to modify the meter configuration (e.g. set the burst size if unspecified). Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: Andy Zhou <azhou@ovn.org>	2017-03-08 13:09:43 -08:00
Yang, Yi Y	6fcecb85ab	datapath: add Ethernet push and pop actions Upstream commit: commit 91820da6ae85904d95ed53bf3a83f9ec44a6b80a Author: Jiri Benc <jbenc@redhat.com> Date: Thu Nov 10 16:28:23 2016 +0100 openvswitch: add Ethernet push and pop actions It's not allowed to push Ethernet header in front of another Ethernet header. It's not allowed to pop Ethernet header if there's a vlan tag. This preserves the invariant that L3 packet never has a vlan tag. Based on previous versions by Lorand Jakab and Simon Horman. Signed-off-by: Lorand Jakab <lojakab@cisco.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> [Committer notes] Fix build with the upstream commit by folding in the required switch case enum handlers. Signed-off-by: Yi Yang <yi.y.yang@intel.com> Signed-off-by: Joe Stringer <joe@ovn.org>	2017-03-02 15:51:39 -08:00
Ciara Loftus	4c30b24602	dpif-netdev: Conditional EMC insert Unconditional insertion of EMC entries results in EMC thrashing at high numbers of parallel flows. When this occurs, the performance of the EMC often falls below that of the dpcls classifier, rendering the EMC practically useless. Instead of unconditionally inserting entries into the EMC when a miss occurs, use a 1% probability of insertion. This ensures that the most frequent flows have the highest chance of creating an entry in the EMC, and the probability of thrashing the EMC is also greatly reduced. The probability of insertion is configurable, via the other_config:emc-insert-inv-prob option. This value sets the average probability of insertion to 1/emc-insert-inv-prob. For example the following command changes the insertion probability to (on average) 1 in every 20 packets ie. 1/20 ie. 5%. ovs-vsctl set Open_vSwitch . other_config:emc-insert-inv-prob=20 Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Georg Schmuecking <georg.schmuecking@ericsson.com> Co-authored-by: Georg Schmuecking <georg.schmuecking@ericsson.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2017-02-16 11:46:17 -08:00
Daniele Di Proietto	d4f6865c3f	dpif-netdev: Pass Openvswitch other_config smap to dpif. Currently we parse the 'other_config' column in Openvswitch table in bridge.c. We extract the values (just 'pmd-cpu-mask' for now) and we pass them down to the datapath, via different layers. If we want to pass other values to dpif-netdev.c (like we recently discussed) we would have to touch ofproto.c, ofproto-dpif.c and dpif.c. This patch sends the entire other_config column to dpif-netdev, so that dpif-netdev can extract the values it's interested in. No functional change. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ben Pfaff <blp@ovn.org>	2017-02-03 09:45:42 -08:00
Andy Zhou	72c84bc2db	dp-packet: Enhance packet batch APIs. One common use case of 'struct dp_packet_batch' is to process all packets in the batch in order. Add an iterator for this use case to simplify the logic of calling sites, Another common use case is to drop packets in the batch, by reading all packets, but writing back pointers of fewer packets. Add macros to support this use case. Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Jarno Rajahalme <jarno@ovn.org>	2017-01-26 17:35:29 -08:00
Andy Zhou	535e3acfa7	dpif-netdev: Add clone action Add support for userspace datapath clone action. The clone action provides an action envelope to enclose an action list. For example, with actions A, B, C and D, and an action list: A, clone(B, C), D The clone action will ensure that: - D will see the same packet, and any meta states, such as flow, as action B. - D will be executed regardless whether B, or C drops a packet. They can only drop a clone. - When B drops a packet, clone will skip all remaining actions within the clone envelope. This feature is useful when we add meter action later: The meter action can be implemented as a simple action without its own envolop (unlike the sample action). When necessary, the flow translation layer can enclose a meter action in clone. The clone action is very similar with the OpenFlow clone action. This is by design to simplify vswitchd flow translation logic. Without datapath clone, vswitchd simulate the effect by inserting datapath actions to "undo" clone actions. The above flow will be translated into A, B, C, -C, -B, D. However, there are two issues: - The resulting datapath action list may be longer without using clone. - Some actions, such as NAT may not be possible to reverse. This patch implements clone() simply with packet copy. The performance can be improved with later patches, for example, to delay or avoid packet copy if possible. It seems datapath should have enough context to carry out such optimization without the userspace context. Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Jarno Rajahalme <jarno@ovn.org>	2017-01-23 22:58:34 -08:00
Andy Zhou	0526761391	dpif-netdev: Avoid sending probe packets When ofproto probe for datapath features, no packets should actually be sent to the network. This pactch fixes the userspace by dropping probe packets before action execution. Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Jarno Rajahalme <jarno@ovn.org>	2017-01-23 22:58:07 -08:00
nickcooper-zhangtonghao	aeff7d9886	dpif-netdev: Avoids repeated addition of DP_STAT_LOST. CC: Daniele Di Proietto <diproiettod@vmware.com> Fixes: 8aaa125dab66 ("dpif-netdev: Share emc and fast path output batches.") Signed-off-by: nickcooper-zhangtonghao <nic@opencloud.tech> Acked-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2017-01-16 18:53:05 -08:00
Daniele Di Proietto	e32971b8dd	dpif-netdev: Centralized threads and queues handling code. Currently we have three different code paths that deal with pmd threads and queues, in response to different input 1. When a port is added 2. When a port is deleted 3. When the cpumask changes or a port must be reconfigured. 1. and 2. are carefully written to minimize disruption to the running datapath, while 3. brings down all the threads reconfigure all the ports and restarts everything. This commit removes the three separate code paths by introducing the reconfigure_datapath() function, that takes care of adapting the pmd threads and queues to the current datapath configuration, no matter how we got there. This aims at simplifying maintenance and introduces a long overdue improvement: port reconfiguration (can happen quite frequently for dpdkvhost ports) is now done without shutting down the whole datapath, but just by temporarily removing the port that needs to be reconfigured (while the rest of the datapath is running). We now also recompute the rxq scheduling from scratch every time a port is added of deleted. This means that the queues will be more balanced, especially when dealing with explicit rxq-affinity from the user (without shutting down the threads and restarting them), but it also means that adding or deleting a port might cause existing queues to be moved between pmd threads. This negative effect can be avoided by taking into account the existing distribution when computing the new scheduling, but I considered code clarity and fast reconfiguration more important than optimizing port addition or removal (a port is added and removed only once, but can be reconfigured many times) Lastly, this commit moves the pmd threads state away from ovs-numa. Now the pmd threads state is kept only in dpif-netdev. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Co-authored-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:12 -08:00
Daniele Di Proietto	947dc56767	dpif-netdev: Use hmap for poll_list in pmd threads. A future commit will use this to determine if a queue is already contained in a pmd thread. To keep the behavior unaltered we now have to sort queues before printing them in pmd_info_show_rxq(). Also this commit introduces 'struct polled_queue' that will be used exclusively in the fast path, uses 'struct dp_netdev_rxq' from 'struct rxq_poll' and uses 'rx' for 'netdev_rxq' and 'rxq' for 'dp_netdev_rxq'. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:12 -08:00
Daniele Di Proietto	f5d317a156	dpctl: Avoid making assumptions on pmd threads. Currently dpctl depends on ovs-numa module to delete and create flows on different pmd threads for pmd devices. The next commits will move away the pmd threads state from ovs-numa to dpif-netdev, so the ovs-numa interface will not be supported. Also, the assignment between ports and thread is an implementation detail of dpif-netdev, dpctl shouldn't know anything about it. This commit changes the dpif_flow_put() and dpif_flow_del() calls to iterate over all the pmd threads, if pmd_id is PMD_ID_NULL. A simple test is added. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:12 -08:00
Daniele Di Proietto	82d765f6f8	dpif-netdev: Make 'static_tx_qid' const. Since previous commit, 'static_tx_qid' doesn't need to be atomic and is actually never touched (except for initialization), so it can be made const. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:11 -08:00
Daniele Di Proietto	b9584f2122	dpif-netdev: Create pmd threads for every numa node. A lot of the complexity in the code that handles pmd threads and ports in dpif-netdev is due to the fact that we postpone the creation of pmd threads on a numa node until we have a port that needs to be polled on that particular node. Since the previous commit, a pmd thread with no ports will not consume any CPU, so it seems easier to create all the threads at once. This will also make future commits easier. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:11 -08:00
Daniele Di Proietto	2788a1b138	dpif-netdev: Block pmd threads if there are no ports. There's no reason for a pmd thread to perform its main loop if there are no queues in its poll_list. This commit introduces a seq object on which the pmd thread can be blocked, if there are no queues. When the main thread wants to reload a pmd threads it must now change the seq object (in case it's blocked) and set 'reload' to true. This is useful to avoid wasting CPU cycles and is also necessary for a future commit. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:11 -08:00
Daniele Di Proietto	14e3e12ac3	dpif-netdev: Use a boolean instead of pmd->port_seq. There's no need for a sequence number, since the main thread has to wait for the pmd thread, so there's no chance that an update will be undetected. A seq object will be introduced for another purpose in the next commit, and changing this to boolean makes the code more readable. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:11 -08:00
Daniele Di Proietto	57eebbb4c3	dpif-netdev: Don't try to output on a device without txqs. Tunnel devices have 0 txqs and don't support netdev_send(). While netdev_send() simply returns EOPNOTSUPP, the XPS logic is still executed on output, and that might be confused by devices with no txqs. It seems better to have different structures in the fast path for ports that support netdev_{push,pop}_header (tunnel devices), and ports that support netdev_send. With this we can also remove a branch in netdev_send(). This is also necessary for a future commit, which starts DPDK devices without txqs. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:11 -08:00
Daniele Di Proietto	febf4a7a87	dpif-netdev: Take non_pmd_mutex to access tx cached ports. As documented in dp_netdev_pmd_thread, we must take non_pmd_mutex to access the tx port caches for the non pmd thread. Found by inspection. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>	2017-01-15 19:25:11 -08:00
Daniele Di Proietto	7c26997257	dpif-netdev: Fix memory leak. We keep all the per-port classifiers around, since they can be reused, but when a pmd thread is destroyed we should free them. Found using valgrind. Fixes: 3453b4d62a98("dpif-netdev: dpcls per in_port with sorted subtables") Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Ben Pfaff <blp@ovn.org>	2017-01-15 19:25:11 -08:00
Jarno Rajahalme	f4b835bb0f	dpcls: Avoid one 8-byte chunk in subtable mask. This patch allows to skip the 8-byte chunk comprising of dp_hash and in_port in the subtable mask when dp_hash is wildcarded. This will slightly speed up the hash computation as one expensive function call to hash_add64() can be skipped. For each new netdev flow we wildcard in_port in the mask, so in the typical case where dp_hash is also wildcarded, the resulting 8-byte chunk will not be part of the subtable mask. This manipulation of the mask is possible as the datapath classifier is explicitly selected based on the in_port value, so that all the datapath flows in the selected classifier have an exact match on that in_port value. Given this, it is safe to ignore the in_port value when doing a lookup in the chosen classifier. Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Co-authored-by: Jarno Rajahalme <jarno@ovn.org>	2017-01-10 14:11:02 -08:00
nickcooper-zhangtonghao	51c37a56d7	dpif-netdev: Uses the OVS_CORE_UNSPEC instead of magic numbers. This patch uses OVS_CORE_UNSPEC for the queue unpinned instead of "-1". More important, the "-1" casted to unsigned int is equal to NON_PMD_CORE_ID. We make the distinction between them. Signed-off-by: nickcooper-zhangtonghao <nic@opencloud.tech> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2017-01-08 18:16:06 -08:00
Daniele Di Proietto	0f6a066f63	dpif: Return ENODEV from dpif_port_query_by_*() if there's no port. bridge_delete_or_reconfigure() deletes every interface that's not dumped by OFPROTO_PORT_FOR_EACH(). ofproto_dpif.c:port_dump_next(), used by OFPROTO_PORT_FOR_EACH, checks if the ofport is in the datapath by calling port_query_by_name(). If port_query_by_name() returns an error, the dump is interrupted. If port_query_by_name() returns ENODEV, the device doesn't exist and the dump can continue. port_query_by_name() for the userspace datapath returns ENOENT instead of ENODEV. This is expected by dpif_port_query_by_name(), but it's not handled correctly by port_dump_next(). dpif-netdev handles reconfiguration errors for an interface by deleting it from the datapath, so it's possible that a device is missing. When this happens we must make sure that port_dump_next() continues to dump other devices, otherwise they will be deleted and the two layers will have an inconsistent view. This commit fixes the problem by returning ENODEV from the userspace datapath if the port doesn't exist, and by documenting this clearly in the dpif interfaces. The problem was found while developing new code. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ben Pfaff <blp@ovn.org>	2017-01-06 15:12:44 -08:00
nickcooper-zhangtonghao	1dea14357f	ovs-vswitchd: Avoid segfault for "netdev" datapath. When the datapath, whose type is "netdev", processes packets in userspce action, it may cause a segmentation fault. In the dp_execute_userspace_action(), we pass the "wc" argument to dp_netdev_upcall() using NULL. In the dp_netdev_upcall() call tree, the "wc" will be used. For example, dp_netdev_upcall() uses the &wc->masks for debugging, and flow_wildcards_init_for_packet() uses the "wc" if we disable megaflow, which is described in more detail below. Segmentation fault in flow_wildcards_init_for_packet: #0 0x0000000000468fe8 flow_wildcards_init_for_packet lib/flow.c:1275 #1 0x0000000000436c0b upcall_cb ofproto/ofproto-dpif-upcall.c:1231 #2 0x000000000045bd96 dp_netdev_upcall lib/dpif-netdev.c:3857 #3 0x0000000000461bf3 dp_execute_userspace_action lib/dpif-netdev.c:4388 #4 dp_execute_cb lib/dpif-netdev.c:4521 #5 0x0000000000486ae2 odp_execute_actions lib/odp-execute.c:538 #6 0x00000000004607f9 dp_netdev_execute_actions lib/dpif-netdev.c:4627 #7 packet_batch_per_flow_execute lib/dpif-netdev.c:3927 #8 dp_netdev_input__ lib/dpif-netdev.c:4229 #9 0x0000000000460ba8 dp_netdev_input lib/dpif-netdev.c:4238 #10 dp_netdev_process_rxq_port lib/dpif-netdev.c:2873 #11 0x000000000046126e dpif_netdev_run lib/dpif-netdev.c:3000 #12 0x000000000042baf5 type_run ofproto/ofproto-dpif.c:504 #13 0x00000000004192bf ofproto_type_run ofproto/ofproto.c:1687 #14 0x0000000000409965 bridge_run__ vswitchd/bridge.c:2875 #15 0x000000000040f145 bridge_run vswitchd/bridge.c:2938 #16 0x00000000004062e5 main vswitchd/ovs-vswitchd.c:111 Signed-off-by: nickcooper-zhangtonghao <nic@opencloud.tech> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-12-09 10:43:27 -08:00
Joe Stringer	8611f9a468	lib: Use nl_attr_get_odp_port(). This helper is a little tidier than the alternative. Use it treewide. Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Simon Horman <simon.horman@netronome.com>	2016-11-16 11:53:50 -08:00
Ilya Maximets	5dd57e80e6	dpif-netdev: Honor rxq affinity during pmd threads creation. Currently, If user will set up 'pmd-rxq-affinity' to cores on different numa node, they may not be polled, because pmd threads will not be created there even if this cores are in 'pmd-cpu-mask'. Fix that by creating threads on all numa nodes rxqs assigned to. Fixes: 3eb67853c481 ("dpif-netdev: Introduce pmd-rxq-affinity.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-11-15 13:55:53 -08:00
Patrik Andersson	687a83254b	dpif-netdev: non-pmd thread static_tx_qid should be constant The non-pmd thread static_tx_qid is assumed to be equal to the highest core ID + 1. The function dp_netdev_del_pmds_on_numa() invalidates this assumption by re-distributing the static_tx_qid:s on all pmd and non-pmd threads of the "other" numa. There might be a number of unwanted effects due to the non-pmd thread static_tx_qid being changed. The actual fault, observed in OVS 2.5, was a crash due to the TX burst queues containing a NULL packet buffer pointer in the range of valid buffers, presumably caused by a race condition. In OVS 2.6 TX burst queues have been removed, nevertheless the current behavior is incorrect. The correction makes dp_netdev_del_pmds_on_numa() honor the constancy of the non-pmd static_tx_qid value by excluding all non-pmd threads from the deletion and from the re-ordering of the static_tx_qid. Signed-off-by: Patrik Andersson <patrik.r.andersson@ericsson.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-11-15 13:55:53 -08:00
Simon Horman	56a56874de	dpif-provider: Use ODPP_NONE in dp_netdev_flow_add() This appears to be the only place where ODPP_NONE is not used but could be. Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Joe Stringer <joe@ovn.org>	2016-11-15 10:12:55 +01:00
Bhanuprakash Bodireddy	63906f18d7	dpcls: Use 32 packet batches for lookups. This patch increases the number of packets processed in a batch during a lookup from 16 to 32. Processing batches of 32 packets improves performance and also one of the internal loops can be avoided here. Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com> Acked-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-11-14 14:25:25 -08:00
Daniele Di Proietto	8917677498	dpif-netdev: Fix windows build. OVS_ALIGNED_VAR(...) should be at the beginning of a definition, as the example in include/openvswitch/compiler.h shows. Fixes: 38ee0814978c ("dpif-netdev: Cache align netdev_flow_keys.") Reported-by: Joe Stringer <joe@ovn.org> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com> Acked-by: Sairam Venugopal <vsairam@vmware.com>	2016-10-19 15:14:51 -07:00
Bhanuprakash Bodireddy	2e4450aa35	dpif-netdev: Reorder elements in dp_netdev_port structure. By reordering the data elements in dp_netdev_port structure, pad bytes can be reduced and there by saving a cache line. Before: structure size:136, holes:3, sum padbytes:15, cachelines:3 After: structure size:128, holes:2, sum padbytes:7, cachelines:2 Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-10-17 18:35:03 -07:00
Bhanuprakash Bodireddy	38ee081497	dpif-netdev: Cache align netdev_flow_keys. Aligning the 'keys' array seems to have positive performance impact. Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-10-17 18:32:44 -07:00
Bhanuprakash Bodireddy	ad9f05812d	dpif-netdev: Add comments to dp_netdev_input__(). Add comments in dp_netdev_input__() to explain the reason behind clearing the flow batches before packet_batch_execute(). Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-10-17 18:32:44 -07:00
Daniele Di Proietto	01961bbdd3	dpdk: New module with some code from netdev-dpdk. There's a lot of code in netdev-dpdk which is not at all related to the netdev interface, mostly the library initialization code. This commit moves it to a new 'dpdk' module, to simplify 'netdev-dpdk'. Also a new module 'dpdk-stub' is introduced to implement some functions when DPDK is not available. This replaces the old 'netdev-nodpdk' module. Some redundant includes are removed or reorganized as a consequence. No functional change. CC: Aaron Conole <aconole@redhat.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Aaron Conole <aconole@redhat.com> Tested-by: Aaron Conole <aconole@redhat.com>	2016-10-12 16:31:06 -07:00
Daniele Di Proietto	546e57d44c	dpif-netdev: Fix crash in dpif_netdev_execute(). dp_netdev_get_pmd() is allowed to return NULL (even if we call it with NON_PMD_CORE_ID) for different reasons: * Since we use RCU to protect pmd threads, it is possible that ovs_refcount_try_ref_rcu() has failed. * During reconfiguration we destroy every thread. This commit makes sure that we always handle the case when dp_netdev_get_pmd() returns NULL without crashing (the change in dpif_netdev_run() doesn't fix anything, because everything is happening in the main thread, but it's better to honor the interface in case we change our threading model). This actually fixes a pretty serious crash that happens if dpif_netdev_execute() is called from a non pmd thread while reconfiguration is happening. It can be triggered by enabling bfd (because it's handled by the monitor thread, which is a non pmd thread) on an interface and changing something that requires datapath reconfiguration (n_rxq, pmd-cpu-mask, mtu). A testcase that reproduces the race condition is included. This is a possible backtrace of the segfault: #0 0x000000000060c7f1 in dp_execute_cb (aux_=0x7f1dd2d2a320, packets_=0x7f1dd2d2a370, a=0x7f1dd2d2a658, may_steal=false) at ../lib/dpif-netdev.c:4357 #1 0x00000000006448b2 in odp_execute_actions (dp=0x7f1dd2d2a320, batch=0x7f1dd2d2a370, steal=false, actions=0x7f1dd2d2a658, actions_len=8, dp_execute_action=0x60c7a5 <dp_execute_cb>) at ../lib/odp-execute.c:538 #2 0x000000000060d00c in dp_netdev_execute_actions (pmd=0x0, packets=0x7f1dd2d2a370, may_steal=false, flow=0x7f1dd2d2ae70, actions=0x7f1dd2d2a658, actions_len=8, now=44965873) at ../lib/dpif-netdev.c:4577 #3 0x000000000060834a in dpif_netdev_execute (dpif=0x2b67b70, execute=0x7f1dd2d2a578) at ../lib/dpif-netdev.c:2624 #4 0x0000000000608441 in dpif_netdev_operate (dpif=0x2b67b70, ops=0x7f1dd2d2a5c8, n_ops=1) at ../lib/dpif-netdev.c:2654 #5 0x0000000000610a30 in dpif_operate (dpif=0x2b67b70, ops=0x7f1dd2d2a5c8, n_ops=1) at ../lib/dpif.c:1268 #6 0x000000000061098c in dpif_execute (dpif=0x2b67b70, execute=0x7f1dd2d2aa50) at ../lib/dpif.c:1233 #7 0x00000000005b9008 in ofproto_dpif_execute_actions__ (ofproto=0x2b69360, version=18446744073709551614, flow=0x7f1dd2d2ae70, rule=0x0, ofpacts=0x7f1dd2d2b100, ofpacts_len=16, indentation=0, depth=0, resubmits=0, packet=0x7f1dd2d2b5c0) at ../ofproto/ofproto-dpif.c:3806 #8 0x00000000005b907a in ofproto_dpif_execute_actions (ofproto=0x2b69360, version=18446744073709551614, flow=0x7f1dd2d2ae70, rule=0x0, ofpacts=0x7f1dd2d2b100, ofpacts_len=16, packet=0x7f1dd2d2b5c0) at ../ofproto/ofproto-dpif.c:3823 #9 0x00000000005dea9b in xlate_send_packet (ofport=0x2b98380, oam=false, packet=0x7f1dd2d2b5c0) at ../ofproto/ofproto-dpif-xlate.c:5792 #10 0x00000000005bab12 in ofproto_dpif_send_packet (ofport=0x2b98380, oam=false, packet=0x7f1dd2d2b5c0) at ../ofproto/ofproto-dpif.c:4628 #11 0x00000000005c3fc8 in monitor_mport_run (mport=0x2b8cd00, packet=0x7f1dd2d2b5c0) at ../ofproto/ofproto-dpif-monitor.c:287 #12 0x00000000005c3d9b in monitor_run () at ../ofproto/ofproto-dpif-monitor.c:227 #13 0x00000000005c3cab in monitor_main (args=0x0) at ../ofproto/ofproto-dpif-monitor.c:189 #14 0x00000000006a183a in ovsthread_wrapper (aux_=0x2b8afd0) at ../lib/ovs-thread.c:342 #15 0x00007f1dd75eb444 in start_thread (arg=0x7f1dd2d2c700) at pthread_create.c:333 #16 0x00007f1dd6e1d20d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ben Pfaff <blp@ovn.org>	2016-10-12 14:51:02 -07:00
Jesse Gross	8d8ab6c2d5	tun-metadata: Manage tunnel TLV mapping table on a per-bridge basis. When using tunnel TLVs (at the moment, this means Geneve options), a controller must first map the class and type onto an appropriate OXM field so that it can be used in OVS flow operations. This table is managed using OpenFlow extensions. The original code that added support for TLVs made the mapping table global as a simplification. However, this is not really logically correct as the OpenFlow management commands are operating on a per-bridge basis. This removes the original limitation to make the table per-bridge. One nice result of this change is that it is generally clearer whether the tunnel metadata is in datapath or OpenFlow format. Rather than allowing ad-hoc format changes and trying to handle both formats in the tunnel metadata functions, the format is more clearly separated by function. Datapaths (both kernel and userspace) use datapath format and it is not changed during the upcall process. At the beginning of action translation, tunnel metadata is converted to OpenFlow format and flows and wildcards are translated back at the end of the process. As an additional benefit, this change improves performance in some flow setup situations by keeping the tunnel metadata in the original packet format in more cases. This helps when copies need to be made as the amount of data touched is only what is present in the packet rather than the maximum amount of metadata supported. Co-authored-by: Madhu Challa <challa@noironetworks.com> Signed-off-by: Madhu Challa <challa@noironetworks.com> Signed-off-by: Jesse Gross <jesse@kernel.org> Acked-by: Ben Pfaff <blp@ovn.org>	2016-09-19 09:52:22 -07:00
Daniele Di Proietto	e98d0cb3ac	netdev-dummy: Add dummy-internal class. "internal" netdevs are treated specially in OVS (e.g. for MTU), but the dummy datapath remaps both "system" and "internal" devices to the same "dummy" netdev class, so there's no way to discern those in tests. This commit adds a new "dummy-internal" netdev type, which will be used by the dummy datapath for internal ports, so that other parts of the code can understand which ports are internal just by looking at the netdev object. The alternative solution, using the original interface type ("internal") instead of the translated netdev type ("dummy"), is harder to implement, because in so many places only the netdev object is available. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ben Pfaff <blp@ovn.org>	2016-08-15 11:07:42 -07:00
Daniele Di Proietto	84dbfb2b69	dpif-netdev: Fix -Wformat warning on 32-bit build. Use the appropriate format specifier for size_t, otherwise the 32-bit build fails. Reported-at: https://travis-ci.org/openvswitch/ovs/jobs/151938383 Fixes: 3453b4d62a98("dpif-netdev: dpcls per in_port with sorted subtables") Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Joe Stringer <joe@ovn.org>	2016-08-12 17:56:43 -07:00
Jan Scheurich	3453b4d62a	dpif-netdev: dpcls per in_port with sorted subtables The user-space datapath (dpif-netdev) consists of a first level "exact match cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient and the OVS forwarding performance is determined by the megaflow classifier. The megaflow classifier (dpcls) consists of a variable number of hash tables (aka subtables), each containing megaflow entries with the same mask of packet header and metadata fields to match upon. A dpcls lookup matches a given packet against all subtables in sequence until it hits a match. As megaflow cache entries are by construction non-overlapping, the first match is the only match. Today the order of the subtables in the dpcls is essentially random so that on average a dpcls lookup has to visit N/2 subtables for a hit, when N is the total number of subtables. Even though every single hash-table lookup is fast, the performance of the current dpcls degrades when there are many subtables. How does the patch address this issue: In reality there is often a strong correlation between the ingress port and a small subset of subtables that have hits. The entire megaflow cache typically decomposes nicely into partitions that are hit only by packets entering from a range of similar ports (e.g. traffic from Phy -> VM vs. traffic from VM -> Phy). Therefore, maintaining a separate dpcls instance per ingress port with its subtable vector sorted by frequency of hits reduces the average number of subtables lookups in the dpcls to a minimum, even if the total number of subtables gets large. This is possible because megaflows always have an exact match on in_port, so every megaflow belongs to unique dpcls instance. For thread safety, the PMD thread needs to block out revalidators during the periodic optimization. We use ovs_mutex_trylock() to avoid blocking the PMD. To monitor the effectiveness of the patch we have enhanced the ovs-appctl dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups per hit" to report the average number of subtable lookup needed for a megaflow match. Ideally, this should be close to 1 and almost all cases much smaller than N/2. The PMD tests have been adjusted to the additional line in pmd-stats-show. We have benchmarked a L3-VPN pipeline on top of a VXLAN overlay mesh. With pure L3 tenant traffic between VMs on different nodes the resulting netdev dpcls contains N=4 subtables. Each packet traversing the OVS datapath is subject to dpcls lookup twice due to the tunnel termination. Disabling the EMC, we have measured a baseline performance (in+out) of ~1.45 Mpps (64 bytes, 10K L4 packet flows). The average number of subtable lookups per dpcls match is 2.5. With the patch the average number of subtable lookups per dpcls match is reduced to 1 and the forwarding performance grows by ~50% to 2.13 Mpps. Even with EMC enabled, the patch improves the performance by 9% (for 1000 L4 flows) and 34% (for 50K+ L4 flows). As the actual number of subtables will often be higher in reality, we can assume that this is at the lower end of the speed-up one can expect from this optimization. Just running a parallel ping between the VXLAN tunnel endpoints increases the number of subtables and hence the average number of subtable lookups from 2.5 to 3.5 on master with a corresponding decrease of throughput to 1.2 Mpps. With the patch the parallel ping has no impact on average number of subtable lookups and performance. The performance gain is then ~75%. Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-08-12 14:38:15 -07:00
Jarno Rajahalme	da9cfca6e2	Revert "pvector: Expose non-concurrent priority vector." This reverts commit 8bdfe1313894047d44349fa4cf4402970865950f. I failed to see that lib/dpif-netdev.c actually needs the concurrency provided by pvector prior to this change. More specifically, when a subtable is removed, concurrent lookups may skip over another subtable swapped in to the place of the removed subtable in the vector. Since this was the only use of the non-concurrent pvector, it is cleaner to revert the whole patch. Reported-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-08-10 14:58:51 -07:00
Fischetti, Antonio	5b1c9c789d	dpcls_lookup: added comments. This patch adds some comments to the dpcls_lookup() funtion, which is one of the most important places where the Userspace wildcard matching happens. The purpose is to give some more explanations on its design and also on how it works. Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Jarno Rajahalme <jarno@ovn.org>	2016-08-05 13:48:38 -07:00
Ilya Maximets	9f7a3035d2	dpif-netdev: Fix xps revalidation. Revalidation should work in case of 'dynamic_txqs == true'. Fixes: 324c8374852a ("dpif-netdev: XPS (Transmit Packet Steering) implementation.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-07-29 18:03:09 -07:00
Jarno Rajahalme	8bdfe13138	pvector: Expose non-concurrent priority vector. PMD threads use pvectors but do not need the overhead of the concurrent version. Expose the non-concurrent version for that use. Note that struct pvector is renamed as struct cpvector (for concurrent priority vector), and the former struct pvector_impl is now struct pvector. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>	2016-07-29 11:12:08 -07:00
Daniele Di Proietto	66e4ad8aa4	conntrack: Add 'dl_type' parameter to conntrack_execute(). Now that dpif_execute has a 'flow' member, it's pretty easy to access a the flow (or the matching megaflow) in dp_execute_cb(). This means that's not necessary anymore for the connection tracker to reextract 'dl_type' from the packet, it can be passed as a parameter. This change means that we have to complicate sightly test-conntrack to group the packets by dl_type before passing them to the connection tracker. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Joe Stringer <joe@ovn.org>	2016-07-27 18:53:29 -07:00
Daniele Di Proietto	5d9cbb4cb8	dpif-netdev: Implement conntrack flush interface. New functions are implemented in the conntrack module to support this. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Flavio Leitner <fbl@sysclose.org>	2016-07-27 18:52:13 -07:00
Daniele Di Proietto	4d4e68ed20	dpif-netdev: Implement conntrack dump functions. New functions are implemented in the conntrack module to support this. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Flavio Leitner <fbl@sysclose.org>	2016-07-27 18:52:13 -07:00
Daniele Di Proietto	5cf3edb311	dpif-netdev: Execute conntrack action. This commit implements the OVS_ACTION_ATTR_CT action in dpif-netdev. To allow ofproto-dpif to detect the conntrack feature, flow_put will not discard anymore flows with ct_* fields set. We still shouldn't allow flows with NAT bits set, since there is no support for NAT. Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Antonio Fischetti <antonio.fischetti@intel.com>	2016-07-27 18:52:13 -07:00
Thadeu Lima de Souza Cascardo	a3e8437a18	dpif-netdev: use the open_type when creating the local port Instead of using the internal type, use the port_open_type when creating the local port. That makes sure that whenever dpif_port_query is used, the netdev open_type is returned instead of the "internal" type. For other ports, that is already the case, as the netdev type is used when creating the dp_netdev_port. That changes the output of dpctl when showing the local port, and also when trying to change its type. So, corresponding tests are fixed. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-07-27 14:48:24 -07:00
Ilya Maximets	3eb67853c4	dpif-netdev: Introduce pmd-rxq-affinity. New 'other_config:pmd-rxq-affinity' field for Interface table to perform manual pinning of RX queues to desired cores. This functionality is required to achieve maximum performance because all kinds of ports have different cost of rx/tx operations and only user can know about expected workload on different ports. Example: # ./bin/ovs-vsctl set interface dpdk0 options:n_rxq=4 \ other_config:pmd-rxq-affinity="0:3,1:7,3:8" Queue #0 pinned to core 3; Queue #1 pinned to core 7; Queue #2 not pinned. Queue #3 pinned to core 8; It's decided to automatically isolate cores that have rxq explicitly assigned to them because it's useful to keep constant polling rate on some performance critical ports while adding/deleting other ports without explicit pinning of all ports. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>	2016-07-27 12:56:04 -07:00

1 2 3 4 5 ...

552 Commits