mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-29 13:27:59 +00:00

Author	SHA1	Message	Date
Ilya Maximets	e7e9973b80	dpif-netdev: Forwarding optimization for flows with a simple match. There are cases where users might want simple forwarding or drop rules for all packets received from a specific port, e.g :: "in_port=1,actions=2" "in_port=2,actions=IN_PORT" "in_port=3,vlan_tci=0x1234/0x1fff,actions=drop" "in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3" There are also cases where complex OpenFlow rules can be simplified down to datapath flows with very simple match criteria. In theory, for very simple forwarding, OVS doesn't need to parse packets at all in order to follow these rules. "Simple match" lookup optimization is intended to speed up packet forwarding in these cases. Design: Due to various implementation constraints userspace datapath has following flow fields always in exact match (i.e. it's required to match at least these fields of a packet even if the OF rule doesn't need that): - recirc_id - in_port - packet_type - dl_type - vlan_tci (CFI + VID) - in most cases - nw_frag - for ip packets Not all of these fields are related to packet itself. We already know the current 'recirc_id' and the 'in_port' before starting the packet processing. It also seems safe to assume that we're working with Ethernet packets. So, for the simple OF rule we need to match only on 'dl_type', 'vlan_tci' and 'nw_frag'. 'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be combined in a single 64bit integer (mark) that can be used as a hash in hash map. We are using only VID and CFI form the 'vlan_tci', flows that need to match on PCP will not qualify for the optimization. Workaround for matching on non-existence of vlan updated to match on CFI and VID only in order to qualify for the optimization. CFI is always set by OVS if vlan is present in a packet, so there is no need to match on PCP in this case. 'nw_frag' takes 2 bits of PCP inside the simple match mark. New per-PMD flow table 'simple_match_table' introduced to store simple match flows only. 'dp_netdev_flow_add' adds flow to the usual 'flow_table' and to the 'simple_match_table' if the flow meets following constraints: - 'recirc_id' in flow match is 0. - 'packet_type' in flow match is Ethernet. - Flow wildcards contains only minimal set of non-wildcarded fields (listed above). If the number of flows for current 'in_port' in a regular 'flow_table' equals number of flows for current 'in_port' in a 'simple_match_table', we may use simple match optimization, because all the flows we have are simple match flows. This means that we only need to parse 'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching. Now we make the unique flow mark from the 'in_port', 'dl_type', 'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'. On successful lookup we don't need to run full 'miniflow_extract()'. Unsuccessful lookup technically means that we have no suitable flow in the datapath and upcall will be required. So, in this case EMC and SMC lookups are disabled. We may optimize this path in the future by bypassing the dpcls lookup too. Performance improvement of this solution on a 'simple match' flows should be comparable with partial HW offloading, because it parses same packet fields and uses similar flow lookup scheme. However, unlike partial HW offloading, it works for all port types including virtual ones. Performance results when compared to EMC: Test setup: virtio-user OVS virtio-user Testpmd1 ------------> pmd1 ------------> Testpmd2 (txonly) x<------ pmd2 <------------ (mac swap) Single stream of 64byte packets. Actions: in_port=vhost0,actions=vhost1 in_port=vhost1,actions=vhost0 Stats collected from pmd1 and pmd2, so there are 2 scenarios: Virt-to-Virt : Testpmd1 ------> pmd1 ------> Testpmd2. Virt-to-NoCopy : Testpmd2 ------> pmd2 --->x Testpmd1. Here the packet sent from pmd2 to Testpmd1 is always dropped, because the virtqueue is full since Testpmd1 is in txonly mode and doesn't receive any packets. This should be closer to the performance of a VM-to-Phy scenario. Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz. Table below represents improvement in throughput when compared to EMC. +----------------+------------------------+------------------------+ \| \| Default (-g -O2) \| "-Ofast -march=native" \| \| Scenario +------------+-----------+------------+-----------+ \| \| GCC \| Clang \| GCC \| Clang \| +----------------+------------+-----------+------------+-----------+ \| Virt-to-Virt \| +18.9% \| +25.5% \| +10.8% \| +16.7% \| \| Virt-to-NoCopy \| +24.3% \| +33.7% \| +14.9% \| +22.0% \| +----------------+------------+-----------+------------+-----------+ For Phy-to-Phy case performance improvement should be even higher, but it's not the main use-case for this functionality. Performance difference for the non-simple flows is within a margin of error. Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-07 20:32:20 +01:00
Ilya Maximets	20a4f546f7	dpif-netdev: Use PMD context to get the port for HW miss recovery. Last RX queue, from which the packet got received, is already stored in the PMD context. So, we can get the netdev from it without the expensive hash map lookup. In my V2V testing this patch improves performance in case HW offload and experimental APIs are enabled by about 3%. That narrows down the performance difference with the case with experimental API disabled to about 0.5%, which is way within a margin of error for that setup. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-09 22:46:47 +01:00
Kevin Traynor	a83a406096	dpif-netdev: Sync PMD ALB state with user commands. Previously, when a user enabled PMD auto load balancer with pmd-auto-lb="true", some conditions such as number of PMDs/RxQs that were required for a rebalance to take place were checked. If the configuration meant that a rebalance would not take place then PMD ALB was logged as 'disabled' and not run. Later, if the PMD/RxQ configuration changed whereby a rebalance could be effective, PMD ALB was logged as 'enabled' and would run at the appropriate time. This worked ok from a functional view but it is unintuitive for the user reading the logs. e.g. with one PMD (PMD ALB would not be effective) User enables ALB, but logs say it is disabled because it won't run. $ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true" \|dpif_netdev\|INFO\|PMD auto load balance is disabled No dry run takes place. Add more PMDs (PMD ALB may be effective). $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50 \|dpif_netdev\|INFO\|PMD auto load balance is enabled ... Dry run takes place. \|dpif_netdev\|DBG\|PMD auto load balance performing dry run. A better approach is to simply reflect back the user enable/disable state in the logs and deal with whether the rebalance will be effective when needed. That is the approach taken in this patch. To cut down on unneccessary work, some basic checks are also made before starting a PMD ALB dry run and debug logs can indicate this to the user. e.g. with one PMD (PMD ALB would not be effective) User enables ALB, and logs confirm the user has enabled it. $ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true" \|dpif_netdev\|INFO\|PMD auto load balance is enabled... No dry run takes place. \|dpif_netdev\|DBG\|PMD auto load balance nothing to do, not enough non-isolated PMDs or RxQs. Add more PMDs (PMD ALB may be effective). $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50 Dry run takes place. \|dpif_netdev\|DBG\|PMD auto load balance performing dry run. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-11-17 21:19:11 +01:00
Kevin Traynor	23083672b5	dpif-netdev: Reset RxQ cycles history on PMD reload. When a PMD reload occurs, some PMD cycle measurements are reset. In order to preserve the full cycles history of an Rxq, the RxQ cycle measurements were not reset. These are both used together to display the % of a PMD that an RxQ is using in the pmd-rxq-show stat. Resetting one and not the other can lead to some unintuitive looking stats while the stats settle for the new config. In some cases, it may appear like the RxQs are using >100% of a PMD. This resolves when the stats settle for the new config, but seeing RxQs apparently using >100% of a PMD may confuse a user and lead them to think there is a bug. To avoid this, reset the RxQ cycle measurements on PMD reload. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-11-16 16:56:40 +01:00
Eelco Chaudron	9b20df73a6	dpctl: dpif: Allow viewing and configuring dp cache sizes. This patch adds a general way of viewing/configuring datapath cache sizes. With an implementation for the netlink interface. The ovs-dpctl/ovs-appctl show commands will display the current cache sizes configured: $ ovs-dpctl show system@ovs-system: lookups: hit:25 missed:63 lost:0 flows: 0 masks: hit:282 total:0 hit/pkt:3.20 cache: hit:4 hit-rate:4.54% caches: masks-cache: size:256 port 0: ovs-system (internal) port 1: br-int (internal) port 2: genev_sys_6081 (geneve: packet_type=ptap) port 3: br-ex (internal) port 4: eth2 port 5: sw0p1 (internal) port 6: sw0p3 (internal) A specific cache can be configured as follows: $ ovs-appctl dpctl/cache-set-size DP CACHE SIZE $ ovs-dpctl cache-set-size DP CACHE SIZE For example to disable the cache do: $ ovs-dpctl cache-set-size system@ovs-system masks-cache 0 Setting cache size successful, new size 0. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-11-08 21:48:05 +01:00
Eelco Chaudron	efd55eb34c	dpctl: dpif: Add kernel datapath cache hit output. This patch adds cache usage statistics to the output: $ ovs-dpctl show system@ovs-system: lookups: hit:24 missed:71 lost:0 flows: 0 masks: hit:334 total:0 hit/pkt:3.52 cache: hit:4 hit-rate:4.21% port 0: ovs-system (internal) port 1: genev_sys_6081 (geneve: packet_type=ptap) port 2: br-int (internal) port 3: br-ex (internal) port 4: eth2 port 5: sw1p1 (internal) port 6: sw0p4 (internal) Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-11-08 21:48:05 +01:00
Ilya Maximets	02aebad3fb	dpif-netdev: Fix use-after-free on PACKET_OUT of IP fragments. IP fragmentation engine may not only steal the packet but also add more. For example, after receiving the last fragment, it will add all previous fragments to a batch. Unfortunately, it will also free the original last fragment and replace it with a copy. This invalidates the 'packet_clone' pointer in the dpif_netdev_execute() leading to the use-after-free: ==3525086==ERROR: AddressSanitizer: heap-use-after-free on address 0x61600020439c at pc 0x000000688a6d READ of size 1 at 0x61600020439c thread T0 #0 0x688a6c in dp_packet_swap ./lib/dp-packet.h:265:5 #1 0x68781d in dpif_netdev_execute lib/dpif-netdev.c:4103:9 #2 0x6675db in dpif_netdev_operate lib/dpif-netdev.c:4129:25 #3 0x691e5e in dpif_operate lib/dpif.c:1367:13 #4 0x692909 in dpif_execute lib/dpif.c:1321:9 #5 0x5b19c6 in packet_execute ofproto/ofproto-dpif.c:4991:5 #6 0x5a2861 in ofproto_packet_out_finish ofproto/ofproto.c:3662:5 #7 0x5a65c6 in do_bundle_commit ofproto/ofproto.c:8270:13 #8 0x5a0cae in handle_bundle_control ofproto/ofproto.c:8309:17 #9 0x59a476 in handle_single_part_openflow ofproto/ofproto.c:8593:16 #10 0x5877ac in handle_openflow ofproto/ofproto.c:8674:21 #11 0x6296f1 in ofconn_run ofproto/connmgr.c:1329:13 #12 0x62925d in connmgr_run ofproto/connmgr.c:356:9 #13 0x586904 in ofproto_run ofproto/ofproto.c:1879:5 #14 0x55c830 in bridge_run__ vswitchd/bridge.c:3251:9 #15 0x55c015 in bridge_run vswitchd/bridge.c:3310:5 #16 0x575f31 in main vswitchd/ovs-vswitchd.c:127:9 #17 0x7f01099d3492 in __libc_start_main (/lib64/libc.so.6+0x23492) #18 0x47d96d in _start (vswitchd/ovs-vswitchd+0x47d96d) 0x61600020439c is located 28 bytes inside of 560-byte region freed by thread T0 here: #0 0x5177a8 in free (vswitchd/ovs-vswitchd+0x5177a8) #1 0x6b17b6 in dp_packet_delete ./lib/dp-packet.h:256:9 #2 0x6afeee in ipf_extract_frags_from_batch lib/ipf.c:947:17 #3 0x6afd63 in ipf_preprocess_conntrack lib/ipf.c:1232:9 #4 0x946b2c in conntrack_execute lib/conntrack.c:1446:5 #5 0x67e3ed in dp_execute_cb lib/dpif-netdev.c:8277:9 #6 0x7097d7 in odp_execute_actions lib/odp-execute.c:865:17 #7 0x66409e in dp_netdev_execute_actions lib/dpif-netdev.c:8322:5 #8 0x6877ad in dpif_netdev_execute lib/dpif-netdev.c:4090:5 #9 0x6675db in dpif_netdev_operate lib/dpif-netdev.c:4129:25 #10 0x691e5e in dpif_operate lib/dpif.c:1367:13 #11 0x692909 in dpif_execute lib/dpif.c:1321:9 #12 0x5b19c6 in packet_execute ofproto/ofproto-dpif.c:4991:5 #13 0x5a2861 in ofproto_packet_out_finish ofproto/ofproto.c:3662:5 #14 0x5a65c6 in do_bundle_commit ofproto/ofproto.c:8270:13 #15 0x5a0cae in handle_bundle_control ofproto/ofproto.c:8309:17 #16 0x59a476 in handle_single_part_openflow ofproto/ofproto.c:8593:16 #17 0x5877ac in handle_openflow ofproto/ofproto.c:8674:21 #18 0x6296f1 in ofconn_run ofproto/connmgr.c:1329:13 #19 0x62925d in connmgr_run ofproto/connmgr.c:356:9 #20 0x586904 in ofproto_run ofproto/ofproto.c:1879:5 #21 0x55c830 in bridge_run__ vswitchd/bridge.c:3251:9 #22 0x55c015 in bridge_run vswitchd/bridge.c:3310:5 #23 0x575f31 in main vswitchd/ovs-vswitchd.c:127:9 #24 0x7f01099d3492 in __libc_start_main (/lib64/libc.so.6+0x23492) The issue can be reproduced with system-userspace testsuite on the 'conntrack - IPv4 fragmentation with fragments specified' test. Previously, there was a leak inside the IP fragmentation module that kept the original segment, so 'packet_clone' remained a valid pointer. But commit 803ed12e31b0 ("ipf: release unhandled packets from the batch") fixed the leak leading to use-after-free. Using the packet from a batch instead of 'packet_clone' to swap packet content to avoid the issue. While investigating this problem, more issues uncovered. One of them is that IP fragmentation engine can add more packets to the batch, but there is no way to get them to a caller. Adding an extra branch for this case with a 'FIXME' comment in order to highlight the issue. Another one is that IP fragmentation engine will keep only 32 fragments dropping all other fragments while refilling a batch, but that should be fixed separately. Fixes: 7e6b41ac8d9d ("dpif-netdev: Fix crash when PACKET_OUT is metered.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com>	2021-10-13 13:36:30 +02:00
linhuang	bfc6e9735c	dpif-netdev: Remove OVS_UNUSED flag in functions for ct_zone limits. The dpif is used, so remove the OVS_UNUSED flag. Fixes: a7f33fdbfb67 ("conntrack: Support zone limits.") Signed-off-by: linhuang <linhuang@ruijie.com.cn> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-09-23 00:27:51 +02:00
Tony van der Peet	7e6b41ac8d	dpif-netdev: Fix crash when PACKET_OUT is metered. When a PACKET_OUT has output port of OFPP_TABLE, and the rule table includes a meter and this causes the packet to be deleted, execute with a clone of the packet, restoring the original packet if it is changed by the execution. Add tests to verify the original issue is fixed, and that the fix doesn't break tunnel processing. Reported-by: Tony van der Peet <tony.vanderpeet@alliedtelesis.co.nz> Signed-off-by: Tony van der Peet <tony.vanderpeet@alliedtelesis.co.nz> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-09-08 17:52:35 +02:00
Harry van Haaren	0b3a5d7add	dpif-netdev: fix memory leak in dpif and mfex commands This patch fixes a memory leak in the commands for DPIF and MFEX get and set. In order to operate the commands require a pmd_list, which is currently not freed after it has been used. This issue was identified by a static analysis tool. Fixes: 3d8f47bc ("dpif-netdev: Add command line and function pointer for miniflow extract") Fixes: abb807e2 ("dpif-netdev: Add command to switch dpif implementation.") Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-08-16 15:53:47 +01:00
Harry van Haaren	01cbe1ed41	dpif-netdev: fix memory leak in dpcls subtable set command This patch fixes a memory leak when the command "dpif-netdev/subtable-lookup-prio-set" is run, the pmd_list required to iterate the PMD threads was not being freed. This issue was identified by a static analysis tool. Fixes: 3d018c3e ("dpif-netdev: Add subtable lookup prio set command.") Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-08-16 15:52:55 +01:00
kumar Amber	d2ad305a64	dpif-netdev: Fix dead code in mfex command The commit removes the dead code from the MFEX set command as highlighted by static tool analysis. Fixes: a395b132b7 ("dpif-netdev: Add packet count and core id paramters for study") Signed-off-by: kumar Amber <kumar.amber@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-08-16 15:18:12 +01:00
Eli Britstein	e21e9dcec2	dpif-netdev: Log flow modification in debug level. Log flow modifications to help debugging. Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 19:33:03 +02:00
Eli Britstein	6f69e0e30b	dpif-netdev: Fix offloads of modified flows. Association of a mark to a flow is done as part of its offload handling, in the offloading thread. However, the PMD thread specifies whether an offload request is an "add" or "modify" by the association of a mark to the flow. This is exposed to a race condition. A flow might be created with actions that cannot be fully offloaded, for example flooding (before MAC learning), and later modified to have actions that can be fully offloaded. If the two requests are queued before the offload thread handling, they are both marked as "add". When the offload thread handles them, the first request is partially offloaded, and the second one is ignored as the flow is already considered as offloaded. Fix it by specifying add/modify of an offload request by the actual flow state change, without relying on the mark. Fixes: 3c7330ebf036 ("netdev-offload-dpdk: Support offload of output action.") Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 18:56:56 +02:00
Eli Britstein	0d25621e4d	dpif-netdev: Fix flow modification after failure. dp_netdev_flow_offload_main thread is asynchronous, by the cited commit. There might be a case where there are modification requests of the same flow submitted before handled. Then, if the first handling fails, the rule for the flow is deleted, and the mark is freed. Then, the following one should not be handled as a modification, but rather as an "add". Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread") Signed-off-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 18:56:56 +02:00
Eli Britstein	8917010b05	dpif-netdev: Do not flush PMD offloads on reload. Before flushing offloads of a removed port was supported by [1], it was necessary to flush the 'marks'. In doing so, all offloads of the PMD are removed, include the ones that are not related to the removed port and that are not modified following this removal. As a result such flows are evicted from being offloaded, and won't resume offloading. As PMD offload flush is not necessary, avoid it. [1] 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.") Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 18:56:47 +02:00
Ilya Maximets	cd36a34f33	dpif-netdev: Fix non-atomic read of smc_enable_db. Atomic variables must be read atomically. And since we already have 'smc_enable_db' in the PMD context, we need to use it from there to avoid reading atomically twice. Also, 'smc_enable_db' is a global configuration, there is no need to read it per-port or per-rxq. Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.") Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-08-02 18:52:44 +02:00
Mark Gray	b1e517bd2f	dpif-netlink: Introduce per-cpu upcall dispatch. The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is correspondingly a large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. Reported-at: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-16 20:05:03 +02:00
David Marchand	3222a89d9a	dpif-netdev: Report overhead busy cycles per pmd. Users complained that per rxq pmd usage was confusing: summing those values per pmd would never reach 100% even if increasing traffic load beyond pmd capacity. This is because the dpif-netdev/pmd-rxq-show command only reports "pure" rxq cycles while some cycles are used in the pmd mainloop and adds up to the total pmd load. dpif-netdev/pmd-stats-show does report per pmd load usage. This load is measured since the last dpif-netdev/pmd-stats-clear call. On the other hand, the per rxq pmd usage reflects the pmd load on a 10s sliding window which makes it non trivial to correlate. Gather per pmd busy cycles with the same periodicity and report the difference as overhead in dpif-netdev/pmd-rxq-show so that we have all info in a single command. Example: $ ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 1 core_id 3: isolated : true port: dpdk0 queue-id: 0 (enabled) pmd usage: 90 % overhead: 4 % pmd thread numa_id 1 core_id 5: isolated : false port: vhost0 queue-id: 0 (enabled) pmd usage: 0 % port: vhost1 queue-id: 0 (enabled) pmd usage: 93 % port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % port: vhost6 queue-id: 0 (enabled) pmd usage: 0 % overhead: 6 % pmd thread numa_id 1 core_id 31: isolated : true port: dpdk1 queue-id: 0 (enabled) pmd usage: 86 % overhead: 4 % pmd thread numa_id 1 core_id 33: isolated : false port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % port: vhost4 queue-id: 0 (enabled) pmd usage: 0 % port: vhost5 queue-id: 0 (enabled) pmd usage: 92 % port: vhost7 queue-id: 0 (enabled) pmd usage: 0 % overhead: 7 % Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 17:43:42 +01:00
Kevin Traynor	6193e03267	dpif-netdev: Allow pin rxq and non-isolate PMD. Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from a port are handled on different PMDs. Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be assigned to it by OVS. There may be cases where there is unused cycles on those pmds and the user would like other rxqs to also be able to be assigned to it by OVS. Add an option to pin the rxq and non-isolate the PMD. The default behaviour is unchanged, which is pin and isolate the PMD. In order to pin and non-isolate: ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false Note this is available only with group assignment type, as pinning conflicts with the operation of the other rxq assignment algorithms. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:57 +01:00
Kevin Traynor	3dd050909a	dpif-netdev: Add group rxq scheduling assignment type. Add an rxq scheduling option that allows rxqs to be grouped on a pmd based purely on their load. The current default 'cycles' assignment sorts rxqs by measured processing load and then assigns them to a list of round robin PMDs. This helps to keep the rxqs that require most processing on different cores but as it selects the PMDs in round robin order, it equally distributes rxqs to PMDs. 'cycles' assignment has the advantage in that it separates the most loaded rxqs from being on the same core but maintains the rxqs being spread across a broad range of PMDs to mitigate against changes to traffic pattern. 'cycles' assignment has the disadvantage that in order to make the trade off between optimising for current traffic load and mitigating against future changes, it tries to assign and equal amount of rxqs per PMD in a round robin manner and this can lead to a less than optimal balance of the processing load. Now that PMD auto load balance can help mitigate with future changes in traffic patterns, a 'group' assignment can be used to assign rxqs based on their measured cycles and the estimated running total of the PMDs. In this case, there is no restriction about keeping equal number of rxqs per PMD as it is purely load based. This means that one PMD may have a group of low load rxqs assigned to it while another PMD has one high load rxq assigned to it, as that is the best balance of their measured loads across the PMDs. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:47 +01:00
Kevin Traynor	4fb54652e0	dpif-netdev: Assign PMD for failed pinned rxqs. Previously, if pmd-rxq-affinity was used to pin an rxq to a core that was not in pmd-cpu-mask the rxq was not polled for and the user received a warning. This meant that no traffic would be received from that rxq. Now that pinned and non-pinned rxqs are assigned to PMDs in a common call to rxq scheduling, if an invalid core is selected in pmd-rxq-affinity the rxq can be assigned an available PMD (if any). A warning will still be logged as the requested core could not be used. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:35 +01:00
Kevin Traynor	0efefc4f98	dpif-netdev: Sort PMD list by core id for rxq scheduling. The list of PMDs is round robined through for the selection when assigning an rxq to a PMD. The list is based on a hash map, so there is no defined order. It means the same set of PMDs may get assigned different rxqs on different runs for no reason other than how the PMDs are stored in the hash map. This can be easily changed by sorting the PMDs by core id after they are extracted, so the PMDs will be used in a consistent order. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:22 +01:00
Kevin Traynor	58fed7e8d8	dpif-netdev: Make PMD auto load balance use common rxq scheduling. PMD auto load balance had its own separate implementation of the rxq scheduling that it used for dry runs. This was done because previously the rxq scheduling was not made reusable for a dry run. Apart from the code duplication (which is a good enough reason to replace it alone) this meant that if any further rxq scheduling changes or assignment types were added they would also have to be duplicated in the auto load balance code too. This patch replaces the current PMD auto load balance rxq scheduling code to reuse the common rxq scheduling code. The behaviour does not change from a user perspective, except the logs are updated to be more consistent. As the dry run will compare the pmd load variances for current and estimated assignments, new functions are added to populate the current assignments and use the rxq scheduling data structs for variance calculations. Now that the new rxq scheduling data structures are being used in PMD auto load balance, the older rr_* data structs and associated functions can be removed. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:12 +01:00
Kevin Traynor	f577c2d046	dpif-netdev: Rework rxq scheduling code. This reworks the current rxq scheduling code to break it into more generic and reusable pieces. The behaviour does not change from a user perspective, except the logs are updated to be more consistent. From an implementation view, there are some changes with mind to extending functionality. The high level reusable functions added in this patch are: - Generate a list of current numas and pmds - Perform rxq scheduling assignments into that list - Effect the rxq scheduling assignments so they are used Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:01 +01:00
Eli Britstein	7ab851e1b8	dpif-netdev: Do not execute packet recovery without experimental support. rte_flow_get_restore_info() API is under experimental attribute. Using it has a performance impact that can be avoided for non-experimental compilation. Do not call it without experimental support. Reported-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-16 13:50:28 +02:00
Harry van Haaren	dc39608d2a	dpif/stats: Add miniflow extract opt hits counter This commit adds a new counter to be displayed to the user when requesting datapath packet statistics. It counts the number of packets that are parsed and a miniflow built up from it by the optimized miniflow extract parsers. The ovs-appctl command "dpif-netdev/pmd-perf-show" now has an extra entry indicating if the optimized MFEX was hit: - MFEX Opt hits: 6786432 (100.0 %) Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 11:31:14 +01:00
Kumar Amber	a395b132b7	dpif-netdev: Add packet count and core id paramters for study This commit introduces additional command line paramter for mfex study function. If user provides additional packet out it is used in study to compare minimum packets which must be processed else a default value is choosen. Also introduces a third paramter for choosing a particular pmd core. $ ovs-appctl dpif-netdev/miniflow-parser-set study 500 3 Signed-off-by: Kumar Amber <kumar.amber@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 11:11:13 +01:00
Kumar Amber	dd3f5d86d9	dpif-netdev: Add auto validation function for miniflow extract This patch introduced the auto-validation function which allows users to compare the batch of packets obtained from different miniflow implementations against the linear miniflow extract and return a hitmask. The autovaidator function can be triggered at runtime using the following command: $ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator Signed-off-by: Kumar Amber <kumar.amber@intel.com> Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 09:57:19 +01:00
Kumar Amber	3d8f47bc04	dpif-netdev: Add command line and function pointer for miniflow extract This patch introduces the MFEX function pointers which allows the user to switch between different miniflow extract implementations which are provided by the OVS based on optimized ISA CPU. The user can query for the available minflow extract variants available for that CPU by following commands: $ovs-appctl dpif-netdev/miniflow-parser-get Similarly an user can set the miniflow implementation by the following command : $ ovs-appctl dpif-netdev/miniflow-parser-set name This allows for more performance and flexibility to the user to choose the miniflow implementation according to the needs. Signed-off-by: Kumar Amber <kumar.amber@intel.com> Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 09:56:58 +01:00
Cian Ferriter	d76a719a7a	dpif-netdev: Add a partial HWOL PMD statistic. It is possible for packets traversing the userspace datapath to match a flow before hitting on EMC by using a mark ID provided by a NIC. Add a PMD statistic for this hit. Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-09 17:13:55 +01:00
Harry van Haaren	3f86fdf5ce	dpif-netdev: Add command to get dpif implementations. This commit adds a new command to retrieve the list of available DPIF implementations. This can be used by to check what implementations of the DPIF are available in any given OVS binary. It also returns which implementations are in use by the OVS PMD threads. Usage: $ ovs-appctl dpif-netdev/dpif-impl-get Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-09 17:13:38 +01:00
Harry van Haaren	abb807e27d	dpif-netdev: Add command to switch dpif implementation. This commit adds a new command to allow the user to switch the active DPIF implementation at runtime. A probe function is executed before switching the DPIF implementation, to ensure the CPU is capable of running the ISA required. For example, the below code will switch to the AVX512 enabled DPIF assuming that the runtime CPU is capable of running AVX512 instructions: $ ovs-appctl dpif-netdev/dpif-impl-set dpif_avx512 A new configuration flag is added to allow selection of the default DPIF. This is useful for running the unit-tests against the available DPIF implementations, without modifying each unit test. The design of the testing & validation for ISA optimized DPIF implementations is based around the work already upstream for DPCLS. Note however that a DPCLS lookup has no state or side-effects, allowing the auto-validator implementation to perform multiple lookups and provide consistent statistic counters. The DPIF component does have state, so running two implementations in parallel and comparing output is not a valid testing method, as there are changes in DPIF statistic counters (side effects). As a result, the DPIF is tested directly against the unit-tests. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-09 17:13:24 +01:00
Harry van Haaren	9ac84a1a36	dpif-avx512: Add ISA implementation of dpif. This commit adds the AVX512 implementation of DPIF functionality, specifically the dp_netdev_input_outer_avx512 function. This function only handles outer (no re-circulations), and is optimized to use the AVX512 ISA for packet batching and other DPIF work. Sparse is not able to handle the AVX512 intrinsics, causing compile time failures, so it is disabled for this file. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Co-authored-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Kumar Amber <kumar.amber@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-09 17:13:12 +01:00
Harry van Haaren	e540499e4f	dpif-netdev: Add function pointer for netdev input. This commit adds a function pointer to the pmd thread data structure, giving the pmd thread flexibility in its dpif-input function choice. This allows choosing of the implementation based on ISA capabilities of the runtime CPU, leading to optimizations and higher performance. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-09 17:13:00 +01:00
Harry van Haaren	5930dfeeb2	dpif-netdev: Refactor to multiple header files. Split the very large file dpif-netdev.c and the datastructures it contains into multiple header files. Each header file is responsible for the datastructures of that component. This logical split allows better reuse and modularity of the code, and reduces the very large file dpif-netdev.c to be more managable. Due to dependencies between components, it is not possible to move component in smaller granularities than this patch. To explain the dependencies better, eg: DPCLS has no deps (from dpif-netdev.c file) FLOW depends on DPCLS (struct dpcls_rule) DFC depends on DPCLS (netdev_flow_key) and FLOW (netdev_flow_key) THREAD depends on DFC (struct dfc_cache) DFC_PROC depends on THREAD (struct pmd_thread) DPCLS lookup.h/c require only DPCLS DPCLS implementations require only dpif-netdev-lookup.h. - This change was made in 2.12 release with function pointers - This commit only refactors the name to "private-dpcls.h" netdev_flow_key_equal_mf() is renamed to emc_flow_key_equal_mf(). Rename functions specific to dpcls from netdev_* namespace to the dpcls_* namespace, as they are only used by dpcls code. 'inline' is added to the dp_netdev_flow_hash() when it is moved definition to fix a compiler error. One valid checkpatch issue with the use of the EMC_FOR_EACH_POS_WITH_HASH() macro was fixed. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-09 17:12:45 +01:00
Paolo Valerio	ba16a36f35	dpif-netdev: Add all-zero SNAT to the advertised features of ct. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-08 23:49:34 +02:00
Balazs Nemeth	aa4359cb9f	dpif-netdev: Read recirc depth and flow api enabled once per batch. The call to recirc_depth_get involves accessing a TLS value. So read that once, and store it on the stack for re-use while processing the batch. The same goes for reading netdev_is_flow_api_enabled(), a non-inlined function. Signed-off-by: Balazs Nemeth <bnemeth@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-08 22:35:19 +02:00
Eelco Chaudron	e6ad4d8d9c	conntrack: Document all-zero IP SNAT behavior and add a test case. Currently, conntrack in the kernel has an undocumented feature referred to as all-zero IP address SNAT. Basically, when a source port collision is detected during the commit, the source port will be translated to an ephemeral port. If there is no collision, no SNAT is performed. This patchset documents this behavior and adds a self-test to verify it's not changing. In addition, a datapath feature flag is added for the all-zero IP SNAT case. This will help applications on top of OVS, like OVN, to determine this feature can be used. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-08 21:19:14 +02:00
Yunjian Wang	828d9cb8d4	ovs: fix wrong quote Remove the coma character by using ' and " character. Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2021-07-06 15:40:04 -07:00
Timothy Redaelli	772a842fb5	dpif-netdev: Apply subtable-lookup-prio-set on any datapath. Currently, if you try to set subtable-lookup-prio-set when you don't have any datapath (for example if an user wants to set AVX512 before creating any bridge) it sets it globally (dpcls_subtable_set_prio), but it returns an error: please specify an existing datapath ovs-appctl: ovs-vswitchd: server returned an error and, in this case, the exit code of ovs-appctl is 2. This commit changes the behaviour by removing the [datapath] optional parameter of subtable-lookup-prio-set and by changing the priority level on any datapath and globally. This means if you don't have any datapath or if you have only one datapath, the behaviour is the same as now, but without the confusing error when you don't have any datapath. Fixes: 3d018c3ea79d ("dpif-netdev: add subtable lookup prio set command.") Signed-off-by: Timothy Redaelli <tredaelli@redhat.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-01 16:31:55 +02:00
Sriharsha Basavapatna	b5e6f6f6bf	dpif-netdev: Provide orig_in_port in metadata for tunneled packets. When an encapsulated packet is recirculated through a TUNNEL_POP action, the metadata gets reinitialized and the originating physical port information is lost. When this flow gets processed by the vport and it needs to be offloaded, we can't figure out the physical port through which the tunneled packet was received. Add a new member to the metadata: 'orig_in_port'. This is passed to the next stage during recirculation and the offload layer can use it to offload the flow to this physical port. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Tested-by: Emma Finn <emma.finn@intel.com> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-06-24 22:22:08 +02:00
Ilya Maximets	a1ec428037	netdev-offload: Disallow offloading to unrelated tunneling vports. 'linux_tc' flow API suitable only for tunneling vports with backing linux interfaces. DPDK flow API is not suitable for such ports. With this change we could drop vport restriction from dpif-netdev. This is a prerequisite for enabling vport offloading in DPDK. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Tested-by: Emma Finn <emma.finn@intel.com> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com>	2021-06-24 22:22:08 +02:00
Eli Britstein	bc341440d3	dpif-netdev: Add HW miss packet state recover logic. Recover the packet if it was partially processed by the HW. Fallback to lookup flow by mark association. Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Tested-by: Emma Finn <emma.finn@intel.com> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-06-24 22:22:08 +02:00
Tonghao Zhang	f3ad560d5f	dpif-netdev: Expand the meter capacity. For now, ovs-vswitchd use the array of the dp_meter struct to store meter's data, and at most, there are only 65536 (defined by MAX_METERS) meters that can be used. But in some case, for example, in the edge gateway, we should use 200,000+, at least, meters for IP address bandwidth limitation. Every one IP address will use two meters for its rx and tx path[1]. In other way, ovs-vswitchd should support meter-offload (rte_mtr_xxx api introduced by dpdk.), but there are more than 65536 meters in the hardware, such as Mellanox ConnectX-6. This patch use cmap to manage the meter, instead of the array. * Insertion performance, ovs-ofctl add-meter 1000+ meters, the cmap takes about 4000ms, as same as previous implementation. * Lookup performance in datapath, we add 1000+ meters which rate limit are 10Gbps (the NIC cards are 10Gbps, so netdev-datapath will not drop the packets.), and a flow which only forward packets from p0 to p1, with meter action[2]. On other machine, pktgen-dpdk will generate 64B packets to p0. The forwarding performance always is 1324 Kpps on my server which CPU is Intel E5-2650, 2.00GHz. [1]. $ in_port=p0,ip,ip_dst=1.1.1.x action=meter:n,output:p1 $ in_port=p1,ip,ip_src=1.1.1.x action=meter:m,output:p0 [2]. $ in_port=p0 action=meter:100,output:p1 Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-06-24 21:34:20 +02:00
Aaron Conole	640d4db788	ipf: Fix a use-after-free error, and remove the 'do_not_steal' flag. As reported by Wang Liang, the way packets are passed to the ipf module doesn't allow for use later on in reassembly. Such packets may be get released anyway, such as during cleanup of tx processing. Because the ipf module lacks a way of forcing the dp_packet to be retained, it will later reuse the packet. Instead, just clone the packet and let the ipf queue own the copy until the queue is destroyed. After this change, there are no more in-tree users of the batch 'do_not_steal' flag. Thus, we remove it as well. Fixes: 4ea96698f667 ("Userspace datapath: Add fragmentation handling.") Fixes: 0b3ff31d35f5 ("dp-packet: Add 'do_not_steal' packet batch flag.") Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-April/382098.html Reported-by: Wang Liang <wangliangrt@didiglobal.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Co-authored-by: Wang Liang <wangliangrt@didiglobal.com> Signed-off-by: Wang Liang <wangliangrt@didiglobal.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-06-15 10:46:33 +02:00
Ilya Maximets	7731d26144	dpif-netdev: Remove meter rate from the bucket size calculation. Implementation of meters supposed to be a classic token bucket with 2 typical parameters: rate and burst size. Burst size in this schema is the maximum number of bytes/packets that could pass without being rate limited. Recent changes to userspace datapath made meter implementation to be in line with the kernel one, and this uncovered several issues. The main problem is that maximum bucket size for unknown reason accounts not only burst size, but also the numerical value of rate. This creates a lot of confusion around behavior of meters. For example, if rate is configured as 1000 pps and burst size set to 1, this should mean that meter will tolerate bursts of 1 packet at most, i.e. not a single packet above the rate should pass the meter. However, current implementation calculates maximum bucket size as (rate + burst size), so the effective bucket size will be 1001. This means that first 1000 packets will not be rate limited and average rate might be twice as high as the configured rate. This also makes it practically impossible to configure meter that will have burst size lower than the rate, which might be a desirable configuration if the rate is high. Inability to configure low values of a burst size and overall inability for a user to predict what will be a maximum and average rate from the configured parameters of a meter without looking at the OVS and kernel code might be also classified as a security issue, because drop meters are frequently used as a way of protection from DoS attacks. This change removes rate from the calculation of a bucket size, making it in line with the classic token bucket algorithm and essentially making the rate and burst tolerance being predictable from a users' perspective. Same change will be proposed for the kernel implementation. Unit tests changed back to their correct version and enhanced. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>	2021-05-18 22:11:14 +02:00
Tonghao Zhang	c3690ccbce	dpif-netdev: Refactor and fix the buckets calculation. The way that "burst_size" was used as total buckets is very strange. If user set the "burst_size" too smaller while "rate" larger, that may affect the meter normal work. This patch refactor the buckets calculation, and start with a full buckets. It also makes the calculation equal to what kernel datapath does. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-03-30 21:16:27 +02:00
Tonghao Zhang	759aaa8513	dpif-netdev: Fix the meter buckets overflow. When setting the meter rate to 4.3+Gbps, there is an overflow, the meters don't work as expected. $ ovs-ofctl -O OpenFlow13 add-meter br-int "meter=1 kbps stats bands=type=drop rate=4294968" It was overflow when we set the rate to 4294968, because "burst_size" in the ofputil_meter_band is uint32_t type. This patch remove the "up" in the dp_meter_band struction, and introduce "rate", "burst_size" and "bucket" (uint64_t) to userspace datapath's meter band. This patch don't change the public API and structure. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-03-30 20:17:54 +02:00
Kevin Traynor	ec68a877db	dpif-netdev: Allow PMD auto load balance with cross-numa. Previously auto load balance did not trigger a reassignment when there was any cross-numa polling as an rxq could be polled from a different numa after reassign and it could impact estimates. In the case where there is only one numa with pmds available, the same numa will always poll before and after reassignment, so estimates are valid. Allow PMD auto load balance to trigger a reassignment in this case. Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Tested-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-03-22 12:28:49 +01:00

1 2 3 4 5 ...

751 Commits