There are cases where users might want simple forwarding or drop rules
for all packets received from a specific port, e.g ::
"in_port=1,actions=2"
"in_port=2,actions=IN_PORT"
"in_port=3,vlan_tci=0x1234/0x1fff,actions=drop"
"in_port=4,actions=push_vlan:0x8100,set_field:4196->vlan_vid,output:3"
There are also cases where complex OpenFlow rules can be simplified
down to datapath flows with very simple match criteria.
In theory, for very simple forwarding, OVS doesn't need to parse
packets at all in order to follow these rules. "Simple match" lookup
optimization is intended to speed up packet forwarding in these cases.
Design:
Due to various implementation constraints userspace datapath has
following flow fields always in exact match (i.e. it's required to
match at least these fields of a packet even if the OF rule doesn't
need that):
- recirc_id
- in_port
- packet_type
- dl_type
- vlan_tci (CFI + VID) - in most cases
- nw_frag - for ip packets
Not all of these fields are related to packet itself. We already
know the current 'recirc_id' and the 'in_port' before starting the
packet processing. It also seems safe to assume that we're working
with Ethernet packets. So, for the simple OF rule we need to match
only on 'dl_type', 'vlan_tci' and 'nw_frag'.
'in_port', 'dl_type', 'nw_frag' and 13 bits of 'vlan_tci' can be
combined in a single 64bit integer (mark) that can be used as a
hash in hash map. We are using only VID and CFI form the 'vlan_tci',
flows that need to match on PCP will not qualify for the optimization.
Workaround for matching on non-existence of vlan updated to match on
CFI and VID only in order to qualify for the optimization. CFI is
always set by OVS if vlan is present in a packet, so there is no need
to match on PCP in this case. 'nw_frag' takes 2 bits of PCP inside
the simple match mark.
New per-PMD flow table 'simple_match_table' introduced to store
simple match flows only. 'dp_netdev_flow_add' adds flow to the
usual 'flow_table' and to the 'simple_match_table' if the flow
meets following constraints:
- 'recirc_id' in flow match is 0.
- 'packet_type' in flow match is Ethernet.
- Flow wildcards contains only minimal set of non-wildcarded fields
(listed above).
If the number of flows for current 'in_port' in a regular 'flow_table'
equals number of flows for current 'in_port' in a 'simple_match_table',
we may use simple match optimization, because all the flows we have
are simple match flows. This means that we only need to parse
'dl_type', 'vlan_tci' and 'nw_frag' to perform packet matching.
Now we make the unique flow mark from the 'in_port', 'dl_type',
'nw_frag' and 'vlan_tci' and looking for it in the 'simple_match_table'.
On successful lookup we don't need to run full 'miniflow_extract()'.
Unsuccessful lookup technically means that we have no suitable flow
in the datapath and upcall will be required. So, in this case EMC and
SMC lookups are disabled. We may optimize this path in the future by
bypassing the dpcls lookup too.
Performance improvement of this solution on a 'simple match' flows
should be comparable with partial HW offloading, because it parses same
packet fields and uses similar flow lookup scheme.
However, unlike partial HW offloading, it works for all port types
including virtual ones.
Performance results when compared to EMC:
Test setup:
virtio-user OVS virtio-user
Testpmd1 ------------> pmd1 ------------> Testpmd2
(txonly) x<------ pmd2 <------------ (mac swap)
Single stream of 64byte packets. Actions:
in_port=vhost0,actions=vhost1
in_port=vhost1,actions=vhost0
Stats collected from pmd1 and pmd2, so there are 2 scenarios:
Virt-to-Virt : Testpmd1 ------> pmd1 ------> Testpmd2.
Virt-to-NoCopy : Testpmd2 ------> pmd2 --->x Testpmd1.
Here the packet sent from pmd2 to Testpmd1 is always dropped, because
the virtqueue is full since Testpmd1 is in txonly mode and doesn't
receive any packets. This should be closer to the performance of a
VM-to-Phy scenario.
Test performed on machine with Intel Xeon CPU E5-2690 v4 @ 2.60GHz.
Table below represents improvement in throughput when compared to EMC.
+----------------+------------------------+------------------------+
| | Default (-g -O2) | "-Ofast -march=native" |
| Scenario +------------+-----------+------------+-----------+
| | GCC | Clang | GCC | Clang |
+----------------+------------+-----------+------------+-----------+
| Virt-to-Virt | +18.9% | +25.5% | +10.8% | +16.7% |
| Virt-to-NoCopy | +24.3% | +33.7% | +14.9% | +22.0% |
+----------------+------------+-----------+------------+-----------+
For Phy-to-Phy case performance improvement should be even higher, but
it's not the main use-case for this functionality. Performance
difference for the non-simple flows is within a margin of error.
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Last RX queue, from which the packet got received, is already stored
in the PMD context. So, we can get the netdev from it without the
expensive hash map lookup.
In my V2V testing this patch improves performance in case HW offload
and experimental APIs are enabled by about 3%. That narrows down the
performance difference with the case with experimental API disabled
to about 0.5%, which is way within a margin of error for that setup.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously, when a user enabled PMD auto load balancer with
pmd-auto-lb="true", some conditions such as number of PMDs/RxQs
that were required for a rebalance to take place were checked.
If the configuration meant that a rebalance would not take place
then PMD ALB was logged as 'disabled' and not run.
Later, if the PMD/RxQ configuration changed whereby a rebalance
could be effective, PMD ALB was logged as 'enabled' and would run at
the appropriate time.
This worked ok from a functional view but it is unintuitive for the
user reading the logs.
e.g. with one PMD (PMD ALB would not be effective)
User enables ALB, but logs say it is disabled because it won't run.
$ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true"
|dpif_netdev|INFO|PMD auto load balance is disabled
No dry run takes place.
Add more PMDs (PMD ALB may be effective).
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50
|dpif_netdev|INFO|PMD auto load balance is enabled ...
Dry run takes place.
|dpif_netdev|DBG|PMD auto load balance performing dry run.
A better approach is to simply reflect back the user enable/disable
state in the logs and deal with whether the rebalance will be effective
when needed. That is the approach taken in this patch.
To cut down on unneccessary work, some basic checks are also made before
starting a PMD ALB dry run and debug logs can indicate this to the user.
e.g. with one PMD (PMD ALB would not be effective)
User enables ALB, and logs confirm the user has enabled it.
$ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true"
|dpif_netdev|INFO|PMD auto load balance is enabled...
No dry run takes place.
|dpif_netdev|DBG|PMD auto load balance nothing to do, not enough non-isolated PMDs or RxQs.
Add more PMDs (PMD ALB may be effective).
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50
Dry run takes place.
|dpif_netdev|DBG|PMD auto load balance performing dry run.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a PMD reload occurs, some PMD cycle measurements are reset.
In order to preserve the full cycles history of an Rxq, the RxQ
cycle measurements were not reset.
These are both used together to display the % of a PMD that an
RxQ is using in the pmd-rxq-show stat.
Resetting one and not the other can lead to some unintuitive
looking stats while the stats settle for the new config. In
some cases, it may appear like the RxQs are using >100% of a PMD.
This resolves when the stats settle for the new config, but
seeing RxQs apparently using >100% of a PMD may confuse a user
and lead them to think there is a bug.
To avoid this, reset the RxQ cycle measurements on PMD reload.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds a general way of viewing/configuring datapath
cache sizes. With an implementation for the netlink interface.
The ovs-dpctl/ovs-appctl show commands will display the
current cache sizes configured:
$ ovs-dpctl show
system@ovs-system:
lookups: hit:25 missed:63 lost:0
flows: 0
masks: hit:282 total:0 hit/pkt:3.20
cache: hit:4 hit-rate:4.54%
caches:
masks-cache: size:256
port 0: ovs-system (internal)
port 1: br-int (internal)
port 2: genev_sys_6081 (geneve: packet_type=ptap)
port 3: br-ex (internal)
port 4: eth2
port 5: sw0p1 (internal)
port 6: sw0p3 (internal)
A specific cache can be configured as follows:
$ ovs-appctl dpctl/cache-set-size DP CACHE SIZE
$ ovs-dpctl cache-set-size DP CACHE SIZE
For example to disable the cache do:
$ ovs-dpctl cache-set-size system@ovs-system masks-cache 0
Setting cache size successful, new size 0.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
IP fragmentation engine may not only steal the packet but also add
more. For example, after receiving the last fragment, it will
add all previous fragments to a batch. Unfortunately, it will also
free the original last fragment and replace it with a copy.
This invalidates the 'packet_clone' pointer in the dpif_netdev_execute()
leading to the use-after-free:
==3525086==ERROR: AddressSanitizer: heap-use-after-free on
address 0x61600020439c at pc 0x000000688a6d
READ of size 1 at 0x61600020439c thread T0
#0 0x688a6c in dp_packet_swap ./lib/dp-packet.h:265:5
#1 0x68781d in dpif_netdev_execute lib/dpif-netdev.c:4103:9
#2 0x6675db in dpif_netdev_operate lib/dpif-netdev.c:4129:25
#3 0x691e5e in dpif_operate lib/dpif.c:1367:13
#4 0x692909 in dpif_execute lib/dpif.c:1321:9
#5 0x5b19c6 in packet_execute ofproto/ofproto-dpif.c:4991:5
#6 0x5a2861 in ofproto_packet_out_finish ofproto/ofproto.c:3662:5
#7 0x5a65c6 in do_bundle_commit ofproto/ofproto.c:8270:13
#8 0x5a0cae in handle_bundle_control ofproto/ofproto.c:8309:17
#9 0x59a476 in handle_single_part_openflow ofproto/ofproto.c:8593:16
#10 0x5877ac in handle_openflow ofproto/ofproto.c:8674:21
#11 0x6296f1 in ofconn_run ofproto/connmgr.c:1329:13
#12 0x62925d in connmgr_run ofproto/connmgr.c:356:9
#13 0x586904 in ofproto_run ofproto/ofproto.c:1879:5
#14 0x55c830 in bridge_run__ vswitchd/bridge.c:3251:9
#15 0x55c015 in bridge_run vswitchd/bridge.c:3310:5
#16 0x575f31 in main vswitchd/ovs-vswitchd.c:127:9
#17 0x7f01099d3492 in __libc_start_main (/lib64/libc.so.6+0x23492)
#18 0x47d96d in _start (vswitchd/ovs-vswitchd+0x47d96d)
0x61600020439c is located 28 bytes inside of 560-byte region
freed by thread T0 here:
#0 0x5177a8 in free (vswitchd/ovs-vswitchd+0x5177a8)
#1 0x6b17b6 in dp_packet_delete ./lib/dp-packet.h:256:9
#2 0x6afeee in ipf_extract_frags_from_batch lib/ipf.c:947:17
#3 0x6afd63 in ipf_preprocess_conntrack lib/ipf.c:1232:9
#4 0x946b2c in conntrack_execute lib/conntrack.c:1446:5
#5 0x67e3ed in dp_execute_cb lib/dpif-netdev.c:8277:9
#6 0x7097d7 in odp_execute_actions lib/odp-execute.c:865:17
#7 0x66409e in dp_netdev_execute_actions lib/dpif-netdev.c:8322:5
#8 0x6877ad in dpif_netdev_execute lib/dpif-netdev.c:4090:5
#9 0x6675db in dpif_netdev_operate lib/dpif-netdev.c:4129:25
#10 0x691e5e in dpif_operate lib/dpif.c:1367:13
#11 0x692909 in dpif_execute lib/dpif.c:1321:9
#12 0x5b19c6 in packet_execute ofproto/ofproto-dpif.c:4991:5
#13 0x5a2861 in ofproto_packet_out_finish ofproto/ofproto.c:3662:5
#14 0x5a65c6 in do_bundle_commit ofproto/ofproto.c:8270:13
#15 0x5a0cae in handle_bundle_control ofproto/ofproto.c:8309:17
#16 0x59a476 in handle_single_part_openflow ofproto/ofproto.c:8593:16
#17 0x5877ac in handle_openflow ofproto/ofproto.c:8674:21
#18 0x6296f1 in ofconn_run ofproto/connmgr.c:1329:13
#19 0x62925d in connmgr_run ofproto/connmgr.c:356:9
#20 0x586904 in ofproto_run ofproto/ofproto.c:1879:5
#21 0x55c830 in bridge_run__ vswitchd/bridge.c:3251:9
#22 0x55c015 in bridge_run vswitchd/bridge.c:3310:5
#23 0x575f31 in main vswitchd/ovs-vswitchd.c:127:9
#24 0x7f01099d3492 in __libc_start_main (/lib64/libc.so.6+0x23492)
The issue can be reproduced with system-userspace testsuite on the
'conntrack - IPv4 fragmentation with fragments specified' test.
Previously, there was a leak inside the IP fragmentation module that
kept the original segment, so 'packet_clone' remained a valid pointer.
But commit 803ed12e31b0 ("ipf: release unhandled packets from the batch")
fixed the leak leading to use-after-free.
Using the packet from a batch instead of 'packet_clone' to swap packet
content to avoid the issue.
While investigating this problem, more issues uncovered. One of them
is that IP fragmentation engine can add more packets to the batch, but
there is no way to get them to a caller. Adding an extra branch for
this case with a 'FIXME' comment in order to highlight the issue.
Another one is that IP fragmentation engine will keep only 32 fragments
dropping all other fragments while refilling a batch, but that should
be fixed separately.
Fixes: 7e6b41ac8d9d ("dpif-netdev: Fix crash when PACKET_OUT is metered.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
The dpif is used, so remove the OVS_UNUSED flag.
Fixes: a7f33fdbfb67 ("conntrack: Support zone limits.")
Signed-off-by: linhuang <linhuang@ruijie.com.cn>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a PACKET_OUT has output port of OFPP_TABLE, and the rule
table includes a meter and this causes the packet to be deleted,
execute with a clone of the packet, restoring the original packet
if it is changed by the execution.
Add tests to verify the original issue is fixed, and that the fix
doesn't break tunnel processing.
Reported-by: Tony van der Peet <tony.vanderpeet@alliedtelesis.co.nz>
Signed-off-by: Tony van der Peet <tony.vanderpeet@alliedtelesis.co.nz>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch fixes a memory leak in the commands for DPIF and MFEX
get and set. In order to operate the commands require a pmd_list,
which is currently not freed after it has been used. This issue
was identified by a static analysis tool.
Fixes: 3d8f47bc ("dpif-netdev: Add command line and function pointer for miniflow extract")
Fixes: abb807e2 ("dpif-netdev: Add command to switch dpif implementation.")
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This patch fixes a memory leak when the command
"dpif-netdev/subtable-lookup-prio-set" is run, the pmd_list
required to iterate the PMD threads was not being freed.
This issue was identified by a static analysis tool.
Fixes: 3d018c3e ("dpif-netdev: Add subtable lookup prio set command.")
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The commit removes the dead code from the
MFEX set command as highlighted by static tool
analysis.
Fixes: a395b132b7 ("dpif-netdev: Add packet count and core id paramters for study")
Signed-off-by: kumar Amber <kumar.amber@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Association of a mark to a flow is done as part of its offload handling,
in the offloading thread. However, the PMD thread specifies whether an
offload request is an "add" or "modify" by the association of a mark to
the flow.
This is exposed to a race condition. A flow might be created with
actions that cannot be fully offloaded, for example flooding (before MAC
learning), and later modified to have actions that can be fully
offloaded. If the two requests are queued before the offload thread
handling, they are both marked as "add". When the offload thread handles
them, the first request is partially offloaded, and the second one is
ignored as the flow is already considered as offloaded.
Fix it by specifying add/modify of an offload request by the actual flow
state change, without relying on the mark.
Fixes: 3c7330ebf036 ("netdev-offload-dpdk: Support offload of output action.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
dp_netdev_flow_offload_main thread is asynchronous, by the cited commit.
There might be a case where there are modification requests of the same
flow submitted before handled. Then, if the first handling fails, the
rule for the flow is deleted, and the mark is freed. Then, the following
one should not be handled as a modification, but rather as an "add".
Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Before flushing offloads of a removed port was supported by [1], it was
necessary to flush the 'marks'. In doing so, all offloads of the PMD are
removed, include the ones that are not related to the removed port and
that are not modified following this removal. As a result such flows are
evicted from being offloaded, and won't resume offloading.
As PMD offload flush is not necessary, avoid it.
[1] 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Atomic variables must be read atomically. And since we already have
'smc_enable_db' in the PMD context, we need to use it from there to
avoid reading atomically twice.
Also, 'smc_enable_db' is a global configuration, there is no need
to read it per-port or per-rxq.
Fixes: 9ac84a1a3698 ("dpif-avx512: Add ISA implementation of dpif.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.
This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:
* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.
In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:
a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.
Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Users complained that per rxq pmd usage was confusing: summing those
values per pmd would never reach 100% even if increasing traffic load
beyond pmd capacity.
This is because the dpif-netdev/pmd-rxq-show command only reports "pure"
rxq cycles while some cycles are used in the pmd mainloop and adds up to
the total pmd load.
dpif-netdev/pmd-stats-show does report per pmd load usage.
This load is measured since the last dpif-netdev/pmd-stats-clear call.
On the other hand, the per rxq pmd usage reflects the pmd load on a 10s
sliding window which makes it non trivial to correlate.
Gather per pmd busy cycles with the same periodicity and report the
difference as overhead in dpif-netdev/pmd-rxq-show so that we have all
info in a single command.
Example:
$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 3:
isolated : true
port: dpdk0 queue-id: 0 (enabled) pmd usage: 90 %
overhead: 4 %
pmd thread numa_id 1 core_id 5:
isolated : false
port: vhost0 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost1 queue-id: 0 (enabled) pmd usage: 93 %
port: vhost2 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost6 queue-id: 0 (enabled) pmd usage: 0 %
overhead: 6 %
pmd thread numa_id 1 core_id 31:
isolated : true
port: dpdk1 queue-id: 0 (enabled) pmd usage: 86 %
overhead: 4 %
pmd thread numa_id 1 core_id 33:
isolated : false
port: vhost3 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost4 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost5 queue-id: 0 (enabled) pmd usage: 92 %
port: vhost7 queue-id: 0 (enabled) pmd usage: 0 %
overhead: 7 %
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Pinning an rxq to a PMD with pmd-rxq-affinity may be done for
various reasons such as reserving a full PMD for an rxq, or to
ensure that multiple rxqs from a port are handled on different PMDs.
Previously pmd-rxq-affinity always isolated the PMD so no other rxqs
could be assigned to it by OVS. There may be cases where there is
unused cycles on those pmds and the user would like other rxqs to
also be able to be assigned to it by OVS.
Add an option to pin the rxq and non-isolate the PMD. The default
behaviour is unchanged, which is pin and isolate the PMD.
In order to pin and non-isolate:
ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false
Note this is available only with group assignment type, as pinning
conflicts with the operation of the other rxq assignment algorithms.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Add an rxq scheduling option that allows rxqs to be grouped
on a pmd based purely on their load.
The current default 'cycles' assignment sorts rxqs by measured
processing load and then assigns them to a list of round robin PMDs.
This helps to keep the rxqs that require most processing on different
cores but as it selects the PMDs in round robin order, it equally
distributes rxqs to PMDs.
'cycles' assignment has the advantage in that it separates the most
loaded rxqs from being on the same core but maintains the rxqs being
spread across a broad range of PMDs to mitigate against changes to
traffic pattern.
'cycles' assignment has the disadvantage that in order to make the
trade off between optimising for current traffic load and mitigating
against future changes, it tries to assign and equal amount of rxqs
per PMD in a round robin manner and this can lead to a less than optimal
balance of the processing load.
Now that PMD auto load balance can help mitigate with future changes in
traffic patterns, a 'group' assignment can be used to assign rxqs based
on their measured cycles and the estimated running total of the PMDs.
In this case, there is no restriction about keeping equal number of
rxqs per PMD as it is purely load based.
This means that one PMD may have a group of low load rxqs assigned to it
while another PMD has one high load rxq assigned to it, as that is the
best balance of their measured loads across the PMDs.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Previously, if pmd-rxq-affinity was used to pin an rxq to
a core that was not in pmd-cpu-mask the rxq was not polled
for and the user received a warning. This meant that no traffic
would be received from that rxq.
Now that pinned and non-pinned rxqs are assigned to PMDs in
a common call to rxq scheduling, if an invalid core is
selected in pmd-rxq-affinity the rxq can be assigned an
available PMD (if any).
A warning will still be logged as the requested core could
not be used.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The list of PMDs is round robined through for the selection
when assigning an rxq to a PMD. The list is based on a
hash map, so there is no defined order.
It means the same set of PMDs may get assigned different rxqs
on different runs for no reason other than how the PMDs are stored
in the hash map.
This can be easily changed by sorting the PMDs by core id after
they are extracted, so the PMDs will be used in a consistent order.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
PMD auto load balance had its own separate implementation of the
rxq scheduling that it used for dry runs. This was done because
previously the rxq scheduling was not made reusable for a dry run.
Apart from the code duplication (which is a good enough reason
to replace it alone) this meant that if any further rxq scheduling
changes or assignment types were added they would also have to be
duplicated in the auto load balance code too.
This patch replaces the current PMD auto load balance rxq scheduling
code to reuse the common rxq scheduling code.
The behaviour does not change from a user perspective, except the logs
are updated to be more consistent.
As the dry run will compare the pmd load variances for current and
estimated assignments, new functions are added to populate the current
assignments and use the rxq scheduling data structs for variance
calculations.
Now that the new rxq scheduling data structures are being used in
PMD auto load balance, the older rr_* data structs and associated
functions can be removed.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This reworks the current rxq scheduling code to break it into more
generic and reusable pieces.
The behaviour does not change from a user perspective, except the logs
are updated to be more consistent.
From an implementation view, there are some changes with mind to
extending functionality.
The high level reusable functions added in this patch are:
- Generate a list of current numas and pmds
- Perform rxq scheduling assignments into that list
- Effect the rxq scheduling assignments so they are used
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
rte_flow_get_restore_info() API is under experimental attribute. Using it
has a performance impact that can be avoided for non-experimental compilation.
Do not call it without experimental support.
Reported-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit adds a new counter to be displayed to the user when
requesting datapath packet statistics. It counts the number of
packets that are parsed and a miniflow built up from it by the
optimized miniflow extract parsers.
The ovs-appctl command "dpif-netdev/pmd-perf-show" now has an
extra entry indicating if the optimized MFEX was hit:
- MFEX Opt hits: 6786432 (100.0 %)
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit introduces additional command line paramter
for mfex study function. If user provides additional packet out
it is used in study to compare minimum packets which must be processed
else a default value is choosen.
Also introduces a third paramter for choosing a particular pmd core.
$ ovs-appctl dpif-netdev/miniflow-parser-set study 500 3
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This patch introduced the auto-validation function which
allows users to compare the batch of packets obtained from
different miniflow implementations against the linear
miniflow extract and return a hitmask.
The autovaidator function can be triggered at runtime using the
following command:
$ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This patch introduces the MFEX function pointers which allows
the user to switch between different miniflow extract implementations
which are provided by the OVS based on optimized ISA CPU.
The user can query for the available minflow extract variants available
for that CPU by following commands:
$ovs-appctl dpif-netdev/miniflow-parser-get
Similarly an user can set the miniflow implementation by the following
command :
$ ovs-appctl dpif-netdev/miniflow-parser-set name
This allows for more performance and flexibility to the user to choose
the miniflow implementation according to the needs.
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
It is possible for packets traversing the userspace datapath to match a
flow before hitting on EMC by using a mark ID provided by a NIC. Add a
PMD statistic for this hit.
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds a new command to retrieve the list of available
DPIF implementations. This can be used by to check what implementations
of the DPIF are available in any given OVS binary. It also returns which
implementations are in use by the OVS PMD threads.
Usage:
$ ovs-appctl dpif-netdev/dpif-impl-get
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds a new command to allow the user to switch
the active DPIF implementation at runtime. A probe function
is executed before switching the DPIF implementation, to ensure
the CPU is capable of running the ISA required. For example, the
below code will switch to the AVX512 enabled DPIF assuming
that the runtime CPU is capable of running AVX512 instructions:
$ ovs-appctl dpif-netdev/dpif-impl-set dpif_avx512
A new configuration flag is added to allow selection of the
default DPIF. This is useful for running the unit-tests against
the available DPIF implementations, without modifying each unit test.
The design of the testing & validation for ISA optimized DPIF
implementations is based around the work already upstream for DPCLS.
Note however that a DPCLS lookup has no state or side-effects, allowing
the auto-validator implementation to perform multiple lookups and
provide consistent statistic counters.
The DPIF component does have state, so running two implementations in
parallel and comparing output is not a valid testing method, as there
are changes in DPIF statistic counters (side effects). As a result, the
DPIF is tested directly against the unit-tests.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds the AVX512 implementation of DPIF functionality,
specifically the dp_netdev_input_outer_avx512 function. This function
only handles outer (no re-circulations), and is optimized to use the
AVX512 ISA for packet batching and other DPIF work.
Sparse is not able to handle the AVX512 intrinsics, causing compile
time failures, so it is disabled for this file.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Co-authored-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds a function pointer to the pmd thread data structure,
giving the pmd thread flexibility in its dpif-input function choice.
This allows choosing of the implementation based on ISA capabilities
of the runtime CPU, leading to optimizations and higher performance.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Split the very large file dpif-netdev.c and the datastructures
it contains into multiple header files. Each header file is
responsible for the datastructures of that component.
This logical split allows better reuse and modularity of the code,
and reduces the very large file dpif-netdev.c to be more managable.
Due to dependencies between components, it is not possible to
move component in smaller granularities than this patch.
To explain the dependencies better, eg:
DPCLS has no deps (from dpif-netdev.c file)
FLOW depends on DPCLS (struct dpcls_rule)
DFC depends on DPCLS (netdev_flow_key) and FLOW (netdev_flow_key)
THREAD depends on DFC (struct dfc_cache)
DFC_PROC depends on THREAD (struct pmd_thread)
DPCLS lookup.h/c require only DPCLS
DPCLS implementations require only dpif-netdev-lookup.h.
- This change was made in 2.12 release with function pointers
- This commit only refactors the name to "private-dpcls.h"
netdev_flow_key_equal_mf() is renamed to emc_flow_key_equal_mf().
Rename functions specific to dpcls from netdev_* namespace to the
dpcls_* namespace, as they are only used by dpcls code.
'inline' is added to the dp_netdev_flow_hash() when it is moved
definition to fix a compiler error.
One valid checkpatch issue with the use of the
EMC_FOR_EACH_POS_WITH_HASH() macro was fixed.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The call to recirc_depth_get involves accessing a TLS value. So read
that once, and store it on the stack for re-use while processing the
batch. The same goes for reading netdev_is_flow_api_enabled(), a
non-inlined function.
Signed-off-by: Balazs Nemeth <bnemeth@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, conntrack in the kernel has an undocumented feature referred
to as all-zero IP address SNAT. Basically, when a source port
collision is detected during the commit, the source port will be
translated to an ephemeral port. If there is no collision, no SNAT is
performed.
This patchset documents this behavior and adds a self-test to verify
it's not changing. In addition, a datapath feature flag is added for
the all-zero IP SNAT case. This will help applications on top of OVS,
like OVN, to determine this feature can be used.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, if you try to set subtable-lookup-prio-set when you don't have
any datapath (for example if an user wants to set AVX512 before creating
any bridge) it sets it globally (dpcls_subtable_set_prio),
but it returns an error:
please specify an existing datapath
ovs-appctl: ovs-vswitchd: server returned an error
and, in this case, the exit code of ovs-appctl is 2.
This commit changes the behaviour by removing the [datapath] optional
parameter of subtable-lookup-prio-set and by changing the priority
level on any datapath and globally. This means if you don't have any
datapath or if you have only one datapath, the behaviour is the same as
now, but without the confusing error when you don't have any datapath.
Fixes: 3d018c3ea79d ("dpif-netdev: add subtable lookup prio set command.")
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When an encapsulated packet is recirculated through a TUNNEL_POP
action, the metadata gets reinitialized and the originating physical
port information is lost. When this flow gets processed by the vport
and it needs to be offloaded, we can't figure out the physical port
through which the tunneled packet was received.
Add a new member to the metadata: 'orig_in_port'. This is passed to
the next stage during recirculation and the offload layer can use it
to offload the flow to this physical port.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Tested-by: Emma Finn <emma.finn@intel.com>
Tested-by: Marko Kovacevic <marko.kovacevic@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
'linux_tc' flow API suitable only for tunneling vports with backing
linux interfaces. DPDK flow API is not suitable for such ports.
With this change we could drop vport restriction from dpif-netdev.
This is a prerequisite for enabling vport offloading in DPDK.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Tested-by: Emma Finn <emma.finn@intel.com>
Tested-by: Marko Kovacevic <marko.kovacevic@intel.com>
Recover the packet if it was partially processed by the HW. Fallback to
lookup flow by mark association.
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Tested-by: Emma Finn <emma.finn@intel.com>
Tested-by: Marko Kovacevic <marko.kovacevic@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
For now, ovs-vswitchd use the array of the dp_meter struct
to store meter's data, and at most, there are only 65536
(defined by MAX_METERS) meters that can be used. But in some
case, for example, in the edge gateway, we should use 200,000+,
at least, meters for IP address bandwidth limitation.
Every one IP address will use two meters for its rx and tx
path[1]. In other way, ovs-vswitchd should support meter-offload
(rte_mtr_xxx api introduced by dpdk.), but there are more than
65536 meters in the hardware, such as Mellanox ConnectX-6.
This patch use cmap to manage the meter, instead of the array.
* Insertion performance, ovs-ofctl add-meter 1000+ meters,
the cmap takes about 4000ms, as same as previous implementation.
* Lookup performance in datapath, we add 1000+ meters which rate limit
are 10Gbps (the NIC cards are 10Gbps, so netdev-datapath will not
drop the packets.), and a flow which only forward packets from p0
to p1, with meter action[2]. On other machine, pktgen-dpdk will
generate 64B packets to p0.
The forwarding performance always is 1324 Kpps on my server
which CPU is Intel E5-2650, 2.00GHz.
[1].
$ in_port=p0,ip,ip_dst=1.1.1.x action=meter:n,output:p1
$ in_port=p1,ip,ip_src=1.1.1.x action=meter:m,output:p0
[2].
$ in_port=p0 action=meter:100,output:p1
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
As reported by Wang Liang, the way packets are passed to the ipf module
doesn't allow for use later on in reassembly. Such packets may be get
released anyway, such as during cleanup of tx processing. Because the
ipf module lacks a way of forcing the dp_packet to be retained, it
will later reuse the packet. Instead, just clone the packet and let the
ipf queue own the copy until the queue is destroyed.
After this change, there are no more in-tree users of the batch
'do_not_steal' flag. Thus, we remove it as well.
Fixes: 4ea96698f667 ("Userspace datapath: Add fragmentation handling.")
Fixes: 0b3ff31d35f5 ("dp-packet: Add 'do_not_steal' packet batch flag.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-April/382098.html
Reported-by: Wang Liang <wangliangrt@didiglobal.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Co-authored-by: Wang Liang <wangliangrt@didiglobal.com>
Signed-off-by: Wang Liang <wangliangrt@didiglobal.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Implementation of meters supposed to be a classic token bucket with 2
typical parameters: rate and burst size.
Burst size in this schema is the maximum number of bytes/packets that
could pass without being rate limited.
Recent changes to userspace datapath made meter implementation to be
in line with the kernel one, and this uncovered several issues.
The main problem is that maximum bucket size for unknown reason
accounts not only burst size, but also the numerical value of rate.
This creates a lot of confusion around behavior of meters.
For example, if rate is configured as 1000 pps and burst size set to 1,
this should mean that meter will tolerate bursts of 1 packet at most,
i.e. not a single packet above the rate should pass the meter.
However, current implementation calculates maximum bucket size as
(rate + burst size), so the effective bucket size will be 1001. This
means that first 1000 packets will not be rate limited and average
rate might be twice as high as the configured rate. This also makes
it practically impossible to configure meter that will have burst size
lower than the rate, which might be a desirable configuration if the
rate is high.
Inability to configure low values of a burst size and overall inability
for a user to predict what will be a maximum and average rate from the
configured parameters of a meter without looking at the OVS and kernel
code might be also classified as a security issue, because drop meters
are frequently used as a way of protection from DoS attacks.
This change removes rate from the calculation of a bucket size, making
it in line with the classic token bucket algorithm and essentially
making the rate and burst tolerance being predictable from a users'
perspective.
Same change will be proposed for the kernel implementation.
Unit tests changed back to their correct version and enhanced.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
The way that "burst_size" was used as total buckets is
very strange. If user set the "burst_size" too smaller while
"rate" larger, that may affect the meter normal work.
This patch refactor the buckets calculation, and start
with a full buckets. It also makes the calculation equal
to what kernel datapath does.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When setting the meter rate to 4.3+Gbps, there is an overflow, the
meters don't work as expected.
$ ovs-ofctl -O OpenFlow13 add-meter br-int "meter=1 kbps stats bands=type=drop rate=4294968"
It was overflow when we set the rate to 4294968, because "burst_size" in
the ofputil_meter_band is uint32_t type. This patch remove the "up"
in the dp_meter_band struction, and introduce "rate", "burst_size" and
"bucket" (uint64_t) to userspace datapath's meter band. This patch don't
change the public API and structure.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously auto load balance did not trigger a reassignment when
there was any cross-numa polling as an rxq could be polled from a
different numa after reassign and it could impact estimates.
In the case where there is only one numa with pmds available, the
same numa will always poll before and after reassignment, so estimates
are valid. Allow PMD auto load balance to trigger a reassignment in
this case.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Tested-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>