Since DPDK 19.11 [1], it is not allowed to set any RX mq mode for virtio
driver.
[1] 13b3137f3b
Signed-off-by: Jaime Caamaño Ruiz <jcaamano@suse.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Dequeue zero-copy is no longer supported for vhost-user client mode
in DPDK due to commit [1].
In addition to this, zero-copy mode has been proposed to be marked
deprecated in [2] with removal in the next DPDK LTS release.
This commit deprecates support for vhost-user dequeue zero-copy in OVS
with its removal expected in the next OVS release.
[1] 715070ea10e6 ("vhost: prevent zero-copy with incompatible client
mode")
[2] http://mails.dpdk.org/archives/dev/2020-August/177236.html
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
As of DPDK 19.11, in order to use dequeue-zero-copy in DPDK Vhost library,
the application has to disable the linear buffer option. Hence
dequeue-zero-copy is not supported for vhost application that requires
linear buffers.
An alternative DPDK based approach to disable the linear buffers within
the vhost library itself was proposed in [1], however the consensus was
that application should be responsible for disabling linear buffers.
As such this patch disables linear buffers when zero-copy is enabled.
[1] https://patches.dpdk.org/patch/67200/
Fixes: 127b6a6eea02 ("dpdk: Update to use DPDK 19.11.")
Signed-off-by: Sivaprasad Tummala <Sivaprasad.Tummala@intel.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
'dpdkr' ring ports was deprecated in 2.13 release and was not
actually used for a long time. Remove support now.
More details in
commit b4c5f00c339b ("netdev-dpdk: Deprecate ring ports.")
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ideally SCTP checksum offload needs be advertised by the
NIC when userspace TSO is enabled. However, very few drivers
do that and it's not a widely used protocol. So, this patch
enables SCTP checksum offload if available, otherwise userspace
TSO can still be enabled but SCTP packets will be dropped on
NICs without support.
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Virtio doesn't expose flags to control which protocols checksum
offload needs to be enabled or disabled. This patch checks if the
NIC supports UDP checksum offload and active it when TSO is enabled.
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Disable ECN and UFO since this is not supported yet. Also, disable
all other features when userspace_tso is not enabled.
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The check on TSO capability did not ensure ip checksum, tcp checksum and
TSO tx offloads were available which resulted in a port init failure
(example below with a ena device):
*2020-02-04T17:42:52.976Z|00084|dpdk|ERR|Ethdev port_id=0 requested Tx
offloads 0x2a doesn't match Tx offloads capabilities 0xe in
rte_eth_dev_configure()*
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Reported-by: Ravi Kerur <rkerur@gmail.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.
A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.
It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.
If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.
It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Ciara Loftus <ciara.loftus.intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
There is no support for multi-segmented buffers, so flag
that to vhost library.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Ciara Loftus <ciara.loftus.intel.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Add a getter function for using the dpdk port id outside the scope of
netdev-dpdk.c to be used for HW offload.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Introduce a rte flow query function as a pre-step towards reading HW
statistics of fully offloaded flows.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds a new policer to the DPDK datapath based on RFC 4115's
Two-Rate, Three-Color marker. It's a two-level hierarchical policer
which first does a color-blind marking of the traffic at the queue
level, followed by a color-aware marking at the port level. At the end
traffic marked as Green or Yellow is forwarded, Red is dropped. For
details on how traffic is marked, see RFC 4115.
This egress policer can be used to limit traffic at different rated
based on the queues the traffic is in. In addition, it can also be used
to prioritize certain traffic over others at a port level.
For example, the following configuration will limit the traffic rate at a
port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
Information Rate). High priority traffic is routed to queue 10, which marks
all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
marked as EIR, i.e. Yellow.
ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
--id=@myqos create qos type=trtcm-policer \
other-config:cir=52000 other-config:cbs=2048 \
other-config:eir=52000 other-config:ebs=2048 \
queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
--id=@dpdk1Q10 create queue \
other-config:cir=41600000 other-config:cbs=2048 \
other-config:eir=0 other-config:ebs=0 -- \
--id=@dpdk1Q20 create queue \
other-config:cir=0 other-config:cbs=0 \
other-config:eir=41600000 other-config:ebs=2048 \
This configuration accomplishes that the high priority traffic has a
guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
use the EIR, so a total of 2000pps at max. These additional 1000pps is
shared with the low priority traffic. The low priority traffic can use at
maximum 1000pps.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This patch adds support for multi-queue QoS to the DPDK datapath. Most of
the code is based on an earlier patch from a patchset sent out by
zhaozhanxu. The patch was titled "[ovs-dev, v2, 1/4] netdev-dpdk.c: Support
the multi-queue QoS configuration for dpdk datapath"
Signed-off-by: zhaozhanxu <zhaozhanxu@163.com>
Co-authored-by: zhaozhanxu <zhaozhanxu@163.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
In "Use of library functions" in the C standard, the following statement
is written to apply to all library functions:
If an argument to a function has an invalid value (such as ... a
null pointer ... the behavior is undefined.
Later, under the "String handling" section, "Comparison functions" no
exception is listed for strcmp, which means NULL is invalid. It may
be possible for the smap_get to return NULL.
Given the above, we must check that new_devargs is not null. The check
against NULL for new_devargs later in the function is still valid.
Fixes: 55e075e65ef9 ("netdev-dpdk: Arbitrary 'dpdk' port naming")
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
When the dpdk vhost library executes an eventfd_write() call,
i.e. waking up the guest, a new callback will be called.
This patch adds the callback to count the number of
interrupts sent to the VM to track the number of times
interrupts where generated.
This might be of interest to find out system-calls were
called in the DPDK fast path.
The coverage counter is called "vhost_notification" and
can be read with:
$ ovs-appctl coverage/read-counter vhost_notification
13238319
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Accessing the sw stats in the vhost datapath of a PVP test
can incur a performance drop of ~2%.
Most of the time these stats will just be getting zero added
to them. By checking if there is a non-zero update first, we
can avoid accessing them when they won't be updated and avoid
the performance drop.
Fixes: 2f862c712e52 ("netdev-dpdk: Detailed packet drop statistics.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, OVS does not register and therefore not handle the
interface reset event from the DPDK framework. This would cause a
problem in cases where a VF is used as an interface, and its
configuration changes.
As an example in the following scenario the MAC change is not
detected/acted upon until OVS is restarted without the patch applied:
$ echo 1 > /sys/bus/pci/devices/0000:05:00.1/sriov_numvfs
$ ovs-vsctl add-port ovs_pvp_br0 dpdk0 -- \
set Interface dpdk0 type=dpdk -- \
set Interface dpdk0 options:dpdk-devargs=0000:05:0a.0
$ ip link set p5p2 vf 0 mac 52:54:00:92:d3:33
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit adds support for DPDK v19.11, it includes the following
changes.
1. travis: Enable compilation and linkage with dpdk 19.11.
2. sparse: Remove dpdk network headers copies.
https://patchwork.ozlabs.org/patch/1185256/
3. dpdk: Migrate to new PDUMP API.
https://patchwork.ozlabs.org/patch/1192971/
4. netdev-dpdk: Prefix network structures with rte_.
https://patchwork.ozlabs.org/patch/1109733/
5. netdev-dpdk: Update by new color definitions.
https://patchwork.ozlabs.org/patch/1086089/
6. docs: Update docs to reference 19.11.
7. docs: Add note regarding hotplug and igb_uio requirements.
For credit all authors of the original commits to 'dpdk-latest' with the
above changes been added as co-authors for this commmit.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Co-authored-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Co-authored-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Co-authored-by: Ophir Munk <ophirmu@mellanox.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
'dpdkr' a.k.a. DPDK ring ports has really poor support in OVS and not
tested on a regular basis. These ports are intended to work via
shared memory with another DPDK secondary process, but there are lots
of limitations for using this functionality in practice. Most of them
connected with running secondary DPDK application and memory layout
issues. More details are available in DPDK guide:
https://doc.dpdk.org/guides-18.11/prog_guide/multi_proc_support.html#multi-process-limitations
Beside the functional limitations it's also hard to use this
functionality correctly. User must be sure that OVS and secondary DPDK
application are running on different CPU cores, which is hard because
non-PMD threads could float over available CPU cores. This or any
other misconfiguration will likely lead to crash of OVS.
Another problem is that the user must actually build the secondary
application with the same version of DPDK that was used for OVS build.
Above issues are same as we have while using DPDK pdump.
Beside that, current implementation in OVS is not able to free
allocated rings that could lead to memory exhausting.
Initially these ports was added to use with IVSHMEM for a fast
zero-copy HOST<-->VM communication. However, IVSHMEM is not used
anymore. IVSHMEM support was removed from DPDK in 16.11 release
(instructions for IVSHMEM were removed from the OVS docs almost 3 years
ago by commit 90ca71dd317f ("doc: Remove ivshmem instructions.")) and
the patch for QEMU for using regular files as a device backend is no
longer available. That makes DPDK ring ports barely useful in real
virtualization environment.
This patch adds a deprecation warnings for run-time port creation
and documentation. Claiming to completely remove this functionality
from OVS in one of the next releases.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
OVS may be unable to transmit packets for multiple reasons on
the userspace datapath and today there is a single counter to
track packets dropped due to any of those reasons. This patch
adds custom software stats for the different reasons packets
may be dropped during tx/rx on the userspace datapath in OVS.
- MTU drops : drops that occur due to a too large packet size
- Qos drops : drops that occur due to egress/ingress QOS
- Tx failures: drops as returned by the DPDK PMD send function
Note that the reason for tx failures is not specified in OVS.
In practice for vhost ports it is most common that tx failures
are because there are not enough available descriptors,
which is usually caused by misconfiguration of the guest queues
and/or because the guest is not consuming packets fast enough
from the queues.
These counters are displayed along with other stats in
"ovs-vsctl get interface <iface> statistics" command and are
available for dpdk and vhostuser/vhostuserclient ports.
Also the existing "tx_retries" counter for vhost ports has been
renamed to "ovs_tx_retries", so that all the custom statistics
that OVS accumulates itself will have the prefix "ovs_". This
will prevent any custom stats names overlapping with
driver/HW stats.
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Sriram Vatala <sriram.v@altencalsoftlabs.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This is yet another refactoring for upcoming detailed drop stats.
It allows to use single function for all the software calculated
statistics in netdev-dpdk for both vhost and ETH ports.
UINT64_MAX used as a marker for non-supported statistics in a
same way as it's done in bridge.c for common netdev stats.
Co-authored-by: Sriram Vatala <sriram.v@altencalsoftlabs.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Sriram Vatala <sriram.v@altencalsoftlabs.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Add a coverage counter to help diagnose contention on the vhost txqs.
This is seen as dropped packets on the physical ports for rates that
are usually handled fine by OVS.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently OVS is unable to change flow control configuration in DPDK
because new settings are being overwritten by current settings with
rte_eth_dev_flow_ctrl_get(). The fix restores correct order of
operations and at the same time does not trigger error on devices
without flow control support when flow control not requested.
Fixes: 7e1de65e8dfb ("netdev-dpdk: Fix failure to configure flow control at netdev-init.")
Signed-off-by: Tomasz Konieczny <tomaszx.konieczny@intel.com>
Co-authored-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
'tx_q' array is allocated for each DPDK netdev. 'struct dpdk_tx_queue'
is 8 bytes long, so 8 tx queues are sharing the same cache line in
case of 64B cacheline size. This causes 'false sharing' issue in
mutliqueue case because taking the spinlock implies write to memory
i.e. cache invalidation.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
vHost interfaces currently has only one custom statistic, but there
might be others in the near future. This refactoring makes the code
work in the same way as it done for dpdk and afxdp stats to keep the
common style over the different code places and makes it easily
extensible for the new stats addition.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
There is a big code duplication issue with DPDK xstats that led to
missed "rx_oversize_errors" statistics. It's defined but not used.
Fix that by actually using this stat along with code refactoring that
will allow us to not make same mistakes in the future.
Macro definitions are perfectly suitable to automate code generation
in such cases and already used in a couple of places in OVS for similar
purposes.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
vhost tx retries can provide some mitigation against
dropped packets due to a temporarily slow guest/limited queue
size for an interface, but on the other hand when a system
is fully loaded those extra cycles retrying could mean
packets are dropped elsewhere.
Up to now max vhost tx retries have been hardcoded, which meant
no tuning and no way to disable for debugging to see if extra
cycles spent retrying resulted in rx drops on some other
interface.
Add an option to change the max retries, with a value of
0 effectively disabling vhost tx retries.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
vhost tx retries may occur, and it can be a sign that
the guest is not optimally configured.
Add a custom stat so a user will know if vhost tx retries are
occurring and hence give a hint that guest config should be
examined.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Fix minor issue of one possible additional retry.
Fixes: c6ec9d176dbf ("netdev-dpdk: Fix vHost stats.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Rather than poll all disabled queues and waste some memory for vms that
have been shutdown, we can reconfigure when receiving a destroy
connection notification from the vhost library.
$ while true; do
ovs-appctl dpif-netdev/pmd-rxq-show |awk '
/port: / {
tot++;
if ($5 == "(enabled)") {
en++;
}
}
END {
print "total: " tot ", enabled: " en
}'
sleep 1
done
total: 66, enabled: 66
total: 6, enabled: 2
This change requires a fix in the DPDK vhost library, so bump the minimal
required version to 18.11.2.
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
At the moment, a malicious guest might negotiate VIRTIO_NET_F_MQ and
!VIRTIO_NET_F_MQ in a loop which would be seen as qp_num going from 1 to
n and n to 1 continuously, triggering datapath reconfigurations at each
transition.
Limit this by only reconfiguring on increased qp_num.
The previous patch reduced the observed cost of polling disabled queues,
so the only cost is memory.
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
We currently poll all available queues based on the max queue count
exchanged with the vhost peer and rely on the vhost library in DPDK to
check the vring status beneath.
This can lead to some overhead when we have a lot of unused queues.
To enhance the situation, we can skip the disabled queues.
On rxq notifications, we make use of the netdev's change_seq number so
that the pmd thread main loop can cache the queue state periodically.
$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 1:
isolated : true
port: dpdk0 queue-id: 0 (enabled) pmd usage: 0 %
pmd thread numa_id 0 core_id 2:
isolated : true
port: vhost1 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost3 queue-id: 0 (enabled) pmd usage: 0 %
pmd thread numa_id 0 core_id 15:
isolated : true
port: dpdk1 queue-id: 0 (enabled) pmd usage: 0 %
pmd thread numa_id 0 core_id 16:
isolated : true
port: vhost0 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost2 queue-id: 0 (enabled) pmd usage: 0 %
$ while true; do
ovs-appctl dpif-netdev/pmd-rxq-show |awk '
/port: / {
tot++;
if ($5 == "(enabled)") {
en++;
}
}
END {
print "total: " tot ", enabled: " en
}'
sleep 1
done
total: 6, enabled: 2
total: 6, enabled: 2
...
# Started vm, virtio devices are bound to kernel driver which enables
# F_MQ + all queue pairs
total: 6, enabled: 2
total: 66, enabled: 66
...
# Unbound vhost0 and vhost1 from the kernel driver
total: 66, enabled: 66
total: 66, enabled: 34
...
# Configured kernel bound devices to use only 1 queue pair
total: 66, enabled: 34
total: 66, enabled: 19
total: 66, enabled: 4
...
# While rebooting the vm
total: 66, enabled: 4
total: 66, enabled: 2
...
total: 66, enabled: 66
...
# After shutting down the vm
total: 66, enabled: 66
total: 66, enabled: 2
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
New module 'netdev-offload' created to manage different flow API
implementations. All the generic and provider independent code moved
there from the 'netdev' module.
Flow API providers further encapsulated.
The only function that was changed is 'netdev_any_oor'.
Now it uses offloading related hmap instead of common 'netdev_shash'.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
Current issues with Flow API:
* OVS calls offloading functions regardless of successful
flow API initialization. (ex. on init_flow_api failure)
* Static initilaization of Flow API for a netdev_class forbids
having different offloading types for different instances
of netdev with the same netdev_class. (ex. different vports in
'system' and 'netdev' datapaths at the same time)
Solution:
* Move Flow API from the netdev_class to netdev instance.
* Make Flow API dynamic, i.e. probe the APIs and choose the
suitable one.
Side effects:
* Flow API providers localized as possible in their modules.
* Now we have an ability to make runtime checks. For example,
we could check if particular device supports features we
need, like if dpdk device supports RSS+MARK action.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Post-copy Live Migration for vHost supported since DPDK 18.11 and
QEMU 2.12. New global config option 'vhost-postcopy-support' added
to control this feature. Ex.:
ovs-vsctl set Open_vSwitch . other_config:vhost-postcopy-support=true
Changing this value requires restarting the daemon. It's safe to
enable this knob even if QEMU doesn't support post-copy LM.
Feature marked as experimental and disabled by default because it may
cause PMD thread hang on destination host on page fault for the time
of page downloading from the source.
Feature is not compatible with 'mlockall' and 'dequeue zero-copy'.
Support added only for vhost-user-client.
Signed-off-by: Liliia Butorina <l.butorina@partner.samsung.com>
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
'vhost_id' is an array of 'PATH_MAX' bytes in the middle of
'netdev_dpdk' structure. That is 4K bytes.
'vhost_id' never used on a hot path and there is no need to keep
it inside the structure memory. Dynamic allocation will allow to
decrease 'struct netdev_dpdk' significantly, saving 4KB per ETH
port (ETH ports doesn't use 'vhost_id') and almost same value per
vhost ports (real 'vhost_id's, in common case, are much shorter).
We could save the pointer space by making the union with 'devargs'
which is mutually exclusive with 'vhost_id'.
As we're just removing the single 'PADDED_MEMBER', the total
cacheline layout is not affected.
Stats for 'struct netdev_dpdk':
Before: /* size: 4992, cachelines: 78 */
After : /* size: 896, cachelines: 14 */
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
In case of reconfiguration while 'vhost_id' is not set yet,
there will be the meaningless message like:
|netdev_dpdk|DBG|TX queue mapping for
|netdev_dpdk|DBG| 0 --> 0
It's better to print the name of the netdev which is always set.
Additionally fixed possible splitting by other log messages and
missing space in the queue state message.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Hardware offloading code is moved to a new file called
netdev-rte-offloads.c. The original offloading code is copied
from the netdev-dpdk.c file to the new file, where future
offloading code should be added as well.
The copied code was refactored based on coding style.
The netdev-dpdk.c file will remain unchanged as new offloading
code is added.
Co-authored-by: Ophir Munk <ophirmu@mellanox.com>
Reviewed-by: Asaf Penso <asafp@mellanox.com>
Signed-off-by: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Before offloading code was added to the netdev-dpdk.c file (MARK and
RSS actions) the only DPDK RTE calls in use were rte_flow_create() and
rte_flow_destroy(). In preparation for splitting the offloading code
from the netdev-dpdk.c file to a separate file, it is required
to embed these RTE calls into a global netdev-dpdk-* API so that
they can be called from the new file. An example for this requirement
can be seen in the handling of dev->mutex, which should be encapsulated
inside netdev-dpdk class (netdev-dpdk.c file), and should be unknown
to the outside callers. This commit embeds the rte_flow_create() call
inside the netdev_dpdk_flow_create() API and the rte_flow_destroy()
call inside the netdev_dpdk_rte_flow_destroy() API.
Reviewed-by: Asaf Penso <asafp@mellanox.com>
Signed-off-by: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Co-authored-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
1. No reason to have mbuf related APIs in a generic code.
2. Not only RSS/checksums should be invalidated in case of tunnel
decapsulation or sending to 'ring' ports.
In order to fix two above issues, new function
'dp_packet_reset_offload' introduced. In order to clean up/unify
the code and simplify addition of new offloading features to non-DPDK
version of dp_packet, introduced 'ol_flags' bitmask. Additionally
reduced code complexity in 'dp_packet_clone_with_headroom' by using
already existent generic APIs.
Unfortunately, we still need to have a special case for mbuf
initialization inside 'dp_packet_init__()'.
'dp_packet_init_specific()' introduced for this purpose as a generic
API for initialization of the implementation-specific fields.
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Having a single structure allows to simplify the code path and
clear all the items at once (probably faster). This does not
increase stack memory usage because all the L4 related items
grouped in a union.
Changes:
- Memsets combined.
- 'ipv4_next_proto_mask' dropped as we already know the address
and able to use 'mask.ipv4.hdr.next_proto_id' directly.
- Group of 'if' statements for L4 protocols turned to a 'switch'.
We can do that, because we don't have semi-local variables anymore.
- Eliminated 'end_proto_check' label. Not needed with 'switch'.
Additionally 'rte_memcpy' replaced with simple 'memcpy' as it makes no
sense to use 'rte_memcpy' for 6 bytes.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Asaf Penso <asafp@mellanox.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* Dropped 'is_all_zero' function, which is equal to 'is_all_zeros'
from util.h .
* util.h added to includes. Includes re-sorted within their blocks.
(it's hard to figure out where to put new one if there is no order.)
Note: linux/if.h depends on sys/socket.h .
* 'ovs_u128_is_zero' used instead of manual checking of fields.
* Code simplified by using direct pointer to 'match->wc.masks'.
* 'sizeof's rewritten to be coding-style complient.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
In netdev_dpdk_add_rte_flow_offload function different rte_flow_item are
created as part of the pattern matching.
For most of them, there is a check whether the wildcards are not zero.
In case of zero, nothing is being done with the rte_flow_item.
Befor the wildcard check, and regardless of the result, the
rte_flow_item is being memset.
The patch moves the memset function only if the condition of the
wildcard is true, thus saving cycles of memset if not needed.
Signed-off-by: Asaf Penso <asafp@mellanox.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Dpdk port representors were introduced in dpdk versions 18.xx.
Prior to port representors there was a one-to-one relationship
between an rte device (e.g. PCI bus) and an eth device (referenced as
dpdk port id in OVS). With port representors the relationship becomes
one-to-many rte device to eth devices.
For example in [3] there are two devices (representors) using the same
PCI physical address 0000:08:00.0: "0000:08:00.0,representor=[3]" and
"0000:08:00.0,representor=[5]".
This commit handles the new one-to-many relationship. For example,
when one of the device port representors in [3] is closed - the PCI bus
cannot be detached until the other device port representor is closed as
well. OVS remains backward compatible by supporting dpdk legacy PCI
ports which do not include port representors.
Dpdk port representors related commits are listed in [1]. Dpdk port
representors documentation appears in [2]. A sample configuration
which uses two representors ports (the output of "ovs-vsctl show"
command) is shown in [3].
[1]
e0cb96204b71 ("net/i40e: add support for representor ports")
cf80ba6e2038 ("net/ixgbe: add support for representor ports")
26c08b979d26 ("net/mlx5: add port representor awareness")
[2]
https://doc.dpdk.org/guides-18.11/prog_guide/switch_representation.html
[3]
Bridge "ovs_br0"
Port "ovs_br0"
Interface "ovs_br0"
type: internal
Port "port-rep3"
Interface "port-rep3"
type: dpdk
options: {dpdk-devargs="0000:08:00.0,representor=[3]"}
Port "port-rep5"
Interface "port-rep5"
type: dpdk
options: {dpdk-devargs="0000:08:00.0,representor=[5]"}
ovs_version: "2.10.90"
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds support for DPDK v18.11, it includes the following
changes.
1. Enable compilation and linkage with dpdk 18.11.0
The following dpdk commits which were introduced after dpdk 17.11.x
require OVS updates to accommodate to the dpdk changes.
- ce17edde ("ethdev: introduce Rx queue offloads API")
- ab3ce1e0 ("ethdev: remove old offload API")
- c06ddf96 ("meter: add configuration profile")
- e58638c3 ("ethdev: fix TPID handling in flow API")
- cd8c7c7c ("ethdev: replace bus specific struct with generic dev")
- ac8d22de ("ethdev: flatten RSS configuration in flow API")
2. Limit configured rss hash functions to only those supported
by the eth device.
3. Set default RSS key in struct action_rss_data, required by OVS
commit- e8a2b5bf ("netdev-dpdk: implement flow offload with rte flow")
when configured with "other_config:hw-offload=true".
4. DEV_RX_OFFLOAD_CRC_STRIP has been removed from DPDK 18.11.
DEV_RX_OFFLOAD_KEEP_CRC can now be used to keep the CRC.
Use the correct flag and check it is supported.
5. rte_eth_dev_attach/detach have been removed from DPDK 18.11.
Replace them with rte_dev_probe/remove.
6. Update docs and travis to use DPDK18.11.
This commit squashes the following commits present on the dpdk-latest
branch:
7f021f902bb3 ("netdev-dpdk: Upgrade to dpdk v18.08")
270d9216f1ed ("netdev-dpdk: Set scatter based on capabilities")
bef2cdc8f412 ("netdev-dpdk: Fix returning the field of malloced struct.")
73c1a65167fc ("redhat: change variable used for non-root user support")
eb485f60ce44 ("dpdk: Update to use DPDK 18.11.")
For credit all authors of the original commits above have been added as
co-authors for this commmit.
From: Ophir Munk <ophirmu@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Co-authored-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Commit dfaf00e started using the result of dpdk_buf_size() to calculate
the available size on each mbuf, as opposed to using the previous
MBUF_SIZE macro. However, this was calculating the mbuf size by adding
up the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.
Before alignment:
ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048
After aligment:
ROUNDUP(MTU(1500), 1024) + 128 = 2176
This might seem insignificant, however, it might have performance
implications in DPDK, where each mbuf is expected to have 2k +
RTE_PKTMBUF_HEADROOM of available space. This is because not only some
NICs have course grained alignments of 1k, they will also take
RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
when setting up their Rx requirements. Thus, only the "After alignment"
case above would guarantee a 2k of available room, as the "Before
alignment" would report only 1920B.
Some extra information can be found at:
https://mails.dpdk.org/archives/dev/2018-November/119219.html
Note: This has been found by Ian Stokes while going through some
af_packet checks.
Reported-by: Ian Stokes <ian.stokes@intel.com>
Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
Signed-off-by: Tiago Lam <tiago.lam@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>