2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-22 18:07:40 +00:00

326 Commits

Author SHA1 Message Date
Kevin Traynor
080f080c3b netdev-dpdk: Enable tx-retries-max config.
vhost tx retries can provide some mitigation against
dropped packets due to a temporarily slow guest/limited queue
size for an interface, but on the other hand when a system
is fully loaded those extra cycles retrying could mean
packets are dropped elsewhere.

Up to now max vhost tx retries have been hardcoded, which meant
no tuning and no way to disable for debugging to see if extra
cycles spent retrying resulted in rx drops on some other
interface.

Add an option to change the max retries, with a value of
0 effectively disabling vhost tx retries.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-08 12:03:28 +01:00
Kevin Traynor
c161357d5d netdev-dpdk: Add custom stat for vhost tx retries.
vhost tx retries may occur, and it can be a sign that
the guest is not optimally configured.

Add a custom stat so a user will know if vhost tx retries are
occurring and hence give a hint that guest config should be
examined.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-08 12:02:51 +01:00
Kevin Traynor
730b34859f netdev-dpdk: Fix additional vhost tx retry.
Fix minor issue of one possible additional retry.

Fixes: c6ec9d176dbf ("netdev-dpdk: Fix vHost stats.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-06-28 09:49:32 +01:00
David Marchand
61473a0eb2 netdev-dpdk: Reset queue number for vhost devices on vm shutdown.
Rather than poll all disabled queues and waste some memory for vms that
have been shutdown, we can reconfigure when receiving a destroy
connection notification from the vhost library.

$ while true; do
  ovs-appctl dpif-netdev/pmd-rxq-show |awk '
  /port: / {
    tot++;
    if ($5 == "(enabled)") {
      en++;
    }
  }
  END {
    print "total: " tot ", enabled: " en
  }'
  sleep 1
done

total: 66, enabled: 66
total: 6, enabled: 2

This change requires a fix in the DPDK vhost library, so bump the minimal
required version to 18.11.2.

Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-06-27 15:28:04 +01:00
David Marchand
7235cd206e netdev-dpdk: Avoid reconfiguration on VIRTIO_NET_F_MQ changes.
At the moment, a malicious guest might negotiate VIRTIO_NET_F_MQ and
!VIRTIO_NET_F_MQ in a loop which would be seen as qp_num going from 1 to
n and n to 1 continuously, triggering datapath reconfigurations at each
transition.

Limit this by only reconfiguring on increased qp_num.
The previous patch reduced the observed cost of polling disabled queues,
so the only cost is memory.

Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-06-26 19:15:13 +01:00
David Marchand
35c91567c8 dpif-netdev: Only poll enabled vhost queues.
We currently poll all available queues based on the max queue count
exchanged with the vhost peer and rely on the vhost library in DPDK to
check the vring status beneath.
This can lead to some overhead when we have a lot of unused queues.

To enhance the situation, we can skip the disabled queues.
On rxq notifications, we make use of the netdev's change_seq number so
that the pmd thread main loop can cache the queue state periodically.

$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 1:
  isolated : true
  port: dpdk0             queue-id:  0 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 2:
  isolated : true
  port: vhost1            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost3            queue-id:  0 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 15:
  isolated : true
  port: dpdk1             queue-id:  0 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 16:
  isolated : true
  port: vhost0            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost2            queue-id:  0 (enabled)   pmd usage:  0 %

$ while true; do
  ovs-appctl dpif-netdev/pmd-rxq-show |awk '
  /port: / {
    tot++;
    if ($5 == "(enabled)") {
      en++;
    }
  }
  END {
    print "total: " tot ", enabled: " en
  }'
  sleep 1
done

total: 6, enabled: 2
total: 6, enabled: 2
...

 # Started vm, virtio devices are bound to kernel driver which enables
 # F_MQ + all queue pairs
total: 6, enabled: 2
total: 66, enabled: 66
...

 # Unbound vhost0 and vhost1 from the kernel driver
total: 66, enabled: 66
total: 66, enabled: 34
...

 # Configured kernel bound devices to use only 1 queue pair
total: 66, enabled: 34
total: 66, enabled: 19
total: 66, enabled: 4
...

 # While rebooting the vm
total: 66, enabled: 4
total: 66, enabled: 2
...
total: 66, enabled: 66
...

 # After shutting down the vm
total: 66, enabled: 66
total: 66, enabled: 2

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-06-26 18:43:39 +01:00
Ilya Maximets
b6cabb8f8f netdev: Split up netdev offloading to separate module.
New module 'netdev-offload' created to manage different flow API
implementations. All the generic and provider independent code moved
there from the 'netdev' module.

Flow API providers further encapsulated.

The only function that was changed is 'netdev_any_oor'.
Now it uses offloading related hmap instead of common 'netdev_shash'.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
2019-06-11 09:39:36 +03:00
Ilya Maximets
5fc5c50f3d netdev: Dynamic per-port Flow API.
Current issues with Flow API:

* OVS calls offloading functions regardless of successful
  flow API initialization. (ex. on init_flow_api failure)
* Static initilaization of Flow API for a netdev_class forbids
  having different offloading types for different instances
  of netdev with the same netdev_class. (ex. different vports in
  'system' and 'netdev' datapaths at the same time)

Solution:

* Move Flow API from the netdev_class to netdev instance.
* Make Flow API dynamic, i.e. probe the APIs and choose the
  suitable one.

Side effects:

* Flow API providers localized as possible in their modules.
* Now we have an ability to make runtime checks. For example,
  we could check if particular device supports features we
  need, like if dpdk device supports RSS+MARK action.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Roi Dayan <roid@mellanox.com>
2019-06-11 09:39:36 +03:00
Ilya Maximets
595ce47cd6 netdev-dpdk: Print the reason of device detaching failure.
Useful for debugging.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
2019-05-30 11:11:41 +03:00
Liliia Butorina
30e834dcb5 netdev-dpdk: Post-copy Live Migration support for vhost-user-client.
Post-copy Live Migration for vHost supported since DPDK 18.11 and
QEMU 2.12. New global config option 'vhost-postcopy-support' added
to control this feature. Ex.:

  ovs-vsctl set Open_vSwitch . other_config:vhost-postcopy-support=true

Changing this value requires restarting the daemon. It's safe to
enable this knob even if QEMU doesn't support post-copy LM.

Feature marked as experimental and disabled by default because it may
cause PMD thread hang on destination host on page fault for the time
of page downloading from the source.

Feature is not compatible with 'mlockall' and 'dequeue zero-copy'.
Support added only for vhost-user-client.

Signed-off-by: Liliia Butorina <l.butorina@partner.samsung.com>
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2019-05-24 15:31:28 +03:00
Ilya Maximets
bb9d262357 netdev-dpdk: Allocate vhost_id dynamically.
'vhost_id' is an array of 'PATH_MAX' bytes in the middle of
'netdev_dpdk' structure. That is 4K bytes.

'vhost_id' never used on a hot path and there is no need to keep
it inside the structure memory. Dynamic allocation will allow to
decrease 'struct netdev_dpdk' significantly, saving 4KB per ETH
port (ETH ports doesn't use 'vhost_id') and almost same value per
vhost ports (real 'vhost_id's, in common case, are much shorter).
We could save the pointer space by making the union with 'devargs'
which is mutually exclusive with 'vhost_id'.
As we're just removing the single 'PADDED_MEMBER', the total
cacheline layout is not affected.

Stats for 'struct netdev_dpdk':

    Before: /* size: 4992, cachelines: 78 */
    After : /* size:  896, cachelines: 14 */

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-04-10 16:54:56 +01:00
Ilya Maximets
170ef7265a netdev-dpdk: Print netdev name for txq mapping.
In case of reconfiguration while 'vhost_id' is not set yet,
there will be the meaningless message like:

    |netdev_dpdk|DBG|TX queue mapping for
    |netdev_dpdk|DBG| 0 -->  0

It's better to print the name of the netdev which is always set.

Additionally fixed possible splitting by other log messages and
missing space in the queue state message.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-03-22 22:44:54 +00:00
Roni Bar Yanai
3d67b2d28e netdev-dpdk: Move offloading code to a new file
Hardware offloading code is moved to a new file called
netdev-rte-offloads.c. The original offloading code is copied
from the netdev-dpdk.c file to the new file, where future
offloading code should be added as well.
The copied code was refactored based on coding style.
The netdev-dpdk.c file will remain unchanged as new offloading
code is added.

Co-authored-by: Ophir Munk <ophirmu@mellanox.com>
Reviewed-by: Asaf Penso <asafp@mellanox.com>
Signed-off-by: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-03-19 14:33:27 +00:00
Roni Bar Yanai
6775bdfc9b netdev-dpdk: Expose flow creation/destruction calls
Before offloading code was added to the netdev-dpdk.c file (MARK and
RSS actions) the only DPDK RTE calls in use were rte_flow_create() and
rte_flow_destroy(). In preparation for splitting the offloading code
from the netdev-dpdk.c file to a separate file, it is required
to embed these RTE calls into a global netdev-dpdk-* API so that
they can be called from the new file. An example for this requirement
can be seen in the handling of dev->mutex, which should be encapsulated
inside netdev-dpdk class (netdev-dpdk.c file), and should be unknown
to the outside callers. This commit embeds the rte_flow_create() call
inside the netdev_dpdk_flow_create() API and the rte_flow_destroy()
call inside the netdev_dpdk_rte_flow_destroy() API.

Reviewed-by: Asaf Penso <asafp@mellanox.com>
Signed-off-by: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Co-authored-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-03-19 14:12:21 +00:00
Ilya Maximets
a47e2db209 dp-packet: Refactor offloading API.
1. No reason to have mbuf related APIs in a generic code.
2. Not only RSS/checksums should be invalidated in case of tunnel
   decapsulation or sending to 'ring' ports.

In order to fix two above issues, new function
'dp_packet_reset_offload' introduced. In order to clean up/unify
the code and simplify addition of new offloading features to non-DPDK
version of dp_packet, introduced 'ol_flags' bitmask. Additionally
reduced code complexity in 'dp_packet_clone_with_headroom' by using
already existent generic APIs.

Unfortunately, we still need to have a special case for mbuf
initialization inside 'dp_packet_init__()'.
'dp_packet_init_specific()' introduced for this purpose as a generic
API for initialization of the implementation-specific fields.

Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-03-13 09:51:30 +00:00
Ilya Maximets
324215c65c netdev-dpdk: Use single struct/union for flow offload items.
Having a single structure allows to simplify the code path and
clear all the items at once (probably faster). This does not
increase stack memory usage because all the L4 related items
grouped in a union.

Changes:
  - Memsets combined.
  - 'ipv4_next_proto_mask' dropped as we already know the address
    and able to use 'mask.ipv4.hdr.next_proto_id' directly.
  - Group of 'if' statements for L4 protocols turned to a 'switch'.
    We can do that, because we don't have semi-local variables anymore.
  - Eliminated 'end_proto_check' label. Not needed with 'switch'.

Additionally 'rte_memcpy' replaced with simple 'memcpy' as it makes no
sense to use 'rte_memcpy' for 6 bytes.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Asaf Penso <asafp@mellanox.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-02-27 22:01:18 +00:00
Ilya Maximets
6ebc4b09c2 netdev-dpdk: Flow validation refactoring.
* Dropped 'is_all_zero' function, which is equal to 'is_all_zeros'
  from util.h .
* util.h added to includes. Includes re-sorted within their blocks.
  (it's hard to figure out where to put new one if there is no order.)
  Note: linux/if.h depends on sys/socket.h .
* 'ovs_u128_is_zero' used instead of manual checking of fields.
* Code simplified by using direct pointer to 'match->wc.masks'.
* 'sizeof's rewritten to be coding-style complient.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-02-19 13:45:17 +00:00
Asaf Penso
fa2353af7e netdev-dpdk: Memset rte_flow_item on a need basis.
In netdev_dpdk_add_rte_flow_offload function different rte_flow_item are
created as part of the pattern matching.

For most of them, there is a check whether the wildcards are not zero.
In case of zero, nothing is being done with the rte_flow_item.

Befor the wildcard check, and regardless of the result, the
rte_flow_item is being memset.

The patch moves the memset function only if the condition of the
wildcard is true, thus saving cycles of memset if not needed.

Signed-off-by: Asaf Penso <asafp@mellanox.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-02-06 11:41:21 +00:00
Ophir Munk
40e940e439 netdev-dpdk: support port representors
Dpdk port representors were introduced in dpdk versions 18.xx.
Prior to port representors there was a one-to-one relationship
between an rte device (e.g. PCI bus) and an eth device (referenced as
dpdk port id in OVS). With port representors the relationship becomes
one-to-many rte device to eth devices.
For example in [3] there are two devices (representors) using the same
PCI physical address 0000:08:00.0: "0000:08:00.0,representor=[3]" and
"0000:08:00.0,representor=[5]".
This commit handles the new one-to-many relationship. For example,
when one of the device port representors in [3] is closed - the PCI bus
cannot be detached until the other device port representor is closed as
well. OVS remains backward compatible by supporting dpdk legacy PCI
ports which do not include port representors.
Dpdk port representors related commits are listed in [1]. Dpdk port
representors documentation appears in [2]. A sample configuration
which uses two representors ports (the output of "ovs-vsctl show"
command) is shown in [3].

[1]
e0cb96204b71 ("net/i40e: add support for representor ports")
cf80ba6e2038 ("net/ixgbe: add support for representor ports")
26c08b979d26 ("net/mlx5: add port representor awareness")

[2]
https://doc.dpdk.org/guides-18.11/prog_guide/switch_representation.html

[3]
Bridge "ovs_br0"
    Port "ovs_br0"
        Interface "ovs_br0"
            type: internal
    Port "port-rep3"
        Interface "port-rep3"
            type: dpdk
            options: {dpdk-devargs="0000:08:00.0,representor=[3]"}
    Port "port-rep5"
        Interface "port-rep5"
            type: dpdk
            options: {dpdk-devargs="0000:08:00.0,representor=[5]"}
    ovs_version: "2.10.90"

Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-01-17 23:33:36 +00:00
Ophir Munk
03f3f9c0fa dpdk: Update to use DPDK 18.11.
This commit adds support for DPDK v18.11, it includes the following
changes.

1. Enable compilation and linkage with dpdk 18.11.0
   The following dpdk commits which were introduced after dpdk 17.11.x
   require OVS updates to accommodate to the dpdk changes.
   - ce17edde ("ethdev: introduce Rx queue offloads API")
   - ab3ce1e0 ("ethdev: remove old offload API")
   - c06ddf96 ("meter: add configuration profile")
   - e58638c3 ("ethdev: fix TPID handling in flow API")
   - cd8c7c7c ("ethdev: replace bus specific struct with generic dev")
   - ac8d22de ("ethdev: flatten RSS configuration in flow API")

2. Limit configured rss hash functions to only those supported
   by the eth device.

3. Set default RSS key in struct action_rss_data, required by OVS
   commit- e8a2b5bf ("netdev-dpdk: implement flow offload with rte flow")
   when configured with "other_config:hw-offload=true".

4. DEV_RX_OFFLOAD_CRC_STRIP has been removed from DPDK 18.11.
   DEV_RX_OFFLOAD_KEEP_CRC can now be used to keep the CRC.
   Use the correct flag and check it is supported.

5. rte_eth_dev_attach/detach have been removed from DPDK 18.11.
   Replace them with rte_dev_probe/remove.

6. Update docs and travis to use DPDK18.11.

This commit squashes the following commits present on the dpdk-latest
branch:

7f021f902bb3 ("netdev-dpdk: Upgrade to dpdk v18.08")
270d9216f1ed ("netdev-dpdk: Set scatter based on capabilities")
bef2cdc8f412 ("netdev-dpdk: Fix returning the field of malloced struct.")
73c1a65167fc ("redhat: change variable used for non-root user support")
eb485f60ce44 ("dpdk: Update to use DPDK 18.11.")

For credit all authors of the original commits above have been added as
co-authors for this commmit.

From: Ophir Munk <ophirmu@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Co-authored-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-12-13 14:25:46 +00:00
Tiago Lam
a32bab26e5 netdev-dpdk: Add mbuf HEADROOM after alignment.
Commit dfaf00e started using the result of dpdk_buf_size() to calculate
the available size on each mbuf, as opposed to using the previous
MBUF_SIZE macro. However, this was calculating the mbuf size by adding
up the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.

Before alignment:
ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048

After aligment:
ROUNDUP(MTU(1500), 1024) + 128 = 2176

This might seem insignificant, however, it might have performance
implications in DPDK, where each mbuf is expected to have 2k +
RTE_PKTMBUF_HEADROOM of available space. This is because not only some
NICs have course grained alignments of 1k, they will also take
RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
when setting up their Rx requirements. Thus, only the "After alignment"
case above would guarantee a 2k of available room, as the "Before
alignment" would report only 1920B.

Some extra information can be found at:
https://mails.dpdk.org/archives/dev/2018-November/119219.html

Note: This has been found by Ian Stokes while going through some
af_packet checks.

Reported-by: Ian Stokes <ian.stokes@intel.com>
Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
Signed-off-by: Tiago Lam <tiago.lam@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-28 15:30:03 +00:00
Eelco Chaudron
2d37de73c1 netdev-dpdk: Bring link down when NETDEV_UP is not set
When the netdev link flags are changed, !NETDEV_UP, the DPDK ports are not
actually going down. This is causing problems for people trying to bring
down a bond member. The bond link is no longer being used to receive or
transmit traffic, however, the other end keeps sending data as the link
remains up.

With OVS 2.6 the link was brought down, and this was changed with commit
3b1fb0779. In this commit, it's explicitly mentioned that the link down/up
DPDK APIs are not called as not all PMD devices support it.

However, this patch does call the appropriate DPDK APIs and ignoring
errors due to the PMD not supporting it. PMDs not supporting this should
be fixed in DPDK upstream.

I verified this patch is working correctly using the
ovs-appctl netdev-dpdk/set-admin-state <port> {up|down} and
ovs-ofctl mod-port <bridge> <port> {up|down} commands on a XL710
and 82599ES.

Fixes: 3b1fb0779b87 ("netdev-dpdk: Don't call rte_dev_stop() in update_flags().")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-12 15:45:12 +00:00
Tiago Lam
3aaa620151 dp-packet: Fix allocated size on DPDK init.
When enabled with DPDK OvS deals with two types of packets, the ones
coming from the mempool and the ones locally created by OvS - which are
copied to mempool mbufs before output. In the latter, the space is
allocated from the system, while in the former the mbufs are allocated
from a mempool, which takes care of initialising them appropriately.

In the current implementation, during mempool's initialisation of mbufs,
dp_packet_set_allocated() is called from dp_packet_init_dpdk() without
considering that the allocated space, in the case of multi-segment
mbufs, might be greater than a single mbuf.  Furthermore, given that
dp_packet_init_dpdk() is on the code path that's called upon mempool's
initialisation, a call to dp_packet_set_allocated() is redundant, since
mempool takes care of initialising it.

To fix this, dp_packet_set_allocated() is no longer called after
initialisation of a mempool, only in dp_packet_init__(), which is still
called by OvS when initialising locally created packets.

Signed-off-by: Tiago Lam <tiago.lam@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 16:29:14 +00:00
Mark Kavanagh
dfaf00e8c3 netdev-dpdk: fix mbuf sizing
There are numerous factors that must be considered when calculating
the size of an mbuf:
- the data portion of the mbuf must be sized in accordance With Rx
  buffer alignment (typically 1024B). So, for example, in order to
  successfully receive and capture a 1500B packet, mbufs with a
  data portion of size 2048B must be used.
- in OvS, the elements that comprise an mbuf are:
  * the dp packet, which includes a struct rte mbuf (704B)
  * RTE_PKTMBUF_HEADROOM (128B)
  * packet data (aligned to 1k, as previously described)
  * RTE_PKTMBUF_TAILROOM (typically 0)

Some PMDs require that the total mbuf size (i.e. the total sum of all
of the above-listed components' lengths) is cache-aligned. To satisfy
this requirement, it may be necessary to round up the total mbuf size
with respect to cacheline size. In doing so, it's possible that the
dp_packet's data portion is inadvertently increased in size, such that
it no longer adheres to Rx buffer alignment. Consequently, the
following property of the mbuf no longer holds true:

    mbuf.data_len == mbuf.buf_len - mbuf.data_off

This creates a problem in the case of multi-segment mbufs, where that
assumption is assumed to be true for all but the final segment in an
mbuf chain. Resolve this issue by adjusting the size of the mbuf's
private data portion, as opposed to the packet data portion when
aligning mbuf size to cachelines.

Co-authored-by: Tiago Lam <tiago.lam@intel.com>

Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
Signed-off-by: Tiago Lam <tiago.lam@intel.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 16:27:31 +00:00
Ian Stokes
31154f9523 netdev-dpdk: Add link speed to get_status().
Report the link speed of the device in netdev_dpdk_get_status()
function.

Link speed is already reported as part of the netdev_get_features()
function. However only link speeds defined in the OpenFlow specs are
supported so speeds such as 25 Gbps etc. are not shown. The link
speed for the device is available in Mbps in rte_eth_link.
This commit converts the link speed for a given dpdk device to an
easy to read string and reports it in get_status().

Suggested-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:17:47 +00:00
Ian Stokes
dfcb5b8ad5 netdev-dpdk: Fix netdev_dpdk_get_features().
This commit fixes netdev_dpdk_get_features() by initializing a bitmap
that represents current features to zero and accounting for non defined
link speed values in the OpenFlow spec.

The current approach for retrieving netdev dpdk features uses a
pointer allocated in the stack without being initialized. As such there
is no guarantee that the bitmap will be accurate. Fix this by declaring
and initializing local variable 'feature' to be used when building the
bitmap, with its value then assigned to the pointer. Also account for
link speeds not defined in the OpenFlow spec by defaulting to
NETDEV_F_OTHER for undefined link speeds.

Fixes: 8a9562d21a40 ("dpif-netdev: Add DPDK netdev.")
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:17:35 +00:00
Ilya Maximets
9474073615 netdev-dpdk: Dump flow patterns only if debug enabled.
No need to waste time for fields checking in case DBG disabled.
Additionally sequence of prints replaced with single print
to avoid output interrupting by other log messages.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:16:14 +00:00
Ilya Maximets
faf71e4922 netdev-dpdk: Print port name in offload API messages.
This is useful for understanding which flows offloaded to
which ports.

Code refactored a bit to reduce number of casts.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:15:27 +00:00
Ilya Maximets
5752eae485 dpif-netdev: Fix cmap node use after free on flow disassociation.
Data pointed by cmap node must not be freed while iterating.
ovsrcu_postpone should be used instead.

CC: Finn Christensen <fc@napatech.com>
Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:13:54 +00:00
Ilya Maximets
95ca79d542 netdev-dpdk: Secure flow offload API.
rte API is not thread safe. We have to get netdev mutex
before uing it and also before using fields of netdev structure.

This is important because offload API used from the separate
thread and could be used at the same time with other netdev
functions called from the main thread.

CC: Finn Christensen <fc@napatech.com>
Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:13:40 +00:00
Ilya Maximets
c0af6425d7 netdev-dpdk: Drop offload API for vhost ports.
vhost ports are not DPDK eth ports and has no rte_flow API.
Stop calling this API with DPDK_ETH_PORT_ID_INVALID to
avoid time wasting and errors in log.

Additionally, DPDK_FLOW_OFFLOAD_API definition moved to .c
file, because there is no need to expose it in header.

CC: Finn Christensen <fc@napatech.com>
Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:13:19 +00:00
Ben Pfaff
89c09c1cd1 netdev: Clean up class initialization.
The macros are hard to read.  This makes it a little more readable.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-08-27 17:48:23 +01:00
Xu Binbin
74cd69a479 netdev-dpdk: Support the link speed of XL710
In the scenario of XL710, the link speed which stored in the table
of Interface is not 40G. Because the implementation of query of link
speed only support to 10G, the parameter 'current' will be a random
value in the scenario of higher link speed. In this case, incorrect
link speed of XL710 nic will be stored in the database.

Signed-off-by: Xu Binbin <xu.binbin1@zte.com.cn>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-08-27 17:48:23 +01:00
Kevin Traynor
51c6a5a3c8 netdev-dpdk: Use hex for PCI vendor ID.
Match the prefix and formatting.

Fixes: 8a9562d21a40 ("dpif-netdev: Add DPDK netdev.")
Cc: pshelar@ovn.org

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-08-08 22:06:21 +01:00
Sugesh Chandran
7e1de65e8d netdev-dpdk: Fix failure to configure flow control at netdev-init.
Configuring flow control at ixgbe netdev-init is throwing error in port
start.

For eg: without this fix, user cannot configure flow control on ixgbe dpdk
port as below,

"
    ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk \
        options:dpdk-devargs=0000:05:00.1 options:rx-flow-ctrl=true
"

Instead,  it must be configured as two different commands,

"
    ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk \
               options:dpdk-devargs=0000:05:00.1
    ovs-vsctl set Interface dpdk0 options:rx-flow-ctrl=true
"

The DPDK ixgbe driver is now validating all the 'rte_eth_fc_conf' fields before
trying to configuring the dpdk ethdev. Hence OVS can no longer set the
'dont care' fields to just '0' as before. This commit make sure all the
'rte_eth_fc_conf' fields are populated with default values before the dev
init.

Also to avoid read error on unsupported ports, the flow control parameters
are now read only when user is trying to configure/update it.

Signed-off-by: Sugesh Chandran <sugesh.chandran@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-08-08 22:06:21 +01:00
Ben Pfaff
773c3cb40f netdev-dpdk: Use ETH_ADDR_BYTES_ARGS instead of open-coding it.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-07-24 22:36:38 +01:00
Ben Pfaff
31a033cb71 netdev-dpdk: Fix sparse complaints.
Neither of these is a real problem.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-07-24 22:36:29 +01:00
Ben Pfaff
2b7b5dbb07 netdev-dpdk: Fix incorrect byte order conversion in log message.
uint8_t values shouldn't be passed to ntohs().

Found by soarse.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-07-24 22:36:21 +01:00
Ian Stokes
43307ad0e2 dpdk: Support both shared and per port mempools.
This commit re-introduces the concept of shared mempools as the default
memory model for DPDK devices. Per port mempools are still available but
must be enabled explicitly by a user.

OVS previously used a shared mempool model for ports with the same MTU
and socket configuration. This was replaced by a per port mempool model
to address issues flagged by users such as:

https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html

However the per port model potentially requires an increase in memory
resource requirements to support the same number of ports and configuration
as the shared port model.

This is considered a blocking factor for current deployments of OVS
when upgrading to future OVS releases as a user may have to redimension
memory for the same deployment configuration. This may not be possible for
users.

This commit resolves the issue by re-introducing shared mempools as
the default memory behaviour in OVS DPDK but also refactors the memory
configuration code to allow for per port mempools.

This patch adds a new global config option, per-port-memory, that
controls the enablement of per port mempools for DPDK devices.

    ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true

This value defaults to false; to enable per port memory support,
this field should be set to true when setting other global parameters
on init (such as "dpdk-socket-mem", for example). Changing the value at
runtime is not supported, and requires restarting the vswitch
daemon.

The mempool sweep functionality is also replaced with the
sweep functionality from OVS 2.9 found in commits

c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)

A new document to discuss the specifics of the memory models and example
memory requirement calculations is also added.

Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Tiago Lam <tiago.lam@intel.com>
Tested-by: Tiago Lam <tiago.lam@intel.com>
2018-07-06 12:46:26 +01:00
Yuanhan Liu
daf90186e2 netdev-dpdk: add debug for rte flow patterns
For debug purpose.

Co-authored-by: Finn Christensen <fc@napatech.com>
Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org>
Signed-off-by: Finn Christensen <fc@napatech.com>
Co-authored-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-07-06 10:32:52 +01:00
Finn Christensen
e8a2b5bf92 netdev-dpdk: implement flow offload with rte flow
The basic yet the major part of this patch is to translate the "match"
to rte flow patterns. And then, we create a rte flow with MARK + RSS
actions. Afterwards, all packets match the flow will have the mark id in
the mbuf.

The reason RSS is needed is, for most NICs, a MARK only action is not
allowed. It has to be used together with some other actions, such as
QUEUE, RSS, etc. However, QUEUE action can specify one queue only, which
may break the rss. Likely, RSS action is currently the best we could
now. Thus, RSS action is choosen.

For any unsupported flows, such as MPLS, -1 is returned, meaning the
flow offload is failed and then skipped.

Co-authored-by: Yuanhan Liu <yliu@fridaylinux.org>
Signed-off-by: Finn Christensen <fc@napatech.com>
Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org>
Co-authored-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-07-06 10:32:52 +01:00
John Hurley
88dcf2aa82 netdev-provider: add class op to get block_id
Add a new class op for netdevs to get the block_id if one exists. The
block_id is used in offload ops to group multiple qdiscs together.

Stub calls are made to the new class op (implementation to follow in
further patches). The default block_id of 0 (no block) will be used in
these cases.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-06-29 14:51:47 +02:00
Aaron Conole
b9a3183d3a netdev-dpdk: Avoid warning for snprintf() call.
lib/netdev-dpdk.c: In function :
lib/netdev-dpdk.c:2865:49: warning:  output may be truncated before the last format character [-Wformat-truncation=]
        snprintf(vhost_vring, 16, "vring_%d_size", i);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suggested-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
2018-06-15 11:26:14 -07:00
Ian Stokes
4dd16ca0c3 netdev-dpdk: Handle ENOTSUP for rte_eth_dev_set_mtu.
The function rte_eth_dev_set_mtu is not supported for all DPDK drivers.
Currently if it is not supported we return an error in
dpdk_eth_dev_queue_setup. There are two issues with this.

(i) A device can still function even if rte_eth_dev_set_mtu is not
supported albeit with the default max rx packet length.

(ii) When ENOTSUP is returned it will not be caught in port_reconfigure()
at the dpif-netdev layer. Port_reconfigure() checks if a netdev_reconfigure()
function is supported for a given netdev and ignores EOPNOTSUPP errors as it
assumes errors of this value mean there is no reconfiguration function.
In this case the reconfiguration function is supported for netdev dpdk but
a function called as part of the reconfigure (rte_eth_dev_set_mtu) may
not be supported.

As this is a corner case, this commit warns a user when
rte_eth_dev_set_mtu is not supported and informs them of the default
max rx packet length that will be used instead.

Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Co-author: Michal Weglicki <michalx.weglicki@intel.com>
Tested-By: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Tested-by: Cian Ferriter <cian.ferriter@intel.com>
2018-06-08 17:27:56 +01:00
Michal Weglicki
e10ca8b921 netdev-dpdk: Enable HW_CRC_STRIP for virtual functions.
Virtual functions such as igb_vf and i40e_vf require HW_CRC_STRIP to be
explicitly enabled before configuration, otherwise device configuration
will fail.

This commit achieves this by adding NETDEV_RX_HW_CRC_STRIP to
dpdk_hw_ol_features. When a dpdk device is added, the driver for the
device is examined, if the device is a virtual function enable
HW_CRC_STRIP.

Signed-off-by: Michal Weglicki <michalx.weglicki@intel.com>
Co-Authored: Ian Stokes <ian.stokes@intel.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Tested-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-06-08 17:27:56 +01:00
Timothy Redaelli
7bbc2e1def netdev-dpdk: fix check for "net_nfp" driver
Currently the check of "net_nfp" driver while enabling scatter compares
only the first 6 bytes, but "net_nfp" is 7 bytes long.

This change fixes the check by comparing the first 7 bytes.

CC: Pablo Cascón <pablo.cascon@netronome.com>
CC: Simon Horman <simon.horman@netronome.com>
Fixes: 65a87968f4cf ("netdev-dpdk: don't enable scatter for jumbo RX support for nfp")
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Acked-by: Pablo Cascón <pablo.cascon@netronome.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-05-25 09:09:50 +01:00
Eelco Chaudron
606f665072 netdev-dpdk: Don't use PMD driver if not configured successfully
When initialization of the DPDK PMD driver fails
(dpdk_eth_dev_init()), the reconfigure_datapath() function will remove
the port from dp_netdev, and the port is not used.

Now when bridge_reconfigure() is called again, no changes to the
previous failing netdev configuration are detected and therefore the
ports gets added to dp_netdev and used uninitialized. This is causing
exceptions...

The fix has two parts to it. First in netdev-dpdk.c we remember if the
DPDK port was started or not, and when calling
netdev_dpdk_reconfigure() we also try re-initialization if the port
was not already active. The second part of the change is in
dpif-netdev.c where it makes sure netdev_reconfigure() is called if
the port needs reconfiguration, as netdev_is_reconf_required() is only
true until netdev_reconfigure() is called (even if it fails).

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Tested-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-05-25 09:09:50 +01:00
Kevin Traynor
1f84a2d5b5 netdev-dpdk: Remove use of rte_mempool_ops_get_count.
rte_mempool_ops_get_count is not exported by DPDK so it means it
cannot be used by OVS when using DPDK as a shared library.

Remove rte_mempool_ops_get_count but still use rte_mempool_full
and document it's behavior.

Fixes: 91fccdad72a2 ("netdev-dpdk: Free mempool only when no in-use mbufs.")
Reported-by: Timothy Redaelli <tredaelli@redhat.com>
Reported-by: Markos Chandras <mchandras@suse.de>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-05-25 09:09:50 +01:00
Darrell Ball
7d7ded7af7 odp-execute: Rename 'may_steal' to 'should_steal'.
Signed-off-by: Darrell Ball <dlu998@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-05-23 11:36:47 -07:00
Eelco Chaudron
eaa4358119 netdev-dpdk: Fixed netdev_dpdk structure alignment
Currently, the code tells us we have 4 pad bytes left in cacheline0
while actually we are 8 bytes short:

struct netdev_dpdk {
	union {
		OVS_CACHE_LINE_MARKER cacheline0;        /*           1 */
		struct {
			dpdk_port_t port_id;             /*     0     2 */
			_Bool      attached;             /*     2     1 */
			struct eth_addr hwaddr;          /*     4     6 */
			int        mtu;                  /*    12     4 */
			int        socket_id;            /*    16     4 */
			int        buf_size;             /*    20     4 */
			int        max_packet_len;       /*    24     4 */
			enum dpdk_dev_type type;         /*    28     4 */
			enum netdev_flags flags;         /*    32     4 */
			char *     devargs;              /*    40     8 */
			struct dpdk_tx_queue * tx_q;     /*    48     8 */
			struct rte_eth_link link;        /*    56     8 */
			int        link_reset_cnt;       /*    64     4 */
		};                                       /*          72 */
		uint8_t            pad9[128];            /*         128 */
	};                                               /*     0   128 */
	/* --- cacheline 2 boundary (128 bytes) --- */

Re-located one member, link_reset_cnt, and now it's one cache line:

struct netdev_dpdk {
	union {
		OVS_CACHE_LINE_MARKER cacheline0;        /*           1 */
		struct {
			dpdk_port_t port_id;             /*     0     2 */
			_Bool      attached;             /*     2     1 */
			struct eth_addr hwaddr;          /*     4     6 */
			int        mtu;                  /*    12     4 */
			int        socket_id;            /*    16     4 */
			int        buf_size;             /*    20     4 */
			int        max_packet_len;       /*    24     4 */
			enum dpdk_dev_type type;         /*    28     4 */
			enum netdev_flags flags;         /*    32     4 */
			int        link_reset_cnt;       /*    36     4 */
			char *     devargs;              /*    40     8 */
			struct dpdk_tx_queue * tx_q;     /*    48     8 */
			struct rte_eth_link link;        /*    56     8 */
		};                                       /*          64 */
		uint8_t            pad9[64];             /*          64 */
	};                                               /*     0    64 */
	/* --- cacheline 1 boundary (64 bytes) --- */

Fixes: 5e925ccc2a6f ("netdev-dpdk: DPDK v17.11 upgrade")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Tiago Lam <tiago.lam@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-05-11 08:08:24 +01:00