2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-29 05:18:13 +00:00

651 Commits

Author SHA1 Message Date
Harry van Haaren
f54d8f004f dpif-netdev: Add specialized generic scalar functions
This commit adds a number of specialized functions, that handle
common miniflow fingerprints. This enables compiler optimization,
resulting in higher performance. Below a quick description of
how this optimization actually works;

"Specialized functions" are "instances" of the generic implementation,
but the compiler is given extra context when compiling. In the case of
iterating miniflow datastructures, the most interesting value to enable
compile time optimizations is the loop trip count per unit.

In order to create a specialized function, there is a generic implementation,
which uses a for() loop without the compiler knowing the loop trip count at
compile time. The loop trip count is passed in as an argument to the function:

uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count)
{
    for(uint32_t i = 0; i < loop_count; i++)
        // do work
}

In order to "specialize" the function, we call the generic implementation
with hard-coded numbers - these are compile time constants!

uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count)
{
    // use hard coded constant for compile-time constant-propogation
    return miniflow_impl_generic(mf, 5);
}

Given the compiler is aware of the loop trip count at compile time,
it can perform an optimization known as "constant propogation". Combined
with inlining of the miniflow_impl_generic() function, the compiler is
now enabled to *compile time* unroll the loop 5x, and produce "flat" code.

The last step to using the specialized functions is to utilize a
function-pointer to choose the specialized (or generic) implementation.
The selection of the function pointer is performed at subtable creation
time, when miniflow fingerprint of the subtable is known. This technique
is known as "multiple dispatch" in some literature, as it uses multiple
items of information (miniflow bit counts) to select the dispatch function.

By pointing the function pointer at the optimized implementation, OvS
benefits from the compile time optimizations at runtime.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-19 12:24:13 +01:00
Harry van Haaren
a0b36b3924 dpif-netdev: Refactor generic implementation
This commit refactors the generic implementation. The
goal of this refactor is to simplify the code to enable
"specialization" of the functions at compile time.

Given compile-time optimizations, the compiler is able
to unroll loops, and create optimized code sequences due
to compile time knowledge of loop-trip counts.

In order to enable these compiler optimizations, we must
refactor the code to pass the loop-trip counts to functions
as compile time constants.

This patch allows the number of miniflow-bits set per "unit"
in the miniflow to be passed around as a function argument.

Note that this patch does NOT yet take advantage of doing so,
this is only a refactor to enable it in the next patches.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-19 12:22:23 +01:00
Harry van Haaren
92c7c870d6 dpif-netdev: Split out generic lookup function
This commit splits the generic hash-lookup-verify function to
its own file, for cleaner seperation between optimized versions.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-19 12:22:01 +01:00
Harry van Haaren
f5ace7cd8a dpif-netdev: Move dpcls lookup structures to .h
This commit moves some data-structures to be available
in the dpif-netdev-private.h header. This allows specific
implementations of the subtable lookup function to include
just that header file, and not require that the code exists
in dpif-netdev.c

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-19 12:21:37 +01:00
Harry van Haaren
aadede3dda dpif-netdev: Implement function pointers/subtable
This allows plugging-in of different subtable hash-lookup-verify
routines, and allows special casing of those functions based on
known context (eg: # of bits set) of the specific subtable.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-19 12:21:16 +01:00
Ilya Maximets
ec61d4707b dpif-netdev: Clarify PMD reloading scheme.
It became more complicated, hence needs to be documented.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-10 13:15:19 +01:00
David Marchand
68a0625b78 dpif-netdev: Catch reloads faster.
Looking at the reload flag only every 1024 loops can be a long time
under load, since we might be handling 32 packets per rxq, per iteration,
which means up to poll_cnt * 32 * 1024 packets.
Look at the flag every loop, no major performance impact seen.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-10 11:51:09 +01:00
David Marchand
e2cafa8692 dpif-netdev: Only reload static tx qid when needed.
pmd->static_tx_qid is allocated under a mutex by the different pmd
threads.
Unconditionally reallocating it will make those pmd threads sleep
when contention occurs.
During "normal" reloads like for rebalancing queues between pmd threads,
this can make pmd threads waste time on this.
Reallocating the tx qid is only needed when removing other pmd threads
as it is the only situation when the qid pool can become uncontiguous.

Add a flag to instruct the pmd to reload tx qid for this case which is
Step 1 in current code.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-10 11:50:21 +01:00
David Marchand
6d9fead107 dpif-netdev: Do not sleep when swapping queues.
When swapping queues from a pmd thread to another (q0 polled by pmd0/q1
polled by pmd1 -> q1 polled by pmd0/q0 polled by pmd1), the current
"Step 5" puts both pmds to sleep waiting for the control thread to wake
them up later.

Prefer to make them spin in such a case to avoid sleeping an
undeterministic amount of time.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-10 11:47:46 +01:00
David Marchand
8f077b31e9 dpif-netdev: Trigger parallel pmd reloads.
pmd reloads are currently serialised in each steps calling
reload_affected_pmds.
Any pmd processing packets, waiting on a mutex etc... will make other
pmd threads wait for a delay that can be undeterministic when syscalls
adds up.

Switch to a little busy loop on the control thread using the existing
per-pmd reload boolean.

The memory order on this atomic is rel-acq to have an explicit
synchronisation between the pmd threads and the control thread.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-10 11:46:31 +01:00
David Marchand
299c8d611e dpif-netdev: Convert exit latch to flag.
No need for a latch here since we don't have to wait.
A simple boolean flag is enough.

The memory order on the reload flag is changed to rel-acq ordering to
serve as a synchronisation point between the pmd threads and the control
thread that asks for termination.

Fixes: e4cfed38b159 ("dpif-netdev: Add poll-mode-device thread.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-07-10 11:45:55 +01:00
Ilya Maximets
f87c135706 vswitchd: Always cleanup userspace datapath.
'netdev' datapath is implemented within ovs-vswitchd process and can
not exist without it, so it should be gracefully terminated with a
full cleanup of resources upon ovs-vswitchd exit.

This change forces dpif cleanup for 'netdev' datapath regardless of
passing '--cleanup' to 'ovs-appctl exit'. Such solution allowes to
not pass this additional option everytime for userspace datapath
installations and also allowes to not terminate system datapath in
setups where both datapaths runs at the same time.

The main part is that dpif_port_del() will lead to netdev_close()
and subsequent netdev_class->destroy(dev) which will stop HW NICs
and free their resources. For vhost-user interfaces it will invoke
vhost driver unregistering with a properly closed vhost-user
connection. For upcoming AF_XDP netdev this will allow to gracefully
destroy xdp sockets and unload xdp programs from linux interfaces.
Another important thing is that port deletion will also trigger
flushing of flows offloaded to HW NICs.

Exception made for 'internal' ports that could have user ip/route
configuration. These ports will not be removed without '--cleanup'.

This change fixes OVS disappearing from the DPDK point of view
(keeping HW NICs improperly configured, sudden closing of vhost-user
connections) and will help with linux devices clearing with upcoming
AF_XDP netdev support.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Tested-by: William Tu <u9012063@gmail.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2019-07-02 12:24:47 +03:00
David Marchand
35c91567c8 dpif-netdev: Only poll enabled vhost queues.
We currently poll all available queues based on the max queue count
exchanged with the vhost peer and rely on the vhost library in DPDK to
check the vring status beneath.
This can lead to some overhead when we have a lot of unused queues.

To enhance the situation, we can skip the disabled queues.
On rxq notifications, we make use of the netdev's change_seq number so
that the pmd thread main loop can cache the queue state periodically.

$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 1:
  isolated : true
  port: dpdk0             queue-id:  0 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 2:
  isolated : true
  port: vhost1            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost3            queue-id:  0 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 15:
  isolated : true
  port: dpdk1             queue-id:  0 (enabled)   pmd usage:  0 %
pmd thread numa_id 0 core_id 16:
  isolated : true
  port: vhost0            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost2            queue-id:  0 (enabled)   pmd usage:  0 %

$ while true; do
  ovs-appctl dpif-netdev/pmd-rxq-show |awk '
  /port: / {
    tot++;
    if ($5 == "(enabled)") {
      en++;
    }
  }
  END {
    print "total: " tot ", enabled: " en
  }'
  sleep 1
done

total: 6, enabled: 2
total: 6, enabled: 2
...

 # Started vm, virtio devices are bound to kernel driver which enables
 # F_MQ + all queue pairs
total: 6, enabled: 2
total: 66, enabled: 66
...

 # Unbound vhost0 and vhost1 from the kernel driver
total: 66, enabled: 66
total: 66, enabled: 34
...

 # Configured kernel bound devices to use only 1 queue pair
total: 66, enabled: 34
total: 66, enabled: 19
total: 66, enabled: 4
...

 # While rebooting the vm
total: 66, enabled: 4
total: 66, enabled: 2
...
total: 66, enabled: 66
...

 # After shutting down the vm
total: 66, enabled: 66
total: 66, enabled: 2

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-06-26 18:43:39 +01:00
Ilya Maximets
b6cabb8f8f netdev: Split up netdev offloading to separate module.
New module 'netdev-offload' created to manage different flow API
implementations. All the generic and provider independent code moved
there from the 'netdev' module.

Flow API providers further encapsulated.

The only function that was changed is 'netdev_any_oor'.
Now it uses offloading related hmap instead of common 'netdev_shash'.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
2019-06-11 09:39:36 +03:00
Ilya Maximets
0da667e345 dpif-netdev: Forbid vport offloading attempts.
'netdev_flow_put()' for vports could eventually succeed for
userspace datapath in case there is a kernel datapath with
similar vport at the same time. The root cause is that vports
like 'vxlan' uses same 'vxlan_sys_<port>' system interfaces
for flow offloading and there is no way to distinguish system
and userspace vports using only 'netdev' structure.

Let's forbid vport offloading from userspace datapath to avoid
installing userspace flows to unrelated system devices.

Future dynamic flow API management will allow to enable vport
offloading back using more flexible checks.

Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id")
Reported-by: Ophir Munk <ophirmu@mellanox.com>
Acked-By: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
2019-06-06 17:23:58 +03:00
Ilya Maximets
0a5cba6591 dpif-netdev: Fix flow mark leak on port lookup failure.
Flow mark should be properly freed in all error cases.

Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id")
Acked-By: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
2019-06-06 17:23:58 +03:00
Ilya Maximets
eef8538081 dpif-netdev: Fix unsafe access to pmd polling lists.
All accesses to 'pmd->poll_list' should be guarded by
'pmd->port_mutex'. Additionally fixed inappropriate usage
of 'HMAP_FOR_EACH_SAFE' (hmap doesn't change in a loop)
and dropped not needed local variable 'proc_cycles'.

CC: Nitin Katiyar <nitin.katiyar@ericsson.com>
Fixes: 5bf84282482a ("Adding support for PMD auto load balancing")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
2019-05-29 14:31:54 +03:00
Darrell Ball
57593fd243 conntrack: Stop exporting internal datastructures.
Stop the exporting of the main internal conntrack datastructure.

Signed-off-by: Darrell Ball <dlu998@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-05-03 09:46:22 -07:00
Zhantao Fu
0fcf0776c7 Double postponing to free subtables.
Subtable destruction should be double postponed because readers could still obtain old values while iterating over pvector implementation before its new version published.

Signed-off-by: Zhantao Fu <fuzhantao@huawei.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-04-23 09:08:51 -07:00
Numan Siddique
5b34f8fc3b Add a new OVS action check_pkt_larger
This patch adds a new action 'check_pkt_larger' which checks if the
packet is larger than the given size and stores the result in the
destination register.

Usage: check_pkt_larger(len)->REGISTER
Eg. match=...,actions=check_pkt_larger(1442)->NXM_NX_REG0[0],next;

This patch makes use of the new datapath action - 'check_pkt_len'
which was recently added in the commit [1].
At the start of ovs-vswitchd, datapath is probed for this action.
If the datapath action is present, then 'check_pkt_larger'
makes use of this datapath action.

Datapath action 'check_pkt_len' takes these nlattrs
      * OVS_CHECK_PKT_LEN_ATTR_PKT_LEN - 'pkt_len' to check for
      * OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER (optional) - Nested actions
        to apply if the packet length is greater than the specified 'pkt_len'
      * OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL (optional) - Nested
        actions to apply if the packet length is lesser or equal to the
        specified 'pkt_len'.

Let's say we have these flows added to an OVS bridge br-int

table=0, priority=100 in_port=1,ip,actions=check_pkt_larger:100->NXM_NX_REG0[0],resubmit(,1)
table=1, priority=200,in_port=1,ip,reg0=0x1/0x1 actions=output:3
table=1, priority=100,in_port=1,ip,actions=output:4

Then the action 'check_pkt_larger' will be translated as
  - check_pkt_len(size=100,gt(3),le(4))

datapath will check the packet length and if the packet length is greater than 100,
it will output to port 3, else it will output to port 4.

In case, datapath doesn't support 'check_pkt_len' action, the OVS action
'check_pkt_larger' sets SLOW_ACTION so that datapath flow is not added.

This OVS action is intended to be used by OVN to check the packet length
and generate an ICMP packet with type 3, code 4 and next hop mtu
in the logical router pipeline if the MTU of the physical interface
is lesser than the packet length. More information can be found here [2]

[1] - 4d5ec89fc8
[2] - https://mail.openvswitch.org/pipermail/ovs-discuss/2018-July/047039.html

Reported-at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2018-July/047039.html
Suggested-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
CC: Ben Pfaff <blp@ovn.org>
CC: Gregory Rose <gvrose8192@gmail.com>
Acked-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-04-22 12:56:50 -07:00
William Tu
42697ca775 dpif-netdev: fix meter at high packet rate.
When testing packet rate around 1Mpps with meter enabled, the frequency
of hitting meter action becomes much higher, around 30us each time.
As a result, the meter's calculation of 'uint32_t delta_t' becomes
always 0 and meter action has no effect.  This is due to the previous
commit 05f9e707e194 divides the delta by 1000, in order to convert to
msec granularity.  The patch fixes it updating the time when across
millisecond boundary.

Fixes: 05f9e707e194 ("dpif-netdev: Use microsecond granularity.")
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-04-22 09:51:28 -07:00
Ilya Maximets
af741ca346 dpif-netdev: Update comment about flow installation race.
Userspace datapath uses per-PMD flow tables/classifiers for a long
time. However, it was decided to keep this race window to not block
revalidators. Comment should be updated to reflect the current state.

Fixes: 1c1e46ed8457 ("dpif-netdev: Add per-pmd flow-table/classifier.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-04-18 08:51:46 +01:00
Ilya Maximets
b137383e86 dpif-netdev: Fix double parsing of packets when EMC disabled.
This partially reverts commit bde94613e6276d48a6e0be7a592ebcf9836b4aaf.

Commit bde94613e627 was aimed to slightly ( < 1%) increase performance
in the case where EMC disabled, but it avoids RSS hash calculation and
OVS has to calculate it while executing OVS_ACTION_ATTR_HASH in order
to handle balanced-tcp bonding. At the time of executing that action
there is no parsed flow, and OVS parses the packet for the second time
to calculate the hash. This happens for all packets received from the
virtual interfaces because they have no HW RSS.

Here is the example of 'perf' output for VM --> (bonded PHY) traffic:

  Samples: 401K of event 'cycles', Event count (approx.): 50964771478
  Overhead  Shared Object       Symbol
    27.50%  ovs-vswitchd        [.] dpcls_lookup.370382
    16.30%  ovs-vswitchd        [.] rte_vhost_dequeue_burst.9267
    14.95%  ovs-vswitchd        [.] miniflow_extract
     7.22%  ovs-vswitchd        [.] flow_extract
     7.10%  ovs-vswitchd        [.] dp_netdev_input__.371002.4826
     4.01%  ovs-vswitchd        [.] fast_path_processing.370987.4893

We can see that packet parsed twice. First time by 'miniflow_extract'
right after receiving and the second time by 'flow_extract' while
executing actions.

In this particular case calculating RSS on receive saves > 7% of the
total CPU processing time. It varies from ~7 to ~10 % depending on
scenario/traffic types.

It's better to calculate hash each time because performance
improvements of avoiding are negligible in compare with performance
drop in case of sending packets to bonded interface.

Another solution could be to pass the parsed flow explicitly through
the datapath, but this will require big code changes and will have
additional overhead for metadata updating on packet changes.

Also, this change should have small impact since SMC works well in most
cases and will be enabled/recommended by default in the future.

CC: Antonio Fischetti <antonio.fischetti@intel.com>
Fixes: bde94613e627 ("dpif-netdev: Avoid reading RSS hash when EMC is disabled.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-04-18 08:45:16 +01:00
Ilya Maximets
5d1765d30e dpif-netdev: Reduce log level for not found mark id.
It's a normal case for 'find' function, especially because this
happens for every first packet of flow that was not offloaded yet.
Should not warn about this. Dropped to DBG to avoid log trashing in
case of big number of new flows.

CC: Yuanhan Liu <yliu@fridaylinux.org>
Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id")
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-02-27 22:19:03 +00:00
Ben Pfaff
d40533fc82 odp-util: Improve log messages and error reporting for Netlink parsing.
As a side effect, this also reduces a lot of log messages' severities from
ERR to WARN.  They just didn't seem like messages that in general reported
anything that would prevent functioning.

Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-02-25 15:38:25 -08:00
Darrell Ball
4ea96698f6 Userspace datapath: Add fragmentation handling.
Fragmentation handling is added for supporting conntrack.
Both v4 and v6 are supported.

After discussion with several people, I decided to not store
configuration state in the database to be more consistent with
the kernel in future, similarity with other conntrack configuration
which will not be in the database as well and overall simplicity.
Accordingly, fragmentation handling is enabled by default.

This patch enables fragmentation tests for the userspace datapath.

Signed-off-by: Darrell Ball <dlu998@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-02-14 14:18:56 -08:00
Darrell Ball
9f17f104fe dp-packet: Add 'do_not_steal' packet batch flag.
This is needed in a subsequent patch and may otherwise be useful.

Signed-off-by: Darrell Ball <dlu998@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-02-14 11:39:22 -08:00
Ilya Maximets
216abd2808 dpif-netdev: Add thread safety annotation to sorted_poll_list.
'sorted_poll_list()' uses the 'pmd->poll_list' that should be
guarded by 'pmd->port_mutex'.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-02-12 13:23:09 +00:00
Ilya Maximets
2fbadeb665 dpif-netdev: Per-port configurable EMC.
Conditional EMC insert helps a lot in scenarios with high numbers
of parallel flows, but in current implementation this option affects
all the threads and ports at once. There are scenarios where we have
different number of flows on different ports. For example, if one
of the VMs encapsulates traffic using additional headers, it will
receive large number of flows but only few flows will come out of
this VM. In this scenario it's much faster to use EMC instead of
classifier for traffic from the VM, but it's better to disable EMC
for the traffic which flows to VM.

To handle above issue introduced 'emc-enable' configurable to
enable/disable EMC on a per-port basis. Ex.:

  ovs-vsctl set interface dpdk0 other_config:emc-enable=false

EMC probability kept as is and it works for all the ports with
'emc-enable=true'.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-01-18 11:54:42 +00:00
Nitin Katiyar
5bf8428248 Adding support for PMD auto load balancing
Port rx queues that have not been statically assigned to PMDs are currently
assigned based on periodically sampled load measurements.
The assignment is performed at specific instances – port addition, port
deletion, upon reassignment request via CLI etc.

Due to change in traffic pattern over time it can cause uneven load among
the PMDs and thus resulting in lower overall throughout.

This patch enables the support of auto load balancing of PMDs based on
measured load of RX queues. Each PMD measures the processing load for each
of its associated queues every 10 seconds. If the aggregated PMD load reaches
95% for 6 consecutive intervals then PMD considers itself to be overloaded.

If any PMD is overloaded, a dry-run of the PMD assignment algorithm is
performed by OVS main thread. The dry-run does NOT change the existing
queue to PMD assignments.

If the resultant mapping of dry-run indicates an improved distribution
of the load then the actual reassignment will be performed.

The automatic rebalancing will be disabled by default and has to be
enabled via configuration option. The interval (in minutes) between
two consecutive rebalancing can also be configured via CLI, default
is 1 min.

Following example commands can be used to set the auto-lb params:
ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-rebalance-intvl="5"

Co-authored-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Rohith Basavaraja <rohith.basavaraja@gmail.com>
Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Tested-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2019-01-16 10:53:17 +00:00
Ilya Maximets
6c95dbf96b dpif-netdev: End the quiescent state for flow offloading thread.
Flow offloading thread uses concurrent hash maps which are
based on rcu protected variables. It must use them while in
active state. Working in a quiescent state could cause
segmentation faults because of possible cmap internal
structure changes.

Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:17:19 +00:00
Ilya Maximets
5752eae485 dpif-netdev: Fix cmap node use after free on flow disassociation.
Data pointed by cmap node must not be freed while iterating.
ovsrcu_postpone should be used instead.

CC: Finn Christensen <fc@napatech.com>
Fixes: e8a2b5bf92bb ("netdev-dpdk: implement flow offload with rte flow")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-11-02 15:13:54 +00:00
Sriharsha Basavapatna via dev
57924fc91c revalidator: Rebalance offloaded flows based on the pps rate
This is the third patch in the patch-set to support dynamic rebalancing
of offloaded flows.

The dynamic rebalancing functionality is implemented in this patch. The
ukeys that are not scheduled for deletion are obtained and passed as input
to the rebalancing routine. The rebalancing is done in the context of
revalidation leader thread, after all other revalidator threads are
done with gathering rebalancing data for flows.

For each netdev that is in OOR state, a list of flows - both offloaded
and non-offloaded (pending) - is obtained using the ukeys. For each netdev
that is in OOR state, the flows are grouped and sorted into offloaded and
pending flows.  The offloaded flows are sorted in descending order of
pps-rate, while pending flows are sorted in ascending order of pps-rate.

The rebalancing is done in two phases. In the first phase, we try to
offload all pending flows and if that succeeds, the OOR state on the device
is cleared. If some (or none) of the pending flows could not be offloaded,
then we start replacing an offloaded flow that has a lower pps-rate than
a pending flow, until there are no more pending flows with a higher rate
than an offloaded flow. The flows that are replaced from the device are
added into kernel datapath.

A new OVS configuration parameter "offload-rebalance", is added to ovsdb.
The default value of this is "false". To enable this feature, set the
value of this parameter to "true", which provides packets-per-second
rate based policy to dynamically offload and un-offload flows.

Note: This option can be enabled only when 'hw-offload' policy is enabled.
It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow
offload errors (specifically ENOSPC error this feature depends on) reported
by an offloaded device are supressed by TC-Flower kernel module.

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Reviewed-by: Sathya Perla <sathya.perla@broadcom.com>
Reviewed-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-10-19 11:27:52 +02:00
Ilya Maximets
35fe9efb2f dpif-netdev: Add vlan to mask for flow_put operation.
Datapath flows in dpif-netdev classifier always has exact match
mask set for vlan. We have to enable it for flow_put operation
too in order to avoid flow modification failure due to
classifier lookup with wrong hash.

Found by OFtest.

CC: Jan Scheurich <jan.scheurich@ericsson.com>
Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
Reported-by: Ben Pfaff <blp@ovn.org>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2018-September/352579.html
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-10-09 10:26:39 -07:00
Kevin Traynor
e77c97b9d6 dpif-netdev: Add round-robin based rxq to pmd assignment.
Prior to OVS 2.9 automatic assignment of Rxqs to PMDs
(i.e. CPUs) was done by round-robin.

That was changed in OVS 2.9 to ordering the Rxqs based on
their measured processing cycles. This was to assign the
busiest Rxqs to different PMDs, improving aggregate
throughput.

For the most part the new scheme should be better, but
there could be situations where a user prefers a simple
round-robin scheme because Rxqs from a single port are
more likely to be spread across multiple PMDs, and/or
traffic is very bursty/unpredictable.

Add 'pmd-rxq-assign' config to allow a user to select
round-robin based assignment.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-09-14 11:45:05 +01:00
Gavi Teitz
a692410af0 dpctl: Expand the flow dump type filter
Added new types to the flow dump filter, and allowed multiple filter
types to be passed at once, as a comma separated list. The new types
added are:
 * tc - specifies flows handled by the tc dp
 * non-offloaded - specifies flows not offloaded to the HW
 * all - specifies flows of all types

The type list is now fully parsed by the dpctl, and a new struct was
added to dpif which enables dpctl to define which types of dumps to
provide, rather than passing the type string and having dpif parse it.

Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-09-13 16:56:25 +02:00
Gavi Teitz
0d6b401cf6 dpif-netdev: Initialize dpif_flow attrs
In a previous commit, the dpif_flow struct was expanded, with the
'offloaded' field being moved into a new struct which also includes a
field for the dp layer the flow is handled on. The initialization of
these fields was only done in dpif-netlink.

This completes that commit, by initializing the fields in dpif-netdev
as well. As the 'offloaded' field was previously ignored by
dpif-netdev, the attrs are initialized to the default values of
'false' for the offloaded state, and 'ovs' for the dp layer.

Fixes: d63ca5329ff9 ("dpctl: Properly reflect a rule's offloaded to HW state")
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-09-13 16:56:25 +02:00
Justin Pettit
866bc7567a dpif-netdev: Prevent unsafe access when retrieving meter stats.
dpif_netdev_meter_get() retrieved a pointer to a meter entry without
holding a lock.  It's possible that another thread could have deleted
that entry between retrieving the pointer and dereferencing the pointer.
This makes the function hold the lock the entire time the meter entry is
needed.

Found by inspection.

Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
2018-09-04 13:36:37 -07:00
Justin Pettit
d0db81eac8 dpif-netdev: Don't check if xcalloc() failed when creating meter.
xcalloc() can't return null.

Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-09-04 13:36:37 -07:00
Vishal Deep Ajmera
9b4f08cdca dpif-netdev: Avoid reordering of packets in a batch with same megaflow
OVS reads packets in batches from a given port and packets in the
batch are subjected to potentially 3 levels of lookups to identify
the datapath megaflow entry (or flow) associated with the packet.
Each megaflow entry has a dedicated buffer in which packets that match
the flow classification criteria are collected. This buffer helps OVS
perform batch processing for all packets associated with a given flow.

Each packet in the received batch is first subjected to lookup in the
Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
EMC lookup is successful, the packet is moved from the rx batch to the
per-flow buffer.

Packets that did not match any EMC entry are rearranged in the rx batch
at the beginning and are now subjected to a lookup in the megaflow cache.
Packets that match a megaflow cache entry are *appended* to the per-flow
buffer.

Packets that do not match any megaflow entry are subjected to slow-path
processing through the upcall mechanism. This cannot change the order of
packets as by definition upcall processing is only done for packets
without matching megaflow entry.

The EMC entry match fields encompass all potentially significant header
fields, typically more than specified in the associated flow's match
criteria. Hence, multiple EMC entries can point to the same flow. Given
that per-flow batching happens at each lookup stage, packets belonging
to the same megaflow can get re-ordered because some packets match EMC
entries while others do not.

The following example can illustrate the issue better. Consider
following batch of packets (labelled P1 to P8) associated with a single
TCP connection and associated with a single flow. Let us assume that
packets with just the ACK bit set in TCP flags have been received in a
prior batch also and a corresponding EMC entry exists.

1. P1 (TCP Flag: ACK)
2. P2 (TCP Flag: ACK)
3. P3 (TCP Flag: ACK)
4. P4 (TCP Flag: ACK, PSH)
5. P5 (TCP Flag: ACK)
6. P6 (TCP Flag: ACK)
7. P7 (TCP Flag: ACK)
8. P8 (TCP Flag: ACK)

The megaflow classification criteria does not include TCP flags while
the EMC match criteria does. Thus, all packets other than P4 match
the existing EMC entry and are moved to the per-flow packet batch.
Subsequently, packet P4 is moved to the same per-flow packet batch as
a result of the megaflow lookup. Though the packets have all been
correctly classified as being associated with the same flow, the
packet order has not been preserved because of the per-flow batching
performed during the EMC lookup stage. This packet re-ordering has
performance implications for TCP applications.

This patch preserves the packet ordering by performing the per-flow
batching after both the EMC and megaflow lookups are complete. As an
optimization, packets are flow-batched in emc processing till any
packet in the batch has an EMC miss.

A new flow map is maintained to keep the original order of packet
along with flow information. Post fastpath processing, packets from
flow map are *appended* to per-flow buffer.

Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-08-27 17:48:23 +01:00
Yi-Hung Wei
cd015a11c2 dpif: Support conntrack zone limit.
This patch defines the dpif interface to support conntrack
per zone limit.  Basically, OVS users can use this interface
to set, delete, and get the conntrack per zone limit for various
dpif interfaces.  The following patch will make use of the proposed
interface to implement the feature.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
2018-08-17 09:30:55 -07:00
Justin Pettit
8101f03fcd dpif: Don't pass in '*meter_id' to meter_set commands.
The original intent of the API appears to be that the underlying DPIF
implementaion would choose a local meter id.  However, neither of the
existing datapath meter implementations (userspace or Linux) implemented
that; they expected a valid meter id to be passed in, otherwise they
returned an error.  This commit follows the existing implementations and
makes the API somewhat cleaner.

Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-08-16 10:20:52 -07:00
Ilya Maximets
18e08953cf dpif-netdev: Fix zero length keys insertion to EMC.
'key.len' should be calculated before inserting to EMC, otherwise
resulting entry will match with any packet with the same hash.

CC: Yipeng Wang <yipeng1.wang@intel.com>
Fixes: 60d8ccae135f ("dpif-netdev: Add SMC cache after EMC cache")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Yipeng Wang <yipeng1.wang@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-08-08 22:06:21 +01:00
Justin Pettit
6508c845ad dpif: Move common meter checks into the dpif layer.
Another dpif provider will soon add support for meters, so move
some of the common sanity checks up into the dpif layer so that each
provider doesn't need to re-implement them.

Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-07-30 13:00:49 -07:00
Justin Pettit
f603d7d262 Revert "dpif-netdev: Use compatible function type to fix broken build."
Commit ab15e70eb587 ("dpctl: Expand the flow dump type filter") will be
reverted, which this patch fixed, so it needs to be reverted as well.

This reverts commit b10ac772218afd4f296db866f6b80258e1d1ca8a.

CC: Gavi Teitz <gavi@mellanox.com>
CC: Simon Horman <simon.horman@netronome.com>
CC: Roi Dayan <roid@mellanox.com>
CC: Aaron Conole <aconole@redhat.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-07-25 14:17:33 -07:00
Aaron Conole
b10ac77221 dpif-netdev: Use compatible function type to fix broken build.
The dpif_provder flow_dump_create function signature was changed, but
the netdev dpif was not updated along with it.  This generated a build
error with the following warnings:

libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I ./include -I ./include -I ./lib -I ./lib -Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers -fno-strict-aliasing -Wshadow -Wno-null-pointer-arithmetic -Werror -Werror -g -O2 -MT lib/dpif-netdev.lo -MD -MP -MF lib/.deps/dpif-netdev.Tpo -c lib/dpif-netdev.c -o lib/dpif-netdev.o
lib/dpif-netdev.c:6812:5: error: initialization from incompatible pointer type [-Werror]
     dpif_netdev_flow_dump_create,
     ^
lib/dpif-netdev.c:6812:5: error: (near initialization for 'dpif_netdev_class.flow_dump_create') [-Werror]

Fixes: ab15e70eb587 ("dpctl: Expand the flow dump type filter")
Cc: Gavi Teitz <gavi@mellanox.com>
Cc: Roi Dayan <roid@mellanox.com>
Cc: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-07-25 11:35:23 -07:00
Yipeng Wang
60d8ccae13 dpif-netdev: Add SMC cache after EMC cache
This patch adds a signature match cache (SMC) after exact match
cache (EMC). The difference between SMC and EMC is SMC only stores
a signature of a flow thus it is much more memory efficient. With
same memory space, EMC can store 8k flows while SMC can store 1M
flows. It is generally beneficial to turn on SMC but turn off EMC
when traffic flow count is much larger than EMC size.

SMC cache will map a signature to an dp_netdev_flow index in
flow_table. Thus, we add two new APIs in cmap for lookup key by
index and lookup index by key.

For now, SMC is an experimental feature that it is turned off by
default. One can turn it on using ovsdb options.

Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
Co-authored-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-07-24 17:01:03 +01:00
Justin Pettit
425a7b9eaf dpif-netdev: Fix a couple of comments for dp_netdev_run_meter().
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-07-06 14:23:27 -07:00
Yuanhan Liu
02bb2824e5 dpif-netdev: do hw flow offload in a thread
Currently, the major trigger for hw flow offload is at upcall handling,
which is actually in the datapath. Moreover, the hw offload installation
and modification is not that lightweight. Meaning, if there are so many
flows being added or modified frequently, it could stall the datapath,
which could result to packet loss.

To diminish that, all those flow operations will be recorded and appended
to a list. A thread is then introduced to process this list (to do the
real flow offloading put/del operations). This could leave the datapath
as lightweight as possible.

Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org>
Co-authored-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-07-06 10:32:52 +01:00
Yuanhan Liu
aab96ec4d8 dpif-netdev: retrieve flow directly from the flow mark
So that we could skip some very costly CPU operations, including but
not limiting to miniflow_extract, emc lookup, dpcls lookup, etc. Thus,
performance could be greatly improved.

A PHY-PHY forwarding with 1000 mega flows (udp,tp_src=1000-1999) and
1 million streams (tp_src=1000-1999, tp_dst=2000-2999) show more that
260% performance boost.

Note that though the heavy miniflow_extract is skipped, we still have
to do per packet checking, due to we have to check the tcp_flags.

Co-authored-by: Finn Christensen <fc@napatech.com>
Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org>
Signed-off-by: Finn Christensen <fc@napatech.com>
Co-authored-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2018-07-06 10:32:52 +01:00