bridge_delete_or_reconfigure() deletes every interface that's not dumped
by OFPROTO_PORT_FOR_EACH(). ofproto_dpif.c:port_dump_next(), used by
OFPROTO_PORT_FOR_EACH, checks if the ofport is in the datapath by
calling port_query_by_name(). If port_query_by_name() returns an error,
the dump is interrupted. If port_query_by_name() returns ENODEV, the
device doesn't exist and the dump can continue.
port_query_by_name() for the userspace datapath returns ENOENT instead
of ENODEV. This is expected by dpif_port_query_by_name(), but it's not
handled correctly by port_dump_next().
dpif-netdev handles reconfiguration errors for an interface by deleting
it from the datapath, so it's possible that a device is missing. When this
happens we must make sure that port_dump_next() continues to dump other
devices, otherwise they will be deleted and the two layers will have an
inconsistent view.
This commit fixes the problem by returning ENODEV from the userspace
datapath if the port doesn't exist, and by documenting this clearly in
the dpif interfaces.
The problem was found while developing new code.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ben Pfaff <blp@ovn.org>
When the datapath, whose type is "netdev", processes packets
in userspce action, it may cause a segmentation fault. In the
dp_execute_userspace_action(), we pass the "wc" argument to
dp_netdev_upcall() using NULL. In the dp_netdev_upcall() call tree,
the "wc" will be used. For example, dp_netdev_upcall() uses the
&wc->masks for debugging, and flow_wildcards_init_for_packet()
uses the "wc" if we disable megaflow, which is described in
more detail below.
Segmentation fault in flow_wildcards_init_for_packet:
#0 0x0000000000468fe8 flow_wildcards_init_for_packet lib/flow.c:1275
#1 0x0000000000436c0b upcall_cb ofproto/ofproto-dpif-upcall.c:1231
#2 0x000000000045bd96 dp_netdev_upcall lib/dpif-netdev.c:3857
#3 0x0000000000461bf3 dp_execute_userspace_action lib/dpif-netdev.c:4388
#4 dp_execute_cb lib/dpif-netdev.c:4521
#5 0x0000000000486ae2 odp_execute_actions lib/odp-execute.c:538
#6 0x00000000004607f9 dp_netdev_execute_actions lib/dpif-netdev.c:4627
#7 packet_batch_per_flow_execute lib/dpif-netdev.c:3927
#8 dp_netdev_input__ lib/dpif-netdev.c:4229
#9 0x0000000000460ba8 dp_netdev_input lib/dpif-netdev.c:4238
#10 dp_netdev_process_rxq_port lib/dpif-netdev.c:2873
#11 0x000000000046126e dpif_netdev_run lib/dpif-netdev.c:3000
#12 0x000000000042baf5 type_run ofproto/ofproto-dpif.c:504
#13 0x00000000004192bf ofproto_type_run ofproto/ofproto.c:1687
#14 0x0000000000409965 bridge_run__ vswitchd/bridge.c:2875
#15 0x000000000040f145 bridge_run vswitchd/bridge.c:2938
#16 0x00000000004062e5 main vswitchd/ovs-vswitchd.c:111
Signed-off-by: nickcooper-zhangtonghao <nic@opencloud.tech>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
This helper is a little tidier than the alternative. Use it treewide.
Signed-off-by: Joe Stringer <joe@ovn.org>
Acked-by: Simon Horman <simon.horman@netronome.com>
Currently, If user will set up 'pmd-rxq-affinity' to cores on
different numa node, they may not be polled, because pmd threads
will not be created there even if this cores are in 'pmd-cpu-mask'.
Fix that by creating threads on all numa nodes rxqs assigned to.
Fixes: 3eb67853c481 ("dpif-netdev: Introduce pmd-rxq-affinity.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
The non-pmd thread static_tx_qid is assumed to be equal to the highest
core ID + 1. The function dp_netdev_del_pmds_on_numa() invalidates
this assumption by re-distributing the static_tx_qid:s on all pmd and
non-pmd threads of the "other" numa.
There might be a number of unwanted effects due to the non-pmd thread
static_tx_qid being changed. The actual fault, observed in OVS 2.5, was a
crash due to the TX burst queues containing a NULL packet buffer pointer
in the range of valid buffers, presumably caused by a race condition.
In OVS 2.6 TX burst queues have been removed, nevertheless the current
behavior is incorrect.
The correction makes dp_netdev_del_pmds_on_numa() honor the constancy
of the non-pmd static_tx_qid value by excluding all non-pmd threads
from the deletion and from the re-ordering of the static_tx_qid.
Signed-off-by: Patrik Andersson <patrik.r.andersson@ericsson.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
This appears to be the only place where ODPP_NONE is not used but could be.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Joe Stringer <joe@ovn.org>
This patch increases the number of packets processed in a batch during a
lookup from 16 to 32. Processing batches of 32 packets improves
performance and also one of the internal loops can be avoided here.
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Acked-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
OVS_ALIGNED_VAR(...) should be at the beginning of a definition, as
the example in include/openvswitch/compiler.h shows.
Fixes: 38ee0814978c ("dpif-netdev: Cache align netdev_flow_keys.")
Reported-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Acked-by: Sairam Venugopal <vsairam@vmware.com>
By reordering the data elements in dp_netdev_port structure, pad bytes
can be reduced and there by saving a cache line.
Before: structure size:136, holes:3, sum padbytes:15, cachelines:3
After: structure size:128, holes:2, sum padbytes:7, cachelines:2
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Aligning the 'keys' array seems to have positive performance impact.
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Add comments in dp_netdev_input__() to explain the reason behind
clearing the flow batches before packet_batch_execute().
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
There's a lot of code in netdev-dpdk which is not at all related to the
netdev interface, mostly the library initialization code.
This commit moves it to a new 'dpdk' module, to simplify 'netdev-dpdk'.
Also a new module 'dpdk-stub' is introduced to implement some functions
when DPDK is not available. This replaces the old 'netdev-nodpdk'
module.
Some redundant includes are removed or reorganized as a consequence.
No functional change.
CC: Aaron Conole <aconole@redhat.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Tested-by: Aaron Conole <aconole@redhat.com>
dp_netdev_get_pmd() is allowed to return NULL (even if we call it with
NON_PMD_CORE_ID) for different reasons:
* Since we use RCU to protect pmd threads, it is possible that
ovs_refcount_try_ref_rcu() has failed.
* During reconfiguration we destroy every thread.
This commit makes sure that we always handle the case when
dp_netdev_get_pmd() returns NULL without crashing (the change in
dpif_netdev_run() doesn't fix anything, because everything is happening
in the main thread, but it's better to honor the interface in case we
change our threading model).
This actually fixes a pretty serious crash that happens if
dpif_netdev_execute() is called from a non pmd thread while
reconfiguration is happening. It can be triggered by enabling bfd
(because it's handled by the monitor thread, which is a non pmd thread)
on an interface and changing something that requires datapath
reconfiguration (n_rxq, pmd-cpu-mask, mtu).
A testcase that reproduces the race condition is included.
This is a possible backtrace of the segfault:
#0 0x000000000060c7f1 in dp_execute_cb (aux_=0x7f1dd2d2a320,
packets_=0x7f1dd2d2a370, a=0x7f1dd2d2a658, may_steal=false) at
../lib/dpif-netdev.c:4357
#1 0x00000000006448b2 in odp_execute_actions (dp=0x7f1dd2d2a320,
batch=0x7f1dd2d2a370, steal=false, actions=0x7f1dd2d2a658,
actions_len=8,
dp_execute_action=0x60c7a5 <dp_execute_cb>) at
../lib/odp-execute.c:538
#2 0x000000000060d00c in dp_netdev_execute_actions (pmd=0x0,
packets=0x7f1dd2d2a370, may_steal=false, flow=0x7f1dd2d2ae70,
actions=0x7f1dd2d2a658, actions_len=8,
now=44965873) at ../lib/dpif-netdev.c:4577
#3 0x000000000060834a in dpif_netdev_execute (dpif=0x2b67b70,
execute=0x7f1dd2d2a578) at ../lib/dpif-netdev.c:2624
#4 0x0000000000608441 in dpif_netdev_operate (dpif=0x2b67b70,
ops=0x7f1dd2d2a5c8, n_ops=1) at ../lib/dpif-netdev.c:2654
#5 0x0000000000610a30 in dpif_operate (dpif=0x2b67b70,
ops=0x7f1dd2d2a5c8, n_ops=1) at ../lib/dpif.c:1268
#6 0x000000000061098c in dpif_execute (dpif=0x2b67b70,
execute=0x7f1dd2d2aa50) at ../lib/dpif.c:1233
#7 0x00000000005b9008 in ofproto_dpif_execute_actions__
(ofproto=0x2b69360, version=18446744073709551614, flow=0x7f1dd2d2ae70,
rule=0x0, ofpacts=0x7f1dd2d2b100,
ofpacts_len=16, indentation=0, depth=0, resubmits=0,
packet=0x7f1dd2d2b5c0) at ../ofproto/ofproto-dpif.c:3806
#8 0x00000000005b907a in ofproto_dpif_execute_actions
(ofproto=0x2b69360, version=18446744073709551614, flow=0x7f1dd2d2ae70,
rule=0x0, ofpacts=0x7f1dd2d2b100,
ofpacts_len=16, packet=0x7f1dd2d2b5c0) at
../ofproto/ofproto-dpif.c:3823
#9 0x00000000005dea9b in xlate_send_packet (ofport=0x2b98380,
oam=false, packet=0x7f1dd2d2b5c0) at
../ofproto/ofproto-dpif-xlate.c:5792
#10 0x00000000005bab12 in ofproto_dpif_send_packet (ofport=0x2b98380,
oam=false, packet=0x7f1dd2d2b5c0) at ../ofproto/ofproto-dpif.c:4628
#11 0x00000000005c3fc8 in monitor_mport_run (mport=0x2b8cd00,
packet=0x7f1dd2d2b5c0) at ../ofproto/ofproto-dpif-monitor.c:287
#12 0x00000000005c3d9b in monitor_run () at
../ofproto/ofproto-dpif-monitor.c:227
#13 0x00000000005c3cab in monitor_main (args=0x0) at
../ofproto/ofproto-dpif-monitor.c:189
#14 0x00000000006a183a in ovsthread_wrapper (aux_=0x2b8afd0) at
../lib/ovs-thread.c:342
#15 0x00007f1dd75eb444 in start_thread (arg=0x7f1dd2d2c700) at
pthread_create.c:333
#16 0x00007f1dd6e1d20d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ben Pfaff <blp@ovn.org>
When using tunnel TLVs (at the moment, this means Geneve options), a
controller must first map the class and type onto an appropriate OXM
field so that it can be used in OVS flow operations. This table is
managed using OpenFlow extensions.
The original code that added support for TLVs made the mapping table
global as a simplification. However, this is not really logically
correct as the OpenFlow management commands are operating on a per-bridge
basis. This removes the original limitation to make the table per-bridge.
One nice result of this change is that it is generally clearer whether
the tunnel metadata is in datapath or OpenFlow format. Rather than
allowing ad-hoc format changes and trying to handle both formats in the
tunnel metadata functions, the format is more clearly separated by function.
Datapaths (both kernel and userspace) use datapath format and it is not
changed during the upcall process. At the beginning of action translation,
tunnel metadata is converted to OpenFlow format and flows and wildcards
are translated back at the end of the process.
As an additional benefit, this change improves performance in some flow
setup situations by keeping the tunnel metadata in the original packet
format in more cases. This helps when copies need to be made as the amount
of data touched is only what is present in the packet rather than the
maximum amount of metadata supported.
Co-authored-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Ben Pfaff <blp@ovn.org>
"internal" netdevs are treated specially in OVS (e.g. for MTU), but
the dummy datapath remaps both "system" and "internal" devices to the
same "dummy" netdev class, so there's no way to discern those in tests.
This commit adds a new "dummy-internal" netdev type, which will be used
by the dummy datapath for internal ports, so that other parts of the
code can understand which ports are internal just by looking at the
netdev object.
The alternative solution, using the original interface type ("internal")
instead of the translated netdev type ("dummy"), is harder to implement,
because in so many places only the netdev object is available.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Use the appropriate format specifier for size_t, otherwise the 32-bit
build fails.
Reported-at: https://travis-ci.org/openvswitch/ovs/jobs/151938383
Fixes: 3453b4d62a98("dpif-netdev: dpcls per in_port with sorted
subtables")
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Joe Stringer <joe@ovn.org>
The user-space datapath (dpif-netdev) consists of a first level "exact match
cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With
many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient
and the OVS forwarding performance is determined by the megaflow classifier.
The megaflow classifier (dpcls) consists of a variable number of hash tables
(aka subtables), each containing megaflow entries with the same mask of
packet header and metadata fields to match upon. A dpcls lookup matches a
given packet against all subtables in sequence until it hits a match. As
megaflow cache entries are by construction non-overlapping, the first match
is the only match.
Today the order of the subtables in the dpcls is essentially random so that
on average a dpcls lookup has to visit N/2 subtables for a hit, when N is the
total number of subtables. Even though every single hash-table lookup is
fast, the performance of the current dpcls degrades when there are many
subtables.
How does the patch address this issue:
In reality there is often a strong correlation between the ingress port and a
small subset of subtables that have hits. The entire megaflow cache typically
decomposes nicely into partitions that are hit only by packets entering from
a range of similar ports (e.g. traffic from Phy -> VM vs. traffic from VM ->
Phy).
Therefore, maintaining a separate dpcls instance per ingress port with its
subtable vector sorted by frequency of hits reduces the average number of
subtables lookups in the dpcls to a minimum, even if the total number of
subtables gets large. This is possible because megaflows always have an exact
match on in_port, so every megaflow belongs to unique dpcls instance.
For thread safety, the PMD thread needs to block out revalidators during the
periodic optimization. We use ovs_mutex_trylock() to avoid blocking the PMD.
To monitor the effectiveness of the patch we have enhanced the ovs-appctl
dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups
per hit" to report the average number of subtable lookup needed for a
megaflow match. Ideally, this should be close to 1 and almost all cases much
smaller than N/2.
The PMD tests have been adjusted to the additional line in pmd-stats-show.
We have benchmarked a L3-VPN pipeline on top of a VXLAN overlay mesh.
With pure L3 tenant traffic between VMs on different nodes the resulting
netdev dpcls contains N=4 subtables. Each packet traversing the OVS
datapath is subject to dpcls lookup twice due to the tunnel termination.
Disabling the EMC, we have measured a baseline performance (in+out) of ~1.45
Mpps (64 bytes, 10K L4 packet flows). The average number of subtable lookups
per dpcls match is 2.5. With the patch the average number of subtable lookups
per dpcls match is reduced to 1 and the forwarding performance grows by ~50%
to 2.13 Mpps.
Even with EMC enabled, the patch improves the performance by 9% (for 1000 L4
flows) and 34% (for 50K+ L4 flows).
As the actual number of subtables will often be higher in reality, we can
assume that this is at the lower end of the speed-up one can expect from this
optimization. Just running a parallel ping between the VXLAN tunnel endpoints
increases the number of subtables and hence the average number of subtable
lookups from 2.5 to 3.5 on master with a corresponding decrease of throughput
to 1.2 Mpps. With the patch the parallel ping has no impact on average number
of subtable lookups and performance. The performance gain is then ~75%.
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
This reverts commit 8bdfe1313894047d44349fa4cf4402970865950f.
I failed to see that lib/dpif-netdev.c actually needs the concurrency
provided by pvector prior to this change. More specifically, when a
subtable is removed, concurrent lookups may skip over another subtable
swapped in to the place of the removed subtable in the vector.
Since this was the only use of the non-concurrent pvector, it is
cleaner to revert the whole patch.
Reported-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
This patch adds some comments to the dpcls_lookup() funtion,
which is one of the most important places where the Userspace
wildcard matching happens.
The purpose is to give some more explanations on its design
and also on how it works.
Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Jarno Rajahalme <jarno@ovn.org>
Revalidation should work in case of 'dynamic_txqs == true'.
Fixes: 324c8374852a ("dpif-netdev: XPS (Transmit Packet Steering) implementation.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
PMD threads use pvectors but do not need the overhead of the
concurrent version. Expose the non-concurrent version for
that use.
Note that struct pvector is renamed as struct cpvector (for concurrent
priority vector), and the former struct pvector_impl is now struct
pvector.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
Now that dpif_execute has a 'flow' member, it's pretty easy to access a
the flow (or the matching megaflow) in dp_execute_cb().
This means that's not necessary anymore for the connection tracker to
reextract 'dl_type' from the packet, it can be passed as a parameter.
This change means that we have to complicate sightly test-conntrack to
group the packets by dl_type before passing them to the connection
tracker.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Joe Stringer <joe@ovn.org>
New functions are implemented in the conntrack module to support this.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
New functions are implemented in the conntrack module to support this.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
This commit implements the OVS_ACTION_ATTR_CT action in dpif-netdev.
To allow ofproto-dpif to detect the conntrack feature, flow_put will not
discard anymore flows with ct_* fields set. We still shouldn't allow
flows with NAT bits set, since there is no support for NAT.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Antonio Fischetti <antonio.fischetti@intel.com>
Instead of using the internal type, use the port_open_type when creating the
local port. That makes sure that whenever dpif_port_query is used, the netdev
open_type is returned instead of the "internal" type.
For other ports, that is already the case, as the netdev type is used when
creating the dp_netdev_port.
That changes the output of dpctl when showing the local port, and also when
trying to change its type. So, corresponding tests are fixed.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
New 'other_config:pmd-rxq-affinity' field for Interface table to
perform manual pinning of RX queues to desired cores.
This functionality is required to achieve maximum performance because
all kinds of ports have different cost of rx/tx operations and
only user can know about expected workload on different ports.
Example:
# ./bin/ovs-vsctl set interface dpdk0 options:n_rxq=4 \
other_config:pmd-rxq-affinity="0:3,1:7,3:8"
Queue #0 pinned to core 3;
Queue #1 pinned to core 7;
Queue #2 not pinned.
Queue #3 pinned to core 8;
It's decided to automatically isolate cores that have rxq explicitly
assigned to them because it's useful to keep constant polling rate on
some performance critical ports while adding/deleting other ports
without explicit pinning of all ports.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Next patches will add new conditions when reconfiguration will be
required. It'll be simpler to have common way to request reconfiguration.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
This commit adds functionality to pass value of 'other_config' column
of 'Interface' table to datapath.
This may be used to pass not directly connected with netdev options and
configure behaviour of the datapath for different ports.
For example: pinning of rx queues to polling threads in dpif-netdev.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
If CPU number in pmd-cpu-mask is not divisible by the number of queues and
in a few more complex situations there may be unfair distribution of TX
queue-ids between PMD threads.
For example, if we have 2 ports with 4 queues and 6 CPUs in pmd-cpu-mask
such distribution is possible:
<------------------------------------------------------------------------>
pmd thread numa_id 0 core_id 13:
port: vhost-user1 queue-id: 1
port: dpdk0 queue-id: 3
pmd thread numa_id 0 core_id 14:
port: vhost-user1 queue-id: 2
pmd thread numa_id 0 core_id 16:
port: dpdk0 queue-id: 0
pmd thread numa_id 0 core_id 17:
port: dpdk0 queue-id: 1
pmd thread numa_id 0 core_id 12:
port: vhost-user1 queue-id: 0
port: dpdk0 queue-id: 2
pmd thread numa_id 0 core_id 15:
port: vhost-user1 queue-id: 3
<------------------------------------------------------------------------>
As we can see above dpdk0 port polled by threads on cores:
12, 13, 16 and 17.
By design of dpif-netdev, there is only one TX queue-id assigned to each
pmd thread. This queue-id's are sequential similar to core-id's. And
thread will send packets to queue with exact this queue-id regardless
of port.
In previous example:
pmd thread on core 12 will send packets to tx queue 0
pmd thread on core 13 will send packets to tx queue 1
...
pmd thread on core 17 will send packets to tx queue 5
So, for dpdk0 port after truncating in netdev-dpdk:
core 12 --> TX queue-id 0 % 4 == 0
core 13 --> TX queue-id 1 % 4 == 1
core 16 --> TX queue-id 4 % 4 == 0
core 17 --> TX queue-id 5 % 4 == 1
As a result only 2 of 4 queues used.
To fix this issue some kind of XPS implemented in following way:
* TX queue-ids are allocated dynamically.
* When PMD thread first time tries to send packets to new port
it allocates less used TX queue for this port.
* PMD threads periodically performes revalidation of
allocated TX queue-ids. If queue wasn't used in last
XPS_TIMEOUT_MS milliseconds it will be freed while revalidation.
* XPS is not working if we have enough TX queues.
Reported-by: Zhihong Wang <zhihong.wang@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Implementation of 'nullable_string_is_equal()' moved to util.c and
reused inside dpif-netdev.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
To easily allow both in- and out-of-tree building of the Python
wrapper for the OVS JSON parser (e.g. w/ pip), move json.h to
include/openvswitch. This also requires moving lib/{hmap,shash}.h.
Both hmap.h and shash.h were #include-ing "util.h" even though the
headers themselves did not use anything from there, but rather from
include/openvswitch/util.h. Fixing that required including util.h
in several C files mostly due to OVS_NOT_REACHED and things like
xmalloc.
Signed-off-by: Terry Wilson <twilson@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
The PMD thread needs to keep processing RX queues in order
to achieve maximum throughput. It also needs to sweep emc
cache and quiesce which use seq_mutex. That mutex can
eventually block the PMD thread causing latency spikes and
affecting the throughput.
Since there is no requirement for running those tasks at a
specific time, this patch extend seq API to allow tentative
locking instead.
Reported-by: Karl Rister <krister@redhat.com>
Co-authored-by: Karl Rister <krister@redhat.com>
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Currently, there are few inconsistencies in ways to configure number of
queues for netdev device:
* dpif-netdev can't know about exact number of queues
allocated inside netdev.
This leads to constant mapping of queue-ids to 'real' ones.
* We are able to configure 'n_rxq' for vhost-user devices, but
there is only one sane number of rx queues which must be used
and configured manually (number of queues that allocated
in QEMU).
This patch disables configuration of 'n_rxq' for DPDK vHost devices.
Configuration of rx and tx queues now automatically applied from
connected virtio device. Standard reconfiguration mechanism was used to
apply this changes.
Also, now 'n_txq' and 'n_rxq' are always the real numbers of queues
in the device.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
The patch adds a new action to support packet truncation. The new action
is formatted as 'output(port=n,max_len=m)', as output to port n, with
packet size being MIN(original_size, m).
One use case is to enable port mirroring to send smaller packets to the
destination port so that only useful packet information is mirrored/copied,
saving some performance overhead of copying entire packet payload. Example
use case is below as well as shown in the testcases:
- Output to port 1 with max_len 100 bytes.
- The output packet size on port 1 will be MIN(original_packet_size, 100).
# ovs-ofctl add-flow br0 'actions=output(port=1,max_len=100)'
- The scope of max_len is limited to output action itself. The following
packet size of output:1 and output:2 will be intact.
# ovs-ofctl add-flow br0 \
'actions=output(port=1,max_len=100),output:1,output:2'
- The Datapath actions shows:
# Datapath actions: trunc(100),1,1,2
Tested-at: https://travis-ci.org/williamtu/ovs-travis/builds/140037134
Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
When debug logging is enabled, dpif-netdev can print each flow as it is
installed, which it currently does using OpenFlow match formatting. Compared
to ODP formatting, there generally isn't too much difference since the
fields are largely the same but it is inconsistent with other logging in
dpif-netdev as well as the analogous functions that deal with the kernel.
However, in some cases there is a difference between the two formats, such
as in the cases of input port or tunnel metadata. For input port, datapath
format helped detect that the generated masks were incorrect. As for tunnels,
at the moment, it's possible to convert between the two formats on demand as
we have a global metadata table. In the future, though this won't be possible
as the metadata table becomes per-bridge which the datapath won't have access
to.
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
When calling odp_flow_key_from_flow (or _mask), the in_port included
as part of the flow is ignored and must be explicitly passed as a
separate parameter. This is because the assumption was that the flow's
version would often be in OFP format, rather than ODP.
However, at this point all flows that are ready for serialization in
netlink format already have their in_port properly set to ODP format.
As a result, every caller needs to explicitly initialize the extra
paramter to the value that is in the flow. This switches to just use
the value in the flow to simply things and avoid the possibility of
forgetting to initialize the extra parameter.
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
This commit moves the code that sets the pmd threads affinity from
netdev-dpdk to ovs-numa. There's one small part left in netdev-dpdk, to
set the lcore_id.
Now dpif-netdev will call both modules (ovs-numa and netdev-dpdk) when
starting a pmd thread.
This change will allow having a dummy implementation of the set affinity
call, for testing purposes.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Currently 'dpctl/flow-get' doesn't work for flows installed by
PMD threads.
Fix that by implementing search across all PMD threads. Will be returned
flow from first PMD thread with match.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
This introduces in dpif-netdev and netdev-dpdk the first use for the
newly introduce reconfigure netdev call.
When a request to change the number of queues comes, netdev-dpdk will
remember this and notify the upper layer via
netdev_request_reconfigure().
The datapath, instead of periodically calling netdev_set_multiq(), can
detect this and call reconfigure().
This mechanism can also be used to:
* Automatically match the number of rxq with the one provided by qemu
via the new_device callback.
* Provide a way to change the MTU of dpdk devices at runtime.
* Move a DPDK vhost device to the proper NUMA socket.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Tested-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Errors returned by netdev_set_multiq() and netdev_rxq_open() weren't
handled properly in reconfigure_pmd_threads(). In case of error now we
remove the port from the datapath.
Also, part of the code is moved in a new function, port_reconfigure().
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Tested-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
netdev objects are hard to use with RCU, because it's not possible to
split removal and reclamation. Postponing the removal means that the
port is not removed and cannot be readded immediately. Waiting for
reclamation means introducing a quiescent state, and that may introduce
subtle bugs, due to the RCU model we use in userspace.
This commit changes the port container from cmap to hmap. 'port_mutex'
must be held by readers and writers. This shouldn't have performance
impact, as readers in the fast path use a thread local cache.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Tested-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
A future commit will stop using RCU for 'dp->ports' and use a mutex for
reading/writing them. To avoid taking a mutex in dp_execute_cb(), which
is called in the fast path, this commit introduces a pmd thread local
cache of ports.
The downside is that every port add/remove now needs to synchronize with
every pmd thread.
Among the advantages, keeping a per thread port mapping could allow
greater control over the txq assigment.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Tested-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
The pmds and the main threads are synchronized using a condition
variable. The main thread writes a new configuration, then it waits on
the condition variable. A pmd thread reads the new configuration, then
it calls signal() on the condition variable. To make sure that the pmds
and the main thread have a consistent view, each signal() should be
backed by a wait().
Currently the first signal() doesn't have a corresponding wait(). If
the pmd thread takes a long time to start and the signal() is received
by a later wait, the threads will have an inconsistent view.
The commit fixes the problem by removing the first signal() from the
pmd thread.
This is hardly a problem on current master, because the main thread
will call the first wait() a long time after the creation of a pmd
thread. It becomes a problem with the next commits.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Tested-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
This commit introduces some functions to add/remove rxqs from pmd
threads without reloading them. They will be used by next commits.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Tested-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Instead of performing every operation inside do_port_add() it seems
clearer to introduce port_create(), since we already have
port_destroy().
No functional change.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Tested-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>