This patch adds a new functions classifier_defer() and
classifier_publish(), which control when the classifier modifications
are made available to lookups. By default, all modifications are made
available to lookups immediately. Modifications made after a
classifier_defer() call MAY be 'deferred' for later 'publication'. A
call to classifier_publish() will both publish any deferred
modifications, and cause subsequent changes to to be published
immediately.
Currently any deferring is limited to the visibility of the subtable
vector changes. pvector now processes modifications mostly in a
working copy, which needs to be explicitly published with
pvector_publish(). pvector_publish() sorts the working copy and
removes gaps before publishing it.
This change helps avoiding O(n**2) memory behavior in corner cases,
where large number of rules with different masks are inserted or
deleted.
VMware-BZ: #1322017
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Before this commit, when 'struct dp_netdev_port' is deleted from
'dpif-netdev' datapath, if there is pmd thread, the pmd thread
will release the last reference to the port and ovs-rcu postpone
the destroy. However, the delayed close of object like 'struct
netdev' could cause failure in immediate re-add or reconfigure of
the same device.
To fix the above issue, this commit uses condition variable and
makes the main thread wait for pmd thread to release the reference
when deleting port. Then, the main thread can directly destroy the
port.
Reported-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
There is a portion of the 'struct dp_netdev_port' initialization
that is placed after the reload of pmd threads. This means in
theory, there could be a race where pmd threads access half-
initialized struct. Although such race has not been seen, it
makes sense to fully initialize the struct before use.
Found by code inspection.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Following patch adds support for userspace tunneling. Tunneling
needs three more component first is routing table which is configured by
caching kernel routes and second is ARP cache which build automatically
by snooping arp. And third is tunnel protocol table which list all
listening protocols which is populated by vswitchd as tunnel ports
are added. GRE and VXLAN protocol support is added in this patch.
Tunneling works as follows:
On packet receive vswitchd check if this packet is targeted to tunnel
port. If it is then vswitchd inserts tunnel pop action which pops
header and sends packet to tunnel port.
On packet xmit rather than generating Set tunnel action it generate
tunnel push action which has tunnel header data. datapath can use
tunnel-push action data to generate header for each packet and
forward this packet to output port. Since tunnel-push action
contains most of packet header vswitchd needs to lookup routing
table and arp table to build this action.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Ben Pfaff <blp@nicira.com>
OVS userspace are backward compatible with older Linux kernel modules.
However, not having the most up-to-date datapath kernel modules can
some times lead to user confusion. Storing the datapath version in
OVSDB allows management software to check and optionally provide
notifications to users.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
DPDK rings don't need one queue per PMD thread and don't support multiple
queues (set_multiq function is undefined). To fix operation with DPDK rings,
this patch ignores EOPNOTSUPP error on netdev_set_multiq() and provides, for
DPDK rings, a netdev send() function that ignores the provided queue id
(= PMD thread core id).
Suggested-by: Maryam Tahhan <maryam.tahhan@intel.com>
Signed-off-by: David Verbeiren <david.verbeiren@intel.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Patch 0de8783a9 (lib/dpif-netdev: Integrate megaflow classifier.)
broke exact match cache lookup, but it went undetected since there are
no separate stats for EMC.
This patch fixes the problem by changing the struct netdev_flow_key
'len' member to cover only the 'mf' member, not the whole
netdev_flow_key, and ignoring the 'len' field in
netdev_flow_key_equal. Comparison is still accurate, as the miniflow
'map' field encodes the length in the number of 1-bits, and the map is
included in the comparison.
Reported-by: Alex Wang <alexw@nicira.com>
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Daniele Di Proietto <ddiproietto@vmware.com>
Megaflow inserts and removals are simplified:
- No need for classifier internal mutex, as dpif-netdev already has a
'flow_mutex'.
- Number of memory allocations/frees can be halved.
- Lookup code path can rely on netdev_flow_key always having inline data.
This will also be easier to simplify further when moving to per-thread
megaflow classifiers in the future.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Alex Wang <alexw@nicira.com>
Today dpif-netdev has single metadat for given batch, since one
batch belongs to one port, but soon packets fro single tunnel ports
can belong to different ports, so we need to have per packet metadata.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Batching the cmap find improves the memory behavior with large cmaps
and can make searches twice as fast:
$ tests/ovstest test-cmap benchmark 2000000 8 0.1 16
Benchmarking with n=2000000, 8 threads, 0.10% mutations, batch size 16:
cmap insert: 533 ms
cmap iterate: 57 ms
batch search: 146 ms
cmap destroy: 233 ms
cmap insert: 552 ms
cmap iterate: 56 ms
cmap search: 299 ms
cmap destroy: 229 ms
hmap insert: 222 ms
hmap iterate: 198 ms
hmap search: 2061 ms
hmap destroy: 209 ms
Batch size 1 has small performance penalty, but all other batch sizes
are faster than non-batched cmap_find(). The batch size 16 was
experimentally found better than 8 or 32, so now
classifier_lookup_miniflow_batch() performs the cmap find operations
in batches of 16.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
This makes the following patch easier and cleans up the code.
Explicit "inline" keywords seem necessary to prevent performance
regression on cmap_find() with GCC 4.7 -O2.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
dp_netdev_free() must free 'dp->upcall_rwlock', but when upcalls are
disabled (if the datapath is being freed upcalls should be disabled)
'dp->upcall_rwlock' is taken and freeing it causes an assertion to
fail.
This commit takes makes sure that the upcalls are disabled and
releases 'dp->upcall_rwlock' before freeing it. A simple testcase is
added to detect the failure.
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
If a packet didn't match a rule in the fast path classifier its memory was
never freed. The issue was particularly clear with DPDK devices because it was
not possible to process more than ~250000 DPDK mbufs in the slow path.
This commit fixes the problem by:
* calling dpif_packet_delete() if the upcalls are disabled
* passing may_steal==true to dp_netdev_execute_actions() during normal upcall
processing
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This commits adds the multithreading functionality to OVS dpdk
module. Users are able to create multiple pmd threads and set
their cpu affinity via specifying the cpu mask string similar
to the EAL '-c COREMASK' option.
Also, the number of rx queues for each dpdk interface is made
configurable to help distribution of rx packets among multiple
pmd threads.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This optimization is done to avoid calling count_1bits(), which, if
the popcnt istruction is not available might is slow. popcnt may not
be available because:
- We are running on old hardware
- (more likely) We're using a generic build (i.e. packaged OVS from a
distro), not tuned for the specific CPU
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
netdev_flow_key is a miniflow with the following constraints:
1) It is used only inside dpif-netdev.c.
2) It always has inline values.
3) It contains only miniflows created by miniflow_extract().
Therefore, by using these new functions instead of the miniflow_*
ones, we get the following (performance related) benefits:
- Because of (1) the functions can be inlined.
- Because of (2) and (3) the netdev_flow_key can be treated as POD.
Specifically, because of (3), we can do comparisons with memcmp,
since if the map is different the miniflow must be different.
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
With this commit, ovs by default will create one pmd thread
for each numa node and pin the pmd thread to available cpu
core on the numa node.
NON_PMD_CORE_ID (currently 0) is used to reserve a particular
cpu core for the I/O of all non-pmd threads. No pmd thread
can be pinned to this reserved core.
As side-effects of this commit:
- pmd thread will not be created, if there is no dpdk interface
from the corresponding numa node added to ovs.
- the exact-match cache for non-pmd threads is removed from
'struct dp_netdev'. Instead, all non-pmd threads will use
the exact-match cache defined in the 'struct dp_netdev_pmd_thread'
for NON_PMD_CORE_ID.
- the rx packet processing functions are refactored to use
'struct dp_netdev_pmd_thread' as input.
- the 'netdev_send()' function will be called with the proper
queue id.
- both pmd and non-pmd threads can call the dpif_netdev_execute().
so, use a per-thread key to help recognize the calling thread.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Before this commit, ovs creates one tx and one rx queue for
each dpdk interface and uses only one poll thread for handling
I/O of all dpdk interfaces. An upcoming patch will allow multiple
poll threads be created. As a preparation, this commit changes
the dpif-netdev to create multiple tx/rx queues when the dpdk
interface is added.
Specifically, the number of rx queues will still be one per-dpdk
interface for this commit. But upcoming work will allow user
create multiple rx queues. The number of tx queues will be the
number of cpu cores on the machine. Although not all the tx queues
will be used, each poll thread will have its own queue for
transmission on the dpdk interface.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This commit adds new variable n_txq to 'struct netdev' for recording
the number of tx queues. Correspondingly, the send_*() functions are
extended to accept queue id as input argument.
All 'netdev-*' implementation will ignore the queue id since having
multiple tx queues is not supported. Upcomping patches will start
using it and create multiple tx queues for dpdk netdev.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
dpif_netdev_execute may be called while doing upcall processing.
Since the context of the input port is not tracked upto this point, we
use the shared dp->emc_cache for packet execution, where the emc_cache
is needed for recirculation.
While recursive mutexes can make thread safety analysis hard, for now
we change emc_mutex to be recursive. Forthcoming new unit tests will
fail with the current non-recursive mutex. Later improvements may
remove the need for this recursion.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Daniele Di Proietto <ddiproietto@vmware.com>
Add a new action type OVS_ACTION_ATTR_SET_MASKED, and support for
parsing, printing, and committing them.
Masked set actions add a mask, immediately following the netlink
attribute data, within the netlink attribute itself. Thus the key
attribute size for a masked set action is exactly double of the
non-masked set action.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
When pmd thread interates through all ports for queue loading,
the main thread may unreference and 'rcu-free' a port before
pmd thread take new reference of it. This could cause pmd
thread fail the reference and access freed memory later.
This commit fixes this race by introducing port_try_ref()
which uses ovs_refcount_try_ref_rcu(). And the pmd thread
will only load the port's queue, if port_try_ref() returns
true.
Found by inspection.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Since lookups in the classifier can be pretty expensive,
we introduce this (thread local) cache which simply
compares the miniflows of the packets
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
These function are used to stored the packet hash. 'netdev-dpdk'
automatically set this value to the RSS hash returned by the
NIC. Other 'netdev's set it to 0 (which is an invalid hash
value), so that callers can compute the hash on their own.
If DPDK support is enabled, struct dpif_packet's member
'dp_hash' is removed and 'pkt.hash.rss' from DPDK mbuf is used
This commit also configure DPDK devices to compute RSS hash
for UDP and IPv6 packets
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
It could be possible that the thread misses a signal when it reads the
change_seq again after reload. Also, the counter has no dependent
data, so the memory model for the atomic read can be relaxed.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This patch avoids the relatively inefficient miss handling processes
dictated by the dpif process, by calling into ofproto-dpif directly
through a callback.
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
This cleans up the dpif interface to make it more consistent with the
other dpif operations, and allows flows to be fetched in batches.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
struct dp_netdev_flow used to have a reference counter.
It has been replaced by RCU. Unfortunately RCU is not
enough if we plan to hold a reference to the dp_netdev_flow
for a long time. So this commit reintroduces reference
counting for struct dp_netdev_flow
Subsequent commits make use of it.
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
The dpif provider 'operate' call duplicates all of the features available
from the 'flow_put', 'flow_del', and 'execute' calls, yielding redundant
code in providers that support both mechanisms. This change drops the
latter calls in favor of making every dpif provider support 'operate'.
The result is code that is overall less duplicative.
It might make sense to do the same with flow_get but so far 'operate'
doesn't support flow_get.
Signed-off-by: Ben Pfaff <blp@nicira.com>
This patch gives dp_netdev_flow_add() a match with which it can
initialize the classifier rule. This prevents it from needing to copy
a flow and flow_wildcards into the match first.
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
There isn't any significant downside to making cmap iteration "safe" all
the time, so this drops the _SAFE variant.
Similar changes to CMAP_CURSOR_FOR_EACH and CMAP_CURSOR_FOR_EACH_CONTINUE.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Typically, kernel datapath threads send upcalls to userspace where
handler threads process the upcalls. For TAP and DPDK devices, the
datapath threads operate in userspace, so there is no need for
separate handler threads.
This patch allows userspace datapath threads to directly call the
ofproto upcall functions, eliminating the need for handler threads
for datapaths of type 'netdev'.
Signed-off-by: Ryan Wilson <wryan@nicira.com>
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
Commit db73f7166a (netdev-dpdk: Fix race condition with DPDK mempools in
non pmd threads) switched to a new way of setting up 'upcall->packet', but
only initialized two of the fields in the packet. This could cause
core dumps and other strange behavior. In particular it caused failures in
several unit tests on XenServer.
This commit fixes the problem by initializing the entire ofpbuf.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>
struct 'miniflow' already contains MINI_N_INLINE values, therefore
we can save few bytes in netdev_flow_key.
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
CMAP_FOR_EACH and CLS_FOR_EACH and their variants tried to use void ** as
a "pointer to any kind of pointer". That is a violation of the aliasing
rules in ISO C which technically yields undefined behavior. With GCC 4.1,
it causes both warnings and actual misbehavior. One option would to add
-fno-strict-aliasing to the compiler flags, but that would only help with
GCC; who knows whether this can be worked around with other compilers.
Instead, this commit rewrites the iterators to avoid disallowed pointer
aliasing.
VMware-BZ: #1287651
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
DPDK mempools rely on rte_lcore_id() to implement a thread-local cache.
Our non pmd threads had rte_lcore_id() == 0. This allowed concurrent access to
the "thread-local" cache, causing crashes.
This commit resolves the issue with the following changes:
- Every non pmd thread has the same lcore_id (0, for management reasons), which
is not shared with any pmd thread (lcore_id for pmd threads now start from 1)
- DPDK mbufs must be allocated/freed in pmd threads. When there is the need to
use mempools in non pmd threads, like in dpdk_do_tx_copy(), a mutex must be
held.
- The previous change does not allow us anymore to pass DPDK mbufs to handler
threads: therefore this commit partially revert 143859ec63. Now packets
are copied for upcall processing. We can remove the extra memcpy by
processing upcalls in the pmd thread itself.
With the introduction of the extra locking, the packet throughput will be lower
in the following cases:
- When using internal (tap) devices with DPDK devices on the same datapath.
Anyway, to support internal devices efficiently, we needed DPDK KNI devices,
which will be proper pmd devices and will not need this locking.
- When packets are processed in the slow path by non pmd threads. This overhead
can be avoided by handling the upcalls directly in pmd threads (a change that
has already been proposed by Ryan Wilson)
Also, the following two fixes have been introduced:
- In dpdk_free_buf() use rte_pktmbuf_free_seg() instead of rte_mempool_put().
This allows OVS to run properly with CONFIG_RTE_LIBRTE_MBUF_DEBUG DPDK option
- Do not bulk free mbufs in a transmission queue. They may belong to different
mempools
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
The userspace datapath returns RCU-protected actions from flow_get() and
flow_dump_next(). This doesn't cause any trouble for current users of
these functions, but it imposes additional constraints on their use.
This patch makes the dpif documentation more explicit about how the
results of these functions can be used.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Change the interface to allow implementations to pass back a buffer, and
allow callers to specify which of actions, mask, and stats they wish to
receive. This will be used in the next commit.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Now that all the relevant classifier structures use RCU and internal
mutual exclusion for modifications, we can remove the fat-rwlock and
thus make the classifier lookups lockless.
As the readers are operating concurrently with the writers, a
concurrent reader may or may not see a new rule being added by a
writer, depending on how the concurrent events overlap with each
other. Overall, this is no different from the former locked behavior,
but there the visibility of the new rule only depended on the timing
of the locking functions.
A new rule is first added to the segment indices, so the readers may
find the rule in the indices before the rule is visible in the
subtables 'rules' map. This may result in us losing the opportunity
to quit lookups earlier, resulting in sub-optimal wildcarding. This
will be fixed by forthcoming revalidation always scheduled after flow
table changes.
Similar behavior may happen due to us removing the overlapping rule
(if any) from the indices only after the corresponding new rule has
been added.
The subtable's max priority is updated only after a rule is inserted
to the maps, so the concurrent readers may not see the rule, as the
updated priority ordered subtable list will only be visible after the
subtable's max priority is updated.
Similarly, the classifier's partitions are updated by the caller after
the rule is inserted to the maps, so the readers may keep skipping the
subtable until they see the updated partitions.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
After a quick analysis, in most cases the access to refcounted objects
is clearly protected either with an explicit lock/mutex, or RCU. there
are only a few places where I left a call to ovs_refcount_unref().
Upon closer analysis it may well be that those could also use the
relaxed form.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
This requires less locking and makes introducing lockless classifier
lookups possible.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>