Currently the functions to set, clear, and iterate over bitmaps
only operate over 32 bit values. If we convert them to handle
64 bit bitmaps, they can be used in more places.
Suggested-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Serializing between userspace flows and netlink attributes currently
requires several additional parameters besides the flows themselves.
This will continue to grow in the future as well. This converts
the function arguments to a parameters struct, which makes the code
easier to read and allowing irrelevant arguments to be omitted.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Until now there have been two variants for --enable-dummy:
* --enable-dummy: This adds support for "dummy" dpif and netdev.
* --enable-dummy=override: In addition, this replaces *every* existing
dpif and netdev by the dummy type.
The latter is useful for testing but it defeats the possibility of using
the userspace native tunneling implementation (because all the tunnel
netdevs get replaced by dummy netdevs). Thus, this commit adds a third
variant:
* --enable-dummy=system: This replaces the "system" dpif and netdev
by dummies but leaves the others untouched.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Alex Wang <alexw@nicira.com>
When --enable-dummy=system or --enable-dummy=override is in use, dpifs
other than "dummy" are actually dummy dpifs, so use a more reliable test.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Alex Wang <alexw@nicira.com>
It appears that miniflow_extract() in emc_processing() spends a lot of
cycles waiting for the packet's data to be read.
Prefetching the next packet's data while parsing removes this delay.
For a single flow pipeline the throughput improves by ~10%. With a
more realistic pipeline the change has a much smaller effect (~0.5%
improvement)
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
This function doesn't need to be exported in the public OVS headers, and
it had an inconsistent name compared to uuid_equals(). Rename and move.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Non pmd threads have a core_id == UINT32_MAX, while queue ids used by
netdevs range from 0 to the number of CPUs. Therefore core ids cannot
be used directly to select a queue.
This commit introduces a simple mapping to fix the problem: pmd threads
continue using queues 0 to N (where N is the number of CPUs in the
system), while non pmd threads use queue N+1.
Fixes: d5c199ea7ff7 ("netdev-dpdk: Properly support non pmd threads.")
Reported-by: 차은호 <eunho.cha@atto-research.com
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Signed-off-by: Mark D. Gray <mark.d.gray@intel.com>
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
We used to reserve DPDK lcore 0 for non pmd operations, making it
difficult to use core 0 for packet processing.
DPDK 2.0 properly support non EAL threads with lcore LCORE_ID_ANY.
Using non EAL threads for non pmd threads, we do not need to reserve
any core for non pmd operations
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
DPDK lcore_id is unsigned. We need to support big values like
LCORE_ID_ANY (=UINT32_MAX). Therefore I am changing the type everywhere
in OVS.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
Having the same RSS hash after recirculation can cause unnecessary
collisions in the exact match cache. A simple solution is to rehash it
with the recirculation depth if it is non-zero.
Suggested-by: Ethan Jackson <ethan@nicira.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
When executing actions, it's possible a recirculation will occur
causing dp_netdev_input() to be called multiple times. If the batch
pointers embedded in dp_netdev_flow aren't cleared, it's possible
packets after the recirculation will be reinserted into a batch
associated with the original lookup. This could be very bad.
This patch fixes the problem by zeroing out flow batch pointers before
calling packet_batch_execute(). This probably has a slightly negative
performance impact, though I haven't tried it.
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Prior to this commit, the number of possible entries in the Exact
Match Cache stood at 1024 per thread exacting to 0.18Mb. A typical
server system will have 2.5Mb cache per core meaning a larger EMC will
comfortably fit in. This patch increases the number of entries to 8192
per thread (1.4Mb) which in turn yields improved throughput when
processing multiple flows of traffic.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
The MAX_PKT_BURST and NETDEV_MAX_RX_BATCH macros had a confusing
relationship. They basically purport to do the same thing, making it
unclear which is the source of truth.
Furthermore, while NETDEV_MAX_RX_BATCH was 256, MAX_PKT_BURST was 32,
meaning we never process a batch larger than 32 packets further adding
to the confusion.
This patch resolves the issue by removing MAX_PKT_BURST completely,
and shrinking the new NETDEV_MAX_BURST macro to only 32. This should
have no change in the execution path except shrinking a couple of
structs and memory allocations (can't hurt).
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Until now the exact match cache processing was able to handle only four
megaflows. The rest of the packets was passed to the megaflow
classifier.
The limit was arbitraly set to four also because the algorithm used to
group packets in output batches didn't perform well with a lot of
megaflows.
After changing the algorithm and after some performance testing it seems
much better just to share the same output batches between the exact
match cache and the megaflow classifier.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
The userspace datapath
1. receives a batch of packets.
2. finds a 'netdev_flow' (megaflow) for each packet.
3. groups the packets in output batches based on the 'netdev_flow'.
Until now the grouping (2) was done using a simple algorithm with a
O(N^2) runtime, where N is the number of distinct megaflows of the packets
in the incoming batch. This could quickly become a bottleneck, even with
a small number of megaflows.
With this commit the datapath simply stores in the 'netdev_flow' (the
megaflow) a pointer to the output batch, if one has been created for the
current input batch. The pointer will be cleared when the output batch
is sent.
In a simple phy2phy test with 128 megaflows the throughput is more than
doubled.
The reason that stopped us from doing this change was that the
'netdev_flow' memory was shared between multiple threads: this is no
longer the case with the per-thread classifier.
Also, this commit reorders struct dp_netdev_flow to group toghether the
members used in the fastpath.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Initializing a struct pkt_metadata for every packet can be surprisingly
expensive. It's much faster to keep a copy for each port and copying it
on each packet.
Suggested-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Now that we have per packet metadata, there's no need to split packet
batches when recirculating.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
We already have the 'dp_hash' embedded in the metadata. This caused
confusion in the code. With this commit it should be clear that
'rss_hash' is the packet hash used for internal purposes, while
'md.dp_hash' is part of the flow, computed during the execution of
certain actions.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
Calling time_msec() (which calls clock_gettime()) too often might be
expensive. With this commit OVS makes only one call per received
batch and caches the result.
Suggested-by: Ethan Jackson <ethan@nicira.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
As stated by the comment above the structure, the 'action' pointer does not
change during the 'dp_netdev_actions' lifetime: we might as well embed
the pointed memory into the structure.
The commit also updates the description of dp_netdev_actions_create().
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
Otherwise it is at least very confusing.
Found during testing. An upcoming commit adds a test.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
These commands can be used to get packets and cycles counters on a pmd
thread basis. They're useful to get a clearer picture about the
performance of the userspace datapath.
They export these pieces of information:
- A (per-thread) view of the caches hit rate. Hits in the exact match
cache are reported separately from hits in the masked classifier
- A rough cycles count. This will allow to estimate the load of OVS and
the polling overhead.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
This init function is called when the dpif class is registered. It will
be used by following commits
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
The counters use x86 TSC if available (currently only with DPDK). They
will be exposed by subsequents commits
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
We used to count exact match cache hits and masked classifier hits
together. This commit splits the DP_STAT_HIT counter into two.
This change will be used by future commits.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
A read operation from a non atomic shared value (without external
locking) can return incorrect values. Using the atomic semantics
prevents this from happening.
However:
* No memory barriers are used. We don't need that kind of consistency
for statistics (we use relaxed operations).
* The updates are not atomic, just the loads and stores. This is ok
because there's a single writer.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
Since statistics updates might require locking (in future commits)
grouping them will reduce the locking overhead.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
Since flow statistics are thread local and updated without any lock,
it is not correct to do a memset from another thread.
This commit simply removes the support for the flag. It is not needed
by ofproto-dpif, it is only exposed by dpctl commands.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
Packets for which an upcall has failed (lost packets) must be deleted.
We also need to count them as MISS and LOST.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
ofpbuf was complicated due to its wide usage across all
layers of OVS, Now we have introduced independent dp_packet
which can be used for datapath packet, we can simplify ofpbuf.
Following patch removes DPDK mbuf and access API of ofpbuf
members.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Currently dp-packet make use of ofpbuf for managing packet
buffers. That complicates ofpbuf, by making dp-packet
independent of ofpbuf both libraries can be optimized for
their own use case.
This avoids mapping operation between ofpbuf and dp_packet
in datapath upcalls.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
dp_packet is short and better name for datapath packet
structure.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Before this patch, dp_netdev_flow_add() inserted newly minted flows in
the "flow_table" cmap before inserting them into the per core "dpcls"
classifier. Since dpcls_insert() initializes 'flow->cr.mask', there's
a brief window where the flow is accessible from the cmap, but has a
bogus mask value.
In my testing, under rare instances (i.e. once every 20 minutes with a
very specific flow table and traffic pattern), revalidators core dump
when they call dpif_netdev_flow_dump_next(), which accesses this bogus
mask value from dp_netdev_flow_to_dpif_flow().
By inserting into the per core classifier before the cmap, all the
values are guaranteed to be initialized during flow dumps. With this
patch, I can no longer reproduce the crash.
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
So far the compressed flow data in struct miniflow has been in 32-bit
words with a 63-bit map, allowing for a maximum size of struct flow of
252 bytes. With the forthcoming Geneve options this is not sufficient
any more.
This patch solves the problem by changing the miniflow data to 64-bit
words, doubling the flow max size to 504 bytes. Since the word size
is doubled, there is some loss in compression efficiency. To counter
this some of the flow fields have been reordered to keep related
fields together (e.g., the source and destination IP addresses share
the same 64-bit word).
This change should speed up flow data processing on 64-bit CPUs, which
may help counterbalance the impact of making the struct flow bigger in
the future.
Classifier lookup stage boundaries are also changed to 64-bit
alignment, as the current algorithm depends on each miniflow word to
not be split between ranges. This has resulted in new padding (part
of the 'mpls_lse' field).
The 'dp_hash' field is also moved to packet metadata to eliminate
otherwise needed padding there. This allows the L4 to fit into one
64-bit word, and also makes matches on 'dp_hash' more efficient as
misses can be found already on stage 1.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Add support for adding 64-bit words to hashes. This will be used by
subsequent patches.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
This commit changes the per dpif-netdev datapath flow-table/
classifier to per pmd-thread. As direct benefit, datapath
and flow statistics no longer need to be protected by mutex
or be declared as per-thread variable, since they are only
written by the owning pmd thread.
As side effects, the flow-dump output of userspace datapath
can contain overlapping flows. To reduce confusion, the dump
from different pmd thread will be separated by a title line.
In addition, the flow operations via 'ovs-appctl dpctl/*'
are modified so that if the given flow in_port corresponds
to a dpdk interface, the operation will be conducted to all
pmd threads recv from that interface (expect for flow-get
which will always be applied to non-pmd threads).
Signed-off-by: Alex Wang <alexw@nicira.com>
Tested-by: Mark D. Gray <mark.d.gray@intel.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This commit adds the function dp_netdev_get_pmd() which allows
users to get 'struct dp_netdev_pmd_thread' based on the core id.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Previously, the dpif layer was responsible for determining datapath
support for UFIDs, which resulted in all ovs-dpctl utilities
inserting/deleting flows from the datapath each time they are run.
Shift this responsibility up to the dpif_backer.
There are two users of this functionality: Revalidators check for UFID
support to request a terser dump using UFIDs, and dpif-netlink uses this
to request flow_del operations to only return the UFID/stats. The latter
case was previously hidden from revalidators, but this change makes them
aware of it, and reuses the same "udpif->enable_ufid" flag for reducing
overhead of both flow dump and flow delete.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
A new function vlog_insert_module() is introduced to avoid using
list_insert() from the vlog.h header.
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Ben Pfaff <blp@nicira.com>
One of the limiting factors on the number of flows that can be supported
in the datapath is the overhead of assembling flow dump messages in the
datapath. This patch modifies the dpif to allow revalidators to skip
dumping the key, mask and actions from the datapath, by making use of
the unique flow identifiers introduced in earlier patches.
For each flow dump, the dpif user specifies whether to skip these
attributes, allowing the common case to only dump a pair of 128-bit ID
and flow stats. With datapath support, this increases the number of
flows that a revalidator can handle per second by 50% or more. Support
in dpif-netdev and dpif-netlink is added in this patch; kernel support
is left for future patches.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
This patch modifies the dpif interface to allow flows to be manipulated
using a 128-bit identifier. This allows revalidator threads to perform
datapath operations faster, as they do not need to serialise the entire
flow key for operations like flow_get and flow_delete. In conjunction
with a future patch to simplify the dump interface, this provides a
significant performance benefit for revalidation.
When handlers assemble flow_put operations, they specify a unique
identifier (UFID) for each flow as it is passed down to the datapath to
be stored with the flow. The UFID is currently provided to handlers
by the dpif during upcall processing.
When revalidators assemble flow_get or flow_del operations, they may
specify the UFID for the flow along with the key. The dpif will decide
whether to send only the UFID to the datapath, or both the UFID and flow
key. The former is preferred for newer datapaths that support UFID,
while the latter is used for backwards compatibility.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
On current master, the 'struct dp_netdev_port' is destroyed
immediately when the ref count reaches 0. However, non-pmd
threads calling the dpif_netdev_execute() for sending packets
could hold pointer to 'port' that is not ref-counted. Thusly
those threads could possibly access freed memory when the port
is deleted.
To fix this bug, this commit makes non-pmd threads acquiring
the 'port_mutex' before doing the actual execution in
dpif_netdev_execute().
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This patch shifts the responsibility for determining the hash for a flow
from the revalidation logic down to the dpif layer. This assists in
handling backward-compatibility for revalidation with the upcoming
unique flow identifier "UFID" patches.
A 128-bit UFID was selected to minimize the likelihood of hash conflicts.
Handler threads will not install a flow that has an identical UFID as
another flow, to prevent misattribution of stats and to ensure that the
correct flow key cache is used for revalidation.
For datapaths that do not support UFID, which is currently all
datapaths, the dpif will generate the UFID and pass it up during upcall
and flow_dump. This is generated based on the datapath flow key.
Later patches will add support for datapaths to store and interpret this
UFID, in which case the dpif has a responsibility to pass it through
transparently.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
On current master, the exact match cache entry can keep reference to
'struct dp_netdev_flow' even after the flow is removed from the flow
table. This means the free of allocated memory of the flow is delayed
until the exact match cache entry is cleared or replaced.
If the allocated memory is ahead of chunks of freed memory on heap,
the delay will prevent the reclaim of those freed chunks, causing
falsely high memory utilization.
To fix the issue, this commit makes the owning thread conduct periodic
garbage collection on the exact match cache and clear dead entries.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
---
PATCH -> V2:
- Adopt Jarno's suggestion and conduct slow sweep to avoid introducing
jitter.
This patch adds a new functions classifier_defer() and
classifier_publish(), which control when the classifier modifications
are made available to lookups. By default, all modifications are made
available to lookups immediately. Modifications made after a
classifier_defer() call MAY be 'deferred' for later 'publication'. A
call to classifier_publish() will both publish any deferred
modifications, and cause subsequent changes to to be published
immediately.
Currently any deferring is limited to the visibility of the subtable
vector changes. pvector now processes modifications mostly in a
working copy, which needs to be explicitly published with
pvector_publish(). pvector_publish() sorts the working copy and
removes gaps before publishing it.
This change helps avoiding O(n**2) memory behavior in corner cases,
where large number of rules with different masks are inserted or
deleted.
VMware-BZ: #1322017
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Before this commit, when 'struct dp_netdev_port' is deleted from
'dpif-netdev' datapath, if there is pmd thread, the pmd thread
will release the last reference to the port and ovs-rcu postpone
the destroy. However, the delayed close of object like 'struct
netdev' could cause failure in immediate re-add or reconfigure of
the same device.
To fix the above issue, this commit uses condition variable and
makes the main thread wait for pmd thread to release the reference
when deleting port. Then, the main thread can directly destroy the
port.
Reported-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
There is a portion of the 'struct dp_netdev_port' initialization
that is placed after the reload of pmd threads. This means in
theory, there could be a race where pmd threads access half-
initialized struct. Although such race has not been seen, it
makes sense to fully initialize the struct before use.
Found by code inspection.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Following patch adds support for userspace tunneling. Tunneling
needs three more component first is routing table which is configured by
caching kernel routes and second is ARP cache which build automatically
by snooping arp. And third is tunnel protocol table which list all
listening protocols which is populated by vswitchd as tunnel ports
are added. GRE and VXLAN protocol support is added in this patch.
Tunneling works as follows:
On packet receive vswitchd check if this packet is targeted to tunnel
port. If it is then vswitchd inserts tunnel pop action which pops
header and sends packet to tunnel port.
On packet xmit rather than generating Set tunnel action it generate
tunnel push action which has tunnel header data. datapath can use
tunnel-push action data to generate header for each packet and
forward this packet to output port. Since tunnel-push action
contains most of packet header vswitchd needs to lookup routing
table and arp table to build this action.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Ben Pfaff <blp@nicira.com>