In some places we access the array of datapath ports without
RCU protection. This introduces a new function to check that in
these case the dp mutex is held for protection.
Found with sparse.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
There are several places where the flow table is accessed
without any kind of RCU protection. This is fine because dp
mutex is held so this adds checks for that condition.
Found with sparse.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Additional configuration is passed down to the kernel in the "config"
array of an odp_port when a vport is created. This information is not
returned when a vport is queried, though. This information is useful
for debugging, since it may be used to distinguish ports based on
additional data, such as the peer in tunnels. In a forthcoming patch, it
will be essential to distinguish between plain GRE and GRE over IPsec.
Signed-off-by: Justin Pettit <jpettit@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The n_buckets argument to tbl_create() can be zero, but the comment didn't
mention that. However, there's no reason that the caller can't just pass
in a correct size, so this commit changes them to do that.
Also, TBL_L1_SIZE was conceptually wrong as the minimum size: the minimum
size is one L2 page, e.g. TBL_L2_SIZE. But TBL_MIN_BUCKETS seems like a
better all-around way to indicate the minimum size, so this commit also
introduces that macro and uses it.
Jesse Gross pointed out inconsistencies in this area.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The sparse checker was complaining about incorrect address spaces (e.g.
__user versus non-__user pointers). I looked at each of them, checked
that the code looked correct to me, and added the appropriate __force
annotations to casts.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This is a small optimization for the case where a new flow is being added
to the flow table.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The functions to get and set the checksum pointers consistently across
different kernel versions had different interpretations of what the
csum_offset pointer was relative to, which is confusing, to say the least.
This makes the meaning be the same as skb->csum_offset in modern kernels
and updates the caller. For a given function the results were consistent
across kernel versions and the callers knew what the meaning should be, so
this doesn't actually fix any bugs.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
If the allocation of percpu stats fails when creating a new
datapath, we currently don't return the correct error code. Since
we don't explicitly set it when the allocation fails it will keep
the value from the previous call. This means we will return success
when the creation actually failed.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
When reading actions without rcu_read_lock we need to hold the
datapath lock. This checks that using lockdep.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
When accessing the flow table without holding rcu_read_lcok
we need to hold the lock on the datapath. This enables lockdep
to validate that that is the case.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
When access the array of DPs, we need to hold either rcu_read_lock
or dp_mutex. This enables lockdep to validate those conditions.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
If RTNL lock is used to protected updates to RCU data structures
then it isn't necessary to use rcu_dereference() to access them if
RTNL is held. This adds rtnl_dereference() to access these pointers
which has several benefits: documents the locking expectations;
checks that RTNL actually is held when run with lockdep; makes
sparse not complain about directly accessing RCU pointers.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
We have generally been using the byte order specific data types
(i.e. __be32 instead of u32) in most places. This corrects a
declaration and adds a few needed casts.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Sparse can warn about incorrect usage of RCU via direct access to
points when used in conjuction with __rcu and CONFIG_SPARSE_RCU.
This adds the necessary annotations.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
We generally have been using the __user annotation but there were
a few places where it was missing or needed a cast.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
As David Miller pointed out on netdev today, IS_ERR has a built-in
'unlikely', so there's no point in adding one of our own.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
If we detect a packet that is looping we kill the flow but then
don't do anything with the packet that caused the problem in the
first place, so this frees the packet. This isn't a very serious
leak because we try to shut off the flow that lead to the loop
as early as possible. Once this happens, packets will no longer
hit the loop detector and will be freed just as any other packet
that should be dropped.
It also fixes an issue where the offset to the stats counter is
uninitialized after a loop is detected.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
It's possible for skb_gso_segment to return NULL but only if the
hardware supports the correct form of segmentation offload but just
wants software to verify the offload parameters. However, since we're
not hardware and don't support any kind of segmentation offload natively,
we can never get in this situation. Therefore drop the check and
comment.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
We have a need to identify tunnels with keys longer than 32 bits. This
commit adds basic datapath and OpenFlow support for such keys. It doesn't
actually add any tunnel protocols that support 64-bit keys, so this is not
very useful yet.
The 'arg' member of struct odp_msg had to be expanded to 64-bits also,
because it sometimes contains a tunnel ID. This member also contains the
argument passed to ODPAT_CONTROLLER, so I expanded that action's argument
to 64 bits also so that it can use the full width of the expanded 'arg'.
Userspace doesn't take advantage of the new space though (it was only
using 16 bits anyhow).
This commit has been tested only to the extent that it doesn't disrupt
basic Open vSwitch operation. I have not tested it with tunnel traffic.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Feature #3976.
In the medium term, we plan to migrate the datapath to use Netlink as its
communication channel. In the short term, we need to be able to have
actions with 64-bit arguments but "struct odp_action" only has room for
48 bits. So this patch shifts to variable-length arguments using Netlink
attributes, which starts in on the Netlink transition and makes 64-bit
arguments possible at the same time.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
There have been two ops to support async access to the datapath
character device for a long time but they have never been implemented.
Drop the commented out code.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Currently we get a pointer to the DP in openvswitch_read() and
openvswitch_poll() and use it without any synchronization. This means
that the DP could disappear from underneath us while we are using it.
Currently, this isn't a problem because userspace is single threaded but
it's better for the locking to be correct.
With this change we hold the mutex while doing a blocking wait, which
means that no changes can be made, including adding/removing flows. It's
possible to make this finer grained but for the time being that isn't done,
since current userspace doesn't care.
Found with lockdep.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
We currently drop RTNL before adding a new datapath to sysfs but
then access the dp data structures. This moves the call to
dp_sysfs_add_dp() before we drop the locks to prevent a potential
race. All other calls to sysfs functions already hold RTNL.
Found with lockdep.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Mark functions and global variables used only in a single file as
static.
Found with sparse.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Checksum offloading has changed quite a bit across different kernel
and Xen versions. Since it is part of the skb data structure it is
unfortunately difficult to separate out into compatibility code.
This consolidates all of the checksum code in one place which makes
it easier read and remove as we prepare for upstreaming. On newer
kernels it also puts everything in inline functions, eliminating the
need to run through the compat code or make extra function calls.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
These steps are sequentially in lockstep, so we might as well combine them.
This expands the region over which the vport_lock is held. I didn't
carefully verify that this was OK.
This also eliminates the synchronize_rcu() call from destruction of tunnel
vports, since they didn't appear to me to need it.
It should be possible to eliminate the synchronize_rcu() from the netdev,
patch, and internal_dev vports, but this commit does not do that.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
After the previous commit, which changed the datapath to always create and
attach a vport at the same time, and to always detach and delete a vport
at the same time, there is no longer any real distinction between a dp_port
and a vport. This commit, therefore, merges the two together to simplify
code. It might even improve performance, although I have not checked.
I wasn't sure at first whether the merged structure should be "struct
dp_port" or "struct vport". I went with the latter since the "v" prefix
sounds cool.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
For some time now, Open vSwitch datapaths have internally made a
distinction between adding a vport and attaching it to a datapath. Adding
a vport just means to create it, as an entity detached from any datapath.
Attaching it gives it a port number and a datapath. Similarly, a vport
could be detached and deleted separately.
After some study, I think I understand why this distinction exists. It is
because ovs-vswitchd tries to open all the datapath ports before it tries
to create them. However, changing it to create them before it tries to
open them is not difficult, so this commit does this.
The bulk of this commit, however, changes the datapath interface to one
that always creates a vport and attaches it to a datapath in a single step,
and similarly detaches a vport and deletes it in a single step.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Upcoming commits will keep needing to pass more information to the vport
'create' member function. It's annoying to have to modify a dozen pieces
of code every time just to do this, so this commit encapsulates all of the
parameters in a new struct and passes that instead.
Acked-by: Jesse Gross <jesse@nicira.com>
Variables which are changed only infrequently should be annotated
with __read_mostly, which will group them together in a special
linker section. This prevents them from sharing cache lines with
data on the hot path.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Our normal loop detection requires disabling preemption while
packet processing takes place. On RT kernels this isn't acceptable
and interacts badly with spinlocks, so we can't use it. This
takes advantage of some extra space that is added to struct
task_struct on RT kernels (and the knowledge that we will always
have a valid task_struct) to store the loop counter for a given
thread. Since we can't make these assumptions on non-RT kernels,
we continue to use the previous method of loop detection there.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
There's no need to have a mask in this action, because both parts of the
TCI are part of the flow structure.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Commit a01ef04ce "datapath: Drop padding from struct odp_flow_key" removed
the "reserved" member from struct odp_flow_key but overlooked uses of that
member from 64-bit compatibility code. This commit fixes up the problem.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This commit fixes a problem where datapath flow used stats can report
incorrent value, thus may result in flow expirations being incorrect.
This happens when a kernel jiffies rollover occurs between the last time
a flow is hit and the flow stats is queried.
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Hao Zheng <hzheng@nicira.com>
These functions run 99% of the time in atomic context and the benefit of
passing along the 'gfp' argument for the other 1% doesn't seem to outweigh
the cost.
Suggested-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Breaking this out as a separate commit should make it easier to see what
needs to change later, if we need to reintroduce padding at some point.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This allows eliminating padding from odp_flow_key, although actually doing
that is postponed until the next commit.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The "port group" concept seems like a good one, but it has not been
used very much in userspace so far, so before we commit ourselves to
a frozen API that we must maintain forever, remove it. We can always
add it back in later as a new kind of vport.
Signed-off-by: Ben Pfaff <blp@nicira.com>
An upcoming commit will use some workqueue functions that weren't
available on earlier kernels, so this backports those functions.
The backporting uses timers instead of delayed work queues because
the earlier versions of work queues have some unsafe corner cases.
In addition, this removes some unused work queue backporting code
that is no longer used because it is potentially unsafe.
Note that this commit changes the behavior of work queues: normally
they run in process context but the backported version runs in
softirq context.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Ben Pfaff <blp@nicira.com>
is_frag is only used for communication between two functions, which
means that it doesn't really need to be in the SKB CB. This wouldn't
necessarily be a problem except that there are also a number of other
paths that lead to this being uninitialized. This isn't a problem
now but uninitialized memory seems dangerous and there isn't much
upside.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Ben Pfaff <blp@nicira.com>
An upcoming commit will add support for supplying cached flows for
packets entering the datapath. This adds the code in the datapath
itself to recognize these cached flows and use them instead of
extracting the flow fields and doing a lookup.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Ben Pfaff <blp@nicira.com>
Currently flows are only used within the confines of one
rcu_read_lock()/rcu_read_unlock() session. However, with the
addition of header caching we will need to hold references to flows
for longer periods of time. This adds support for that by adding
refcounts to flows. RCU is still used for normal packet handling
to avoid a performance impact from constantly updating the refcount.
However, instead of directly freeing the flow after a grace period
we simply decrement the refcount.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Ben Pfaff <blp@nicira.com>
As the process to allocate a flow becomes more involved it becomes
more cumbersome for the code to be mixed in with the general
datapath so split it out into a new function.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Ben Pfaff <blp@nicira.com>
Nothing uses it anymore and it causes problems when backported on
some distributions. Kernels we support have net_random(), which
is the same thing so there is no reason to have an entire copy of
the random number generator in our source tree.
Reported-by: Alexey I. Froloff <raorn@altlinux.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
flow_actions_alloc() returns an error code in the form of a pointer
but we checked that the pointer was not NULL, which is always true.
This caused oopses on allocation errors when we would write into
an invalid pointer.
NIC-234
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Simon Horman <horms@verge.net.au>
[Jesse: Added missing pr_fmt in vport-gre.c and dp_sysfs_dp.c]
Signed-off-by: Jesse Gross <jesse@nicira.com>
Until now flow_extract() has simply returned a bogus flow when memory
allocation errors occurred. This fixes the problem by propagating the
error to the caller.
Signed-off-by: Ben Pfaff <blp@nicira.com>
flow_extract() can fail due to memory allocation errors in pskb_may_pull().
Currently it doesn't return those properly, instead just reporting a bogus
flow to the caller. But its return value is currently in use for reporting
whether the packet was an IPv4 fragment. This commit switches to reporting
that in the skb itself so that the return value can be reused to report
errors.
Signed-off-by: Ben Pfaff <blp@nicira.com>
We enforce mutual exclusion when updating statistics by disabling
bottom halves and only writing to per-CPU state. However, reading
requires looking at the statistics for foreign CPUs, which could be
in the process of updating them since there isn't a lock. This means
we could get garbage values for 64-bit values on 32-bit machines or
byte counts that don't correspond to packet counts, etc.
This commit introduces a sequence lock for statistics values to avoid
this problem. Getting a write lock is very cheap - it only requires
incrementing a counter plus a memory barrier (which is compiled away
on x86) to acquire or release the lock and will never block. On
read we spin until the sequence number hasn't changed in the middle
of the operation, indicating that the we have a consistent set of
values.
Signed-off-by: Jesse Gross <jesse@nicira.com>