vlan_deaccel_tag() was introduced to move a vlan tag from skb->vlan_tci
to the packet but there was still an open coded variant when doing
an upcall. vlan_deaccel_tag() also clears skb->vlan_tci which is not
currently done but it makes no difference.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Using the kernel vlan acceleration has a number of benefits:
it enables hardware tagging, allows usage of TSO and checksum
offloading, and is generally easier to manipulate. This switches
the vlan actions to use skb->vlan_tci field for any necessary
changes. In places that do not support vlan acceleration in a way
that we can use (in particular kernels before 2.6.37) we perform
any necessary conversions, such as tagging and GSO before the
packet leaves Open vSwitch.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Until now, tunnel vports have had a specific MTU, in the same way that
ordinary network devices have an MTU, but treating them this way does not
always make sense. For example, consider a datapath that has three ports:
the local port, a GRE tunnel to another host, and a physical port. If
the physical port is configured with a jumbo MTU, it should be possible to
send jumbo packets across the tunnel: the tunnel can do fragmentation or
the physical port traversed by the tunnel might have a jumbo MTU.
However, until now, tunnels always had a 1500-byte MTU by default. It
could be adjusted using ODP_VPORT_MTU_SET, but nothing actually did this.
One alternative would be to make ovs-vswitchd able to set the vport's MTU.
This commit, however, takes a different approach, of dropping the concept
of MTU entirely for tunnel vports. This also solves the problem described
above, without making any additional work for anyone.
I tested that, without this change, I could not send 1600-byte "pings"
between two machines whose NICs had 2000-byte MTUs that were connected to
vswitches that were in turn connected over GRE tunnels with the default
1500-byte MTU. With this change, it worked OK, regardless of the MTU of
the network traversed by the GRE tunnel.
This patch also makes "patch" ports MTU-less.
It might make sense to remove vport_set_mtu() and the associated callback
now, since ordinary network devices are the only vports that support it
now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Suggested-by: Jesse Gross <jesse@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #3728.
Expanding an skbuff in a netlink dump handler doesn't work well. We
weren't updating the truesize of the skb or the allocation within the
socket that netlink_dump() had put the skb in. The code had other bugs
too.
This commit fixes the problem (in my tests, anyway) by avoiding expanding
the reply skbuff to fill in the actions. Instead, in such a case the
userspace client has to do a separate "get" action to get the actions.
This commit also updates userspace to do this automatically for dumps in
the cases where the caller cares (only "ovs-dpctl dump-flows" currently
cares).
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #4520.
The current reporting of flow last used time has two issues that
cause it to incorrectly report the system monotonic time when the
flow was last used.
The first is that it simply converts the stored jiffies value to
milliseconds by scaling with a constant. This does not work because
jiffies is not zero based and can wrap around on 32-bit platforms.
The second is there is no guarantee that jiffies advances at the
same rate as the RTC based monotonic time that userspace uses.
A variety of factors can cause differences, including system suspend
and clock drift. These are not too important for relatively short
time periods such as the duration of the flow (nor is the flow timing
precision of extreme importance). However, when the time being
measured is the duration since system boot (assuming that the above
issues had been addressed) the difference can become significant.
This addresses both issues by restoring behavior similar to the
previous method of computing the flow used time, though in a
slightly different form to reflect the needs of the Netlink code.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
When doing an upcall we allocate memory for ODP_PACKET_ATTR_TYPE.
However, ODP_PACKET_ATTR_TYPE does not exist - the type is specified
by the command.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
We allocate a number of multicast groups and stripe upcalls across
them using a hash function. However, instead of using the ID of
the selected group for the upcall multicast we were directly using
the output of the hash function. In the best case this leads to
intermittent failures when we choose an invalid group ID (such as
0) or in the worse case could lead to access of unallocated memory.
This fixes that by looking up the group we have been allocated.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Jesse suggested this naming scheme, so I'm adjusting existing names to
fit it.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
I can't see any real value in maintaining a dp_idx separate from the
ifindex of the local port. With the current implementation it also
artificially limits the number of datapaths.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This completes the transition to the Generic Netlink interface, and
so this commit restores support for Linux 2.6.18 and later.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This commit calls genl_lock() and thus doesn't support Linux before
2.6.35, which wasn't exported before that version. That problem will
be fixed once the whole userspace interface transitions to Generic
Netlink a few commits from now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This commit calls genl_lock() and thus doesn't support Linux before
2.6.35, which wasn't exported before that version. That problem will
be fixed once the whole userspace interface transitions to Generic
Netlink a few commits from now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This commit calls genl_lock() and thus doesn't support Linux before
2.6.35, which wasn't exported before that version. That problem will
be fixed once the whole userspace interface transitions to Generic
Netlink a few commits from now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The kernel Generic Netlink layer always holds a mutex (genl_lock) when it
invokes callbacks, so that means that there is no point in having
per-datapath mutexes or a separate vport lock. This commit removes them.
This commit breaks support for Linux before 2.6.35 because it calls
genl_lock(), which wasn't exported before that version. That problem will
be fixed once the whole userspace interface transitions to Generic
Netlink a few commits from now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The vport_mutex really only protects the vport dev_table, which isn't very
much. By getting rid of it we take one step toward simplifying the vswitch
locking, which will necessarily have to be based mainly around the Generic
Netlink genl_mutex once we switch to Generic Netlink.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This definition wasn't actually useful for the kernel--the only place that
it was used it didn't really have to be, so this commit removes it from
datapath-protocol.h. It is still marginally useful in userspace, at least
as a value that converts to and from OpenFlow port number OFPP_NONE, so
move it to odp-util.c.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
When the datapath moves to the Netlink protocol it won't have a minor
number to use, so we have to put the dp_idx in the message.
This also changes the kernel implementation of ODP_FLOW_FLUSH to do the
datapath locking inside flush_flows() instead of inside openvswitch_ioctl()
but doesn't change that command's userspace interface, which still passes
a datapath number as the ioctl argument.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Nothing was productively using the 'flags' member of odp_flow, so this
commit removes it.
ODPFF_ZERO_TCP_FLAGS isn't used at all (as of the previous commit).
ODPFF_EOF has been replaced by a special case of the 'key_len' member.
This will go away, too, once AF_NETLINK starts being used.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This brings the code closer to what the Netlink interface will need to
implement.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
As with n_flows, n_ports was used regularly by userspace to determine how
much memory to allocate when listing ports, but it is no longer needed for
that. max_ports, on the other hand, is necessary but it is also a fixed
value for the kernel datapath right now and if we expand it we can also
come up with a way to report the expanded value.
The remaining members of odp_stats are actually real statistics that I
intend to keep.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This queue information will be available through the kernel socket layer
once we move over to Netlink socket as transports, so we might as well get
rid of the redundancy.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Userspace used to use the n_flows information here to decide how much
memory needed to be allocated to list flows, but that isn't necessary any
longer now that listing flows uses an iterator abstraction. The
cur_capacity and max_capacity members are just curiosities and don't
provide much information; if the implementation ever changes away from
the current hash table implementation then they could become meaningless
anyhow.
But more than anything, these aren't really the kind of statistics that
networking people usually care about.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
One of the goals for Open vSwitch is to decouple kernel and userspace
software, so that either one can be upgraded or rolled back independent of
the other. To do this in full generality, it must be possible to add new
features to the kernel vport layer without changing userspace software.
The customary way to do this in the Linux networking stack is to use
Netlink and in particular Netlink attributes. This commit adopts that
model for the vport layer. It does not yet actually start using the
Netlink socket layer, which will come later.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
I plan to make the vport type part of the standard header stuck on each
Netlink message related to a vport. As such, it is more convenient to use
an integer than a string. In addition, by being fundamentally different
from strings, using an integer may reduce the confusion we've had in the
past over the differences in userspace and kernel names for network device
and vport types.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Until now it has only been possible to query a vport if you know what
datapath it is on. This doesn't really make sense, so this commit removes
that restriction. It is a little bigger than one might naturally expect
because locking changes are required.
This also allows us to get rid of the ETHTOOL_GDRVINFO kluge that has
bothered me for a long time. The next commit does that.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
One of the goals for Open vSwitch is to decouple kernel and userspace
software, so that either one can be upgraded or rolled back independent of
the other. To do this in full generality, it must be possible to add new
features to the kernel vport layer without changing userspace software. In
turn, that means that the odp_port structure must become variable-length.
This does not, however, fit in well with the ODP_PORT_LIST ioctl in its
current form, because that would require userspace to know how much space
to allocate for each port in advance, or to allocate as much space as
could possibly be needed. Neither choice is very attractive.
This commit prepares for a different solution, by replacing ODP_PORT_LIST
by a new ioctl ODP_VPORT_DUMP that retrieves information about a single
vport from the datapath on each call. It is much cleaner to allocate the
maximum amount of space for a single vport than to do so for possibly a
large number of vports.
It would be faster to retrieve a number of vports in batch instead of just
one at a time, but that will naturally happen later when the kernel
datapath interface is changed to use Netlink, so this patch does not bother
with it.
The Netlink version won't need to take the starting port number from
userspace, since Netlink sockets can keep track of that state as part
of their "dump" feature.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
One of the goals for Open vSwitch is to decouple kernel and userspace
software, so that either one can be upgraded or rolled back independent of
the other. To do this in full generality, it must be possible to change
the kernel's idea of the flow key separately from the userspace version.
This commit takes one step in that direction by making the kernel report
its idea of the flow that a packet belongs to whenever it passes a packet
up to userspace. This means that userspace can intelligently figure out
what to do:
- If userspace's notion of the flow for the packet matches the kernel's,
then nothing special is necessary.
- If the kernel has a more specific notion for the flow than userspace,
for example if the kernel decoded IPv6 headers but userspace stopped
at the Ethernet type (because it does not understand IPv6), then again
nothing special is necessary: userspace can still set up the flow in
the usual way.
- If userspace has a more specific notion for the flow than the kernel,
for example if userspace decoded an IPv6 header but the kernel
stopped at the Ethernet type, then userspace can forward the packet
manually, without setting up a flow in the kernel. (This case is
bad from a performance point of view, but at least it is correct.)
This commit does not actually make userspace flexible enough to handle
changes in the kernel flow key structure, although userspace does now
have enough information to do that intelligently. This will have to wait
for later commits.
This commit is bigger than it would otherwise be because it is rolled
together with changing "struct odp_msg" to a sequence of Netlink
attributes. The alternative, to do each of those changes in a separate
patch, seemed like overkill because it meant that either we would have to
introduce and then kill off Netlink attributes for in_port and tun_id, if
Netlink conversion went first, or shove yet another variable-length header
into the stuff already after odp_msg, if adding the flow key to odp_msg
went first.
This commit will slow down performance of checksumming packets sent up to
userspace. I'm not entirely pleased with how I did it. I considered a
couple of alternatives, but none of them seemed that much better.
Suggestions welcome. Not changing anything wasn't an option,
unfortunately. At any rate some slowdown will become unavoidable when OVS
actually starts using Netlink instead of just Netlink framing.
(Actually, I thought of one option where we could avoid that: make
userspace do the checksum instead, by passing csum_start and csum_offset as
part of what goes to userspace. But that's not perfect either.)
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
One of the goals for Open vSwitch is to decouple kernel and userspace
software, so that either one can be upgraded or rolled back independent of
the other. To do this in full generality, it must be possible to change
the kernel's idea of the flow key separately from the userspace version.
In turn, that means that flow keys must become variable-length. This
commit makes that change using Netlink attribute sequences.
This commit does not actually make userspace flexible enough to handle
changes in the kernel flow key structure, because userspace doesn't yet
have enough information to do that intelligently. Upcoming commits will
fix that.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
One of the goals for Open vSwitch is to decouple kernel and userspace
software, so that either one can be upgraded or rolled back independent of
the other. To do this in full generality, it must be possible to change
the kernel's idea of the flow key separately from the userspace version.
In turn, that means that flow keys must become variable-length. This does
not, however, fit in well with the ODP_FLOW_LIST ioctl in its current form,
because that would require userspace to know how much space to allocate
for each flow's key in advance, or to allocate as much space as could
possibly be needed. Neither choice is very attractive.
This commit prepares for a different solution, by replacing ODP_FLOW_LIST
by a new ioctl ODP_FLOW_DUMP that retrieves a single flow from the datapath
on each call. It is much cleaner to allocate the maximum amount of space
for a single flow key than to do so for possibly a very large number of
flow keys.
As a side effect, this patch also fixes a race condition that sometimes
made "ovs-dpctl dump-flows" print an error: previously, flows were listed
and then their actions were retrieved, which left a window in which
ovs-vswitchd could delete the flow. Now dumping a flow and its actions is
a single step, closing that window.
Dumping all of the flows in a datapath is no longer an atomic step, so now
it is possible to miss some flows or see a single flow twice during
iteration, if the flow table is modified by another process. It doesn't
look like this should be a problem for ovs-vswitchd.
It would be faster to retrieve a number of flows in batch instead of just
one at a time, but that will naturally happen later when the kernel
datapath interface is changed to use Netlink, so this patch does not bother
with it.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The missing "break" here meant that an attempt to output to any port
number that happened to include the wrong bit would fail.
Problem introduced by commit cdee00fd63 (datapath: Replace "struct
odp_action" by Netlink attributes.)
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jess@nicira.com>
Reported-by: Michael Mao <mmao@nicira.com>
Bug #4385.
This is proper kernel style.
Kernel style also encourages using a type name instead of an expression as
sizeof's operand, but this patch doesn't make any of those changes.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
I had completely forgotten that we had a top-level compat.h and compat26.h.
It's better to distribute their contents to individual compat headers, so
this commit does so and deletes them.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
When deleting a datapath, we remove all of the vports and then immediately
free the datapath data structures. Since the vports are allowed to use
call_rcu() to free their data, it's possible for them to return immediately
while packet processing is still taking place. This breaks apart the dropping
of references and the freeing of the data using call_rcu() for protection.
This race cannot actually occur in practice since the last port to be
deleted is an internal device, which uses synchronize_rcu() itself
(implicitly through unregister_netdevice()). However, there is no
requirement that it must do this nor should there be.
Reported-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
A few places marked struct datapath pointers as const since they
didn't expect to make modifications. However, when compiled with
lockdep the datapath mutex pointer is passed to lockdep_is_held(),
which has a non-const argument. That provoked warnings about
casting away the const, so this drops the const from the original
pointers.
Reported-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
On datapath creation we hold dp_mutex but not dp->mutex when
creating the vport for the datapath device. However, there are
lockdep checks that validate that we hold dp->mutex during the call
to new_vport(). The lock isn't actually necessary in this case
because no one else can access the datapath but it's good to have
the lock assertions, so this holds dp->mutex while initializing
the datapath.
Found with lockdep.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
It's currently possible for operations on our character device to
be still running when we unload the module. This will result in
an oops when the executing code is suddenly freed. The chrdev
code has a way to avoid this by taking a reference on the module
every time the device is opened, which means that we can't be
unloaded as long as there is an open file descriptor and therefore
the possibility of an operation. However, our file_operations
structure doesn't include an owner member, which prevents this
mechanism from working. This adds one.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
It's possible that someone is using the datapath data structures
when we attempt to delete the datapath. The first writer will
only hold dp->mutex, which we don't currently acquire when deleting.
This adds that lock to prevent a potential race (this can't currently
happen because userspace is single threaded, as long as "ovs-dpctl
del-dp" is not used at the same time).
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Both do_destroy_dp() and destroy_dp() are small functions and
only have a single caller. There's no good reason for them to
be separate so this merges them together. It also makes things
more logically consistent and easier to read in the next commit,
which adds additional locking as everything is in one place.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
If inserting a flow failed and we need to free the actions, we
currently directly free them from the flow struct. This is fine
but it makes sparse complain about directly accessing an RCU
protected field. We could insert some casts to avoid this but
it's cleaner to just free the data from the local variable
instead.
Found with sparse.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
query_port() directly accesses the datapath port array, without
using any kind of RCU dereference. It's OK, since it is holding
DP mutex but this adds an explicit check to make sparse happy.
It also simplifies the code path somewhat.
Found with sparse.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
In some places we access the array of datapath ports without
RCU protection. This introduces a new function to check that in
these case the dp mutex is held for protection.
Found with sparse.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
There are several places where the flow table is accessed
without any kind of RCU protection. This is fine because dp
mutex is held so this adds checks for that condition.
Found with sparse.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Additional configuration is passed down to the kernel in the "config"
array of an odp_port when a vport is created. This information is not
returned when a vport is queried, though. This information is useful
for debugging, since it may be used to distinguish ports based on
additional data, such as the peer in tunnels. In a forthcoming patch, it
will be essential to distinguish between plain GRE and GRE over IPsec.
Signed-off-by: Justin Pettit <jpettit@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The n_buckets argument to tbl_create() can be zero, but the comment didn't
mention that. However, there's no reason that the caller can't just pass
in a correct size, so this commit changes them to do that.
Also, TBL_L1_SIZE was conceptually wrong as the minimum size: the minimum
size is one L2 page, e.g. TBL_L2_SIZE. But TBL_MIN_BUCKETS seems like a
better all-around way to indicate the minimum size, so this commit also
introduces that macro and uses it.
Jesse Gross pointed out inconsistencies in this area.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The sparse checker was complaining about incorrect address spaces (e.g.
__user versus non-__user pointers). I looked at each of them, checked
that the code looked correct to me, and added the appropriate __force
annotations to casts.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This is a small optimization for the case where a new flow is being added
to the flow table.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>