ODP_ACTION_ATTR_CONTROLLER in the kernel actually sends packets to
userspace, not the controller. To make it generic rename this action
to ODP_ACTION_ATTR_USERSPACE.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
This trivially supports linux 3.0 by incrementing the version check.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
It's possible to trace kfree_skb() call sites to find out where
packets are getting dropped. Situations where kfree_skb() does
not actually indicate an error adds additional noise, so use
consume_skb() instead to avoid tracing non-errors.
Suggested-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Older kernels (those before 2.6.22) rely on implicit assumptions
to determine checksum offloading status. These assumptions tend
to break down when doing switching because it sits in the middle
of the transmit and receive path. Newer kernels deal with this
problem by adding more explicit information about how to checksum.
This replicates that behavior by mirroring the state from newer
kernels in private OVS storage on the kernels that lack it. On
ingress and egress we then map that state onto the appropriate
location for the given kernel and can consistently manipulate it
within OVS. Some of this was already done for the checksum type
but this makes it more robust and expands it to the checksum start
and offset as well.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Most of the time kernels older or newer than the ones we support
simply fail to compile. However, sometimes they appear to succeed
but then cause problems later on. This explicitly checks for
supported versions at compile time.
Signed-off-by: Jesse Gross <jesse@nicira.com>
The NXAST_DROP_SPOOFED_ARP action has been deprecated in favor of
defining flows using the NXM_NX_ARP_SHA flow match for a while. This
commit removes it.
Signed-off-by: Justin Pettit <jpettit@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Until now, the tun_id and in_port have been lost when a packet is sent from
the kernel to userspace and then back to the kernel. I didn't think that
this was a problem, but recent behavior made me look closer and see that
it makes a difference if sFlow is turned on or if an
ODP_ATTR_ACTION_CONTROLLER action is present. We could possibly kluge
around those, but for future-proofing it seems better to pass the packet
metadata from userspace to the kernel. That is what this commit does.
This commit introduces a user-kernel protocol break. We could avoid that,
if it is desirable, by making ODP_PACKET_ATTR_KEY optional for
ODP_PACKET_CMD_EXECUTE commands.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
genlmsg_reply() indirectly makes a call to kmalloc but takes no
GFP flags, instead using GFP_ATOMIC if in a softirq and GFP_KERNEL
otherwise. However, here we hold rcu_read_lock(), which requires
GFP_ATOMIC but is not a softirq. Since we've already built the
reply message, it is safe to release rcu_read_lock(), so do that
before calling genlmsg_reply().
Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Hao Zheng <hzheng@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Currently the whole flow key struct is hashed on every packet received from the
network or userspace. The whole struct is also compared byte-for-byte when
doing flow table lookups. This consumes a fair percentage of CPU time, and most
of the time part of the structure is unused (e.g. the IPv6 fields when handling
IPv4 traffic; the IPv4 fields when handling Ethernet frames).
This commit reorders the fields in the flow key struct to put the least
commonly used elements at the end and changes the hash and comparison functions
to look only at the part that contains data.
Signed-off-by: Andrew Evans <aevans@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Fixes the following warning:
datapath.c:473:6: warning: variable 'port_no' set but not used
[-Wunused-but-set-variable]
Signed-off-by: Ethan Jackson <ethan@nicira.com>
It's (almost) always easier to understand a function with fewer parameters,
so this removes the now-redundant sw_flow_key and actions parameters from
execute_actions(), since they can be found through OVS_CB(skb)->flow now.
This also necessarily moves loop detection into execute_actions().
Otherwise, the flow's actions could have changed between the time that the
loop was detected and the time that it was suppressed, which would mean
that the wrong (version of the) flow would get suppressed.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This way, it's always possible to get a packet's key or hash simply by
looking at its 'flow', without considering whether the packet came from
userspace or from a vport.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Before commit 3f19d399f "datapath: Fix mysterious GRE-over-IPSEC problems,"
'packet' in opd_packet_cmd_execute() was an skb cloned from one created by
Netlink, so its cb member wasn't necessarily zeroed. But that commit
changed 'packet' to be freshly allocated with __dev_alloc_skb(), which
means that cb is zeroed, so we don't have to do it again.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
We've noticed that packets that go up to userspace and then back down to
the kernel and then enter an GRE tunnel that is then ESP encapsulated
by IPSEC end up with a bad ESP "next header" value: it ends up as zero
instead of 0x2f (IPPROTO_GRE). Just putting packets from userspace into
a freshly allocated skb fixes the problem.
The underlying problem that this works around is still unknown.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #4769.
Most necessary compatibility code is simply backported versions
of kernel functions from newer kernels. These belong in the compat
directory, where they can be transparently picked up when necessary.
However, in some situations there is code that is different
depending on the kernel version but is always needed in some form.
Here it is desirable to segregate the code but it does not really
belong in the compat directory because it does not exist in upstream
kernels. This moves those functions to a compat file, which makes
the meaning clear and prevents problems when Open vSwitch is integrated
into other projects.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
flow_extract() sets key->tun_id from OVS_CB(packet)->tun_id, which until
now has contained whatever Netlink put there in its NETLINK_CB structure.
Zero it earlier so that its value is at least predictable.
The resulting code is still not correct, because key->tun_id and
key->in_port are now set to arbitrary values. I have known about this
since I wrote this function (and before, too, in its earlier incarnations),
but until now I did not think that it was a problem because I did not
think that there were any users along this code path. But that is wrong:
sFlow sampling uses tun_id and in_port and ODP_ACTION_ATTR_CONTROLLER uses
in_port. So we need a way to pass these back down from userspace. An
upcoming commit will add a way.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Normally when performing checksum offloading the transport header
must be present in the linear data area. However, this might not
be the case with packets processed by GRO. On transmit these
packets are processed by GSO if emulation of checksum offloading
needs to be performed. Unlike skb_checksum_help(), the GSO code
does not have any requirements about the packet structure. Since
our code that copies and checksums packets to userspace is called
in conditions similar to GSO and does not have any assumptions
about layout, drop the BUG_ON assertion.
NIC-343
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Netlink attributes have a maximum length of 64k. It's theoretically
possible that a packet could exceed this length, so check for it before
we try to send the packet to userspace.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
It's possible to encounter a few different errors when preparing
to send a packet to userspace in queue_control_packet(). This
ensures that if we encounter one of these problems, the issue is
properly recorded as a lost packet.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
When destroying vports we account for two types of synchronization
mechanisms: RTNL and RCU. However, it is possible to call into
network device methods with just a device reference without either
of these. These device methods can use the datapath data structures
but we don't wait for all of the references to go away before freeing
the datapath. The actual wait happens in rtnl_unlock(), so by moving
up that call we can avoid the possibility of use after free with
internal devices.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Currently we explicitly zero out each of the fields in the OVS_CB for
executed packets. However, it seems simpler and more robust to just
memset the whole thing to zero.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
The ovs_skb_cb in 'packet' in this function is initially a clone of the
corresponding area in 'skb', which came from the Netlink layer and thus
isn't necessarily all-zeros. This commit initializes it properly before
passing it along to execute_actions().
The most common problem caused by failing to initialize the ovs_skb_cb
properly was that on Linux 2.6.26 and earlier, where Open vSwitch keeps
its own vlan_tci field inside ovs_skb_cb, the first packet of a flow would
get sent out tagged with a random VLAN (usually 0x0001 or 0xffff in our
testing). This commit should fix that problem.
Another likely problem would be for turning on sFlow to randomly panic the
kernel. That problem would not be kernel version dependent. We haven't
been testing sFlow so we haven't noticed this problem.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>
Reported-by: Pankaj Thakkar <thakkar@nicira.com>
vlan_deaccel_tag() was introduced to move a vlan tag from skb->vlan_tci
to the packet but there was still an open coded variant when doing
an upcall. vlan_deaccel_tag() also clears skb->vlan_tci which is not
currently done but it makes no difference.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Using the kernel vlan acceleration has a number of benefits:
it enables hardware tagging, allows usage of TSO and checksum
offloading, and is generally easier to manipulate. This switches
the vlan actions to use skb->vlan_tci field for any necessary
changes. In places that do not support vlan acceleration in a way
that we can use (in particular kernels before 2.6.37) we perform
any necessary conversions, such as tagging and GSO before the
packet leaves Open vSwitch.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Until now, tunnel vports have had a specific MTU, in the same way that
ordinary network devices have an MTU, but treating them this way does not
always make sense. For example, consider a datapath that has three ports:
the local port, a GRE tunnel to another host, and a physical port. If
the physical port is configured with a jumbo MTU, it should be possible to
send jumbo packets across the tunnel: the tunnel can do fragmentation or
the physical port traversed by the tunnel might have a jumbo MTU.
However, until now, tunnels always had a 1500-byte MTU by default. It
could be adjusted using ODP_VPORT_MTU_SET, but nothing actually did this.
One alternative would be to make ovs-vswitchd able to set the vport's MTU.
This commit, however, takes a different approach, of dropping the concept
of MTU entirely for tunnel vports. This also solves the problem described
above, without making any additional work for anyone.
I tested that, without this change, I could not send 1600-byte "pings"
between two machines whose NICs had 2000-byte MTUs that were connected to
vswitches that were in turn connected over GRE tunnels with the default
1500-byte MTU. With this change, it worked OK, regardless of the MTU of
the network traversed by the GRE tunnel.
This patch also makes "patch" ports MTU-less.
It might make sense to remove vport_set_mtu() and the associated callback
now, since ordinary network devices are the only vports that support it
now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Suggested-by: Jesse Gross <jesse@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #3728.
Expanding an skbuff in a netlink dump handler doesn't work well. We
weren't updating the truesize of the skb or the allocation within the
socket that netlink_dump() had put the skb in. The code had other bugs
too.
This commit fixes the problem (in my tests, anyway) by avoiding expanding
the reply skbuff to fill in the actions. Instead, in such a case the
userspace client has to do a separate "get" action to get the actions.
This commit also updates userspace to do this automatically for dumps in
the cases where the caller cares (only "ovs-dpctl dump-flows" currently
cares).
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #4520.
The current reporting of flow last used time has two issues that
cause it to incorrectly report the system monotonic time when the
flow was last used.
The first is that it simply converts the stored jiffies value to
milliseconds by scaling with a constant. This does not work because
jiffies is not zero based and can wrap around on 32-bit platforms.
The second is there is no guarantee that jiffies advances at the
same rate as the RTC based monotonic time that userspace uses.
A variety of factors can cause differences, including system suspend
and clock drift. These are not too important for relatively short
time periods such as the duration of the flow (nor is the flow timing
precision of extreme importance). However, when the time being
measured is the duration since system boot (assuming that the above
issues had been addressed) the difference can become significant.
This addresses both issues by restoring behavior similar to the
previous method of computing the flow used time, though in a
slightly different form to reflect the needs of the Netlink code.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
When doing an upcall we allocate memory for ODP_PACKET_ATTR_TYPE.
However, ODP_PACKET_ATTR_TYPE does not exist - the type is specified
by the command.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
We allocate a number of multicast groups and stripe upcalls across
them using a hash function. However, instead of using the ID of
the selected group for the upcall multicast we were directly using
the output of the hash function. In the best case this leads to
intermittent failures when we choose an invalid group ID (such as
0) or in the worse case could lead to access of unallocated memory.
This fixes that by looking up the group we have been allocated.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Jesse suggested this naming scheme, so I'm adjusting existing names to
fit it.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
I can't see any real value in maintaining a dp_idx separate from the
ifindex of the local port. With the current implementation it also
artificially limits the number of datapaths.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This completes the transition to the Generic Netlink interface, and
so this commit restores support for Linux 2.6.18 and later.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This commit calls genl_lock() and thus doesn't support Linux before
2.6.35, which wasn't exported before that version. That problem will
be fixed once the whole userspace interface transitions to Generic
Netlink a few commits from now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This commit calls genl_lock() and thus doesn't support Linux before
2.6.35, which wasn't exported before that version. That problem will
be fixed once the whole userspace interface transitions to Generic
Netlink a few commits from now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This commit calls genl_lock() and thus doesn't support Linux before
2.6.35, which wasn't exported before that version. That problem will
be fixed once the whole userspace interface transitions to Generic
Netlink a few commits from now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The kernel Generic Netlink layer always holds a mutex (genl_lock) when it
invokes callbacks, so that means that there is no point in having
per-datapath mutexes or a separate vport lock. This commit removes them.
This commit breaks support for Linux before 2.6.35 because it calls
genl_lock(), which wasn't exported before that version. That problem will
be fixed once the whole userspace interface transitions to Generic
Netlink a few commits from now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The vport_mutex really only protects the vport dev_table, which isn't very
much. By getting rid of it we take one step toward simplifying the vswitch
locking, which will necessarily have to be based mainly around the Generic
Netlink genl_mutex once we switch to Generic Netlink.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This definition wasn't actually useful for the kernel--the only place that
it was used it didn't really have to be, so this commit removes it from
datapath-protocol.h. It is still marginally useful in userspace, at least
as a value that converts to and from OpenFlow port number OFPP_NONE, so
move it to odp-util.c.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
When the datapath moves to the Netlink protocol it won't have a minor
number to use, so we have to put the dp_idx in the message.
This also changes the kernel implementation of ODP_FLOW_FLUSH to do the
datapath locking inside flush_flows() instead of inside openvswitch_ioctl()
but doesn't change that command's userspace interface, which still passes
a datapath number as the ioctl argument.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Nothing was productively using the 'flags' member of odp_flow, so this
commit removes it.
ODPFF_ZERO_TCP_FLAGS isn't used at all (as of the previous commit).
ODPFF_EOF has been replaced by a special case of the 'key_len' member.
This will go away, too, once AF_NETLINK starts being used.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This brings the code closer to what the Netlink interface will need to
implement.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
As with n_flows, n_ports was used regularly by userspace to determine how
much memory to allocate when listing ports, but it is no longer needed for
that. max_ports, on the other hand, is necessary but it is also a fixed
value for the kernel datapath right now and if we expand it we can also
come up with a way to report the expanded value.
The remaining members of odp_stats are actually real statistics that I
intend to keep.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This queue information will be available through the kernel socket layer
once we move over to Netlink socket as transports, so we might as well get
rid of the redundancy.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Userspace used to use the n_flows information here to decide how much
memory needed to be allocated to list flows, but that isn't necessary any
longer now that listing flows uses an iterator abstraction. The
cur_capacity and max_capacity members are just curiosities and don't
provide much information; if the implementation ever changes away from
the current hash table implementation then they could become meaningless
anyhow.
But more than anything, these aren't really the kind of statistics that
networking people usually care about.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
One of the goals for Open vSwitch is to decouple kernel and userspace
software, so that either one can be upgraded or rolled back independent of
the other. To do this in full generality, it must be possible to add new
features to the kernel vport layer without changing userspace software.
The customary way to do this in the Linux networking stack is to use
Netlink and in particular Netlink attributes. This commit adopts that
model for the vport layer. It does not yet actually start using the
Netlink socket layer, which will come later.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
I plan to make the vport type part of the standard header stuck on each
Netlink message related to a vport. As such, it is more convenient to use
an integer than a string. In addition, by being fundamentally different
from strings, using an integer may reduce the confusion we've had in the
past over the differences in userspace and kernel names for network device
and vport types.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>