We currently document that BHs need to be disabled when handling
received packets. However, this isn't actually generally the
case (usually preemption is disabled but not BHs). Only one place
actually relies on BHs being disabled so fix that and update the
documentation of our expectations.
Places that update per-cpu stats without locking need to have bottom
halves disabled. Otherwise we can be running in process context and
in the middle of an operation and be interrupted by a softirq.
We are never called in hardirq context - only process or softirq.
Therefore it is not necessary to disable interrupts with
spin_lock_irqsave(), so use spin_lock_bh() everywhere.
When a 32-bit userspace program runs on a 64-bit kernel, data structures
that contain members whose sizes or alignments change from 32- to 64-bit
must be translated when they are passed to ioctls. This commit adds such
support for openvswitch_mod.
We should really reconsider some parts of the Open vSwitch ioctl interface
to avoid needing as much translation as we do.
Lightly tested with 32-bit userspace on sparc64.
The advantages of the double-underscore variants of copy_to_user(),
copy_from_user(), get_user(), and put_user() are pretty marginal, at best,
in the places where we are using them, and it's not always obvious that we
are making the right calls to access_ok() beforehand. So switch to the
safe variants without double underscores.
Suggested-by: Jesse Gross <jesse@nicira.com>
This commit prepares the core of datapath.c and vport.c to reduce the
amount of new code duplication when the following commit adds support for
32-bit compatibility ioctls. It breaks a number of functions apart into
pairs of functions: one that copies data to and from userspace and another
that does the real work.
This change is a pure refactoring that should not change behavior.
do_flowvec_ioctl() was checking for too-big 'n_flows' but not negative
'n_flows'. We could add that check too, but 'n_flows' should never be
negative so it's better to just use an unsigned type.
Enables checksum offloading, scatter/gather, and TSO on internal
devices. While these optimizations were not previously enabled on
internal ports we already could receive these types of packets from
Xen guests. This has the obvious performance benefits when these
packets can be passed directly to hardware.
There is also a more subtle benefit for GRE on Xen. GRE packets
pass through OVS twice - once before encapsulation and once after
encapsulation, moving through an internal device in the process.
If it is a SG packet (as is common on Xen), a copy was necessary
to linearize for the internal device. However, Xen uses the
memory allocator to track packets so when the original packet is
freed after the copy netback notifies the guest that the packet
has been sent, despite the fact that it is actually sitting in the
transmit queue. The guest then sends packets as fast as the CPU
can handle, overflowing the transmit queue. By enabling SG on
the internal device, we avoid the copy and keep the accounting
correct.
In certain circumstances this patch can decrease performance for
TCP. TCP has its own mechanism for tracking in-flight packets
and therefore does not benefit from the corrected socket accounting.
However, certain NICs do not like SG when it is not being used for
TSO (these packets can no longer be handled by TSO after GRE
encapsulation). These NICs presumably enable SG even though they
can't handle it well because TSO requires SG.
Tested controllers (all 1G):
Marvell 88E8053 (large performance hit)
Broadcom BCM5721 (small performance hit)
Intel 82571EB (no change)
We currently acquire dp_mutex when we are notified that the MTU
of a device attached to the datapath has changed so that we can
set the internal devices to the minimum MTU. However, it is not
required to hold dp_mutex because we already have RTNL lock and it
causes a deadlock, so don't do it.
Specifically, the issue is that DP mutex is acquired twice: once in
dp_device_event() before calling set_internal_devs_mtu() and then
again in internal_dev_change_mtu() when it is actually being changed
(since the MTU can also be set directly). Since it's not a recursive
mutex, deadlock.
During the setup of checksumming pointers we need to make sure that
the transport headers are in the skb linear data area. However, we
don't currently verify that the lengths in the packet headers are
within the size of the packet. This makes that check before a
BUG() check does it for us.
CC: "Nick Couchman" <Nick.Couchman@seakr.com>
I don't see any value in silently truncating device names. Doing so will
sow confusion in userspace. This commit makes too-long device names
return ENAMETOOLONG.
strncpy() does not null-terminate its output buffer if the source string's
length is at least as large as its 'count' argument. We know that the
source and destination buffers are the same size and that the source buffer
is null-terminated, so just use strcpy().
This fixes a kernel BUG message that often occurred when strlen(devname)
was exactly IFNAMSIZ-1. In such a case, if
internal_dev_port.devname[IFNAMSIZ-1] happened to be nonzero, it would
eventually fail the following check in alloc_netdev_mq():
BUG_ON(strlen(name) >= sizeof(dev->name));
Bug #2722.
The strncpy_from_user() function's 'count' argument is documented to
include the trailing null byte, but create_dp() did not include it. This
commit adds it in.
Later kernel versions remove the direction argument from
skb_checksum_help. This provides a compatibility function so we
can have consistent syntax across versions.
Since CHECKSUM_PARTIAL is the same as CHECKSUM_HW on older kernels
this allows a unified code path for computing checksums.
Currently the flow hash table assumes that it is storing flows.
However, we will need additional types of hash tables in the
future so remove assumptions about flows and convert the datapath
to use the new table.
Currently the datapath directly accesses devices through their
Linux functions. Obviously this doesn't work for virtual devices
that are not backed by an actual Linux device. This creates a
new virtual port layer which handles all interaction with devices.
The existing support for Linux devices was then implemented on top
of this layer as two device types. It splits out and renames dp_dev
to internal_dev. There were several places where datapath devices
had to handled in a special manner and this cleans that up by putting
all the special casing in a single location.
Add a tun_id field which contains the ID of the encapsulating tunnel
on which a packet was received (0 if not received on a tunnel). Also
add an action which allows the tunnel ID to be set for outgoing
packets. At this point there aren't any tunnel implementations so
these fields don't have any effect.
The matching is exposed to OpenFlow by overloading the high 32 bits
of the cookie as the tunnel ID. ovs-ofctl is capable of turning
on this special behavior using a new "tun-cookie" command but this
command is intentially undocumented to avoid it being used without
a full understanding of the consequences.
LRO can play fast and loose with the packets that it merges, which
isn't very polite when you are bridging packets for other operating
systems. This disables LRO on any underlying devices that are added
to the datapath, which is the same as what the bridge does.
Note that this does not disable GRO, which has a more strict set of
rules about what is merged and is therefore safe for bridging. Both
are typically done in software anyways.
The checksum computed by hardware on receive stored in skb->csum
when skb->ip_summed == CHECKSUM_COMPLETE is supposed to reflect
the contents of the packet starting at skb->data, which includes
the VLAN tag if there is one. However, when we manipulate the
VLAN tag we don't update the checksum. This leads to all kinds
of nasty warnings about broken hardware, not to mention we can't
take advantage of the checksum that was already computed.
This also fixes some issues with our private checksum type value
on some different kernels and after GSO.
When adding a VLAN tag it is necessary for us to setup checksum
pointers for offloaded packets manually. However, this process
clobbers some of the fields that other components need to determine
the current status. Here we mark the packet with its status upon
ingress in our own format that does not get clobbered and is
consistent across kernel versions.
Bug #2436
OpenFlow 1.0 adds support for matching on IP ToS/DSCP bits.
NOTE: OVS at this point is not wire-compatible with OpenFlow 1.0 until
the final commit in this OpenFlow 1.0 set.
Starting in OpenFlow 0.9, it is possible to match on the VLAN PCP
(priority) field and rewrite the IP ToS/DSCP bits. This check-in
provides that support and bumps the wire protocol number to 0x98.
NOTE: The wire changes come together over the set of OpenFlow 0.9 commits,
so OVS will not be OpenFlow-compatible with any official release between
this commit and the one that completes the set.
'port' is a kernel-space copy of the odp_port and modifying it is useless.
'portp' is the userspace copy; modifying it is useful.
None of our current userspace users care about the port number and so we
never noticed.
Found by sparse (http://sparse.wiki.kernel.org/).
The MTU of the local port should be no larger than the minimum of
the MTUs of the ports attached to the bridge, overwise packets may be
dropped. We already prevent changes to the MTU that would violate
this constraint but don't actuallly proactively set the MTU. This
changes makes everything consistent and matches the behavior of
the bridge.
Some ports of Xen (such as Debian Lenny's) use the old style
Xen checksumming fields on newer kernels. Normally the code that
deals with those fields isn't used at all on newer kernels. This
updates the checksumming pointer code with some changes from Lenny
Xen since it is cleaner and works well with our existing compatibility
layer.
CC:pspreadborough@comcast.net
On older kernels (< 2.6.19) CHECKSUM_HW can mean either that the
checksum has already been computed by hardware or that the checksum
needs to be computed by hardware, depending on whether we are on
the transmit or receive path. Unfortunately since we are in the
middle of these two paths it is impossible to tell which is the
case. Code after us assumes that CHECKSUM_HW means that the
checksum needs to be computed and will panic if there already is
a checksum. On these kernels we mark these packets as CHECKSUM_NONE
before handing them off.
Without this change using certain NICs will cause panics.
No conflicts, but lib/dpif.c needed a few changes since struct dpif's
member "class" was renamed to "dpif_class" in master since sflow was
branched off.
vswitch_skb_checksum_setup() can be defined in datapath.h as a no-op
when defined(CONFIG_XEN) && defined(HAVE_PROTO_DATA_VALID) is false.
Also, skb_checksum_setup(), which was defined similarly, can be dropped
now, since it was unused.
According to Neil McKee, in an email archived at
http://openvswitch.org/pipermail/dev_openvswitch.org/2010-January/000934.html:
The containment rule is that a given sflow-datasource (sampler or
poller) should be scoped within only one sflow-agent (or
sub-agent). So the issue arrises when you have two
switches/datapaths defined on the same host being managed with
the same IP address: each switch is a separate sub-agent, so they
can run independently (e.g. with their own sequence numbers) but
they can't both claim to speak for the same sflow-datasource.
Specifically, they can't both represent the <ifindex>:0
data-source. This containment rule is necessary so that the
sFlow collector can scale and combine the results accurately.
One option would be to stick with the <ifindex>:0 data-source but
elevate it to be global across all bridges, with a global
sample_pool and a global sflow_agent. Not tempting. Better to
go the other way and allow each interface to have it's own
sampler, just as it already has it's own poller. The ifIndex
numbers are globally unique across all switches/datapaths on the
host, so the containment is now clean. Datasource <ifindex>:5
might be on one switch, whille <ifindex>:7 can be on another.
Other benefits are that 1) you can support the option of
overriding the default sampling-rate on an interface-by-interface
basis, and 2) this is how most sFlow implementations are coded,
so there will be no surprises or interoperability issues with any
sFlow collectors out there.
This commit implements the approach suggested by Neil.
This commit uses an atomic_t to represent the sampling pool. This is
because we do want access to it to be atomic, but we expect that it will
"mostly" be accessed from a single CPU at a time. Perhaps this is a bad
assumption; we can always switch to another form of synchronization later.
CC: Neil McKee <neil.mckee@inmon.com>
The purpose of the non-empty variant of vswitch_skb_checksum_setup is to
synchronise the proto_data_valid and proto_csum_blank fields into the
standard skb csum/ip_summed fields, therefore it is more correct to key
off of HAVE_PROTO_DATA_VALID.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
When querying flow stats allow the TCP flags to be reset. Since
the datapath ORs together all flags that have previously been
seen it is otherwise impossible to determine the set of flags from
after a particular time.
When querying flow stats allow the TCP flags to be reset. Since
the datapath ORs together all flags that have previously been
seen it is otherwise impossible to determine the set of flags from
after a particular time.
This merge took a little bit of care due to two issues:
- Crossport of "interface-reconfigure" fixes from master back to
citrix that had happened and needed to be canceled out of the merge.
- New script "refresh-xs-network-uuids" added on citrix branch that
needed to be moved from /root/vswitch/scripts to
/usr/share/vswitch/scripts.
In Linux 2.6.30, the rtnl_notify() return type was changed from int to
void along with the following commit message:
This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.
Thus there's no point in checking the return value, even in older versions
of the kernel, and so this commit changes our code to ignore it, even
on older kernel versions. We also update the rtnl_notify() wrapper macros
to make the return type void on older kernel versions.
This has not been tested, just built.
Thanks to Mikio for spurring me to try building with Linux 2.6.29 and
2.6.30.
The VLAN PCP mask is in the rightmost bits of the vlan_pcp member but we
were checking for it in its position in the VLAN tag field instead.
Slightly modified from Jean's original patch by adding and using the
VLAN_PCP_SHIFT macro.
Linux 2.6.27 introduces a new mechanism for sharing STP packets among
kernel modules, which means that the code in datapath.c to avoid loading
when the Linux bridging module is also loaded has false positives. So
fall back on these newer kernels to a less reliable way of avoiding the
bridge module, but one that does not have false positives.
CC: Jean Tourrihles <jt@hpl.hp.com>
Until now, when dp_output_control() queued a GSO packet to userspace, it
would first compute the checksum for the whole GSO packet, then break the
packet into segments. However this had two drawbacks:
1. The checksum had to be recomputed for each segment, wasting time.
2. Linux 2.6.22 and later would emit a warning in skb_gso_segment()
because the checksum was precomputed.
This commit changes dp_output_control() to instead break the packet into
segments, then compute the checksum across each of the segments
individually. This fixes both drawbacks.
This commit has seen light testing on Xen's 2.6.27. It has been build
tested on a few different kernel versions.