strncpy() does not null-terminate its output buffer if the source string's
length is at least as large as its 'count' argument. We know that the
source and destination buffers are the same size and that the source buffer
is null-terminated, so just use strcpy().
This fixes a kernel BUG message that often occurred when strlen(devname)
was exactly IFNAMSIZ-1. In such a case, if
internal_dev_port.devname[IFNAMSIZ-1] happened to be nonzero, it would
eventually fail the following check in alloc_netdev_mq():
BUG_ON(strlen(name) >= sizeof(dev->name));
Bug #2722.
The strncpy_from_user() function's 'count' argument is documented to
include the trailing null byte, but create_dp() did not include it. This
commit adds it in.
Later kernel versions remove the direction argument from
skb_checksum_help. This provides a compatibility function so we
can have consistent syntax across versions.
Since CHECKSUM_PARTIAL is the same as CHECKSUM_HW on older kernels
this allows a unified code path for computing checksums.
Currently the flow hash table assumes that it is storing flows.
However, we will need additional types of hash tables in the
future so remove assumptions about flows and convert the datapath
to use the new table.
Currently the datapath directly accesses devices through their
Linux functions. Obviously this doesn't work for virtual devices
that are not backed by an actual Linux device. This creates a
new virtual port layer which handles all interaction with devices.
The existing support for Linux devices was then implemented on top
of this layer as two device types. It splits out and renames dp_dev
to internal_dev. There were several places where datapath devices
had to handled in a special manner and this cleans that up by putting
all the special casing in a single location.
Add a tun_id field which contains the ID of the encapsulating tunnel
on which a packet was received (0 if not received on a tunnel). Also
add an action which allows the tunnel ID to be set for outgoing
packets. At this point there aren't any tunnel implementations so
these fields don't have any effect.
The matching is exposed to OpenFlow by overloading the high 32 bits
of the cookie as the tunnel ID. ovs-ofctl is capable of turning
on this special behavior using a new "tun-cookie" command but this
command is intentially undocumented to avoid it being used without
a full understanding of the consequences.
LRO can play fast and loose with the packets that it merges, which
isn't very polite when you are bridging packets for other operating
systems. This disables LRO on any underlying devices that are added
to the datapath, which is the same as what the bridge does.
Note that this does not disable GRO, which has a more strict set of
rules about what is merged and is therefore safe for bridging. Both
are typically done in software anyways.
The checksum computed by hardware on receive stored in skb->csum
when skb->ip_summed == CHECKSUM_COMPLETE is supposed to reflect
the contents of the packet starting at skb->data, which includes
the VLAN tag if there is one. However, when we manipulate the
VLAN tag we don't update the checksum. This leads to all kinds
of nasty warnings about broken hardware, not to mention we can't
take advantage of the checksum that was already computed.
This also fixes some issues with our private checksum type value
on some different kernels and after GSO.
When adding a VLAN tag it is necessary for us to setup checksum
pointers for offloaded packets manually. However, this process
clobbers some of the fields that other components need to determine
the current status. Here we mark the packet with its status upon
ingress in our own format that does not get clobbered and is
consistent across kernel versions.
Bug #2436
OpenFlow 1.0 adds support for matching on IP ToS/DSCP bits.
NOTE: OVS at this point is not wire-compatible with OpenFlow 1.0 until
the final commit in this OpenFlow 1.0 set.
Starting in OpenFlow 0.9, it is possible to match on the VLAN PCP
(priority) field and rewrite the IP ToS/DSCP bits. This check-in
provides that support and bumps the wire protocol number to 0x98.
NOTE: The wire changes come together over the set of OpenFlow 0.9 commits,
so OVS will not be OpenFlow-compatible with any official release between
this commit and the one that completes the set.
'port' is a kernel-space copy of the odp_port and modifying it is useless.
'portp' is the userspace copy; modifying it is useful.
None of our current userspace users care about the port number and so we
never noticed.
Found by sparse (http://sparse.wiki.kernel.org/).
The MTU of the local port should be no larger than the minimum of
the MTUs of the ports attached to the bridge, overwise packets may be
dropped. We already prevent changes to the MTU that would violate
this constraint but don't actuallly proactively set the MTU. This
changes makes everything consistent and matches the behavior of
the bridge.
Some ports of Xen (such as Debian Lenny's) use the old style
Xen checksumming fields on newer kernels. Normally the code that
deals with those fields isn't used at all on newer kernels. This
updates the checksumming pointer code with some changes from Lenny
Xen since it is cleaner and works well with our existing compatibility
layer.
CC:pspreadborough@comcast.net
On older kernels (< 2.6.19) CHECKSUM_HW can mean either that the
checksum has already been computed by hardware or that the checksum
needs to be computed by hardware, depending on whether we are on
the transmit or receive path. Unfortunately since we are in the
middle of these two paths it is impossible to tell which is the
case. Code after us assumes that CHECKSUM_HW means that the
checksum needs to be computed and will panic if there already is
a checksum. On these kernels we mark these packets as CHECKSUM_NONE
before handing them off.
Without this change using certain NICs will cause panics.
No conflicts, but lib/dpif.c needed a few changes since struct dpif's
member "class" was renamed to "dpif_class" in master since sflow was
branched off.
vswitch_skb_checksum_setup() can be defined in datapath.h as a no-op
when defined(CONFIG_XEN) && defined(HAVE_PROTO_DATA_VALID) is false.
Also, skb_checksum_setup(), which was defined similarly, can be dropped
now, since it was unused.
According to Neil McKee, in an email archived at
http://openvswitch.org/pipermail/dev_openvswitch.org/2010-January/000934.html:
The containment rule is that a given sflow-datasource (sampler or
poller) should be scoped within only one sflow-agent (or
sub-agent). So the issue arrises when you have two
switches/datapaths defined on the same host being managed with
the same IP address: each switch is a separate sub-agent, so they
can run independently (e.g. with their own sequence numbers) but
they can't both claim to speak for the same sflow-datasource.
Specifically, they can't both represent the <ifindex>:0
data-source. This containment rule is necessary so that the
sFlow collector can scale and combine the results accurately.
One option would be to stick with the <ifindex>:0 data-source but
elevate it to be global across all bridges, with a global
sample_pool and a global sflow_agent. Not tempting. Better to
go the other way and allow each interface to have it's own
sampler, just as it already has it's own poller. The ifIndex
numbers are globally unique across all switches/datapaths on the
host, so the containment is now clean. Datasource <ifindex>:5
might be on one switch, whille <ifindex>:7 can be on another.
Other benefits are that 1) you can support the option of
overriding the default sampling-rate on an interface-by-interface
basis, and 2) this is how most sFlow implementations are coded,
so there will be no surprises or interoperability issues with any
sFlow collectors out there.
This commit implements the approach suggested by Neil.
This commit uses an atomic_t to represent the sampling pool. This is
because we do want access to it to be atomic, but we expect that it will
"mostly" be accessed from a single CPU at a time. Perhaps this is a bad
assumption; we can always switch to another form of synchronization later.
CC: Neil McKee <neil.mckee@inmon.com>
The purpose of the non-empty variant of vswitch_skb_checksum_setup is to
synchronise the proto_data_valid and proto_csum_blank fields into the
standard skb csum/ip_summed fields, therefore it is more correct to key
off of HAVE_PROTO_DATA_VALID.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
When querying flow stats allow the TCP flags to be reset. Since
the datapath ORs together all flags that have previously been
seen it is otherwise impossible to determine the set of flags from
after a particular time.
When querying flow stats allow the TCP flags to be reset. Since
the datapath ORs together all flags that have previously been
seen it is otherwise impossible to determine the set of flags from
after a particular time.
This merge took a little bit of care due to two issues:
- Crossport of "interface-reconfigure" fixes from master back to
citrix that had happened and needed to be canceled out of the merge.
- New script "refresh-xs-network-uuids" added on citrix branch that
needed to be moved from /root/vswitch/scripts to
/usr/share/vswitch/scripts.
In Linux 2.6.30, the rtnl_notify() return type was changed from int to
void along with the following commit message:
This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.
Thus there's no point in checking the return value, even in older versions
of the kernel, and so this commit changes our code to ignore it, even
on older kernel versions. We also update the rtnl_notify() wrapper macros
to make the return type void on older kernel versions.
This has not been tested, just built.
Thanks to Mikio for spurring me to try building with Linux 2.6.29 and
2.6.30.
The VLAN PCP mask is in the rightmost bits of the vlan_pcp member but we
were checking for it in its position in the VLAN tag field instead.
Slightly modified from Jean's original patch by adding and using the
VLAN_PCP_SHIFT macro.
Linux 2.6.27 introduces a new mechanism for sharing STP packets among
kernel modules, which means that the code in datapath.c to avoid loading
when the Linux bridging module is also loaded has false positives. So
fall back on these newer kernels to a less reliable way of avoiding the
bridge module, but one that does not have false positives.
CC: Jean Tourrihles <jt@hpl.hp.com>
Until now, when dp_output_control() queued a GSO packet to userspace, it
would first compute the checksum for the whole GSO packet, then break the
packet into segments. However this had two drawbacks:
1. The checksum had to be recomputed for each segment, wasting time.
2. Linux 2.6.22 and later would emit a warning in skb_gso_segment()
because the checksum was precomputed.
This commit changes dp_output_control() to instead break the packet into
segments, then compute the checksum across each of the segments
individually. This fixes both drawbacks.
This commit has seen light testing on Xen's 2.6.27. It has been build
tested on a few different kernel versions.
Before transimitting a packet, the datapath checks that the packet
length is not greater than the MTU. It determines the length based on
the 'protocol' field in the skb. If 'protocol' is ETH_P_8021Q, it reduces
the packet length as stored in the 'len' field by four bytes, which
is the size of a VLAN tag header. Unfortunately, packets that arrived
from userspace were not having the 'protocol' field set, which would
cause MTU-sized packets to be dropped. This commit sets the 'protocol'
field appropriately.
Thanks to Ben Pfaff for the help diagnosing this issue.
NIC-17 and NIC-26
The hash table used until now in the kernel datapath for storing the flow
table provides only two slots that a given flow can occupy. If both of
those slots are already full, for a given flow, then that flow cannot be
added at all and its packets must be handled entirely in userspace, taking
a performance hit. The code does attempt to compensate for this by making
the flow table rather large: 8 slots per flow actually in the flow table.
In practice, this is usually good enough, but some of the tests that we
have run show bad enough performance degradation or even timeouts of
various kinds that we want to implement something better.
This commit replaces the existing hash table by one with a completely
different design in which buckets are flexibly sized and can accept any
number of collisions. By use of suitable levels of indirection, this
design is both simple and RCU-compatible. I did consider other schemes,
but none of the ones that I came up with shared both of those two
properties.
This commit also adds kerneldoc comments for all of the flow table
non-static functions and data structures.
This has been lightly tested for correctness. It has not been tested for
performance.
Bug #1656. Bug #1851.
The code on one side of this #if fork was difficult to test until Xen
upgraded to a new enough kernel that it would exercise it. Later Xen
kernels are now available and this code path has been tested, at least to
some extent, so remove the warning.
Thanks to Ian Campbell <Ian.Campbell@citrix.com> for pointing out the
warning.
This was a somewhat difficult merge since there was a fair amount of
superficially divergent development on the two branches, especially in the
datapath.
This has been build-tested against XenServer 5.5.0 and XenServer 5.7.0
build 15122. It has been booted and connected to XenCenter on 5.5.0.
The merge revealed a couple of outstanding bugs, which will be fixed on
citrix and then merged back into master.
For newer kernels the checksum setup is done at the point the skb is
received in netback or netfront so there is no more need to sprinkle
skb_checksum_setup calls throughout the kernel.