ovs-vswitchd doesn't declare its QoS capabilities in the database yet,
so the controller has to know what they are. We can add that later.
The linux-htb QoS class has been tested to the extent that I can see that
it sets up the queues I expect when I run "tc qdisc show" and "tc class
show". I haven't tested that the effects on flows are what we expect them
to be. I am sure that there will be problems in that area that we will
have to fix.
This simplifies a bit of existing code since it is known that an rtnetlink
socket will always be available. It will simplify additional code in
upcoming commits.
These two functions use their "sock" parameter only to figure out the
nlmsg_pid to put in the nlmsghdr. But that field can be filled in just
as well right before sending the message. Since our functions for sending
Netlink messages always modify the nlmsghdr anyhow (to fill in the length),
there is little benefit to filling in the nlmsg_pid in advance. The cost,
on the other hand, is having to pass another argument to functions that
already have too many. So this commit removes the argument.
In certain cases we require the ability to provide stats that are
added to the values collected by the kernel (currently only used
by bond fake devices). Internal devices previously implemented
this directly but now that their stats are now handled by the vport
layer the functionality has been moved there. This removes the
userspace code to set the stats and replaces it with a mechanism
to access the equivalent functionality in the vport layer.
The vport layer has the ability to track stats using 64-bit counters,
even if the kernel is only 32-bit. This first attempts to collect
stats from these counters if they are available and otherwise falls
back to the normal Linux interfaces.
Tap devices can have two FDs that allow transmit and receive from
different perspectives. We previously would always share one of
the FDs among all openers. However, this is confusing to some
users (primarily the DHCP client) which expect tap devices to behave
like any other device. Now we give the tap FD to the first opener,
which knows that it has opened a tap device, and a normal system FD
to everyone else for consistency.
For tap and internal devices we swap the transmit and receive stats
to appear consistent with other devices. However, the check whether
to store the stats in a temporary location before the swap did not
include tap devices, which lead to the use of uninitialized memory
when the swap occured.
If we attempt to remove ingress policing and receive "invalid
argument" it means that policing isn't compiled into the kernel.
If it isn't compiled in then accept that policing has been
successfully removed.
Now that we have a new patch implementation, remove the veth driver
and its userspace components. Then rename 'patchnew' to 'patch'.
The new implementation is a drop-in replacement for the old one.
It is very expensive to start a subprocess and, especially, to wait for it
to complete. This replaces the most common subprocess operation in
netdev_linux_set_policing() by a Netlink socket operation, which is much
faster.
Without this and the other netdev-linux commits, my 1000-interface test
case runs in 1 min 48 s. With them, it runs in 25 seconds.
The new GRE implementation provides a complete drop in replacement
for the old Linux based implementation. Therefore, remove the
old implementation and rename "grenew" to "gre".
We allocate struct netdev_linux which contains struct netdev but
free the netdev. In practice this makes no difference because the
netdev is the first member of the struct but we should be correct
anyways.
When receiving a change notification from rtnetlink we checked whether
a netdev of that name existed and if so tried to handle it. This also
checks that the type of the device is one handled by netdev-linux.
This commit introduces a new netdev type called "patch". A patch is a
pair of interfaces, in which frames sent through one of the devices
pop out of the other. This is useful for linking together datapaths.
A patch's only argument on creation is "peer", which specifies the other
side of the patch. A patch must be created in pairs, so a second netdev
must be created with the "name" and "peer" values reversed.
The current implementation is built using veth devices. Further, it's
limited to the veth devices which support configuration through sysfs.
This limits the ability to use a "patch" on 2.6.18 kernels using the
veth device we include (read: flavors of XenServer 5.5). In the not too
distant future, the implementation will be modified to use the new
kernel port abstraction introduced by Jesse Gross's forthcoming GRE
work. At that point, patch devices will work on any Linux platform
supported by OVS.
This allows path MTU discovery to properly work when used with
bridging. While there was previously support for PMTUD it used
the kernel's IP stack. This works fine for routing but when
bridging it is possible that a complete network is operating over
the bridge that the kernel has no knowledge of and the ICMP
fragmentation needed packets are lost.
When a packet arrives that is above the MTU of the tunnel, an
ICMP message is synthesized and send back on the device that the
original packet came from. This does not rely on the kernel IP
stack and is therefore independent of the routing table. Both
IPv4 and IPv6 are supported, including over VLANs. Other types
of packets that are over the MTU are encapsulated and the outer
packets are fragmented.
This entire functionality is a layer violation since bridging
operates at layer 2 and fragmentation is a function of layer 3.
For this reason it is possible to disable PMTUD, which will
provide complete transparency but will cause the outer IP packets
to be fragmented.
Currently the TTL is copied from the inner packet of the tunnel to
the outer packet if the inner packet is IP. This is good if your
GRE packets might make it into the input of your device but bad
if you want to be fully transparent.
This also resolves an inconsistency between tunnels set up using
the ioctl and using Netlink. The ioctl version would force PMTUD
on if a fixed TTL is set as a backup way to prevent loops but it
never made it over to the newer Netlink code so obviously no one
cares too much about it. This removes it to provide consistency
and transparency.
Basically, don't create loops and you will be happy.
If we are using netlink to get stats and get_ifindex() fails, then for
an internal network device we will then swap around a bunch of
indeterminate (uninitialized) data values. That won't hurt anything--the
caller will still set them to all-1-bits due to the error--but it still
seems wrong. So this commit avoid it.
Found using Clang (http://clang-analyzer.llvm.org/).
We previously maintained a list of open devices inside of the
linux netdev. Since the netdev library now maintains this list,
it is better to use that list instead of our own.
Never close the file descriptor if it is 0, since it is never a
valid FD in this context. Also initialize the FD to -1 so that
it is never set to a valid but incorrect value.
We were storing a struct netdev_dev_linux ** instead of a
netdev_dev_linux * in the cache map. This prevented the cache
from being invalidated on changes such as link status.
The default burst rate was 10Kb. This increases it to 1000kb, since
we were having problems getting traffic through at 10kb. A better value
probably exists between these two points, but that will require
additional experimentation.
Calling close(0) at random points is bad. It means that the next call to
socket() or open() returns fd 0. Then the next time a netdev gets closed,
that socket or file fd gets closed too, and you end up with weird "Bad
file descriptor" errors.
Found by installing the following as lib/unistd.h in the source tree:
#ifndef UNISTD_H
#define UNISTD_H 1
#include <stdlib.h>
#include_next <unistd.h>
#undef close
#define close(fd) rpl_close(fd)
static inline int rpl_close(int fd)
{
if (!fd) {
abort();
}
return (close)(fd);
}
#endif
TAP devices need to be treated slightly differently from other other
devices because they cannot be opened multiple times. Instead we
open them once and share the file descriptor. This means that if
the netdev is opened multiple times one reader can drain the buffers
of another. While this is a deviation from the normal convention,
it does not impact current or planned users.
In addition, this cleans up some confusion between the file
descriptor for tap devices versus other FD's.
This builds on earlier work that implemented netdev object refcounting.
However, rather than requiring explicit create and destroy calls,
these operations are now performed automatically based on the referenece
count. This is important because in certain situations it is not
possible to know whether a netdev has already been created. A
workaround existed (which looked fairly similar to this paradigm) but
introduced it's own issues. This simplifies and unifies the API.
The latest version of GCC flags a common socket convention as breaking
strict-aliasing rules. This commit removes the aliasing and gets rid of
the scary warning.
This implements the userspace portion of GRE on Linux. It communicates
with the kernel module to setup tunnels using either Netlink or ioctls
as appropriate based on the kernel version.
Significant portions of this commit were actually written by
Justin Pettit.
This change adds netdev_create() and netdev_destroy() functions to allow
the creation of network devices through the netdev library. Previously,
network devices had to already exist or be created on demand through
netdev_open(). This caused problems such as not being able to specify
TAP devices as ports in ovs-vswitchd, which this patch fixes.
This also lays the groundwork for adding GRE and VDE support.
The comment on netdev_get_features() claimed that all of the passed-in
values were set to 0 on failure, but the implementation didn't live up
to the promise.
CC: Paul Ingram <paul@nicira.com>
Fixes a bug whereby netdev_linux_set_etheraddr() would update the cached
Ethernet address but not mark it valid. (This potentially wasted a system
call later but wasn't harmful.)
As an added optimization, don't set the Ethernet address at all if the
new address is the same as the current address.
netdev_linux_receive was returning positive error codes while the
interface specifies that it should be returning negative errors.
This difference causes a huge increase in (non-existant) packet
processing with the userspace datapath.
Whether a port is internal is cached to avoid requerying the kernel
every time stats are requested. However, the cache vality bit was
never being set so the cache wasn't used. This corrects that
oversight.
Thanks to Ben Pfaff for noticing.
Internal ports appear to have their transmit and receive stats swapped
because from the kernel's point of view these ports are acting like
the machine connected to the switch, not the switch itself. This swaps
the stats for consistency with other ports.
It was getting to be too confusing to have both netdev_linux_* functions
and linux_netdev_* functions. Rename the latter to make the distinction
more obvious. "rtnetlink" seems to be a fairly good name because that's
what the kernel calls it, so the name will be familiar at least to people
who know about rtnetlink.
This new abstraction layer allows multiple implementations of network
devices in a single running process. This will be useful, for example, to
support network devices that are simulated entirely in the running process
or that communicate with other processes over Unix domain sockets, etc.
The reimplemented tap device support in this commit has not been tested.