I looked at almost every uint<N>_t in the tree to determine whether it was
really in network byte order, and converted the ones that were.
The only remaining ones, modulo my mistakes, are in openflow.h. I'm not
sure whether we should convert those, because there might be some value
in remaining close to upstream for this header.
Only a privileged process can open a raw AF_PACKET socket, so netdev-linux
will fail to initialize if run as non-root and you get a cascade of error
messages, like this:
netdev_linux|ERR|failed to create packet socket: Operation not permitted
netdev|ERR|failed to initialize system network device class: Operation not permitted
netdev|ERR|failed to initialize internal network device class: Operation not permitted
netdev|ERR|failed to initialize tap network device class: Operation not permitted
But in fact the AF_PACKET socket is not needed for most operations (only
for sending packets) and it is never needed for testing with the "dummy"
datapath and network device, so we can avoid logging all of these errors
by opening the packet socket only on demand, as this commit does.
Reviewed-by: Simon Horman <horms@verge.net.au>
An upcoming commit will introduce another function that needs to convert
between rtnl_link_stats64 and netdev_stats, so it seemed best to just add
functions to do the conversion.
Commit 76c308b50d3 "netdev-linux: Support 'send' for netdevs opened with
NETDEV_ETH_TYPE_NONE" broke sending packets to tap devices. Sending a
packet to a tap device with an AF_PACKET socket causes that packet to be
looped back to be received on the tap device again, which obviously isn't
useful.
Obviously correct code is easier on everyone. As the C FAQ says:
20.15c: How can I swap two values without using a temporary?
A: The standard hoary old assembly language programmer's trick is:
a ^= b;
b ^= a;
a ^= b;
But this sort of code has little place in modern, HLL
programming. Temporary variables are essentially free,
and the idiomatic code using three assignments, namely
int t = a;
a = b;
b = t;
is not only clearer to the human reader, it is more likely to be
recognized by the compiler and turned into the most-efficient
code (e.g. using a swap instruction, if available). The latter
code is obviously also amenable to use with pointers and
floating-point values, unlike the XOR trick. See also questions
3.3b and 10.3.
There doesn't appear to be any reason to enforce a minimum min-rate
of 1500Bps on queues. This commit lowers the minimum to 1Bps. A
min-rate of 0 is not allowed by hfsc in the kernel.
One could quite reasonably desire to create a queue with no
min-rate. For example, a default queue could be reasonably
configured without a min-rate or a max-rate. This commit removes
the requirement that min-rate be configured on all queues. If not
configured, defaults to something very small.
Coverity complains that we're copying the unitialized "sin_zero" member
from "sin" into "r". I don't think this is an actual problem, but
there's no harm in zeroing out the structure, either.
Coverity #10916
Static analyzers hate strncpy(). This new function shares its property of
initializing an entire buffer, without its nasty habit of failing to
null-terminate long strings.
Coverity #10697,10696,10695,10694,10693,10692,10691,10690.
Until now, tunnel vports have had a specific MTU, in the same way that
ordinary network devices have an MTU, but treating them this way does not
always make sense. For example, consider a datapath that has three ports:
the local port, a GRE tunnel to another host, and a physical port. If
the physical port is configured with a jumbo MTU, it should be possible to
send jumbo packets across the tunnel: the tunnel can do fragmentation or
the physical port traversed by the tunnel might have a jumbo MTU.
However, until now, tunnels always had a 1500-byte MTU by default. It
could be adjusted using ODP_VPORT_MTU_SET, but nothing actually did this.
One alternative would be to make ovs-vswitchd able to set the vport's MTU.
This commit, however, takes a different approach, of dropping the concept
of MTU entirely for tunnel vports. This also solves the problem described
above, without making any additional work for anyone.
I tested that, without this change, I could not send 1600-byte "pings"
between two machines whose NICs had 2000-byte MTUs that were connected to
vswitches that were in turn connected over GRE tunnels with the default
1500-byte MTU. With this change, it worked OK, regardless of the MTU of
the network traversed by the GRE tunnel.
This patch also makes "patch" ports MTU-less.
It might make sense to remove vport_set_mtu() and the associated callback
now, since ordinary network devices are the only vports that support it
now.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Suggested-by: Jesse Gross <jesse@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #3728.
This gives network device implementations the opportunity to fetch an
existing device's configuration and store it as their arguments, so that
netdev clients can find out how an existing device is configured.
So far netdev-vport is the only implementation that needs to use this.
The next commit will add use by clients.
Reviewed by Justin Pettit.
I introduced this a long time ago as an efficient way for userspace to find
out whether and where an internal device was attached, but I've always
considered it an ugly kluge. Now that ODP_VPORT_QUERY can fetch a vport's
info regardless of datapath, it is no longer necessary. This commit
stops using Ethtool for this purpose and drops the feature.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
When this library was originally implemented, support for Linux 2.4 was
important. The Netlink implementation in Linux only added support for
joining and leaving multicast groups after a socket is bound as of Linux
2.6.14, so the library did not support it either. But the current version
of Open vSwitch targets Linux 2.6.18 and over, so it's fine to add this
support now, and this commit does so.
This will be used more extensively in upcoming commits.
Reviewed by Justin Pettit.
New columns in Interface table: admin_state, link_state, link_speed, duplex,
mtu.
New keys in status map in Interface table: driver_name, driver_version,
firmware_version.
Requested-by: Peter Balland <pballand@nicira.com>
Bug #4299.
This commit fixes warnings caused by the miimon code's breakage of
strict aliasing rules.
Reported-by: Jesse Gross <jesse@nicira.com>
Implemented-by: Ben Pfaff <blp@nicira.com>
This commit removes the tunnel_egress_iface column from the
interface table and moves it's data to the status column. In the
process it reverts the database to version 1.0.0.
Abstracted rtnetlink so that it may be used for messages other than
RTM LINK messages. Created a new rtnetlink-link module which
specifically deals with these kinds of messages and follows the old
rtnetlink API.
netdev_linux_create() called rtnetlink_notifier_register() for both system
and internal devices, but netdev_linux_destroy() only did the reverse
accounting for system devices. This fixes the pairing.
This isn't really much of a bug, since it would only cause the notifier to
be active unnecessarily (not to be removed even though it was needed). At
most it was a missed opportunity for optimization, but I don't think that
optimization would ever happen anyway.
Found with valgrind --leak-check=full --show-reachable=yes.
The parts of the netlink module that are related to sockets are
Linux-specific, since only Linux has AF_NETLINK sockets. The rest can be
built anywhere. This commit breaks them into two modules, and builds the
generic one on all platforms.
Acked-by: Jesse Gross <jesse@nicira.com>
Linux kernel datapath vports have a "set_stats" method. Until now,
internal vports have been handled in the userspace netdev library as
type "system", so the "system" netdevs would try to use the vport
"set_stats" method. Now, however, internal netdevs have been broken out
as a separate netdev type, so only that new type of netdev has to be able
to call into "set_stats". This commit, therefore, removes it from the
"system" netdevs.
For some time now, Open vSwitch datapaths have internally made a
distinction between adding a vport and attaching it to a datapath. Adding
a vport just means to create it, as an entity detached from any datapath.
Attaching it gives it a port number and a datapath. Similarly, a vport
could be detached and deleted separately.
After some study, I think I understand why this distinction exists. It is
because ovs-vswitchd tries to open all the datapath ports before it tries
to create them. However, changing it to create them before it tries to
open them is not difficult, so this commit does this.
The bulk of this commit, however, changes the datapath interface to one
that always creates a vport and attaches it to a datapath in a single step,
and similarly detaches a vport and deletes it in a single step.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Until now, the collection of coverage counters supported by a given OVS
program was not specific to that program. That means that, for example,
even though ovs-dpctl does not have anything to do with mac_learning, it
still has a coverage counter for it. This is confusing, at best.
This commit fixes the problem on some systems, in particular on ones that
use GCC and the GNU linker. It uses the feature of the GNU linker
described in its manual as:
If an orphaned section's name is representable as a C identifier then
the linker will automatically see PROVIDE two symbols: __start_SECNAME
and __end_SECNAME, where SECNAME is the name of the section. These
indicate the start address and end address of the orphaned section
respectively.
Systems that don't support these features retain the earlier behavior.
This commit also fixes the annoyance that files that include coverage
counters must be listed on COVERAGE_FILES in lib/automake.mk.
This commit also fixes the annoyance that modifying any source file that
includes a coverage counter caused all programs that link against
libopenvswitch.a to relink, even programs that the source file was not
linked into. For example, modifying ofproto/ofproto.c (which includes
coverage counters) caused tests/test-aes128 to relink, even though
test-aes128 does not link again ofproto.o.
A few coverage counters were incremented both in netdev generic code and
in netdev_linux code. This commit drops the increments from the
lower-level code.
(This is not an actual bug because these counters are used only for
logging.)
This commit implements the Hierarchical Fair Service Curve queuing
discipline in linux. HFSC performs better at high bandwidth and
implements min-rate proportional sharing of excess bandwidth. Only
a simplified configuration interface is exposed to the user. This
can be expand to allow more tweaking in the future.
This patch defines, by convention, queue 0 as the default queue in
a particular QOS. Thus, if queue 0 is defined, all traffic going
through the relevant interface will be enqueued in it. If queue 0
is not defined then ovs will send the traffic directly through the
interface without applying any policy to it.
A "min-rate" of zero for the "linux-htb" QoS type would cause a divide
by zero exception. This patch prevents that by just returning zero. A
later patch will try to enforce reasonable values for "min-rate".
Bug #3745
Linux kernel queue numbers are one greater than OpenFlow queue numbers, for
HTB anyhow. The code to dump queues wasn't compensating for this, so this
commit fixes it up.
The main advantage of a sparse array over a hash table is that it can be
iterated in numerical order. But the OVS implementation of sparse arrays
is quite expensive in terms of memory: on a 32-bit system, a sparse array
with exactly 1 nonnull element has 512 bytes of overhead. In this case,
the sparse array's property of iteration in numerical order is not
important, so this commit converts it to a hash table to save memory.
All of these changes avoid using the same name for two local variables
within a same function. None of them are actual bugs as far as I can tell,
but any of them could be confusing to the casual reader.
The one in lib/ovsdb-idl.c is particularly brilliant: inner and outer
loops both using (different) variables named 'i'.
Found with GCC -Wshadow.
Much of the code in the GRE implementation is not specific to the
GRE protocol but is actually common to all types of tunnels. In
order to support future types of tunnels, move this code into a
common library.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Adding a macro to define the vlog module in use adds a level of
indirection, which makes it easier to change how the vlog module must be
defined. A followup commit needs to do that, so getting these widespread
changes out of the way first should make that commit easier to review.
Linux traffic control handles with minor number 0 refer to qdiscs, not
to classes. This commit deals with this by using a conversion function:
OpenFlow queue 0 maps to minor 1, queue 1 to minor 2, and so on.
A netdev-linux traffic control implementation has to dump all of a port's
traffic classes in a couple of different situations. start_queue_dump()
is supposed to do that. But it was specifying TC_H_ROOT as tcm_parent,
which only dumped classes that were direct children of the root. This
commit changes tcm_parent to 0, which obtains all traffic classes
regardless of parent.
ovs-vswitchd doesn't declare its QoS capabilities in the database yet,
so the controller has to know what they are. We can add that later.
The linux-htb QoS class has been tested to the extent that I can see that
it sets up the queues I expect when I run "tc qdisc show" and "tc class
show". I haven't tested that the effects on flows are what we expect them
to be. I am sure that there will be problems in that area that we will
have to fix.