This merge took a little bit of care due to two issues:
- Crossport of "interface-reconfigure" fixes from master back to
citrix that had happened and needed to be canceled out of the merge.
- New script "refresh-xs-network-uuids" added on citrix branch that
needed to be moved from /root/vswitch/scripts to
/usr/share/vswitch/scripts.
Commit c874dc6d6b "secchan: Fix behavior when a network device is renamed."
fixed a crash in the datapath when network devices within a datapath were
renamed. However, this missed the case where the device that was renamed
was a datapath's internal port: these devices have their br_port members
set to NULL, so we have to determine that they belong to a datapath another
way. This commit does so.
This commit also changes the initialization order in dp_dev_create().
Otherwise, dp_device_event() will dereference null when it is called via
register_netdevice(), because the newly created device is a datapath device
but its members are not yet initialized.
This was a somewhat difficult merge since there was a fair amount of
superficially divergent development on the two branches, especially in the
datapath.
This has been build-tested against XenServer 5.5.0 and XenServer 5.7.0
build 15122. It has been booted and connected to XenCenter on 5.5.0.
The merge revealed a couple of outstanding bugs, which will be fixed on
citrix and then merged back into master.
The datapath has no problems switching jumbo frames (frames with a payload
greater than 1500 bytes), but it has not supported sending and receiving
them to the device itself. With this commit, the MTU can be set as large
as the minimum MTU size of the devices that are directly attached, or 1500
bytes if there are none. This mimics the behavior of the Linux bridge.
Feature #1736
Before commit 72ca14c1 "datapath: Fix race against workqueue in dp_dev and
simplify code," the dp_dev network device had a device queue, and we would
orphan packets before sticking them on the queue. This screwed up socket
accounting a bit, but the effect was limited to the device queue length.
Now, after that commit, the dp_dev device has no device queue, but it still
orphans packets. This screws up socket accounting a *lot*, because the
effect is now unlimited, since there is no queue to limit it.
The solution is to not orphan packets at all. There is little need for it
now since packet transmission now happens immediately, not in a workqueue
whose execution may be delayed.
This should fix bug #1519, which tests "netperf -t UDP_STREAM" performance,
finding that an unrealistically high number of UDP packets could be sent
but that none at all were received. The send rate is due to the orphaning,
the receive rate presumably because at least one out of approx. 65535/1500
= 44 fragments per full packet were dropped in each case.
This reverts commit 0858ad2b9d.
Although that commit partially fixed a performance regression
relative to the Linux bridge, it introduced other problems that
were not apparent in the performance testing (e.g.
WARN_ON_ONCE(skb->destructor) now triggers in dp_process_received_packet()).
The fix is not completely obvious, so reverting now to ensure stability
while investigating the problem.
Before commit 72ca14c1 "datapath: Fix race against workqueue in dp_dev and
simplify code," the dp_dev network device had a device queue, and we would
orphan packets before sticking them on the queue. This screwed up socket
accounting a bit, but the effect was limited to the device queue length.
Now, after that commit, the dp_dev device has no device queue, but it still
orphans packets. This screws up socket accounting a *lot*, because the
effect is now unlimited, since there is no queue to limit it.
The solution is to not orphan packets at all. There is little need for it
now since packet transmission now happens immediately, not in a workqueue
whose execution may be delayed.
This should fix bug #1519, which tests "netperf -t UDP_STREAM" performance,
finding that an unrealistically high number of UDP packets could be sent
but that none at all were received. The send rate is due to the orphaning,
the receive rate presumably because at least one out of approx. 65535/1500
= 44 fragments per full packet were dropped in each case.
The dp_dev_destroy() function failed to cancel the xmit_queue work, which
allowed it to run after the device had been destroyed, accessing freed
memory. However, simply canceling the work with cancel_work_sync() would
be insufficient, since other packets could get queued while the work
function was running. Stopping the queue with netif_tx_disable() doesn't
help, because the final action in dp_dev_do_xmit() is to re-enable the TX
queue.
This issue led me to re-examine why the dp_dev needs to use a work_struct
at all. This was implemented in commit 71f13ed0b "Send of0 packets from
workqueue, to avoid recursive locking of ofN device" due to a complaint
from lockdep about recursive locking.
However, there's no actual reason that we need any locking around
dp_dev_xmit(). Until now, it has accepted the standard locking provided
by the network stack. But looking at the other software devices (veth,
loopback), those use NETIF_F_LLTX, which disables this locking, and
presumably do so for this very reason. In fact, the lwn article at
http://lwn.net/Articles/121566/ hints that NETIF_F_LLTX, which is otherwise
discouraged in the kernel, is acceptable for "certain types of software
device."
So this commit switches to using NETIF_F_LLTX for dp_dev and gets rid
of the work_struct.
In the process, I noticed that veth and loopback also take advantage of
a network device destruction "hook" using the net_device "destructor"
member. Using this we can automatically get called on network device
destruction at the point where rtnl_unlock() is called. This allows us
to stop stringing the dp_devs that are being destroyed onto a list so
that we can free them, and thus simplifies the code along all the paths
that call dp_dev_destroy().
This commit gets rid of a call to synchronize_rcu() (disguised as a call
to synchronize_net(), which is a macro that expands to synchronize_rcu()),
so it probably speeds up deleting ports, too.
The userspace tools were allowing the name of any internal port to be used
to identify a datapath. This, however, makes it hard to enumerate all the
names by which a datapath can be known, and it was never documented or
intentional behavior, so this commit disables it.