mirror of
https://github.com/openvswitch/ovs
synced 2025-10-17 14:28:02 +00:00
Suggested-by: Justin Pettit <jpettit@nicira.com> Suggested-by: Michael Mao <mmao@nicira.com>
263 lines
12 KiB
Plaintext
263 lines
12 KiB
Plaintext
Design Decisions In Open vSwitch
|
|
================================
|
|
|
|
This document describes design decisions that went into implementing
|
|
Open vSwitch. While we believe these to be reasonable decisions, it is
|
|
impossible to predict how Open vSwitch will be used in all environments.
|
|
Understanding assumptions made by Open vSwitch is critical to a
|
|
successful deployment. The end of this document contains contact
|
|
information that can be used to let us know how we can make Open vSwitch
|
|
more generally useful.
|
|
|
|
|
|
Multiple Table Support
|
|
======================
|
|
|
|
OpenFlow 1.0 has only rudimentary support for multiple flow tables.
|
|
Notably, OpenFlow 1.0 does not allow the controller to specify the
|
|
flow table to which a flow is to be added. Open vSwitch adds an
|
|
extension for this purpose, which is enabled on a per-OpenFlow
|
|
connection basis using the NXT_FLOW_MOD_TABLE_ID message. When the
|
|
extension is enabled, the upper 8 bits of the 'command' member in an
|
|
OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a
|
|
flow is to be added.
|
|
|
|
The Open vSwitch software switch implementation offers 255 flow
|
|
tables. On packet ingress, only the first flow table (table 0) is
|
|
searched, and the contents of the remaining tables are not considered
|
|
in any way. Tables other than table 0 only come into play when an
|
|
NXAST_RESUBMIT_TABLE action specifies another table to search.
|
|
|
|
Tables 128 and above are reserved for use by the switch itself.
|
|
Controllers should use only tables 0 through 127.
|
|
|
|
|
|
IPv6
|
|
====
|
|
|
|
Open vSwitch supports stateless handling of IPv6 packets. Flows can be
|
|
written to support matching TCP, UDP, and ICMPv6 headers within an IPv6
|
|
packet. Deeper matching of some Neighbor Discovery messages is also
|
|
supported.
|
|
|
|
IPv6 was not designed to interact well with middle-boxes. This,
|
|
combined with Open vSwitch's stateless nature, have affected the
|
|
processing of IPv6 traffic, which is detailed below.
|
|
|
|
Extension Headers
|
|
-----------------
|
|
|
|
The base IPv6 header is incredibly simple with the intention of only
|
|
containing information relevant for routing packets between two
|
|
endpoints. IPv6 relies heavily on the use of extension headers to
|
|
provide any other functionality. Unfortunately, the extension headers
|
|
were designed in such a way that it is impossible to move to the next
|
|
header (including the layer-4 payload) unless the current header is
|
|
understood.
|
|
|
|
Open vSwitch will process the following extension headers and continue
|
|
to the next header:
|
|
|
|
* Fragment (see the next section)
|
|
* AH (Authentication Header)
|
|
* Hop-by-Hop Options
|
|
* Routing
|
|
* Destination Options
|
|
|
|
When a header is encountered that is not in that list, it is considered
|
|
"terminal". A terminal header's IPv6 protocol value is stored in
|
|
"nw_proto" for matching purposes. If a terminal header is TCP, UDP, or
|
|
ICMPv6, the packet will be further processed in an attempt to extract
|
|
layer-4 information.
|
|
|
|
Fragments
|
|
---------
|
|
|
|
IPv6 requires that every link in the internet have an MTU of 1280 octets
|
|
or greater (RFC 2460). As such, a terminal header (as described above in
|
|
"Extension Headers") in the first fragment should generally be
|
|
reachable. In this case, the terminal header's IPv6 protocol type is
|
|
stored in the "nw_proto" field for matching purposes. If a terminal
|
|
header cannot be found in the first fragment (one with a fragment offset
|
|
of zero), the "nw_proto" field is set to 0. Subsequent fragments (those
|
|
with a non-zero fragment offset) have the "nw_proto" field set to the
|
|
IPv6 protocol type for fragments (44).
|
|
|
|
Jumbograms
|
|
----------
|
|
|
|
An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer
|
|
than 65,535 octets. A jumbogram is only relevant in subnets with a link
|
|
MTU greater than 65,575 octets, and are not required to be supported on
|
|
nodes that do not connect to link with such large MTUs. Currently, Open
|
|
vSwitch doesn't process jumbograms.
|
|
|
|
|
|
In-Band Control
|
|
===============
|
|
|
|
In-band control allows a single network to be used for OpenFlow traffic and
|
|
other data traffic. See ovs-vswitchd.conf.db(5) for a description of
|
|
configuring in-band control.
|
|
|
|
This comment is an attempt to describe how in-band control works at a
|
|
wire- and implementation-level. Correctly implementing in-band
|
|
control has proven difficult due to its many subtleties, and has thus
|
|
gone through many iterations. Please read through and understand the
|
|
reasoning behind the chosen rules before making modifications.
|
|
|
|
In Open vSwitch, in-band control is implemented as "hidden" flows (in that
|
|
they are not visible through OpenFlow) and at a higher priority than
|
|
wildcarded flows can be set up by through OpenFlow. This is done so that
|
|
the OpenFlow controller cannot interfere with them and possibly break
|
|
connectivity with its switches. It is possible to see all flows, including
|
|
in-band ones, with the ovs-appctl "bridge/dump-flows" command.
|
|
|
|
The Open vSwitch implementation of in-band control can hide traffic to
|
|
arbitrary "remotes", where each remote is one TCP port on one IP address.
|
|
Currently the remotes are automatically configured as the in-band OpenFlow
|
|
controllers plus the OVSDB managers, if any. (The latter is a requirement
|
|
because OVSDB managers are responsible for configuring OpenFlow controllers,
|
|
so if the manager cannot be reached then OpenFlow cannot be reconfigured.)
|
|
|
|
The following rules (with the OFPP_NORMAL action) are set up on any bridge
|
|
that has any remotes:
|
|
|
|
(a) DHCP requests sent from the local port.
|
|
(b) ARP replies to the local port's MAC address.
|
|
(c) ARP requests from the local port's MAC address.
|
|
|
|
In-band also sets up the following rules for each unique next-hop MAC
|
|
address for the remotes' IPs (the "next hop" is either the remote
|
|
itself, if it is on a local subnet, or the gateway to reach the remote):
|
|
|
|
(d) ARP replies to the next hop's MAC address.
|
|
(e) ARP requests from the next hop's MAC address.
|
|
|
|
In-band also sets up the following rules for each unique remote IP address:
|
|
|
|
(f) ARP replies containing the remote's IP address as a target.
|
|
(g) ARP requests containing the remote's IP address as a source.
|
|
|
|
In-band also sets up the following rules for each unique remote (IP,port)
|
|
pair:
|
|
|
|
(h) TCP traffic to the remote's IP and port.
|
|
(i) TCP traffic from the remote's IP and port.
|
|
|
|
The goal of these rules is to be as narrow as possible to allow a
|
|
switch to join a network and be able to communicate with the
|
|
remotes. As mentioned earlier, these rules have higher priority
|
|
than the controller's rules, so if they are too broad, they may
|
|
prevent the controller from implementing its policy. As such,
|
|
in-band actively monitors some aspects of flow and packet processing
|
|
so that the rules can be made more precise.
|
|
|
|
In-band control monitors attempts to add flows into the datapath that
|
|
could interfere with its duties. The datapath only allows exact
|
|
match entries, so in-band control is able to be very precise about
|
|
the flows it prevents. Flows that miss in the datapath are sent to
|
|
userspace to be processed, so preventing these flows from being
|
|
cached in the "fast path" does not affect correctness. The only type
|
|
of flow that is currently prevented is one that would prevent DHCP
|
|
replies from being seen by the local port. For example, a rule that
|
|
forwarded all DHCP traffic to the controller would not be allowed,
|
|
but one that forwarded to all ports (including the local port) would.
|
|
|
|
As mentioned earlier, packets that miss in the datapath are sent to
|
|
the userspace for processing. The userspace has its own flow table,
|
|
the "classifier", so in-band checks whether any special processing
|
|
is needed before the classifier is consulted. If a packet is a DHCP
|
|
response to a request from the local port, the packet is forwarded to
|
|
the local port, regardless of the flow table. Note that this requires
|
|
L7 processing of DHCP replies to determine whether the 'chaddr' field
|
|
matches the MAC address of the local port.
|
|
|
|
It is interesting to note that for an L3-based in-band control
|
|
mechanism, the majority of rules are devoted to ARP traffic. At first
|
|
glance, some of these rules appear redundant. However, each serves an
|
|
important role. First, in order to determine the MAC address of the
|
|
remote side (controller or gateway) for other ARP rules, we must allow
|
|
ARP traffic for our local port with rules (b) and (c). If we are
|
|
between a switch and its connection to the remote, we have to
|
|
allow the other switch's ARP traffic to through. This is done with
|
|
rules (d) and (e), since we do not know the addresses of the other
|
|
switches a priori, but do know the remote's or gateway's. Finally,
|
|
if the remote is running in a local guest VM that is not reached
|
|
through the local port, the switch that is connected to the VM must
|
|
allow ARP traffic based on the remote's IP address, since it will
|
|
not know the MAC address of the local port that is sending the traffic
|
|
or the MAC address of the remote in the guest VM.
|
|
|
|
With a few notable exceptions below, in-band should work in most
|
|
network setups. The following are considered "supported' in the
|
|
current implementation:
|
|
|
|
- Locally Connected. The switch and remote are on the same
|
|
subnet. This uses rules (a), (b), (c), (h), and (i).
|
|
|
|
- Reached through Gateway. The switch and remote are on
|
|
different subnets and must go through a gateway. This uses
|
|
rules (a), (b), (c), (h), and (i).
|
|
|
|
- Between Switch and Remote. This switch is between another
|
|
switch and the remote, and we want to allow the other
|
|
switch's traffic through. This uses rules (d), (e), (h), and
|
|
(i). It uses (b) and (c) indirectly in order to know the MAC
|
|
address for rules (d) and (e). Note that DHCP for the other
|
|
switch will not work unless an OpenFlow controller explicitly lets this
|
|
switch pass the traffic.
|
|
|
|
- Between Switch and Gateway. This switch is between another
|
|
switch and the gateway, and we want to allow the other switch's
|
|
traffic through. This uses the same rules and logic as the
|
|
"Between Switch and Remote" configuration described earlier.
|
|
|
|
- Remote on Local VM. The remote is a guest VM on the
|
|
system running in-band control. This uses rules (a), (b), (c),
|
|
(h), and (i).
|
|
|
|
- Remote on Local VM with Different Networks. The remote
|
|
is a guest VM on the system running in-band control, but the
|
|
local port is not used to connect to the remote. For
|
|
example, an IP address is configured on eth0 of the switch. The
|
|
remote's VM is connected through eth1 of the switch, but an
|
|
IP address has not been configured for that port on the switch.
|
|
As such, the switch will use eth0 to connect to the remote,
|
|
and eth1's rules about the local port will not work. In the
|
|
example, the switch attached to eth0 would use rules (a), (b),
|
|
(c), (h), and (i) on eth0. The switch attached to eth1 would use
|
|
rules (f), (g), (h), and (i).
|
|
|
|
The following are explicitly *not* supported by in-band control:
|
|
|
|
- Specify Remote by Name. Currently, the remote must be
|
|
identified by IP address. A naive approach would be to permit
|
|
all DNS traffic. Unfortunately, this would prevent the
|
|
controller from defining any policy over DNS. Since switches
|
|
that are located behind us need to connect to the remote,
|
|
in-band cannot simply add a rule that allows DNS traffic from
|
|
the local port. The "correct" way to support this is to parse
|
|
DNS requests to allow all traffic related to a request for the
|
|
remote's name through. Due to the potential security
|
|
problems and amount of processing, we decided to hold off for
|
|
the time-being.
|
|
|
|
- Differing Remotes for Switches. All switches must know
|
|
the L3 addresses for all the remotes that other switches
|
|
may use, since rules need to be set up to allow traffic related
|
|
to those remotes through. See rules (f), (g), (h), and (i).
|
|
|
|
- Differing Routes for Switches. In order for the switch to
|
|
allow other switches to connect to a remote through a
|
|
gateway, it allows the gateway's traffic through with rules (d)
|
|
and (e). If the routes to the remote differ for the two
|
|
switches, we will not know the MAC address of the alternate
|
|
gateway.
|
|
|
|
|
|
Suggestions
|
|
===========
|
|
|
|
Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org.
|