mirror of
https://github.com/openvswitch/ovs
synced 2025-10-25 15:07:05 +00:00
This defines the version number for OpenFlow 1.4 so that the switch can actually use it. The ovsdb schema is also modified. Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com> Cc: Daniel Baluta <dbaluta@ixiacom.com> [blp@nicira.com adjusted code in cases where 1.3 and 1.4 are the same] Signed-off-by: Ben Pfaff <blp@nicira.com>
934 lines
40 KiB
Plaintext
934 lines
40 KiB
Plaintext
Design Decisions In Open vSwitch
|
|
================================
|
|
|
|
This document describes design decisions that went into implementing
|
|
Open vSwitch. While we believe these to be reasonable decisions, it is
|
|
impossible to predict how Open vSwitch will be used in all environments.
|
|
Understanding assumptions made by Open vSwitch is critical to a
|
|
successful deployment. The end of this document contains contact
|
|
information that can be used to let us know how we can make Open vSwitch
|
|
more generally useful.
|
|
|
|
Asynchronous Messages
|
|
=====================
|
|
|
|
Over time, Open vSwitch has added many knobs that control whether a
|
|
given controller receives OpenFlow asynchronous messages. This
|
|
section describes how all of these features interact.
|
|
|
|
First, a service controller never receives any asynchronous messages
|
|
unless it changes its miss_send_len from the service controller
|
|
default of zero in one of the following ways:
|
|
|
|
- Sending an OFPT_SET_CONFIG message with nonzero miss_send_len.
|
|
|
|
- Sending any NXT_SET_ASYNC_CONFIG message: as a side effect, this
|
|
message changes the miss_send_len to
|
|
OFP_DEFAULT_MISS_SEND_LEN (128) for service controllers.
|
|
|
|
Second, OFPT_FLOW_REMOVED and NXT_FLOW_REMOVED messages are generated
|
|
only if the flow that was removed had the OFPFF_SEND_FLOW_REM flag
|
|
set.
|
|
|
|
Third, OFPT_PACKET_IN and NXT_PACKET_IN messages are sent only to
|
|
OpenFlow controller connections that have the correct connection ID
|
|
(see "struct nx_controller_id" and "struct nx_action_controller"):
|
|
|
|
- For packet-in messages generated by a NXAST_CONTROLLER action,
|
|
the controller ID specified in the action.
|
|
|
|
- For other packet-in messages, controller ID zero. (This is the
|
|
default ID when an OpenFlow controller does not configure one.)
|
|
|
|
Finally, Open vSwitch consults a per-connection table indexed by the
|
|
message type, reason code, and current role. The following table
|
|
shows how this table is initialized by default when an OpenFlow
|
|
connection is made. An entry labeled "yes" means that the message is
|
|
sent, an entry labeled "---" means that the message is suppressed.
|
|
|
|
master/
|
|
message and reason code other slave
|
|
---------------------------------------- ------- -----
|
|
OFPT_PACKET_IN / NXT_PACKET_IN
|
|
OFPR_NO_MATCH yes ---
|
|
OFPR_ACTION yes ---
|
|
OFPR_INVALID_TTL --- ---
|
|
|
|
OFPT_FLOW_REMOVED / NXT_FLOW_REMOVED
|
|
OFPRR_IDLE_TIMEOUT yes ---
|
|
OFPRR_HARD_TIMEOUT yes ---
|
|
OFPRR_DELETE yes ---
|
|
|
|
OFPT_PORT_STATUS
|
|
OFPPR_ADD yes yes
|
|
OFPPR_DELETE yes yes
|
|
OFPPR_MODIFY yes yes
|
|
|
|
The NXT_SET_ASYNC_CONFIG message directly sets all of the values in
|
|
this table for the current connection. The
|
|
OFPC_INVALID_TTL_TO_CONTROLLER bit in the OFPT_SET_CONFIG message
|
|
controls the setting for OFPR_INVALID_TTL for the "master" role.
|
|
|
|
|
|
OFPAT_ENQUEUE
|
|
=============
|
|
|
|
The OpenFlow 1.0 specification requires the output port of the OFPAT_ENQUEUE
|
|
action to "refer to a valid physical port (i.e. < OFPP_MAX) or OFPP_IN_PORT".
|
|
Although OFPP_LOCAL is not less than OFPP_MAX, it is an 'internal' port which
|
|
can have QoS applied to it in Linux. Since we allow the OFPAT_ENQUEUE to apply
|
|
to 'internal' ports whose port numbers are less than OFPP_MAX, we interpret
|
|
OFPP_LOCAL as a physical port and support OFPAT_ENQUEUE on it as well.
|
|
|
|
|
|
OFPT_FLOW_MOD
|
|
=============
|
|
|
|
The OpenFlow specification for the behavior of OFPT_FLOW_MOD is
|
|
confusing. The following tables summarize the Open vSwitch
|
|
implementation of its behavior in the following categories:
|
|
|
|
- "match on priority": Whether the flow_mod acts only on flows
|
|
whose priority matches that included in the flow_mod message.
|
|
|
|
- "match on out_port": Whether the flow_mod acts only on flows
|
|
that output to the out_port included in the flow_mod message (if
|
|
out_port is not OFPP_NONE). OpenFlow 1.1 and later have a
|
|
similar feature (not listed separately here) for out_group.
|
|
|
|
- "match on flow_cookie": Whether the flow_mod acts only on flows
|
|
whose flow_cookie matches an optional controller-specified value
|
|
and mask.
|
|
|
|
- "updates flow_cookie": Whether the flow_mod changes the
|
|
flow_cookie of the flow or flows that it matches to the
|
|
flow_cookie included in the flow_mod message.
|
|
|
|
- "updates OFPFF_ flags": Whether the flow_mod changes the
|
|
OFPFF_SEND_FLOW_REM flag of the flow or flows that it matches to
|
|
the setting included in the flags of the flow_mod message.
|
|
|
|
- "honors OFPFF_CHECK_OVERLAP": Whether the OFPFF_CHECK_OVERLAP
|
|
flag in the flow_mod is significant.
|
|
|
|
- "updates idle_timeout" and "updates hard_timeout": Whether the
|
|
idle_timeout and hard_timeout in the flow_mod, respectively,
|
|
have an effect on the flow or flows matched by the flow_mod.
|
|
|
|
- "updates idle timer": Whether the flow_mod resets the per-flow
|
|
timer that measures how long a flow has been idle.
|
|
|
|
- "updates hard timer": Whether the flow_mod resets the per-flow
|
|
timer that measures how long it has been since a flow was
|
|
modified.
|
|
|
|
- "zeros counters": Whether the flow_mod resets per-flow packet
|
|
and byte counters to zero.
|
|
|
|
- "may add a new flow": Whether the flow_mod may add a new flow to
|
|
the flow table. (Obviously this is always true for "add"
|
|
commands but in some OpenFlow versions "modify" and
|
|
"modify-strict" can also add new flows.)
|
|
|
|
- "sends flow_removed message": Whether the flow_mod generates a
|
|
flow_removed message for the flow or flows that it affects.
|
|
|
|
An entry labeled "yes" means that the flow mod type does have the
|
|
indicated behavior, "---" means that it does not, an empty cell means
|
|
that the property is not applicable, and other values are explained
|
|
below the table.
|
|
|
|
OpenFlow 1.0
|
|
------------
|
|
|
|
MODIFY DELETE
|
|
ADD MODIFY STRICT DELETE STRICT
|
|
=== ====== ====== ====== ======
|
|
match on priority yes --- yes --- yes
|
|
match on out_port --- --- --- yes yes
|
|
match on flow_cookie --- --- --- --- ---
|
|
match on table_id --- --- --- --- ---
|
|
controller chooses table_id --- --- ---
|
|
updates flow_cookie yes yes yes
|
|
updates OFPFF_SEND_FLOW_REM yes + +
|
|
honors OFPFF_CHECK_OVERLAP yes + +
|
|
updates idle_timeout yes + +
|
|
updates hard_timeout yes + +
|
|
resets idle timer yes + +
|
|
resets hard timer yes yes yes
|
|
zeros counters yes + +
|
|
may add a new flow yes yes yes
|
|
sends flow_removed message --- --- --- % %
|
|
|
|
(+) "modify" and "modify-strict" only take these actions when they
|
|
create a new flow, not when they update an existing flow.
|
|
|
|
(%) "delete" and "delete_strict" generates a flow_removed message if
|
|
the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set.
|
|
(Each controller can separately control whether it wants to
|
|
receive the generated messages.)
|
|
|
|
OpenFlow 1.1
|
|
------------
|
|
|
|
OpenFlow 1.1 makes these changes:
|
|
|
|
- The controller now must specify the table_id of the flow match
|
|
searched and into which a flow may be inserted. Behavior for a
|
|
table_id of 255 is undefined.
|
|
|
|
- A flow_mod, except an "add", can now match on the flow_cookie.
|
|
|
|
- When a flow_mod matches on the flow_cookie, "modify" and
|
|
"modify-strict" never insert a new flow.
|
|
|
|
MODIFY DELETE
|
|
ADD MODIFY STRICT DELETE STRICT
|
|
=== ====== ====== ====== ======
|
|
match on priority yes --- yes --- yes
|
|
match on out_port --- --- --- yes yes
|
|
match on flow_cookie --- yes yes yes yes
|
|
match on table_id yes yes yes yes yes
|
|
controller chooses table_id yes yes yes
|
|
updates flow_cookie yes --- ---
|
|
updates OFPFF_SEND_FLOW_REM yes + +
|
|
honors OFPFF_CHECK_OVERLAP yes + +
|
|
updates idle_timeout yes + +
|
|
updates hard_timeout yes + +
|
|
resets idle timer yes + +
|
|
resets hard timer yes yes yes
|
|
zeros counters yes + +
|
|
may add a new flow yes # #
|
|
sends flow_removed message --- --- --- % %
|
|
|
|
(+) "modify" and "modify-strict" only take these actions when they
|
|
create a new flow, not when they update an existing flow.
|
|
|
|
(%) "delete" and "delete_strict" generates a flow_removed message if
|
|
the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set.
|
|
(Each controller can separately control whether it wants to
|
|
receive the generated messages.)
|
|
|
|
(#) "modify" and "modify-strict" only add a new flow if the flow_mod
|
|
does not match on any bits of the flow cookie
|
|
|
|
OpenFlow 1.2
|
|
------------
|
|
|
|
OpenFlow 1.2 makes these changes:
|
|
|
|
- Only "add" commands ever add flows, "modify" and "modify-strict"
|
|
never do.
|
|
|
|
- A new flag OFPFF_RESET_COUNTS now controls whether "modify" and
|
|
"modify-strict" reset counters, whereas previously they never
|
|
reset counters (except when they inserted a new flow).
|
|
|
|
MODIFY DELETE
|
|
ADD MODIFY STRICT DELETE STRICT
|
|
=== ====== ====== ====== ======
|
|
match on priority yes --- yes --- yes
|
|
match on out_port --- --- --- yes yes
|
|
match on flow_cookie --- yes yes yes yes
|
|
match on table_id yes yes yes yes yes
|
|
controller chooses table_id yes yes yes
|
|
updates flow_cookie yes --- ---
|
|
updates OFPFF_SEND_FLOW_REM yes --- ---
|
|
honors OFPFF_CHECK_OVERLAP yes --- ---
|
|
updates idle_timeout yes --- ---
|
|
updates hard_timeout yes --- ---
|
|
resets idle timer yes --- ---
|
|
resets hard timer yes yes yes
|
|
zeros counters yes & &
|
|
may add a new flow yes --- ---
|
|
sends flow_removed message --- --- --- % %
|
|
|
|
(%) "delete" and "delete_strict" generates a flow_removed message if
|
|
the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set.
|
|
(Each controller can separately control whether it wants to
|
|
receive the generated messages.)
|
|
|
|
(&) "modify" and "modify-strict" reset counters if the
|
|
OFPFF_RESET_COUNTS flag is specified.
|
|
|
|
OpenFlow 1.3
|
|
------------
|
|
|
|
OpenFlow 1.3 makes these changes:
|
|
|
|
- Behavior for a table_id of 255 is now defined, for "delete" and
|
|
"delete-strict" commands, as meaning to delete from all tables.
|
|
A table_id of 255 is now explicitly invalid for other commands.
|
|
|
|
- New flags OFPFF_NO_PKT_COUNTS and OFPFF_NO_BYT_COUNTS for "add"
|
|
operations.
|
|
|
|
The table for 1.3 is the same as the one shown above for 1.2.
|
|
|
|
|
|
OpenFlow 1.4
|
|
------------
|
|
|
|
OpenFlow 1.4 does not change flow_mod semantics.
|
|
|
|
|
|
OFPT_PACKET_IN
|
|
==============
|
|
|
|
The OpenFlow 1.1 specification for OFPT_PACKET_IN is confusing. The
|
|
definition in OF1.1 openflow.h is[*]:
|
|
|
|
/* Packet received on port (datapath -> controller). */
|
|
struct ofp_packet_in {
|
|
struct ofp_header header;
|
|
uint32_t buffer_id; /* ID assigned by datapath. */
|
|
uint32_t in_port; /* Port on which frame was received. */
|
|
uint32_t in_phy_port; /* Physical Port on which frame was received. */
|
|
uint16_t total_len; /* Full length of frame. */
|
|
uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */
|
|
uint8_t table_id; /* ID of the table that was looked up */
|
|
uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word,
|
|
so the IP header is 32-bit aligned. The
|
|
amount of data is inferred from the length
|
|
field in the header. Because of padding,
|
|
offsetof(struct ofp_packet_in, data) ==
|
|
sizeof(struct ofp_packet_in) - 2. */
|
|
};
|
|
OFP_ASSERT(sizeof(struct ofp_packet_in) == 24);
|
|
|
|
The confusing part is the comment on the data[] member. This comment
|
|
is a leftover from OF1.0 openflow.h, in which the comment was correct:
|
|
sizeof(struct ofp_packet_in) is 20 in OF1.0 and offsetof(struct
|
|
ofp_packet_in, data) is 18. When OF1.1 was written, the structure
|
|
members were changed but the comment was carelessly not updated, and
|
|
the comment became wrong: sizeof(struct ofp_packet_in) and
|
|
offsetof(struct ofp_packet_in, data) are both 24 in OF1.1.
|
|
|
|
That leaves the question of how to implement ofp_packet_in in OF1.1.
|
|
The OpenFlow reference implementation for OF1.1 does not include any
|
|
padding, that is, the first byte of the encapsulated frame immediately
|
|
follows the 'table_id' member without a gap. Open vSwitch therefore
|
|
implements it the same way for compatibility.
|
|
|
|
For an earlier discussion, please see the thread archived at:
|
|
https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html
|
|
|
|
[*] The quoted definition is directly from OF1.1. Definitions used
|
|
inside OVS omit the 8-byte ofp_header members, so the sizes in
|
|
this discussion are 8 bytes larger than those declared in OVS
|
|
header files.
|
|
|
|
|
|
VLAN Matching
|
|
=============
|
|
|
|
The 802.1Q VLAN header causes more trouble than any other 4 bytes in
|
|
networking. More specifically, three versions of OpenFlow and Open
|
|
vSwitch have among them four different ways to match the contents and
|
|
presence of the VLAN header. The following table describes how each
|
|
version works.
|
|
|
|
Match NXM OF1.0 OF1.1 OF1.2
|
|
----- --------- ----------- ----------- ------------
|
|
[1] 0000/0000 ????/1,??/? ????/1,??/? 0000/0000,--
|
|
[2] 0000/ffff ffff/0,??/? ffff/0,??/? 0000/ffff,--
|
|
[3] 1xxx/1fff 0xxx/0,??/1 0xxx/0,??/1 1xxx/ffff,--
|
|
[4] z000/f000 ????/1,0y/0 fffe/0,0y/0 1000/1000,0y
|
|
[5] zxxx/ffff 0xxx/0,0y/0 0xxx/0,0y/0 1xxx/ffff,0y
|
|
[6] 0000/0fff <none> <none> <none>
|
|
[7] 0000/f000 <none> <none> <none>
|
|
[8] 0000/efff <none> <none> <none>
|
|
[9] 1001/1001 <none> <none> 1001/1001,--
|
|
[10] 3000/3000 <none> <none> <none>
|
|
|
|
Each column is interpreted as follows.
|
|
|
|
- Match: See the list below.
|
|
|
|
- NXM: xxxx/yyyy means NXM_OF_VLAN_TCI_W with value xxxx and mask
|
|
yyyy. A mask of 0000 is equivalent to omitting
|
|
NXM_OF_VLAN_TCI(_W), a mask of ffff is equivalent to
|
|
NXM_OF_VLAN_TCI.
|
|
|
|
- OF1.0 and OF1.1: wwww/x,yy/z means dl_vlan wwww, OFPFW_DL_VLAN
|
|
x, dl_vlan_pcp yy, and OFPFW_DL_VLAN_PCP z. ? means that the
|
|
given nibble is ignored (and conventionally 0 for wwww or yy,
|
|
conventionally 1 for x or z). <none> means that the given match
|
|
is not supported.
|
|
|
|
- OF1.2: xxxx/yyyy,zz means OXM_OF_VLAN_VID_W with value xxxx and
|
|
mask yyyy, and OXM_OF_VLAN_PCP (which is not maskable) with
|
|
value zz. A mask of 0000 is equivalent to omitting
|
|
OXM_OF_VLAN_VID(_W), a mask of ffff is equivalent to
|
|
OXM_OF_VLAN_VID. -- means that OXM_OF_VLAN_PCP is omitted.
|
|
<none> means that the given match is not supported.
|
|
|
|
The matches are:
|
|
|
|
[1] Matches any packet, that is, one without an 802.1Q header or with
|
|
an 802.1Q header with any TCI value.
|
|
|
|
[2] Matches only packets without an 802.1Q header.
|
|
|
|
NXM: Any match with (vlan_tci == 0) and (vlan_tci_mask & 0x1000)
|
|
!= 0 is equivalent to the one listed in the table.
|
|
|
|
OF1.0: The spec doesn't define behavior if dl_vlan is set to
|
|
0xffff and OFPFW_DL_VLAN_PCP is not set.
|
|
|
|
OF1.1: The spec says explicitly to ignore dl_vlan_pcp when
|
|
dl_vlan is set to 0xffff.
|
|
|
|
OF1.2: The spec doesn't say what should happen if (vlan_vid == 0)
|
|
and (vlan_vid_mask & 0x1000) != 0 but (vlan_vid_mask != 0x1000),
|
|
but it would be straightforward to also interpret as [2].
|
|
|
|
[3] Matches only packets that have an 802.1Q header with VID xxx (and
|
|
any PCP).
|
|
|
|
[4] Matches only packets that have an 802.1Q header with PCP y (and
|
|
any VID).
|
|
|
|
NXM: z is ((y << 1) | 1).
|
|
|
|
OF1.0: The spec isn't very clear, but OVS implements it this way.
|
|
|
|
OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff)
|
|
== 0x1000 would also work, but the spec doesn't define their
|
|
behavior.
|
|
|
|
[5] Matches only packets that have an 802.1Q header with VID xxx and
|
|
PCP y.
|
|
|
|
NXM: z is ((y << 1) | 1).
|
|
|
|
OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff)
|
|
== 0x1fff would also work.
|
|
|
|
[6] Matches packets with no 802.1Q header or with an 802.1Q header
|
|
with a VID of 0. Only possible with NXM.
|
|
|
|
[7] Matches packets with no 802.1Q header or with an 802.1Q header
|
|
with a PCP of 0. Only possible with NXM.
|
|
|
|
[8] Matches packets with no 802.1Q header or with an 802.1Q header
|
|
with both VID and PCP of 0. Only possible with NXM.
|
|
|
|
[9] Matches only packets that have an 802.1Q header with an
|
|
odd-numbered VID (and any PCP). Only possible with NXM and
|
|
OF1.2. (This is just an example; one can match on any desired
|
|
VID bit pattern.)
|
|
|
|
[10] Matches only packets that have an 802.1Q header with an
|
|
odd-numbered PCP (and any VID). Only possible with NXM. (This
|
|
is just an example; one can match on any desired VID bit
|
|
pattern.)
|
|
|
|
Additional notes:
|
|
|
|
- OF1.2: The top three bits of OXM_OF_VLAN_VID are fixed to zero,
|
|
so bits 13, 14, and 15 in the masks listed in the table may be
|
|
set to arbitrary values, as long as the corresponding value bits
|
|
are also zero. The suggested ffff mask for [2], [3], and [5]
|
|
allows a shorter OXM representation (the mask is omitted) than
|
|
the minimal 1fff mask.
|
|
|
|
|
|
Flow Cookies
|
|
============
|
|
|
|
OpenFlow 1.0 and later versions have the concept of a "flow cookie",
|
|
which is a 64-bit integer value attached to each flow. The treatment
|
|
of the flow cookie has varied greatly across OpenFlow versions,
|
|
however.
|
|
|
|
In OpenFlow 1.0:
|
|
|
|
- OFPFC_ADD set the cookie in the flow that it added.
|
|
|
|
- OFPFC_MODIFY and OFPFC_MODIFY_STRICT updated the cookie for
|
|
the flow or flows that it modified.
|
|
|
|
- OFPST_FLOW messages included the flow cookie.
|
|
|
|
- OFPT_FLOW_REMOVED messages reported the cookie of the flow
|
|
that was removed.
|
|
|
|
OpenFlow 1.1 made the following changes:
|
|
|
|
- Flow mod operations OFPFC_MODIFY, OFPFC_MODIFY_STRICT,
|
|
OFPFC_DELETE, and OFPFC_DELETE_STRICT, plus flow stats
|
|
requests and aggregate stats requests, gained the ability to
|
|
match on flow cookies with an arbitrary mask.
|
|
|
|
- OFPFC_MODIFY and OFPFC_MODIFY_STRICT were changed to add a
|
|
new flow, in the case of no match, only if the flow table
|
|
modification operation did not match on the cookie field.
|
|
(In OpenFlow 1.0, modify operations always added a new flow
|
|
when there was no match.)
|
|
|
|
- OFPFC_MODIFY and OFPFC_MODIFY_STRICT no longer updated flow
|
|
cookies.
|
|
|
|
OpenFlow 1.2 made the following changes:
|
|
|
|
- OFPC_MODIFY and OFPFC_MODIFY_STRICT were changed to never
|
|
add a new flow, regardless of whether the flow cookie was
|
|
used for matching.
|
|
|
|
Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0
|
|
behavior with the following extensions:
|
|
|
|
- An NXM extension field NXM_NX_COOKIE(_W) allows the NXM
|
|
versions of OFPFC_MODIFY, OFPFC_MODIFY_STRICT, OFPFC_DELETE,
|
|
and OFPFC_DELETE_STRICT flow_mods, plus flow stats requests
|
|
and aggregate stats requests, to match on flow cookies with
|
|
arbitrary masks. This is much like the equivalent OpenFlow
|
|
1.1 feature.
|
|
|
|
- Like OpenFlow 1.1, OFPC_MODIFY and OFPFC_MODIFY_STRICT add a
|
|
new flow if there is no match and the mask is zero (or not
|
|
given).
|
|
|
|
- The "cookie" field in OFPT_FLOW_MOD and NXT_FLOW_MOD messages
|
|
is used as the cookie value for OFPFC_ADD commands, as
|
|
described in OpenFlow 1.0. For OFPFC_MODIFY and
|
|
OFPFC_MODIFY_STRICT commands, the "cookie" field is used as a
|
|
new cookie for flows that match unless it is UINT64_MAX, in
|
|
which case the flow's cookie is not updated.
|
|
|
|
- NXT_PACKET_IN (the Nicira extended version of
|
|
OFPT_PACKET_IN) reports the cookie of the rule that
|
|
generated the packet, or all-1-bits if no rule generated the
|
|
packet. (Older versions of OVS used all-0-bits instead of
|
|
all-1-bits.)
|
|
|
|
The following table shows the handling of different protocols when
|
|
receiving OFPFC_MODIFY and OFPFC_MODIFY_STRICT messages. A mask of 0
|
|
indicates either an explicit mask of zero or an implicit one by not
|
|
specifying the NXM_NX_COOKIE(_W) field.
|
|
|
|
Match Update Add on miss Add on miss
|
|
cookie cookie mask!=0 mask==0
|
|
====== ====== =========== ===========
|
|
OpenFlow 1.0 no yes <always add on miss>
|
|
OpenFlow 1.1 yes no no yes
|
|
OpenFlow 1.2 yes no no no
|
|
NXM yes yes* no yes
|
|
|
|
* Updates the flow's cookie unless the "cookie" field is UINT64_MAX.
|
|
|
|
|
|
Multiple Table Support
|
|
======================
|
|
|
|
OpenFlow 1.0 has only rudimentary support for multiple flow tables.
|
|
Notably, OpenFlow 1.0 does not allow the controller to specify the
|
|
flow table to which a flow is to be added. Open vSwitch adds an
|
|
extension for this purpose, which is enabled on a per-OpenFlow
|
|
connection basis using the NXT_FLOW_MOD_TABLE_ID message. When the
|
|
extension is enabled, the upper 8 bits of the 'command' member in an
|
|
OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a
|
|
flow is to be added.
|
|
|
|
The Open vSwitch software switch implementation offers 255 flow
|
|
tables. On packet ingress, only the first flow table (table 0) is
|
|
searched, and the contents of the remaining tables are not considered
|
|
in any way. Tables other than table 0 only come into play when an
|
|
NXAST_RESUBMIT_TABLE action specifies another table to search.
|
|
|
|
Tables 128 and above are reserved for use by the switch itself.
|
|
Controllers should use only tables 0 through 127.
|
|
|
|
|
|
IPv6
|
|
====
|
|
|
|
Open vSwitch supports stateless handling of IPv6 packets. Flows can be
|
|
written to support matching TCP, UDP, and ICMPv6 headers within an IPv6
|
|
packet. Deeper matching of some Neighbor Discovery messages is also
|
|
supported.
|
|
|
|
IPv6 was not designed to interact well with middle-boxes. This,
|
|
combined with Open vSwitch's stateless nature, have affected the
|
|
processing of IPv6 traffic, which is detailed below.
|
|
|
|
Extension Headers
|
|
-----------------
|
|
|
|
The base IPv6 header is incredibly simple with the intention of only
|
|
containing information relevant for routing packets between two
|
|
endpoints. IPv6 relies heavily on the use of extension headers to
|
|
provide any other functionality. Unfortunately, the extension headers
|
|
were designed in such a way that it is impossible to move to the next
|
|
header (including the layer-4 payload) unless the current header is
|
|
understood.
|
|
|
|
Open vSwitch will process the following extension headers and continue
|
|
to the next header:
|
|
|
|
* Fragment (see the next section)
|
|
* AH (Authentication Header)
|
|
* Hop-by-Hop Options
|
|
* Routing
|
|
* Destination Options
|
|
|
|
When a header is encountered that is not in that list, it is considered
|
|
"terminal". A terminal header's IPv6 protocol value is stored in
|
|
"nw_proto" for matching purposes. If a terminal header is TCP, UDP, or
|
|
ICMPv6, the packet will be further processed in an attempt to extract
|
|
layer-4 information.
|
|
|
|
Fragments
|
|
---------
|
|
|
|
IPv6 requires that every link in the internet have an MTU of 1280 octets
|
|
or greater (RFC 2460). As such, a terminal header (as described above in
|
|
"Extension Headers") in the first fragment should generally be
|
|
reachable. In this case, the terminal header's IPv6 protocol type is
|
|
stored in the "nw_proto" field for matching purposes. If a terminal
|
|
header cannot be found in the first fragment (one with a fragment offset
|
|
of zero), the "nw_proto" field is set to 0. Subsequent fragments (those
|
|
with a non-zero fragment offset) have the "nw_proto" field set to the
|
|
IPv6 protocol type for fragments (44).
|
|
|
|
Jumbograms
|
|
----------
|
|
|
|
An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer
|
|
than 65,535 octets. A jumbogram is only relevant in subnets with a link
|
|
MTU greater than 65,575 octets, and are not required to be supported on
|
|
nodes that do not connect to link with such large MTUs. Currently, Open
|
|
vSwitch doesn't process jumbograms.
|
|
|
|
|
|
In-Band Control
|
|
===============
|
|
|
|
Motivation
|
|
----------
|
|
|
|
An OpenFlow switch must establish and maintain a TCP network
|
|
connection to its controller. There are two basic ways to categorize
|
|
the network that this connection traverses: either it is completely
|
|
separate from the one that the switch is otherwise controlling, or its
|
|
path may overlap the network that the switch controls. We call the
|
|
former case "out-of-band control", the latter case "in-band control".
|
|
|
|
Out-of-band control has the following benefits:
|
|
|
|
- Simplicity: Out-of-band control slightly simplifies the switch
|
|
implementation.
|
|
|
|
- Reliability: Excessive switch traffic volume cannot interfere
|
|
with control traffic.
|
|
|
|
- Integrity: Machines not on the control network cannot
|
|
impersonate a switch or a controller.
|
|
|
|
- Confidentiality: Machines not on the control network cannot
|
|
snoop on control traffic.
|
|
|
|
In-band control, on the other hand, has the following advantages:
|
|
|
|
- No dedicated port: There is no need to dedicate a physical
|
|
switch port to control, which is important on switches that have
|
|
few ports (e.g. wireless routers, low-end embedded platforms).
|
|
|
|
- No dedicated network: There is no need to build and maintain a
|
|
separate control network. This is important in many
|
|
environments because it reduces proliferation of switches and
|
|
wiring.
|
|
|
|
Open vSwitch supports both out-of-band and in-band control. This
|
|
section describes the principles behind in-band control. See the
|
|
description of the Controller table in ovs-vswitchd.conf.db(5) to
|
|
configure OVS for in-band control.
|
|
|
|
Principles
|
|
----------
|
|
|
|
The fundamental principle of in-band control is that an OpenFlow
|
|
switch must recognize and switch control traffic without involving the
|
|
OpenFlow controller. All the details of implementing in-band control
|
|
are special cases of this principle.
|
|
|
|
The rationale for this principle is simple. If the switch does not
|
|
handle in-band control traffic itself, then it will be caught in a
|
|
contradiction: it must contact the controller, but it cannot, because
|
|
only the controller can set up the flows that are needed to contact
|
|
the controller.
|
|
|
|
The following points describe important special cases of this
|
|
principle.
|
|
|
|
- In-band control must be implemented regardless of whether the
|
|
switch is connected.
|
|
|
|
It is tempting to implement the in-band control rules only when
|
|
the switch is not connected to the controller, using the
|
|
reasoning that the controller should have complete control once
|
|
it has established a connection with the switch.
|
|
|
|
This does not work in practice. Consider the case where the
|
|
switch is connected to the controller. Occasionally it can
|
|
happen that the controller forgets or otherwise needs to obtain
|
|
the MAC address of the switch. To do so, the controller sends a
|
|
broadcast ARP request. A switch that implements the in-band
|
|
control rules only when it is disconnected will then send an
|
|
OFPT_PACKET_IN message up to the controller. The controller will
|
|
be unable to respond, because it does not know the MAC address of
|
|
the switch. This is a deadlock situation that can only be
|
|
resolved by the switch noticing that its connection to the
|
|
controller has hung and reconnecting.
|
|
|
|
- In-band control must override flows set up by the controller.
|
|
|
|
It is reasonable to assume that flows set up by the OpenFlow
|
|
controller should take precedence over in-band control, on the
|
|
basis that the controller should be in charge of the switch.
|
|
|
|
Again, this does not work in practice. Reasonable controller
|
|
implementations may set up a "last resort" fallback rule that
|
|
wildcards every field and, e.g., sends it up to the controller or
|
|
discards it. If a controller does that, then it will isolate
|
|
itself from the switch.
|
|
|
|
- The switch must recognize all control traffic.
|
|
|
|
The fundamental principle of in-band control states, in part,
|
|
that a switch must recognize control traffic without involving
|
|
the OpenFlow controller. More specifically, the switch must
|
|
recognize *all* control traffic. "False negatives", that is,
|
|
packets that constitute control traffic but that the switch does
|
|
not recognize as control traffic, lead to control traffic storms.
|
|
|
|
Consider an OpenFlow switch that only recognizes control packets
|
|
sent to or from that switch. Now suppose that two switches of
|
|
this type, named A and B, are connected to ports on an Ethernet
|
|
hub (not a switch) and that an OpenFlow controller is connected
|
|
to a third hub port. In this setup, control traffic sent by
|
|
switch A will be seen by switch B, which will send it to the
|
|
controller as part of an OFPT_PACKET_IN message. Switch A will
|
|
then see the OFPT_PACKET_IN message's packet, re-encapsulate it
|
|
in another OFPT_PACKET_IN, and send it to the controller. Switch
|
|
B will then see that OFPT_PACKET_IN, and so on in an infinite
|
|
loop.
|
|
|
|
Incidentally, the consequences of "false positives", where
|
|
packets that are not control traffic are nevertheless recognized
|
|
as control traffic, are much less severe. The controller will
|
|
not be able to control their behavior, but the network will
|
|
remain in working order. False positives do constitute a
|
|
security problem.
|
|
|
|
- The switch should use echo-requests to detect disconnection.
|
|
|
|
TCP will notice that a connection has hung, but this can take a
|
|
considerable amount of time. For example, with default settings
|
|
the Linux kernel TCP implementation will retransmit for between
|
|
13 and 30 minutes, depending on the connection's retransmission
|
|
timeout, according to kernel documentation. This is far too long
|
|
for a switch to be disconnected, so an OpenFlow switch should
|
|
implement its own connection timeout. OpenFlow OFPT_ECHO_REQUEST
|
|
messages are the best way to do this, since they test the
|
|
OpenFlow connection itself.
|
|
|
|
Implementation
|
|
--------------
|
|
|
|
This section describes how Open vSwitch implements in-band control.
|
|
Correctly implementing in-band control has proven difficult due to its
|
|
many subtleties, and has thus gone through many iterations. Please
|
|
read through and understand the reasoning behind the chosen rules
|
|
before making modifications.
|
|
|
|
Open vSwitch implements in-band control as "hidden" flows, that is,
|
|
flows that are not visible through OpenFlow, and at a higher priority
|
|
than wildcarded flows can be set up through OpenFlow. This is done so
|
|
that the OpenFlow controller cannot interfere with them and possibly
|
|
break connectivity with its switches. It is possible to see all
|
|
flows, including in-band ones, with the ovs-appctl "bridge/dump-flows"
|
|
command.
|
|
|
|
The Open vSwitch implementation of in-band control can hide traffic to
|
|
arbitrary "remotes", where each remote is one TCP port on one IP address.
|
|
Currently the remotes are automatically configured as the in-band OpenFlow
|
|
controllers plus the OVSDB managers, if any. (The latter is a requirement
|
|
because OVSDB managers are responsible for configuring OpenFlow controllers,
|
|
so if the manager cannot be reached then OpenFlow cannot be reconfigured.)
|
|
|
|
The following rules (with the OFPP_NORMAL action) are set up on any bridge
|
|
that has any remotes:
|
|
|
|
(a) DHCP requests sent from the local port.
|
|
(b) ARP replies to the local port's MAC address.
|
|
(c) ARP requests from the local port's MAC address.
|
|
|
|
In-band also sets up the following rules for each unique next-hop MAC
|
|
address for the remotes' IPs (the "next hop" is either the remote
|
|
itself, if it is on a local subnet, or the gateway to reach the remote):
|
|
|
|
(d) ARP replies to the next hop's MAC address.
|
|
(e) ARP requests from the next hop's MAC address.
|
|
|
|
In-band also sets up the following rules for each unique remote IP address:
|
|
|
|
(f) ARP replies containing the remote's IP address as a target.
|
|
(g) ARP requests containing the remote's IP address as a source.
|
|
|
|
In-band also sets up the following rules for each unique remote (IP,port)
|
|
pair:
|
|
|
|
(h) TCP traffic to the remote's IP and port.
|
|
(i) TCP traffic from the remote's IP and port.
|
|
|
|
The goal of these rules is to be as narrow as possible to allow a
|
|
switch to join a network and be able to communicate with the
|
|
remotes. As mentioned earlier, these rules have higher priority
|
|
than the controller's rules, so if they are too broad, they may
|
|
prevent the controller from implementing its policy. As such,
|
|
in-band actively monitors some aspects of flow and packet processing
|
|
so that the rules can be made more precise.
|
|
|
|
In-band control monitors attempts to add flows into the datapath that
|
|
could interfere with its duties. The datapath only allows exact
|
|
match entries, so in-band control is able to be very precise about
|
|
the flows it prevents. Flows that miss in the datapath are sent to
|
|
userspace to be processed, so preventing these flows from being
|
|
cached in the "fast path" does not affect correctness. The only type
|
|
of flow that is currently prevented is one that would prevent DHCP
|
|
replies from being seen by the local port. For example, a rule that
|
|
forwarded all DHCP traffic to the controller would not be allowed,
|
|
but one that forwarded to all ports (including the local port) would.
|
|
|
|
As mentioned earlier, packets that miss in the datapath are sent to
|
|
the userspace for processing. The userspace has its own flow table,
|
|
the "classifier", so in-band checks whether any special processing
|
|
is needed before the classifier is consulted. If a packet is a DHCP
|
|
response to a request from the local port, the packet is forwarded to
|
|
the local port, regardless of the flow table. Note that this requires
|
|
L7 processing of DHCP replies to determine whether the 'chaddr' field
|
|
matches the MAC address of the local port.
|
|
|
|
It is interesting to note that for an L3-based in-band control
|
|
mechanism, the majority of rules are devoted to ARP traffic. At first
|
|
glance, some of these rules appear redundant. However, each serves an
|
|
important role. First, in order to determine the MAC address of the
|
|
remote side (controller or gateway) for other ARP rules, we must allow
|
|
ARP traffic for our local port with rules (b) and (c). If we are
|
|
between a switch and its connection to the remote, we have to
|
|
allow the other switch's ARP traffic to through. This is done with
|
|
rules (d) and (e), since we do not know the addresses of the other
|
|
switches a priori, but do know the remote's or gateway's. Finally,
|
|
if the remote is running in a local guest VM that is not reached
|
|
through the local port, the switch that is connected to the VM must
|
|
allow ARP traffic based on the remote's IP address, since it will
|
|
not know the MAC address of the local port that is sending the traffic
|
|
or the MAC address of the remote in the guest VM.
|
|
|
|
With a few notable exceptions below, in-band should work in most
|
|
network setups. The following are considered "supported' in the
|
|
current implementation:
|
|
|
|
- Locally Connected. The switch and remote are on the same
|
|
subnet. This uses rules (a), (b), (c), (h), and (i).
|
|
|
|
- Reached through Gateway. The switch and remote are on
|
|
different subnets and must go through a gateway. This uses
|
|
rules (a), (b), (c), (h), and (i).
|
|
|
|
- Between Switch and Remote. This switch is between another
|
|
switch and the remote, and we want to allow the other
|
|
switch's traffic through. This uses rules (d), (e), (h), and
|
|
(i). It uses (b) and (c) indirectly in order to know the MAC
|
|
address for rules (d) and (e). Note that DHCP for the other
|
|
switch will not work unless an OpenFlow controller explicitly lets this
|
|
switch pass the traffic.
|
|
|
|
- Between Switch and Gateway. This switch is between another
|
|
switch and the gateway, and we want to allow the other switch's
|
|
traffic through. This uses the same rules and logic as the
|
|
"Between Switch and Remote" configuration described earlier.
|
|
|
|
- Remote on Local VM. The remote is a guest VM on the
|
|
system running in-band control. This uses rules (a), (b), (c),
|
|
(h), and (i).
|
|
|
|
- Remote on Local VM with Different Networks. The remote
|
|
is a guest VM on the system running in-band control, but the
|
|
local port is not used to connect to the remote. For
|
|
example, an IP address is configured on eth0 of the switch. The
|
|
remote's VM is connected through eth1 of the switch, but an
|
|
IP address has not been configured for that port on the switch.
|
|
As such, the switch will use eth0 to connect to the remote,
|
|
and eth1's rules about the local port will not work. In the
|
|
example, the switch attached to eth0 would use rules (a), (b),
|
|
(c), (h), and (i) on eth0. The switch attached to eth1 would use
|
|
rules (f), (g), (h), and (i).
|
|
|
|
The following are explicitly *not* supported by in-band control:
|
|
|
|
- Specify Remote by Name. Currently, the remote must be
|
|
identified by IP address. A naive approach would be to permit
|
|
all DNS traffic. Unfortunately, this would prevent the
|
|
controller from defining any policy over DNS. Since switches
|
|
that are located behind us need to connect to the remote,
|
|
in-band cannot simply add a rule that allows DNS traffic from
|
|
the local port. The "correct" way to support this is to parse
|
|
DNS requests to allow all traffic related to a request for the
|
|
remote's name through. Due to the potential security
|
|
problems and amount of processing, we decided to hold off for
|
|
the time-being.
|
|
|
|
- Differing Remotes for Switches. All switches must know
|
|
the L3 addresses for all the remotes that other switches
|
|
may use, since rules need to be set up to allow traffic related
|
|
to those remotes through. See rules (f), (g), (h), and (i).
|
|
|
|
- Differing Routes for Switches. In order for the switch to
|
|
allow other switches to connect to a remote through a
|
|
gateway, it allows the gateway's traffic through with rules (d)
|
|
and (e). If the routes to the remote differ for the two
|
|
switches, we will not know the MAC address of the alternate
|
|
gateway.
|
|
|
|
|
|
Action Reproduction
|
|
===================
|
|
|
|
It seems likely that many controllers, at least at startup, use the
|
|
OpenFlow "flow statistics" request to obtain existing flows, then
|
|
compare the flows' actions against the actions that they expect to
|
|
find. Before version 1.8.0, Open vSwitch always returned exact,
|
|
byte-for-byte copies of the actions that had been added to the flow
|
|
table. The current version of Open vSwitch does not always do this in
|
|
some exceptional cases. This section lists the exceptions that
|
|
controller authors must keep in mind if they compare actual actions
|
|
against desired actions in a bytewise fashion:
|
|
|
|
- Open vSwitch zeros padding bytes in action structures,
|
|
regardless of their values when the flows were added.
|
|
|
|
- Open vSwitch "normalizes" the instructions in OpenFlow 1.1
|
|
(and later) in the following way:
|
|
|
|
* OVS sorts the instructions into the following order:
|
|
Apply-Actions, Clear-Actions, Write-Actions,
|
|
Write-Metadata, Goto-Table.
|
|
|
|
* OVS drops Apply-Actions instructions that have empty
|
|
action lists.
|
|
|
|
* OVS drops Write-Actions instructions that have empty
|
|
action sets.
|
|
|
|
Please report other discrepancies, if you notice any, so that we can
|
|
fix or document them.
|
|
|
|
|
|
Suggestions
|
|
===========
|
|
|
|
Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org.
|