mirror of
				https://github.com/openvswitch/ovs
				synced 2025-10-25 15:07:05 +00:00 
			
		
		
		
	This defines the version number for OpenFlow 1.4 so that the switch can actually use it. The ovsdb schema is also modified. Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com> Cc: Daniel Baluta <dbaluta@ixiacom.com> [blp@nicira.com adjusted code in cases where 1.3 and 1.4 are the same] Signed-off-by: Ben Pfaff <blp@nicira.com>
		
			
				
	
	
		
			934 lines
		
	
	
		
			40 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			934 lines
		
	
	
		
			40 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
|                      Design Decisions In Open vSwitch
 | |
|                      ================================
 | |
| 
 | |
| This document describes design decisions that went into implementing
 | |
| Open vSwitch.  While we believe these to be reasonable decisions, it is
 | |
| impossible to predict how Open vSwitch will be used in all environments.
 | |
| Understanding assumptions made by Open vSwitch is critical to a
 | |
| successful deployment.  The end of this document contains contact
 | |
| information that can be used to let us know how we can make Open vSwitch
 | |
| more generally useful.
 | |
| 
 | |
| Asynchronous Messages
 | |
| =====================
 | |
| 
 | |
| Over time, Open vSwitch has added many knobs that control whether a
 | |
| given controller receives OpenFlow asynchronous messages.  This
 | |
| section describes how all of these features interact.
 | |
| 
 | |
| First, a service controller never receives any asynchronous messages
 | |
| unless it changes its miss_send_len from the service controller
 | |
| default of zero in one of the following ways:
 | |
| 
 | |
|     - Sending an OFPT_SET_CONFIG message with nonzero miss_send_len.
 | |
| 
 | |
|     - Sending any NXT_SET_ASYNC_CONFIG message: as a side effect, this
 | |
|       message changes the miss_send_len to
 | |
|       OFP_DEFAULT_MISS_SEND_LEN (128) for service controllers.
 | |
| 
 | |
| Second, OFPT_FLOW_REMOVED and NXT_FLOW_REMOVED messages are generated
 | |
| only if the flow that was removed had the OFPFF_SEND_FLOW_REM flag
 | |
| set.
 | |
| 
 | |
| Third, OFPT_PACKET_IN and NXT_PACKET_IN messages are sent only to
 | |
| OpenFlow controller connections that have the correct connection ID
 | |
| (see "struct nx_controller_id" and "struct nx_action_controller"):
 | |
| 
 | |
|     - For packet-in messages generated by a NXAST_CONTROLLER action,
 | |
|       the controller ID specified in the action.
 | |
| 
 | |
|     - For other packet-in messages, controller ID zero.  (This is the
 | |
|       default ID when an OpenFlow controller does not configure one.)
 | |
| 
 | |
| Finally, Open vSwitch consults a per-connection table indexed by the
 | |
| message type, reason code, and current role.  The following table
 | |
| shows how this table is initialized by default when an OpenFlow
 | |
| connection is made.  An entry labeled "yes" means that the message is
 | |
| sent, an entry labeled "---" means that the message is suppressed.
 | |
| 
 | |
|                                              master/
 | |
|   message and reason code                     other     slave
 | |
|   ----------------------------------------   -------    -----
 | |
|   OFPT_PACKET_IN / NXT_PACKET_IN
 | |
|     OFPR_NO_MATCH                              yes       ---
 | |
|     OFPR_ACTION                                yes       ---
 | |
|     OFPR_INVALID_TTL                           ---       ---
 | |
| 
 | |
|   OFPT_FLOW_REMOVED / NXT_FLOW_REMOVED
 | |
|     OFPRR_IDLE_TIMEOUT                         yes       ---
 | |
|     OFPRR_HARD_TIMEOUT                         yes       ---
 | |
|     OFPRR_DELETE                               yes       ---
 | |
| 
 | |
|   OFPT_PORT_STATUS
 | |
|     OFPPR_ADD                                  yes       yes
 | |
|     OFPPR_DELETE                               yes       yes
 | |
|     OFPPR_MODIFY                               yes       yes
 | |
| 
 | |
| The NXT_SET_ASYNC_CONFIG message directly sets all of the values in
 | |
| this table for the current connection.  The
 | |
| OFPC_INVALID_TTL_TO_CONTROLLER bit in the OFPT_SET_CONFIG message
 | |
| controls the setting for OFPR_INVALID_TTL for the "master" role.
 | |
| 
 | |
| 
 | |
| OFPAT_ENQUEUE
 | |
| =============
 | |
| 
 | |
| The OpenFlow 1.0 specification requires the output port of the OFPAT_ENQUEUE
 | |
| action to "refer to a valid physical port (i.e. < OFPP_MAX) or OFPP_IN_PORT".
 | |
| Although OFPP_LOCAL is not less than OFPP_MAX, it is an 'internal' port which
 | |
| can have QoS applied to it in Linux.  Since we allow the OFPAT_ENQUEUE to apply
 | |
| to 'internal' ports whose port numbers are less than OFPP_MAX, we interpret
 | |
| OFPP_LOCAL as a physical port and support OFPAT_ENQUEUE on it as well.
 | |
| 
 | |
| 
 | |
| OFPT_FLOW_MOD
 | |
| =============
 | |
| 
 | |
| The OpenFlow specification for the behavior of OFPT_FLOW_MOD is
 | |
| confusing.  The following tables summarize the Open vSwitch
 | |
| implementation of its behavior in the following categories:
 | |
| 
 | |
|     - "match on priority": Whether the flow_mod acts only on flows
 | |
|       whose priority matches that included in the flow_mod message.
 | |
| 
 | |
|     - "match on out_port": Whether the flow_mod acts only on flows
 | |
|       that output to the out_port included in the flow_mod message (if
 | |
|       out_port is not OFPP_NONE).  OpenFlow 1.1 and later have a
 | |
|       similar feature (not listed separately here) for out_group.
 | |
| 
 | |
|     - "match on flow_cookie": Whether the flow_mod acts only on flows
 | |
|       whose flow_cookie matches an optional controller-specified value
 | |
|       and mask.
 | |
| 
 | |
|     - "updates flow_cookie": Whether the flow_mod changes the
 | |
|       flow_cookie of the flow or flows that it matches to the
 | |
|       flow_cookie included in the flow_mod message.
 | |
| 
 | |
|     - "updates OFPFF_ flags": Whether the flow_mod changes the
 | |
|       OFPFF_SEND_FLOW_REM flag of the flow or flows that it matches to
 | |
|       the setting included in the flags of the flow_mod message.
 | |
| 
 | |
|     - "honors OFPFF_CHECK_OVERLAP": Whether the OFPFF_CHECK_OVERLAP
 | |
|       flag in the flow_mod is significant.
 | |
| 
 | |
|     - "updates idle_timeout" and "updates hard_timeout": Whether the
 | |
|       idle_timeout and hard_timeout in the flow_mod, respectively,
 | |
|       have an effect on the flow or flows matched by the flow_mod.
 | |
| 
 | |
|     - "updates idle timer": Whether the flow_mod resets the per-flow
 | |
|       timer that measures how long a flow has been idle.
 | |
| 
 | |
|     - "updates hard timer": Whether the flow_mod resets the per-flow
 | |
|       timer that measures how long it has been since a flow was
 | |
|       modified.
 | |
| 
 | |
|     - "zeros counters": Whether the flow_mod resets per-flow packet
 | |
|       and byte counters to zero.
 | |
| 
 | |
|     - "may add a new flow": Whether the flow_mod may add a new flow to
 | |
|       the flow table.  (Obviously this is always true for "add"
 | |
|       commands but in some OpenFlow versions "modify" and
 | |
|       "modify-strict" can also add new flows.)
 | |
| 
 | |
|     - "sends flow_removed message": Whether the flow_mod generates a
 | |
|       flow_removed message for the flow or flows that it affects.
 | |
| 
 | |
| An entry labeled "yes" means that the flow mod type does have the
 | |
| indicated behavior, "---" means that it does not, an empty cell means
 | |
| that the property is not applicable, and other values are explained
 | |
| below the table.
 | |
| 
 | |
| OpenFlow 1.0
 | |
| ------------
 | |
| 
 | |
|                                           MODIFY          DELETE
 | |
|                              ADD  MODIFY  STRICT  DELETE  STRICT
 | |
|                              ===  ======  ======  ======  ======
 | |
| match on priority            yes    ---     yes     ---     yes
 | |
| match on out_port            ---    ---     ---     yes     yes
 | |
| match on flow_cookie         ---    ---     ---     ---     ---
 | |
| match on table_id            ---    ---     ---     ---     ---
 | |
| controller chooses table_id  ---    ---     ---
 | |
| updates flow_cookie          yes    yes     yes
 | |
| updates OFPFF_SEND_FLOW_REM  yes     +       +
 | |
| honors OFPFF_CHECK_OVERLAP   yes     +       +
 | |
| updates idle_timeout         yes     +       +
 | |
| updates hard_timeout         yes     +       +
 | |
| resets idle timer            yes     +       +
 | |
| resets hard timer            yes    yes     yes
 | |
| zeros counters               yes     +       +
 | |
| may add a new flow           yes    yes     yes
 | |
| sends flow_removed message   ---    ---     ---      %       %
 | |
| 
 | |
| (+) "modify" and "modify-strict" only take these actions when they
 | |
|     create a new flow, not when they update an existing flow.
 | |
| 
 | |
| (%) "delete" and "delete_strict" generates a flow_removed message if
 | |
|     the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set.
 | |
|     (Each controller can separately control whether it wants to
 | |
|     receive the generated messages.)
 | |
| 
 | |
| OpenFlow 1.1
 | |
| ------------
 | |
| 
 | |
| OpenFlow 1.1 makes these changes:
 | |
| 
 | |
|     - The controller now must specify the table_id of the flow match
 | |
|       searched and into which a flow may be inserted.  Behavior for a
 | |
|       table_id of 255 is undefined.
 | |
| 
 | |
|     - A flow_mod, except an "add", can now match on the flow_cookie.
 | |
| 
 | |
|     - When a flow_mod matches on the flow_cookie, "modify" and
 | |
|       "modify-strict" never insert a new flow.
 | |
| 
 | |
|                                           MODIFY          DELETE
 | |
|                              ADD  MODIFY  STRICT  DELETE  STRICT
 | |
|                              ===  ======  ======  ======  ======
 | |
| match on priority            yes    ---     yes     ---     yes
 | |
| match on out_port            ---    ---     ---     yes     yes
 | |
| match on flow_cookie         ---    yes     yes     yes     yes
 | |
| match on table_id            yes    yes     yes     yes     yes
 | |
| controller chooses table_id  yes    yes     yes
 | |
| updates flow_cookie          yes    ---     ---
 | |
| updates OFPFF_SEND_FLOW_REM  yes     +       +
 | |
| honors OFPFF_CHECK_OVERLAP   yes     +       +
 | |
| updates idle_timeout         yes     +       +
 | |
| updates hard_timeout         yes     +       +
 | |
| resets idle timer            yes     +       +
 | |
| resets hard timer            yes    yes     yes
 | |
| zeros counters               yes     +       +
 | |
| may add a new flow           yes     #       #
 | |
| sends flow_removed message   ---    ---     ---      %       %
 | |
| 
 | |
| (+) "modify" and "modify-strict" only take these actions when they
 | |
|     create a new flow, not when they update an existing flow.
 | |
| 
 | |
| (%) "delete" and "delete_strict" generates a flow_removed message if
 | |
|     the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set.
 | |
|     (Each controller can separately control whether it wants to
 | |
|     receive the generated messages.)
 | |
| 
 | |
| (#) "modify" and "modify-strict" only add a new flow if the flow_mod
 | |
|     does not match on any bits of the flow cookie
 | |
| 
 | |
| OpenFlow 1.2
 | |
| ------------
 | |
| 
 | |
| OpenFlow 1.2 makes these changes:
 | |
| 
 | |
|     - Only "add" commands ever add flows, "modify" and "modify-strict"
 | |
|       never do.
 | |
| 
 | |
|     - A new flag OFPFF_RESET_COUNTS now controls whether "modify" and
 | |
|       "modify-strict" reset counters, whereas previously they never
 | |
|       reset counters (except when they inserted a new flow).
 | |
| 
 | |
|                                           MODIFY          DELETE
 | |
|                              ADD  MODIFY  STRICT  DELETE  STRICT
 | |
|                              ===  ======  ======  ======  ======
 | |
| match on priority            yes    ---     yes     ---     yes
 | |
| match on out_port            ---    ---     ---     yes     yes
 | |
| match on flow_cookie         ---    yes     yes     yes     yes
 | |
| match on table_id            yes    yes     yes     yes     yes
 | |
| controller chooses table_id  yes    yes     yes
 | |
| updates flow_cookie          yes    ---     ---
 | |
| updates OFPFF_SEND_FLOW_REM  yes    ---     ---
 | |
| honors OFPFF_CHECK_OVERLAP   yes    ---     ---
 | |
| updates idle_timeout         yes    ---     ---
 | |
| updates hard_timeout         yes    ---     ---
 | |
| resets idle timer            yes    ---     ---
 | |
| resets hard timer            yes    yes     yes
 | |
| zeros counters               yes     &       &
 | |
| may add a new flow           yes    ---     ---
 | |
| sends flow_removed message   ---    ---     ---      %       %
 | |
| 
 | |
| (%) "delete" and "delete_strict" generates a flow_removed message if
 | |
|     the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set.
 | |
|     (Each controller can separately control whether it wants to
 | |
|     receive the generated messages.)
 | |
| 
 | |
| (&) "modify" and "modify-strict" reset counters if the
 | |
|     OFPFF_RESET_COUNTS flag is specified.
 | |
| 
 | |
| OpenFlow 1.3
 | |
| ------------
 | |
| 
 | |
| OpenFlow 1.3 makes these changes:
 | |
| 
 | |
|     - Behavior for a table_id of 255 is now defined, for "delete" and
 | |
|       "delete-strict" commands, as meaning to delete from all tables.
 | |
|       A table_id of 255 is now explicitly invalid for other commands.
 | |
| 
 | |
|     - New flags OFPFF_NO_PKT_COUNTS and OFPFF_NO_BYT_COUNTS for "add"
 | |
|       operations.
 | |
| 
 | |
| The table for 1.3 is the same as the one shown above for 1.2.
 | |
| 
 | |
| 
 | |
| OpenFlow 1.4
 | |
| ------------
 | |
| 
 | |
| OpenFlow 1.4 does not change flow_mod semantics.
 | |
| 
 | |
| 
 | |
| OFPT_PACKET_IN
 | |
| ==============
 | |
| 
 | |
| The OpenFlow 1.1 specification for OFPT_PACKET_IN is confusing.  The
 | |
| definition in OF1.1 openflow.h is[*]:
 | |
| 
 | |
|   /* Packet received on port (datapath -> controller). */
 | |
|   struct ofp_packet_in {
 | |
|       struct ofp_header header;
 | |
|       uint32_t buffer_id;     /* ID assigned by datapath. */
 | |
|       uint32_t in_port;       /* Port on which frame was received. */
 | |
|       uint32_t in_phy_port;   /* Physical Port on which frame was received. */
 | |
|       uint16_t total_len;     /* Full length of frame. */
 | |
|       uint8_t reason;         /* Reason packet is being sent (one of OFPR_*) */
 | |
|       uint8_t table_id;       /* ID of the table that was looked up */
 | |
|       uint8_t data[0];        /* Ethernet frame, halfway through 32-bit word,
 | |
|                                  so the IP header is 32-bit aligned.  The
 | |
|                                  amount of data is inferred from the length
 | |
|                                  field in the header.  Because of padding,
 | |
|                                  offsetof(struct ofp_packet_in, data) ==
 | |
|                                  sizeof(struct ofp_packet_in) - 2. */
 | |
|   };
 | |
|   OFP_ASSERT(sizeof(struct ofp_packet_in) == 24);
 | |
| 
 | |
| The confusing part is the comment on the data[] member.  This comment
 | |
| is a leftover from OF1.0 openflow.h, in which the comment was correct:
 | |
| sizeof(struct ofp_packet_in) is 20 in OF1.0 and offsetof(struct
 | |
| ofp_packet_in, data) is 18.  When OF1.1 was written, the structure
 | |
| members were changed but the comment was carelessly not updated, and
 | |
| the comment became wrong: sizeof(struct ofp_packet_in) and
 | |
| offsetof(struct ofp_packet_in, data) are both 24 in OF1.1.
 | |
| 
 | |
| That leaves the question of how to implement ofp_packet_in in OF1.1.
 | |
| The OpenFlow reference implementation for OF1.1 does not include any
 | |
| padding, that is, the first byte of the encapsulated frame immediately
 | |
| follows the 'table_id' member without a gap.  Open vSwitch therefore
 | |
| implements it the same way for compatibility.
 | |
| 
 | |
| For an earlier discussion, please see the thread archived at:
 | |
| https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html
 | |
| 
 | |
| [*] The quoted definition is directly from OF1.1.  Definitions used
 | |
|     inside OVS omit the 8-byte ofp_header members, so the sizes in
 | |
|     this discussion are 8 bytes larger than those declared in OVS
 | |
|     header files.
 | |
| 
 | |
| 
 | |
| VLAN Matching
 | |
| =============
 | |
| 
 | |
| The 802.1Q VLAN header causes more trouble than any other 4 bytes in
 | |
| networking.  More specifically, three versions of OpenFlow and Open
 | |
| vSwitch have among them four different ways to match the contents and
 | |
| presence of the VLAN header.  The following table describes how each
 | |
| version works.
 | |
| 
 | |
|        Match        NXM        OF1.0        OF1.1         OF1.2
 | |
|        -----  ---------  -----------  -----------  ------------
 | |
|          [1]  0000/0000  ????/1,??/?  ????/1,??/?  0000/0000,--
 | |
|          [2]  0000/ffff  ffff/0,??/?  ffff/0,??/?  0000/ffff,--
 | |
|          [3]  1xxx/1fff  0xxx/0,??/1  0xxx/0,??/1  1xxx/ffff,--
 | |
|          [4]  z000/f000  ????/1,0y/0  fffe/0,0y/0  1000/1000,0y
 | |
|          [5]  zxxx/ffff  0xxx/0,0y/0  0xxx/0,0y/0  1xxx/ffff,0y
 | |
|          [6]  0000/0fff    <none>       <none>        <none>
 | |
|          [7]  0000/f000    <none>       <none>        <none>
 | |
|          [8]  0000/efff    <none>       <none>        <none>
 | |
|          [9]  1001/1001    <none>       <none>     1001/1001,--
 | |
|         [10]  3000/3000    <none>       <none>        <none>
 | |
| 
 | |
| Each column is interpreted as follows.
 | |
| 
 | |
|     - Match: See the list below.
 | |
| 
 | |
|     - NXM: xxxx/yyyy means NXM_OF_VLAN_TCI_W with value xxxx and mask
 | |
|       yyyy.  A mask of 0000 is equivalent to omitting
 | |
|       NXM_OF_VLAN_TCI(_W), a mask of ffff is equivalent to
 | |
|       NXM_OF_VLAN_TCI.
 | |
| 
 | |
|     - OF1.0 and OF1.1: wwww/x,yy/z means dl_vlan wwww, OFPFW_DL_VLAN
 | |
|       x, dl_vlan_pcp yy, and OFPFW_DL_VLAN_PCP z.  ? means that the
 | |
|       given nibble is ignored (and conventionally 0 for wwww or yy,
 | |
|       conventionally 1 for x or z).  <none> means that the given match
 | |
|       is not supported.
 | |
| 
 | |
|     - OF1.2: xxxx/yyyy,zz means OXM_OF_VLAN_VID_W with value xxxx and
 | |
|       mask yyyy, and OXM_OF_VLAN_PCP (which is not maskable) with
 | |
|       value zz.  A mask of 0000 is equivalent to omitting
 | |
|       OXM_OF_VLAN_VID(_W), a mask of ffff is equivalent to
 | |
|       OXM_OF_VLAN_VID.  -- means that OXM_OF_VLAN_PCP is omitted.
 | |
|       <none> means that the given match is not supported.
 | |
| 
 | |
| The matches are:
 | |
| 
 | |
|  [1] Matches any packet, that is, one without an 802.1Q header or with
 | |
|      an 802.1Q header with any TCI value.
 | |
| 
 | |
|  [2] Matches only packets without an 802.1Q header.
 | |
| 
 | |
|      NXM: Any match with (vlan_tci == 0) and (vlan_tci_mask & 0x1000)
 | |
|      != 0 is equivalent to the one listed in the table.
 | |
| 
 | |
|      OF1.0: The spec doesn't define behavior if dl_vlan is set to
 | |
|      0xffff and OFPFW_DL_VLAN_PCP is not set.
 | |
| 
 | |
|      OF1.1: The spec says explicitly to ignore dl_vlan_pcp when
 | |
|      dl_vlan is set to 0xffff.
 | |
| 
 | |
|      OF1.2: The spec doesn't say what should happen if (vlan_vid == 0)
 | |
|      and (vlan_vid_mask & 0x1000) != 0 but (vlan_vid_mask != 0x1000),
 | |
|      but it would be straightforward to also interpret as [2].
 | |
| 
 | |
|  [3] Matches only packets that have an 802.1Q header with VID xxx (and
 | |
|      any PCP).
 | |
| 
 | |
|  [4] Matches only packets that have an 802.1Q header with PCP y (and
 | |
|      any VID).
 | |
| 
 | |
|      NXM: z is ((y << 1) | 1).
 | |
| 
 | |
|      OF1.0: The spec isn't very clear, but OVS implements it this way.
 | |
| 
 | |
|      OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff)
 | |
|      == 0x1000 would also work, but the spec doesn't define their
 | |
|      behavior.
 | |
| 
 | |
|  [5] Matches only packets that have an 802.1Q header with VID xxx and
 | |
|      PCP y.
 | |
| 
 | |
|      NXM: z is ((y << 1) | 1).
 | |
| 
 | |
|      OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff)
 | |
|      == 0x1fff would also work.
 | |
| 
 | |
|  [6] Matches packets with no 802.1Q header or with an 802.1Q header
 | |
|      with a VID of 0.  Only possible with NXM.
 | |
| 
 | |
|  [7] Matches packets with no 802.1Q header or with an 802.1Q header
 | |
|      with a PCP of 0.  Only possible with NXM.
 | |
| 
 | |
|  [8] Matches packets with no 802.1Q header or with an 802.1Q header
 | |
|      with both VID and PCP of 0.  Only possible with NXM.
 | |
| 
 | |
|  [9] Matches only packets that have an 802.1Q header with an
 | |
|      odd-numbered VID (and any PCP).  Only possible with NXM and
 | |
|      OF1.2.  (This is just an example; one can match on any desired
 | |
|      VID bit pattern.)
 | |
| 
 | |
| [10] Matches only packets that have an 802.1Q header with an
 | |
|      odd-numbered PCP (and any VID).  Only possible with NXM.  (This
 | |
|      is just an example; one can match on any desired VID bit
 | |
|      pattern.)
 | |
| 
 | |
| Additional notes:
 | |
| 
 | |
|     - OF1.2: The top three bits of OXM_OF_VLAN_VID are fixed to zero,
 | |
|       so bits 13, 14, and 15 in the masks listed in the table may be
 | |
|       set to arbitrary values, as long as the corresponding value bits
 | |
|       are also zero.  The suggested ffff mask for [2], [3], and [5]
 | |
|       allows a shorter OXM representation (the mask is omitted) than
 | |
|       the minimal 1fff mask.
 | |
| 
 | |
| 
 | |
| Flow Cookies
 | |
| ============
 | |
| 
 | |
| OpenFlow 1.0 and later versions have the concept of a "flow cookie",
 | |
| which is a 64-bit integer value attached to each flow.  The treatment
 | |
| of the flow cookie has varied greatly across OpenFlow versions,
 | |
| however.
 | |
| 
 | |
| In OpenFlow 1.0:
 | |
| 
 | |
|         - OFPFC_ADD set the cookie in the flow that it added.
 | |
| 
 | |
|         - OFPFC_MODIFY and OFPFC_MODIFY_STRICT updated the cookie for
 | |
|           the flow or flows that it modified.
 | |
| 
 | |
|         - OFPST_FLOW messages included the flow cookie.
 | |
| 
 | |
|         - OFPT_FLOW_REMOVED messages reported the cookie of the flow
 | |
|           that was removed.
 | |
| 
 | |
| OpenFlow 1.1 made the following changes:
 | |
| 
 | |
|         - Flow mod operations OFPFC_MODIFY, OFPFC_MODIFY_STRICT,
 | |
|           OFPFC_DELETE, and OFPFC_DELETE_STRICT, plus flow stats
 | |
|           requests and aggregate stats requests, gained the ability to
 | |
|           match on flow cookies with an arbitrary mask.
 | |
| 
 | |
|         - OFPFC_MODIFY and OFPFC_MODIFY_STRICT were changed to add a
 | |
|           new flow, in the case of no match, only if the flow table
 | |
|           modification operation did not match on the cookie field.
 | |
|           (In OpenFlow 1.0, modify operations always added a new flow
 | |
|           when there was no match.)
 | |
| 
 | |
|         - OFPFC_MODIFY and OFPFC_MODIFY_STRICT no longer updated flow
 | |
|           cookies.
 | |
| 
 | |
| OpenFlow 1.2 made the following changes:
 | |
| 
 | |
|         - OFPC_MODIFY and OFPFC_MODIFY_STRICT were changed to never
 | |
|           add a new flow, regardless of whether the flow cookie was
 | |
|           used for matching.
 | |
| 
 | |
| Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0
 | |
| behavior with the following extensions:
 | |
| 
 | |
|         - An NXM extension field NXM_NX_COOKIE(_W) allows the NXM
 | |
|           versions of OFPFC_MODIFY, OFPFC_MODIFY_STRICT, OFPFC_DELETE,
 | |
|           and OFPFC_DELETE_STRICT flow_mods, plus flow stats requests
 | |
|           and aggregate stats requests, to match on flow cookies with
 | |
|           arbitrary masks.  This is much like the equivalent OpenFlow
 | |
|           1.1 feature.
 | |
| 
 | |
|         - Like OpenFlow 1.1, OFPC_MODIFY and OFPFC_MODIFY_STRICT add a
 | |
|           new flow if there is no match and the mask is zero (or not
 | |
|           given).
 | |
| 
 | |
|         - The "cookie" field in OFPT_FLOW_MOD and NXT_FLOW_MOD messages
 | |
|           is used as the cookie value for OFPFC_ADD commands, as
 | |
|           described in OpenFlow 1.0.  For OFPFC_MODIFY and
 | |
|           OFPFC_MODIFY_STRICT commands, the "cookie" field is used as a
 | |
|           new cookie for flows that match unless it is UINT64_MAX, in
 | |
|           which case the flow's cookie is not updated.
 | |
| 
 | |
|         - NXT_PACKET_IN (the Nicira extended version of
 | |
|           OFPT_PACKET_IN) reports the cookie of the rule that
 | |
|           generated the packet, or all-1-bits if no rule generated the
 | |
|           packet.  (Older versions of OVS used all-0-bits instead of
 | |
|           all-1-bits.)
 | |
| 
 | |
| The following table shows the handling of different protocols when
 | |
| receiving OFPFC_MODIFY and OFPFC_MODIFY_STRICT messages.  A mask of 0
 | |
| indicates either an explicit mask of zero or an implicit one by not
 | |
| specifying the NXM_NX_COOKIE(_W) field.
 | |
| 
 | |
|                 Match   Update   Add on miss   Add on miss
 | |
|                 cookie  cookie     mask!=0       mask==0
 | |
|                 ======  ======   ===========   ===========
 | |
| OpenFlow 1.0      no     yes        <always add on miss>
 | |
| OpenFlow 1.1     yes      no          no           yes
 | |
| OpenFlow 1.2     yes      no          no            no
 | |
| NXM              yes     yes*         no           yes
 | |
| 
 | |
| * Updates the flow's cookie unless the "cookie" field is UINT64_MAX.
 | |
| 
 | |
| 
 | |
| Multiple Table Support
 | |
| ======================
 | |
| 
 | |
| OpenFlow 1.0 has only rudimentary support for multiple flow tables.
 | |
| Notably, OpenFlow 1.0 does not allow the controller to specify the
 | |
| flow table to which a flow is to be added.  Open vSwitch adds an
 | |
| extension for this purpose, which is enabled on a per-OpenFlow
 | |
| connection basis using the NXT_FLOW_MOD_TABLE_ID message.  When the
 | |
| extension is enabled, the upper 8 bits of the 'command' member in an
 | |
| OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a
 | |
| flow is to be added.
 | |
| 
 | |
| The Open vSwitch software switch implementation offers 255 flow
 | |
| tables.  On packet ingress, only the first flow table (table 0) is
 | |
| searched, and the contents of the remaining tables are not considered
 | |
| in any way.  Tables other than table 0 only come into play when an
 | |
| NXAST_RESUBMIT_TABLE action specifies another table to search.
 | |
| 
 | |
| Tables 128 and above are reserved for use by the switch itself.
 | |
| Controllers should use only tables 0 through 127.
 | |
| 
 | |
| 
 | |
| IPv6
 | |
| ====
 | |
| 
 | |
| Open vSwitch supports stateless handling of IPv6 packets.  Flows can be
 | |
| written to support matching TCP, UDP, and ICMPv6 headers within an IPv6
 | |
| packet.  Deeper matching of some Neighbor Discovery messages is also
 | |
| supported.
 | |
| 
 | |
| IPv6 was not designed to interact well with middle-boxes.  This,
 | |
| combined with Open vSwitch's stateless nature, have affected the
 | |
| processing of IPv6 traffic, which is detailed below.
 | |
| 
 | |
| Extension Headers
 | |
| -----------------
 | |
| 
 | |
| The base IPv6 header is incredibly simple with the intention of only
 | |
| containing information relevant for routing packets between two
 | |
| endpoints.  IPv6 relies heavily on the use of extension headers to
 | |
| provide any other functionality.  Unfortunately, the extension headers
 | |
| were designed in such a way that it is impossible to move to the next
 | |
| header (including the layer-4 payload) unless the current header is
 | |
| understood.
 | |
| 
 | |
| Open vSwitch will process the following extension headers and continue
 | |
| to the next header:
 | |
| 
 | |
|     * Fragment (see the next section)
 | |
|     * AH (Authentication Header)
 | |
|     * Hop-by-Hop Options
 | |
|     * Routing
 | |
|     * Destination Options
 | |
| 
 | |
| When a header is encountered that is not in that list, it is considered
 | |
| "terminal".  A terminal header's IPv6 protocol value is stored in
 | |
| "nw_proto" for matching purposes.  If a terminal header is TCP, UDP, or
 | |
| ICMPv6, the packet will be further processed in an attempt to extract
 | |
| layer-4 information.
 | |
| 
 | |
| Fragments
 | |
| ---------
 | |
| 
 | |
| IPv6 requires that every link in the internet have an MTU of 1280 octets
 | |
| or greater (RFC 2460).  As such, a terminal header (as described above in
 | |
| "Extension Headers") in the first fragment should generally be
 | |
| reachable.  In this case, the terminal header's IPv6 protocol type is
 | |
| stored in the "nw_proto" field for matching purposes.  If a terminal
 | |
| header cannot be found in the first fragment (one with a fragment offset
 | |
| of zero), the "nw_proto" field is set to 0.  Subsequent fragments (those
 | |
| with a non-zero fragment offset) have the "nw_proto" field set to the
 | |
| IPv6 protocol type for fragments (44).
 | |
| 
 | |
| Jumbograms
 | |
| ----------
 | |
| 
 | |
| An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer
 | |
| than 65,535 octets.  A jumbogram is only relevant in subnets with a link
 | |
| MTU greater than 65,575 octets, and are not required to be supported on
 | |
| nodes that do not connect to link with such large MTUs.  Currently, Open
 | |
| vSwitch doesn't process jumbograms.
 | |
| 
 | |
| 
 | |
| In-Band Control
 | |
| ===============
 | |
| 
 | |
| Motivation
 | |
| ----------
 | |
| 
 | |
| An OpenFlow switch must establish and maintain a TCP network
 | |
| connection to its controller.  There are two basic ways to categorize
 | |
| the network that this connection traverses: either it is completely
 | |
| separate from the one that the switch is otherwise controlling, or its
 | |
| path may overlap the network that the switch controls.  We call the
 | |
| former case "out-of-band control", the latter case "in-band control".
 | |
| 
 | |
| Out-of-band control has the following benefits:
 | |
| 
 | |
|     - Simplicity: Out-of-band control slightly simplifies the switch
 | |
|       implementation.
 | |
| 
 | |
|     - Reliability: Excessive switch traffic volume cannot interfere
 | |
|       with control traffic.
 | |
| 
 | |
|     - Integrity: Machines not on the control network cannot
 | |
|       impersonate a switch or a controller.
 | |
| 
 | |
|     - Confidentiality: Machines not on the control network cannot
 | |
|       snoop on control traffic.
 | |
| 
 | |
| In-band control, on the other hand, has the following advantages:
 | |
| 
 | |
|     - No dedicated port: There is no need to dedicate a physical
 | |
|       switch port to control, which is important on switches that have
 | |
|       few ports (e.g. wireless routers, low-end embedded platforms).
 | |
| 
 | |
|     - No dedicated network: There is no need to build and maintain a
 | |
|       separate control network.  This is important in many
 | |
|       environments because it reduces proliferation of switches and
 | |
|       wiring.
 | |
| 
 | |
| Open vSwitch supports both out-of-band and in-band control.  This
 | |
| section describes the principles behind in-band control.  See the
 | |
| description of the Controller table in ovs-vswitchd.conf.db(5) to
 | |
| configure OVS for in-band control.
 | |
| 
 | |
| Principles
 | |
| ----------
 | |
| 
 | |
| The fundamental principle of in-band control is that an OpenFlow
 | |
| switch must recognize and switch control traffic without involving the
 | |
| OpenFlow controller.  All the details of implementing in-band control
 | |
| are special cases of this principle.
 | |
| 
 | |
| The rationale for this principle is simple.  If the switch does not
 | |
| handle in-band control traffic itself, then it will be caught in a
 | |
| contradiction: it must contact the controller, but it cannot, because
 | |
| only the controller can set up the flows that are needed to contact
 | |
| the controller.
 | |
| 
 | |
| The following points describe important special cases of this
 | |
| principle.
 | |
| 
 | |
|    - In-band control must be implemented regardless of whether the
 | |
|      switch is connected.
 | |
| 
 | |
|      It is tempting to implement the in-band control rules only when
 | |
|      the switch is not connected to the controller, using the
 | |
|      reasoning that the controller should have complete control once
 | |
|      it has established a connection with the switch.
 | |
| 
 | |
|      This does not work in practice.  Consider the case where the
 | |
|      switch is connected to the controller.  Occasionally it can
 | |
|      happen that the controller forgets or otherwise needs to obtain
 | |
|      the MAC address of the switch.  To do so, the controller sends a
 | |
|      broadcast ARP request.  A switch that implements the in-band
 | |
|      control rules only when it is disconnected will then send an
 | |
|      OFPT_PACKET_IN message up to the controller.  The controller will
 | |
|      be unable to respond, because it does not know the MAC address of
 | |
|      the switch.  This is a deadlock situation that can only be
 | |
|      resolved by the switch noticing that its connection to the
 | |
|      controller has hung and reconnecting.
 | |
| 
 | |
|    - In-band control must override flows set up by the controller.
 | |
| 
 | |
|      It is reasonable to assume that flows set up by the OpenFlow
 | |
|      controller should take precedence over in-band control, on the
 | |
|      basis that the controller should be in charge of the switch.
 | |
| 
 | |
|      Again, this does not work in practice.  Reasonable controller
 | |
|      implementations may set up a "last resort" fallback rule that
 | |
|      wildcards every field and, e.g., sends it up to the controller or
 | |
|      discards it.  If a controller does that, then it will isolate
 | |
|      itself from the switch.
 | |
| 
 | |
|    - The switch must recognize all control traffic.
 | |
| 
 | |
|      The fundamental principle of in-band control states, in part,
 | |
|      that a switch must recognize control traffic without involving
 | |
|      the OpenFlow controller.  More specifically, the switch must
 | |
|      recognize *all* control traffic.  "False negatives", that is,
 | |
|      packets that constitute control traffic but that the switch does
 | |
|      not recognize as control traffic, lead to control traffic storms.
 | |
| 
 | |
|      Consider an OpenFlow switch that only recognizes control packets
 | |
|      sent to or from that switch.  Now suppose that two switches of
 | |
|      this type, named A and B, are connected to ports on an Ethernet
 | |
|      hub (not a switch) and that an OpenFlow controller is connected
 | |
|      to a third hub port.  In this setup, control traffic sent by
 | |
|      switch A will be seen by switch B, which will send it to the
 | |
|      controller as part of an OFPT_PACKET_IN message.  Switch A will
 | |
|      then see the OFPT_PACKET_IN message's packet, re-encapsulate it
 | |
|      in another OFPT_PACKET_IN, and send it to the controller.  Switch
 | |
|      B will then see that OFPT_PACKET_IN, and so on in an infinite
 | |
|      loop.
 | |
| 
 | |
|      Incidentally, the consequences of "false positives", where
 | |
|      packets that are not control traffic are nevertheless recognized
 | |
|      as control traffic, are much less severe.  The controller will
 | |
|      not be able to control their behavior, but the network will
 | |
|      remain in working order.  False positives do constitute a
 | |
|      security problem.
 | |
| 
 | |
|    - The switch should use echo-requests to detect disconnection.
 | |
| 
 | |
|      TCP will notice that a connection has hung, but this can take a
 | |
|      considerable amount of time.  For example, with default settings
 | |
|      the Linux kernel TCP implementation will retransmit for between
 | |
|      13 and 30 minutes, depending on the connection's retransmission
 | |
|      timeout, according to kernel documentation.  This is far too long
 | |
|      for a switch to be disconnected, so an OpenFlow switch should
 | |
|      implement its own connection timeout.  OpenFlow OFPT_ECHO_REQUEST
 | |
|      messages are the best way to do this, since they test the
 | |
|      OpenFlow connection itself.
 | |
| 
 | |
| Implementation
 | |
| --------------
 | |
| 
 | |
| This section describes how Open vSwitch implements in-band control.
 | |
| Correctly implementing in-band control has proven difficult due to its
 | |
| many subtleties, and has thus gone through many iterations.  Please
 | |
| read through and understand the reasoning behind the chosen rules
 | |
| before making modifications.
 | |
| 
 | |
| Open vSwitch implements in-band control as "hidden" flows, that is,
 | |
| flows that are not visible through OpenFlow, and at a higher priority
 | |
| than wildcarded flows can be set up through OpenFlow.  This is done so
 | |
| that the OpenFlow controller cannot interfere with them and possibly
 | |
| break connectivity with its switches.  It is possible to see all
 | |
| flows, including in-band ones, with the ovs-appctl "bridge/dump-flows"
 | |
| command.
 | |
| 
 | |
| The Open vSwitch implementation of in-band control can hide traffic to
 | |
| arbitrary "remotes", where each remote is one TCP port on one IP address.
 | |
| Currently the remotes are automatically configured as the in-band OpenFlow
 | |
| controllers plus the OVSDB managers, if any.  (The latter is a requirement
 | |
| because OVSDB managers are responsible for configuring OpenFlow controllers,
 | |
| so if the manager cannot be reached then OpenFlow cannot be reconfigured.)
 | |
| 
 | |
| The following rules (with the OFPP_NORMAL action) are set up on any bridge
 | |
| that has any remotes:
 | |
| 
 | |
|    (a) DHCP requests sent from the local port.
 | |
|    (b) ARP replies to the local port's MAC address.
 | |
|    (c) ARP requests from the local port's MAC address.
 | |
| 
 | |
| In-band also sets up the following rules for each unique next-hop MAC
 | |
| address for the remotes' IPs (the "next hop" is either the remote
 | |
| itself, if it is on a local subnet, or the gateway to reach the remote):
 | |
| 
 | |
|    (d) ARP replies to the next hop's MAC address.
 | |
|    (e) ARP requests from the next hop's MAC address.
 | |
| 
 | |
| In-band also sets up the following rules for each unique remote IP address:
 | |
| 
 | |
|    (f) ARP replies containing the remote's IP address as a target.
 | |
|    (g) ARP requests containing the remote's IP address as a source.
 | |
| 
 | |
| In-band also sets up the following rules for each unique remote (IP,port)
 | |
| pair:
 | |
| 
 | |
|    (h) TCP traffic to the remote's IP and port.
 | |
|    (i) TCP traffic from the remote's IP and port.
 | |
| 
 | |
| The goal of these rules is to be as narrow as possible to allow a
 | |
| switch to join a network and be able to communicate with the
 | |
| remotes.  As mentioned earlier, these rules have higher priority
 | |
| than the controller's rules, so if they are too broad, they may
 | |
| prevent the controller from implementing its policy.  As such,
 | |
| in-band actively monitors some aspects of flow and packet processing
 | |
| so that the rules can be made more precise.
 | |
| 
 | |
| In-band control monitors attempts to add flows into the datapath that
 | |
| could interfere with its duties.  The datapath only allows exact
 | |
| match entries, so in-band control is able to be very precise about
 | |
| the flows it prevents.  Flows that miss in the datapath are sent to
 | |
| userspace to be processed, so preventing these flows from being
 | |
| cached in the "fast path" does not affect correctness.  The only type
 | |
| of flow that is currently prevented is one that would prevent DHCP
 | |
| replies from being seen by the local port.  For example, a rule that
 | |
| forwarded all DHCP traffic to the controller would not be allowed,
 | |
| but one that forwarded to all ports (including the local port) would.
 | |
| 
 | |
| As mentioned earlier, packets that miss in the datapath are sent to
 | |
| the userspace for processing.  The userspace has its own flow table,
 | |
| the "classifier", so in-band checks whether any special processing
 | |
| is needed before the classifier is consulted.  If a packet is a DHCP
 | |
| response to a request from the local port, the packet is forwarded to
 | |
| the local port, regardless of the flow table.  Note that this requires
 | |
| L7 processing of DHCP replies to determine whether the 'chaddr' field
 | |
| matches the MAC address of the local port.
 | |
| 
 | |
| It is interesting to note that for an L3-based in-band control
 | |
| mechanism, the majority of rules are devoted to ARP traffic.  At first
 | |
| glance, some of these rules appear redundant.  However, each serves an
 | |
| important role.  First, in order to determine the MAC address of the
 | |
| remote side (controller or gateway) for other ARP rules, we must allow
 | |
| ARP traffic for our local port with rules (b) and (c).  If we are
 | |
| between a switch and its connection to the remote, we have to
 | |
| allow the other switch's ARP traffic to through.  This is done with
 | |
| rules (d) and (e), since we do not know the addresses of the other
 | |
| switches a priori, but do know the remote's or gateway's.  Finally,
 | |
| if the remote is running in a local guest VM that is not reached
 | |
| through the local port, the switch that is connected to the VM must
 | |
| allow ARP traffic based on the remote's IP address, since it will
 | |
| not know the MAC address of the local port that is sending the traffic
 | |
| or the MAC address of the remote in the guest VM.
 | |
| 
 | |
| With a few notable exceptions below, in-band should work in most
 | |
| network setups.  The following are considered "supported' in the
 | |
| current implementation:
 | |
| 
 | |
|    - Locally Connected.  The switch and remote are on the same
 | |
|      subnet.  This uses rules (a), (b), (c), (h), and (i).
 | |
| 
 | |
|    - Reached through Gateway.  The switch and remote are on
 | |
|      different subnets and must go through a gateway.  This uses
 | |
|      rules (a), (b), (c), (h), and (i).
 | |
| 
 | |
|    - Between Switch and Remote.  This switch is between another
 | |
|      switch and the remote, and we want to allow the other
 | |
|      switch's traffic through.  This uses rules (d), (e), (h), and
 | |
|      (i).  It uses (b) and (c) indirectly in order to know the MAC
 | |
|      address for rules (d) and (e).  Note that DHCP for the other
 | |
|      switch will not work unless an OpenFlow controller explicitly lets this
 | |
|      switch pass the traffic.
 | |
| 
 | |
|    - Between Switch and Gateway.  This switch is between another
 | |
|      switch and the gateway, and we want to allow the other switch's
 | |
|      traffic through.  This uses the same rules and logic as the
 | |
|      "Between Switch and Remote" configuration described earlier.
 | |
| 
 | |
|    - Remote on Local VM.  The remote is a guest VM on the
 | |
|      system running in-band control.  This uses rules (a), (b), (c),
 | |
|      (h), and (i).
 | |
| 
 | |
|    - Remote on Local VM with Different Networks.  The remote
 | |
|      is a guest VM on the system running in-band control, but the
 | |
|      local port is not used to connect to the remote.  For
 | |
|      example, an IP address is configured on eth0 of the switch.  The
 | |
|      remote's VM is connected through eth1 of the switch, but an
 | |
|      IP address has not been configured for that port on the switch.
 | |
|      As such, the switch will use eth0 to connect to the remote,
 | |
|      and eth1's rules about the local port will not work.  In the
 | |
|      example, the switch attached to eth0 would use rules (a), (b),
 | |
|      (c), (h), and (i) on eth0.  The switch attached to eth1 would use
 | |
|      rules (f), (g), (h), and (i).
 | |
| 
 | |
| The following are explicitly *not* supported by in-band control:
 | |
| 
 | |
|    - Specify Remote by Name.  Currently, the remote must be
 | |
|      identified by IP address.  A naive approach would be to permit
 | |
|      all DNS traffic.  Unfortunately, this would prevent the
 | |
|      controller from defining any policy over DNS.  Since switches
 | |
|      that are located behind us need to connect to the remote,
 | |
|      in-band cannot simply add a rule that allows DNS traffic from
 | |
|      the local port.  The "correct" way to support this is to parse
 | |
|      DNS requests to allow all traffic related to a request for the
 | |
|      remote's name through.  Due to the potential security
 | |
|      problems and amount of processing, we decided to hold off for
 | |
|      the time-being.
 | |
| 
 | |
|    - Differing Remotes for Switches.  All switches must know
 | |
|      the L3 addresses for all the remotes that other switches
 | |
|      may use, since rules need to be set up to allow traffic related
 | |
|      to those remotes through.  See rules (f), (g), (h), and (i).
 | |
| 
 | |
|    - Differing Routes for Switches.  In order for the switch to
 | |
|      allow other switches to connect to a remote through a
 | |
|      gateway, it allows the gateway's traffic through with rules (d)
 | |
|      and (e).  If the routes to the remote differ for the two
 | |
|      switches, we will not know the MAC address of the alternate
 | |
|      gateway.
 | |
| 
 | |
| 
 | |
| Action Reproduction
 | |
| ===================
 | |
| 
 | |
| It seems likely that many controllers, at least at startup, use the
 | |
| OpenFlow "flow statistics" request to obtain existing flows, then
 | |
| compare the flows' actions against the actions that they expect to
 | |
| find.  Before version 1.8.0, Open vSwitch always returned exact,
 | |
| byte-for-byte copies of the actions that had been added to the flow
 | |
| table.  The current version of Open vSwitch does not always do this in
 | |
| some exceptional cases.  This section lists the exceptions that
 | |
| controller authors must keep in mind if they compare actual actions
 | |
| against desired actions in a bytewise fashion:
 | |
| 
 | |
|         - Open vSwitch zeros padding bytes in action structures,
 | |
|           regardless of their values when the flows were added.
 | |
| 
 | |
|         - Open vSwitch "normalizes" the instructions in OpenFlow 1.1
 | |
|           (and later) in the following way:
 | |
| 
 | |
|               * OVS sorts the instructions into the following order:
 | |
|                 Apply-Actions, Clear-Actions, Write-Actions,
 | |
|                 Write-Metadata, Goto-Table.
 | |
| 
 | |
|               * OVS drops Apply-Actions instructions that have empty
 | |
|                 action lists.
 | |
| 
 | |
|               * OVS drops Write-Actions instructions that have empty
 | |
|                 action sets.
 | |
| 
 | |
| Please report other discrepancies, if you notice any, so that we can
 | |
| fix or document them.
 | |
| 
 | |
| 
 | |
| Suggestions
 | |
| ===========
 | |
| 
 | |
| Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org.
 |