mirror of
https://github.com/openvswitch/ovs
synced 2025-08-31 06:15:47 +00:00
datapath-windows: Update DESIGN document.
In this patch, we update the design document to reflect the netlink based kernel-userspace interface implementation and a few other changes. I have covered at a high level. Please feel free to extend the document with more details that you think got missed out. Signed-off-by: Nithin Raju <nithin@vmware.com> Acked-by: Sorin Vinturis <svinturis@cloudbasesolutions.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
This commit is contained in:
@@ -1,20 +1,13 @@
|
||||
OVS-on-Hyper-V Design Document
|
||||
==============================
|
||||
There has been an effort in the recent past to develop the Open vSwitch (OVS)
|
||||
solution onto multiple hypervisor platforms such as FreeBSD and Microsoft
|
||||
Hyper-V. VMware has been working on a OVS solution for Microsoft Hyper-V for
|
||||
the past few months and has successfully completed the implementation.
|
||||
There has been a community effort to develop Open vSwitch on Microsoft Hyper-V.
|
||||
In this document, we provide details of the development effort. We believe this
|
||||
document should give enough information to understand the overall design.
|
||||
|
||||
This document provides details of the development effort. We believe this
|
||||
document should give enough information to members of the community who are
|
||||
curious about the developments of OVS on Hyper-V. The community should also be
|
||||
able to get enough information to make plans to leverage the deliverables of
|
||||
this effort.
|
||||
|
||||
The userspace portion of the OVS has already been ported to Hyper-V and
|
||||
committed to the openvswitch repo. So, this document will mostly emphasize on
|
||||
the kernel driver, though we touch upon some of the aspects of userspace as
|
||||
well.
|
||||
The userspace portion of the OVS has been ported to Hyper-V in a separate
|
||||
effort, and committed to the openvswitch repo. So, this document will mostly
|
||||
emphasize on the kernel driver, though we touch upon some of the aspects of
|
||||
userspace as well.
|
||||
|
||||
We cover the following topics:
|
||||
1. Background into relevant Hyper-V architecture
|
||||
@@ -48,13 +41,13 @@ In Hyper-V, the virtual machine is called the Child Partition. Each VIF or
|
||||
physical NIC on the Hyper-V extensible switch is attached via a port. Each port
|
||||
is both on the ingress path or the egress path of the switch. The ingress path
|
||||
is used for packets being sent out of a port, and egress is used for packet
|
||||
being received on a port. By design, NDIS provides a layered interface, where
|
||||
in the ingress path, higher level layers call into lower level layers, and on
|
||||
the egress path, it is the other way round. In addition, there is a object
|
||||
identifier (OID) interface for control operations Eg. addition of a port. The
|
||||
workflow for the calls is similar in nature to the packets, where higher level
|
||||
layers call into the lower level layers. A good representational diagram of
|
||||
this architecture is in [4].
|
||||
being received on a port. By design, NDIS provides a layered interface. In this
|
||||
layered interface, higher level layers call into lower level layers, in the
|
||||
ingress path. In the egress path, it is the other way round. In addition, there
|
||||
is a object identifier (OID) interface for control operations Eg. addition of
|
||||
a port. The workflow for the calls is similar in nature to the packets, where
|
||||
higher level layers call into the lower level layers. A good representational
|
||||
diagram of this architecture is in [4].
|
||||
|
||||
Windows Filtering Platform (WFP)[5] is a platform implemented on Hyper-V that
|
||||
provides APIs and services for filtering packets. WFP has been utilized to
|
||||
@@ -75,22 +68,23 @@ has been used to retrieve some of the configuration information that OVS needs.
|
||||
| |
|
||||
+------+ +--------------+ | +-----------+ +------------+ |
|
||||
| | | | | | | | | |
|
||||
| OVS- | | OVS | | | Virtual | | Virtual | |
|
||||
| wind | | USERSPACE | | | Machine #1| | Machine #2 | |
|
||||
| | | DAEMON/CTL | | | | | | |
|
||||
| ovs- | | OVS- | | | Virtual | | Virtual | |
|
||||
| *ctl | | USERSPACE | | | Machine #1| | Machine #2 | |
|
||||
| | | DAEMON | | | | | | |
|
||||
+------+-++---+---------+ | +--+------+-+ +----+------++ | +--------+
|
||||
| DPIF- | | netdev- | | |VIF #1| |VIF #2| | |Physical|
|
||||
| Windows |<=>| Windows | | +------+ +------+ | | NIC |
|
||||
| dpif- | | netdev- | | |VIF #1| |VIF #2| | |Physical|
|
||||
| netlink | | windows | | +------+ +------+ | | NIC |
|
||||
+---------+ +---------+ | || /\ | +--------+
|
||||
User /\ | || *#1* *#4* || | /\
|
||||
=========||=======================+------||-------------------||--+ ||
|
||||
Kernel || \/ || ||=====/
|
||||
\/ +-----+ +-----+ *#5*
|
||||
User /\ /\ | || *#1* *#4* || | /\
|
||||
=========||=========||============+------||-------------------||--+ ||
|
||||
Kernel || || \/ || ||=====/
|
||||
\/ \/ +-----+ +-----+ *#5*
|
||||
+-------------------------------+ | | | |
|
||||
| +----------------------+ | | | | |
|
||||
| | OVS Pseudo Device | | | | | |
|
||||
| +----------------+-----+ | | | | |
|
||||
| | | I | | |
|
||||
| +----------------------+ | | | | |
|
||||
| | Netlink Impl. | | | | | |
|
||||
| ----------------- | | I | | |
|
||||
| +------------+ | | N | | E |
|
||||
| | Flowtable | +------------+ | | G | | G |
|
||||
| +------------+ | Packet | |*#2*| R | | R |
|
||||
@@ -110,9 +104,8 @@ Kernel || \/ || ||=====/
|
||||
Figure 2 shows the various blocks involved in the OVS Windows implementation,
|
||||
along with some of the components available in the NDIS stack, and also the
|
||||
virtual machines. The workflow of a packet being transmitted from a VIF out and
|
||||
into another VIF and to a physical NIC is also shown. New userspace components
|
||||
being added as also shown. Later on in this section, we’ll discuss the flow of
|
||||
a packet at a high level.
|
||||
into another VIF and to a physical NIC is also shown. Later on in this section,
|
||||
we will discuss the flow of a packet at a high level.
|
||||
|
||||
The figure gives a general idea of where the OVS userspace and the kernel
|
||||
components fit in, and how they interface with each other.
|
||||
@@ -122,9 +115,11 @@ a forwarding extension roughly implementing the following
|
||||
sub-modules/functionality. Details of each of these sub-components in the
|
||||
kernel are contained in later sections:
|
||||
* Interfacing with the NDIS stack
|
||||
* Netlink message parser
|
||||
* Netlink sockets
|
||||
* Switch/Datapath management
|
||||
* Interfacing with userspace portion of the OVS solution to implement the
|
||||
necessary ioctls that userspace needs
|
||||
necessary functionality that userspace needs
|
||||
* Port management
|
||||
* Flowtable/Actions/packet forwarding
|
||||
* Tunneling
|
||||
@@ -140,32 +135,36 @@ are:
|
||||
* Interface between the userspace and the kernel module.
|
||||
* Event notifications are significantly different.
|
||||
* The communication interface between DPIF and the kernel module need not be
|
||||
implemented in the way OVS on Linux does.
|
||||
implemented in the way OVS on Linux does. That said, it would be
|
||||
advantageous to have a similar interface to the kernel module for reasons of
|
||||
readability and maintainability.
|
||||
* Any licensing issues of using Linux kernel code directly.
|
||||
|
||||
Due to these differences, it was a straightforward decision to develop the
|
||||
datapath for OVS on Hyper-V from scratch rather than porting the one on Linux.
|
||||
A re-development focussed on the following goals:
|
||||
A re-development focused on the following goals:
|
||||
* Adhere to the existing requirements of userspace portion of OVS (such as
|
||||
ovs- vswitchd), to minimize changes in the userspace workflow.
|
||||
ovs-vswitchd), to minimize changes in the userspace workflow.
|
||||
* Fit well into the typical workflow of a Hyper-V extensible switch forwarding
|
||||
extension.
|
||||
|
||||
The userspace portion of the OVS solution is mostly POSIX code, and not very
|
||||
Linux specific. Majority of the code has already been ported and committed to
|
||||
the openvswitch repo. Most of the daemons such as ovs-vswitchd or ovsdb-server
|
||||
can run on Windows now. One additional daemon that has been implemented is
|
||||
called ovs-wind. At a high level ovs-wind manages keeps the ovsdb used by
|
||||
userspace in sync with the kernel state. More details in the userspace section.
|
||||
Linux specific. Majority of the userspace code does not interface directly with
|
||||
the kernel datapath and was ported independently of the kernel datapath
|
||||
effort.
|
||||
|
||||
As explained in the OVS porting design document [7], DPIF is the portion of
|
||||
userspace that interfaces with the kernel portion of the OVS. Each platform can
|
||||
have its own implementation of the DPIF provider whose interface is defined in
|
||||
dpif-provider.h [3]. For OVS on Hyper-V, we have an implementation of DPIF
|
||||
provider for Hyper-V. The communication interface between userspace and the
|
||||
kernel is a pseudo device and is different from that of the Linux’s DPIF
|
||||
provider which uses netlink. But, as long as the DPIF provider interface is the
|
||||
same, the callers should be agnostic of the underlying communication interface.
|
||||
userspace that interfaces with the kernel portion of the OVS. The interface
|
||||
that each DPIF provider has to implement is defined in dpif-provider.h [3].
|
||||
Though each platform is allowed to have its own implementation of the DPIF
|
||||
provider, it was found, via community feedback, that it is desired to
|
||||
share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares
|
||||
code with the DPIF provider on Linux. This interface is implemented in
|
||||
dpif-netlink.c, formerly dpif-linux.c.
|
||||
|
||||
We'll elaborate more on kernel-userspace interface in a dedicated section
|
||||
below. Here it suffices to say that the DPIF provider implementation for
|
||||
Windows is netlink-based and shares code with the Linux one.
|
||||
|
||||
2.a) Kernel module (datapath)
|
||||
-----------------------------
|
||||
@@ -178,8 +177,8 @@ This is consistent with using a single datapath in the kernel on Linux. All the
|
||||
physical adapters are connected as external adapters to the extensible switch.
|
||||
|
||||
When the OVS switch extension registers itself as a filter driver, it also
|
||||
registers callbacks for the switch management and datapath functions. In other
|
||||
words, when a switch is created on the Hyper-V root partition (host), the
|
||||
registers callbacks for the switch/port management and datapath functions. In
|
||||
other words, when a switch is created on the Hyper-V root partition (host), the
|
||||
extension gets an activate callback upon which it can initialize the data
|
||||
structures necessary for OVS to function. Similarly, there are callbacks for
|
||||
when a port gets added to the Hyper-V switch, and an External Network adapter
|
||||
@@ -190,7 +189,7 @@ packet is received on an external NIC.
|
||||
As shown in the figures, an extensible switch extension gets to see a packet
|
||||
sent by the VM (VIF) twice - once on the ingress path and once on the egress
|
||||
path. Forwarding decisions are to be made on the ingress path. Correspondingly,
|
||||
we’ll be hooking onto the following interfaces:
|
||||
we will be hooking onto the following interfaces:
|
||||
* Ingress send indication: intercept packets for performing flow based
|
||||
forwarding.This includes straight forwarding to output ports. Any packet
|
||||
modifications needed to be performed are done here either inline or by
|
||||
@@ -203,11 +202,41 @@ we’ll be hooking onto the following interfaces:
|
||||
|
||||
Interfacing with OVS userspace
|
||||
------------------------------
|
||||
We’ve implemented a pseudo device interface for letting OVS userspace talk to
|
||||
We have implemented a pseudo device interface for letting OVS userspace talk to
|
||||
the OVS kernel module. This is equivalent to the typical character device
|
||||
interface on POSIX platforms. The pseudo device supports a whole bunch of
|
||||
interface on POSIX platforms where we can register custom functions for read,
|
||||
write and ioctl functionality. The pseudo device supports a whole bunch of
|
||||
ioctls that netdev and DPIF on OVS userspace make use of.
|
||||
|
||||
Netlink message parser
|
||||
----------------------
|
||||
The communication between OVS userspace and OVS kernel datapath is in the form
|
||||
of Netlink messages [1]. More details about this are provided in #2.c section,
|
||||
kernel-userspace interface. In the kernel, a full fledged netlink message
|
||||
parser has been implemented along the lines of the netlink message parser in
|
||||
OVS userspace. In fact, a lot of the code is ported code.
|
||||
|
||||
On the lines of 'struct ofpbuf' in OVS userspace, a managed buffer has been
|
||||
implemented in the kernel datapath to make it easier to parse and construct
|
||||
netlink messages.
|
||||
|
||||
Netlink sockets
|
||||
---------------
|
||||
On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink
|
||||
messages. Since much of userspace code including DPIF provider in
|
||||
dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets
|
||||
have been implemented in OVS userspace. As it is known, Windows lacks native
|
||||
netlink socket support, and also the socket family is not extensible either.
|
||||
Hence it is not possible to provide a native implementation of netlink socket.
|
||||
We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_*
|
||||
APIs to higher levels. The implementation opens a handle to the pseudo device
|
||||
for each netlink socket. Some more details on this topic are provided in the
|
||||
userspace section on netlink sockets.
|
||||
|
||||
Typical netlink semantics of read message, write message, dump, and transaction
|
||||
have been implemented so that higher level layers are not affected by the
|
||||
netlink implementation not being native.
|
||||
|
||||
Switch/Datapath management
|
||||
--------------------------
|
||||
As explained above, we hook onto the management callback functions in the NDIS
|
||||
@@ -220,8 +249,19 @@ Port management
|
||||
As explained above, we hook onto the management callback functions in the NDIS
|
||||
interface to know when a port is added/connected to the Hyper-V switch. We use
|
||||
these callbacks to initialize the port related data structures in OVS. Also,
|
||||
some of the ports are tunnel ports that don’t exist on the Hyper-V switch that
|
||||
are initiated from OVS userspace.
|
||||
some of the ports are tunnel ports that don’t exist on the Hyper-V switch and
|
||||
get added from OVS userspace.
|
||||
|
||||
In order to identify a Hyper-V port, we use the value of 'FriendlyName' field
|
||||
in each Hyper-V port. We call this the "OVS-port-name". The idea is that OVS
|
||||
userspace sets 'OVS-port-name' in each Hyper-V port to the same value as the
|
||||
'name' field of the 'Interface' table in OVSDB. When OVS userspace calls into
|
||||
the kernel datapath to add a port, we match the name of the port with the
|
||||
'OVS-port-name' of a Hyper-V port.
|
||||
|
||||
We maintain separate hash tables, and separate counters for ports that have
|
||||
been added from the Hyper-V switch, and for ports that have been added from OVS
|
||||
userspace.
|
||||
|
||||
Flowtable/Actions/packet forwarding
|
||||
-----------------------------------
|
||||
@@ -267,48 +307,90 @@ used.
|
||||
|
||||
2.b) Userspace components
|
||||
-------------------------
|
||||
A new daemon has been added to userspace to manage the entities in OVSDB, and
|
||||
also to keep it in sync with the kernel state, and this include bridges,
|
||||
physical NICs, VIFs etc. For example, upon bootup, ovs-wind does a get on the
|
||||
kernel to get a list of the bridges, and the corresponding ports and populates
|
||||
OVSDB. If a new VIF gets added to the kernel switch because a user powered on a
|
||||
Virtual Machine, ovs-wind detects it, and adds a corresponding entry in the
|
||||
ovsdb. This implies that ovs-wind has a synchronous as well as an asynchronous
|
||||
interface to the OVS kernel driver.
|
||||
The userspace portion of the OVS solution is mostly POSIX code, and not very
|
||||
Linux specific. Majority of the userspace code does not interface directly with
|
||||
the kernel datapath and was ported independently of the kernel datapath
|
||||
effort.
|
||||
|
||||
In this section, we cover the userspace components that interface with the
|
||||
kernel datapath.
|
||||
|
||||
2.c) Kernel-Userspace interface
|
||||
-------------------------------
|
||||
DPIF-Windows
|
||||
------------
|
||||
DPIF-Windows is the Windows implementation of the interface defined in dpif-
|
||||
provider.h, and provides an interface into the OVS kernel driver. We implement
|
||||
most of the callbacks required by the DPIF provider. A quick summary of the
|
||||
functionality implemented is as follows:
|
||||
* dp_dump, dp_get: dump all datapath information or get information for a
|
||||
particular datapath. Currently we only support one datapath.
|
||||
* flow_dump, flow_put, flow_get, flow_flush: These functions retrieve all
|
||||
flows in the kernel, add a flow to the kernel, get a specific flow and
|
||||
delete all the flows in the kernel.
|
||||
* recv_set, recv, recv_wait, recv_purge: these poll packets for upcalls.
|
||||
* execute: This is used to send packets from userspace to the kernel. The
|
||||
packets could be either flow miss packet punted from kernel earlier or
|
||||
userspace generated packets.
|
||||
* vport_dump, vport_get, ext_info: These functions dump all ports in the
|
||||
kernel, get a specific port in the kernel, or get extended information
|
||||
about a port.
|
||||
* event_subscribe, wait, poll: These functions subscribe, wait and poll the
|
||||
events that kernel posts. A typical example is kernel notices a port has
|
||||
gone up/down, and would like to notify the userspace.
|
||||
As explained earlier, OVS on Hyper-V shares the DPIF provider implementation
|
||||
with Linux. The DPIF provider on Linux uses netlink sockets and netlink
|
||||
messages. Netlink sockets and messages are extensively used on Linux to
|
||||
exchange information between userspace and kernel. In order to satisfy these
|
||||
dependencies, netlink socket (pseudo and non-native) and netlink messages
|
||||
are implemented on Hyper-V.
|
||||
|
||||
The following are the major advantages of sharing DPIF provider code:
|
||||
1. Maintenance is simpler:
|
||||
Any change made to the interface defined in dpif-provider.h need not be
|
||||
propagated to multiple implementations. Also, developers familiar with the
|
||||
Linux implementation of the DPIF provider can easily ramp on the Hyper-V
|
||||
implementation as well.
|
||||
2. Netlink messages provides inherent advantages:
|
||||
Netlink messages are known for their extensibility. Each message is
|
||||
versioned, so the provided data structures offer a mechanism to perform
|
||||
version checking and forward/backward compatibility with the kernel
|
||||
module.
|
||||
|
||||
Netlink sockets
|
||||
---------------
|
||||
As explained in other sections, an emulation of netlink sockets has been
|
||||
implemented in lib/netlink-socket.c for Windows. The implementation creates a
|
||||
handle to the OVS pseudo device, and emulates netlink socket semantics of
|
||||
receive message, send message, dump, and transact. Most of the nl_* functions
|
||||
are supported.
|
||||
|
||||
The fact that the implementation is non-native manifests in various ways.
|
||||
One example is that PID for the netlink socket is not automatically assigned in
|
||||
userspace when a handle is created to the OVS pseudo device. There's an extra
|
||||
command (defined in OvsDpInterfaceExt.h) that is used to grab the PID generated
|
||||
in the kernel.
|
||||
|
||||
DPIF provider
|
||||
--------------
|
||||
As has been mentioned in earlier sections, the netlink socket and netlink
|
||||
message based DPIF provider on Linux has been ported to Windows.
|
||||
Correspondingly, the file is called lib/dpif-netlink.c now from its former
|
||||
name of lib/dpif-linux.c.
|
||||
|
||||
Most of the code is common. Some divergence is in the code to receive
|
||||
packets. The Linux implementation uses epoll() which is not natively supported
|
||||
on Windows.
|
||||
|
||||
Netdev-Windows
|
||||
--------------
|
||||
We have a Windows implementation of the the interface defined in lib/netdev-
|
||||
provider.h. The implementation provided functionality to get extended
|
||||
information about an interface. It is limited in functionality compared to the
|
||||
Linux implementation of the netdev provider and cannot be used to add any
|
||||
interfaces in the kernel such as a tap interface.
|
||||
We have a Windows implementation of the interface defined in
|
||||
lib/netdev-provider.h. The implementation provides functionality to get
|
||||
extended information about an interface. It is limited in functionality
|
||||
compared to the Linux implementation of the netdev provider and cannot be used
|
||||
to add any interfaces in the kernel such as a tap interface or to send/receive
|
||||
packets. The netdev-windows implementation uses the datapath interface
|
||||
extensions defined in:
|
||||
datapath-windows/include/OvsDpInterfaceExt.h
|
||||
|
||||
Powershell extensions to set "OVS-port-name"
|
||||
--------------------------------------------
|
||||
As explained in the section on "Port management", each Hyper-V port has a
|
||||
'FriendlyName' field, which we call as the "OVS-port-name" field. We have
|
||||
implemented powershell command extensions to be able to set the "OVS-port-name"
|
||||
of a Hyper-V port.
|
||||
|
||||
2.c) Kernel-Userspace interface
|
||||
-------------------------------
|
||||
openvswitch.h and OvsDpInterfaceExt.h
|
||||
-------------------------------------
|
||||
Since the DPIF provider is shared with Linux, the kernel datapath provides the
|
||||
same interface as the Linux datapath. The interface is defined in
|
||||
datapath/linux/compat/include/linux/openvswitch.h. Derivatives of this
|
||||
interface file are created during OVS userspace compilation. The derivative for
|
||||
the kernel datapath on Hyper-V is provided in the following location:
|
||||
datapath-windows/include/OvsDpInterface.h
|
||||
|
||||
That said, there are Windows specific extensions that are defined in the
|
||||
interface file:
|
||||
datapath-windows/include/OvsDpInterfaceExt.h
|
||||
|
||||
2.d) Flow of a packet
|
||||
---------------------
|
||||
@@ -354,9 +436,9 @@ driver.
|
||||
|
||||
Reference list:
|
||||
===============
|
||||
1: Hyper-V Extensible Switch
|
||||
1. Hyper-V Extensible Switch
|
||||
http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx
|
||||
2: Hyper-V Extensible Switch Extensions
|
||||
2. Hyper-V Extensible Switch Extensions
|
||||
http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx
|
||||
3. DPIF Provider
|
||||
http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif-
|
||||
@@ -369,3 +451,7 @@ http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx
|
||||
http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx
|
||||
7. How to Port Open vSwitch to New Software or Hardware
|
||||
http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
|
||||
8. Netlink
|
||||
http://en.wikipedia.org/wiki/Netlink
|
||||
9. epoll
|
||||
http://en.wikipedia.org/wiki/Epoll
|
||||
|
Reference in New Issue
Block a user