2
0
mirror of https://github.com/openvswitch/ovs synced 2025-09-01 06:45:17 +00:00

datapath-windows: Update DESIGN document.

In this patch, we update the design document to reflect the netlink
based kernel-userspace interface implementation and a few other changes.
I have covered at a high level.

Please feel free to extend the document with more details that you think
got missed out.

Signed-off-by: Nithin Raju <nithin@vmware.com>
Acked-by: Sorin Vinturis <svinturis@cloudbasesolutions.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
This commit is contained in:
Nithin Raju
2014-11-25 09:06:43 -08:00
committed by Ben Pfaff
parent 5dd9826c9d
commit 36013bb157

View File

@@ -1,20 +1,13 @@
OVS-on-Hyper-V Design Document OVS-on-Hyper-V Design Document
============================== ==============================
There has been an effort in the recent past to develop the Open vSwitch (OVS) There has been a community effort to develop Open vSwitch on Microsoft Hyper-V.
solution onto multiple hypervisor platforms such as FreeBSD and Microsoft In this document, we provide details of the development effort. We believe this
Hyper-V. VMware has been working on a OVS solution for Microsoft Hyper-V for document should give enough information to understand the overall design.
the past few months and has successfully completed the implementation.
This document provides details of the development effort. We believe this The userspace portion of the OVS has been ported to Hyper-V in a separate
document should give enough information to members of the community who are effort, and committed to the openvswitch repo. So, this document will mostly
curious about the developments of OVS on Hyper-V. The community should also be emphasize on the kernel driver, though we touch upon some of the aspects of
able to get enough information to make plans to leverage the deliverables of userspace as well.
this effort.
The userspace portion of the OVS has already been ported to Hyper-V and
committed to the openvswitch repo. So, this document will mostly emphasize on
the kernel driver, though we touch upon some of the aspects of userspace as
well.
We cover the following topics: We cover the following topics:
1. Background into relevant Hyper-V architecture 1. Background into relevant Hyper-V architecture
@@ -48,13 +41,13 @@ In Hyper-V, the virtual machine is called the Child Partition. Each VIF or
physical NIC on the Hyper-V extensible switch is attached via a port. Each port physical NIC on the Hyper-V extensible switch is attached via a port. Each port
is both on the ingress path or the egress path of the switch. The ingress path is both on the ingress path or the egress path of the switch. The ingress path
is used for packets being sent out of a port, and egress is used for packet is used for packets being sent out of a port, and egress is used for packet
being received on a port. By design, NDIS provides a layered interface, where being received on a port. By design, NDIS provides a layered interface. In this
in the ingress path, higher level layers call into lower level layers, and on layered interface, higher level layers call into lower level layers, in the
the egress path, it is the other way round. In addition, there is a object ingress path. In the egress path, it is the other way round. In addition, there
identifier (OID) interface for control operations Eg. addition of a port. The is a object identifier (OID) interface for control operations Eg. addition of
workflow for the calls is similar in nature to the packets, where higher level a port. The workflow for the calls is similar in nature to the packets, where
layers call into the lower level layers. A good representational diagram of higher level layers call into the lower level layers. A good representational
this architecture is in [4]. diagram of this architecture is in [4].
Windows Filtering Platform (WFP)[5] is a platform implemented on Hyper-V that Windows Filtering Platform (WFP)[5] is a platform implemented on Hyper-V that
provides APIs and services for filtering packets. WFP has been utilized to provides APIs and services for filtering packets. WFP has been utilized to
@@ -75,22 +68,23 @@ has been used to retrieve some of the configuration information that OVS needs.
| | | |
+------+ +--------------+ | +-----------+ +------------+ | +------+ +--------------+ | +-----------+ +------------+ |
| | | | | | | | | | | | | | | | | | | |
| OVS- | | OVS | | | Virtual | | Virtual | | | ovs- | | OVS- | | | Virtual | | Virtual | |
| wind | | USERSPACE | | | Machine #1| | Machine #2 | | | *ctl | | USERSPACE | | | Machine #1| | Machine #2 | |
| | | DAEMON/CTL | | | | | | | | | | DAEMON | | | | | | |
+------+-++---+---------+ | +--+------+-+ +----+------++ | +--------+ +------+-++---+---------+ | +--+------+-+ +----+------++ | +--------+
| DPIF- | | netdev- | | |VIF #1| |VIF #2| | |Physical| | dpif- | | netdev- | | |VIF #1| |VIF #2| | |Physical|
| Windows |<=>| Windows | | +------+ +------+ | | NIC | | netlink | | windows | | +------+ +------+ | | NIC |
+---------+ +---------+ | || /\ | +--------+ +---------+ +---------+ | || /\ | +--------+
User /\ | || *#1* *#4* || | /\ User /\ /\ | || *#1* *#4* || | /\
=========||=======================+------||-------------------||--+ || =========||=========||============+------||-------------------||--+ ||
Kernel || \/ || ||=====/ Kernel || || \/ || ||=====/
\/ +-----+ +-----+ *#5* \/ \/ +-----+ +-----+ *#5*
+-------------------------------+ | | | | +-------------------------------+ | | | |
| +----------------------+ | | | | | | +----------------------+ | | | | |
| | OVS Pseudo Device | | | | | | | | OVS Pseudo Device | | | | | |
| +----------------+-----+ | | | | | | +----------------------+ | | | | |
| | | I | | | | | Netlink Impl. | | | | | |
| ----------------- | | I | | |
| +------------+ | | N | | E | | +------------+ | | N | | E |
| | Flowtable | +------------+ | | G | | G | | | Flowtable | +------------+ | | G | | G |
| +------------+ | Packet | |*#2*| R | | R | | +------------+ | Packet | |*#2*| R | | R |
@@ -110,9 +104,8 @@ Kernel || \/ || ||=====/
Figure 2 shows the various blocks involved in the OVS Windows implementation, Figure 2 shows the various blocks involved in the OVS Windows implementation,
along with some of the components available in the NDIS stack, and also the along with some of the components available in the NDIS stack, and also the
virtual machines. The workflow of a packet being transmitted from a VIF out and virtual machines. The workflow of a packet being transmitted from a VIF out and
into another VIF and to a physical NIC is also shown. New userspace components into another VIF and to a physical NIC is also shown. Later on in this section,
being added as also shown. Later on in this section, well discuss the flow of we will discuss the flow of a packet at a high level.
a packet at a high level.
The figure gives a general idea of where the OVS userspace and the kernel The figure gives a general idea of where the OVS userspace and the kernel
components fit in, and how they interface with each other. components fit in, and how they interface with each other.
@@ -122,9 +115,11 @@ a forwarding extension roughly implementing the following
sub-modules/functionality. Details of each of these sub-components in the sub-modules/functionality. Details of each of these sub-components in the
kernel are contained in later sections: kernel are contained in later sections:
* Interfacing with the NDIS stack * Interfacing with the NDIS stack
* Netlink message parser
* Netlink sockets
* Switch/Datapath management * Switch/Datapath management
* Interfacing with userspace portion of the OVS solution to implement the * Interfacing with userspace portion of the OVS solution to implement the
necessary ioctls that userspace needs necessary functionality that userspace needs
* Port management * Port management
* Flowtable/Actions/packet forwarding * Flowtable/Actions/packet forwarding
* Tunneling * Tunneling
@@ -140,32 +135,36 @@ are:
* Interface between the userspace and the kernel module. * Interface between the userspace and the kernel module.
* Event notifications are significantly different. * Event notifications are significantly different.
* The communication interface between DPIF and the kernel module need not be * The communication interface between DPIF and the kernel module need not be
implemented in the way OVS on Linux does. implemented in the way OVS on Linux does. That said, it would be
advantageous to have a similar interface to the kernel module for reasons of
readability and maintainability.
* Any licensing issues of using Linux kernel code directly. * Any licensing issues of using Linux kernel code directly.
Due to these differences, it was a straightforward decision to develop the Due to these differences, it was a straightforward decision to develop the
datapath for OVS on Hyper-V from scratch rather than porting the one on Linux. datapath for OVS on Hyper-V from scratch rather than porting the one on Linux.
A re-development focussed on the following goals: A re-development focused on the following goals:
* Adhere to the existing requirements of userspace portion of OVS (such as * Adhere to the existing requirements of userspace portion of OVS (such as
ovs- vswitchd), to minimize changes in the userspace workflow. ovs-vswitchd), to minimize changes in the userspace workflow.
* Fit well into the typical workflow of a Hyper-V extensible switch forwarding * Fit well into the typical workflow of a Hyper-V extensible switch forwarding
extension. extension.
The userspace portion of the OVS solution is mostly POSIX code, and not very The userspace portion of the OVS solution is mostly POSIX code, and not very
Linux specific. Majority of the code has already been ported and committed to Linux specific. Majority of the userspace code does not interface directly with
the openvswitch repo. Most of the daemons such as ovs-vswitchd or ovsdb-server the kernel datapath and was ported independently of the kernel datapath
can run on Windows now. One additional daemon that has been implemented is effort.
called ovs-wind. At a high level ovs-wind manages keeps the ovsdb used by
userspace in sync with the kernel state. More details in the userspace section.
As explained in the OVS porting design document [7], DPIF is the portion of As explained in the OVS porting design document [7], DPIF is the portion of
userspace that interfaces with the kernel portion of the OVS. Each platform can userspace that interfaces with the kernel portion of the OVS. The interface
have its own implementation of the DPIF provider whose interface is defined in that each DPIF provider has to implement is defined in dpif-provider.h [3].
dpif-provider.h [3]. For OVS on Hyper-V, we have an implementation of DPIF Though each platform is allowed to have its own implementation of the DPIF
provider for Hyper-V. The communication interface between userspace and the provider, it was found, via community feedback, that it is desired to
kernel is a pseudo device and is different from that of the Linuxs DPIF share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares
provider which uses netlink. But, as long as the DPIF provider interface is the code with the DPIF provider on Linux. This interface is implemented in
same, the callers should be agnostic of the underlying communication interface. dpif-netlink.c, formerly dpif-linux.c.
We'll elaborate more on kernel-userspace interface in a dedicated section
below. Here it suffices to say that the DPIF provider implementation for
Windows is netlink-based and shares code with the Linux one.
2.a) Kernel module (datapath) 2.a) Kernel module (datapath)
----------------------------- -----------------------------
@@ -178,8 +177,8 @@ This is consistent with using a single datapath in the kernel on Linux. All the
physical adapters are connected as external adapters to the extensible switch. physical adapters are connected as external adapters to the extensible switch.
When the OVS switch extension registers itself as a filter driver, it also When the OVS switch extension registers itself as a filter driver, it also
registers callbacks for the switch management and datapath functions. In other registers callbacks for the switch/port management and datapath functions. In
words, when a switch is created on the Hyper-V root partition (host), the other words, when a switch is created on the Hyper-V root partition (host), the
extension gets an activate callback upon which it can initialize the data extension gets an activate callback upon which it can initialize the data
structures necessary for OVS to function. Similarly, there are callbacks for structures necessary for OVS to function. Similarly, there are callbacks for
when a port gets added to the Hyper-V switch, and an External Network adapter when a port gets added to the Hyper-V switch, and an External Network adapter
@@ -190,7 +189,7 @@ packet is received on an external NIC.
As shown in the figures, an extensible switch extension gets to see a packet As shown in the figures, an extensible switch extension gets to see a packet
sent by the VM (VIF) twice - once on the ingress path and once on the egress sent by the VM (VIF) twice - once on the ingress path and once on the egress
path. Forwarding decisions are to be made on the ingress path. Correspondingly, path. Forwarding decisions are to be made on the ingress path. Correspondingly,
well be hooking onto the following interfaces: we will be hooking onto the following interfaces:
* Ingress send indication: intercept packets for performing flow based * Ingress send indication: intercept packets for performing flow based
forwarding.This includes straight forwarding to output ports. Any packet forwarding.This includes straight forwarding to output ports. Any packet
modifications needed to be performed are done here either inline or by modifications needed to be performed are done here either inline or by
@@ -203,11 +202,41 @@ well be hooking onto the following interfaces:
Interfacing with OVS userspace Interfacing with OVS userspace
------------------------------ ------------------------------
Weve implemented a pseudo device interface for letting OVS userspace talk to We have implemented a pseudo device interface for letting OVS userspace talk to
the OVS kernel module. This is equivalent to the typical character device the OVS kernel module. This is equivalent to the typical character device
interface on POSIX platforms. The pseudo device supports a whole bunch of interface on POSIX platforms where we can register custom functions for read,
write and ioctl functionality. The pseudo device supports a whole bunch of
ioctls that netdev and DPIF on OVS userspace make use of. ioctls that netdev and DPIF on OVS userspace make use of.
Netlink message parser
----------------------
The communication between OVS userspace and OVS kernel datapath is in the form
of Netlink messages [1]. More details about this are provided in #2.c section,
kernel-userspace interface. In the kernel, a full fledged netlink message
parser has been implemented along the lines of the netlink message parser in
OVS userspace. In fact, a lot of the code is ported code.
On the lines of 'struct ofpbuf' in OVS userspace, a managed buffer has been
implemented in the kernel datapath to make it easier to parse and construct
netlink messages.
Netlink sockets
---------------
On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink
messages. Since much of userspace code including DPIF provider in
dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets
have been implemented in OVS userspace. As it is known, Windows lacks native
netlink socket support, and also the socket family is not extensible either.
Hence it is not possible to provide a native implementation of netlink socket.
We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_*
APIs to higher levels. The implementation opens a handle to the pseudo device
for each netlink socket. Some more details on this topic are provided in the
userspace section on netlink sockets.
Typical netlink semantics of read message, write message, dump, and transaction
have been implemented so that higher level layers are not affected by the
netlink implementation not being native.
Switch/Datapath management Switch/Datapath management
-------------------------- --------------------------
As explained above, we hook onto the management callback functions in the NDIS As explained above, we hook onto the management callback functions in the NDIS
@@ -220,8 +249,19 @@ Port management
As explained above, we hook onto the management callback functions in the NDIS As explained above, we hook onto the management callback functions in the NDIS
interface to know when a port is added/connected to the Hyper-V switch. We use interface to know when a port is added/connected to the Hyper-V switch. We use
these callbacks to initialize the port related data structures in OVS. Also, these callbacks to initialize the port related data structures in OVS. Also,
some of the ports are tunnel ports that dont exist on the Hyper-V switch that some of the ports are tunnel ports that dont exist on the Hyper-V switch and
are initiated from OVS userspace. get added from OVS userspace.
In order to identify a Hyper-V port, we use the value of 'FriendlyName' field
in each Hyper-V port. We call this the "OVS-port-name". The idea is that OVS
userspace sets 'OVS-port-name' in each Hyper-V port to the same value as the
'name' field of the 'Interface' table in OVSDB. When OVS userspace calls into
the kernel datapath to add a port, we match the name of the port with the
'OVS-port-name' of a Hyper-V port.
We maintain separate hash tables, and separate counters for ports that have
been added from the Hyper-V switch, and for ports that have been added from OVS
userspace.
Flowtable/Actions/packet forwarding Flowtable/Actions/packet forwarding
----------------------------------- -----------------------------------
@@ -267,48 +307,90 @@ used.
2.b) Userspace components 2.b) Userspace components
------------------------- -------------------------
A new daemon has been added to userspace to manage the entities in OVSDB, and The userspace portion of the OVS solution is mostly POSIX code, and not very
also to keep it in sync with the kernel state, and this include bridges, Linux specific. Majority of the userspace code does not interface directly with
physical NICs, VIFs etc. For example, upon bootup, ovs-wind does a get on the the kernel datapath and was ported independently of the kernel datapath
kernel to get a list of the bridges, and the corresponding ports and populates effort.
OVSDB. If a new VIF gets added to the kernel switch because a user powered on a
Virtual Machine, ovs-wind detects it, and adds a corresponding entry in the
ovsdb. This implies that ovs-wind has a synchronous as well as an asynchronous
interface to the OVS kernel driver.
In this section, we cover the userspace components that interface with the
kernel datapath.
2.c) Kernel-Userspace interface As explained earlier, OVS on Hyper-V shares the DPIF provider implementation
------------------------------- with Linux. The DPIF provider on Linux uses netlink sockets and netlink
DPIF-Windows messages. Netlink sockets and messages are extensively used on Linux to
------------ exchange information between userspace and kernel. In order to satisfy these
DPIF-Windows is the Windows implementation of the interface defined in dpif- dependencies, netlink socket (pseudo and non-native) and netlink messages
provider.h, and provides an interface into the OVS kernel driver. We implement are implemented on Hyper-V.
most of the callbacks required by the DPIF provider. A quick summary of the
functionality implemented is as follows: The following are the major advantages of sharing DPIF provider code:
* dp_dump, dp_get: dump all datapath information or get information for a 1. Maintenance is simpler:
particular datapath. Currently we only support one datapath. Any change made to the interface defined in dpif-provider.h need not be
* flow_dump, flow_put, flow_get, flow_flush: These functions retrieve all propagated to multiple implementations. Also, developers familiar with the
flows in the kernel, add a flow to the kernel, get a specific flow and Linux implementation of the DPIF provider can easily ramp on the Hyper-V
delete all the flows in the kernel. implementation as well.
* recv_set, recv, recv_wait, recv_purge: these poll packets for upcalls. 2. Netlink messages provides inherent advantages:
* execute: This is used to send packets from userspace to the kernel. The Netlink messages are known for their extensibility. Each message is
packets could be either flow miss packet punted from kernel earlier or versioned, so the provided data structures offer a mechanism to perform
userspace generated packets. version checking and forward/backward compatibility with the kernel
* vport_dump, vport_get, ext_info: These functions dump all ports in the module.
kernel, get a specific port in the kernel, or get extended information
about a port. Netlink sockets
* event_subscribe, wait, poll: These functions subscribe, wait and poll the ---------------
events that kernel posts. A typical example is kernel notices a port has As explained in other sections, an emulation of netlink sockets has been
gone up/down, and would like to notify the userspace. implemented in lib/netlink-socket.c for Windows. The implementation creates a
handle to the OVS pseudo device, and emulates netlink socket semantics of
receive message, send message, dump, and transact. Most of the nl_* functions
are supported.
The fact that the implementation is non-native manifests in various ways.
One example is that PID for the netlink socket is not automatically assigned in
userspace when a handle is created to the OVS pseudo device. There's an extra
command (defined in OvsDpInterfaceExt.h) that is used to grab the PID generated
in the kernel.
DPIF provider
--------------
As has been mentioned in earlier sections, the netlink socket and netlink
message based DPIF provider on Linux has been ported to Windows.
Correspondingly, the file is called lib/dpif-netlink.c now from its former
name of lib/dpif-linux.c.
Most of the code is common. Some divergence is in the code to receive
packets. The Linux implementation uses epoll() which is not natively supported
on Windows.
Netdev-Windows Netdev-Windows
-------------- --------------
We have a Windows implementation of the the interface defined in lib/netdev- We have a Windows implementation of the interface defined in
provider.h. The implementation provided functionality to get extended lib/netdev-provider.h. The implementation provides functionality to get
information about an interface. It is limited in functionality compared to the extended information about an interface. It is limited in functionality
Linux implementation of the netdev provider and cannot be used to add any compared to the Linux implementation of the netdev provider and cannot be used
interfaces in the kernel such as a tap interface. to add any interfaces in the kernel such as a tap interface or to send/receive
packets. The netdev-windows implementation uses the datapath interface
extensions defined in:
datapath-windows/include/OvsDpInterfaceExt.h
Powershell extensions to set "OVS-port-name"
--------------------------------------------
As explained in the section on "Port management", each Hyper-V port has a
'FriendlyName' field, which we call as the "OVS-port-name" field. We have
implemented powershell command extensions to be able to set the "OVS-port-name"
of a Hyper-V port.
2.c) Kernel-Userspace interface
-------------------------------
openvswitch.h and OvsDpInterfaceExt.h
-------------------------------------
Since the DPIF provider is shared with Linux, the kernel datapath provides the
same interface as the Linux datapath. The interface is defined in
datapath/linux/compat/include/linux/openvswitch.h. Derivatives of this
interface file are created during OVS userspace compilation. The derivative for
the kernel datapath on Hyper-V is provided in the following location:
datapath-windows/include/OvsDpInterface.h
That said, there are Windows specific extensions that are defined in the
interface file:
datapath-windows/include/OvsDpInterfaceExt.h
2.d) Flow of a packet 2.d) Flow of a packet
--------------------- ---------------------
@@ -354,9 +436,9 @@ driver.
Reference list: Reference list:
=============== ===============
1: Hyper-V Extensible Switch 1. Hyper-V Extensible Switch
http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx
2: Hyper-V Extensible Switch Extensions 2. Hyper-V Extensible Switch Extensions
http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx
3. DPIF Provider 3. DPIF Provider
http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif- http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif-
@@ -369,3 +451,7 @@ http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx
7. How to Port Open vSwitch to New Software or Hardware 7. How to Port Open vSwitch to New Software or Hardware
http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
8. Netlink
http://en.wikipedia.org/wiki/Netlink
9. epoll
http://en.wikipedia.org/wiki/Epoll