Expose the kernel connection tracker via OVS. Userspace components can
make use of the CT action to populate the connection state (ct_state)
field for a flow. This state can be subsequently matched.
Exposed connection states are OVS_CS_F_*:
- NEW (0x01) - Beginning of a new connection.
- ESTABLISHED (0x02) - Part of an existing connection.
- RELATED (0x04) - Related to an established connection.
- INVALID (0x20) - Could not track the connection for this packet.
- REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
- TRACKED (0x80) - This packet has been sent through conntrack.
When the CT action is executed by itself, it will send the packet
through the connection tracker and populate the ct_state field with one
or more of the connection state flags above. The CT action will always
set the TRACKED bit.
When the COMMIT flag is passed to the conntrack action, this specifies
that information about the connection should be stored. This allows
subsequent packets for the same (or related) connections to be
correlated with this connection. Sending subsequent packets for the
connection through conntrack allows the connection tracker to consider
the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.
The CT action may optionally take a zone to track the flow within. This
allows connections with the same 5-tuple to be kept logically separate
from connections in other zones. If the zone is specified, then the
"ct_zone" match field will be subsequently populated with the zone id.
IP fragments are handled by transparently assembling them as part of the
CT action. The maximum received unit (MRU) size is tracked so that
refragmentation can occur during output.
IP frag handling contributed by Andy Zhou.
Based on original design by Justin Pettit.
Upstream: 7f8a436 "openvswitch: Add conntrack action"
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
IPv6 fragmentation functionality is not exported by most kernels, so
backport this code from the upstream 4.3 development tree.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Backport IPv4 reassembly from the upstream commit caaecdd3d3f8 ("inet:
frags: remove INET_FRAG_EVICTED and use list_evictor for the test").
This is necessary because kernels prior to upstream commit d6b915e29f4a
("ip_fragment: don't forward defragmented DF packet") would not always
track the maximum received unit size during ip_defrag(). Without the
MRU, refragmentation cannot occur so reassembled packets are dropped.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Most kernels provide some form of ip fragmentation. However, until
recently many of them would always send ICMP responses for over_MTU
packets, even when operating in bridge mode. Backport the check to
ensure this doesn't occur.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Loosely based upon Linux commit 0838aa7fcfcd "netfilter: fix netns
dependencies with conntrack templates" and commit 5e8018fc6142
"netfilter: nf_conntrack: add efficient mark to zone mapping".
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Following patch adds support for lwtunnel to OVS datapath.
With this change OVS datapath detect lwtunnel support and
make use of new APIs if available. On older kernel where the
support is not there the backported tunnel modules are used.
These backported tunnel devices acts as lwtunnel devices.
I tried to keep backported module same as upstream for easier
bug-fix backport. Since STT and LISP are not upstream OVS
always needs to use respective modules from tunnel compat layer.
To make it work on kernel 4.3 I have converted STT and LISP
modules to lwtunnel API model.
lwtunnel make use of skb-dst to pass tunnel information to the
tunnel module. On older kernel this is not possible. So the in
case of old kernel metadata ref is stored in OVS_CB and direct
call to tunnel transmit function is made by respective tunnel
vport modules. Similarly on receive side tunnel recv directly
call netdev-vport-receive to pass the skb to OVS.
Major backported components include:
Geneve, GRE, VXLAN, ip_tunnel, udp-tunnels GRO.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Jesse Gross <jesse@kernel.org>
Fixes following compilation error:
CC [M] /home/travis/build/openvswitch/ovs/datapath/linux/actions.o
In file included from
/home/travis/build/openvswitch/ovs/datapath/linux/actions.c:21:0:
/home/travis/build/openvswitch/ovs/datapath/linux/compat/include/linux/skbuff.h:
In function ‘rpl_skb_postpull_rcsum’:
/home/travis/build/openvswitch/ovs/datapath/linux/compat/include/linux/skbuff.h:384:4:
error: implicit declaration of function ‘skb_checksum_start_offset’
[-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
Reported-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Joe Stringer <joestringer@nicira.com>
Fixes following compilation error:
In file included from ovs/datapath/linux/actions.c:30: ovs/datapath/linux/compat/include/linux/if_vlan.h:65:
error: redefinition of ‘__vlan_hwaccel_push_inside’ include/linux/if_vlan.h:353: note: previous definition of
‘__vlan_hwaccel_push_inside’ was here ovs/datapath/linux/compat/include/linux/if_vlan.h:83:
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
When linking with DPDK, if a relative path is used with the
'--with-dpdk' flag, then OVS will always be compiled with vHost Cuse
support, even if it is not enabled in the DPDK build.
This patch fixes this problem, and enables the correct version of
vHost despite whether or not a relative or absolute path is used.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
upstream: ("netlink: implement nla_put_in_addr and nla_put_in6_addr")
upstream: ("netlink: implement nla_get_in_addr and nla_get_in6_addr")
IP addresses are often stored in netlink attributes. Add generic functions
to do that.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Update relevant artifacts to add support for DPDK v2.1.0
- INSTALL.DPDK.md
- acinclude.m4: Change DPDK library name
- netdev-dpdk: Limit minimum mbuf size to to adapt to DPDK bug fix that
changes the treatment of the requested mbuf size
- build.sh: Change DPDK version number
Note that this breaks compatibility with DPDK v2.0.0 although only
for the library name change.
Note that throughput for vhost ports with mergeable buffers is reduced
about 10% due to a necessary bug fix in DPDK vhost code.
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
Signed-off-by: Michal Weglicki <michalx.weglicki@intel.com>
Signed-off-by: Timo Puha <timox.puha@intel.com>
Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Red Hat Enterprise Linux 6 has backported the netdev RX
handler facility so use the netdev_rx_handler_register as
an indicator.
The handler prototype changed between 2.6.36 and 2.6.39
since there could be backports in any stage, don't look
at the kernel version, but at the prototype.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
The OVS hook has been backported so it doesn't work to
decide per_cpu work arounds.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Red Hat Enterprise Linux 6 has backported it from upstream,
so check for ip_is_fragment instead of kernel version.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Red Hat Enterprise Linux 6 has backported it from upstream,
so check for proto_ports_offset instead of kernel version.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Red Hat Enterprise Linux 6 has a comment saying
that it doesn't support l4_rxhash which matches
the current grep regex.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Dpdk allows users to create a config that includes other config files and
then override values.
Eg.
defconfig_x86_64-native_vhost_cuse-linuxapp-gcc:
CONFIG_RTE_BUILD_COMBINE_LIBS=y
CONFIG_RTE_BUILD_SHARED_LIB=n
CONFIG_RTE_LIBRTE_VHOST=y
CONFIG_RTE_LIBRTE_VHOST_USER=n
This allows you to have both a vhostuser and vhostcuse config in the same
source tree without the need to replicate everything in those config files
just to change a couple of settings. The resultant .config file has all of
the settings from the included files with the updated settings at the end.
The resultant rte_config.h contains multiple undefs and defines for the
overridden settings.
Eg.
> grep RTE_LIBRTE_VHOST_USER x86_64-native_vhost_cuse-linuxapp-gcc/include/rte_config.h
#undef RTE_LIBRTE_VHOST_USER
#define RTE_LIBRTE_VHOST_USER 1
#undef RTE_LIBRTE_VHOST_USER
The current mechanism to detect the RTE_LIBRTE_VHOST_USER setting merely
greps the rte_config.h file for the string "define RTE_LIBRTE_VHOST_USER 1"
rather than the final setting of RTE_LIBRTE_VHOST_USER. The following patch
changes this test to detect the final setting of RTE_LIBRTE_VHOST_USER.
Signed-off-by: Gary Mussar <gmussar@ciena.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com
Fields found using OVS_FIND_FIELD_IFELSE would previously be printed in
the console during configure. Clean up the output.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
DPDK with vhost-user doesn't require libfuse, so we shouldn't link OVS
with libfuse unless DPDK is built with vhost-cuse support.
CC: Rapelly, Varun <vrapelly@sonusnet.com>
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This patch adds support for a new port type to the userspace
datapath called dpdkvhostuser.
A new dpdkvhostuser port will create a unix domain socket which
when provided to QEMU is used to facilitate communication between
the virtio-net device on the VM and the OVS port on the host.
vhost-cuse ('dpdkvhost') ports are still available as 'dpdkvhostcuse'
ports and will be enabled if vhost-cuse support is detected in the
DPDK build specified during compilation of the switch. Otherwise,
vhost-user ports are enabled.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Update relevant artifacts to add support for DPDK v2.0.0
- INSTALL.DPDK.md
- travis build script
- acinclude.m4: add 'mssse3' flag to OVS_CFLAGS
- netdev-dpdk: fix build with unified offload types in DPDK v2.0.0
Note that this breaks compatibility with DPDK v1.8.0
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
Signed-off-by: Panu Matilainen <pmatilai@redhat.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
No longer need this compat file, we can use the upstream version
of the function.
Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
On NetBSD, clang (clang-3.5.0 from pkgsrc) complains
when "clang -g" is used for linking. Specify -Qunused-arguments
to suppress the warning.
Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Acked-by: Ben Pfaff <blp@nicira.com>
Lately our internal build system has been seeing intermittent failures that
I can't explain. With old glibc versions, the "configure" time check will
pass, but the equivalent (almost identical) "make check" test will fail.
One possibility, I guess, is that occasionally address space randomization
will put valid data at the 0xc0ffee address that the test assumes will
segfault, and another is that some change in compiler optimization flags
is making a difference. At any rate, I think it's safe to just always
assume that this strtok_r() bug is present whenever glibc before 2.8 is
in use.
Specifically we've seen this happen intermittently when building against
the XenServer DDK 5.6.100 build 39265, which uses glibc 2.5.
Reported-by: Alex Wang <alexw@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Alex Wang <alexw@nicira.com>
The --whole-archive option is required to link vswitchd with DPDK,
otherwise the PMD drivers are not going to be included. Omitting the
option is not going to cause build failures, but OVS won't be able to
use most physical NICs.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This patch adds support for a new port type to userspace datapath
called dpdkvhost. This allows KVM (QEMU) to offload the servicing
of virtio-net devices to its associated dpdkvhost port. Instructions
for use are in INSTALL.DPDK.
This has been tested on Intel multi-core platforms and with clients
that have virtio-net interfaces.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Kevin Traynor <kevin.traynor@intel.com>
Signed-off-by: Maryam Tahhan <maryam.tahhan@intel.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
There are two important GSO tunnel features that were introduced
after the 3.12 cutoff for our current out of tree GSO implementation:
* 3.16 introduced support for outer UDP checksums.
* 3.18 introduced support for verifying hardware support for protocols
other than VXLAN.
In cases where these features are used, we should use OVS GSO to
ensure correct behavior. However, we also want to continue to use
kernel GSO or hardware TSO in existing situations. Therefore, this
extends the range of kernels where OVS GSO is available to 3.18 and
makes it easier to select which one to use.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Upstream commit:
genetlink: pass only network namespace to genl_has_listeners()
There's no point to force the caller to know about the internal
genl_sock to use inside struct net, just have them pass the network
namespace. This doesn't really change code generation since it's
an inline, but makes the caller less magic - there's never any
reason to pass another socket.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upstream: f8403a2 ("genetlink: pass only network namespace to genl_has_listeners()")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Upstream commit:
vxlan: Group Policy extension
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.
SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels. However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.
Host 1: Host 2:
Group A Group B Group B Group A
+-----+ +-------------+ +-------+ +-----+
| lxc | | SELinux CTX | | httpd | | VM |
+--+--+ +--+----------+ +---+---+ +--+--+
\---+---/ \----+---/
| |
+---+---+ +---+---+
| vxlan | | vxlan |
+---+---+ +---+---+
+------------------------------+
Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP frames. The extension is therefore disabled by default
and needs to be specifically enabled:
ip link add [...] type vxlan [...] gbp
In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.
Examples:
iptables:
host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
host2# iptables -I INPUT -m mark --mark 0x200 -j DROP
OVS:
# ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
# ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'
[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upstream: 351149 ("vxlan: Group Policy extension")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
This patch effectively reverts commit 500f80872645 ("net: ovs: use CRC32
accelerated flow hash if available"), and other remaining arch_fast_hash()
users such as from nfsd via commit 6282cd565553 ("NFSD: Don't hand out
delegations for 30 seconds after recalling them.") where it has been used
as a hash function for bloom filtering.
While we think that these users are actually not much of concern, it has
been requested to remove the arch_fast_hash() library bits that arose
from [1] entirely as per recent discussion [2]. The main argument is that
using it as a hash may introduce bias due to its linearity (see avalanche
criterion) and thus makes it less clear (though we tried to document that)
when this security/performance trade-off is actually acceptable for a
general purpose library function.
Lets therefore avoid any further confusion on this matter and remove it to
prevent any future accidental misuse of it. For the time being, this is
going to make hashing of flow keys a bit more expensive in the ovs case,
but future work could reevaluate a different hashing discipline.
[1] https://patchwork.ozlabs.org/patch/299369/
[2] https://patchwork.ozlabs.org/patch/418756/
Upstream: 8754589 ("net: replace remaining users of arch_fast_hash with jhash")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
So it can be used from out of openvswitch code.
Did couple of cosmetic changes on the way, namely variable naming and
adding support for 8021AD proto.
Note on backwards compatability:
Unlike the upstream version, the backport of skb_vlan_push() does not
support translating a hardware accelerated 8021AD tag to software.
This is not a problem though as it preserves existing behaviour.
Upstream: 93515d53 ("net: move vlan pop/push functions into common code")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
note that skb_make_writable already exists in net/netfilter/core.c
but does something slightly different.
Upstream: e219512 ("net: move make_writable helper into common code")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Since older kernels do not have skb->vlan_proto, it is assumed that
kernels which don't provide their own __vlan_insert_tag() will also
not have skb->vlan_proto. The backwards compat function therefore
only supports ETH_P_8021Q as the protocol type.
Upstream: 15255a43 ("vlan: introduce __vlan_insert_tag helper which does not free skb")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
__vlan_put_tag() was renamed to vlan_insert_tag_set_proto() with
the argument list kept intact.
Upstream: 62749e ("vlan: rename __vlan_put_tag to vlan_insert_tag_set_proto")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
nla_is_last() is not available in 3.18, it's only in net-next.
Convert to grep based to check to account for distribution backports.
Fixes: 684b5f ("datapath: Rename last_action() as nla_is_last() and move to netlink.h")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
ipv6_find_hdr() already fixed in newer upstram kernel by Ansis, we
can start using this API safely.
This patch also backports fix (ipv6: ipv6_find_hdr restore prev
functionality) to compat ipv6_find_hdr().
CC: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
This patch updates the documentation to reflect that DPDK 1.7.1
is supported. Travis scripts have also been updated to reflect
this. DPDK phy and ring ports were validated against DPDK 1.7.1.
Reviewed-by: Mark D. Gray <mark.d.gray@intel.com>
Signed-off-by: Maryam Tahhan <maryam.tahhan@intel.com>
Acked-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>