2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-29 13:27:59 +00:00

121 Commits

Author SHA1 Message Date
Ilya Maximets
7a5e0ee7cc dpif: Turn dpif_flow_hash function into generic odp_flow_key_hash.
Current implementation of dpif_flow_hash() doesn't depend on datapath
interface and only complicates the callers by forcing them to figure
out what is their current 'dpif'.  If we'll need different hashing
for different 'dpif's we'll implement an API for dpif-providers
and each dpif implementation will be able to use their local function
directly without calling it via dpif API.

This change will allow us to not store 'dpif' pointer in the userspace
datapath implementation which is broken and will be removed in next
commits.

This patch moves dpif_flow_hash() to odp-util module and replaces
unused odp_flow_key_hash() by it, along with removing of unused 'dpif'
argument.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2020-01-08 16:02:37 +01:00
Ilya Maximets
f7392b44a2 dpif-netlink: Fix dumping uninitialized netdev flow stats.
dpif logging functions expects to be called after the operation.
log_flow_del_message() dumps flow stats on success which are not
initialized before the actual call to netdev_flow_del():

 Conditional jump or move depends on uninitialised value(s)
    at 0x6090875: _itoa_word (_itoa.c:179)
    by 0x6093F0D: vfprintf (vfprintf.c:1642)
    by 0x60C090F: vsnprintf (vsnprintf.c:114)
    by 0xE5E7EC: ds_put_format_valist (dynamic-string.c:155)
    by 0xE5E755: ds_put_format (dynamic-string.c:142)
    by 0xE5A5E6: dpif_flow_stats_format (dpif.c:903)
    by 0xE5B708: log_flow_message (dpif.c:1763)
    by 0xE5BCA4: log_flow_del_message (dpif.c:1809)
    by 0xFA6076: try_send_to_netdev (dpif-netlink.c:2190)
    by 0xFA0D3C: dpif_netlink_operate (dpif-netlink.c:2248)
    by 0xE5AFAC: dpif_operate (dpif.c:1376)
    by 0xDF176E: push_dp_ops (ofproto-dpif-upcall.c:2367)
    by 0xDF04C8: push_ukey_ops (ofproto-dpif-upcall.c:2447)
    by 0xDF008F: revalidator_sweep__ (ofproto-dpif-upcall.c:2805)
    by 0xDF5DC6: revalidator_sweep (ofproto-dpif-upcall.c:2816)
    by 0xDF1E83: udpif_revalidator (ofproto-dpif-upcall.c:949)
    by 0xF3A3FE: ovsthread_wrapper (ovs-thread.c:383)
    by 0x565F6DA: start_thread (pthread_create.c:463)
    by 0x615988E: clone (clone.S:95)
  Uninitialised value was created by a stack allocation
    at 0xDEFC24: revalidator_sweep__ (ofproto-dpif-upcall.c:2733)

Fixes: 3cd99886191e ("dpif-netlink: Use dpif logging functions")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2020-01-07 10:11:45 +01:00
Paul Blakey
9221c721be netdev-offload-tc: Add conntrack label and mark support
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2019-12-22 11:54:40 +01:00
Paul Blakey
576126a931 netdev-offload-tc: Add conntrack support
Zone and ct_state first.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2019-12-22 11:54:40 +01:00
Paul Blakey
b2ae40690e netdev-offload-tc: Add recirculation support via tc chains
Each recirculation id will create a tc chain, and we translate
the recirculation action to a tc goto chain action.

We check for kernel support for this by probing OvS Datapath for the
tc recirc id sharing feature. If supported, we can offload rules
that match on recirc_id, and recirculation action safely.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2019-12-22 11:54:40 +01:00
Paul Blakey
dcdcad68c6 dpif: Add support to set user features
This enables user features on the kernel datapath via the DP_CMD_SET
command, and also retrieves them to check for actual support and
not just an older kernel ignoring the requested features.

This will be used in next patch to enable recirc_id sharing with tc.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2019-12-22 11:54:40 +01:00
Tonghao Zhang
0442bfb11d ofproto-dpif-upcall: Echo HASH attribute back to datapath.
The kernel datapath may sent upcall with hash info,
ovs-vswitchd should get it from upcall and then send
it back.

The reason is that:
| When using the kernel datapath, the upcall don't
| include skb hash info relatived. That will introduce
| some problem, because the hash of skb is important
| in kernel stack. For example, VXLAN module uses
| it to select UDP src port. The tx queue selection
| may also use the hash in stack.
|
| Hash is computed in different ways. Hash is random
| for a TCP socket, and hash may be computed in hardware,
| or software stack. Recalculation hash is not easy.
|
| There will be one upcall, without information of skb
| hash, to ovs-vswitchd, for the first packet of a TCP
| session. The rest packets will be processed in Open vSwitch
| modules, hash kept. If this tcp session is forward to
| VXLAN module, then the UDP src port of first tcp packet
| is different from rest packets.
|
| TCP packets may come from the host or dockers, to Open vSwitch.
| To fix it, we store the hash info to upcall, and restore hash
| when packets sent back.

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/364062.html
Link: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=bd1903b7c4596ba6f7677d0dfefd05ba5876707d
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-11-22 09:51:57 -08:00
Ben Pfaff
622ea8fd75 dpif-netlink: Fix some variable naming.
Usually a plural name refers to an array, but 'socks' and 'socksp' were
only single objects, so this changes their names to 'sock' and 'sockp'.

Usually a 'p' suffix means that a variable is an output argument, but
that was only true in one place here, so this changes the names of the
other variables to plain 'sock'.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
2019-10-14 12:27:11 -07:00
Yifeng Sun
666602ada7 dpif-netlink: Free leaked nl_sock
Valgrind reports:
20 bytes in 1 blocks are definitely lost in loss record 94 of 353
    by 0x532594: xmalloc (util.c:138)
    by 0x553EAD: nl_sock_create (netlink-socket.c:146)
    by 0x54331D: create_nl_sock (dpif-netlink.c:255)
    by 0x54331D: dpif_netlink_port_add__ (dpif-netlink.c:756)
    by 0x5435F6: dpif_netlink_port_add_compat (dpif-netlink.c:876)
    by 0x5435F6: dpif_netlink_port_add (dpif-netlink.c:922)
    by 0x47EC1D: dpif_port_add (dpif.c:584)
    by 0x42B35F: port_add (ofproto-dpif.c:3721)
    by 0x41E64A: ofproto_port_add (ofproto.c:2032)
    by 0x40B3FE: iface_do_create (bridge.c:1817)
    by 0x40B3FE: iface_create (bridge.c:1855)
    by 0x40B3FE: bridge_add_ports__ (bridge.c:943)
    by 0x40D14A: bridge_add_ports (bridge.c:959)
    by 0x40D14A: bridge_reconfigure (bridge.c:673)
    by 0x410D75: bridge_run (bridge.c:3050)
    by 0x407614: main (ovs-vswitchd.c:127)

This leak is because when vport_add_channel() returns 0, it is expected
to take the ownership of 'socksp'. This patch fixes this issue.

Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-10-14 11:00:34 -07:00
Yi-Hung Wei
187bb41fbf ofproto-dpif-xlate: Translate timeout policy in ct action
This patch derives the timeout policy based on ct zone from the
internal data structure that we maintain on dpif layer.

It also adds a system traffic test to verify the zone-based conntrack
timeout feature.  The test uses ovs-vsctl commands to configure
the customized ICMP and UDP timeout on zone 5 to a shorter period.
It then injects ICMP and UDP traffic to conntrack, and checks if the
corresponding conntrack entry expires after the predefined timeout.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>

ofproto-dpif: Checks if datapath supports OVS_CT_ATTR_TIMEOUT

This patch checks whether datapath supports OVS_CT_ATTR_TIMEOUT. With this
check, ofproto-dpif-xlate can use this information to decide whether to
translate the ct timeout policy.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
2019-09-26 13:51:04 -07:00
Yi-Hung Wei
1f16131837 ct-dpif, dpif-netlink: Add conntrack timeout policy support
This patch first defines the dpif interface for a datapath to support
adding, deleting, getting and dumping conntrack timeout policy.
The timeout policy is identified by a 4 bytes unsigned integer in
datapath, and it currently support timeout for TCP, UDP, and ICMP
protocols.

Moreover, this patch provides the implementation for Linux kernel
datapath in dpif-netlink.

In Linux kernel, the timeout policy is maintained per L3/L4 protocol,
and it is identified by 32 bytes null terminated string.  On the other
hand, in vswitchd, the timeout policy is a generic one that consists of
all the supported L4 protocols.  Therefore, one of the main task in
dpif-netlink is to break down the generic timeout policy into 6
sub policies (ipv4 tcp, udp, icmp, and ipv6 tcp, udp, icmp),
and push down the configuration using the netlink API in
netlink-conntrack.c.

This patch also adds missing symbols in the windows datapath so
that the build on windows can pass.

Appveyor CI:
* https://ci.appveyor.com/project/YiHungWei/ovs/builds/26387754

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
2019-09-26 13:50:17 -07:00
Darrell Ball
64207120c8 conntrack: Add option to disable TCP sequence checking.
This may be needed in some special cases, such as to support some hardware
offload implementations.  Note that disabling TCP sequence number
verification is not an optimization in itself, but supporting some
hardware offload implementations may offer better performance.  TCP
sequence number verification is enabled by default.  This option is only
available for the userspace datapath.  Access to this option is presently
provided via 'dpctl' commands as the need for this option is quite node
specific, by virtue of which nics are in use on a given node.  A test is
added to verify this option.

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-May/359188.html
Signed-off-by: Darrell Ball <dlu998@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-09-25 12:11:32 -07:00
Ilya Maximets
0770429e38 dpif-netlink: Allow offloading of flows with dl_type 0x1234.
'dpif_probe_feature()' always has DPIF_FP_PROBE flag set. Other probing
code uses dpif_execute() with DPIF_OP_EXECUTE, hence never calls
parse_flow_put().
Thus, this 'if' statement is wrong and should be removed as it only
forbids offloading of the real legitimate flows with dl_type 0x1234.
Dummy flows never reach this code.

CC: Paul Blakey <paulb@mellanox.com>
Fixes: 8b668ee3f0cc ("dpif-netlink: Use netdev flow put api to insert a flow")
Reported-by: Eli Britstein <elibr@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
2019-07-31 14:44:15 +03:00
Ilya Maximets
f87c135706 vswitchd: Always cleanup userspace datapath.
'netdev' datapath is implemented within ovs-vswitchd process and can
not exist without it, so it should be gracefully terminated with a
full cleanup of resources upon ovs-vswitchd exit.

This change forces dpif cleanup for 'netdev' datapath regardless of
passing '--cleanup' to 'ovs-appctl exit'. Such solution allowes to
not pass this additional option everytime for userspace datapath
installations and also allowes to not terminate system datapath in
setups where both datapaths runs at the same time.

The main part is that dpif_port_del() will lead to netdev_close()
and subsequent netdev_class->destroy(dev) which will stop HW NICs
and free their resources. For vhost-user interfaces it will invoke
vhost driver unregistering with a properly closed vhost-user
connection. For upcoming AF_XDP netdev this will allow to gracefully
destroy xdp sockets and unload xdp programs from linux interfaces.
Another important thing is that port deletion will also trigger
flushing of flows offloaded to HW NICs.

Exception made for 'internal' ports that could have user ip/route
configuration. These ports will not be removed without '--cleanup'.

This change fixes OVS disappearing from the DPDK point of view
(keeping HW NICs improperly configured, sudden closing of vhost-user
connections) and will help with linux devices clearing with upcoming
AF_XDP netdev support.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Tested-by: William Tu <u9012063@gmail.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2019-07-02 12:24:47 +03:00
Ilya Maximets
b6cabb8f8f netdev: Split up netdev offloading to separate module.
New module 'netdev-offload' created to manage different flow API
implementations. All the generic and provider independent code moved
there from the 'netdev' module.

Flow API providers further encapsulated.

The only function that was changed is 'netdev_any_oor'.
Now it uses offloading related hmap instead of common 'netdev_shash'.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
2019-06-11 09:39:36 +03:00
wenxu
1028cb71d0 dpif-netlink: make offload failed EOPNOTSUPP and ENOSPC cases lower priority level log
Offload flow failed for EOPNOTSUPP and ENOSPC which should not
be a err. It should e lower priority level log for this two
failure case.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2019-03-21 08:05:07 +01:00
Yifeng Sun
c225ce2202 dpif-netlink: Free leaked ofpbuf by using ofpbuf_delete
Found by valgrind.

256 bytes in 4 blocks are definitely lost in loss record 319 of 348
    by 0x52E204: xmalloc (util.c:123)
    by 0x4F6172: ofpbuf_new (ofpbuf.c:151)
    by 0x53DEF2: dpif_netlink_ct_get_limits (dpif-netlink.c:2951)
    by 0x587881: dpctl_ct_get_limits (dpctl.c:1904)
    by 0x58566F: dpctl_unixctl_handler (dpctl.c:2589)
    by 0x52D660: process_command (unixctl.c:308)
    by 0x52D660: run_connection (unixctl.c:342)
    by 0x52D660: unixctl_server_run (unixctl.c:393)
    by 0x407366: main (ovs-vswitchd.c:126)

Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-03-05 17:34:24 -08:00
Darrell Ball
4ea96698f6 Userspace datapath: Add fragmentation handling.
Fragmentation handling is added for supporting conntrack.
Both v4 and v6 are supported.

After discussion with several people, I decided to not store
configuration state in the database to be more consistent with
the kernel in future, similarity with other conntrack configuration
which will not be in the database as well and overall simplicity.
Accordingly, fragmentation handling is enabled by default.

This patch enables fragmentation tests for the userspace datapath.

Signed-off-by: Darrell Ball <dlu998@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-02-14 14:18:56 -08:00
Yifeng Sun
7c5793e62a dpif-netlink: Fix a bug that causes duplicate key error in datapath
Kmod tests 122 and 123 failed and kernel reports a "Duplicate key of
type 6" error. Further debugging reveals that nl_attr_find__() should
start looking for OVS_KEY_ATTR_ETHERTYPE from offset returned by
a previous called nl_msg_start_nested(). This patch fixes it.

Tests 122 and 123 were skipped by kernel 4.15 and older versions.
Kernel 4.16 and later kernels start showing this failure.

Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-02-04 13:37:48 -08:00
Flavio Leitner
c37cb3eea6 Revert "dpif-netlink: Don't destroy and recreate port if it exists"
This reverts commit  a38dccb3ee80a1d0b8973191c9e94f045441f8cc.

The original commit 7521e0cf9e88 ("ofproto-dpif: Let the dpif report
when a port is a duplicate.") relies on the kernel to check if the
port exists or not. However, the current kernel code doesn't handle
when the port is moved to another network namespace.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-01-25 13:20:11 -08:00
Alin Gabriel Serdean
d240e46aca Windows: Fix broken kernel userspace communication
Patch: 69c51582ff
broke Windows userpace - kernel communication.

On windows we create netlink sockets when the handlers are initiated and
reuse them.
This patch remaps the usage of the netlink socket pool.

Fixes:
https://github.com/openvswitch/ovs-issues/issues/164

Signed-off-by: Alin Gabriel Serdean <aserdean@ovn.org>
Acked-by: Shashank Ram <rams@vmware.com>
Tested-by: Shashank Ram <rams@vmware.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Co-authored-by: Ben Pfaff <blp@ovn.org>
2018-11-16 18:06:12 +02:00
Ben Pfaff
713a45dbd5 dpif-netlink: Fix error behavior in dpif_netlink_port_add__().
Until now, the code here would report an error to its caller as success.
This fixes the problem.

Found by inspection.

Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-11-15 09:36:20 -08:00
Jianbo Liu
a38dccb3ee dpif-netlink: Don't destroy and recreate port if it exists
In commit 7521e0cf9e ('ofproto-dpif: Let the dpif report when a port is
a duplicate'), the checking of port existence before adding was removed,
and it's up to the dpif to check if port exists and add only if needed.

But the port can't be added to datapath if already exists. Then it will
be destroyed and created again. This causes problem because configuration
may miss. For example, if creating two vxlan on the same port, its ingress
qdisc will be lost after recreated.

Fixes: 7521e0cf9e88 ("ofproto-dpif: Let the dpif report when a port is a duplicate.")
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-10-30 11:28:36 -07:00
Sriharsha Basavapatna via dev
57924fc91c revalidator: Rebalance offloaded flows based on the pps rate
This is the third patch in the patch-set to support dynamic rebalancing
of offloaded flows.

The dynamic rebalancing functionality is implemented in this patch. The
ukeys that are not scheduled for deletion are obtained and passed as input
to the rebalancing routine. The rebalancing is done in the context of
revalidation leader thread, after all other revalidator threads are
done with gathering rebalancing data for flows.

For each netdev that is in OOR state, a list of flows - both offloaded
and non-offloaded (pending) - is obtained using the ukeys. For each netdev
that is in OOR state, the flows are grouped and sorted into offloaded and
pending flows.  The offloaded flows are sorted in descending order of
pps-rate, while pending flows are sorted in ascending order of pps-rate.

The rebalancing is done in two phases. In the first phase, we try to
offload all pending flows and if that succeeds, the OOR state on the device
is cleared. If some (or none) of the pending flows could not be offloaded,
then we start replacing an offloaded flow that has a lower pps-rate than
a pending flow, until there are no more pending flows with a higher rate
than an offloaded flow. The flows that are replaced from the device are
added into kernel datapath.

A new OVS configuration parameter "offload-rebalance", is added to ovsdb.
The default value of this is "false". To enable this feature, set the
value of this parameter to "true", which provides packets-per-second
rate based policy to dynamically offload and un-offload flows.

Note: This option can be enabled only when 'hw-offload' policy is enabled.
It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow
offload errors (specifically ENOSPC error this feature depends on) reported
by an offloaded device are supressed by TC-Flower kernel module.

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Reviewed-by: Sathya Perla <sathya.perla@broadcom.com>
Reviewed-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-10-19 11:27:52 +02:00
Sriharsha Basavapatna via dev
738c785ff1 dpif-netlink: Detect Out-Of-Resource condition on a netdev
This is the first patch in the patch-set to support dynamic rebalancing
of offloaded flows.

The patch detects OOR condition on a netdev port when ENOSPC error is
returned by TC-Flower while adding a flow rule. A new structure is added
to the netdev called "netdev_hw_info", to store OOR related information
required to perform dynamic offload-rebalancing.

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Reviewed-by: Sathya Perla <sathya.perla@broadcom.com>
Reviewed-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-10-19 11:27:45 +02:00
Eli Britstein
d9677a1f0e netdev-tc-offloads: TC csum option is not matched with tunnel configuration
Tunnels (gre, geneve, vxlan) support 'csum' option (true/false), default is false.
Generated encap TC rule will now be configured as the tunnel configuration.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-10-16 09:28:30 +02:00
Matteo Croce
790a437229 dpif-netlink: Fix null pointer.
In dpif_netlink_port_add__(), socksp could be NULL, because
vport_socksp_to_pids() would allocate a new array and return a single
zero element.
Following vport_socksp_to_pids() removal, a NULL pointer can happen when
dpif_netlink_port_add__() is called and dpif->handlers is 0.

Restore the old behaviour of using a zero pid when dpif->handlers is 0.

Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
Reported-by: Flavio Leitner <fbl@redhat.com>
Reported-by: Guru Shetty <guru@ovn.org>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-10-08 08:17:31 -07:00
Ben Pfaff
769b50349f dpif: Remove support for multiple queues per port.
Commit 69c51582ff78 ("dpif-netlink: don't allocate per thread netlink
sockets") removed dpif-netlink support for multiple queues per port.
No remaining dpif provider supports multiple queues per port, so
remove infrastructure for the feature.

CC: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
2018-09-26 15:57:06 -07:00
Matteo Croce
69c51582ff dpif-netlink: don't allocate per thread netlink sockets
When using the kernel datapath, OVS allocates a pool of sockets to handle
netlink events. The number of sockets is: ports * n-handler-threads, where
n-handler-threads is user configurable and defaults to 3/4*number of cores.

This because vswitchd starts n-handler-threads threads, each one with a
netlink socket for every port of the switch. Every thread then, starts
listening on events on its set of sockets with epoll().

On setup with lot of CPUs and ports, the number of sockets easily hits
the process file descriptor limit, and ovs-vswitchd will exit with -EMFILE.

Change the number of allocated sockets to just one per port by moving
the socket array from a per handler structure to a per datapath one,
and let all the handlers share the same sockets by using EPOLLEXCLUSIVE
epoll flag which avoids duplicate events, on systems that support it.

The patch was tested on a 56 core machine running Linux 4.18 and latest
Open vSwitch. A bridge was created with 2000+ ports, some of them being
veth interfaces with the peer outside the bridge. The latency of the upcall
is measured by setting a single 'action=controller,local' OpenFlow rule to
force all the packets going to the slow path and then to the local port.
A tool[1] injects some packets to the veth outside the bridge, and measures
the delay until the packet is captured on the local port. The rx timestamp
is get from the socket ancillary data in the attribute SO_TIMESTAMPNS, to
avoid having the scheduler delay in the measured time.

The first test measures the average latency for an upcall generated from
a single port. To measure it 100k packets, one every msec, are sent to a
single port and the latencies are measured.

The second test is meant to check latency fairness among ports, namely if
latency is equal between ports or if some ports have lower priority.
The previous test is repeated for every port, the average of the average
latencies and the standard deviation between averages is measured.

The third test serves to measure responsiveness under load. Heavy traffic
is sent through all ports, latency and packet loss is measured
on a single idle port.

The fourth test is all about fairness. Heavy traffic is injected in all
ports but one, latency and packet loss is measured on the single idle port.

This is the test setup:

  # nproc
  56
  # ovs-vsctl show |grep -c Port
  2223
  # ovs-ofctl dump-flows ovs_upc_br
   cookie=0x0, duration=4.827s, table=0, n_packets=0, n_bytes=0, actions=CONTROLLER:65535,LOCAL
  # uname -a
  Linux fc28 4.18.7-200.fc28.x86_64 #1 SMP Mon Sep 10 15:44:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

And these are the results of the tests:

                                          Stock OVS                 Patched
  netlink sockets
  in use by vswitchd
  lsof -p $(pidof ovs-vswitchd) \
      |grep -c GENERIC                        91187                    2227

  Test 1
  one port latency
  min/avg/max/mdev (us)           2.7/6.6/238.7/1.8       1.6/6.8/160.6/1.7

  Test 2
  all port
  avg latency/mdev (us)                   6.51/0.97               6.86/0.17

  Test 3
  single port latency
  under load
  avg/mdev (us)                             7.5/5.9                 3.8/4.8
  packet loss                                  95 %                    62 %

  Test 4
  idle port latency
  under load
  min/avg/max/mdev (us)           0.8/1.5/210.5/0.9       1.0/2.1/344.5/1.2
  packet loss                                  94 %                     4 %

CPU and RAM usage seems not to be affected, the resource usage of vswitchd
idle with 2000+ ports is unchanged:

  # ps u $(pidof ovs-vswitchd)
  USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  openvsw+  5430 54.3  0.3 4263964 510968 pts/1  RLl+ 16:20   0:50 ovs-vswitchd

Additionally, to check if vswitchd is thread safe with this patch, the
following test was run for circa 48 hours: on a 56 core machine, a
bridge with kernel datapath is filled with 2200 dummy interfaces and 22
veth, then 22 traffic generators are run in parallel piping traffic into
the veths peers outside the bridge.
To generate as many upcalls as possible, all packets were forced to the
slowpath with an openflow rule like 'action=controller,local' and packet
size was set to 64 byte. Also, to avoid overflowing the FDB early and
slowing down the upcall processing, generated mac addresses were restricted
to a small interval. vswitchd ran without problems for 48+ hours,
obviously with all the handler threads with almost 99% CPU usage.

[1] https://github.com/teknoraver/network-tools/blob/master/weed.c

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
2018-09-25 14:52:20 -07:00
Gavi Teitz
a692410af0 dpctl: Expand the flow dump type filter
Added new types to the flow dump filter, and allowed multiple filter
types to be passed at once, as a comma separated list. The new types
added are:
 * tc - specifies flows handled by the tc dp
 * non-offloaded - specifies flows not offloaded to the HW
 * all - specifies flows of all types

The type list is now fully parsed by the dpctl, and a new struct was
added to dpif which enables dpctl to define which types of dumps to
provide, rather than passing the type string and having dpif parse it.

Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-09-13 16:56:25 +02:00
Justin Pettit
60ebc04d12 dpif-netlink: Prevent abort in probe_broken_meters().
Commit 92d0d515d ("dpif-netlink: Probe for broken Linux meter
implementations.") introduced a deadlock on the 'once' structure
declared in probe_broken_meters() with the following callstack:

        probe_broken_meters()
        probe_broken_meters__()
        dpif_netlink_meter_set()
        probe_broken_meters()

This commit introduce a modified version of dpif_netlink_meter_set()
that sets a meter without calling the probe.

Reported-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-08-17 13:10:01 -07:00
Yi-Hung Wei
906ff9d229 dpif-netlink: Implement conntrack zone limit
This patch provides the implementation of conntrack zone limit
in dpif-netlink.  It basically utilizes the netlink API to
communicate with OVS kernel module to set, delete, and get conntrack
zone limit.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
2018-08-17 09:46:56 -07:00
Yi-Hung Wei
cd015a11c2 dpif: Support conntrack zone limit.
This patch defines the dpif interface to support conntrack
per zone limit.  Basically, OVS users can use this interface
to set, delete, and get the conntrack per zone limit for various
dpif interfaces.  The following patch will make use of the proposed
interface to implement the feature.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
2018-08-17 09:30:55 -07:00
Justin Pettit
92d0d515d6 dpif-netlink: Probe for broken Linux meter implementations.
Meter support was introduced in Linux 4.15.  In some versions of Linux
4.15, 4.16, and 4.17, there was a bug that never set the id when the
meter was created, so all meters essentially had an id of zero.  This
commit adds a probe to check for that condition and disable meters on
those kernels.

Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-08-16 10:20:52 -07:00
Justin Pettit
8101f03fcd dpif: Don't pass in '*meter_id' to meter_set commands.
The original intent of the API appears to be that the underlying DPIF
implementaion would choose a local meter id.  However, neither of the
existing datapath meter implementations (userspace or Linux) implemented
that; they expected a valid meter id to be passed in, otherwise they
returned an error.  This commit follows the existing implementations and
makes the API somewhat cleaner.

Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-08-16 10:20:52 -07:00
Andy Zhou
80738e5f93 dpif-netlink: Add meter support.
To work with kernel datapath that supports meter.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Co-authored-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-07-30 13:00:51 -07:00
Justin Pettit
494a74557a Revert "dpctl: Expand the flow dump type filter"
Commit ab15e70eb587 ("dpctl: Expand the flow dump type filter") had a
number of issues with style, build breakage, and failing unit tests.
The patch is being reverted so that they can addressed.

This reverts commit ab15e70eb5878b46f8f84da940ffc915b6d74cad.

CC: Gavi Teitz <gavi@mellanox.com>
CC: Simon Horman <simon.horman@netronome.com>
CC: Roi Dayan <roid@mellanox.com>
CC: Aaron Conole <aconole@redhat.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2018-07-25 14:17:36 -07:00
Gavi Teitz
ab15e70eb5 dpctl: Expand the flow dump type filter
Added new types to the flow dump filter, and allowed multiple filter
types to be passed at once, as a comma separated list. The new types
added are:
 * tc - specifies flows handled by the tc dp
 * non-offloaded - specifies flows not offloaded to the HW
 * all - specifies flows of all types

The type list is now fully parsed by the dpctl, and a new struct was
added to dpif which enables dpctl to define which types of dumps to
provide, rather than passing the type string and having dpif parse it.

Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-07-25 18:16:27 +02:00
Jianbo Liu
f9885dc594 Add support to offload QinQ double VLAN headers match
Currently the inner VLAN header is ignored when using the TC data-path.
As TC flower supports QinQ, now we can offload the rules to match on both
outer and inner VLAN headers.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-07-25 18:15:52 +02:00
Gavi Teitz
d63ca5329f dpctl: Properly reflect a rule's offloaded to HW state
Previously, any rule that is offloaded via a netdev, not necessarily
to the HW, would be reported as "offloaded". This patch fixes this
misalignment, and introduces the 'dp' state, as follows:

rule is in HW via TC offload  -> offloaded=yes dp:tc
rule is in not HW over TC DP  -> offloaded=no  dp:tc
rule is in not HW over OVS DP -> offloaded=no  dp:ovs

To achieve this, the flows's 'offloaded' flag was encapsulated in a new
attrs struct, which contains the offloaded state of the flow and the
DP layer the flow is handled in, and instead of setting the flow's
'offloaded' state based solely on the type of dump it was acquired
via, for netdev flows it now sends the new attrs struct to be
collected along with the rest of the flow via the netdev, allowing
it to be set per flow.

For TC offloads, the offloaded state is set based on the 'in_hw' and
'not_in_hw' flags received from the TC as part of the flower. If no
such flag was received, due to lack of kernel support, it defaults
to true.

Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
[simon: resolved conflict in lib/dpctl.man]
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-06-18 09:57:37 +02:00
Ben Pfaff
fa37affad3 Embrace anonymous unions.
Several OVS structs contain embedded named unions, like this:

struct {
    ...
    union {
        ...
    } u;
};

C11 standardized a feature that many compilers already implemented
anyway, where an embedded union may be unnamed, like this:

struct {
    ...
    union {
        ...
    };
};

This is more convenient because it allows the programmer to omit "u."
in many places.  OVS already used this feature in several places.  This
commit embraces it in several others.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
Tested-by: Alin Gabriel Serdean <aserdean@ovn.org>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
2018-05-25 13:36:05 -07:00
Greg Rose
1c385f4972 lib/dpif-netlink: Fix miscompare of gre ports
In netdev_to_ovs_vport_type() it checks for netdev types matching
"gre" with a strstr().  This makes it match ip6gre as well and return
OVS_VPORT_TYPE_GRE, which is clearly wrong.

Move the usage of strstr() *after* all the exact matches with strcmp()
to avoid the problem permanently because when I added the ip6gre
type I ran into a very difficult to detect bug.

Cc: Ben Pfaff <blp@ovn.org>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: William Tu <u9012063@gmail.com>
2018-05-21 20:33:30 -07:00
Greg Rose
3b10ceeed1 ip6gre: Add ip6gre vport type
Add handlers for OVS_VPORT_TYPE_IP6GRE

Cc: Ben Pfaff <blp@ovn.org>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: William Tu <u9012063@gmail.com>
2018-05-21 20:33:30 -07:00
William Tu
98514eea21 erspan: add kernel datapath support
pass check, check-kernel (4.16-rc4), check-system-userspace

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-05-21 20:33:30 -07:00
William Tu
7dc18ae96d userspace: add erspan tunnel support.
ERSPAN is a tunneling protocol based on GRE tunnel.  The patch
add erspan tunnel support for ovs-vswitchd with userspace datapath.
Configuring erspan tunnel is similar to gre tunnel, but with
additional erspan's parameters.  Matching a flow on erspan's
metadata is also supported, see ovs-fields for more details.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-05-21 20:33:30 -07:00
Greg Rose
c387d8177f compat: Add ipv6 GRE and IPV6 Tunneling
This patch backports upstream ipv6 GRE and tunneling into the OVS
OOT (Out of Tree) datapath drivers.  The primary reason for this
is to support the ERSPAN feature.

Because there is no previous history of ipv6 GRE and tunneling it is
not possible to exactly reproduce the history of all the files in
the patch.  The two newly added files - ip6_gre.c and ip6_tunnel.c -
are cut from whole cloth out of the upstream Linux 4.15 kernel and
then modified as necessary with compatibility layer fixups.
These two files already included parts of several other upstream
commits that also touched other upstream files.  As such, this
patch may incorporate parts or all of the following commits:

d350a82 net: erspan: create erspan metadata uapi header
c69de58 net: erspan: use bitfield instead of mask and offset
b423d13 net: erspan: fix use-after-free
214bb1c net: erspan: remove md NULL check
afb4c97 ip6_gre: fix potential memory leak in ip6erspan_rcv
50670b6 ip_gre: fix potential memory leak in erspan_rcv
a734321 ip6_gre: fix error path when ip6erspan_rcv failed
dd8d5b8 ip_gre: fix error path when erspan_rcv failed
293a199 ip6_gre: fix a pontential issue in ip6erspan_rcv
d91e8db5 net: erspan: reload pointer after pskb_may_pull
ae3e133 net: erspan: fix wrong return value
c05fad5 ip_gre: fix wrong return value of erspan_rcv
94d7d8f ip6_gre: add erspan v2 support
f551c91 net: erspan: introduce erspan v2 for ip_gre
1d7e2ed net: erspan: refactor existing erspan code
ef7baf5 ip6_gre: add ip6 erspan collect_md mode
5a963eb ip6_gre: Add ERSPAN native tunnel support
ceaa001 openvswitch: Add erspan tunnel support.
f192970 ip_gre: check packet length and mtu correctly in erspan tx
c84bed4 ip_gre: erspan device should keep dst
c122fda ip_gre: set tunnel hlen properly in erspan_tunnel_init
5513d08 ip_gre: check packet length and mtu correctly in erspan_xmit
935a974 ip_gre: get key from session_id correctly in erspan_rcv
1a66a83 gre: add collect_md mode to ERSPAN tunnel
84e54fe gre: introduce native tunnel support for ERSPAN

In cases where the listed commits also touched other source code
files then the patches are also listed separately within this
patch series.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: William Tu <u9012063@gmail.com>
2018-05-21 20:33:29 -07:00
Chris Mi
00a0a011d3 netdev-tc-offloads: Add offloading of multiple outputs
Currently, we support offloading of one output port. Remove that
limitation by use of mirred mirror action for all output ports,
except that the last one is mirred redirect action.

Signed-off-by: Chris Mi <chrism@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2018-04-12 11:08:34 +02:00
Flavio Leitner
e0e2410d52 netdev-linux: fail ops not supporting remote netns.
When the netdev is in another namespace and the operation doesn't
support network namespaces, return the correct error.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-03-31 12:48:39 -07:00
Flavio Leitner
bfda523979 netnsid: update device only if netnsid matches.
Recent kernels provide the network namespace ID of a port,
so use that to discover where the port currently is.

A network device in another network namespace could have the
same name, so once the socket starts listening to other network
namespaces, it is necessary to confirm the netnsid.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-03-31 12:48:33 -07:00
Flavio Leitner
a86bd14ec9 netlink: provide network namespace id from a msg.
The netlink notification's ancillary data contains the network
namespace id (netnsid) needed to identify the device correctly.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2018-03-31 12:48:31 -07:00