The patch introduces a new commands ovs-appctl dpctl/dump-conntrack-exp
that allows to dump the existing expectations for the userspace ct.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In the specific call to dpif_netlink_dp_transact() (line 398) in
dpif_netlink_open(), the 'dp' content is not being used in the branch
when no error is returned (starting line 430). Furthermore, the 'dp'
and 'buf' variables are overwritten later in this same branch when a
new netlink request is sent (line 437), which results in a memory leak.
Reported by Address Sanitizer.
Indirect leak of 1024 byte(s) in 1 object(s) allocated from:
0 0x7fe09d3bfe70 in __interceptor_malloc (/usr/lib64/libasan.so.4+0xe0e70)
1 0x8759be in xmalloc__ lib/util.c:140
2 0x875a9a in xmalloc lib/util.c:175
3 0x7ba0d2 in ofpbuf_init lib/ofpbuf.c:141
4 0x7ba1d6 in ofpbuf_new lib/ofpbuf.c:169
5 0x9057f9 in nl_sock_transact lib/netlink-socket.c:1113
6 0x907a7e in nl_transact lib/netlink-socket.c:1817
7 0x8b5abe in dpif_netlink_dp_transact lib/dpif-netlink.c:5007
8 0x89a6b5 in dpif_netlink_open lib/dpif-netlink.c:398
9 0x5de16f in do_open lib/dpif.c:348
10 0x5de69a in dpif_open lib/dpif.c:393
11 0x5de71f in dpif_create_and_open lib/dpif.c:419
12 0x47b918 in open_dpif_backer ofproto/ofproto-dpif.c:764
13 0x483e4a in construct ofproto/ofproto-dpif.c:1658
14 0x441644 in ofproto_create ofproto/ofproto.c:556
15 0x40ba5a in bridge_reconfigure vswitchd/bridge.c:885
16 0x41f1a9 in bridge_run vswitchd/bridge.c:3313
17 0x42d4fb in main vswitchd/ovs-vswitchd.c:132
18 0x7fe09cc03c86 in __libc_start_main (/usr/lib64/libc.so.6+0x25c86)
Fixes: b841e3cd4a28 ("dpif-netlink: Fix feature negotiation for older kernels.")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Since 3d9c1b855a5f ("conntrack: Replace timeout based expiration lists
with rculists.") the sweep interval changed as well as the constraints
related to the sweeper.
Being able to change the default reschedule time may be convenient in
some conditions, like debugging.
This patch introduces new commands allowing to get and set the sweep
interval in ms.
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
SRv6 (Segment Routing IPv6) tunnel vport is responsible
for encapsulation and decapsulation the inner packets with
IPv6 header and an extended header called SRH
(Segment Routing Header). See spec in:
https://datatracker.ietf.org/doc/html/rfc8754
This patch implements SRv6 tunneling in userspace datapath.
It uses `remote_ip` and `local_ip` options as with existing
tunnel protocols. It also adds a dedicated `srv6_segs` option
to define a sequence of routers called segment list.
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ensure at least 1 handler is created even if something goes wrong during
cpu detection or prime numer calculation.
Fixes: a5cacea5f988 ("handlers: Create additional handler threads when using CPU isolation.")
Suggested-by: Aaron Conole <aconole@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When the ukey's action set changes, it could cause the flow to use a
different datapath, for example, when it moves from tc to kernel.
This will cause the the cached previous datapath statistics to be used.
This change will reset the cached statistics when a change in
datapath is discovered.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add support to count upcall packets per port, both succeed and failed,
which is a better way to see how many packets upcalled on each interface.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: wangchuanlei <wangchuanlei@inspur.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The assignment of the features pointer is not doing
anything and can be removed.
CC: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Acked-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Current offloading code supports only limited number of tunnel keys
and silently ignores everything it doesn't understand. This is
causing, for example, offloaded ERSPAN tunnels to not work, because
flow is offloaded, but ERSPAN options are not provided to TC.
There is a number of tunnel keys, which are supported by the userspace,
but silently ignored during offloading:
OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT
OVS_TUNNEL_KEY_ATTR_OAM
OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS
OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS
OVS_TUNNEL_KEY_ATTR_CSUM is kind of supported, but only for actions
and for some reason is set from the tunnel port instead of the
provided action, and not currently supported for the tunnel key in
the match.
Addig a default case to fail offloading of unknown attributes. For
now explicitly allowing incorrect behavior for the DONT_FRAGMENT flag,
otherwise we'll break all tunnel offloading by default. VXLAN and
ERSPAN options has to fail offloading, because the tunnel will not
work otherwise. OAM is not a default configurations, so failing it
as well. The missing DONT_FRAGMENT flag though should, probably,
cause frequent flow revalidation, but that is not new with this patch.
Same for the 'match' key, only clearing masks that was actually
consumed, except for the DONT_FRAGMENT and CSUM flags, which are
explicitly allowed and highlighted as broken.
Also, destination port as well as CSUM configuration for unknown
reason was not taken from the actions list and were passed via HW
offload info instead of being consumed from the set() action.
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2022-July/395522.html
Reported-by: Eelco Chaudron <echaudro@redhat.com>
Fixes: 8f283af89298 ("netdev-tc-offloads: Implement netdev flow put using tc interface")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The handler and CPU mapping in upcalls are incorrect, and this is
specially noticeable systems with cpu isolation enabled.
Say we have a 12 core system where only every even number CPU is enabled
C0, C2, C4, C6, C8, C10
This means we will create an array of size 6 that will be sent to
kernel that is populated with sockets [S0, S1, S2, S3, S4, S5]
The problem is when the kernel does an upcall it checks the socket array
via the index of the CPU, effectively adding additional load on some
CPUs while leaving no work on other CPUs.
e.g.
C0 indexes to S0
C2 indexes to S2 (should be S1)
C4 indexes to S4 (should be S2)
Modulo of 6 (size of socket array) is applied, so we wrap back to S0
C6 indexes to S0 (should be S3)
C8 indexes to S2 (should be S4)
C10 indexes to S4 (should be S5)
Effectively sockets S0, S2, S4 get overloaded while sockets S1, S3, S5
get no work assigned to them
This leads to the kernel to throw the following message:
"openvswitch: cpu_id mismatch with handler threads"
Instead we will send the kernel a corrected array of sockets the size
of all CPUs in the system, or the largest core_id on the system, which
ever one is greatest. This is to take care of systems with non-continous
core cpus.
In the above example we would create a
corrected array in a round-robin(assuming prime bias) fashion as follows:
[S0, S1, S2, S3, S4, S5, S6, S0, S1, S2, S3, S4]
Fixes: b1e517bd2f81 ("dpif-netlink: Introduce per-cpu upcall dispatch.")
Co-authored-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Additional threads are required to service upcalls when we have CPU
isolation (in per-cpu dispatch mode). The reason additional threads
are required is because it creates a more fair distribution. With more
threads we decrease the load of each thread as more threads would
decrease the number of cores each threads is assigned.
Adding additional threads also increases the chance OVS utilizes all
cores available to use. Some RPS schemas might make some handler
threads get all the workload while others get no workload. This tends
to happen when the handler thread count is low.
An example would be an RPS that sends traffic on all even cores on a
system with only the lower half of the cores available for OVS to use.
In this example we have as many handlers threads as there are
available cores. In this case 50% of the handler threads get all the
workload while the other 50% get no workload. Not only that, but OVS
is only utilizing half of the cores that it can use. This is the worst
case scenario.
The ideal scenario is to have as many threads as there are cores - in
this case we guarantee that all cores OVS can use are utilized
But, adding as many threads are there are cores could have a performance
hit when the number of active cores (which all threads have to share) is
very low. For this reason we avoid creating as many threads as there
are cores and instead meet somewhere in the middle.
The formula used to calculate the number of handler threads to create
is as follows:
handlers_n = min(next_prime(active_cores+1), total_cores)
Assume default behavior when total_cores <= 2, that is do not create
additional threads when we have less than 2 total cores on the system
Fixes: b1e517bd2f81 ("dpif-netlink: Introduce per-cpu upcall dispatch.")
Signed-off-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior in
lib/dpif-netlink.c:1077:40: runtime error:
left shift of 1 by 31 places cannot be represented in type 'int'
#0 0x73fc31 in dpif_netlink_port_add_compat lib/dpif-netlink.c:1077:40
#1 0x73fc31 in dpif_netlink_port_add lib/dpif-netlink.c:1132:17
#2 0x2c1745 in dpif_port_add lib/dpif.c:597:13
#3 0x07b279 in port_add ofproto/ofproto-dpif.c:3957:17
#4 0x01b209 in ofproto_port_add ofproto/ofproto.c:2124:13
#5 0xfdbfce in iface_do_create vswitchd/bridge.c:2066:13
#6 0xfdbfce in iface_create vswitchd/bridge.c:2109:13
#7 0xfdbfce in bridge_add_ports__ vswitchd/bridge.c:1173:21
#8 0xfb5319 in bridge_add_ports vswitchd/bridge.c:1189:5
#9 0xfb5319 in bridge_reconfigure vswitchd/bridge.c:901:9
#10 0xfae0f9 in bridge_run vswitchd/bridge.c:3334:9
#11 0xfe67dd in main vswitchd/ovs-vswitchd.c:129:9
#12 0x4b6d8f (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
#13 0x4b6e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f)
#14 0x562594eed024 in _start (vswitchd/ovs-vswitchd+0x787024)
Fixes: 526df7d8543f ("tunnel: Provide framework for tunnel extensions for VXLAN-GBP and others")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
OVS meters are created in advance and openflow rules refer to them by
their unique ID. New tc_police API is used to offload them. By calling
the API, police actions are created and meters are mapped to them.
These actions then can be used in tc filter rules by the index.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
UB Sanitizer reports:
tests/test-hash.c:59:40:
runtime error: shift exponent 64 is too large for 64-bit type
'long unsigned int'
0 0x44c3c9 in get_range128 tests/test-hash.c:59
1 0x44cb2e in check_hash_bytes128 tests/test-hash.c:178
2 0x44d14d in test_hash_main tests/test-hash.c:282
[...]
ofproto/ofproto-dpif-xlate.c:5607:45:
runtime error: left shift of 65535 by 16 places cannot be represented
in type 'int'
0 0x53fe9f in xlate_sample_action ofproto/ofproto-dpif-xlate.c:5607
1 0x54d625 in do_xlate_actions ofproto/ofproto-dpif-xlate.c:7160
2 0x553b76 in xlate_actions ofproto/ofproto-dpif-xlate.c:7806
3 0x4fcb49 in upcall_xlate ofproto/ofproto-dpif-upcall.c:1237
4 0x4fe02f in process_upcall ofproto/ofproto-dpif-upcall.c:1456
5 0x4fda99 in upcall_cb ofproto/ofproto-dpif-upcall.c:1358
[...]
tests/test-util.c:89:23:
runtime error: left shift of 1 by 31 places cannot be represented in
type 'int'
0 0x476415 in test_ctz tests/test-util.c:89
[...]
lib/dpif-netlink.c:396:33:
runtime error: left shift of 1 by 31 places cannot be represented in
type 'int'
0 0x571b9f in dpif_netlink_open lib/dpif-netlink.c:396
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Expose a function to query datapath offload statistics.
This function is separate from the current one in netdev-offload
as it exposes more detailed statistics from the datapath, instead of
only from the netdev-offload provider.
Each datapath is meant to use the custom counters as it sees fit for its
handling of hardware offloads.
Call the new API from dpctl.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds a series of NetLink flow operation USDT probes.
These probes are in turn used in the upcall_cost Python script,
which in addition of some kernel tracepoints, give an insight into
the time spent on processing upcall.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
OVS_DP_F_UNALIGNED is already set, no need to set again. If restarting ovs,
dp is already created. So dpif_netlink_dp_transact() will return EEXIST.
No need to probe again.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds a general way of viewing/configuring datapath
cache sizes. With an implementation for the netlink interface.
The ovs-dpctl/ovs-appctl show commands will display the
current cache sizes configured:
$ ovs-dpctl show
system@ovs-system:
lookups: hit:25 missed:63 lost:0
flows: 0
masks: hit:282 total:0 hit/pkt:3.20
cache: hit:4 hit-rate:4.54%
caches:
masks-cache: size:256
port 0: ovs-system (internal)
port 1: br-int (internal)
port 2: genev_sys_6081 (geneve: packet_type=ptap)
port 3: br-ex (internal)
port 4: eth2
port 5: sw0p1 (internal)
port 6: sw0p3 (internal)
A specific cache can be configured as follows:
$ ovs-appctl dpctl/cache-set-size DP CACHE SIZE
$ ovs-dpctl cache-set-size DP CACHE SIZE
For example to disable the cache do:
$ ovs-dpctl cache-set-size system@ovs-system masks-cache 0
Setting cache size successful, new size 0.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Older kernels do not reject unsupported features. This can lead
to a situation in which 'ovs-vswitchd' believes that a feature is
supported when, in fact, it is not.
This patch probes for this by attempting to set a known unsupported
feature.
Reported-by: Dumitru Ceara <dceara@redhat.com>
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2004083
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Tested-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.
This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:
* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.
In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:
a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.
Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, conntrack in the kernel has an undocumented feature referred
to as all-zero IP address SNAT. Basically, when a source port
collision is detected during the commit, the source port will be
translated to an ephemeral port. If there is no collision, no SNAT is
performed.
This patchset documents this behavior and adds a self-test to verify
it's not changing. In addition, a datapath feature flag is added for
the all-zero IP SNAT case. This will help applications on top of OVS,
like OVN, to determine this feature can be used.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ct limit requests never initializes the whole 'struct ovs_zone_limit'
sending uninitialized stack memory to kernel:
Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
at 0x5E23867: sendmsg (in /usr/lib64/libpthread-2.28.so)
by 0x54F761: nl_sock_transact_multiple__ (netlink-socket.c:858)
by 0x54FB6E: nl_sock_transact_multiple.part.9 (netlink-socket.c:1079)
by 0x54FCC0: nl_sock_transact_multiple (netlink-socket.c:1044)
by 0x54FCC0: nl_sock_transact (netlink-socket.c:1108)
by 0x550B6F: nl_transact (netlink-socket.c:1804)
by 0x53BEA2: dpif_netlink_ct_get_limits (dpif-netlink.c:3052)
by 0x588B57: dpctl_ct_get_limits (dpctl.c:2178)
by 0x586FF2: dpctl_unixctl_handler (dpctl.c:2870)
by 0x52C241: process_command (unixctl.c:310)
by 0x52C241: run_connection (unixctl.c:344)
by 0x52C241: unixctl_server_run (unixctl.c:395)
by 0x407526: main (ovs-vswitchd.c:128)
Address 0x10b87480 is 32 bytes inside a block of size 4,096 alloc'd
at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
by 0x52CDE4: xmalloc (util.c:138)
by 0x4F7E07: ofpbuf_init (ofpbuf.c:123)
by 0x4F7E07: ofpbuf_new (ofpbuf.c:151)
by 0x53BDE3: dpif_netlink_ct_get_limits (dpif-netlink.c:3025)
by 0x588B57: dpctl_ct_get_limits (dpctl.c:2178)
by 0x586FF2: dpctl_unixctl_handler (dpctl.c:2870)
by 0x52C241: process_command (unixctl.c:310)
by 0x52C241: run_connection (unixctl.c:344)
by 0x52C241: unixctl_server_run (unixctl.c:395)
by 0x407526: main (ovs-vswitchd.c:128)
Uninitialised value was created by a stack allocation
at 0x46AAA0: ct_dpif_get_limits (ct-dpif.c:197)
Fix that by using designated initializers that will clear all the
non-specified fields.
Fixes: 906ff9d229ee ("dpif-netlink: Implement conntrack zone limit")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Before info.tc_modify_flow_deleted is assigned a value, error
processing of other statements goes to the out label. In the
out label, the uninitialized variant is used for condition
determination, which may cause uncertain behavior.
Fixes: 65b84d4a32bd ("dpif-netlink: avoid netlink modify flow put op failed after tc modify flow put op failed.")
Signed-off-by: Mengfan Lv <lvmengfan@huawei.com>
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There are various L3 encapsulation standards using UDP being discussed to
leverage the UDP based load balancing capability of different networks.
MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them.
The Bareudp tunnel provides a generic L3 encapsulation support for
tunnelling different L3 protocols like MPLS, IP, NSH etc. inside a UDP
tunnel.
An example to create bareudp device to tunnel MPLS traffic is
given
$ ovs-vsctl add-port br_mpls udp_port -- set interface udp_port \
type=bareudp options:remote_ip=2.1.1.3
options:local_ip=2.1.1.2 \
options:payload_type=0x8847 options:dst_port=6635
The bareudp device supports special handling for MPLS & IP as
they can have multiple ethertypes. MPLS procotcol can have ethertypes
ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast). IP protocol can have
ethertypes ETH_P_IP (v4) & ETH_P_IPV6 (v6).
The bareudp device to tunnel L3 traffic with multiple ethertypes
(MPLS & IP) can be created by passing the L3 protocol name as string in
the field payload_type. An example to create bareudp device to tunnel
MPLS unicast & multicast traffic is given below.::
$ ovs-vsctl add-port br_mpls udp_port -- set interface
udp_port \
type=bareudp options:remote_ip=2.1.1.3
options:local_ip=2.1.1.2 \
options:payload_type=mpls options:dst_port=6635
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-By: Greg Rose <gvrose8192@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The n_offloaded_flows counter is saved in dpif, and this is the first
one when ofproto is created. When flow operation is done by ovs-appctl
commands, such as, dpctl/add-flow, a new dpif is opened, and the
n_offloaded_flows in it can't be used. So, instead of using counter,
the number of offloaded flows is queried from each netdev, then sum
them up. To achieve this, a new API is added in netdev_flow_api to get
how many flows assigned to a netdev.
In order to get better performance, this number is calculated directly
from tc_to_ufid hmap for netdev-offload-tc, because flow dumping by tc
takes much time if there are many flows offloaded.
Fixes: af0618470507 ("dpif-netlink: Count the number of offloaded rules")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add a counter for the offloaded rules, and display it in the command
of "ovs-appctl upcall/show".
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Current code generates UFID for flows installed by ovs-dpctl. This
leads to inability to remove such flows by the same command. Ex:
ovs-dpctl add-dp test
ovs-dpctl add-if test vport0
ovs-dpctl add-flow test "in_port(0),eth(),eth_type(0x800),ipv4(src=100.1.0.1)" 0
ovs-dpctl del-flow test "in_port(0),eth(),eth_type(0x800),ipv4(src=100.1.0.1)"
dpif|WARN|system@test: failed to flow_del (No such file or directory)
ufid:e4457189-3990-4a01-bdcf-1e5f8b208711 in_port(0),
eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),eth_type(0x0800),
ipv4(src=100.1.0.1,dst=0.0.0.0,proto=0,tos=0,ttl=0,frag=no)
ovs-dpctl: deleting flow (No such file or directory)
Perhaps you need to specify a UFID?
During del-flow operation UFID is generated too, however resulted
value is different from one generated during add-flow. This happens
because odp_flow_key_hash() function uses random base value for flow
hashes which is different on every invocation. That is not an issue
while running 'ovs-appctl dpctl/{add,del}-flow' because execution
of these requests happens in context of the OVS main process, i.e.
there will be same random seed.
Commit e61984e781e6 was intended to allow offloading for flows
added by dpctl/add-flow unixctl command, so it's better to generate
UFIDs conditionally inside dpctl command handler only for appctl
invocations. Offloading is not possible from ovs-dpctl utility anyway.
There are still couple of corner case: It will not be possible to
remove flow by 'ovs-appctl dpctl/del-flow' without specifying UFID if
main OVS process was restarted since flow addition and it will not
be possible to remove flow by ovs-dpctl without specifying UUID if
it was added by 'ovs-appctl dpctl/add-flow'. But these scenarios
seems minor since these commands intended for testing only.
Reported-by: Eelco Chaudron <echaudro@redhat.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2020-September/374863.html
Fixes: e61984e781e6 ("dpif-netlink: Generate ufids for installing TC flowers")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
There is no real difference between the 'class' and 'type' in the
context of common lookup operations inside netdev-offload module
because it only checks the value of pointers without using the
value itself. However, 'type' has some meaning and can be used by
offload provides on the initialization phase to check if this type
of Flow API in pair with the netdev type could be used in particular
datapath type. For example, this is needed to check if Linux flow
API could be used for current tunneling vport because it could be
used only if tunneling vport belongs to system datapath, i.e. has
backing linux interface.
This is needed to unblock tunneling offloads in userspace datapath
with DPDK flow API.
Acked-by: Eli Britstein <elibr@mellanox.com>
Acked-by: Roni Bar Yanai <roniba@mellanox.com>
Acked-by: Ophir Munk <ophirmu@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Problem:
In OVS, flows with output over a bond interface of type “balance-tcp”
gets translated by the ofproto layer into "HASH" and "RECIRC" datapath
actions. After recirculation, the packet is forwarded to the bond
member port based on 8-bits of the datapath hash value computed through
dp_hash. This causes performance degradation in the following ways:
1. The recirculation of the packet implies another lookup of the
packet’s flow key in the exact match cache (EMC) and potentially
Megaflow classifier (DPCLS). This is the biggest cost factor.
2. The recirculated packets have a new “RSS” hash and compete with the
original packets for the scarce number of EMC slots. This implies more
EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
3. The 256 extra megaflow entries per bond for dp_hash bond selection
put additional load on the revalidation threads.
Owing to this performance degradation, deployments stick to “balance-slb”
bond mode even though it does not do active-active load balancing for
VXLAN- and GRE-tunnelled traffic because all tunnel packet have the
same source MAC address.
Proposed optimization:
This proposal introduces a new load-balancing output action instead of
recirculation.
Maintain one table per-bond (could just be an array of uint16's) and
program it the same way internal flows are created today for each
possible hash value (256 entries) from ofproto layer. Use this table to
load-balance flows as part of output action processing.
Currently xlate_normal() -> output_normal() ->
bond_update_post_recirc_rules() -> bond_may_recirc() and
compose_output_action__() generate 'dp_hash(hash_l4(0))' and
'recirc(<RecircID>)' actions. In this case the RecircID identifies the
bond. For the recirculated packets the ofproto layer installs megaflow
entries that match on RecircID and masked dp_hash and send them to the
corresponding output port.
Instead, we will now generate action as
'lb_output(<bond id>)'
This combines hash computation (only if needed, else re-use RSS hash)
and inline load-balancing over the bond. This action is used *only* for
balance-tcp bonds in userspace datapath (the OVS kernel datapath
remains unchanged).
Example:
Current scheme:
With 8 UDP flows (with random UDP src port):
flow-dump from pmd on cpu core: 2
recirc_id(0),in_port(7),<...> actions:hash(hash_l4(0)),recirc(0x1)
recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),<...> actions:2
recirc_id(0x1),dp_hash(0xb236c260/0xff),<...> actions:1
recirc_id(0x1),dp_hash(0x7d89eb18/0xff),<...> actions:1
recirc_id(0x1),dp_hash(0xa78d75df/0xff),<...> actions:2
recirc_id(0x1),dp_hash(0xb58d846f/0xff),<...> actions:2
recirc_id(0x1),dp_hash(0x24534406/0xff),<...> actions:1
recirc_id(0x1),dp_hash(0x3cf32550/0xff),<...> actions:1
New scheme:
We can do with a single flow entry (for any number of new flows):
in_port(7),<...> actions:lb_output(1)
A new CLI has been added to dump datapath bond cache as given below.
# ovs-appctl dpif-netdev/bond-show [dp]
Bond cache:
bond-id 1 :
bucket 0 - slave 2
bucket 1 - slave 1
bucket 2 - slave 2
bucket 3 - slave 1
Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
Tested-by: Matteo Croce <mcroce@redhat.com>
Tested-by: Adrian Moreno <amorenoz@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
OVS_DP_ATTR_NAME field is required when sending OVS_DP_CMD_SET to
windows kernel driver. The function "dpif_netlink_set_features"
dose not set the OVS_DP_ATTR_NAME field which will cause set feature
failure and ovs-vswitchd will exist.
This patch fixes the issue by setting "request.name" in request.
Reported-at: https://github.com/openvswitch/ovs-issues/issues/187
Submitted-at: https://github.com/openvswitch/ovs/pull/319
Signed-off-by: Rui Cao <rcao@vmware.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
In order to improve revalidator performance by minimizing unnecessary
copying of data, extend netdev-offloads to support terse dump mode. Extend
netdev_flow_api->flow_dump_create() with 'terse' bool argument. Implement
support for terse dump in functions that convert netlink to flower and
flower to match. Set flow stats "used" value based on difference in number
of flow packets because lastuse timestamp is not included in TC terse dump.
Kernel API support is implemented in following patch.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
To support installing the TC flowers to HW, via "ovs-appctl dpctl/add-flow"
command, there should be an ufid. This patch will check whether ufid exists,
if not, generate an ufid. Should to know that when processing upcall packets,
ufid is generated in parse_odp_packet for kernel datapath.
Configuring the max-idle/max-revalidator, may help testing this patch.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
GTP, GPRS Tunneling Protocol, is a group of IP-based communications
protocols used to carry general packet radio service (GPRS) within
GSM, UMTS and LTE networks. GTP protocol has two parts: Signalling
(GTP-Control, GTP-C) and User data (GTP-User, GTP-U). GTP-C is used
for setting up GTP-U protocol, which is an IP-in-UDP tunneling
protocol. Usually GTP is used in connecting between base station for
radio, Serving Gateway (S-GW), and PDN Gateway (P-GW).
This patch implements GTP-U protocol for userspace datapath,
supporting only required header fields and G-PDU message type.
See spec in:
https://tools.ietf.org/html/draft-hmm-dmm-5g-uplane-analysis-00
Tested-at: https://travis-ci.org/github/williamtu/ovs-travis/builds/666518784
Signed-off-by: Feng Yang <yangfengee04@gmail.com>
Co-authored-by: Feng Yang <yangfengee04@gmail.com>
Signed-off-by: Yi Yang <yangyi01@inspur.com>
Co-authored-by: Yi Yang <yangyi01@inspur.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Ben Pfaff <blp@ovn.org>
The tc modify flow put always delete the original flow first and
then add the new flow. If the modfiy flow put operation failed,
the flow put operation will change from modify to create if success
to delete the original flow in tc (which will be always failed with
ENOENT, the flow is already be deleted before add the new flow in tc).
Finally, the modify flow put will failed to add in kernel datapath.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
'dpif_probe_feature'/'revalidate' doesn't free the 'dp_extra_info'
string. Also, all the implementations of dpif_flow_get() should
initialize the value to avoid printing/freeing of random memory.
30 bytes in 1 blocks are definitely lost in loss record 323 of 889
at 0x483AD19: realloc (vg_replace_malloc.c:836)
by 0xDDAD89: xrealloc (util.c:149)
by 0xCE1609: ds_reserve (dynamic-string.c:63)
by 0xCE1A90: ds_put_format_valist (dynamic-string.c:161)
by 0xCE19B9: ds_put_format (dynamic-string.c:142)
by 0xCCCEA9: dp_netdev_flow_to_dpif_flow (dpif-netdev.c:3170)
by 0xCCD2DD: dpif_netdev_flow_get (dpif-netdev.c:3278)
by 0xCCEA0A: dpif_netdev_operate (dpif-netdev.c:3868)
by 0xCDF81B: dpif_operate (dpif.c:1361)
by 0xCDEE93: dpif_flow_get (dpif.c:1002)
by 0xCDECF9: dpif_probe_feature (dpif.c:962)
by 0xC635D2: check_recirc (ofproto-dpif.c:896)
by 0xC65C02: check_support (ofproto-dpif.c:1567)
by 0xC63274: open_dpif_backer (ofproto-dpif.c:818)
by 0xC65E3E: construct (ofproto-dpif.c:1605)
by 0xC4D436: ofproto_create (ofproto.c:549)
by 0xC3931A: bridge_reconfigure (bridge.c:877)
by 0xC3FEAC: bridge_run (bridge.c:3324)
by 0xC4551D: main (ovs-vswitchd.c:127)
CC: Emma Finn <emma.finn@intel.com>
Fixes: 0e8f5c6a38d0 ("dpif-netdev: Modified ovs-appctl dpctl/dump-flows command")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
Current implementation of dpif_flow_hash() doesn't depend on datapath
interface and only complicates the callers by forcing them to figure
out what is their current 'dpif'. If we'll need different hashing
for different 'dpif's we'll implement an API for dpif-providers
and each dpif implementation will be able to use their local function
directly without calling it via dpif API.
This change will allow us to not store 'dpif' pointer in the userspace
datapath implementation which is broken and will be removed in next
commits.
This patch moves dpif_flow_hash() to odp-util module and replaces
unused odp_flow_key_hash() by it, along with removing of unused 'dpif'
argument.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
dpif logging functions expects to be called after the operation.
log_flow_del_message() dumps flow stats on success which are not
initialized before the actual call to netdev_flow_del():
Conditional jump or move depends on uninitialised value(s)
at 0x6090875: _itoa_word (_itoa.c:179)
by 0x6093F0D: vfprintf (vfprintf.c:1642)
by 0x60C090F: vsnprintf (vsnprintf.c:114)
by 0xE5E7EC: ds_put_format_valist (dynamic-string.c:155)
by 0xE5E755: ds_put_format (dynamic-string.c:142)
by 0xE5A5E6: dpif_flow_stats_format (dpif.c:903)
by 0xE5B708: log_flow_message (dpif.c:1763)
by 0xE5BCA4: log_flow_del_message (dpif.c:1809)
by 0xFA6076: try_send_to_netdev (dpif-netlink.c:2190)
by 0xFA0D3C: dpif_netlink_operate (dpif-netlink.c:2248)
by 0xE5AFAC: dpif_operate (dpif.c:1376)
by 0xDF176E: push_dp_ops (ofproto-dpif-upcall.c:2367)
by 0xDF04C8: push_ukey_ops (ofproto-dpif-upcall.c:2447)
by 0xDF008F: revalidator_sweep__ (ofproto-dpif-upcall.c:2805)
by 0xDF5DC6: revalidator_sweep (ofproto-dpif-upcall.c:2816)
by 0xDF1E83: udpif_revalidator (ofproto-dpif-upcall.c:949)
by 0xF3A3FE: ovsthread_wrapper (ovs-thread.c:383)
by 0x565F6DA: start_thread (pthread_create.c:463)
by 0x615988E: clone (clone.S:95)
Uninitialised value was created by a stack allocation
at 0xDEFC24: revalidator_sweep__ (ofproto-dpif-upcall.c:2733)
Fixes: 3cd99886191e ("dpif-netlink: Use dpif logging functions")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Zone and ct_state first.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Each recirculation id will create a tc chain, and we translate
the recirculation action to a tc goto chain action.
We check for kernel support for this by probing OvS Datapath for the
tc recirc id sharing feature. If supported, we can offload rules
that match on recirc_id, and recirculation action safely.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
This enables user features on the kernel datapath via the DP_CMD_SET
command, and also retrieves them to check for actual support and
not just an older kernel ignoring the requested features.
This will be used in next patch to enable recirc_id sharing with tc.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
The kernel datapath may sent upcall with hash info,
ovs-vswitchd should get it from upcall and then send
it back.
The reason is that:
| When using the kernel datapath, the upcall don't
| include skb hash info relatived. That will introduce
| some problem, because the hash of skb is important
| in kernel stack. For example, VXLAN module uses
| it to select UDP src port. The tx queue selection
| may also use the hash in stack.
|
| Hash is computed in different ways. Hash is random
| for a TCP socket, and hash may be computed in hardware,
| or software stack. Recalculation hash is not easy.
|
| There will be one upcall, without information of skb
| hash, to ovs-vswitchd, for the first packet of a TCP
| session. The rest packets will be processed in Open vSwitch
| modules, hash kept. If this tcp session is forward to
| VXLAN module, then the UDP src port of first tcp packet
| is different from rest packets.
|
| TCP packets may come from the host or dockers, to Open vSwitch.
| To fix it, we store the hash info to upcall, and restore hash
| when packets sent back.
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/364062.html
Link: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=bd1903b7c4596ba6f7677d0dfefd05ba5876707d
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Usually a plural name refers to an array, but 'socks' and 'socksp' were
only single objects, so this changes their names to 'sock' and 'sockp'.
Usually a 'p' suffix means that a variable is an output argument, but
that was only true in one place here, so this changes the names of the
other variables to plain 'sock'.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
Valgrind reports:
20 bytes in 1 blocks are definitely lost in loss record 94 of 353
by 0x532594: xmalloc (util.c:138)
by 0x553EAD: nl_sock_create (netlink-socket.c:146)
by 0x54331D: create_nl_sock (dpif-netlink.c:255)
by 0x54331D: dpif_netlink_port_add__ (dpif-netlink.c:756)
by 0x5435F6: dpif_netlink_port_add_compat (dpif-netlink.c:876)
by 0x5435F6: dpif_netlink_port_add (dpif-netlink.c:922)
by 0x47EC1D: dpif_port_add (dpif.c:584)
by 0x42B35F: port_add (ofproto-dpif.c:3721)
by 0x41E64A: ofproto_port_add (ofproto.c:2032)
by 0x40B3FE: iface_do_create (bridge.c:1817)
by 0x40B3FE: iface_create (bridge.c:1855)
by 0x40B3FE: bridge_add_ports__ (bridge.c:943)
by 0x40D14A: bridge_add_ports (bridge.c:959)
by 0x40D14A: bridge_reconfigure (bridge.c:673)
by 0x410D75: bridge_run (bridge.c:3050)
by 0x407614: main (ovs-vswitchd.c:127)
This leak is because when vport_add_channel() returns 0, it is expected
to take the ownership of 'socksp'. This patch fixes this issue.
Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
This patch derives the timeout policy based on ct zone from the
internal data structure that we maintain on dpif layer.
It also adds a system traffic test to verify the zone-based conntrack
timeout feature. The test uses ovs-vsctl commands to configure
the customized ICMP and UDP timeout on zone 5 to a shorter period.
It then injects ICMP and UDP traffic to conntrack, and checks if the
corresponding conntrack entry expires after the predefined timeout.
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
ofproto-dpif: Checks if datapath supports OVS_CT_ATTR_TIMEOUT
This patch checks whether datapath supports OVS_CT_ATTR_TIMEOUT. With this
check, ofproto-dpif-xlate can use this information to decide whether to
translate the ct timeout policy.
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
This patch first defines the dpif interface for a datapath to support
adding, deleting, getting and dumping conntrack timeout policy.
The timeout policy is identified by a 4 bytes unsigned integer in
datapath, and it currently support timeout for TCP, UDP, and ICMP
protocols.
Moreover, this patch provides the implementation for Linux kernel
datapath in dpif-netlink.
In Linux kernel, the timeout policy is maintained per L3/L4 protocol,
and it is identified by 32 bytes null terminated string. On the other
hand, in vswitchd, the timeout policy is a generic one that consists of
all the supported L4 protocols. Therefore, one of the main task in
dpif-netlink is to break down the generic timeout policy into 6
sub policies (ipv4 tcp, udp, icmp, and ipv6 tcp, udp, icmp),
and push down the configuration using the netlink API in
netlink-conntrack.c.
This patch also adds missing symbols in the windows datapath so
that the build on windows can pass.
Appveyor CI:
* https://ci.appveyor.com/project/YiHungWei/ovs/builds/26387754
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
Signed-off-by: Justin Pettit <jpettit@ovn.org>
This may be needed in some special cases, such as to support some hardware
offload implementations. Note that disabling TCP sequence number
verification is not an optimization in itself, but supporting some
hardware offload implementations may offer better performance. TCP
sequence number verification is enabled by default. This option is only
available for the userspace datapath. Access to this option is presently
provided via 'dpctl' commands as the need for this option is quite node
specific, by virtue of which nics are in use on a given node. A test is
added to verify this option.
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-May/359188.html
Signed-off-by: Darrell Ball <dlu998@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
'dpif_probe_feature()' always has DPIF_FP_PROBE flag set. Other probing
code uses dpif_execute() with DPIF_OP_EXECUTE, hence never calls
parse_flow_put().
Thus, this 'if' statement is wrong and should be removed as it only
forbids offloading of the real legitimate flows with dl_type 0x1234.
Dummy flows never reach this code.
CC: Paul Blakey <paulb@mellanox.com>
Fixes: 8b668ee3f0cc ("dpif-netlink: Use netdev flow put api to insert a flow")
Reported-by: Eli Britstein <elibr@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>