The DPDK mbuf API specifies 4 status when it comes to L4 checksums:
- RTE_MBUF_F_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum
- RTE_MBUF_F_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong
- RTE_MBUF_F_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid
- RTE_MBUF_F_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet
data, but the integrity of the L4 data is verified.
Similarly to the IP checksum offloads API, revise OVS L4 offloads API.
No information about the L4 protocol is provided by any netdev-*
implementation, so OVS needs to mark this L4 protocol during flow
extraction.
Rename current API for consistency with dp_packet_(inner_)?l4_checksum_.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Rather than set bits in the mbuf ol_flags field, that only makes sense
for netdev-dpdk ports, mark packet for tunnel offload in OVS offloads
API.
While at it, since there is nothing really "hardware" related, rename
current API for consistency with dp_packet_tunnel_ prefix.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
As a preparation for tracking inner checksums, separate Rx checksum
status from the DPDK ol_flags field.
To minimize the cost of translating from DPDK API to OVS API, simply map
OVS flags to DPDK Rx mbuf flags.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Flagging packets with IP version is only needed at the netdev-dpdk level.
In most cases, OVS is already inspecting the IP header in packet data,
so maintaining such IP version metadata won't save much cycles
(given the cost of additional branches necessary for handling
outer/inner flags).
Cleanup OVS shared code and only set these flags in netdev-dpdk.c.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
As the virtio-net offload API is used for netdev-linux ports, but
provides no information about the potentially encapsulated protocol
concerned by a checksum request, specific information from this netdev-
specific implementation is propagated into OVS code, and must be
carefully evaluated when some tunnel gets decapsulated.
This induces a cost in "normal" processing path, while the netdev-linux
path is not performance critical.
This patch removes such specific information, yet try harder to parse
the packet on the Rx side and set offload flags accordingly for non
encapsulated traffic. For encapsulated traffic, the inner
checksum is computed.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Packets with L4 partial status for a simple match flow would not get L4
checksums offloads applied.
This was not caught in unit tests, because packets from netdev-dummy
(calling miniflow_extract) would get Tx flags set early, before
parse_tcp_flags() got called during packet processing.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Caught by code review, offloading checksum of IPv6 UDP packets was wrong
as the IPv6 header used for the pseudo header checksum was wrong.
Fixes: cb0cbffbe8fb ("netdev-linux: Favour inner packet for multi-encapsulated TSO.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It's not nice to send random stack memory to a kernel over ioctl:
Uninitialized bytes in ioctl_common_pre
at offset 20 inside [0x7fff8899f3c0, 40)
WARNING: MemorySanitizer: use-of-uninitialized-value
0 0x1180b7 in af_inet_ioctl lib/socket-util-unix.c:417:15
1 0x1180ff in af_inet_ifreq_ioctl lib/socket-util-unix.c:428:13
2 0x11e4fd in netdev_linux_set_mtu lib/netdev-linux.c:2005:13
3 0xba250b in netdev_set_mtu lib/netdev.c:1132:30
4 0x59a19f in update_mtu_ofproto ofproto/ofproto.c:3042:18
5 0x596947 in update_mtu ofproto/ofproto.c:3024:5
6 0x5976d6 in ofport_install ofproto/ofproto.c:2617:5
7 0x572e96 in update_port ofproto/ofproto.c:2893:21
8 0x57636b in ofproto_port_add ofproto/ofproto.c:2208:17
9 0x4f9e0b in iface_do_create vswitchd/bridge.c:2203:13
10 0x4f7f43 in iface_create vswitchd/bridge.c:2246:13
11 0x4f7a63 in bridge_add_ports__ vswitchd/bridge.c:1225:21
12 0x4d8ea4 in bridge_add_ports vswitchd/bridge.c:1241:5
13 0x4cce3a in bridge_reconfigure vswitchd/bridge.c:952:9
14 0x4ca156 in bridge_run vswitchd/bridge.c:3439:9
15 0x5278e7 in main vswitchd/ovs-vswitchd.c:137:9
16 0x7f9def in __libc_start_call_main
17 0x7f9def in __libc_start_main@GLIBC_2.2.5
18 0x432b14 in _start (vswitchd/ovs-vswitchd+0x432b14)
We do already initialize this structure for a few ioctls, let's
do that for all of them.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Compilers are allowed to make alignment assumptions based on the
pointer type passed to memcpy(). We have to cast before using it
to avoid misaligned reads:
lib/netdev-linux.c:6738:41: runtime error: load of misaligned address
0x51b000008db4 for type 'const struct rpl_rtnl_link_stats64 *',
which requires 8 byte alignment
0x51b000008db4: note: pointer points here
cc 00 17 00 01 00 00 00 00 00 00 00 01 00 00 00
^
0 0xad2d8 in get_stats_via_netlink lib/netdev-linux.c:6738:17
1 0x9dcce in netdev_linux_get_stats lib/netdev-linux.c:2293:13
2 0xee5b8 in netdev_get_stats lib/netdev.c:1722:16
3 0xc07ad in iface_refresh_stats vswitchd/bridge.c:2795:5
4 0xbf4fc in iface_create vswitchd/bridge.c:2274:5
5 0xbf4fc in bridge_add_ports__ vswitchd/bridge.c:1225:21
6 0x989fc in bridge_add_ports vswitchd/bridge.c:1236:5
7 0x989fc in bridge_reconfigure vswitchd/bridge.c:952:9
8 0x9102a in bridge_run vswitchd/bridge.c:3439:9
9 0xc8976 in main vswitchd/ovs-vswitchd.c:137:9
10 0x2a1c9 in __libc_start_call_main
11 0x2a28a in __libc_start_main
12 0xb3ff4 in _start (vswitchd/ovs-vswitchd+0x734ff4)
Reported by Clang 18 with UBsan on Ubuntu 24.04.
Casting the original 'void *' pointer returned from nl_attr_get() call
to 'struct rpl_rtnl_link_stats64 *' also technically constitutes an
undefined behavior, since the resulting pointer is not correctly
aligned for the referenced type, but compilers do not insist on that
today.
Fixes: c8c49a9db9f2 ("netdev-linux: Properly access 32-bit aligned rtnl_link_stats64 structs.")
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Offloading checksum with a virtio_net_hdr header works only for L4.
Remove false claim that IP checksum offloading is supported.
As a consequence, L3 checksum can be assumed to be correct when
preparing a L4 offload request.
Fixes: 3337e6d91c5b ("userspace: Enable L4 checksum offloading by default.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch fixes an uninitialized gso_type case in
netdev_linux_prepend_vnet_hdr() by returning an error.
Fixes: 3337e6d91c5b ("userspace: Enable L4 checksum offloading by default.")
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Fix unintentional integer overflow reported by Coverity by adding
the ULL suffix to the numerical literals used in the multiplications.
Fixes: ed2300cca0d3 ("netdev-linux: Refactor put police action netlink message")
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Clang's static analyzer noted that the output from
netdev_linux_get_speed_locked can be checked even if this function
doesn't set any values.
Now we always set those values to a sane default in all cases.
Fixes: 19cffe30cfda ("netdev-linux: Avoid deadlock in netdev_get_speed.")
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When retrieving a list of features supported by a network card, return
with an error code if the request completed without an error but the
list contains zero entries.
In practice this should never happen, but it does contribute to a
detection in Clang's static analyzer.
Fixes: 6c59c195266c ("netdev-linux: Use ethtool to detect offload support.")
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Extract checking for a given kernel version to a separate function.
It will be used also in the next patch.
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Recently OVS adopted a policy of using the inclusive naming word list v1
[1, 2].
This patch partially addresses the use of the term master in the
context of LAG devices by using the term primary instead: the local
variables master_netdev and master_name are renamed as primary_netdev
and primary_name.
Related comments are also updated.
No functional change intended.
[1] df5e5cf ("Documentation: Add section on inclusive language.")
[2] https://inclusivenaming.org/word-lists/
Signed-off-by: Simon Horman <horms@ovn.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Recently OVS adopted a policy of using the inclusive naming word list v1
[1, 2].
This patch partially addresses the use of the term master in the
context of LAG devices by using the term primary instead: the
is_lag_master field of struct netdev_linux is renamed is_lag_primary.
A related comment is also updated.
No functional change intended.
[1] df5e5cf ("Documentation: Add section on inclusive language.")
[2] https://inclusivenaming.org/word-lists/
Signed-off-by: Simon Horman <horms@ovn.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Previously a change was added to the vnet prepend code to solve for the
case where no L4 checksum offloading was needed but the L3 checksum
hadn't been calculated. But the added check didn't properly account
for IPv6 traffic.
Fixes: 85bcbbed839a ("userspace: Enable tunnel tests with TSO.")
Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously if an OVS configuration nested multiple layers of UDP tunnels
like VXLAN or GENEVE on top of each other through netdev-linux
interfaces, the vnet header would be incorrectly set to the outermost
UDP tunnel layer instead of the intermediary tunnel layer.
This resulted in the middle UDP tunnel not checksum offloading properly.
Fixes: 85bcbbed839a ("userspace: Enable tunnel tests with TSO.")
Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
netdev_linux_get_speed needs to lock netdev_linux->mutex, and so do the
internal tc operations. Therefore, the former cannot be called from the
latter.
Create a lock-free version of netdev_linux_get_speed() and call it from
tc operations.
Also expand the unit test to cover queues where ceil is determined by
the maximum link speed.
Fixes: b8f8fad86435 ("netdev-linux: Use speed as max rate in tc classes.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2024-January/052912.html
Reported-by: Daryl Wang <darylywang@google.com>
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch enables most of the tunnel tests in the testsuite, and adds a
large TCP transfer to a vxlan and geneve test to verify TSO
functionality. Some additional changes were required to accommodate these
changes with netdev-linux interfaces. The test for vlan over vxlan is
purposely not enabled as the traffic produced by this test gives
incorrect values in the vnet header.
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently OVS will calculate the segment size based on the
MTU of the egress port. That usually happens to be correct
when the ports share the same MTU, but that is not always true.
Therefore, if the segment size is provided, then use that and
make sure the over sized packets are dropped.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently when userspace-tso is enabled, netdev-linux interfaces will
indicate support for all offload flags regardless of interface
configuration. This patch checks for which offload features are enabled
during netdev construction.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When using net/af_xdp DPDK driver along OVS native AF_XDP support,
confusing logs are reported, like:
netdev_dpdk|INFO|Device 'net_af_xdpp0,iface=ovs-p0' attached to DPDK
dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 11 created.
dpif_netdev|INFO|There are 1 pmd threads on numa node 0
dpdk|INFO|Device with port_id=0 already stopped
dpdk(pmd-c11/id:22)|INFO|PMD thread uses DPDK lcore 1.
netdev_dpdk|WARN|Rx checksum offload is not supported on port 0
netdev_afxdp|INFO|libbpf: elf: skipping unrecognized data section(6)
.xdp_run_config
netdev_afxdp|INFO|libbpf: elf: skipping unrecognized data section(7)
xdp_metadata
netdev_afxdp|INFO|libbpf: elf: skipping unrecognized data section(7)
xdp_metadata
netdev_afxdp|INFO|libbpf: elf: skipping unrecognized data section(7)
xdp_metadata
This comes from the fact that netdev-afxdp unconditionnally registers a
helper for logging libbpf messages.
Making both net/af_xdp and netdev-afxdp work at the same time seems
difficult, so at least, ensure that netdev-afxdp won't register this
helper unless a netdev is actually allocated.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
For better usability, the function pairs get_config() and
set_config() for netdevs should be symmetric: Options which are
accepted by set_config() should be returned by get_config() and the
latter should output valid options for set_config() only. This patch
also moves key-value pairs which are not valid options from get_config()
to the get_status() callback.
The documentation in vswitchd/vswitch.xml for status columns has been
updated accordingly.
Reported-at: https://bugzilla.redhat.com/1949855
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Use TCA_POLICE_RATE64 if the rate cannot be expressed using 32bits.
This breaks the 32Gbps barrier. The new barrier is ~4Tbps caused by
netdev's API expressing kbps rates using 32-bit integers.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137643
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In preparation for supporting 64-bit rates in tc policies, move the
allocation and initialization of struct tc_police object inside
nl_msg_put_act_police(). That way, the function is now called with the
actual rates.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It is equivalent to tc_policer_init() so remove the duplicated function.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, htb rates are capped at ~34Gbps because they are internally
expressed as 32-bit fields.
Move min and max rates to 64-bit fields and use TCA_HTB_RATE64 and
TCA_HTB_CEIL64 to configure HTC classes to break this barrier.
In order to test this, create a dummy tuntap device and set it's
speed to a very high value so we can try adding a QoS queue with big
rates.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137619
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
tc uses these "rtab" tables to estimate the time (ticks) that it takes
to send a packet of different sizes. In preparation for the introduction
of 64-bit rates, add an argument to tc_put_rtab() to allow an external
64-bit rate.
Also use 64bits for other burst buffer calculation functions.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Instead of relying on feature bits, use the speed value directly as
maximum rate for htb and hfsc classes.
There is still a limitation with the maximum rate that we can express
with a 32-bit number in bytes/s (~ 34.3Gbps), but using the actual link speed
instead of the feature bits, we can at least use an accurate maximum for
some link speeds (such as 25Gbps) which are not supported by netdev's feature
bits.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, the netdev's speed is being calculated by taking the link's
feature bits (using netdev_get_features()) and transforming them into
bps.
This mechanism can be both inaccurate and difficult to maintain, mainly
because we currently use the feature bits supported by OpenFlow which
would have to be extended to support all new feature bits of all netdev
implementations while keeping the OpenFlow API intact.
In order to expose the link speed accurately for all current and future
hardware, add a new netdev API call that allows the implementations to
provide the current and maximum link speeds in Mbps.
Internally, the logic to get the maximum supported speed still relies on
feature bits so it might still get out of sync in the future. However,
the maximum configurable speed is not used as much as the current speed
and these feature bits are not exposed through the netdev interface so
it should be easier to add more.
Use this new function instead of netdev_get_features() where the link
speed is needed.
As a consequence of this patch, link speeds of cards is properly
reported (internally in OVSDB) even if not supported by OpenFlow.
A test verifies this behavior using a tap device.
Also, in order to avoid using the old, this patch adds a checkpatch.py
warning if the old API is used.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The netdev receiving packets is supposed to provide the flags
indicating if the L4 checksum was verified and it is OK or BAD,
otherwise the stack will check when appropriate by software.
If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.
When encapsulate a packet with that flag, set the checksum
of the inner L4 header since that is not yet supported.
Calculate the L4 checksum when the packet is going to be sent
over a device that doesn't support the feature.
Linux tap devices allows enabling L3 and L4 offload, so this
patch enables the feature. However, Linux socket interface
remains disabled because the API doesn't allow enabling
those two features without enabling TSO too.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The tc module combines the use of the `tc_transact` helper
function for communication with the in-kernel tc infrastructure
with assertions on the reply data by `ofpbuf_at_assert` on the
received data prior to further processing.
With the presence of bugs on the kernel side, we need to treat
the kernel as an unreliable service provider and replace assertions
on the reply from it with checks to avoid a fatal crash of OVS.
For the record, the symptom of the crash is this in the log:
EMER|include/openvswitch/ofpbuf.h:194:
assertion offset + size <= b->size failed in ofpbuf_at_assert()
And an excerpt of the backtrace looks like this:
ofpbuf_at_assert (offset=16, size=20) at include/openvswitch/ofpbuf.h:194
tc_replace_flower at lib/tc.c:3223
netdev_tc_flow_put at lib/netdev-offload-tc.c:2096
netdev_flow_put at lib/netdev-offload.c:257
parse_flow_put at lib/dpif-netlink.c:2297
try_send_to_netdev at lib/dpif-netlink.c:2384
Reported-At: https://launchpad.net/bugs/2018500
Fixes: 5c039ddc64ff ("netdev-linux: Add functions to manipulate tc police action")
Fixes: e7f6ba220e10 ("lib/tc: add ingress ratelimiting support for tc-offload")
Fixes: f98e418fbdb6 ("tc: Add tc flower functions")
Fixes: c1c9c9c4b636 ("Implement QoS framework.")
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add support to count upcall packets per port, both succeed and failed,
which is a better way to see how many packets upcalled on each interface.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: wangchuanlei <wangchuanlei@inspur.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
A long long time ago, an effort was made to make tc flower
rtnl_lock() free. However, on the OVS part we forgot to add
the TCA_KIND "flower" attribute, which tell the kernel to skip
the lock. This patch corrects this by adding the attribute for
the delete and get operations.
The kernel code calls tcf_proto_is_unlocked() to determine the
rtnl_lock() is needed for the specific tc protocol. It does this
in the tc_new_tfilter(), tc_del_tfilter(), and in tc_get_tfilter().
If the name is not set, tcf_proto_is_unlocked() will always return
false. If set, the specific protocol is queried for unlocked support.
Fixes: f98e418fbdb6 ("tc: Add tc flower functions")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
100 Mbps was a fair assumption 13 years ago. Modern days 10 Gbps seems
like a good value in case no information is available otherwise.
The change mainly affects QoS which is currently limited to 100 Mbps if
the user didn't specify 'max-rate' and the card doesn't report the
speed or OVS doesn't have a predefined enumeration for the speed
reported by the NIC.
Calculation of the path cost for STP/RSTP is also affected if OVS is
unable to determine the link speed.
Lower link speed adapters are typically good at reporting their speed,
so chances for overshoot should be low. But newer high-speed adapters,
for which there is no speed enumeration or if there are some other
issues, will not suffer that much.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
tc_del_qdisc() function only removes qdiscs with handle '1:0'. If for
some reason the interface has a qdisc with non-zero handle attached,
tc_del_qdisc() will not delete it and subsequent tc_install() will fail
to install a new qdisc.
The problem is that Libvirt by default is setting noqueue qdisc for all
tap interfaces it creates. This is done for performance reasons to
ensure lockless xmit.
The issue is causing non-working QoS in OpenStack setups since new
versions of Libvirt started to use OVS to configure it. In the past,
Libvirt configured TC on its own, bypassing OVS.
Removing the handle value from the deletion request, so any qdisc can
be removed. Changing the error checking to also pass ENOENT, since
that is the error reported if only default qdisc is present.
Alternative solution might be to use NLM_F_REPLACE, but that will be
a larger change with a potential need of refactoring.
Potential side effect of the change is that OVS may start removing
qdiscs that it didn't remove before. Though it's not a new issue and
'linux-noop' QoS type should be used for ports that OVS should not
touch. Otherwise, OVS owns qdiscs on all interfaces attached to it.
While at it, adding more logs as errors are not logged in any way
at the moment making the issue hard to debug.
Reported-at: https://bugzilla.redhat.com/2138339
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052088.html
Reported-at: https://github.com/openvswitch/ovs-issues/issues/268
Suggested-by: Slawek Kaplonski <skaplons@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add tc action flags when adding police action to offload meter table.
There is a restriction that the flag of skip_sw/skip_hw should be same for
filter rule and the independent created tc actions the rule uses. In this
case, if we configure the tc-policy as skip_hw, filter rule will be created
with skip_hw flag and the police action according to meter table will have
no action flag, then flower rule will fail to add to tc kernel system.
To fix this issue, we will add tc action flag when adding police action to
offload a meter table, so it will allow meter table to work in tc software
datapath.
Fixes: 5c039ddc64ff ("netdev-linux: Add functions to manipulate tc police action")
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
For netdev_linux_update_via_netlink(), hint to the kernel that
we do not need it to gather netlink internal stats when we want
to update the netlink flags, as those stats are not rendered
within OVS.
Background:
ovs-vswitchd can spend quite a bit of time blocked by the kernel
during netlink calls, especially systems with many cores. This
time is dominated by the kernel-side internal stats gathering
mechanism in netlink, specifically:
inet6_fill_link_af
inet6_fill_ifla6_attrs
__snmp6_fill_stats64
In Linux 4.4+, there exists a hint for netlink requests to not
trigger the ipv6 stats gathering mechanism, which greatly reduces
the amount of time that ovs-vswitchd is on CPU.
Testing and Results:
Tested booting 320 VM's and measuring OVS utilization with perf
record, then visualized into a flamegraph using a patched version
of ovs 2.14.2. Calls under bridge_run() seem to get hit the worst
by this issue.
Before bridge_run() == 11.3% of samples
After bridge_run() == 3.4% of samples
Note that there are at least two observed netlink calls under
bridge_run that are still kernel stats heavy after this patch:
Call 1:
bridge_run -> netdev_run -> route_table_run -> route_table_reset ->
ovs_router_insert -> ovs_router_insert__ -> get_src_addr ->
netdev_ger_addr_list -> netdev_linux_get_addr_list -> getifaddrs
Since the actual netlink call is coming from getifaddrs() in glibc,
fixing would likely involve either duplicating glibc code in ovs
source or patch glibc.
Call 2:
bridge_run -> iface_refresh_stats -> netdev_get_stats ->
netdev_linux_get_stats -> get_stats_via_netlink
This does use netlink based stats; however, it isn't immediately
clear if just dropping the stats from inet6_fill_link_af would
impact anything or not. Given this call is more intermittent, its
of lesser concern.
Acked-by: Greg Smith <gasmith@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Referenced commit changed policer action type from TC_ACT_UNSPEC (continue)
to TC_ACT_PIPE. However, since neither TC hardware offload layer nor mlx5
driver at the time validated action type and always assumed 'continue', the
breakage wasn't caught until later validation code was added. The change
also broke valid configuration when sending from offload-capable device to
non-offload capable. For example, when sending from mlx5 VF to OvS bridge
netdevice the traffic that passed matchall classifier with policer could no
longer match the following flower rule in software:
filter protocol all pref 1 matchall chain 0
filter protocol all pref 1 matchall chain 0 handle 0x1
in_hw (rule hit 7863)
action order 1: police 0x1 rate 32Mbit burst 1000Kb mtu 64Kb action drop/pipe overhead 0b
ref 1 bind 1 installed 17 sec firstused 17 sec
Action statistics:
Sent 152199634 bytes 102550 pkt (dropped 1315, overlimits 1315 requeues 0)
Sent software 74612172 bytes 51275 pkt
Sent hardware 77587462 bytes 51275 pkt
backlog 0b 0p requeues 0
used_hw_stats delayed
filter protocol ip pref 3 flower chain 0
filter protocol ip pref 3 flower chain 0 handle 0x1
dst_mac aa:94:1f:f2:f8:44
src_mac e4:00:01:08:00:02
eth_type ipv4
ip_flags nofrag
not_in_hw
action order 1: skbedit ptype host pipe
index 1 ref 1 bind 1 installed 6 sec used 6 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 2: mirred (Ingress Redirect to device br-ovs) stolen
index 1 ref 1 bind 1 installed 6 sec used 6 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
cookie 401a9c8b3d403c62240d3eb5e21c1604
no_percpu
Fix the issue by restoring matchall and basic policers action type to
'continue'.
Fixes: c2567e533f8a ("add port-based ingress policing based packet-per-second rate-limiting")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Bond master netdev may be created without a classification type, due
to routing or tunneling code.
If bond master is not attached to ovs, the ingress block on LAG members
should not be updated.
Simple reproducer:
tc q ls dev net3 ingress
ip a add 10.1.1.1/30 dev bond0
ip l set net3 master bond0
tc q ls dev net3 ingress
Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC")
Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
Acked-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add helpers to add, delete and get stats of police action with
the specified index.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
To reuse the code for manipulating police action, move the common
initialization code to a function, and change PPS parameters as meter
pktps is in unit of packet per second.
null_police is redundant because either BPS or PPS, not both, can be
configured in one message. So the police passed in to
nl_msg_put_act_police can be reused as its rate is zero for PPS, and
it also provides the index for police action to be created.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for hmap and all its derived types.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently ingress policing uses the basic classifier to apply traffic
control filters if hardware offload is not enabled, in which case it
uses matchall. This change changes the behavior to always use matchall,
and fall back onto basic if the kernel is built without matchall
support.
The system tests are modified to allow either basic or matchall
classification on the ingestion filter, and to allow either 10000 or
10240 packets for the packet burst filter. 10000 is accurate for kernel
5.14 and the most recent iproute2, however, 10240 is left for
compatibility with older kernels.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>