Currently OVS will calculate the segment size based on the
MTU of the egress port. That usually happens to be correct
when the ports share the same MTU, but that is not always true.
Therefore, if the segment size is provided, then use that and
make sure the over sized packets are dropped.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently when userspace-tso is enabled, netdev-linux interfaces will
indicate support for all offload flags regardless of interface
configuration. This patch checks for which offload features are enabled
during netdev construction.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When using net/af_xdp DPDK driver along OVS native AF_XDP support,
confusing logs are reported, like:
netdev_dpdk|INFO|Device 'net_af_xdpp0,iface=ovs-p0' attached to DPDK
dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 11 created.
dpif_netdev|INFO|There are 1 pmd threads on numa node 0
dpdk|INFO|Device with port_id=0 already stopped
dpdk(pmd-c11/id:22)|INFO|PMD thread uses DPDK lcore 1.
netdev_dpdk|WARN|Rx checksum offload is not supported on port 0
netdev_afxdp|INFO|libbpf: elf: skipping unrecognized data section(6)
.xdp_run_config
netdev_afxdp|INFO|libbpf: elf: skipping unrecognized data section(7)
xdp_metadata
netdev_afxdp|INFO|libbpf: elf: skipping unrecognized data section(7)
xdp_metadata
netdev_afxdp|INFO|libbpf: elf: skipping unrecognized data section(7)
xdp_metadata
This comes from the fact that netdev-afxdp unconditionnally registers a
helper for logging libbpf messages.
Making both net/af_xdp and netdev-afxdp work at the same time seems
difficult, so at least, ensure that netdev-afxdp won't register this
helper unless a netdev is actually allocated.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
For better usability, the function pairs get_config() and
set_config() for netdevs should be symmetric: Options which are
accepted by set_config() should be returned by get_config() and the
latter should output valid options for set_config() only. This patch
also moves key-value pairs which are not valid options from get_config()
to the get_status() callback.
The documentation in vswitchd/vswitch.xml for status columns has been
updated accordingly.
Reported-at: https://bugzilla.redhat.com/1949855
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Use TCA_POLICE_RATE64 if the rate cannot be expressed using 32bits.
This breaks the 32Gbps barrier. The new barrier is ~4Tbps caused by
netdev's API expressing kbps rates using 32-bit integers.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137643
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In preparation for supporting 64-bit rates in tc policies, move the
allocation and initialization of struct tc_police object inside
nl_msg_put_act_police(). That way, the function is now called with the
actual rates.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It is equivalent to tc_policer_init() so remove the duplicated function.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, htb rates are capped at ~34Gbps because they are internally
expressed as 32-bit fields.
Move min and max rates to 64-bit fields and use TCA_HTB_RATE64 and
TCA_HTB_CEIL64 to configure HTC classes to break this barrier.
In order to test this, create a dummy tuntap device and set it's
speed to a very high value so we can try adding a QoS queue with big
rates.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137619
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
tc uses these "rtab" tables to estimate the time (ticks) that it takes
to send a packet of different sizes. In preparation for the introduction
of 64-bit rates, add an argument to tc_put_rtab() to allow an external
64-bit rate.
Also use 64bits for other burst buffer calculation functions.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Instead of relying on feature bits, use the speed value directly as
maximum rate for htb and hfsc classes.
There is still a limitation with the maximum rate that we can express
with a 32-bit number in bytes/s (~ 34.3Gbps), but using the actual link speed
instead of the feature bits, we can at least use an accurate maximum for
some link speeds (such as 25Gbps) which are not supported by netdev's feature
bits.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, the netdev's speed is being calculated by taking the link's
feature bits (using netdev_get_features()) and transforming them into
bps.
This mechanism can be both inaccurate and difficult to maintain, mainly
because we currently use the feature bits supported by OpenFlow which
would have to be extended to support all new feature bits of all netdev
implementations while keeping the OpenFlow API intact.
In order to expose the link speed accurately for all current and future
hardware, add a new netdev API call that allows the implementations to
provide the current and maximum link speeds in Mbps.
Internally, the logic to get the maximum supported speed still relies on
feature bits so it might still get out of sync in the future. However,
the maximum configurable speed is not used as much as the current speed
and these feature bits are not exposed through the netdev interface so
it should be easier to add more.
Use this new function instead of netdev_get_features() where the link
speed is needed.
As a consequence of this patch, link speeds of cards is properly
reported (internally in OVSDB) even if not supported by OpenFlow.
A test verifies this behavior using a tap device.
Also, in order to avoid using the old, this patch adds a checkpatch.py
warning if the old API is used.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The netdev receiving packets is supposed to provide the flags
indicating if the L4 checksum was verified and it is OK or BAD,
otherwise the stack will check when appropriate by software.
If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.
When encapsulate a packet with that flag, set the checksum
of the inner L4 header since that is not yet supported.
Calculate the L4 checksum when the packet is going to be sent
over a device that doesn't support the feature.
Linux tap devices allows enabling L3 and L4 offload, so this
patch enables the feature. However, Linux socket interface
remains disabled because the API doesn't allow enabling
those two features without enabling TSO too.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The tc module combines the use of the `tc_transact` helper
function for communication with the in-kernel tc infrastructure
with assertions on the reply data by `ofpbuf_at_assert` on the
received data prior to further processing.
With the presence of bugs on the kernel side, we need to treat
the kernel as an unreliable service provider and replace assertions
on the reply from it with checks to avoid a fatal crash of OVS.
For the record, the symptom of the crash is this in the log:
EMER|include/openvswitch/ofpbuf.h:194:
assertion offset + size <= b->size failed in ofpbuf_at_assert()
And an excerpt of the backtrace looks like this:
ofpbuf_at_assert (offset=16, size=20) at include/openvswitch/ofpbuf.h:194
tc_replace_flower at lib/tc.c:3223
netdev_tc_flow_put at lib/netdev-offload-tc.c:2096
netdev_flow_put at lib/netdev-offload.c:257
parse_flow_put at lib/dpif-netlink.c:2297
try_send_to_netdev at lib/dpif-netlink.c:2384
Reported-At: https://launchpad.net/bugs/2018500
Fixes: 5c039ddc64ff ("netdev-linux: Add functions to manipulate tc police action")
Fixes: e7f6ba220e10 ("lib/tc: add ingress ratelimiting support for tc-offload")
Fixes: f98e418fbdb6 ("tc: Add tc flower functions")
Fixes: c1c9c9c4b636 ("Implement QoS framework.")
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add support to count upcall packets per port, both succeed and failed,
which is a better way to see how many packets upcalled on each interface.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: wangchuanlei <wangchuanlei@inspur.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
A long long time ago, an effort was made to make tc flower
rtnl_lock() free. However, on the OVS part we forgot to add
the TCA_KIND "flower" attribute, which tell the kernel to skip
the lock. This patch corrects this by adding the attribute for
the delete and get operations.
The kernel code calls tcf_proto_is_unlocked() to determine the
rtnl_lock() is needed for the specific tc protocol. It does this
in the tc_new_tfilter(), tc_del_tfilter(), and in tc_get_tfilter().
If the name is not set, tcf_proto_is_unlocked() will always return
false. If set, the specific protocol is queried for unlocked support.
Fixes: f98e418fbdb6 ("tc: Add tc flower functions")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
100 Mbps was a fair assumption 13 years ago. Modern days 10 Gbps seems
like a good value in case no information is available otherwise.
The change mainly affects QoS which is currently limited to 100 Mbps if
the user didn't specify 'max-rate' and the card doesn't report the
speed or OVS doesn't have a predefined enumeration for the speed
reported by the NIC.
Calculation of the path cost for STP/RSTP is also affected if OVS is
unable to determine the link speed.
Lower link speed adapters are typically good at reporting their speed,
so chances for overshoot should be low. But newer high-speed adapters,
for which there is no speed enumeration or if there are some other
issues, will not suffer that much.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
tc_del_qdisc() function only removes qdiscs with handle '1:0'. If for
some reason the interface has a qdisc with non-zero handle attached,
tc_del_qdisc() will not delete it and subsequent tc_install() will fail
to install a new qdisc.
The problem is that Libvirt by default is setting noqueue qdisc for all
tap interfaces it creates. This is done for performance reasons to
ensure lockless xmit.
The issue is causing non-working QoS in OpenStack setups since new
versions of Libvirt started to use OVS to configure it. In the past,
Libvirt configured TC on its own, bypassing OVS.
Removing the handle value from the deletion request, so any qdisc can
be removed. Changing the error checking to also pass ENOENT, since
that is the error reported if only default qdisc is present.
Alternative solution might be to use NLM_F_REPLACE, but that will be
a larger change with a potential need of refactoring.
Potential side effect of the change is that OVS may start removing
qdiscs that it didn't remove before. Though it's not a new issue and
'linux-noop' QoS type should be used for ports that OVS should not
touch. Otherwise, OVS owns qdiscs on all interfaces attached to it.
While at it, adding more logs as errors are not logged in any way
at the moment making the issue hard to debug.
Reported-at: https://bugzilla.redhat.com/2138339
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052088.html
Reported-at: https://github.com/openvswitch/ovs-issues/issues/268
Suggested-by: Slawek Kaplonski <skaplons@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add tc action flags when adding police action to offload meter table.
There is a restriction that the flag of skip_sw/skip_hw should be same for
filter rule and the independent created tc actions the rule uses. In this
case, if we configure the tc-policy as skip_hw, filter rule will be created
with skip_hw flag and the police action according to meter table will have
no action flag, then flower rule will fail to add to tc kernel system.
To fix this issue, we will add tc action flag when adding police action to
offload a meter table, so it will allow meter table to work in tc software
datapath.
Fixes: 5c039ddc64ff ("netdev-linux: Add functions to manipulate tc police action")
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
For netdev_linux_update_via_netlink(), hint to the kernel that
we do not need it to gather netlink internal stats when we want
to update the netlink flags, as those stats are not rendered
within OVS.
Background:
ovs-vswitchd can spend quite a bit of time blocked by the kernel
during netlink calls, especially systems with many cores. This
time is dominated by the kernel-side internal stats gathering
mechanism in netlink, specifically:
inet6_fill_link_af
inet6_fill_ifla6_attrs
__snmp6_fill_stats64
In Linux 4.4+, there exists a hint for netlink requests to not
trigger the ipv6 stats gathering mechanism, which greatly reduces
the amount of time that ovs-vswitchd is on CPU.
Testing and Results:
Tested booting 320 VM's and measuring OVS utilization with perf
record, then visualized into a flamegraph using a patched version
of ovs 2.14.2. Calls under bridge_run() seem to get hit the worst
by this issue.
Before bridge_run() == 11.3% of samples
After bridge_run() == 3.4% of samples
Note that there are at least two observed netlink calls under
bridge_run that are still kernel stats heavy after this patch:
Call 1:
bridge_run -> netdev_run -> route_table_run -> route_table_reset ->
ovs_router_insert -> ovs_router_insert__ -> get_src_addr ->
netdev_ger_addr_list -> netdev_linux_get_addr_list -> getifaddrs
Since the actual netlink call is coming from getifaddrs() in glibc,
fixing would likely involve either duplicating glibc code in ovs
source or patch glibc.
Call 2:
bridge_run -> iface_refresh_stats -> netdev_get_stats ->
netdev_linux_get_stats -> get_stats_via_netlink
This does use netlink based stats; however, it isn't immediately
clear if just dropping the stats from inet6_fill_link_af would
impact anything or not. Given this call is more intermittent, its
of lesser concern.
Acked-by: Greg Smith <gasmith@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Referenced commit changed policer action type from TC_ACT_UNSPEC (continue)
to TC_ACT_PIPE. However, since neither TC hardware offload layer nor mlx5
driver at the time validated action type and always assumed 'continue', the
breakage wasn't caught until later validation code was added. The change
also broke valid configuration when sending from offload-capable device to
non-offload capable. For example, when sending from mlx5 VF to OvS bridge
netdevice the traffic that passed matchall classifier with policer could no
longer match the following flower rule in software:
filter protocol all pref 1 matchall chain 0
filter protocol all pref 1 matchall chain 0 handle 0x1
in_hw (rule hit 7863)
action order 1: police 0x1 rate 32Mbit burst 1000Kb mtu 64Kb action drop/pipe overhead 0b
ref 1 bind 1 installed 17 sec firstused 17 sec
Action statistics:
Sent 152199634 bytes 102550 pkt (dropped 1315, overlimits 1315 requeues 0)
Sent software 74612172 bytes 51275 pkt
Sent hardware 77587462 bytes 51275 pkt
backlog 0b 0p requeues 0
used_hw_stats delayed
filter protocol ip pref 3 flower chain 0
filter protocol ip pref 3 flower chain 0 handle 0x1
dst_mac aa:94:1f:f2:f8:44
src_mac e4:00:01:08:00:02
eth_type ipv4
ip_flags nofrag
not_in_hw
action order 1: skbedit ptype host pipe
index 1 ref 1 bind 1 installed 6 sec used 6 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 2: mirred (Ingress Redirect to device br-ovs) stolen
index 1 ref 1 bind 1 installed 6 sec used 6 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
cookie 401a9c8b3d403c62240d3eb5e21c1604
no_percpu
Fix the issue by restoring matchall and basic policers action type to
'continue'.
Fixes: c2567e533f8a ("add port-based ingress policing based packet-per-second rate-limiting")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Bond master netdev may be created without a classification type, due
to routing or tunneling code.
If bond master is not attached to ovs, the ingress block on LAG members
should not be updated.
Simple reproducer:
tc q ls dev net3 ingress
ip a add 10.1.1.1/30 dev bond0
ip l set net3 master bond0
tc q ls dev net3 ingress
Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC")
Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
Acked-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add helpers to add, delete and get stats of police action with
the specified index.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
To reuse the code for manipulating police action, move the common
initialization code to a function, and change PPS parameters as meter
pktps is in unit of packet per second.
null_police is redundant because either BPS or PPS, not both, can be
configured in one message. So the police passed in to
nl_msg_put_act_police can be reused as its rate is zero for PPS, and
it also provides the index for police action to be created.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for hmap and all its derived types.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently ingress policing uses the basic classifier to apply traffic
control filters if hardware offload is not enabled, in which case it
uses matchall. This change changes the behavior to always use matchall,
and fall back onto basic if the kernel is built without matchall
support.
The system tests are modified to allow either basic or matchall
classification on the ingestion filter, and to allow either 10000 or
10240 packets for the packet burst filter. 10000 is accurate for kernel
5.14 and the most recent iproute2, however, 10240 is left for
compatibility with older kernels.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
If nl_sock_join_mcgroup() returns an error, the 'sock' is freed and
set to NULL. This issues will lead to null pointer deference in
nl_sock_listen_all_nsid(). To fix it, we call nl_sock_listen_all_nsid()
before joining the mcgroups.
Fixes: cf114a7fce80 ("netlink linux: enable listening to all nsids")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When TSO is disabled from a userspace forwarding datapath perspective,
but TSO has been wrongly enabled on the kernel side, log a warning
message, and drop the packet. With the current implementation,
OVS will crash.
[i.maximets]:
The call stack looks like this:
0 dp_packet_set_size (b=0x0, b=0x0, v=13028) at lib/dp-packet.h:578
1 netdev_linux_batch_rxq_recv_sock at lib/netdev-linux.c:1310
2 netdev_linux_rxq_recv at lib/netdev-linux.c
3 netdev_rxq_recv at lib/netdev.c
4 dp_netdev_process_rxq_port at lib/dpif-netdev.c
The problem is that the code assumes that (mmsgs[i].msg_len > std_len)
can only be true if userpace-tso is enabled and additional buffers are
provided to the kernel. However, since recvmmsg() is called with
MSG_TRUNC, the resulting msg_len reflects the original packet size
before truncation, and it can be larger than the buffer if TSO / GRO
is enabled on the network interface. If TSO support for user space is
not enabled in OVS, the aux_bufs are not allocated and are left NULL,
resulting in a crash.
Fixes: 73858f9dbe83 ("netdev-linux: Prepend the std packet in the TSO packet")
Fixes: 2109841b7984 ("Use batch process recv for tap and raw socket in netdev datapath")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
OVS has support for using policing to enforce a rate limit in
kilobits per second. This is configured using OVSDB. f.e.
$ ovs-vsctl set interface tap0 ingress_policing_rate=1000
$ ovs-vsctl set interface tap0 ingress_policing_burst=100
This patch adds a related feature, allowing policing to enforce a rate
limit in kilo-packets per second. This is also configured using OVSDB.
$ ovs-vsctl set interface tap0 ingress_policing_kpkts_rate=1000
$ ovs-vsctl set interface tap0 ingress_policing_kpkts_burst=100
The kilo-bit and kilo-packet rate limits may be used separately or in
combination.
Add separate action for BPS and PPS in netlink message.
Revise code and change action result to pipe to allow
traffic pipe into second action.
This patch implements the feature for:
* OVSDB (northbound API)
* TC policer when used both with and without TC offload (kernel API)
Signed-off-by: Yong Xu <yong.xu@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
'if_indextoname' may fail leaving the 'master_name' uninitialized:
Conditional jump or move depends on uninitialised value(s)
at 0x4C34329: strlen (vg_replace_strmem.c:459)
by 0x51C638: hash_string (hash.h:342)
by 0x51C638: hash_name (shash.c:28)
by 0x51CC51: shash_find (shash.c:231)
by 0x51CD38: shash_find_data (shash.c:245)
by 0x4A797F: netdev_from_name (netdev.c:2013)
by 0x544148: netdev_linux_update_lag (netdev-linux.c:676)
by 0x544148: netdev_linux_run (netdev-linux.c:769)
by 0x4A5997: netdev_run (netdev.c:186)
by 0x40752B: main (ovs-vswitchd.c:129)
Uninitialised value was created by a stack allocation
at 0x543AFA: netdev_linux_run (netdev-linux.c:722)
Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Some older wireless drivers - ones relying on the
old and long deprecated wireless extension ioctl
system - can generate quite a bit of IFLA_WIRELESS
events depending on their configuration and
runtime conditions. These are delivered as
RTNLGRP_LINK via RTM_NEWLINK messages.
These tend to be relatively easily identifiable
because they report the change mask being 0. This
isn't guaranteed but in practice it shouldn't be a
problem. None of the wireless events that I ever
observed actually carry any unique information
about netdev states that ovs-vswitchd is
interested in. Hence ignoring these shouldn't
cause any problems.
These events can be responsible for a significant
CPU churn as ovs-vswitchd attempts to do plenty of
work for each and every netlink message regardless
of what that message carries. On low-end devices
such as consumer-grade routers these can lead to a
lot of CPU cycles being wasted, adding up to heat
output and reducing performance.
It could be argued that wireless drivers in
question should be fixed, but that isn't exactly a
trivial thing to do. Patching ovs seems far more
viable while still making sense.
Signed-off-by: Michal Kazior <michal@plume.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Correct calculation of burst parameter used when configuring TC policer
action for ingress port-based policing in the case where TC offload is in
use. This now matches the value calculated for the case where TC offload is
not in use.
The division by 8 is to convert from bits to bytes.
Its unclear why 64 was previously used.
Fixes: e7f6ba220 ("lib/tc: add ingress ratelimiting support for tc-offload")
Signed-off-by: Yong Xu <yong.xu@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Remove one extra space. No actual code logic changed.
Fixes: 2109841b79845 ("Use batch process recv for tap and raw socket in netdev datapath")
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The new term is "member".
Most of these changes should not change user-visible behavior. One
place where they do is in "ovs-ofctl dump-flows", which will now output
"members:..." inside "bundle" actions instead of "slaves:...". I don't
expect this to cause real problems in most systems. The old syntax
is still supported on input for backward compatibility.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Patch 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") uses
__virtio16 which is defined in kernel 3.19. Ubuntu 14.04 is using 3.13
kernel that lacks the virtio_types definition. This patch fixes that.
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Acked-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
In some cases, when processing a netlink change event, it's possible for
an alternate part of OvS (like the IPv6 endpoint processing) to hold an
active netdev interface. This creates a race-condition, where sometimes
the OvS change processing will take the normal path. This doesn't work
because the netdev device object won't actually be enslaved to the
ovs-system (for instance, a linux bond) and ingress qdisc entries will
be missing.
To address this, we update the LAG information in ALL cases where
LAG information could come in.
Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC")
Cc: Marcelo Leitner <mleitner@redhat.com>
Cc: John Hurley <john.hurley@netronome.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When introducing the interrupt mode for netdev-afxdp, the netdev
init function is accidentally removed. Fix it by adding it back.
Fixes: 5bfc519fee499 ("netdev-afxdp: Add interrupt mode netdev class.")
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
When using kernel veth as OVS interface, doubled drop counter
value is shown when veth drops packets due to traffic overrun.
In netdev_linux_get_stats, it reads both vport stats and kernel
netdev stats, in case vport stats retrieve failure. If both of
them success, error counters are added to include errors from
different layers. But implementation of ovs_vport_get_stats in
kernel data path has included kernel netdev stats by calling
dev_get_stats. When drop or other error counters is not zero,
its value is doubled by netdev_linux_get_stats.
In this change, adding kernel netdev stats into vport stats
is removed, since vport stats includes all information of
kernel netdev stats.
Signed-off-by: Jiang Lidong <jianglidong3@jd.com>
Signed-off-by: William Tu <u9012063@gmail.com>
The patch adds a new netdev class 'afxdp-nonpmd' to enable afxdp
interrupt mode. This is similar to 'type=afxdp', except that the
is_pmd field is set to false. As a result, the packet processing
is handled by main thread, not pmd thread. This avoids burning
the CPU to always 100% when there is no traffic.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Coverity reports unreachable code in "?" statement
Fixed by removing code segment and unused variables & defines
Signed-off-by: Usman Ansari <ua1422@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
offloading in the tap device.
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Reported-by: "Yi Yang (杨�D)-云服务集团" <yangyi01@inspur.com>
Tested-by: William Tu <u9012063@gmail.com>
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: William Tu <u9012063@gmail.com>
Ideally SCTP checksum offload needs be advertised by the
NIC when userspace TSO is enabled. However, very few drivers
do that and it's not a widely used protocol. So, this patch
enables SCTP checksum offload if available, otherwise userspace
TSO can still be enabled but SCTP packets will be dropped on
NICs without support.
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Virtio doesn't expose flags to control which protocols checksum
offload needs to be enabled or disabled. This patch checks if the
NIC supports UDP checksum offload and active it when TSO is enabled.
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Usually TSO packets are close to 50k, 60k bytes long, so to
to copy less bytes when receiving a packet from the kernel
change the approach. Instead of extending the MTU sized
packet received and append with remaining TSO data from
the TSO buffer, allocate a TSO packet with enough headroom
to prepend the std packet data.
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Suggested-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
The patch detects the numa node id from the name of the netdev,
by reading the '/sys/class/net/<devname>/device/numa_node'.
If not available, ex: virtual device, or any error happens,
return numa id 0. Currently only the afxdp netdev type uses it,
other linux netdev types are disabled due to no use case.
Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.
A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.
It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.
If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.
It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Ciara Loftus <ciara.loftus.intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Current netdev_linux_rxq_recv_tap and netdev_linux_rxq_recv_sock
just receive single packet, that is very inefficient, per my test
case which adds two tap ports or veth ports into OVS bridge
(datapath_type=netdev) and use iperf3 to do performance test
between two ports (they are set into different network name space).
The result is as below:
tap: 295 Mbits/sec
veth: 207 Mbits/sec
After I change netdev_linux_rxq_recv_tap and
netdev_linux_rxq_recv_sock to use batch process, the performance
is boosted by about 7 times, here is the result:
tap: 1.96 Gbits/sec
veth: 1.47 Gbits/sec
Undoubtedly this is a huge improvement although it can't match
OVS kernel datapath yet.
FYI: here is thr result for OVS kernel datapath:
tap: 37.2 Gbits/sec
veth: 36.3 Gbits/sec
Note: performance result is highly related with your test machine,
you shouldn't expect the same results on your test machine.
Signed-off-by: Yi Yang <yangyi01@inspur.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>