mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-22 09:58:01 +00:00

Author	SHA1	Message	Date
Flavio Leitner	e0056018c4	userspace: Respect tso/gso segment size. Currently OVS will calculate the segment size based on the MTU of the egress port. That usually happens to be correct when the ports share the same MTU, but that is not always true. Therefore, if the segment size is provided, then use that and make sure the over sized packets are dropped. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-12-02 00:56:36 +01:00
Mike Pattrick	6c59c19526	netdev-linux: Use ethtool to detect offload support. Currently when userspace-tso is enabled, netdev-linux interfaces will indicate support for all offload flags regardless of interface configuration. This patch checks for which offload features are enabled during netdev construction. Signed-off-by: Mike Pattrick <mkp@redhat.com> Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-12-02 00:56:36 +01:00
David Marchand	b561bbdc27	netdev-afxdp: Postpone libbpf logging helper registration. When using net/af_xdp DPDK driver along OVS native AF_XDP support, confusing logs are reported, like: netdev_dpdk\|INFO\|Device 'net_af_xdpp0,iface=ovs-p0' attached to DPDK dpif_netdev\|INFO\|PMD thread on numa_id: 0, core id: 11 created. dpif_netdev\|INFO\|There are 1 pmd threads on numa node 0 dpdk\|INFO\|Device with port_id=0 already stopped dpdk(pmd-c11/id:22)\|INFO\|PMD thread uses DPDK lcore 1. netdev_dpdk\|WARN\|Rx checksum offload is not supported on port 0 netdev_afxdp\|INFO\|libbpf: elf: skipping unrecognized data section(6) .xdp_run_config netdev_afxdp\|INFO\|libbpf: elf: skipping unrecognized data section(7) xdp_metadata netdev_afxdp\|INFO\|libbpf: elf: skipping unrecognized data section(7) xdp_metadata netdev_afxdp\|INFO\|libbpf: elf: skipping unrecognized data section(7) xdp_metadata This comes from the fact that netdev-afxdp unconditionnally registers a helper for logging libbpf messages. Making both net/af_xdp and netdev-afxdp work at the same time seems difficult, so at least, ensure that netdev-afxdp won't register this helper unless a netdev is actually allocated. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Simon Horman <horms@ovn.org>	2023-11-21 09:40:11 +00:00
Jakob Meng	d614f2863f	netdev-afxdp: Sync and clean {get, set}_config() callbacks. For better usability, the function pairs get_config() and set_config() for netdevs should be symmetric: Options which are accepted by set_config() should be returned by get_config() and the latter should output valid options for set_config() only. This patch also moves key-value pairs which are not valid options from get_config() to the get_status() callback. The documentation in vswitchd/vswitch.xml for status columns has been updated accordingly. Reported-at: https://bugzilla.redhat.com/1949855 Signed-off-by: Jakob Meng <code@jakobmeng.de> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2023-11-14 11:03:28 +00:00
Adrian Moreno	07ce41da11	netdev-linux: Support 64-bit rates in tc policing. Use TCA_POLICE_RATE64 if the rate cannot be expressed using 32bits. This breaks the 32Gbps barrier. The new barrier is ~4Tbps caused by netdev's API expressing kbps rates using 32-bit integers. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137643 Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:54 +02:00
Adrian Moreno	68ac6e9db7	netdev-linux: Refactor nl_msg_put_act_police. In preparation for supporting 64-bit rates in tc policies, move the allocation and initialization of struct tc_police object inside nl_msg_put_act_police(). That way, the function is now called with the actual rates. Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:54 +02:00
Adrian Moreno	13e183da31	netdev-linux: Remove tc_matchall_fill_police. It is equivalent to tc_policer_init() so remove the duplicated function. Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:54 +02:00
Adrian Moreno	a86fea06fe	netdev-linux: Use 64-bit rates in htb tc classes. Currently, htb rates are capped at ~34Gbps because they are internally expressed as 32-bit fields. Move min and max rates to 64-bit fields and use TCA_HTB_RATE64 and TCA_HTB_CEIL64 to configure HTC classes to break this barrier. In order to test this, create a dummy tuntap device and set it's speed to a very high value so we can try adding a QoS queue with big rates. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137619 Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:54 +02:00
Adrian Moreno	7edfac5745	netdev-linux: Use 64bit rtab and burst calculations. tc uses these "rtab" tables to estimate the time (ticks) that it takes to send a packet of different sizes. In preparation for the introduction of 64-bit rates, add an argument to tc_put_rtab() to allow an external 64-bit rate. Also use 64bits for other burst buffer calculation functions. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:54 +02:00
Adrian Moreno	b8f8fad864	netdev-linux: Use speed as max rate in tc classes. Instead of relying on feature bits, use the speed value directly as maximum rate for htb and hfsc classes. There is still a limitation with the maximum rate that we can express with a 32-bit number in bytes/s (~ 34.3Gbps), but using the actual link speed instead of the feature bits, we can at least use an accurate maximum for some link speeds (such as 25Gbps) which are not supported by netdev's feature bits. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:54 +02:00
Adrian Moreno	6240c0b4c8	netdev: Add netdev_get_speed() to netdev API. Currently, the netdev's speed is being calculated by taking the link's feature bits (using netdev_get_features()) and transforming them into bps. This mechanism can be both inaccurate and difficult to maintain, mainly because we currently use the feature bits supported by OpenFlow which would have to be extended to support all new feature bits of all netdev implementations while keeping the OpenFlow API intact. In order to expose the link speed accurately for all current and future hardware, add a new netdev API call that allows the implementations to provide the current and maximum link speeds in Mbps. Internally, the logic to get the maximum supported speed still relies on feature bits so it might still get out of sync in the future. However, the maximum configurable speed is not used as much as the current speed and these feature bits are not exposed through the netdev interface so it should be easier to add more. Use this new function instead of netdev_get_features() where the link speed is needed. As a consequence of this patch, link speeds of cards is properly reported (internally in OVSDB) even if not supported by OpenFlow. A test verifies this behavior using a tap device. Also, in order to avoid using the old, this patch adds a checkpatch.py warning if the old API is used. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567 Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:32 +02:00
Mike Pattrick	3337e6d91c	userspace: Enable L4 checksum offloading by default. The netdev receiving packets is supposed to provide the flags indicating if the L4 checksum was verified and it is OK or BAD, otherwise the stack will check when appropriate by software. If the packet comes with good checksum, then postpone the checksum calculation to the egress device if needed. When encapsulate a packet with that flag, set the checksum of the inner L4 header since that is not yet supported. Calculate the L4 checksum when the packet is going to be sent over a device that doesn't support the feature. Linux tap devices allows enabling L3 and L4 offload, so this patch enables the feature. However, Linux socket interface remains disabled because the API doesn't allow enabling those two features without enabling TSO too. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-15 23:50:30 +02:00
Frode Nordahl	106ef21860	tc: Fix crash on malformed reply from kernel. The tc module combines the use of the `tc_transact` helper function for communication with the in-kernel tc infrastructure with assertions on the reply data by `ofpbuf_at_assert` on the received data prior to further processing. With the presence of bugs on the kernel side, we need to treat the kernel as an unreliable service provider and replace assertions on the reply from it with checks to avoid a fatal crash of OVS. For the record, the symptom of the crash is this in the log: EMER\|include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert() And an excerpt of the backtrace looks like this: ofpbuf_at_assert (offset=16, size=20) at include/openvswitch/ofpbuf.h:194 tc_replace_flower at lib/tc.c:3223 netdev_tc_flow_put at lib/netdev-offload-tc.c:2096 netdev_flow_put at lib/netdev-offload.c:257 parse_flow_put at lib/dpif-netlink.c:2297 try_send_to_netdev at lib/dpif-netlink.c:2384 Reported-At: https://launchpad.net/bugs/2018500 Fixes: 5c039ddc64ff ("netdev-linux: Add functions to manipulate tc police action") Fixes: e7f6ba220e10 ("lib/tc: add ingress ratelimiting support for tc-offload") Fixes: f98e418fbdb6 ("tc: Add tc flower functions") Fixes: c1c9c9c4b636 ("Implement QoS framework.") Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-07 22:46:45 +02:00
Miika Petäjäniemi	a6195e2c42	netdev-linux: Add jitter parameter to the netem qos options. Adds jitter option to enable emulating latency fluctuation with netem. Submitted-at: https://github.com/openvswitch/ovs/pull/407 Signed-off-by: Miika Petäjäniemi <miika.petajaniemi@solita.fi> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-02-21 14:25:57 +01:00
wangchuanlei	e22e1f6725	dpctl: Add support to count upcall packets. Add support to count upcall packets per port, both succeed and failed, which is a better way to see how many packets upcalled on each interface. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: wangchuanlei <wangchuanlei@inspur.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-31 17:40:50 +01:00
Eelco Chaudron	e1e5eac5b0	tc: Add TCA_KIND flower to delete and get operation to avoid rtnl_lock(). A long long time ago, an effort was made to make tc flower rtnl_lock() free. However, on the OVS part we forgot to add the TCA_KIND "flower" attribute, which tell the kernel to skip the lock. This patch corrects this by adding the attribute for the delete and get operations. The kernel code calls tcf_proto_is_unlocked() to determine the rtnl_lock() is needed for the specific tc protocol. It does this in the tc_new_tfilter(), tc_del_tfilter(), and in tc_get_tfilter(). If the name is not set, tcf_proto_is_unlocked() will always return false. If set, the specific protocol is queried for unlocked support. Fixes: f98e418fbdb6 ("tc: Add tc flower functions") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-30 21:12:31 +01:00
Ilya Maximets	b22c4d8403	netdev: Assume default link speed to be 10 Gbps instead of 100 Mbps. 100 Mbps was a fair assumption 13 years ago. Modern days 10 Gbps seems like a good value in case no information is available otherwise. The change mainly affects QoS which is currently limited to 100 Mbps if the user didn't specify 'max-rate' and the card doesn't report the speed or OVS doesn't have a predefined enumeration for the speed reported by the NIC. Calculation of the path cost for STP/RSTP is also affected if OVS is unable to determine the link speed. Lower link speed adapters are typically good at reporting their speed, so chances for overshoot should be low. But newer high-speed adapters, for which there is no speed enumeration or if there are some other issues, will not suffer that much. Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-11-30 14:42:59 +01:00
Ilya Maximets	02be2c318c	netdev-linux: Fix inability to apply QoS on ports with custom qdiscs. tc_del_qdisc() function only removes qdiscs with handle '1:0'. If for some reason the interface has a qdisc with non-zero handle attached, tc_del_qdisc() will not delete it and subsequent tc_install() will fail to install a new qdisc. The problem is that Libvirt by default is setting noqueue qdisc for all tap interfaces it creates. This is done for performance reasons to ensure lockless xmit. The issue is causing non-working QoS in OpenStack setups since new versions of Libvirt started to use OVS to configure it. In the past, Libvirt configured TC on its own, bypassing OVS. Removing the handle value from the deletion request, so any qdisc can be removed. Changing the error checking to also pass ENOENT, since that is the error reported if only default qdisc is present. Alternative solution might be to use NLM_F_REPLACE, but that will be a larger change with a potential need of refactoring. Potential side effect of the change is that OVS may start removing qdiscs that it didn't remove before. Though it's not a new issue and 'linux-noop' QoS type should be used for ports that OVS should not touch. Otherwise, OVS owns qdiscs on all interfaces attached to it. While at it, adding more logs as errors are not logged in any way at the moment making the issue hard to debug. Reported-at: https://bugzilla.redhat.com/2138339 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052088.html Reported-at: https://github.com/openvswitch/ovs-issues/issues/268 Suggested-by: Slawek Kaplonski <skaplons@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-11-02 19:50:02 +01:00
Baowen Zheng	ffcb6f115f	netdev-linux: Allow meter to work in tc software datapath when tc-policy is specified Add tc action flags when adding police action to offload meter table. There is a restriction that the flag of skip_sw/skip_hw should be same for filter rule and the independent created tc actions the rule uses. In this case, if we configure the tc-policy as skip_hw, filter rule will be created with skip_hw flag and the police action according to meter table will have no action flag, then flower rule will fail to add to tc kernel system. To fix this issue, we will add tc action flag when adding police action to offload a meter table, so it will allow meter table to work in tc software datapath. Fixes: 5c039ddc64ff ("netdev-linux: Add functions to manipulate tc police action") Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Simon Horman <simon.horman@corigine.com>	2022-11-01 10:18:16 +01:00
Jon Kohler	c0e053f6d1	netdev-linux: Skip some internal kernel stats gathering. For netdev_linux_update_via_netlink(), hint to the kernel that we do not need it to gather netlink internal stats when we want to update the netlink flags, as those stats are not rendered within OVS. Background: ovs-vswitchd can spend quite a bit of time blocked by the kernel during netlink calls, especially systems with many cores. This time is dominated by the kernel-side internal stats gathering mechanism in netlink, specifically: inet6_fill_link_af inet6_fill_ifla6_attrs __snmp6_fill_stats64 In Linux 4.4+, there exists a hint for netlink requests to not trigger the ipv6 stats gathering mechanism, which greatly reduces the amount of time that ovs-vswitchd is on CPU. Testing and Results: Tested booting 320 VM's and measuring OVS utilization with perf record, then visualized into a flamegraph using a patched version of ovs 2.14.2. Calls under bridge_run() seem to get hit the worst by this issue. Before bridge_run() == 11.3% of samples After bridge_run() == 3.4% of samples Note that there are at least two observed netlink calls under bridge_run that are still kernel stats heavy after this patch: Call 1: bridge_run -> netdev_run -> route_table_run -> route_table_reset -> ovs_router_insert -> ovs_router_insert__ -> get_src_addr -> netdev_ger_addr_list -> netdev_linux_get_addr_list -> getifaddrs Since the actual netlink call is coming from getifaddrs() in glibc, fixing would likely involve either duplicating glibc code in ovs source or patch glibc. Call 2: bridge_run -> iface_refresh_stats -> netdev_get_stats -> netdev_linux_get_stats -> get_stats_via_netlink This does use netlink based stats; however, it isn't immediately clear if just dropping the stats from inet6_fill_link_af would impact anything or not. Given this call is more intermittent, its of lesser concern. Acked-by: Greg Smith <gasmith@nutanix.com> Signed-off-by: Jon Kohler <jon@nutanix.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-09-09 20:53:54 +02:00
Vlad Buslov	d9268782af	netdev-linux: set correct action for packets that passed policer Referenced commit changed policer action type from TC_ACT_UNSPEC (continue) to TC_ACT_PIPE. However, since neither TC hardware offload layer nor mlx5 driver at the time validated action type and always assumed 'continue', the breakage wasn't caught until later validation code was added. The change also broke valid configuration when sending from offload-capable device to non-offload capable. For example, when sending from mlx5 VF to OvS bridge netdevice the traffic that passed matchall classifier with policer could no longer match the following flower rule in software: filter protocol all pref 1 matchall chain 0 filter protocol all pref 1 matchall chain 0 handle 0x1 in_hw (rule hit 7863) action order 1: police 0x1 rate 32Mbit burst 1000Kb mtu 64Kb action drop/pipe overhead 0b ref 1 bind 1 installed 17 sec firstused 17 sec Action statistics: Sent 152199634 bytes 102550 pkt (dropped 1315, overlimits 1315 requeues 0) Sent software 74612172 bytes 51275 pkt Sent hardware 77587462 bytes 51275 pkt backlog 0b 0p requeues 0 used_hw_stats delayed filter protocol ip pref 3 flower chain 0 filter protocol ip pref 3 flower chain 0 handle 0x1 dst_mac aa:94:1f:f2:f8:44 src_mac e4:00:01:08:00:02 eth_type ipv4 ip_flags nofrag not_in_hw action order 1: skbedit ptype host pipe index 1 ref 1 bind 1 installed 6 sec used 6 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 action order 2: mirred (Ingress Redirect to device br-ovs) stolen index 1 ref 1 bind 1 installed 6 sec used 6 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 cookie 401a9c8b3d403c62240d3eb5e21c1604 no_percpu Fix the issue by restoring matchall and basic policers action type to 'continue'. Fixes: c2567e533f8a ("add port-based ingress policing based packet-per-second rate-limiting") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Simon Horman <simon.horman@corigine.com>	2022-08-04 10:04:28 +01:00
Tao Liu	8166c066a7	netdev-linux: Do not touch LAG members if master is not attached to OVS. Bond master netdev may be created without a classification type, due to routing or tunneling code. If bond master is not attached to ovs, the ingress block on LAG members should not be updated. Simple reproducer: tc q ls dev net3 ingress ip a add 10.1.1.1/30 dev bond0 ip l set net3 master bond0 tc q ls dev net3 ingress Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC") Signed-off-by: Tao Liu <thomas.liu@ucloud.cn> Acked-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-26 12:48:14 +02:00
Ilya Maximets	0d153bffbf	netdev-linux: Fix leak of a tc police get/del reply. Direct leak of 64 byte(s) in 1 object(s) allocated from: 0 0x51b1d8 in malloc (vswitchd/ovs-vswitchd+0x51b1d8) 1 0xc81ded in xmalloc__ lib/util.c:137:15 2 0xc81ded in xmalloc lib/util.c:172:12 3 0xb32153 in ofpbuf_new lib/ofpbuf.c:168:24 4 0xd563e4 in nl_sock_transact lib/netlink-socket.c:1113:34 5 0xd56261 in nl_transact lib/netlink-socket.c:1812:13 6 0xd5e096 in tc_transact lib/tc.c:238:17 7 0xd01622 in tc_del_policer_action lib/netdev-linux.c:5807:13 8 0xd2e681 in meter_tc_del_policer lib/netdev-offload-tc.c:2827:15 9 0x94ec21 in meter_offload_del lib/netdev-offload.c:245:23 10 0xcc86c4 in dpif_netlink_meter_del lib/dpif-netlink.c:4288:9 11 0x86c595 in dpif_meter_del lib/dpif.c:2014:13 12 0x663439 in free_meter_id ofproto/ofproto-dpif.c:6789:5 13 0xb47518 in ovsrcu_call_postponed lib/ovs-rcu.c:346:13 14 0xb48031 in ovsrcu_postpone_thread lib/ovs-rcu.c:362:14 15 0xb5015a in ovsthread_wrapper lib/ovs-thread.c:422:12 16 0x7f86af4081ce in start_thread (/lib64/libpthread.so.0+0x81ce) Fixes: 5c039ddc64ff ("netdev-linux: Add functions to manipulate tc police action") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Simon Horman <simon.horman@corigine.com>	2022-07-14 14:58:10 +02:00
Jianbo Liu	5c039ddc64	netdev-linux: Add functions to manipulate tc police action Add helpers to add, delete and get stats of police action with the specified index. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Simon Horman <simon.horman@corigine.com>	2022-07-11 11:18:16 +02:00
Jianbo Liu	ed2300cca0	netdev-linux: Refactor put police action netlink message To reuse the code for manipulating police action, move the common initialization code to a function, and change PPS parameters as meter pktps is in unit of packet per second. null_police is redundant because either BPS or PPS, not both, can be configured in one message. So the police passed in to nl_msg_put_act_police can be reused as its rate is zero for PPS, and it also provides the index for police action to be created. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Simon Horman <simon.horman@corigine.com>	2022-07-11 11:17:44 +02:00
Dumitru Ceara	c8c49a9db9	netdev-linux: Properly access 32-bit aligned rtnl_link_stats64 structs. Detected by UB Sanitizer when running system tests: lib/netdev-linux.c:6250:26: runtime error: member access within misaligned address 0x00000229a204 for type 'const struct rpl_rtnl_link_stats64', which requires 8 byte alignment 0x00000229a204: note: pointer points here c4 00 17 00 01 00 00 00 00 00 00 00 01 00 00 00 ^ 00 00 00 00 6e 00 00 00 00 00 00 00 6e 00 00 00 0 0x89f10e in netdev_stats_from_rtnl_link_stats64 lib/netdev-linux.c:6250 1 0x89f10e in get_stats_via_netlink lib/netdev-linux.c:6298 2 0x8a039a in netdev_linux_get_stats lib/netdev-linux.c:2227 3 0x68e149 in netdev_get_stats lib/netdev.c:1599 4 0x407b21 in iface_refresh_stats vswitchd/bridge.c:2687 5 0x419eb6 in iface_create vswitchd/bridge.c:2134 6 0x419eb6 in bridge_add_ports__ vswitchd/bridge.c:1170 7 0x41f71c in bridge_add_ports vswitchd/bridge.c:1181 8 0x41f71c in bridge_reconfigure vswitchd/bridge.c:898 9 0x429f59 in bridge_run vswitchd/bridge.c:3331 10 0x430af3 in main vswitchd/ovs-vswitchd.c:129 11 0x7fbdfd43eb74 in __libc_start_main (/lib64/libc.so.6+0x27b74) 12 0x4072fd in _start (/root/ovs/vswitchd/ovs-vswitchd+0x4072fd) Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-17 23:10:20 +02:00
Adrian Moreno	9e56549c2b	hmap: use short version of safe loops if possible. Using SHORT version of the *_SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for hmap and all its derived types. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-30 16:59:02 +02:00
Mike Pattrick	eb1ab5357b	netdev-linux: Use matchall classifier for ingress policing. Currently ingress policing uses the basic classifier to apply traffic control filters if hardware offload is not enabled, in which case it uses matchall. This change changes the behavior to always use matchall, and fall back onto basic if the kernel is built without matchall support. The system tests are modified to allow either basic or matchall classification on the ingestion filter, and to allow either 10000 or 10240 packets for the packet burst filter. 10000 is accurate for kernel 5.14 and the most recent iproute2, however, 10240 is left for compatibility with older kernels. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-12 15:23:29 +01:00
Yunjian Wang	849a40ccfb	netdev-linux: Fix a null pointer dereference in netdev_linux_notify_sock(). If nl_sock_join_mcgroup() returns an error, the 'sock' is freed and set to NULL. This issues will lead to null pointer deference in nl_sock_listen_all_nsid(). To fix it, we call nl_sock_listen_all_nsid() before joining the mcgroups. Fixes: cf114a7fce80 ("netlink linux: enable listening to all nsids") Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-09-16 01:19:39 +02:00
Yong Xu	d2e97030ed	netdev-linux: fix compile error in nl_msg_put_act_police Use 'memset' to init memory to 0. This resolves a build problem with clang on Ubuntu 16.04 on ARM (in Travis): libtool: compile: clang -DHAVE_CONFIG_H -I. -I ./include -I ./include -I ./lib -I ./lib -Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers -Wthread-safety -fno-strict-aliasing -Wswitch-bool -Wlogical-not-parentheses -Wsizeof-array-argument -Wshift-negative-value -Qunused-arguments -Wshadow -Warray-bounds-pointer-arithmetic -Werror -Werror -g -O2 -Wno-error=unused-command-line-argument -DHAVE_AVX512F -MT lib/netlink-conntrack.lo -MD -MP -MF lib/.deps/netlink-conntrack.Tpo -c lib/netlink-conntrack.c -o lib/netlink-conntrack.o lib/netdev-linux.c:2638:38: error: missing field 'action' initializer [-Werror,-Wmissing-field-initializers] struct tc_police null_police = {0}; ^ 1 error generated. make[2]: * [lib/netdev-linux.lo] Error 1 make[2]: * Waiting for unfinished jobs.... Fixes: c2567e533 ("add port-based ingress policing based packet-per-second rate-limiting") Reported-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Yong Xu <yong.xu@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2021-07-13 13:32:17 +02:00
Eelco Chaudron	b0d289bb5a	netdev-linux: Ignore TSO packets when TSO is not enabled for userspace. When TSO is disabled from a userspace forwarding datapath perspective, but TSO has been wrongly enabled on the kernel side, log a warning message, and drop the packet. With the current implementation, OVS will crash. [i.maximets]: The call stack looks like this: 0 dp_packet_set_size (b=0x0, b=0x0, v=13028) at lib/dp-packet.h:578 1 netdev_linux_batch_rxq_recv_sock at lib/netdev-linux.c:1310 2 netdev_linux_rxq_recv at lib/netdev-linux.c 3 netdev_rxq_recv at lib/netdev.c 4 dp_netdev_process_rxq_port at lib/dpif-netdev.c The problem is that the code assumes that (mmsgs[i].msg_len > std_len) can only be true if userpace-tso is enabled and additional buffers are provided to the kernel. However, since recvmmsg() is called with MSG_TRUNC, the resulting msg_len reflects the original packet size before truncation, and it can be larger than the buffer if TSO / GRO is enabled on the network interface. If TSO support for user space is not enabled in OVS, the aux_bufs are not allocated and are left NULL, resulting in a crash. Fixes: 73858f9dbe83 ("netdev-linux: Prepend the std packet in the TSO packet") Fixes: 2109841b7984 ("Use batch process recv for tap and raw socket in netdev datapath") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-09 21:33:41 +02:00
Yong Xu	c2567e533f	add port-based ingress policing based packet-per-second rate-limiting OVS has support for using policing to enforce a rate limit in kilobits per second. This is configured using OVSDB. f.e. $ ovs-vsctl set interface tap0 ingress_policing_rate=1000 $ ovs-vsctl set interface tap0 ingress_policing_burst=100 This patch adds a related feature, allowing policing to enforce a rate limit in kilo-packets per second. This is also configured using OVSDB. $ ovs-vsctl set interface tap0 ingress_policing_kpkts_rate=1000 $ ovs-vsctl set interface tap0 ingress_policing_kpkts_burst=100 The kilo-bit and kilo-packet rate limits may be used separately or in combination. Add separate action for BPS and PPS in netlink message. Revise code and change action result to pipe to allow traffic pipe into second action. This patch implements the feature for: * OVSDB (northbound API) * TC policer when used both with and without TC offload (kernel API) Signed-off-by: Yong Xu <yong.xu@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>	2021-07-01 20:44:07 +02:00
Ilya Maximets	577b9a8169	netdev-linux: Fix use of uninitialized LAG master name. 'if_indextoname' may fail leaving the 'master_name' uninitialized: Conditional jump or move depends on uninitialised value(s) at 0x4C34329: strlen (vg_replace_strmem.c:459) by 0x51C638: hash_string (hash.h:342) by 0x51C638: hash_name (shash.c:28) by 0x51CC51: shash_find (shash.c:231) by 0x51CD38: shash_find_data (shash.c:245) by 0x4A797F: netdev_from_name (netdev.c:2013) by 0x544148: netdev_linux_update_lag (netdev-linux.c:676) by 0x544148: netdev_linux_run (netdev-linux.c:769) by 0x4A5997: netdev_run (netdev.c:186) by 0x40752B: main (ovs-vswitchd.c:129) Uninitialised value was created by a stack allocation at 0x543AFA: netdev_linux_run (netdev-linux.c:722) Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>	2021-05-24 20:21:56 +02:00
Michal Kazior	d90b4f2928	rtnetlink: ignore IFLA_WIRELESS events. Some older wireless drivers - ones relying on the old and long deprecated wireless extension ioctl system - can generate quite a bit of IFLA_WIRELESS events depending on their configuration and runtime conditions. These are delivered as RTNLGRP_LINK via RTM_NEWLINK messages. These tend to be relatively easily identifiable because they report the change mask being 0. This isn't guaranteed but in practice it shouldn't be a problem. None of the wireless events that I ever observed actually carry any unique information about netdev states that ovs-vswitchd is interested in. Hence ignoring these shouldn't cause any problems. These events can be responsible for a significant CPU churn as ovs-vswitchd attempts to do plenty of work for each and every netlink message regardless of what that message carries. On low-end devices such as consumer-grade routers these can lead to a lot of CPU cycles being wasted, adding up to heat output and reducing performance. It could be argued that wireless drivers in question should be fixed, but that isn't exactly a trivial thing to do. Patching ovs seems far more viable while still making sense. Signed-off-by: Michal Kazior <michal@plume.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-04-20 00:00:22 +02:00
Yong.Xu	67e0e0bc15	netdev-linux: correct unit of burst parameter Correct calculation of burst parameter used when configuring TC policer action for ingress port-based policing in the case where TC offload is in use. This now matches the value calculated for the case where TC offload is not in use. The division by 8 is to convert from bits to bytes. Its unclear why 64 was previously used. Fixes: e7f6ba220 ("lib/tc: add ingress ratelimiting support for tc-offload") Signed-off-by: Yong Xu <yong.xu@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>	2021-04-13 15:36:35 +02:00
William Tu	e7df370cff	netdev-linux: Fix indentation. Remove one extra space. No actual code logic changed. Fixes: 2109841b79845 ("Use batch process recv for tap and raw socket in netdev datapath") Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-02-19 18:40:25 +01:00
Ben Pfaff	91fc374a9c	Eliminate use of term "slave" in bond, LACP, and bundle contexts. The new term is "member". Most of these changes should not change user-visible behavior. One place where they do is in "ovs-ofctl dump-flows", which will now output "members:..." inside "bundle" actions instead of "slaves:...". I don't expect this to cause real problems in most systems. The old syntax is still supported on input for backward compatibility. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>	2020-10-21 11:28:24 -07:00
Yi-Hung Wei	fd4d477760	netdev-linux: Fix broken build on Ubuntu 14.04 Patch 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") uses __virtio16 which is defined in kernel 3.19. Ubuntu 14.04 is using 3.13 kernel that lacks the virtio_types definition. This patch fixes that. Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>	2020-07-08 11:51:15 -07:00
Aaron Conole	7a076a5371	netdev-linux: Update LAG in all cases. In some cases, when processing a netlink change event, it's possible for an alternate part of OvS (like the IPv6 endpoint processing) to hold an active netdev interface. This creates a race-condition, where sometimes the OvS change processing will take the normal path. This doesn't work because the netdev device object won't actually be enslaved to the ovs-system (for instance, a linux bond) and ingress qdisc entries will be missing. To address this, we update the LAG information in ALL cases where LAG information could come in. Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC") Cc: Marcelo Leitner <mleitner@redhat.com> Cc: John Hurley <john.hurley@netronome.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2020-05-16 13:15:50 +02:00
William Tu	5119cfe32d	netdev-afxdp: Fix missing init. When introducing the interrupt mode for netdev-afxdp, the netdev init function is accidentally removed. Fix it by adding it back. Fixes: 5bfc519fee499 ("netdev-afxdp: Add interrupt mode netdev class.") Acked-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>	2020-05-04 16:13:29 -07:00
Jiang Lidong	b9f825a544	netdev-linux: remove sum of vport stats and kernel netdev stats. When using kernel veth as OVS interface, doubled drop counter value is shown when veth drops packets due to traffic overrun. In netdev_linux_get_stats, it reads both vport stats and kernel netdev stats, in case vport stats retrieve failure. If both of them success, error counters are added to include errors from different layers. But implementation of ovs_vport_get_stats in kernel data path has included kernel netdev stats by calling dev_get_stats. When drop or other error counters is not zero, its value is doubled by netdev_linux_get_stats. In this change, adding kernel netdev stats into vport stats is removed, since vport stats includes all information of kernel netdev stats. Signed-off-by: Jiang Lidong <jianglidong3@jd.com> Signed-off-by: William Tu <u9012063@gmail.com>	2020-04-30 07:35:54 -07:00
William Tu	5bfc519fee	netdev-afxdp: Add interrupt mode netdev class. The patch adds a new netdev class 'afxdp-nonpmd' to enable afxdp interrupt mode. This is similar to 'type=afxdp', except that the is_pmd field is set to false. As a result, the packet processing is handled by main thread, not pmd thread. This avoids burning the CPU to always 100% when there is no traffic. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2020-04-28 17:58:31 +02:00
Usman Ansari	fed4282c53	netdev-linux.c: Fix coverity unreachable code warning Coverity reports unreachable code in "?" statement Fixed by removing code segment and unused variables & defines Signed-off-by: Usman Ansari <ua1422@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>	2020-04-06 11:58:35 -07:00
Flavio Leitner	6211ad5708	netdev-linux: Enable TSO in the TAP device. Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO offloading in the tap device. Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Reported-by: "Yi Yang (杨�D)-云服务集团" <yangyi01@inspur.com> Tested-by: William Tu <u9012063@gmail.com> Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: William Tu <u9012063@gmail.com>	2020-03-02 09:50:17 -08:00
Flavio Leitner	35b5586ba7	userspace TSO: SCTP checksum offload optional. Ideally SCTP checksum offload needs be advertised by the NIC when userspace TSO is enabled. However, very few drivers do that and it's not a widely used protocol. So, this patch enables SCTP checksum offload if available, otherwise userspace TSO can still be enabled but SCTP packets will be dropped on NICs without support. Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2020-02-26 15:24:15 +01:00
Flavio Leitner	8c5163fe81	userspace TSO: Include UDP checksum offload. Virtio doesn't expose flags to control which protocols checksum offload needs to be enabled or disabled. This patch checks if the NIC supports UDP checksum offload and active it when TSO is enabled. Reported-by: Ilya Maximets <i.maximets@ovn.org> Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2020-02-26 15:24:15 +01:00
Flavio Leitner	73858f9dbe	netdev-linux: Prepend the std packet in the TSO packet Usually TSO packets are close to 50k, 60k bytes long, so to to copy less bytes when receiving a packet from the kernel change the approach. Instead of extending the MTU sized packet received and append with remaining TSO data from the TSO buffer, allocate a TSO packet with enough headroom to prepend the std packet data. Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Suggested-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ben Pfaff <blp@ovn.org>	2020-02-06 11:37:23 -08:00
William Tu	105cf8df82	netdev-linux: Detect numa node id. The patch detects the numa node id from the name of the netdev, by reading the '/sys/class/net/<devname>/device/numa_node'. If not available, ex: virtual device, or any error happens, return numa id 0. Currently only the afxdp netdev type uses it, other linux netdev types are disabled due to no use case. Signed-off-by: William Tu <u9012063@gmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2020-01-18 01:42:22 +01:00
Flavio Leitner	29cf9c1b3b	userspace: Add TCP Segmentation Offload support Abbreviated as TSO, TCP Segmentation Offload is a feature which enables the network stack to delegate the TCP segmentation to the NIC reducing the per packet CPU overhead. A guest using vhostuser interface with TSO enabled can send TCP packets much bigger than the MTU, which saves CPU cycles normally used to break the packets down to MTU size and to calculate checksums. It also saves CPU cycles used to parse multiple packets/headers during the packet processing inside virtual switch. If the destination of the packet is another guest in the same host, then the same big packet can be sent through a vhostuser interface skipping the segmentation completely. However, if the destination is not local, the NIC hardware is instructed to do the TCP segmentation and checksum calculation. It is recommended to check if NIC hardware supports TSO before enabling the feature, which is off by default. For additional information please check the tso.rst document. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Tested-by: Ciara Loftus <ciara.loftus.intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2020-01-17 22:27:25 +00:00
Yi Yang	2109841b79	Use batch process recv for tap and raw socket in netdev datapath Current netdev_linux_rxq_recv_tap and netdev_linux_rxq_recv_sock just receive single packet, that is very inefficient, per my test case which adds two tap ports or veth ports into OVS bridge (datapath_type=netdev) and use iperf3 to do performance test between two ports (they are set into different network name space). The result is as below: tap: 295 Mbits/sec veth: 207 Mbits/sec After I change netdev_linux_rxq_recv_tap and netdev_linux_rxq_recv_sock to use batch process, the performance is boosted by about 7 times, here is the result: tap: 1.96 Gbits/sec veth: 1.47 Gbits/sec Undoubtedly this is a huge improvement although it can't match OVS kernel datapath yet. FYI: here is thr result for OVS kernel datapath: tap: 37.2 Gbits/sec veth: 36.3 Gbits/sec Note: performance result is highly related with your test machine, you shouldn't expect the same results on your test machine. Signed-off-by: Yi Yang <yangyi01@inspur.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2020-01-09 09:48:49 -08:00

1 2 3 4 5 ...

396 Commits