mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-22 01:51:26 +00:00

Author	SHA1	Message	Date
Jakob Meng	bb6ed2472f	netdev-dpdk: Document rx-steering status options. Fixes: fc06ea9a1883 ("netdev-dpdk: Add custom rx-steering configuration.") Signed-off-by: Jakob Meng <code@jakobmeng.de> Acked-by: Simon Horman <horms@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2023-10-10 11:23:35 +01:00
Jakob Meng	e9ada16292	netdev-dpdk: Update docs for interface info. The status options pci-vendor_id and pci-device_id for dpdk netdevs have been replaced by bus_info. This patch updates the documentation in vswitchd/vswitch.xml accordingly. Fixes: a77c7796f23a ("dpdk: Update to use v22.11.1.") Signed-off-by: Jakob Meng <code@jakobmeng.de> Acked-by: Simon Horman <horms@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2023-10-10 11:23:35 +01:00
Jakob Meng	8020eff9a0	netdev-dpdk: Document status options for VF MAC address. Fixes: f4336f504b17 ("netdev-dpdk: Add option to configure VF MAC address. ") Signed-off-by: Jakob Meng <code@jakobmeng.de> Acked-by: Simon Horman <horms@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>	2023-10-10 11:23:35 +01:00
Ilya Maximets	24520a401e	vswitchd: Wait for a bridge exit before replying to exit unixctl. Before the cleanup option, the bridge_exit() call was fairly fast, because it didn't include any particularly long operations. However, with the cleanup flag, this function destroys a lot of datapath resources freeing a lot of memory, waiting on RCU and talking to the kernel. That may take a noticeable amount of time, especially on a busy system or under profilers/sanitizers. However, the unixctl 'exit' command replies instantly without waiting for any work to actually be done. This may cause system test failures or other issues where scripts expect ovs-vswitchd to exit or destroy all the datapath resources shortly after appctl call. Fix that by waiting for the bridge_exit() before replying to the user. At least, all the datapath resources will actually be destroyed by the time ovs-appctl exits. Also moving a structure from stack to global. Seems cleaner this way. Since we're not replying right away and it's technically possible to have multiple clients requesting exit at the same time, storing connections in an array. Fixes: fe13ccdca6a2 ("vswitchd: Add --cleanup option to the 'appctl exit' command") Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-18 20:45:00 +02:00
Adrian Moreno	6240c0b4c8	netdev: Add netdev_get_speed() to netdev API. Currently, the netdev's speed is being calculated by taking the link's feature bits (using netdev_get_features()) and transforming them into bps. This mechanism can be both inaccurate and difficult to maintain, mainly because we currently use the feature bits supported by OpenFlow which would have to be extended to support all new feature bits of all netdev implementations while keeping the OpenFlow API intact. In order to expose the link speed accurately for all current and future hardware, add a new netdev API call that allows the implementations to provide the current and maximum link speeds in Mbps. Internally, the logic to get the maximum supported speed still relies on feature bits so it might still get out of sync in the future. However, the maximum configurable speed is not used as much as the current speed and these feature bits are not exposed through the netdev interface so it should be easier to add more. Use this new function instead of netdev_get_features() where the link speed is needed. As a consequence of this patch, link speeds of cards is properly reported (internally in OVSDB) even if not supported by OpenFlow. A test verifies this behavior using a tap device. Also, in order to avoid using the old, this patch adds a checkpatch.py warning if the old API is used. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567 Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-17 20:03:32 +02:00
Kevin Traynor	023dcdc7a1	dpif-netdev: Rename pmd-maxsleep config option. other_config:pmd-maxsleep is a config option to allow PMD thread cores to sleep under low or no load conditions. Rename it to 'pmd-sleep-max' to allow a more structured name and so that additional options or command can follow the 'pmd-sleep-xyz' pattern. Use of other_config:pmd-maxsleep is deprecated to be removed in a future release and will result in a warning. Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-15 00:11:21 +02:00
Sayali Naval	8e073791d4	bridge: Fix unexpected values for IPFIX enable-input/output-sampling. As per the Open vSwitch Manual ovs-vsctl(8) the Bridge IPFIX parameters can be passed as follows: ovs-vsctl -- set Bridge br0 ipfix=@i \ -- --id=@i create IPFIX targets=\"192.168.0.34:4739\" \ obs_domain_id=123 obs_point_id=456 cache_active_timeout=60 \ cache_max_flows=13 \ other_config:enable-input-sampling=false \ other_config:enable-output-sampling=false where the default values are: enable_input_sampling: true enable_output_sampling: true But in the existing code these 2 parameters take up unexpected values in some scenarios: be_opts.enable_input_sampling = !smap_get_bool(&be_cfg->other_config, "enable-input-sampling", false); be_opts.enable_output_sampling = !smap_get_bool(&be_cfg->other_config, "enable-output-sampling", false); Here, the function smap_get_bool is being used with a negation. This returns expected values for the default case (since the above code will negate “false” we get from smap_get bool function and return the value “true”) but unexpected values for the case where the sampling value is passed through the CLI. For example, if we pass "true" for other_config:enable-input-sampling in the CLI, the above code will negate the “true” value we get from the smap_bool function and return the value “false”. Same would be the case for enable_output_sampling. Acked-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Sayali Naval <sanaval@cisco.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-12 00:12:49 +02:00
Robin Jarry	fc06ea9a18	netdev-dpdk: Add custom rx-steering configuration. Some control protocols are used to maintain link status between forwarding engines (e.g. LACP). When the system is not sized properly, the PMD threads may not be able to process all incoming traffic from the configured Rx queues. When a signaling packet of such protocols is dropped, it can cause link flapping, worsening the situation. Use the rte_flow API to redirect these protocols into a dedicated Rx queue. The assumption is made that the ratio between control protocol traffic and user data traffic is very low and thus this dedicated Rx queue will never get full. Re-program the RSS redirection table to only use the other Rx queues. The additional Rx queue will be assigned a PMD core like any other Rx queue. Polling that extra queue may introduce increased latency and a slight performance penalty at the benefit of preventing link flapping. This feature must be enabled per port on specific protocols via the rx-steering option. This option takes "rss" followed by a "+" separated list of protocol names. It is only supported on ethernet ports. This feature is experimental. If the user has already configured multiple Rx queues on the port, an additional one will be allocated for control packets. If the hardware cannot satisfy the number of requested Rx queues, the last Rx queue will be assigned for control plane. If only one Rx queue is available, the rx-steering feature will be disabled. If the hardware does not support the rte_flow matchers/actions, the rx-steering feature will be completely disabled on the port and regular rss will be performed instead. It cannot be enabled when other-config:hw-offload=true as it may conflict with the offloaded flows. Similarly, if hw-offload is enabled, custom rx-steering will be forcibly disabled on all ports and replaced by regular rss. Example use: ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \ set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \ set interface phy0 options:rx-steering=rss+lacp -- \ set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \ set interface phy1 options:rx-steering=rss+lacp As a starting point, only one protocol is supported: LACP. Other protocols can be added in the future. NIC compatibility should be checked. To validate that this works as intended, I used a traffic generator to generate random traffic slightly above the machine capacity at line rate on a two ports bond interface. OVS is configured to receive traffic on two VLANs and pop/push them in a br-int bridge based on tags set on patch ports. +----------------------+ \| DUT \| \|+--------------------+\| \|\| br-int \|\| in_port=patch10,actions=mod_dl_src:$patch11, \|\| \|\| mod_dl_dst:$tgen0, \|\| \|\| output:patch10 \|\| \|\| in_port=patch11,actions=mod_dl_src:$patch10 \|\| \|\| mod_dl_dst:$tgen0, \|\| patch10 patch11 \|\| output:patch10 \|+---\|-----------\|----+\| \| \| \| \| \|+---\|-----------\|----+\| \|\| patch00 patch01 \|\| \|\| tag:10 tag:20 \|\| \|\| \|\| \|\| br-phy \|\| default flow, action=NORMAL \|\| \|\| \|\| bond0 \|\| balance-slb, lacp=passive, lacp-time=fast \|\| phy0 phy1 \|\| \|+------\|-----\|-------+\| +-------\|-----\|--------+ \| \| +-------\|-----\|--------+ \| port0 port1 \| balance L3/L4, lacp=active, lacp-time=fast \| lag \| mode trunk VLANs 10, 20 \| \| \| switch \| \| \| \| vlan 10 vlan 20 \| mode access \| port2 port3 \| +-----\|----------\|-----+ \| \| +-----\|----------\|-----+ \| tgen0 tgen1 \| Random traffic that is properly balanced \| \| across the bond ports in both directions. \| traffic generator \| +----------------------+ Without rx-steering, the bond0 links are randomly switching to "defaulted" when one of the LACP packets sent by the switch is dropped because the RX queues are full and the PMD threads did not process them fast enough. When that happens, all traffic must go through a single link which causes above line rate traffic to be dropped. ~# ovs-appctl lacp/show-stats bond0 ---- bond0 statistics ---- member: phy0: TX PDUs: 347246 RX PDUs: 14865 RX Bad PDUs: 0 RX Marker Request PDUs: 0 Link Expired: 168 Link Defaulted: 0 Carrier Status Changed: 0 member: phy1: TX PDUs: 347245 RX PDUs: 14919 RX Bad PDUs: 0 RX Marker Request PDUs: 0 Link Expired: 147 Link Defaulted: 1 Carrier Status Changed: 0 When rx-steering is enabled, no LACP packet is dropped and the bond links remain enabled at all times, maximizing the throughput. Neither the "Link Expired" nor the "Link Defaulted" counters are incremented anymore. This feature may be considered as "QoS". However, it does not work by limiting the rate of traffic explicitly. It only guarantees that some protocols have a lower chance of being dropped because the PMD cores cannot keep up with regular traffic. The choice of protocols is limited on purpose. This is not meant to be configurable by users. Some limited configurability could be considered in the future but it would expose to more potential issues if users are accidentally redirecting all traffic in the isolated queue. Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Robin Jarry <rjarry@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-10 15:49:44 +02:00
Nobuhiro MIKI	701c2dbfb8	userspace: Add new option srv6_flowlabel in SRv6 tunnel. It supports flowlabel based load balancing by controlling the flowlabel of outer IPv6 header, which is already implemented in Linux kernel as seg6_flowlabel sysctl [1]. [1]: https://docs.kernel.org/networking/seg6-sysctl.html Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-05-25 17:08:32 +02:00
Nobuhiro MIKI	0f34ecbd5a	vswitch.xml: Add description of SRv6 tunnel and related options. The description of SRv6 was missing in vswitch.xml, which is used to generate the man page, so this patch adds it. Fixes: 03fc1ad78521 ("userspace: Add SRv6 tunnel support.") Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-03-30 22:10:40 +02:00
Aaron Conole	07cf5810de	dpdk: Allow retaining CAP_SYS_RAWIO privileges. Open vSwitch generally tries to let the underlying operating system managed the low level details of hardware, for example DMA mapping, bus arbitration, etc. However, when using DPDK, the underlying operating system yields control of many of these details to userspace for management. In the case of some DPDK port drivers, configuring rte_flow or even allocating resources may require access to iopl/ioperm calls, which are guarded by the CAP_SYS_RAWIO privilege on linux systems. These calls are dangerous, and can allow a process to completely compromise a system. However, they are needed in the case of some userspace driver code which manages the hardware (for example, the mlx implementation of backend support for rte_flow). Here, we create an opt-in flag passed to the command line to allow this access. We need to do this before ever accessing the database, because we want to drop all privileges asap, and cannot wait for a connection to the database to be established and functional before dropping. There may be distribution specific ways to do capability management as well (using for example, systemd), but they are not as universal to the vswitchd as a flag. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-03-22 18:56:02 +01:00
Ales Musil	e90a0727f1	vswitch: Add missing documentation for "ct_flush" capability. Fixes: 08146bf7d9b4 ("openflow: Add extension to flush CT by generic match.") Signed-off-by: Ales Musil <amusil@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-03-15 21:24:38 +01:00
Eelco Chaudron	29720e378e	ofproto-dpif-upcall: Wait for valid hw flow stats before applying min-revalidate-pps. Depending on the driver implementation, it can take from 0.2 seconds up to 2 seconds before offloaded flow statistics are updated. This is true for both TC and rte_flow-based offloading. This is causing a problem with min-revalidate-pps, as old statistic values are used during this period. This fix will wait for at least 2 seconds, by default, before assuming no packets where received during this period. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-03-15 21:22:22 +01:00
Adrian Moreno	f1f278f5e1	ipfix: Make template and stats interval configurable. Add options to the IPFIX table configure the interval to send statistics and template information. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-02-27 21:18:59 +01:00
Miika Petäjäniemi	a6195e2c42	netdev-linux: Add jitter parameter to the netem qos options. Adds jitter option to enable emulating latency fluctuation with netem. Submitted-at: https://github.com/openvswitch/ovs/pull/407 Signed-off-by: Miika Petäjäniemi <miika.petajaniemi@solita.fi> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-02-21 14:25:57 +01:00
wangchuanlei	e22e1f6725	dpctl: Add support to count upcall packets. Add support to count upcall packets per port, both succeed and failed, which is a better way to see how many packets upcalled on each interface. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: wangchuanlei <wangchuanlei@inspur.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-31 17:40:50 +01:00
Han Zhou	e5b3cb9995	revalidator: Allow min-revalidator-pps to be 0. Today the minimum value for this setting is 1. This patch allows it to be 0, meaning not checking pps at all, and always do revalidation. This is particularly useful for environments where some of the applications with long-lived connections may have very low traffic for certain period but have high rate of burst periodically. It is desirable to keep the datapath flows instead of periodically deleting them to avoid burst of packet miss to userspace. When setting to 0, there may be more datapath flows to be revalidated, resulting in higher CPU cost of revalidator threads. This is the downside but in certain cases this is still more desirable than packet misses to user space. Signed-off-by: Han Zhou <hzhou@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-27 16:09:10 +01:00
Kevin Traynor	948767a18d	dpif-netdev: Set PMD load based sleep start/inc to 1 us. Now that the timer slack for the PMD threads is reduced we can also reduce the start/increment for PMD load based sleeping to match it. This will further reduce initial sleep times making it more resilient to interfaces that might be sensitive to large sleep times. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-23 17:23:28 +01:00
Kevin Traynor	de3bbdc479	dpif-netdev: Add PMD load based sleeping. Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD. Upon detecting the threshold of >= 16 pkts on an Rxq, reset the sleep time to zero (i.e. no sleep). Sleep time will be increased on each iteration where the low load conditions remain up to a total of the max sleep time which is set by the user e.g: ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500 The default pmd-maxsleep value is 0, which means that no sleeps will occur and the default behaviour is unchanged from previously. Also add new stats to pmd-perf-show to get visibility of operation e.g. ... - sleep iterations: 153994 ( 76.8 % of iterations) Sleep time (us): 9159399 ( 59 us/iteration avg.) ... Reviewed-by: Robin Jarry <rjarry@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-01-12 18:56:05 +01:00
Daniel Ding	093915e04a	vswitch.ovsschema: Set bfd_status to ephemeral. When restart openvswitch, the bfd status will be kept before ovs-vswitchd running. And if the ovs-vswitchd has high workload, which will defer updating bfd status, which not we excepted. Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-12-06 16:21:54 +01:00
Ilya Maximets	b22c4d8403	netdev: Assume default link speed to be 10 Gbps instead of 100 Mbps. 100 Mbps was a fair assumption 13 years ago. Modern days 10 Gbps seems like a good value in case no information is available otherwise. The change mainly affects QoS which is currently limited to 100 Mbps if the user didn't specify 'max-rate' and the card doesn't report the speed or OVS doesn't have a predefined enumeration for the speed reported by the NIC. Calculation of the path cost for STP/RSTP is also affected if OVS is unable to determine the link speed. Lower link speed adapters are typically good at reporting their speed, so chances for overshoot should be low. But newer high-speed adapters, for which there is no speed enumeration or if there are some other issues, will not suffer that much. Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-11-30 14:42:59 +01:00
David Marchand	c6062d1077	vswitchd: Publish per iface received multicast packets. The count of received multicast packets has been computed internally, but not exposed to ovsdb. Fix this. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Michael Santana <msantana@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-11-24 13:38:08 +01:00
Ilya Maximets	0d0f282c19	vswitch.xml: Fix the name of rstp-path-cost option. For some reason it is documented as 'rstp-port-path-cost', while the code and some other bits of documentation use 'rstp-path-cost'. Fixes: 9efd308e957c ("Rapid Spanning Tree Protocol (IEEE 802.1D).") Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-11-02 19:45:14 +01:00
Ilya Maximets	516f181a21	docs: Remove remaining references to OVS kmod and XenServer. README file still mentions a kernel module and some parts of the documentation still have XenServer references, e.g. 'xs-*' database configuration options. Removing them. Fixes: 422e90437854 ("make: Remove the Linux datapath.") Fixes: 83c9518e7c67 ("xenserver: Remove xenserver.") Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-08-15 19:46:00 +02:00
Greg Rose	83c9518e7c	xenserver: Remove xenserver. Remove the current xenserver implementation - it is obsolete and since 3.0 we do not support kernel module builds [1]. 1. https://mail.openvswitch.org/pipermail/ovs-dev/2022-July/395789.html [i.maximets] Can be added back if people willing to maintain it will be found. Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-08-15 13:07:13 +02:00
Christophe Fontaine	1b53826d6c	ofproto/bond: Add knob 'all-members-active'. This config param allows the delivery of broadcast and multicast packets to the secondary interface of non-lacp bonds, equivalent to the option 'all_slaves_active' for Linux kernel bonds. Reported-at: https://bugzilla.redhat.com/1720935 Signed-off-by: Christophe Fontaine <cfontain@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-15 23:08:38 +02:00
Emma Finn	1713fc0116	odp-execute: Add command to switch action implementation. This commit adds a new command to allow the user to switch the active action implementation at runtime. Usage: $ ovs-appctl odp-execute/action-impl-set scalar This commit also adds a new command to retrieve the list of available action implementations. This can be used by to check what implementations of actions are available and what implementation is active during runtime. Usage: $ ovs-appctl odp-execute/action-impl-show Added separate test-case for ovs-actions show/set commands: odp-execute - actions implementation Signed-off-by: Emma Finn <emma.finn@intel.com> Signed-off-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com> Co-authored-by: Kumar Amber <kumar.amber@intel.com> Co-authored-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-07-15 11:39:20 +01:00
Emma Finn	95e4a35b0a	odp-execute: Add function pointers to odp-execute for different action implementations. This commit introduces the initial infrastructure required to allow different implementations for OvS actions. The patch introduces action function pointers which allows user to switch between different action implementations available. This will allow for more performance and flexibility so the user can choose the action implementation to best suite their use case. Signed-off-by: Emma Finn <emma.finn@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-07-15 11:38:16 +01:00
Frode Nordahl	2fc29c4278	man: Fix various typos across manual pages. As reported by Debian lintian. Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-14 15:24:07 +02:00
Kevin Traynor	3757e9f8e9	netdev-dpdk: Add shared mempool config. Mempools may currently be shared between DPDK ports based on port MTU and NUMA. With some hint from the user we can increase the sharing on MTU and hence reduce memory consumption in many cases. For example, a port with MTU 9000, uses a mempool with an mbuf size based on 9000 MTU. A port with MTU 1500, uses a different mempool with an mbuf size based on 1500 MTU. In this case, assuming same NUMA, both these ports could share the 9000 MTU mempool. The user must give a hint as order of creation of ports and setting of MTUs may vary and we need to ensure that upgrades from older OVS versions do not require more memory. This scheme can also prevent multiple mempools being created for cases where a port is added picking up a default MTU and an appropriate mempool, but later has it's MTU changed to a different value requiring a different mempool. Example usage: $ ovs-vsctl --no-wait set Open_vSwitch . \ other_config:shared-mempool-config=9000,1500:1,6000:1 Port added on NUMA 0: * MTU 1500, use mempool based on 9000 MTU * MTU 5000, use mempool based on 9000 MTU * MTU 9000, use mempool based on 9000 MTU * MTU 9300, use mempool based on 9300 MTU (existing behaviour) Port added on NUMA 1: * MTU 1500, use mempool based on 1500 MTU * MTU 5000, use mempool based on 6000 MTU * MTU 9000, use mempool based on 9000 MTU * MTU 9300, use mempool based on 9300 MTU (existing behaviour) Default behaviour is unchanged and mempools are still only created when needed. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2022-07-14 13:17:59 +01:00
Kevin Traynor	6c50462096	vswitchd.xml: Fix whitespace. My xml editor keeps autofixing these which means I have to be careful during 'git add' for unrelated changes. Might as well just fix them. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-04 21:47:18 +02:00
Andreas Karis	e8515c8cc0	ovs-monitor-ipsec: Allow custom options per tunnel. Tunnels in LibreSwan and OpenSwan allow for many options to be set on a per tunnel basis. Pass through any options starting with ipsec_ to the connection in the configuration file. Administrators are responsible for picking valid key/value pairs. Signed-off-by: Andreas Karis <ak.karis@gmail.com> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-05-04 16:30:21 +02:00
Adrian Moreno	9e56549c2b	hmap: use short version of safe loops if possible. Using SHORT version of the *_SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for hmap and all its derived types. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-30 16:59:02 +02:00
Adrian Moreno	e9bf5bffb0	list: use short version of safe loops if possible. Using the SHORT version of the *_SAFE loops makes the code cleaner and less error-prone. So, use the SHORT version and remove the extra variable when possible. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-30 16:59:02 +02:00
Maxime Coquelin	a7f52b7eb6	vswitchd.xml: Add missing tx-steering PMD option. This patch documents PMD's other_config:tx-steering option. Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-31 21:36:59 +01:00
Gaetan Rivet	62c2d8a675	netdev-offload: Add multi-thread API. Expose functions reporting user configuration of offloading threads, as well as utility functions for multithreading. This will only expose the configuration knob to the user, while no datapath will implement the multiple thread request. This will allow implementations to use this API for offload thread management in relevant layers before enabling the actual dataplane implementation. The offload thread ID is lazily allocated and can as such be in a different order than the offload thread start sequence. The RCU thread will sometime access hardware-offload objects from a provider for reclamation purposes. In such case, it will get a default offload thread ID of 0. Care must be taken that using this thread ID is safe concurrently with the offload threads. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-19 01:35:19 +01:00
Eelco Chaudron	512fab8f21	openvswitch: Define the OVS_STATIC_TRACE() macro. This patch defines the OVS_STATIC_TRACE() macro, and as an example, adds two of them in the bridge run loop. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-01-18 00:46:30 +01:00
Ilya Maximets	11441385c2	bridge: Fix incorrect configuration of netdev's dpif type. netdev_set_dpif_type() can only be used with a normalized dpif type as an argument, which is a constant static string derived from a type of a dpif_class or a constant string "system". Usage of a same constant string allows netdev-offload module to compare types by simply comparing pointers. OTOH, 'br->ofproto->type' is a dynamic string that: a. Can be NULL. b. Even if not NULL and equal, can be a different dynamically allocated string. Both these qualities breaks assumptions made by all other modules related to HW offload, breaking the functionality. Fix that by moving netdev_set_dpif_type() to dpif.c and calling with a correct constant string as an argument. The call moved from bridge.c to dpif.c, because we need to have access to the dpif class, but bridge.c should not. Not trying to set the dpif_type inside the netdev_ports_insert(), because it's used now outside the offloading context. So, it's cleaner to move the netdev_set_dpif_type() call outside of the netdev-offload module. Additionally removed the redundant call from the netdev_ports_insert() and refactored the function, since it doesn't need an extra argument anymore. Fixes: 4f19a78a61c5 ("netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.") Reported-by: Roi Dayan <roid@nvidia.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390117.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Lin Huang <linhuang@ruijie.com.cn> Acked-by: Roi Dayan <roid@nvidia.com>	2021-12-17 21:31:55 +01:00
Lin Huang	4f19a78a61	netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs. Userspace tunnel doesn't have a valid device in the kernel. So get_ifindex() function (ioctl) always get error during adding a port, deleting a port or updating a port status. The info log is "2021-08-29T09:17:39.830Z\|00059\|netdev_linux\|INFO\|ioctl(SIOCGIFINDEX) on vxlan_sys_4789 device failed: No such device" If there are a lot of userspace tunnel ports on a bridge, the iface_refresh_netdev_status() function will spend a lot of time. So ignore userspace tunnel port ioctl(SIOCGIFINDEX) operation, just return -ENODEV. Signed-off-by: Lin Huang <linhuang@ruijie.com.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-12-08 18:17:19 +01:00
Ilya Maximets	066741d9c5	ovsdb-idl: Add memory report function. Added new function to return memory usage statistics for database objects inside IDL. Statistics similar to what ovsdb-server reports. Not counting _Server database as it should be small, hence doesn't worth adding extra code to the ovsdb-cs module. Can be added later if needed. ovs-vswitchd is a user in OVS, but this API will be mostly useful for OVN daemons. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>	2021-11-04 23:13:13 +01:00
Ilya Maximets	51946d2227	ovsdb-data: Optimize union of sets. Current algorithm of ovsdb_datum_union looks like this: for-each atom in b: if not bin_search(a, atom): push(a, clone(atom)) quicksort(a) So, the complexity looks like this: Nb * log2(Na) + Nb + (Na + Nb) * log2(Na + Nb) Comparisons clones Comparisons for quicksort for search ovsdb_datum_union() is heavily used in database transactions while new element is added to a set. For example, if new logical switch port is added to a logical switch in OVN. This is a very common use case where CMS adds one new port to an existing switch that already has, let's say, 100 ports. For this case ovsdb-server will have to perform: 1 * log2(100) + 1 clone + 101 * log2(101) Comparisons Comparisons for for search quicksort. ~7 1 ~707 Roughly 714 comparisons of atoms and 1 clone. Since binary search can give us position, where new atom should go (it's the 'low' index after the search completion) for free, the logic can be re-worked like this: copied = 0 for-each atom in b: desired_position = bin_search(a, atom) push(result, a[ copied : desired_position - 1 ]) copied = desired_position push(result, clone(atom)) push(result, a[ copied : Na ]) swap(a, result) Complexity of this schema: Nb * log2(Na) + Nb + Na Comparisons clones memory copy on push for search 'swap' is just a swap of a few pointers. 'push' is not a 'clone', but a simple memory copy of 'union ovsdb_atom'. In general, this schema substitutes complexity of a quicksort with complexity of a memory copy of Na atom structures, where we're not even copying strings that these atoms are pointing to. Complexity in the example above goes down from 714 comparisons to 7 comparisons and memcpy of 100 * sizeof (union ovsdb_atom) bytes. General complexity of a memory copy should always be lower than complexity of a quicksort, especially because these copies usually performed in bulk, so this new schema should work faster for any input. All in all, this change allows to execute several times more transactions per second for transactions that adds new entries to sets. Alternatively, union can be implemented as a linear merge of two sorted arrays, but this will result in O(Na) comparisons, which is more than Nb * log2(Na) in common case, since Na is usually far bigger than Nb. Linear merge will also mean per-atom memory copies instead of copying in bulk. 'replace' functionality of ovsdb_datum_union() had no users, so it just removed. But it can easily be added back if needed in the future. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>	2021-09-24 14:55:54 +02:00
Rosemarie O'Riorden	de15afa50f	dpdk: Stop configuring socket-limit with the value of socket-mem. This change removes the automatic memory limit on start-up of OVS with DPDK. As DPDK supports dynamic memory allocation, there is no need to limit the amount of memory available, if not requested. Currently, if socket-limit is not configured, it is set to the value of socket-mem. With this change, the user can decide to set it or have no memory limit. Removed logs that announce this change and fixed documentation. Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850 Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-26 03:31:46 +02:00
Rosemarie O'Riorden	a8621f49d0	dpdk: Remove default values for socket-mem and limit. This change removes the default values for EAL args socket-mem and socket-limit. As DPDK supports dynamic memory allocation, there is no need to allocate a certain amount of memory on start-up, nor limit the amount of memory available, if not requested. Currently, socket-mem has a default value of 1024 when it is not configured by the user, and socket-limit takes on the value of socket-mem, 1024, by default. With this change, socket-mem is not configured by default, meaning that socket-limit is not either. Neither, either or both options can be set. Removed extra logs that announce this change and fixed documentation. Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850 Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-26 03:31:46 +02:00
Mark Gray	b1e517bd2f	dpif-netlink: Introduce per-cpu upcall dispatch. The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is correspondingly a large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. Reported-at: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-16 20:05:03 +02:00
Kevin Traynor	6193e03267	dpif-netdev: Allow pin rxq and non-isolate PMD. Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from a port are handled on different PMDs. Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be assigned to it by OVS. There may be cases where there is unused cycles on those pmds and the user would like other rxqs to also be able to be assigned to it by OVS. Add an option to pin the rxq and non-isolate the PMD. The default behaviour is unchanged, which is pin and isolate the PMD. In order to pin and non-isolate: ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false Note this is available only with group assignment type, as pinning conflicts with the operation of the other rxq assignment algorithms. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:57 +01:00
Kevin Traynor	3dd050909a	dpif-netdev: Add group rxq scheduling assignment type. Add an rxq scheduling option that allows rxqs to be grouped on a pmd based purely on their load. The current default 'cycles' assignment sorts rxqs by measured processing load and then assigns them to a list of round robin PMDs. This helps to keep the rxqs that require most processing on different cores but as it selects the PMDs in round robin order, it equally distributes rxqs to PMDs. 'cycles' assignment has the advantage in that it separates the most loaded rxqs from being on the same core but maintains the rxqs being spread across a broad range of PMDs to mitigate against changes to traffic pattern. 'cycles' assignment has the disadvantage that in order to make the trade off between optimising for current traffic load and mitigating against future changes, it tries to assign and equal amount of rxqs per PMD in a round robin manner and this can lead to a less than optimal balance of the processing load. Now that PMD auto load balance can help mitigate with future changes in traffic patterns, a 'group' assignment can be used to assign rxqs based on their measured cycles and the estimated running total of the PMDs. In this case, there is no restriction about keeping equal number of rxqs per PMD as it is purely load based. This means that one PMD may have a group of low load rxqs assigned to it while another PMD has one high load rxq assigned to it, as that is the best balance of their measured loads across the PMDs. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 16:51:47 +01:00
Vasu Dasari	ccc24fc88d	ofproto-dpif: APIs and CLI option to add/delete static fdb entry. Currently there is an option to add/flush/show ARP/ND neighbor. This covers L3 side. For L2 side, there is only fdb show command. This commit gives an option to add/del an fdb entry via ovs-appctl. CLI command looks like: To add: ovs-appctl fdb/add <bridge> <port> <vlan> <Mac> ovs-appctl fdb/add br0 p1 0 50:54:00:00:00:05 To del: ovs-appctl fdb/del <bridge> <vlan> <Mac> ovs-appctl fdb/del br0 0 50:54:00:00:00:05 Added two new APIs to provide convenient interface to add and delete static-macs. bool xlate_add_static_mac_entry(const struct ofproto_dpif , ofp_port_t in_port, struct eth_addr dl_src, int vlan); bool xlate_delete_static_mac_entry(const struct ofproto_dpif , struct eth_addr dl_src, int vlan); 1. Static entry should not age. To indicate that entry being programmed is a static entry, 'expires' field in 'struct mac_entry' will be set to a MAC_ENTRY_AGE_STATIC_ENTRY. A check for this value is made while deleting mac entry as part of regular aging process. 2. Another change to the mac-update logic, when a packet with same dl_src as that of a static-mac entry arrives on any port, the logic will not modify the expires field. 3. While flushing fdb entries, made sure static ones are not evicted. 4. Updated "ovs-appctl fdb/stats-show br0" to display number of static entries in switch Added following tests: ofproto-dpif - static-mac add/del/flush ofproto-dpif - static-mac mac moves Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048894.html Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1597752 Signed-off-by: Vasu Dasari <vdasari@gmail.com> Tested-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-16 16:21:02 +02:00
Rosemarie O'Riorden	ae2424696c	dpdk: Logs to announce removal of defaults for socket-mem and limit. Deprecate current OVS provided defaults for DPDK socket-mem and socket-limit that are planned to be removed in OVS 2.17. At that point DPDK defaults will be used instead. Warnings have been added to alert users in advance. Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>	2021-07-16 14:00:31 +01:00
Eelco Chaudron	e6ad4d8d9c	conntrack: Document all-zero IP SNAT behavior and add a test case. Currently, conntrack in the kernel has an undocumented feature referred to as all-zero IP address SNAT. Basically, when a source port collision is detected during the commit, the source port will be translated to an ephemeral port. If there is no collision, no SNAT is performed. This patchset documents this behavior and adds a self-test to verify it's not changing. In addition, a datapath feature flag is added for the all-zero IP SNAT case. This will help applications on top of OVS, like OVN, to determine this feature can be used. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-08 21:19:14 +02:00
Ben Pfaff	4e948b86c7	bridge: Use correct (legacy) role names in database. The vswitchd database schema requires role names to be "master" or "slave", but this code tried to use "primary" and "secondary". Signed-off-by: Ben Pfaff <blp@ovn.org> Reported-at: https://github.com/openvswitch/ovs-issues/issues/218 Tested-at: https://github.com/openvswitch/ovs-issues/issues/218#issuecomment-875374045 Fixes: 807152a4ddfb ("Use primary/secondary, not master/slave, as names for OpenFlow roles.")	2021-07-07 11:56:59 -07:00

1 2 3 4 5 ...

1452 Commits