The status options pci-vendor_id and pci-device_id for dpdk netdevs
have been replaced by bus_info. This patch updates the documentation
in vswitchd/vswitch.xml accordingly.
Fixes: a77c7796f23a ("dpdk: Update to use v22.11.1.")
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Before the cleanup option, the bridge_exit() call was fairly fast,
because it didn't include any particularly long operations. However,
with the cleanup flag, this function destroys a lot of datapath
resources freeing a lot of memory, waiting on RCU and talking to
the kernel. That may take a noticeable amount of time, especially
on a busy system or under profilers/sanitizers. However, the unixctl
'exit' command replies instantly without waiting for any work to
actually be done. This may cause system test failures or other
issues where scripts expect ovs-vswitchd to exit or destroy all the
datapath resources shortly after appctl call.
Fix that by waiting for the bridge_exit() before replying to the user.
At least, all the datapath resources will actually be destroyed by
the time ovs-appctl exits.
Also moving a structure from stack to global. Seems cleaner this way.
Since we're not replying right away and it's technically possible
to have multiple clients requesting exit at the same time, storing
connections in an array.
Fixes: fe13ccdca6a2 ("vswitchd: Add --cleanup option to the 'appctl exit' command")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, the netdev's speed is being calculated by taking the link's
feature bits (using netdev_get_features()) and transforming them into
bps.
This mechanism can be both inaccurate and difficult to maintain, mainly
because we currently use the feature bits supported by OpenFlow which
would have to be extended to support all new feature bits of all netdev
implementations while keeping the OpenFlow API intact.
In order to expose the link speed accurately for all current and future
hardware, add a new netdev API call that allows the implementations to
provide the current and maximum link speeds in Mbps.
Internally, the logic to get the maximum supported speed still relies on
feature bits so it might still get out of sync in the future. However,
the maximum configurable speed is not used as much as the current speed
and these feature bits are not exposed through the netdev interface so
it should be easier to add more.
Use this new function instead of netdev_get_features() where the link
speed is needed.
As a consequence of this patch, link speeds of cards is properly
reported (internally in OVSDB) even if not supported by OpenFlow.
A test verifies this behavior using a tap device.
Also, in order to avoid using the old, this patch adds a checkpatch.py
warning if the old API is used.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
other_config:pmd-maxsleep is a config option to allow
PMD thread cores to sleep under low or no load conditions.
Rename it to 'pmd-sleep-max' to allow a more structured
name and so that additional options or command can follow
the 'pmd-sleep-xyz' pattern.
Use of other_config:pmd-maxsleep is deprecated to be
removed in a future release and will result in a warning.
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
As per the Open vSwitch Manual ovs-vsctl(8) the Bridge IPFIX parameters
can be passed as follows:
ovs-vsctl -- set Bridge br0 ipfix=@i \
-- --id=@i create IPFIX targets=\"192.168.0.34:4739\" \
obs_domain_id=123 obs_point_id=456 cache_active_timeout=60 \
cache_max_flows=13 \
other_config:enable-input-sampling=false \
other_config:enable-output-sampling=false
where the default values are:
enable_input_sampling: true
enable_output_sampling: true
But in the existing code these 2 parameters take up unexpected values
in some scenarios:
be_opts.enable_input_sampling = !smap_get_bool(&be_cfg->other_config,
"enable-input-sampling", false);
be_opts.enable_output_sampling = !smap_get_bool(&be_cfg->other_config,
"enable-output-sampling", false);
Here, the function smap_get_bool is being used with a negation.
This returns expected values for the default case (since the above code
will negate “false” we get from smap_get bool function and return the
value “true”) but unexpected values for the case where the sampling
value is passed through the CLI.
For example, if we pass "true" for other_config:enable-input-sampling
in the CLI, the above code will negate the “true” value we get from
the smap_bool function and return the value “false”. Same would be the
case for enable_output_sampling.
Acked-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Sayali Naval <sanaval@cisco.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Some control protocols are used to maintain link status between
forwarding engines (e.g. LACP). When the system is not sized properly,
the PMD threads may not be able to process all incoming traffic from the
configured Rx queues. When a signaling packet of such protocols is
dropped, it can cause link flapping, worsening the situation.
Use the rte_flow API to redirect these protocols into a dedicated Rx
queue. The assumption is made that the ratio between control protocol
traffic and user data traffic is very low and thus this dedicated Rx
queue will never get full. Re-program the RSS redirection table to only
use the other Rx queues.
The additional Rx queue will be assigned a PMD core like any other Rx
queue. Polling that extra queue may introduce increased latency and
a slight performance penalty at the benefit of preventing link flapping.
This feature must be enabled per port on specific protocols via the
rx-steering option. This option takes "rss" followed by a "+" separated
list of protocol names. It is only supported on ethernet ports. This
feature is experimental.
If the user has already configured multiple Rx queues on the port, an
additional one will be allocated for control packets. If the hardware
cannot satisfy the number of requested Rx queues, the last Rx queue will
be assigned for control plane. If only one Rx queue is available, the
rx-steering feature will be disabled. If the hardware does not support
the rte_flow matchers/actions, the rx-steering feature will be
completely disabled on the port and regular rss will be performed
instead.
It cannot be enabled when other-config:hw-offload=true as it may
conflict with the offloaded flows. Similarly, if hw-offload is enabled,
custom rx-steering will be forcibly disabled on all ports and replaced
by regular rss.
Example use:
ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
set interface phy0 options:rx-steering=rss+lacp -- \
set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
set interface phy1 options:rx-steering=rss+lacp
As a starting point, only one protocol is supported: LACP. Other
protocols can be added in the future. NIC compatibility should be
checked.
To validate that this works as intended, I used a traffic generator to
generate random traffic slightly above the machine capacity at line rate
on a two ports bond interface. OVS is configured to receive traffic on
two VLANs and pop/push them in a br-int bridge based on tags set on
patch ports.
+----------------------+
| DUT |
|+--------------------+|
|| br-int || in_port=patch10,actions=mod_dl_src:$patch11,
|| || mod_dl_dst:$tgen0,
|| || output:patch10
|| || in_port=patch11,actions=mod_dl_src:$patch10
|| || mod_dl_dst:$tgen0,
|| patch10 patch11 || output:patch10
|+---|-----------|----+|
| | | |
|+---|-----------|----+|
|| patch00 patch01 ||
|| tag:10 tag:20 ||
|| ||
|| br-phy || default flow, action=NORMAL
|| ||
|| bond0 || balance-slb, lacp=passive, lacp-time=fast
|| phy0 phy1 ||
|+------|-----|-------+|
+-------|-----|--------+
| |
+-------|-----|--------+
| port0 port1 | balance L3/L4, lacp=active, lacp-time=fast
| lag | mode trunk VLANs 10, 20
| |
| switch |
| |
| vlan 10 vlan 20 | mode access
| port2 port3 |
+-----|----------|-----+
| |
+-----|----------|-----+
| tgen0 tgen1 | Random traffic that is properly balanced
| | across the bond ports in both directions.
| traffic generator |
+----------------------+
Without rx-steering, the bond0 links are randomly switching to
"defaulted" when one of the LACP packets sent by the switch is dropped
because the RX queues are full and the PMD threads did not process them
fast enough. When that happens, all traffic must go through a single
link which causes above line rate traffic to be dropped.
~# ovs-appctl lacp/show-stats bond0
---- bond0 statistics ----
member: phy0:
TX PDUs: 347246
RX PDUs: 14865
RX Bad PDUs: 0
RX Marker Request PDUs: 0
Link Expired: 168
Link Defaulted: 0
Carrier Status Changed: 0
member: phy1:
TX PDUs: 347245
RX PDUs: 14919
RX Bad PDUs: 0
RX Marker Request PDUs: 0
Link Expired: 147
Link Defaulted: 1
Carrier Status Changed: 0
When rx-steering is enabled, no LACP packet is dropped and the bond
links remain enabled at all times, maximizing the throughput. Neither
the "Link Expired" nor the "Link Defaulted" counters are incremented
anymore.
This feature may be considered as "QoS". However, it does not work by
limiting the rate of traffic explicitly. It only guarantees that some
protocols have a lower chance of being dropped because the PMD cores
cannot keep up with regular traffic.
The choice of protocols is limited on purpose. This is not meant to be
configurable by users. Some limited configurability could be considered
in the future but it would expose to more potential issues if users are
accidentally redirecting all traffic in the isolated queue.
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It supports flowlabel based load balancing by controlling the flowlabel
of outer IPv6 header, which is already implemented in Linux kernel as
seg6_flowlabel sysctl [1].
[1]: https://docs.kernel.org/networking/seg6-sysctl.html
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The description of SRv6 was missing in vswitch.xml, which is
used to generate the man page, so this patch adds it.
Fixes: 03fc1ad78521 ("userspace: Add SRv6 tunnel support.")
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Open vSwitch generally tries to let the underlying operating system
managed the low level details of hardware, for example DMA mapping,
bus arbitration, etc. However, when using DPDK, the underlying
operating system yields control of many of these details to userspace
for management.
In the case of some DPDK port drivers, configuring rte_flow or even
allocating resources may require access to iopl/ioperm calls, which
are guarded by the CAP_SYS_RAWIO privilege on linux systems. These
calls are dangerous, and can allow a process to completely compromise
a system. However, they are needed in the case of some userspace
driver code which manages the hardware (for example, the mlx
implementation of backend support for rte_flow).
Here, we create an opt-in flag passed to the command line to allow
this access. We need to do this before ever accessing the database,
because we want to drop all privileges asap, and cannot wait for
a connection to the database to be established and functional before
dropping. There may be distribution specific ways to do capability
management as well (using for example, systemd), but they are not
as universal to the vswitchd as a flag.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Depending on the driver implementation, it can take from 0.2 seconds
up to 2 seconds before offloaded flow statistics are updated. This is
true for both TC and rte_flow-based offloading. This is causing a
problem with min-revalidate-pps, as old statistic values are used
during this period.
This fix will wait for at least 2 seconds, by default, before assuming no
packets where received during this period.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add options to the IPFIX table configure the interval to send statistics
and template information.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add support to count upcall packets per port, both succeed and failed,
which is a better way to see how many packets upcalled on each interface.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: wangchuanlei <wangchuanlei@inspur.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Today the minimum value for this setting is 1. This patch allows it to
be 0, meaning not checking pps at all, and always do revalidation.
This is particularly useful for environments where some of the
applications with long-lived connections may have very low traffic for
certain period but have high rate of burst periodically. It is desirable
to keep the datapath flows instead of periodically deleting them to
avoid burst of packet miss to userspace.
When setting to 0, there may be more datapath flows to be revalidated,
resulting in higher CPU cost of revalidator threads. This is the
downside but in certain cases this is still more desirable than packet
misses to user space.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Now that the timer slack for the PMD threads is reduced we can also
reduce the start/increment for PMD load based sleeping to match it.
This will further reduce initial sleep times making it more resilient
to interfaces that might be sensitive to large sleep times.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.
Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).
Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.
Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
- sleep iterations: 153994 ( 76.8 % of iterations)
Sleep time (us): 9159399 ( 59 us/iteration avg.)
...
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When restart openvswitch, the bfd status will be kept
before ovs-vswitchd running. And if the ovs-vswitchd
has high workload, which will defer updating bfd status,
which not we excepted.
Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
100 Mbps was a fair assumption 13 years ago. Modern days 10 Gbps seems
like a good value in case no information is available otherwise.
The change mainly affects QoS which is currently limited to 100 Mbps if
the user didn't specify 'max-rate' and the card doesn't report the
speed or OVS doesn't have a predefined enumeration for the speed
reported by the NIC.
Calculation of the path cost for STP/RSTP is also affected if OVS is
unable to determine the link speed.
Lower link speed adapters are typically good at reporting their speed,
so chances for overshoot should be low. But newer high-speed adapters,
for which there is no speed enumeration or if there are some other
issues, will not suffer that much.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The count of received multicast packets has been computed internally,
but not exposed to ovsdb. Fix this.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
For some reason it is documented as 'rstp-port-path-cost', while
the code and some other bits of documentation use 'rstp-path-cost'.
Fixes: 9efd308e957c ("Rapid Spanning Tree Protocol (IEEE 802.1D).")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
README file still mentions a kernel module and some parts of
the documentation still have XenServer references, e.g. 'xs-*'
database configuration options. Removing them.
Fixes: 422e90437854 ("make: Remove the Linux datapath.")
Fixes: 83c9518e7c67 ("xenserver: Remove xenserver.")
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Remove the current xenserver implementation - it is obsolete and
since 3.0 we do not support kernel module builds [1].
1. https://mail.openvswitch.org/pipermail/ovs-dev/2022-July/395789.html
[i.maximets]
Can be added back if people willing to maintain it will be found.
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This config param allows the delivery of broadcast and multicast
packets to the secondary interface of non-lacp bonds, equivalent
to the option 'all_slaves_active' for Linux kernel bonds.
Reported-at: https://bugzilla.redhat.com/1720935
Signed-off-by: Christophe Fontaine <cfontain@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit adds a new command to allow the user to switch
the active action implementation at runtime.
Usage:
$ ovs-appctl odp-execute/action-impl-set scalar
This commit also adds a new command to retrieve the list of available
action implementations. This can be used by to check what implementations
of actions are available and what implementation is active during runtime.
Usage:
$ ovs-appctl odp-execute/action-impl-show
Added separate test-case for ovs-actions show/set commands:
odp-execute - actions implementation
Signed-off-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
Co-authored-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit introduces the initial infrastructure required to allow
different implementations for OvS actions. The patch introduces action
function pointers which allows user to switch between different action
implementations available. This will allow for more performance and flexibility
so the user can choose the action implementation to best suite their use case.
Signed-off-by: Emma Finn <emma.finn@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Mempools may currently be shared between DPDK ports based
on port MTU and NUMA. With some hint from the user we can
increase the sharing on MTU and hence reduce memory
consumption in many cases.
For example, a port with MTU 9000, uses a mempool with an
mbuf size based on 9000 MTU. A port with MTU 1500, uses a
different mempool with an mbuf size based on 1500 MTU.
In this case, assuming same NUMA, both these ports could
share the 9000 MTU mempool.
The user must give a hint as order of creation of ports and
setting of MTUs may vary and we need to ensure that upgrades
from older OVS versions do not require more memory.
This scheme can also prevent multiple mempools being created
for cases where a port is added picking up a default MTU and
an appropriate mempool, but later has it's MTU changed to a
different value requiring a different mempool.
Example usage:
$ ovs-vsctl --no-wait set Open_vSwitch . \
other_config:shared-mempool-config=9000,1500:1,6000:1
Port added on NUMA 0:
* MTU 1500, use mempool based on 9000 MTU
* MTU 5000, use mempool based on 9000 MTU
* MTU 9000, use mempool based on 9000 MTU
* MTU 9300, use mempool based on 9300 MTU (existing behaviour)
Port added on NUMA 1:
* MTU 1500, use mempool based on 1500 MTU
* MTU 5000, use mempool based on 6000 MTU
* MTU 9000, use mempool based on 9000 MTU
* MTU 9300, use mempool based on 9300 MTU (existing behaviour)
Default behaviour is unchanged and mempools are still only created
when needed.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
My xml editor keeps autofixing these which means I have to be
careful during 'git add' for unrelated changes. Might as well
just fix them.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Tunnels in LibreSwan and OpenSwan allow for many options to be set on a
per tunnel basis. Pass through any options starting with ipsec_ to the
connection in the configuration file. Administrators are responsible for
picking valid key/value pairs.
Signed-off-by: Andreas Karis <ak.karis@gmail.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for hmap and all its derived types.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Using the SHORT version of the *_SAFE loops makes the code cleaner
and less error-prone. So, use the SHORT version and remove the extra
variable when possible.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Expose functions reporting user configuration of offloading threads, as
well as utility functions for multithreading.
This will only expose the configuration knob to the user, while no
datapath will implement the multiple thread request.
This will allow implementations to use this API for offload thread
management in relevant layers before enabling the actual dataplane
implementation.
The offload thread ID is lazily allocated and can as such be in a
different order than the offload thread start sequence.
The RCU thread will sometime access hardware-offload objects from
a provider for reclamation purposes. In such case, it will get
a default offload thread ID of 0. Care must be taken that using
this thread ID is safe concurrently with the offload threads.
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch defines the OVS_STATIC_TRACE() macro, and as an
example, adds two of them in the bridge run loop.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
netdev_set_dpif_type() can only be used with a normalized dpif type
as an argument, which is a constant static string derived from a type
of a dpif_class or a constant string "system". Usage of a same
constant string allows netdev-offload module to compare types by
simply comparing pointers.
OTOH, 'br->ofproto->type' is a dynamic string that:
a. Can be NULL.
b. Even if not NULL and equal, can be a different dynamically
allocated string.
Both these qualities breaks assumptions made by all other modules
related to HW offload, breaking the functionality.
Fix that by moving netdev_set_dpif_type() to dpif.c and calling with
a correct constant string as an argument.
The call moved from bridge.c to dpif.c, because we need to have access
to the dpif class, but bridge.c should not.
Not trying to set the dpif_type inside the netdev_ports_insert(),
because it's used now outside the offloading context. So, it's
cleaner to move the netdev_set_dpif_type() call outside of the
netdev-offload module.
Additionally removed the redundant call from the netdev_ports_insert()
and refactored the function, since it doesn't need an extra argument
anymore.
Fixes: 4f19a78a61c5 ("netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.")
Reported-by: Roi Dayan <roid@nvidia.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390117.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Roi Dayan <roid@nvidia.com>
Userspace tunnel doesn't have a valid device in the kernel. So
get_ifindex() function (ioctl) always get error during
adding a port, deleting a port or updating a port status.
The info log is
"2021-08-29T09:17:39.830Z|00059|netdev_linux|INFO|ioctl(SIOCGIFINDEX)
on vxlan_sys_4789 device failed: No such device"
If there are a lot of userspace tunnel ports on a bridge, the
iface_refresh_netdev_status() function will spend a lot of time.
So ignore userspace tunnel port ioctl(SIOCGIFINDEX) operation, just
return -ENODEV.
Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Added new function to return memory usage statistics for database
objects inside IDL. Statistics similar to what ovsdb-server reports.
Not counting _Server database as it should be small, hence doesn't
worth adding extra code to the ovsdb-cs module. Can be added later
if needed.
ovs-vswitchd is a user in OVS, but this API will be mostly useful for
OVN daemons.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Current algorithm of ovsdb_datum_union looks like this:
for-each atom in b:
if not bin_search(a, atom):
push(a, clone(atom))
quicksort(a)
So, the complexity looks like this:
Nb * log2(Na) + Nb + (Na + Nb) * log2(Na + Nb)
Comparisons clones Comparisons for quicksort
for search
ovsdb_datum_union() is heavily used in database transactions while
new element is added to a set. For example, if new logical switch
port is added to a logical switch in OVN. This is a very common
use case where CMS adds one new port to an existing switch that
already has, let's say, 100 ports. For this case ovsdb-server will
have to perform:
1 * log2(100) + 1 clone + 101 * log2(101)
Comparisons Comparisons for
for search quicksort.
~7 1 ~707
Roughly 714 comparisons of atoms and 1 clone.
Since binary search can give us position, where new atom should go
(it's the 'low' index after the search completion) for free, the
logic can be re-worked like this:
copied = 0
for-each atom in b:
desired_position = bin_search(a, atom)
push(result, a[ copied : desired_position - 1 ])
copied = desired_position
push(result, clone(atom))
push(result, a[ copied : Na ])
swap(a, result)
Complexity of this schema:
Nb * log2(Na) + Nb + Na
Comparisons clones memory copy on push
for search
'swap' is just a swap of a few pointers. 'push' is not a 'clone',
but a simple memory copy of 'union ovsdb_atom'.
In general, this schema substitutes complexity of a quicksort
with complexity of a memory copy of Na atom structures, where we're
not even copying strings that these atoms are pointing to.
Complexity in the example above goes down from 714 comparisons
to 7 comparisons and memcpy of 100 * sizeof (union ovsdb_atom) bytes.
General complexity of a memory copy should always be lower than
complexity of a quicksort, especially because these copies usually
performed in bulk, so this new schema should work faster for any input.
All in all, this change allows to execute several times more
transactions per second for transactions that adds new entries to sets.
Alternatively, union can be implemented as a linear merge of two
sorted arrays, but this will result in O(Na) comparisons, which
is more than Nb * log2(Na) in common case, since Na is usually
far bigger than Nb. Linear merge will also mean per-atom memory
copies instead of copying in bulk.
'replace' functionality of ovsdb_datum_union() had no users, so it
just removed. But it can easily be added back if needed in the future.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
This change removes the automatic memory limit on start-up of OVS with
DPDK. As DPDK supports dynamic memory allocation, there is no
need to limit the amount of memory available, if not requested.
Currently, if socket-limit is not configured, it is set to the value of
socket-mem. With this change, the user can decide to set it or have no
memory limit.
Removed logs that announce this change and fixed documentation.
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This change removes the default values for EAL args socket-mem and
socket-limit. As DPDK supports dynamic memory allocation, there is no
need to allocate a certain amount of memory on start-up, nor limit the
amount of memory available, if not requested.
Currently, socket-mem has a default value of 1024 when it is not
configured by the user, and socket-limit takes on the value of
socket-mem, 1024, by default. With this change, socket-mem is not
configured by default, meaning that socket-limit is not either.
Neither, either or both options can be set.
Removed extra logs that announce this change and fixed documentation.
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.
This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:
* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.
In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:
a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.
Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Pinning an rxq to a PMD with pmd-rxq-affinity may be done for
various reasons such as reserving a full PMD for an rxq, or to
ensure that multiple rxqs from a port are handled on different PMDs.
Previously pmd-rxq-affinity always isolated the PMD so no other rxqs
could be assigned to it by OVS. There may be cases where there is
unused cycles on those pmds and the user would like other rxqs to
also be able to be assigned to it by OVS.
Add an option to pin the rxq and non-isolate the PMD. The default
behaviour is unchanged, which is pin and isolate the PMD.
In order to pin and non-isolate:
ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false
Note this is available only with group assignment type, as pinning
conflicts with the operation of the other rxq assignment algorithms.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Add an rxq scheduling option that allows rxqs to be grouped
on a pmd based purely on their load.
The current default 'cycles' assignment sorts rxqs by measured
processing load and then assigns them to a list of round robin PMDs.
This helps to keep the rxqs that require most processing on different
cores but as it selects the PMDs in round robin order, it equally
distributes rxqs to PMDs.
'cycles' assignment has the advantage in that it separates the most
loaded rxqs from being on the same core but maintains the rxqs being
spread across a broad range of PMDs to mitigate against changes to
traffic pattern.
'cycles' assignment has the disadvantage that in order to make the
trade off between optimising for current traffic load and mitigating
against future changes, it tries to assign and equal amount of rxqs
per PMD in a round robin manner and this can lead to a less than optimal
balance of the processing load.
Now that PMD auto load balance can help mitigate with future changes in
traffic patterns, a 'group' assignment can be used to assign rxqs based
on their measured cycles and the estimated running total of the PMDs.
In this case, there is no restriction about keeping equal number of
rxqs per PMD as it is purely load based.
This means that one PMD may have a group of low load rxqs assigned to it
while another PMD has one high load rxq assigned to it, as that is the
best balance of their measured loads across the PMDs.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Currently there is an option to add/flush/show ARP/ND neighbor. This
covers L3 side. For L2 side, there is only fdb show command. This
commit gives an option to add/del an fdb entry via ovs-appctl.
CLI command looks like:
To add:
ovs-appctl fdb/add <bridge> <port> <vlan> <Mac>
ovs-appctl fdb/add br0 p1 0 50:54:00:00:00:05
To del:
ovs-appctl fdb/del <bridge> <vlan> <Mac>
ovs-appctl fdb/del br0 0 50:54:00:00:00:05
Added two new APIs to provide convenient interface to add and delete
static-macs.
bool xlate_add_static_mac_entry(const struct ofproto_dpif *,
ofp_port_t in_port,
struct eth_addr dl_src, int vlan);
bool xlate_delete_static_mac_entry(const struct ofproto_dpif *,
struct eth_addr dl_src, int vlan);
1. Static entry should not age. To indicate that entry being
programmed is a static entry, 'expires' field in 'struct mac_entry'
will be set to a MAC_ENTRY_AGE_STATIC_ENTRY. A check for this value
is made while deleting mac entry as part of regular aging process.
2. Another change to the mac-update logic, when a packet with same
dl_src as that of a static-mac entry arrives on any port, the logic
will not modify the expires field.
3. While flushing fdb entries, made sure static ones are not evicted.
4. Updated "ovs-appctl fdb/stats-show br0" to display number of static
entries in switch
Added following tests:
ofproto-dpif - static-mac add/del/flush
ofproto-dpif - static-mac mac moves
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048894.html
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1597752
Signed-off-by: Vasu Dasari <vdasari@gmail.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Deprecate current OVS provided defaults for DPDK socket-mem and
socket-limit that are planned to be removed in OVS 2.17. At that point
DPDK defaults will be used instead. Warnings have been added to alert
users in advance.
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Currently, conntrack in the kernel has an undocumented feature referred
to as all-zero IP address SNAT. Basically, when a source port
collision is detected during the commit, the source port will be
translated to an ephemeral port. If there is no collision, no SNAT is
performed.
This patchset documents this behavior and adds a self-test to verify
it's not changing. In addition, a datapath feature flag is added for
the all-zero IP SNAT case. This will help applications on top of OVS,
like OVN, to determine this feature can be used.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>