Using larger rxq can be beneficial in highly bursty setups.
Remove the artificial limit on the count of descriptors in rxq and txq.
The device driver will limit the values in any case.
Reported-at: https://issues.redhat.com/browse/FDP-1415
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
STT and LISP tunnel types were deprecated and marked for removal in
the following commits in the OVS 3.5 release:
3b37a6154a59 ("netdev-vport: Deprecate STT tunnel port type.")
8d7ac031c03d ("netdev-vport: Deprecate LISP tunnel port type.")
Main reasons were that STT was rejected in upstream kernel and the
LISP was never upstreamed as well and doesn't really have a supported
implementation. Both protocols also appear to have lost their former
relevance.
Removing both now. While at it, also fixing some small documentation
issues and comments.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Alin Serdean <aserdean@ovn.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ovs_get_program_version() already returns the formatted program name and
version instead of doing it again.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
This patch uses the new rte_vhost_driver_set_max_queue_num
API to set the maximum number of queue pairs supported by
the vhost-user port.
This is required for VDUSE which needs to specify the
maximum number of queue pairs at creation time. Without it
128 queue pairs metadata would be allocated.
To configure it, a new 'vhost-max-queue-pairs' option is
introduced.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
SSL protocol family is not actually being used or supported in OVS.
What we use is actually TLS.
Terms "SSL" and "TLS" are often used interchangeably in modern
software and refer to the same thing, which is normally just TLS.
Let's replace "SSL" with "SSL/TLS" in documentation and user-visible
messages, where it makes sense. This may make it more clear what
is meant for a less experienced user that may look for TLS support
in OVS and not find much.
We're not changing any actual code, because, for example, most of
OpenSSL APIs are using just SSL, for historical reasons. And our
database is using "SSL" table. We may consider migrating to "TLS"
naming for user-visible configuration like command line arguments
and database names, but that will require extra work on making sure
upgrades can still work. In general, a slightly more clear
documentation should be enough for now, especially since term SSL
is still widely used in the industry.
"SSL/TLS" is chosen over "TLS/SSL" simply because our user-visible
configuration knobs are using "SSL" naming, e.g. '--ssl-cyphers'
or 'ovs-vsctl set-ssl'. So, it might be less confusing this way.
We may switch that, if we decide on re-working the user-visible
commands towards "TLS" naming, or providing both alternatives.
Some other projects did similar changes. For example, the python ssl
library is now using "TLS/SSL" in the documentation whenever possible.
Same goes for OpenSSL itself.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
If prefixes are set for the flow table, ovs-vswitchd will print them
out to the log whenever something changes in the database. Since
normally prefixes will be set for every OpenFlow table, it will print
255 log messages per iteration. This is very annoying in dynamic
environments like Kubernetes, where database changes can happen
frequently, obscuring and erasing useful logs on log rotation.
These log messages are not very important. The information can be
looked up in the database and normally the values will not actually
change after initial setup. Move the log to debug level.
While at it, rate limit the warnings about misconfigured prefixes,
as they may be too much as well. And make the print out a little
nicer by only printing once if multiple adjacent tables have the
same prefixes configured. In most cases that will reduce the amount
of logs from 255 lines to 1 per iteration with debug logging enabled.
We're also now logging the default values as well, since it's under
debug and will not actually add that many log lines with the new
collapsed format. This makes debug logs more accurate/useful.
An additional improvement might be to not print if nothing actually
changed, but that will require either per-table per-bridge tracking
of previous values or changing parts of the ofproto API to tell the
bridge layer if something changed or not. Doesn't seem necessary
at the moment.
Fixes: 13751fd88c4b ("Classifier: Track address prefixes.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
STT tunnel implementation was rejected in the upstream Linux kernel
long time ago and will probably never be there. So, the only
implementation for Linux is in the OOT kernel module shipped with
OVS 2.17. It is deprecated and will reach end of life in Feb 2025.
In addition, modern network interface cards support various hardware
offload features with UDP tunnels, diminishing the main selling point
of STT - the ability to reuse hardware offload features meant for TCP.
Deprecate the port type now, so it can be removed once 2.17 is EoL.
There is another implementation for this tunnel type in the Windows
datapath. However, the protocol itself is considered harmful as it
may confuse stateful network hardware by pretending to be TCP (hence
the reason it was rejected in the Linux kernel). So, it is better if
we deprecate this implementation and stop supporting it as well.
The standard draft for the protocol itself is also expired and
archived with the latest update made in 2016:
https://datatracker.ietf.org/doc/draft-davie-stt/
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Tunnel support is in upstream Linux kernel since 2013. However,
despite the FAQ saying so, I'm not aware of actual attempts to bring
support for LISP tunnels upstream. The only available implementation
is in OOT kernel module shipped with OVS 2.17. It is deprecated and
will reach EoL in Feb 2025.
Mark the tunnel port type as deprecated, so we can fully remove the
support once the only available implementation reaches end of life
together with OVS 2.17.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It is common to have flow tables with OpenFlow rules for both IPv4 and
IPv6 traffic at the same time. One major example is OVN.
Today only IPv4 addresses and L4 ports are prefix matched by default.
Since recently, it's possible to turn on the optimization for all
4 fields at the same time (nw_dst, nw_src, ipv6_dst and ipv6_src),
but it is an inconvenience for users. IPv6 configurations become more
and more common in the real world, so having IPv6 tables not optimized
by default is not good.
Enable prefix tree lookups for IPv6 addresses by default in addition
to IPv4.
This change increases memory consumption slightly as well as takes a
few extra cycles per flow to create and later iterate over additional
prefix trees. However, IPv4 and IPv6 matches are mutually exclusive,
so performance overhead should be minimal, especially in comparison
with the benefits this configuration brings to IPv6 setups reducing
the number of datapath flows by default.
A new test added to check that defaults are working as expected.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Today users can enable prefix matches for tun_id, tun_src, tun_dst,
ip_src, ip_dst, ipv6_src and ipv6_dst. However, they are limited to
only 3 of these enabled at the same time. This means that if our flow
table is handling both IPv4 and IPv6 traffic, we can't optimize all
the addresses, we'll either have to split IPv4 and IPv6 rules into
separate tables or sacrifice one of the fields, as we can select only
3 out of 4 fields (ip_src, ip_dst, ipv6_src and ipv6_dst).
The maximum number of tries is a little arbitrary. Increasing it will
slightly increase memory usage and may take a couple extra processing
cycles, but should not change classification results, so should be
reasonable.
Actually enabling more prefixes will consume more memory and reduce
efficiency of a single flow classification, but that's a trade user can
make knowing the traffic pattern and how their particular flow table
looks like. While efficiency of a single flow classification may go
down, the overall performance of the system may be significantly
improved by having way less datapath flows with wider matches.
The number of tunnels in a typical setup is not that high, so I'm not
sure if it makes sense to increase the limit higher. At the same time
combined IPv4 + IPv6 handling is pretty common. For example, that's
the case with OVN.
Tests in ofproto-dpif.at cover IPv4 and IPv6 address classification
separately, and these fields can't overlap, so not adding any new tests.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently a bridge mirror will collect all packets and tools like
ovs-tcpdump can apply additional filters after they have already been
duplicated by vswitchd. This can result in inefficient collection.
This patch adds support to apply pre-selection to bridge mirrors, which
can limit which packets are mirrored based on flow metadata. This
significantly improves overall vswitchd performance during mirroring if
only a subset of traffic is required.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Since on CentOS/RHEL the builds are based on stable branches and not on
tags for debugging purpose it's better to have the downstream version as
version so it's easier to know which commits are included in a build.
This commit adds --with-version-suffix as ./configure option in
order to set an OVS version suffix that should be shown to the user via
ovs-vsctl -V and, so, also on database, on ovs-vsctl show and the other
utilities.
--with-version-suffix is used in Fedora/CentOS/RHEL spec file in order to have
the version be aligned with the downstream one.
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When the flow translation results in a datapath action list whose last
action is an "observational" action, i.e: one generated for IPFIX,
sFlow or local sampling applications, the packet is actually going to be
dropped (and observed).
In that case, add an explicit drop action so that drop statistics remain
accurate. This behavior is controlled by a configurable boolean knob
called "explicit_sampled_drops"
Combine the "optimizations" and other odp_actions "tweaks" into a single
function.
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add as new column in the Flow_Sample_Collector_Set table named
"local_group_id" which enables this feature.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Only kernel datapath supports this action so add a function in dpif.c
that checks for that.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adopts the proposed RFC 6935 by allowing null UDP checksums
even if the tunnel protocol is IPv6. This is already supported by Linux
through the udp6zerocsumtx tunnel option. It is disabled by default and
IPv6 tunnels are flagged as requiring a checksum, but this patch enables
the user to set csum=false on IPv6 tunnels.
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The main purpose of locking the memory is to ensure that OVS can keep
doing what it did before in case of increased memory pressure, e.g.,
during VM ingest / migration. Fulfilling this requirement can be
achieved without locking all the allocated memory, but only the pages
already accessed in the past (faulted in). Processing of the new
traffic involves new memory allocations. Latency on these operations
can't be guaranteed by the locking. The main difference would be
the pre-faulting of the stack memory. However, in order to revalidate
or process upcalls on the same traffic, the same amount of stack is
likely needed, so all the necessary memory will already be faulted in.
Switch 'mlockall' to MCL_ONFAULT to avoid consuming unnecessarily
large amounts of RAM on systems with high core counts. For example,
in a densely populated OVN cluster this saves about 650 MB of RAM per
node on a system with 64 cores. This equates to 320 GB of allocated
but unused RAM in a 500 node cluster.
This also makes OVS better suited by default for small systems with
limited amount of memory.
The MCL_ONFAULT flag was introduced in Linux kernel 4.4 and wasn't
available at the time of '--mlockall' introduction, but we can use it
now. Falling back to an old way of locking in case we're running on
an older kernel just in case.
Only locking the faulted in pages also makes locking compatible with
vhost post-copy live migration by default, because we'll no longer
pre-fault all the guest's memory. Post-copy relies on userfaultfd
to work on shared huge pages, which is only available in 4.11+ kernels.
So, technically, it should not be possible for MCL_ONFAULT to fail and
the call without it to succeed. But keeping the check just in case
for now.
Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Querying link status may get delayed for an undeterministic (long) time
with mlx5 ports. This is a consequence of the mlx5 driver calling ethtool
kernel API and getting stuck on the kernel RTNL lock while some other
operation is in progress under this lock.
One impact for long link status query is that it is called under the bond
lock taken in write mode periodically in bond_run().
In parallel, datapath threads may block requesting to read bonding related
info (like for example in bond_check_admissibility()).
The LSC interrupt mode is available with many DPDK drivers and is used by
default with testpmd.
It seems safe enough to switch on this feature by default in OVS.
We keep the per interface option to disable this feature in case of an
unforeseen bug.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Since the patch-set that included [1] there has been a policy of using
the term member for bonds, LACP, and bundle contexts. This is
consistent with the more recently adopted policy of using the inclusive
naming word list v1 [2, 3].
This patch addresses two instances where the term member should be used
in vswitch.xml. It does not address instances of alternative wording
that require code updates, which can addressed as follow-up activity.
[1] 91fc374a9c5a ("Eliminate use of term "slave" in bond, LACP, and bundle contexts.")
[2] df5e5cf4318a ("Documentation: Add section on inclusive language.")
[3] https://inclusivenaming.org/word-lists/
Signed-off-by: Simon Horman <horms@ovn.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Extend 'pmd-sleep-max' so that individual PMD thread cores may have a
specified max sleep request value.
Existing behaviour is maintained.
Any PMD thread core without a value will use the global value if set
or default no sleep.
To set PMD thread cores 8 and 9 to never request a load based sleep
and all other PMD thread cores to be able to request a max sleep of
50 usecs:
$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50,8:0,9:0
To set PMD thread cores 10 and 11 to request a max sleep of 100 usecs
and all other PMD thread cores to never request a sleep:
$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=10:100,11:100
'pmd-sleep-show' is updated to show the max sleep value for each PMD
thread.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Make sure that if any zone limit was set via DB
all zones are forced to be set there also. This
is done by tracking which datapath has zone limit
protection and it is reflected in the dpctl command.
If the datapath is protected the dpctl command will
return permission error.
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Propagate the CT limit that is present in the DB into
datapath. The limit is currently only propagated on change
and can be overwritten by the dpctl commands.
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add limit to the CT zone DB table with ovs-vsctl
helper methods. The limit has two special values
besides any number, 0 is unlimited and empty limit
is to leave the value untouched in the datapath.
This is preparation step and the value is not yet
propagated to the datapath.
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
For better usability, the function pairs get_config() and
set_config() for netdevs should be symmetric: Options which are
accepted by set_config() should be returned by get_config() and the
latter should output valid options for set_config() only.
This patch moves key-value pairs which are not valid options from
get_config() to the get_status() callback. For example, get_config()
in lib/netdev-dpdk.c returned {configured,requested}_{rx,tx}_queues
previously. For requested rx queues the proper option name is n_rxq,
so requested_rx_queues has been renamed respectively. Tx queues
cannot be changed by the user, hence requested_tx_queues has been
dropped. Both configured_{rx,tx}_queues will be returned as
n_{r,t}xq in the get_status() callback.
The netdev dpdk classes no longer share a common get_config() callback,
instead both the dpdk_class and the dpdk_vhost_client_class define
their own callbacks. The get_config() callback for dpdk_vhost_class has
been dropped because it does not have a set_config() callback.
The documentation in vswitchd/vswitch.xml for status columns as well
as tests have been updated accordingly.
Reported-at: https://bugzilla.redhat.com/1949855
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
For better usability, the function pairs get_config() and
set_config() for netdevs should be symmetric: Options which are
accepted by set_config() should be returned by get_config() and the
latter should output valid options for set_config() only. This patch
also moves key-value pairs which are not valid options from get_config()
to the get_status() callback.
The documentation in vswitchd/vswitch.xml for status columns has been
updated accordingly.
Reported-at: https://bugzilla.redhat.com/1949855
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
get_status for dpdkvhostuser(/client) netdev class may display
userspace-tso status.
Fixes: a5669fd51c9b ("netdev-dpdk: Drop TSO in case of conflicting virtio features.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
Add group for dpdkvhostuser(/client) netdev.
Adding as a single group as they display the same status,
one of which is 'mode' to indicate if it's client or server.
Fixes: b2e8b12f8a82 ("netdev-dpdk: add vhost-user get_status.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
The status options pci-vendor_id and pci-device_id for dpdk netdevs
have been replaced by bus_info. This patch updates the documentation
in vswitchd/vswitch.xml accordingly.
Fixes: a77c7796f23a ("dpdk: Update to use v22.11.1.")
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Before the cleanup option, the bridge_exit() call was fairly fast,
because it didn't include any particularly long operations. However,
with the cleanup flag, this function destroys a lot of datapath
resources freeing a lot of memory, waiting on RCU and talking to
the kernel. That may take a noticeable amount of time, especially
on a busy system or under profilers/sanitizers. However, the unixctl
'exit' command replies instantly without waiting for any work to
actually be done. This may cause system test failures or other
issues where scripts expect ovs-vswitchd to exit or destroy all the
datapath resources shortly after appctl call.
Fix that by waiting for the bridge_exit() before replying to the user.
At least, all the datapath resources will actually be destroyed by
the time ovs-appctl exits.
Also moving a structure from stack to global. Seems cleaner this way.
Since we're not replying right away and it's technically possible
to have multiple clients requesting exit at the same time, storing
connections in an array.
Fixes: fe13ccdca6a2 ("vswitchd: Add --cleanup option to the 'appctl exit' command")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, the netdev's speed is being calculated by taking the link's
feature bits (using netdev_get_features()) and transforming them into
bps.
This mechanism can be both inaccurate and difficult to maintain, mainly
because we currently use the feature bits supported by OpenFlow which
would have to be extended to support all new feature bits of all netdev
implementations while keeping the OpenFlow API intact.
In order to expose the link speed accurately for all current and future
hardware, add a new netdev API call that allows the implementations to
provide the current and maximum link speeds in Mbps.
Internally, the logic to get the maximum supported speed still relies on
feature bits so it might still get out of sync in the future. However,
the maximum configurable speed is not used as much as the current speed
and these feature bits are not exposed through the netdev interface so
it should be easier to add more.
Use this new function instead of netdev_get_features() where the link
speed is needed.
As a consequence of this patch, link speeds of cards is properly
reported (internally in OVSDB) even if not supported by OpenFlow.
A test verifies this behavior using a tap device.
Also, in order to avoid using the old, this patch adds a checkpatch.py
warning if the old API is used.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
other_config:pmd-maxsleep is a config option to allow
PMD thread cores to sleep under low or no load conditions.
Rename it to 'pmd-sleep-max' to allow a more structured
name and so that additional options or command can follow
the 'pmd-sleep-xyz' pattern.
Use of other_config:pmd-maxsleep is deprecated to be
removed in a future release and will result in a warning.
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
As per the Open vSwitch Manual ovs-vsctl(8) the Bridge IPFIX parameters
can be passed as follows:
ovs-vsctl -- set Bridge br0 ipfix=@i \
-- --id=@i create IPFIX targets=\"192.168.0.34:4739\" \
obs_domain_id=123 obs_point_id=456 cache_active_timeout=60 \
cache_max_flows=13 \
other_config:enable-input-sampling=false \
other_config:enable-output-sampling=false
where the default values are:
enable_input_sampling: true
enable_output_sampling: true
But in the existing code these 2 parameters take up unexpected values
in some scenarios:
be_opts.enable_input_sampling = !smap_get_bool(&be_cfg->other_config,
"enable-input-sampling", false);
be_opts.enable_output_sampling = !smap_get_bool(&be_cfg->other_config,
"enable-output-sampling", false);
Here, the function smap_get_bool is being used with a negation.
This returns expected values for the default case (since the above code
will negate “false” we get from smap_get bool function and return the
value “true”) but unexpected values for the case where the sampling
value is passed through the CLI.
For example, if we pass "true" for other_config:enable-input-sampling
in the CLI, the above code will negate the “true” value we get from
the smap_bool function and return the value “false”. Same would be the
case for enable_output_sampling.
Acked-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Sayali Naval <sanaval@cisco.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Some control protocols are used to maintain link status between
forwarding engines (e.g. LACP). When the system is not sized properly,
the PMD threads may not be able to process all incoming traffic from the
configured Rx queues. When a signaling packet of such protocols is
dropped, it can cause link flapping, worsening the situation.
Use the rte_flow API to redirect these protocols into a dedicated Rx
queue. The assumption is made that the ratio between control protocol
traffic and user data traffic is very low and thus this dedicated Rx
queue will never get full. Re-program the RSS redirection table to only
use the other Rx queues.
The additional Rx queue will be assigned a PMD core like any other Rx
queue. Polling that extra queue may introduce increased latency and
a slight performance penalty at the benefit of preventing link flapping.
This feature must be enabled per port on specific protocols via the
rx-steering option. This option takes "rss" followed by a "+" separated
list of protocol names. It is only supported on ethernet ports. This
feature is experimental.
If the user has already configured multiple Rx queues on the port, an
additional one will be allocated for control packets. If the hardware
cannot satisfy the number of requested Rx queues, the last Rx queue will
be assigned for control plane. If only one Rx queue is available, the
rx-steering feature will be disabled. If the hardware does not support
the rte_flow matchers/actions, the rx-steering feature will be
completely disabled on the port and regular rss will be performed
instead.
It cannot be enabled when other-config:hw-offload=true as it may
conflict with the offloaded flows. Similarly, if hw-offload is enabled,
custom rx-steering will be forcibly disabled on all ports and replaced
by regular rss.
Example use:
ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
set interface phy0 options:rx-steering=rss+lacp -- \
set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
set interface phy1 options:rx-steering=rss+lacp
As a starting point, only one protocol is supported: LACP. Other
protocols can be added in the future. NIC compatibility should be
checked.
To validate that this works as intended, I used a traffic generator to
generate random traffic slightly above the machine capacity at line rate
on a two ports bond interface. OVS is configured to receive traffic on
two VLANs and pop/push them in a br-int bridge based on tags set on
patch ports.
+----------------------+
| DUT |
|+--------------------+|
|| br-int || in_port=patch10,actions=mod_dl_src:$patch11,
|| || mod_dl_dst:$tgen0,
|| || output:patch10
|| || in_port=patch11,actions=mod_dl_src:$patch10
|| || mod_dl_dst:$tgen0,
|| patch10 patch11 || output:patch10
|+---|-----------|----+|
| | | |
|+---|-----------|----+|
|| patch00 patch01 ||
|| tag:10 tag:20 ||
|| ||
|| br-phy || default flow, action=NORMAL
|| ||
|| bond0 || balance-slb, lacp=passive, lacp-time=fast
|| phy0 phy1 ||
|+------|-----|-------+|
+-------|-----|--------+
| |
+-------|-----|--------+
| port0 port1 | balance L3/L4, lacp=active, lacp-time=fast
| lag | mode trunk VLANs 10, 20
| |
| switch |
| |
| vlan 10 vlan 20 | mode access
| port2 port3 |
+-----|----------|-----+
| |
+-----|----------|-----+
| tgen0 tgen1 | Random traffic that is properly balanced
| | across the bond ports in both directions.
| traffic generator |
+----------------------+
Without rx-steering, the bond0 links are randomly switching to
"defaulted" when one of the LACP packets sent by the switch is dropped
because the RX queues are full and the PMD threads did not process them
fast enough. When that happens, all traffic must go through a single
link which causes above line rate traffic to be dropped.
~# ovs-appctl lacp/show-stats bond0
---- bond0 statistics ----
member: phy0:
TX PDUs: 347246
RX PDUs: 14865
RX Bad PDUs: 0
RX Marker Request PDUs: 0
Link Expired: 168
Link Defaulted: 0
Carrier Status Changed: 0
member: phy1:
TX PDUs: 347245
RX PDUs: 14919
RX Bad PDUs: 0
RX Marker Request PDUs: 0
Link Expired: 147
Link Defaulted: 1
Carrier Status Changed: 0
When rx-steering is enabled, no LACP packet is dropped and the bond
links remain enabled at all times, maximizing the throughput. Neither
the "Link Expired" nor the "Link Defaulted" counters are incremented
anymore.
This feature may be considered as "QoS". However, it does not work by
limiting the rate of traffic explicitly. It only guarantees that some
protocols have a lower chance of being dropped because the PMD cores
cannot keep up with regular traffic.
The choice of protocols is limited on purpose. This is not meant to be
configurable by users. Some limited configurability could be considered
in the future but it would expose to more potential issues if users are
accidentally redirecting all traffic in the isolated queue.
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It supports flowlabel based load balancing by controlling the flowlabel
of outer IPv6 header, which is already implemented in Linux kernel as
seg6_flowlabel sysctl [1].
[1]: https://docs.kernel.org/networking/seg6-sysctl.html
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The description of SRv6 was missing in vswitch.xml, which is
used to generate the man page, so this patch adds it.
Fixes: 03fc1ad78521 ("userspace: Add SRv6 tunnel support.")
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Open vSwitch generally tries to let the underlying operating system
managed the low level details of hardware, for example DMA mapping,
bus arbitration, etc. However, when using DPDK, the underlying
operating system yields control of many of these details to userspace
for management.
In the case of some DPDK port drivers, configuring rte_flow or even
allocating resources may require access to iopl/ioperm calls, which
are guarded by the CAP_SYS_RAWIO privilege on linux systems. These
calls are dangerous, and can allow a process to completely compromise
a system. However, they are needed in the case of some userspace
driver code which manages the hardware (for example, the mlx
implementation of backend support for rte_flow).
Here, we create an opt-in flag passed to the command line to allow
this access. We need to do this before ever accessing the database,
because we want to drop all privileges asap, and cannot wait for
a connection to the database to be established and functional before
dropping. There may be distribution specific ways to do capability
management as well (using for example, systemd), but they are not
as universal to the vswitchd as a flag.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Depending on the driver implementation, it can take from 0.2 seconds
up to 2 seconds before offloaded flow statistics are updated. This is
true for both TC and rte_flow-based offloading. This is causing a
problem with min-revalidate-pps, as old statistic values are used
during this period.
This fix will wait for at least 2 seconds, by default, before assuming no
packets where received during this period.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add options to the IPFIX table configure the interval to send statistics
and template information.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add support to count upcall packets per port, both succeed and failed,
which is a better way to see how many packets upcalled on each interface.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: wangchuanlei <wangchuanlei@inspur.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Today the minimum value for this setting is 1. This patch allows it to
be 0, meaning not checking pps at all, and always do revalidation.
This is particularly useful for environments where some of the
applications with long-lived connections may have very low traffic for
certain period but have high rate of burst periodically. It is desirable
to keep the datapath flows instead of periodically deleting them to
avoid burst of packet miss to userspace.
When setting to 0, there may be more datapath flows to be revalidated,
resulting in higher CPU cost of revalidator threads. This is the
downside but in certain cases this is still more desirable than packet
misses to user space.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Now that the timer slack for the PMD threads is reduced we can also
reduce the start/increment for PMD load based sleeping to match it.
This will further reduce initial sleep times making it more resilient
to interfaces that might be sensitive to large sleep times.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.
Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).
Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.
Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
- sleep iterations: 153994 ( 76.8 % of iterations)
Sleep time (us): 9159399 ( 59 us/iteration avg.)
...
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When restart openvswitch, the bfd status will be kept
before ovs-vswitchd running. And if the ovs-vswitchd
has high workload, which will defer updating bfd status,
which not we excepted.
Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
100 Mbps was a fair assumption 13 years ago. Modern days 10 Gbps seems
like a good value in case no information is available otherwise.
The change mainly affects QoS which is currently limited to 100 Mbps if
the user didn't specify 'max-rate' and the card doesn't report the
speed or OVS doesn't have a predefined enumeration for the speed
reported by the NIC.
Calculation of the path cost for STP/RSTP is also affected if OVS is
unable to determine the link speed.
Lower link speed adapters are typically good at reporting their speed,
so chances for overshoot should be low. But newer high-speed adapters,
for which there is no speed enumeration or if there are some other
issues, will not suffer that much.
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The count of received multicast packets has been computed internally,
but not exposed to ovsdb. Fix this.
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
For some reason it is documented as 'rstp-port-path-cost', while
the code and some other bits of documentation use 'rstp-path-cost'.
Fixes: 9efd308e957c ("Rapid Spanning Tree Protocol (IEEE 802.1D).")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>