2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-22 01:51:26 +00:00

1452 Commits

Author SHA1 Message Date
Jakob Meng
bb6ed2472f netdev-dpdk: Document rx-steering status options.
Fixes: fc06ea9a1883 ("netdev-dpdk: Add custom rx-steering configuration.")
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2023-10-10 11:23:35 +01:00
Jakob Meng
e9ada16292 netdev-dpdk: Update docs for interface info.
The status options pci-vendor_id and pci-device_id for dpdk netdevs
have been replaced by bus_info. This patch updates the documentation
in vswitchd/vswitch.xml accordingly.

Fixes: a77c7796f23a ("dpdk: Update to use v22.11.1.")
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2023-10-10 11:23:35 +01:00
Jakob Meng
8020eff9a0 netdev-dpdk: Document status options for VF MAC address.
Fixes: f4336f504b17 ("netdev-dpdk: Add option to configure VF MAC address. ")
Signed-off-by: Jakob Meng <code@jakobmeng.de>
Acked-by: Simon Horman <horms@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2023-10-10 11:23:35 +01:00
Ilya Maximets
24520a401e vswitchd: Wait for a bridge exit before replying to exit unixctl.
Before the cleanup option, the bridge_exit() call was fairly fast,
because it didn't include any particularly long operations.  However,
with the cleanup flag, this function destroys a lot of datapath
resources freeing a lot of memory, waiting on RCU and talking to
the kernel.  That may take a noticeable amount of time, especially
on a busy system or under profilers/sanitizers.  However, the unixctl
'exit' command replies instantly without waiting for any work to
actually be done.  This may cause system test failures or other
issues where scripts expect ovs-vswitchd to exit or destroy all the
datapath resources shortly after appctl call.

Fix that by waiting for the bridge_exit() before replying to the user.
At least, all the datapath resources will actually be destroyed by
the time ovs-appctl exits.

Also moving a structure from stack to global.  Seems cleaner this way.

Since we're not replying right away and it's technically possible
to have multiple clients requesting exit at the same time, storing
connections in an array.

Fixes: fe13ccdca6a2 ("vswitchd: Add --cleanup option to the 'appctl exit' command")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-18 20:45:00 +02:00
Adrian Moreno
6240c0b4c8 netdev: Add netdev_get_speed() to netdev API.
Currently, the netdev's speed is being calculated by taking the link's
feature bits (using netdev_get_features()) and transforming them into
bps.

This mechanism can be both inaccurate and difficult to maintain, mainly
because we currently use the feature bits supported by OpenFlow which
would have to be extended to support all new feature bits of all netdev
implementations while keeping the OpenFlow API intact.

In order to expose the link speed accurately for all current and future
hardware, add a new netdev API call that allows the implementations to
provide the current and maximum link speeds in Mbps.

Internally, the logic to get the maximum supported speed still relies on
feature bits so it might still get out of sync in the future. However,
the maximum configurable speed is not used as much as the current speed
and these feature bits are not exposed through the netdev interface so
it should be easier to add more.

Use this new function instead of netdev_get_features() where the link
speed is needed.

As a consequence of this patch, link speeds of cards is properly
reported (internally in OVSDB) even if not supported by OpenFlow.
A test verifies this behavior using a tap device.

Also, in order to avoid using the old, this patch adds a checkpatch.py
warning if the old API is used.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2137567
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-17 20:03:32 +02:00
Kevin Traynor
023dcdc7a1 dpif-netdev: Rename pmd-maxsleep config option.
other_config:pmd-maxsleep is a config option to allow
PMD thread cores to sleep under low or no load conditions.

Rename it to 'pmd-sleep-max' to allow a more structured
name and so that additional options or command can follow
the 'pmd-sleep-xyz' pattern.

Use of other_config:pmd-maxsleep is deprecated to be
removed in a future release and will result in a warning.

Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-15 00:11:21 +02:00
Sayali Naval
8e073791d4 bridge: Fix unexpected values for IPFIX enable-input/output-sampling.
As per the Open vSwitch Manual ovs-vsctl(8) the Bridge IPFIX parameters
can be passed as follows:

  ovs-vsctl -- set Bridge br0 ipfix=@i \
    --  --id=@i create  IPFIX targets=\"192.168.0.34:4739\" \
        obs_domain_id=123 obs_point_id=456 cache_active_timeout=60 \
        cache_max_flows=13 \
        other_config:enable-input-sampling=false \
        other_config:enable-output-sampling=false

where the default values are:

  enable_input_sampling: true
  enable_output_sampling: true

But in the existing code these 2 parameters take up unexpected values
in some scenarios:

  be_opts.enable_input_sampling = !smap_get_bool(&be_cfg->other_config,
                                        "enable-input-sampling", false);

  be_opts.enable_output_sampling = !smap_get_bool(&be_cfg->other_config,
                                        "enable-output-sampling", false);

Here, the function smap_get_bool is being used with a negation.

This returns expected values for the default case (since the above code
will negate “false” we get from smap_get bool function and return the
value “true”) but unexpected values for the case where the sampling
value is passed through the CLI.
For example, if we pass "true" for other_config:enable-input-sampling
in the CLI, the above code will negate the “true” value we get from
the smap_bool function and return the value “false”. Same would be the
case for enable_output_sampling.

Acked-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Sayali Naval <sanaval@cisco.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-12 00:12:49 +02:00
Robin Jarry
fc06ea9a18 netdev-dpdk: Add custom rx-steering configuration.
Some control protocols are used to maintain link status between
forwarding engines (e.g. LACP). When the system is not sized properly,
the PMD threads may not be able to process all incoming traffic from the
configured Rx queues. When a signaling packet of such protocols is
dropped, it can cause link flapping, worsening the situation.

Use the rte_flow API to redirect these protocols into a dedicated Rx
queue. The assumption is made that the ratio between control protocol
traffic and user data traffic is very low and thus this dedicated Rx
queue will never get full. Re-program the RSS redirection table to only
use the other Rx queues.

The additional Rx queue will be assigned a PMD core like any other Rx
queue. Polling that extra queue may introduce increased latency and
a slight performance penalty at the benefit of preventing link flapping.

This feature must be enabled per port on specific protocols via the
rx-steering option. This option takes "rss" followed by a "+" separated
list of protocol names. It is only supported on ethernet ports. This
feature is experimental.

If the user has already configured multiple Rx queues on the port, an
additional one will be allocated for control packets. If the hardware
cannot satisfy the number of requested Rx queues, the last Rx queue will
be assigned for control plane. If only one Rx queue is available, the
rx-steering feature will be disabled. If the hardware does not support
the rte_flow matchers/actions, the rx-steering feature will be
completely disabled on the port and regular rss will be performed
instead.

It cannot be enabled when other-config:hw-offload=true as it may
conflict with the offloaded flows. Similarly, if hw-offload is enabled,
custom rx-steering will be forcibly disabled on all ports and replaced
by regular rss.

Example use:

 ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
   set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
   set interface phy0 options:rx-steering=rss+lacp -- \
   set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
   set interface phy1 options:rx-steering=rss+lacp

As a starting point, only one protocol is supported: LACP. Other
protocols can be added in the future. NIC compatibility should be
checked.

To validate that this works as intended, I used a traffic generator to
generate random traffic slightly above the machine capacity at line rate
on a two ports bond interface. OVS is configured to receive traffic on
two VLANs and pop/push them in a br-int bridge based on tags set on
patch ports.

   +----------------------+
   |         DUT          |
   |+--------------------+|
   ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,
   ||                    ||                         mod_dl_dst:$tgen0,
   ||                    ||                         output:patch10
   ||                    || in_port=patch11,actions=mod_dl_src:$patch10
   ||                    ||                         mod_dl_dst:$tgen0,
   || patch10    patch11 ||                         output:patch10
   |+---|-----------|----+|
   |    |           |     |
   |+---|-----------|----+|
   || patch00    patch01 ||
   ||  tag:10    tag:20  ||
   ||                    ||
   ||       br-phy       || default flow, action=NORMAL
   ||                    ||
   ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
   ||    phy0   phy1     ||
   |+------|-----|-------+|
   +-------|-----|--------+
           |     |
   +-------|-----|--------+
   |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
   |         lag          | mode trunk VLANs 10, 20
   |                      |
   |        switch        |
   |                      |
   |  vlan 10    vlan 20  |  mode access
   |   port2      port3   |
   +-----|----------|-----+
         |          |
   +-----|----------|-----+
   |   tgen0      tgen1   |  Random traffic that is properly balanced
   |                      |  across the bond ports in both directions.
   |  traffic generator   |
   +----------------------+

Without rx-steering, the bond0 links are randomly switching to
"defaulted" when one of the LACP packets sent by the switch is dropped
because the RX queues are full and the PMD threads did not process them
fast enough. When that happens, all traffic must go through a single
link which causes above line rate traffic to be dropped.

 ~# ovs-appctl lacp/show-stats bond0
 ---- bond0 statistics ----
 member: phy0:
   TX PDUs: 347246
   RX PDUs: 14865
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 168
   Link Defaulted: 0
   Carrier Status Changed: 0
 member: phy1:
   TX PDUs: 347245
   RX PDUs: 14919
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 147
   Link Defaulted: 1
   Carrier Status Changed: 0

When rx-steering is enabled, no LACP packet is dropped and the bond
links remain enabled at all times, maximizing the throughput. Neither
the "Link Expired" nor the "Link Defaulted" counters are incremented
anymore.

This feature may be considered as "QoS". However, it does not work by
limiting the rate of traffic explicitly. It only guarantees that some
protocols have a lower chance of being dropped because the PMD cores
cannot keep up with regular traffic.

The choice of protocols is limited on purpose. This is not meant to be
configurable by users. Some limited configurability could be considered
in the future but it would expose to more potential issues if users are
accidentally redirecting all traffic in the isolated queue.

Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-10 15:49:44 +02:00
Nobuhiro MIKI
701c2dbfb8 userspace: Add new option srv6_flowlabel in SRv6 tunnel.
It supports flowlabel based load balancing by controlling the flowlabel
of outer IPv6 header, which is already implemented in Linux kernel as
seg6_flowlabel sysctl [1].

[1]: https://docs.kernel.org/networking/seg6-sysctl.html

Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-05-25 17:08:32 +02:00
Nobuhiro MIKI
0f34ecbd5a vswitch.xml: Add description of SRv6 tunnel and related options.
The description of SRv6 was missing in vswitch.xml, which is
used to generate the man page, so this patch adds it.

Fixes: 03fc1ad78521 ("userspace: Add SRv6 tunnel support.")
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-03-30 22:10:40 +02:00
Aaron Conole
07cf5810de dpdk: Allow retaining CAP_SYS_RAWIO privileges.
Open vSwitch generally tries to let the underlying operating system
managed the low level details of hardware, for example DMA mapping,
bus arbitration, etc.  However, when using DPDK, the underlying
operating system yields control of many of these details to userspace
for management.

In the case of some DPDK port drivers, configuring rte_flow or even
allocating resources may require access to iopl/ioperm calls, which
are guarded by the CAP_SYS_RAWIO privilege on linux systems.  These
calls are dangerous, and can allow a process to completely compromise
a system.  However, they are needed in the case of some userspace
driver code which manages the hardware (for example, the mlx
implementation of backend support for rte_flow).

Here, we create an opt-in flag passed to the command line to allow
this access.  We need to do this before ever accessing the database,
because we want to drop all privileges asap, and cannot wait for
a connection to the database to be established and functional before
dropping.  There may be distribution specific ways to do capability
management as well (using for example, systemd), but they are not
as universal to the vswitchd as a flag.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-03-22 18:56:02 +01:00
Ales Musil
e90a0727f1 vswitch: Add missing documentation for "ct_flush" capability.
Fixes: 08146bf7d9b4 ("openflow: Add extension to flush CT by generic match.")
Signed-off-by: Ales Musil <amusil@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-03-15 21:24:38 +01:00
Eelco Chaudron
29720e378e ofproto-dpif-upcall: Wait for valid hw flow stats before applying min-revalidate-pps.
Depending on the driver implementation, it can take from 0.2 seconds
up to 2 seconds before offloaded flow statistics are updated. This is
true for both TC and rte_flow-based offloading. This is causing a
problem with min-revalidate-pps, as old statistic values are used
during this period.

This fix will wait for at least 2 seconds, by default, before assuming no
packets where received during this period.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-03-15 21:22:22 +01:00
Adrian Moreno
f1f278f5e1 ipfix: Make template and stats interval configurable.
Add options to the IPFIX table configure the interval to send statistics
and template information.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-02-27 21:18:59 +01:00
Miika Petäjäniemi
a6195e2c42 netdev-linux: Add jitter parameter to the netem qos options.
Adds jitter option to enable emulating latency fluctuation with netem.

Submitted-at: https://github.com/openvswitch/ovs/pull/407
Signed-off-by: Miika Petäjäniemi <miika.petajaniemi@solita.fi>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-02-21 14:25:57 +01:00
wangchuanlei
e22e1f6725 dpctl: Add support to count upcall packets.
Add support to count upcall packets per port, both succeed and failed,
which is a better way to see how many packets upcalled on each interface.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: wangchuanlei <wangchuanlei@inspur.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-31 17:40:50 +01:00
Han Zhou
e5b3cb9995 revalidator: Allow min-revalidator-pps to be 0.
Today the minimum value for this setting is 1. This patch allows it to
be 0, meaning not checking pps at all, and always do revalidation.

This is particularly useful for environments where some of the
applications with long-lived connections may have very low traffic for
certain period but have high rate of burst periodically. It is desirable
to keep the datapath flows instead of periodically deleting them to
avoid burst of packet miss to userspace.

When setting to 0, there may be more datapath flows to be revalidated,
resulting in higher CPU cost of revalidator threads. This is the
downside but in certain cases this is still more desirable than packet
misses to user space.

Signed-off-by: Han Zhou <hzhou@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-27 16:09:10 +01:00
Kevin Traynor
948767a18d dpif-netdev: Set PMD load based sleep start/inc to 1 us.
Now that the timer slack for the PMD threads is reduced we can also
reduce the start/increment for PMD load based sleeping to match it.

This will further reduce initial sleep times making it more resilient
to interfaces that might be sensitive to large sleep times.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-23 17:23:28 +01:00
Kevin Traynor
de3bbdc479 dpif-netdev: Add PMD load based sleeping.
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.

Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).

Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500

The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.

Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
   - sleep iterations:       153994  ( 76.8 % of iterations)
   Sleep time (us):         9159399  ( 59 us/iteration avg.)
...

Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-01-12 18:56:05 +01:00
Daniel Ding
093915e04a vswitch.ovsschema: Set bfd_status to ephemeral.
When restart openvswitch, the bfd status will be kept
before ovs-vswitchd running.  And if the ovs-vswitchd
has high workload, which will defer updating bfd status,
which not we excepted.

Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-12-06 16:21:54 +01:00
Ilya Maximets
b22c4d8403 netdev: Assume default link speed to be 10 Gbps instead of 100 Mbps.
100 Mbps was a fair assumption 13 years ago.  Modern days 10 Gbps seems
like a good value in case no information is available otherwise.

The change mainly affects QoS which is currently limited to 100 Mbps if
the user didn't specify 'max-rate' and the card doesn't report the
speed or OVS doesn't have a predefined enumeration for the speed
reported by the NIC.

Calculation of the path cost for STP/RSTP is also affected if OVS is
unable to determine the link speed.

Lower link speed adapters are typically good at reporting their speed,
so chances for overshoot should be low.  But newer high-speed adapters,
for which there is no speed enumeration or if there are some other
issues, will not suffer that much.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-11-30 14:42:59 +01:00
David Marchand
c6062d1077 vswitchd: Publish per iface received multicast packets.
The count of received multicast packets has been computed internally,
but not exposed to ovsdb. Fix this.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-11-24 13:38:08 +01:00
Ilya Maximets
0d0f282c19 vswitch.xml: Fix the name of rstp-path-cost option.
For some reason it is documented as 'rstp-port-path-cost', while
the code and some other bits of documentation use 'rstp-path-cost'.

Fixes: 9efd308e957c ("Rapid Spanning Tree Protocol (IEEE 802.1D).")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-11-02 19:45:14 +01:00
Ilya Maximets
516f181a21 docs: Remove remaining references to OVS kmod and XenServer.
README file still mentions a kernel module and some parts of
the documentation still have XenServer references, e.g. 'xs-*'
database configuration options.  Removing them.

Fixes: 422e90437854 ("make: Remove the Linux datapath.")
Fixes: 83c9518e7c67 ("xenserver: Remove xenserver.")
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-15 19:46:00 +02:00
Greg Rose
83c9518e7c xenserver: Remove xenserver.
Remove the current xenserver implementation - it is obsolete and
since 3.0 we do not support kernel module builds [1].

1. https://mail.openvswitch.org/pipermail/ovs-dev/2022-July/395789.html

[i.maximets]
Can be added back if people willing to maintain it will be found.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-15 13:07:13 +02:00
Christophe Fontaine
1b53826d6c ofproto/bond: Add knob 'all-members-active'.
This config param allows the delivery of broadcast and multicast
packets to the secondary interface of non-lacp bonds, equivalent
to the option 'all_slaves_active' for Linux kernel bonds.

Reported-at: https://bugzilla.redhat.com/1720935
Signed-off-by: Christophe Fontaine <cfontain@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-15 23:08:38 +02:00
Emma Finn
1713fc0116 odp-execute: Add command to switch action implementation.
This commit adds a new command to allow the user to switch
the active action implementation at runtime.

Usage:
  $ ovs-appctl odp-execute/action-impl-set scalar

This commit also adds a new command to retrieve the list of available
action implementations. This can be used by to check what implementations
of actions are available and what implementation is active during runtime.

Usage:
   $ ovs-appctl odp-execute/action-impl-show

Added separate test-case for ovs-actions show/set commands:
odp-execute - actions implementation

Signed-off-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
Co-authored-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-07-15 11:39:20 +01:00
Emma Finn
95e4a35b0a odp-execute: Add function pointers to odp-execute for different action implementations.
This commit introduces the initial infrastructure required to allow
different implementations for OvS actions. The patch introduces action
function pointers which allows user to switch between different action
implementations available. This will allow for more performance and flexibility
so the user can choose the action implementation to best suite their use case.

Signed-off-by: Emma Finn <emma.finn@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-07-15 11:38:16 +01:00
Frode Nordahl
2fc29c4278 man: Fix various typos across manual pages.
As reported by Debian lintian.

Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-14 15:24:07 +02:00
Kevin Traynor
3757e9f8e9 netdev-dpdk: Add shared mempool config.
Mempools may currently be shared between DPDK ports based
on port MTU and NUMA. With some hint from the user we can
increase the sharing on MTU and hence reduce memory
consumption in many cases.

For example, a port with MTU 9000, uses a mempool with an
mbuf size based on 9000 MTU. A port with MTU 1500, uses a
different mempool with an mbuf size based on 1500 MTU.

In this case, assuming same NUMA, both these ports could
share the 9000 MTU mempool.

The user must give a hint as order of creation of ports and
setting of MTUs may vary and we need to ensure that upgrades
from older OVS versions do not require more memory.

This scheme can also prevent multiple mempools being created
for cases where a port is added picking up a default MTU and
an appropriate mempool, but later has it's MTU changed to a
different value requiring a different mempool.

Example usage:

 $ ovs-vsctl --no-wait set Open_vSwitch . \
   other_config:shared-mempool-config=9000,1500:1,6000:1

Port added on NUMA 0:
* MTU 1500, use mempool based on 9000 MTU
* MTU 5000, use mempool based on 9000 MTU
* MTU 9000, use mempool based on 9000 MTU
* MTU 9300, use mempool based on 9300 MTU (existing behaviour)

Port added on NUMA 1:
* MTU 1500, use mempool based on 1500 MTU
* MTU 5000, use mempool based on 6000 MTU
* MTU 9000, use mempool based on 9000 MTU
* MTU 9300, use mempool based on 9300 MTU (existing behaviour)

Default behaviour is unchanged and mempools are still only created
when needed.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-07-14 13:17:59 +01:00
Kevin Traynor
6c50462096 vswitchd.xml: Fix whitespace.
My xml editor keeps autofixing these which means I have to be
careful during 'git add' for unrelated changes. Might as well
just fix them.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-04 21:47:18 +02:00
Andreas Karis
e8515c8cc0 ovs-monitor-ipsec: Allow custom options per tunnel.
Tunnels in LibreSwan and OpenSwan allow for many options to be set on a
per tunnel basis. Pass through any options starting with ipsec_ to the
connection in the configuration file. Administrators are responsible for
picking valid key/value pairs.

Signed-off-by: Andreas Karis <ak.karis@gmail.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-05-04 16:30:21 +02:00
Adrian Moreno
9e56549c2b hmap: use short version of safe loops if possible.
Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for hmap and all its derived types.

In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-03-30 16:59:02 +02:00
Adrian Moreno
e9bf5bffb0 list: use short version of safe loops if possible.
Using the SHORT version of the *_SAFE loops makes the code cleaner
and less error-prone. So, use the SHORT version and remove the extra
variable when possible.

In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-03-30 16:59:02 +02:00
Maxime Coquelin
a7f52b7eb6 vswitchd.xml: Add missing tx-steering PMD option.
This patch documents PMD's other_config:tx-steering option.

Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-31 21:36:59 +01:00
Gaetan Rivet
62c2d8a675 netdev-offload: Add multi-thread API.
Expose functions reporting user configuration of offloading threads, as
well as utility functions for multithreading.

This will only expose the configuration knob to the user, while no
datapath will implement the multiple thread request.

This will allow implementations to use this API for offload thread
management in relevant layers before enabling the actual dataplane
implementation.

The offload thread ID is lazily allocated and can as such be in a
different order than the offload thread start sequence.

The RCU thread will sometime access hardware-offload objects from
a provider for reclamation purposes.  In such case, it will get
a default offload thread ID of 0. Care must be taken that using
this thread ID is safe concurrently with the offload threads.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-19 01:35:19 +01:00
Eelco Chaudron
512fab8f21 openvswitch: Define the OVS_STATIC_TRACE() macro.
This patch defines the OVS_STATIC_TRACE() macro, and as an
example, adds two of them in the bridge run loop.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-01-18 00:46:30 +01:00
Ilya Maximets
11441385c2 bridge: Fix incorrect configuration of netdev's dpif type.
netdev_set_dpif_type() can only be used with a normalized dpif type
as an argument, which is a constant static string derived from a type
of a dpif_class or a constant string "system".  Usage of a same
constant string allows netdev-offload module to compare types by
simply comparing pointers.

OTOH, 'br->ofproto->type' is a dynamic string that:
a. Can be NULL.
b. Even if not NULL and equal, can be a different dynamically
   allocated string.

Both these qualities breaks assumptions made by all other modules
related to HW offload, breaking the functionality.

Fix that by moving netdev_set_dpif_type() to dpif.c and calling with
a correct constant string as an argument.

The call moved from bridge.c to dpif.c, because we need to have access
to the dpif class, but bridge.c should not.

Not trying to set the dpif_type inside the netdev_ports_insert(),
because it's used now outside the offloading context.  So, it's
cleaner to move the netdev_set_dpif_type() call outside of the
netdev-offload module.

Additionally removed the redundant call from the netdev_ports_insert()
and refactored the function, since it doesn't need an extra argument
anymore.

Fixes: 4f19a78a61c5 ("netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.")
Reported-by: Roi Dayan <roid@nvidia.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390117.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Lin Huang <linhuang@ruijie.com.cn>
Acked-by: Roi Dayan <roid@nvidia.com>
2021-12-17 21:31:55 +01:00
Lin Huang
4f19a78a61 netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.
Userspace tunnel doesn't have a valid device in the kernel. So
get_ifindex() function (ioctl) always get error during
adding a port, deleting a port or updating a port status.

The info log is
"2021-08-29T09:17:39.830Z|00059|netdev_linux|INFO|ioctl(SIOCGIFINDEX)
on vxlan_sys_4789 device failed: No such device"

If there are a lot of userspace tunnel ports on a bridge, the
iface_refresh_netdev_status() function will spend a lot of time.

So ignore userspace tunnel port ioctl(SIOCGIFINDEX) operation, just
return -ENODEV.

Signed-off-by: Lin Huang <linhuang@ruijie.com.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-12-08 18:17:19 +01:00
Ilya Maximets
066741d9c5 ovsdb-idl: Add memory report function.
Added new function to return memory usage statistics for database
objects inside IDL.  Statistics similar to what ovsdb-server reports.
Not counting _Server database as it should be small, hence doesn't
worth adding extra code to the ovsdb-cs module.  Can be added later
if needed.

ovs-vswitchd is a user in OVS, but this API will be mostly useful for
OVN daemons.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
2021-11-04 23:13:13 +01:00
Ilya Maximets
51946d2227 ovsdb-data: Optimize union of sets.
Current algorithm of ovsdb_datum_union looks like this:

  for-each atom in b:
      if not bin_search(a, atom):
          push(a, clone(atom))
  quicksort(a)

So, the complexity looks like this:

   Nb * log2(Na)   +    Nb     +   (Na + Nb) * log2(Na + Nb)
   Comparisons        clones       Comparisons for quicksort
   for search

ovsdb_datum_union() is heavily used in database transactions while
new element is added to a set.  For example, if new logical switch
port is added to a logical switch in OVN.  This is a very common
use case where CMS adds one new port to an existing switch that
already has, let's say, 100 ports.  For this case ovsdb-server will
have to perform:

   1 * log2(100)  + 1 clone + 101 * log2(101)
   Comparisons                Comparisons for
   for search                   quicksort.
       ~7           1            ~707
   Roughly 714 comparisons of atoms and 1 clone.

Since binary search can give us position, where new atom should go
(it's the 'low' index after the search completion) for free, the
logic can be re-worked like this:

  copied = 0
  for-each atom in b:
      desired_position = bin_search(a, atom)
      push(result, a[ copied : desired_position - 1 ])
      copied = desired_position
      push(result, clone(atom))
  push(result, a[ copied : Na ])
  swap(a, result)

Complexity of this schema:

   Nb * log2(Na)   +    Nb     +         Na
   Comparisons        clones       memory copy on push
   for search

'swap' is just a swap of a few pointers.  'push' is not a 'clone',
but a simple memory copy of 'union ovsdb_atom'.

In general, this schema substitutes complexity of a quicksort
with complexity of a memory copy of Na atom structures, where we're
not even copying strings that these atoms are pointing to.

Complexity in the example above goes down from 714 comparisons
to 7 comparisons and memcpy of 100 * sizeof (union ovsdb_atom) bytes.

General complexity of a memory copy should always be lower than
complexity of a quicksort, especially because these copies usually
performed in bulk, so this new schema should work faster for any input.

All in all, this change allows to execute several times more
transactions per second for transactions that adds new entries to sets.

Alternatively, union can be implemented as a linear merge of two
sorted arrays, but this will result in O(Na) comparisons, which
is more than Nb * log2(Na) in common case, since Na is usually
far bigger than Nb.  Linear merge will also mean per-atom memory
copies instead of copying in bulk.

'replace' functionality of ovsdb_datum_union() had no users, so it
just removed.  But it can easily be added back if needed in the future.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
2021-09-24 14:55:54 +02:00
Rosemarie O'Riorden
de15afa50f dpdk: Stop configuring socket-limit with the value of socket-mem.
This change removes the automatic memory limit on start-up of OVS with
DPDK. As DPDK supports dynamic memory allocation, there is no
need to limit the amount of memory available, if not requested.

Currently, if socket-limit is not configured, it is set to the value of
socket-mem. With this change, the user can decide to set it or have no
memory limit.

Removed logs that announce this change and fixed documentation.

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-26 03:31:46 +02:00
Rosemarie O'Riorden
a8621f49d0 dpdk: Remove default values for socket-mem and limit.
This change removes the default values for EAL args socket-mem and
socket-limit.  As DPDK supports dynamic memory allocation, there is no
need to allocate a certain amount of memory on start-up, nor limit the
amount of memory available, if not requested.

Currently, socket-mem has a default value of 1024 when it is not
configured by the user, and socket-limit takes on the value of
socket-mem, 1024, by default.  With this change, socket-mem is not
configured by default, meaning that socket-limit is not either.
Neither, either or both options can be set.

Removed extra logs that announce this change and fixed documentation.

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-26 03:31:46 +02:00
Mark Gray
b1e517bd2f dpif-netlink: Introduce per-cpu upcall dispatch.
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.

This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:

* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)

This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.

In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:

a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.

Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 20:05:03 +02:00
Kevin Traynor
6193e03267 dpif-netdev: Allow pin rxq and non-isolate PMD.
Pinning an rxq to a PMD with pmd-rxq-affinity may be done for
various reasons such as reserving a full PMD for an rxq, or to
ensure that multiple rxqs from a port are handled on different PMDs.

Previously pmd-rxq-affinity always isolated the PMD so no other rxqs
could be assigned to it by OVS. There may be cases where there is
unused cycles on those pmds and the user would like other rxqs to
also be able to be assigned to it by OVS.

Add an option to pin the rxq and non-isolate the PMD. The default
behaviour is unchanged, which is pin and isolate the PMD.

In order to pin and non-isolate:
ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false

Note this is available only with group assignment type, as pinning
conflicts with the operation of the other rxq assignment algorithms.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:57 +01:00
Kevin Traynor
3dd050909a dpif-netdev: Add group rxq scheduling assignment type.
Add an rxq scheduling option that allows rxqs to be grouped
on a pmd based purely on their load.

The current default 'cycles' assignment sorts rxqs by measured
processing load and then assigns them to a list of round robin PMDs.
This helps to keep the rxqs that require most processing on different
cores but as it selects the PMDs in round robin order, it equally
distributes rxqs to PMDs.

'cycles' assignment has the advantage in that it separates the most
loaded rxqs from being on the same core but maintains the rxqs being
spread across a broad range of PMDs to mitigate against changes to
traffic pattern.

'cycles' assignment has the disadvantage that in order to make the
trade off between optimising for current traffic load and mitigating
against future changes, it tries to assign and equal amount of rxqs
per PMD in a round robin manner and this can lead to a less than optimal
balance of the processing load.

Now that PMD auto load balance can help mitigate with future changes in
traffic patterns, a 'group' assignment can be used to assign rxqs based
on their measured cycles and the estimated running total of the PMDs.

In this case, there is no restriction about keeping equal number of
rxqs per PMD as it is purely load based.

This means that one PMD may have a group of low load rxqs assigned to it
while another PMD has one high load rxq assigned to it, as that is the
best balance of their measured loads across the PMDs.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:47 +01:00
Vasu Dasari
ccc24fc88d ofproto-dpif: APIs and CLI option to add/delete static fdb entry.
Currently there is an option to add/flush/show ARP/ND neighbor. This
covers L3 side.  For L2 side, there is only fdb show command.  This
commit gives an option to add/del an fdb entry via ovs-appctl.

CLI command looks like:

To add:
    ovs-appctl fdb/add <bridge> <port> <vlan> <Mac>
    ovs-appctl fdb/add br0 p1 0 50:54:00:00:00:05

To del:
    ovs-appctl fdb/del <bridge> <vlan> <Mac>
    ovs-appctl fdb/del br0 0 50:54:00:00:00:05

Added two new APIs to provide convenient interface to add and delete
static-macs.
bool xlate_add_static_mac_entry(const struct ofproto_dpif *,
                                ofp_port_t in_port,
                                struct eth_addr dl_src, int vlan);
bool xlate_delete_static_mac_entry(const struct ofproto_dpif *,
                                   struct eth_addr dl_src, int vlan);

1. Static entry should not age.  To indicate that entry being
   programmed is a static entry, 'expires' field in 'struct mac_entry'
   will be set to a MAC_ENTRY_AGE_STATIC_ENTRY. A check for this value
   is made while deleting mac entry as part of regular aging process.
2. Another change to the mac-update logic, when a packet with same
   dl_src as that of a static-mac entry arrives on any port, the logic
   will not modify the expires field.
3. While flushing fdb entries, made sure static ones are not evicted.
4. Updated "ovs-appctl fdb/stats-show br0" to display number of static
   entries in switch

Added following tests:
  ofproto-dpif - static-mac add/del/flush
  ofproto-dpif - static-mac mac moves

Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048894.html
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1597752
Signed-off-by: Vasu Dasari <vdasari@gmail.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 16:21:02 +02:00
Rosemarie O'Riorden
ae2424696c dpdk: Logs to announce removal of defaults for socket-mem and limit.
Deprecate current OVS provided defaults for DPDK socket-mem and
socket-limit that are planned to be removed in OVS 2.17. At that point
DPDK defaults will be used instead. Warnings have been added to alert
users in advance.

Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 14:00:31 +01:00
Eelco Chaudron
e6ad4d8d9c conntrack: Document all-zero IP SNAT behavior and add a test case.
Currently, conntrack in the kernel has an undocumented feature referred
to as all-zero IP address SNAT. Basically, when a source port
collision is detected during the commit, the source port will be
translated to an ephemeral port. If there is no collision, no SNAT is
performed.

This patchset documents this behavior and adds a self-test to verify
it's not changing. In addition, a datapath feature flag is added for
the all-zero IP SNAT case. This will help applications on top of OVS,
like OVN, to determine this feature can be used.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-08 21:19:14 +02:00
Ben Pfaff
4e948b86c7 bridge: Use correct (legacy) role names in database.
The vswitchd database schema requires role names to be "master" or
"slave", but this code tried to use "primary" and "secondary".

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reported-at: https://github.com/openvswitch/ovs-issues/issues/218
Tested-at: https://github.com/openvswitch/ovs-issues/issues/218#issuecomment-875374045
Fixes: 807152a4ddfb ("Use primary/secondary, not master/slave, as names for OpenFlow roles.")
2021-07-07 11:56:59 -07:00