2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-30 05:47:55 +00:00

18716 Commits

Author SHA1 Message Date
Ilya Maximets
43b7d960af netdev-dummy: Silence the 'may be uninitialized' warning.
GCC 11 with -O1 on Feodra 34 emits a false-positive warning like this:

 lib/netdev-dummy.c: In function ‘dummy_packet_stream_run’:
 lib/netdev-dummy.c:284:16: error: ‘n’ may be used uninitialized in this
                                   function [-Werror=maybe-uninitialized]
  284 |             if (retval == n && dp_packet_size(&s->rxbuf) > 2) {
      |                ^

This breaks the build with --enable-Werror.  Initializing 'n' to
avoid the warning.

Acked-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-23 15:40:08 +02:00
Ben Pfaff
f05d6d623e ofproto-dpif-xlate: Fix continuations with OF instructions in OF1.1+.
Open vSwitch supports OpenFlow "instructions", which were introduced in
OpenFlow 1.1 and act like restricted kinds of actions that can only
appear in a particular order and particular circumstances.  OVS did
not support two of these instructions, "write_metadata" and
"goto_table", properly in the case where they appeared in a flow that
needed to be frozen for continuations.

Both of these instructions had the problem that they couldn't be
properly serialized into the stream of actions, because they're not
actions.  This commit fixes that problem in freeze_unroll_actions()
by converting them into equivalent actions for serialization.

goto_table had the additional problem that it was being serialized to
the frozen stream even after it had been executed.  This was already
properly handled in do_xlate_actions() for resubmit, which is almost
equivalent to goto_table, so this commit applies the same fix to
goto_table.  (The commit removes an assertion from the goto_table
implementation, but there wasn't any real value in that assertion and
I thought the code looked cleaner without it.)

This commit adds tests that would have found these bugs.  This includes
adding a variant of each continuation test that uses OF1.3 for
monitor/resume (which is necessary to trigger these bugs) plus specific
tests for continuations with goto_table and write_metadata.  It also
improves the continuation test infrastructure to add more detail on
the problem if a test fails.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reported-by: Grayson Wu <wgrayson@vmware.com>
Reported-at: https://github.com/openvswitch/ovs-issues/issues/213
Discussed-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/386166.html
Acked-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-22 12:26:20 -07:00
Wilson Peng
8e808e7f14 datapath-windows:Correct checksum for DNAT action
While testing OVS-windows flows for the DNAT action, the checksum
In TCP header is set incorrectly when TCP offload is enabled by
Default. As a result, the packet will be dropped on receiver linuxVM.

>>>sample flow default configuration on both Windows VM and Linux VM
(src=40.0.1.2,dst=10.150.0.1) --dnat--> (src=40.0.1.2,dst==30.1.0.2)
Without the fix for some TCP packet(40.0.1.2->30.1.0.2 with payload
len 207) the TCP checksum will be pseduo header checksum and the value
is 0x01d6. With the fix the checksum will be 0x47ee, it could be got
the correct TCP checksum on the receiver Linux VM.

Signed-off-by: Wilson Peng<pweisong@vmware.com>
Signed-off-by: Anand Kumar<kumaranand@vmware.com>
Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Signed-off-by: Alin-Gabriel Serdean <aserdean@ovn.org>
2021-07-21 12:59:06 +03:00
David Marchand
9547987526 Documentation: Remove duplicate words.
This is a simple cleanup with a script of mine.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2021-07-19 09:33:01 -07:00
Ilya Maximets
4703bc67b7 Prepare for post-2.16.0 (2.16.90).
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 21:52:39 +02:00
Ilya Maximets
45bd6d93f1 Prepare for 2.16.0.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 21:52:23 +02:00
Ilya Maximets
298d4151f4 bond: Fix broken rebalancing after link state changes.
There are 3 constraints for moving hashes from one member to another:

  1. The load difference exceeds ~ 3% of the load of one member.
  2. The difference in load between members exceeds 100,000 bytes.
  3. Moving the hash reduces the load difference by more than 10%.

In the current implementation, if one of the members transitions to
the DOWN state, all hashes assigned to it will be moved to the other
members.  After that, if the member goes UP, it will wait for
rebalancing to get hashes.  But in case we have more than 10 equally
loaded hashes, it will never meet constraint # 3, because each hash
will handle less than 10% of the load.  The situation gets worse when
the number of flows grows and it is almost impossible to transfer any
hash when all 256 hash records are used, which is very likely when we
have few hundred/thousand flows.

As a result, if one of the members goes down and back up while traffic
flows, it will never be used to transmit packets again.  This will not
be fixed even if we completely stop the traffic and start it again,
because the first two constraints will block rebalancing in the
earlier stages, while we have low traffic volume.

Moving a single hash if the destination does not have any hashes,
as it was before commit c460a6a7bc75 ("ofproto/bond: simplifying the
rebalancing logic"), will not help, because a single hash is not
enough to make the difference in load less than 10% of the total load,
and this member will handle only that one hash forever.

To fix this, let's try to move multiple hashes at the same time to
meet constraint # 3.

The implementation includes sorting the "records" to be able to
collect records with a cumulative load close enough to the ideal value.

Acked-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 21:52:23 +02:00
Mark Gray
b1e517bd2f dpif-netlink: Introduce per-cpu upcall dispatch.
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.

This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:

* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)

This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.

In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:

a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.

Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 20:05:03 +02:00
Mark Gray
485e3a13a6 dpif-netlink: Fix report_loss() message.
Fixes: 1579cf677fcb ("dpif-linux: Implement the API functions to allow multiple handler threads read upcall.")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 20:05:03 +02:00
Mark Gray
1325debb45 ofproto: Change type of n_handlers and n_revalidators.
'n_handlers' and 'n_revalidators' are declared as type 'size_t'.
However, dpif_handlers_set() requires parameter 'n_handlers' as
type 'uint32_t'. This patch fixes this type mismatch.

Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 20:05:03 +02:00
David Marchand
3222a89d9a dpif-netdev: Report overhead busy cycles per pmd.
Users complained that per rxq pmd usage was confusing: summing those
values per pmd would never reach 100% even if increasing traffic load
beyond pmd capacity.

This is because the dpif-netdev/pmd-rxq-show command only reports "pure"
rxq cycles while some cycles are used in the pmd mainloop and adds up to
the total pmd load.

dpif-netdev/pmd-stats-show does report per pmd load usage.
This load is measured since the last dpif-netdev/pmd-stats-clear call.
On the other hand, the per rxq pmd usage reflects the pmd load on a 10s
sliding window which makes it non trivial to correlate.

Gather per pmd busy cycles with the same periodicity and report the
difference as overhead in dpif-netdev/pmd-rxq-show so that we have all
info in a single command.

Example:
$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 3:
  isolated : true
  port: dpdk0             queue-id:  0 (enabled)   pmd usage: 90 %
  overhead:  4 %
pmd thread numa_id 1 core_id 5:
  isolated : false
  port: vhost0            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost1            queue-id:  0 (enabled)   pmd usage: 93 %
  port: vhost2            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost6            queue-id:  0 (enabled)   pmd usage:  0 %
  overhead:  6 %
pmd thread numa_id 1 core_id 31:
  isolated : true
  port: dpdk1             queue-id:  0 (enabled)   pmd usage: 86 %
  overhead:  4 %
pmd thread numa_id 1 core_id 33:
  isolated : false
  port: vhost3            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost4            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost5            queue-id:  0 (enabled)   pmd usage: 92 %
  port: vhost7            queue-id:  0 (enabled)   pmd usage:  0 %
  overhead:  7 %

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 17:43:42 +01:00
Kevin Traynor
30bfba0249 tests: Add new test for cross-numa pmd rxq assignments.
Add some tests to ensure that if there are numa local
PMDs they are used for polling an rxq.

Also check that if there are only numa non-local PMDs they
will be used ito poll the rxq and but the user will be warned.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:52:09 +01:00
Kevin Traynor
6193e03267 dpif-netdev: Allow pin rxq and non-isolate PMD.
Pinning an rxq to a PMD with pmd-rxq-affinity may be done for
various reasons such as reserving a full PMD for an rxq, or to
ensure that multiple rxqs from a port are handled on different PMDs.

Previously pmd-rxq-affinity always isolated the PMD so no other rxqs
could be assigned to it by OVS. There may be cases where there is
unused cycles on those pmds and the user would like other rxqs to
also be able to be assigned to it by OVS.

Add an option to pin the rxq and non-isolate the PMD. The default
behaviour is unchanged, which is pin and isolate the PMD.

In order to pin and non-isolate:
ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false

Note this is available only with group assignment type, as pinning
conflicts with the operation of the other rxq assignment algorithms.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:57 +01:00
Kevin Traynor
3dd050909a dpif-netdev: Add group rxq scheduling assignment type.
Add an rxq scheduling option that allows rxqs to be grouped
on a pmd based purely on their load.

The current default 'cycles' assignment sorts rxqs by measured
processing load and then assigns them to a list of round robin PMDs.
This helps to keep the rxqs that require most processing on different
cores but as it selects the PMDs in round robin order, it equally
distributes rxqs to PMDs.

'cycles' assignment has the advantage in that it separates the most
loaded rxqs from being on the same core but maintains the rxqs being
spread across a broad range of PMDs to mitigate against changes to
traffic pattern.

'cycles' assignment has the disadvantage that in order to make the
trade off between optimising for current traffic load and mitigating
against future changes, it tries to assign and equal amount of rxqs
per PMD in a round robin manner and this can lead to a less than optimal
balance of the processing load.

Now that PMD auto load balance can help mitigate with future changes in
traffic patterns, a 'group' assignment can be used to assign rxqs based
on their measured cycles and the estimated running total of the PMDs.

In this case, there is no restriction about keeping equal number of
rxqs per PMD as it is purely load based.

This means that one PMD may have a group of low load rxqs assigned to it
while another PMD has one high load rxq assigned to it, as that is the
best balance of their measured loads across the PMDs.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:47 +01:00
Kevin Traynor
4fb54652e0 dpif-netdev: Assign PMD for failed pinned rxqs.
Previously, if pmd-rxq-affinity was used to pin an rxq to
a core that was not in pmd-cpu-mask the rxq was not polled
for and the user received a warning. This meant that no traffic
would be received from that rxq.

Now that pinned and non-pinned rxqs are assigned to PMDs in
a common call to rxq scheduling, if an invalid core is
selected in pmd-rxq-affinity the rxq can be assigned an
available PMD (if any).

A warning will still be logged as the requested core could
not be used.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:35 +01:00
Kevin Traynor
0efefc4f98 dpif-netdev: Sort PMD list by core id for rxq scheduling.
The list of PMDs is round robined through for the selection
when assigning an rxq to a PMD. The list is based on a
hash map, so there is no defined order.

It means the same set of PMDs may get assigned different rxqs
on different runs for no reason other than how the PMDs are stored
in the hash map.

This can be easily changed by sorting the PMDs by core id after
they are extracted, so the PMDs will be used in a consistent order.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:22 +01:00
Kevin Traynor
58fed7e8d8 dpif-netdev: Make PMD auto load balance use common rxq scheduling.
PMD auto load balance had its own separate implementation of the
rxq scheduling that it used for dry runs. This was done because
previously the rxq scheduling was not made reusable for a dry run.

Apart from the code duplication (which is a good enough reason
to replace it alone) this meant that if any further rxq scheduling
changes or assignment types were added they would also have to be
duplicated in the auto load balance code too.

This patch replaces the current PMD auto load balance rxq scheduling
code to reuse the common rxq scheduling code.

The behaviour does not change from a user perspective, except the logs
are updated to be more consistent.

As the dry run will compare the pmd load variances for current and
estimated assignments, new functions are added to populate the current
assignments and use the rxq scheduling data structs for variance
calculations.

Now that the new rxq scheduling data structures are being used in
PMD auto load balance, the older rr_* data structs and associated
functions can be removed.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:12 +01:00
Kevin Traynor
f577c2d046 dpif-netdev: Rework rxq scheduling code.
This reworks the current rxq scheduling code to break it into more
generic and reusable pieces.

The behaviour does not change from a user perspective, except the logs
are updated to be more consistent.

From an implementation view, there are some changes with mind to
extending functionality.

The high level reusable functions added in this patch are:
- Generate a list of current numas and pmds
- Perform rxq scheduling assignments into that list
- Effect the rxq scheduling assignments so they are used

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:01 +01:00
Vasu Dasari
ccc24fc88d ofproto-dpif: APIs and CLI option to add/delete static fdb entry.
Currently there is an option to add/flush/show ARP/ND neighbor. This
covers L3 side.  For L2 side, there is only fdb show command.  This
commit gives an option to add/del an fdb entry via ovs-appctl.

CLI command looks like:

To add:
    ovs-appctl fdb/add <bridge> <port> <vlan> <Mac>
    ovs-appctl fdb/add br0 p1 0 50:54:00:00:00:05

To del:
    ovs-appctl fdb/del <bridge> <vlan> <Mac>
    ovs-appctl fdb/del br0 0 50:54:00:00:00:05

Added two new APIs to provide convenient interface to add and delete
static-macs.
bool xlate_add_static_mac_entry(const struct ofproto_dpif *,
                                ofp_port_t in_port,
                                struct eth_addr dl_src, int vlan);
bool xlate_delete_static_mac_entry(const struct ofproto_dpif *,
                                   struct eth_addr dl_src, int vlan);

1. Static entry should not age.  To indicate that entry being
   programmed is a static entry, 'expires' field in 'struct mac_entry'
   will be set to a MAC_ENTRY_AGE_STATIC_ENTRY. A check for this value
   is made while deleting mac entry as part of regular aging process.
2. Another change to the mac-update logic, when a packet with same
   dl_src as that of a static-mac entry arrives on any port, the logic
   will not modify the expires field.
3. While flushing fdb entries, made sure static ones are not evicted.
4. Updated "ovs-appctl fdb/stats-show br0" to display number of static
   entries in switch

Added following tests:
  ofproto-dpif - static-mac add/del/flush
  ofproto-dpif - static-mac mac moves

Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048894.html
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1597752
Signed-off-by: Vasu Dasari <vdasari@gmail.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 16:21:02 +02:00
Rosemarie O'Riorden
ae2424696c dpdk: Logs to announce removal of defaults for socket-mem and limit.
Deprecate current OVS provided defaults for DPDK socket-mem and
socket-limit that are planned to be removed in OVS 2.17. At that point
DPDK defaults will be used instead. Warnings have been added to alert
users in advance.

Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 14:00:31 +01:00
David Marchand
15329b728b flow: Count and dump invalid IP packets.
Skipping further processing of invalid IP packets helps avoid crashes
but it does not help to figure out if the malformed packets are still
present on the network.

Add coverage counters for IPv4 and IPv6 sanity checks so that we know
there are some invalid packets.

Dump such whole packets in debug mode.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 14:23:00 +02:00
Gaetan Rivet
6545977cee ovs-rcu: Remove unused perthread mutex.
A mutex is allocated, initialized and destroyed, without being
used in the perthread structure.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Acked-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 14:19:41 +02:00
Guzowski Adrian
cb4bff6ff8 Don't mangle shebangs when building DKMS RPM package.
While building the package, some .in files are being subject to shebang
substitution, but the process fails, because given scripts have
placeholders in place of shebangs. In order to fix the issue, don't mangle
shebangs in this specific package.

Signed-off-by: Guzowski Adrian <adrian.guzowski@exatel.pl>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 14:18:10 +02:00
Ilya Maximets
1f38f9dcf2 AUTHORS: Add Adrian Guzowski.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 14:13:59 +02:00
Guzowski Adrian
2abd8148cf Add ability to override default Release suffix in RPM packages.
In some cases, like building OvS packages in Jenkins, it may be
useful to set a custom version suffix that will correspond with
job's build number, etc. Currently, version number is explicitly
set to 1. This change adds a define "release_number" that may be
overridden during package bulding process:

rpmbuild -ba --define="release_number X" ...

Signed-off-by: Guzowski Adrian <adrian.guzowski@exatel.pl>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 14:11:12 +02:00
Terry Wilson
d28c5ca576 python: Add cooperative_yield() API method to Idl.
When using eventlet monkey_patch()'d code, greenthreads can be
blocked on connection for several seconds while the database
contents are parsed. Eventlet recommends adding a sleep(0) call
to cooperatively yield in cpu-bound code. asyncio code has
asyncio.sleep(0). This patch adds an API  method that defaults to
doing nothing, but can be overridden to yield as needed.

Signed-off-by: Terry Wilson <twilson@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 14:08:19 +02:00
Timothy Redaelli
487253d5b8 python: Update bundled sortedcontainers to 2.4.0.
This is needed since the current bundled version doesn't work on Python
3.10+.

Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 13:57:21 +02:00
David Marchand
6c41bcb138 ci: Do not dump logs on error for GitHub Actions.
GHA webui directly focus on the last lines for a failing step.
config and testsuite logs are attached as artifacts in GHA in case of
failures, so dumping them just adds noise.
Skip dumping those files. Travis is left untouched though

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 13:54:25 +02:00
Eli Britstein
7ab851e1b8 dpif-netdev: Do not execute packet recovery without experimental support.
rte_flow_get_restore_info() API is under experimental attribute. Using it
has a performance impact that can be avoided for non-experimental compilation.

Do not call it without experimental support.

Reported-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 13:50:28 +02:00
Harry van Haaren
a72c1dfbd5 dpif/dpcls: limit count subtable search info logs
This commit avoids many instances of "using subtable X for miniflow (x,y)"
in the ovs-vswitchd log when using the DPCLS Autovalidator. This occurs
when no specialized subtable is found, and the generic "_any" version of
the avx512 subtable search implementation was used. This change logs the
subtable usage once, avoiding duplicates.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: kumar Amber <kumar.amber@intel.com>
Co-authored-by: kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 12:02:43 +01:00
Ian Stokes
26fbd1a1be AUTHORS: Add Cian Ferriter.
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:56:24 +01:00
Ian Stokes
83aae83e65 AUTHORS: Add Amber Kumar.
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:51:10 +01:00
Harry van Haaren
aa85a25095 dpif-netdev/mfex: Add more AVX512 traffic profiles
This commit adds 3 new traffic profile implementations to the
existing avx512 miniflow extract infrastructure. The profiles added are:
- Ether()/IP()/TCP()
- Ether()/Dot1Q()/IP()/UDP()
- Ether()/Dot1Q()/IP()/TCP()

The design of the avx512 code here is for scalability to add more
traffic profiles, as well as enabling CPU ISA. Note that an implementation
is primarily adding static const data, which the compiler then specializes
away when the profile specific function is declared below.

As a result, the code is relatively maintainable, and scalable for new
traffic profiles as well as new ISA, and does not lower performance
compared with manually written code for each profile/ISA.

Note that confidence in the correctness of each implementation is
achieved through autovalidation, unit tests with known packets, and
fuzz tested packets.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:31:49 +01:00
Harry van Haaren
250ceddcc2 dpif-netdev/mfex: Add AVX512 based optimized miniflow extract
This commit adds AVX512 implementations of miniflow extract.
By using the 64 bytes available in an AVX512 register, it is
possible to convert a packet to a miniflow data-structure in
a small quantity instructions.

The implementation here probes for Ether()/IP()/UDP() traffic,
and builds the appropriate miniflow data-structure for packets
that match the probe.

The implementation here is auto-validated by the miniflow
extract autovalidator, hence its correctness can be easily
tested and verified.

Note that this commit is designed to easily allow addition of new
traffic profiles in a scalable way, without code duplication for
each traffic profile.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:31:42 +01:00
Harry van Haaren
32f93dc5ed dpdk: Add additional CPU ISA detection strings
This commit enables OVS to at runtime check for more detailed
AVX512 capabilities, specifically Byte and Word (BW) extensions,
and Vector Bit Manipulation Instructions (VBMI).

These instructions will be used in the CPU ISA optimized
implementations of traffic profile aware miniflow extract.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:31:29 +01:00
Harry van Haaren
dc39608d2a dpif/stats: Add miniflow extract opt hits counter
This commit adds a new counter to be displayed to the user when
requesting datapath packet statistics. It counts the number of
packets that are parsed and a miniflow built up from it by the
optimized miniflow extract parsers.

The ovs-appctl command "dpif-netdev/pmd-perf-show" now has an
extra entry indicating if the optimized MFEX was hit:

  - MFEX Opt hits:        6786432  (100.0 %)

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:31:14 +01:00
Kumar Amber
50be6715c0 test/sytem-dpdk: Add unit test for mfex autovalidator
Tests:
  6: OVS-DPDK - MFEX Autovalidator
  7: OVS-DPDK - MFEX Autovalidator Fuzzy
  8: OVS-DPDK - MFEX Configuration

Added a new directory to store the PCAP file used
in the tests and a script to generate the fuzzy traffic
type pcap to be used in fuzzy unit test.

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:30:31 +01:00
Kumar Amber
a395b132b7 dpif-netdev: Add packet count and core id paramters for study
This commit introduces additional command line paramter
for mfex study function. If user provides additional packet out
it is used in study to compare minimum packets which must be processed
else a default value is choosen.
Also introduces a third paramter for choosing a particular pmd core.

$ ovs-appctl dpif-netdev/miniflow-parser-set study 500 3

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:11:13 +01:00
Kumar Amber
5324b54e60 dpif-netdev: Add configure to enable autovalidator at build time.
This commit adds a new command to allow the user to enable
autovalidatior by default at build time thus allowing for
runnig unit test by default.

 $ ./configure --enable-mfex-default-autovalidator

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 10:34:49 +01:00
Kumar Amber
5c5c98cec2 docs/dpdk/bridge: Add miniflow extract section.
This commit adds a section to the dpdk/bridge.rst netdev documentation,
detailing the added miniflow functionality. The newly added commands are
documented, and sample output is provided.

The use of auto-validator and special study function is also described
in detail as well as running fuzzy tests.

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 10:32:41 +01:00
Kumar Amber
72dd22a0df dpif-netdev: Add study function to select the best mfex function
The study function runs all the available implementations
of miniflow_extract and makes a choice whose hitmask has
maximum hits and sets the mfex to that function.

Study can be run at runtime using the following command:

$ ovs-appctl dpif-netdev/miniflow-parser-set study

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 10:29:52 +01:00
Kumar Amber
dd3f5d86d9 dpif-netdev: Add auto validation function for miniflow extract
This patch introduced the auto-validation function which
allows users to compare the batch of packets obtained from
different miniflow implementations against the linear
miniflow extract and return a hitmask.

The autovaidator function can be triggered at runtime using the
following command:

$ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 09:57:19 +01:00
Kumar Amber
3d8f47bc04 dpif-netdev: Add command line and function pointer for miniflow extract
This patch introduces the MFEX function pointers which allows
the user to switch between different miniflow extract implementations
which are provided by the OVS based on optimized ISA CPU.

The user can query for the available minflow extract variants available
for that CPU by following commands:

$ovs-appctl dpif-netdev/miniflow-parser-get

Similarly an user can set the miniflow implementation by the following
command :

$ ovs-appctl dpif-netdev/miniflow-parser-set name

This allows for more performance and flexibility to the user to choose
the miniflow implementation according to the needs.

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 09:56:58 +01:00
Ilya Maximets
3e82604b7c docs: Add documentation for ovsdb relay mode.
Main documentation for the service model and tutorial with the use case
and configuration examples.

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:38:52 +02:00
Ilya Maximets
e26bf9726f ovsdb: Make clients aware of relay service model.
Clients needs to re-connect from the relay that has no connection
with the database source.  Also, relay acts similarly to the follower
from a clustered model from the consistency point of view, so it's not
suitable for leader-only connections.

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:38:49 +02:00
Ilya Maximets
edcf441722 ovsdb: relay: Reflect connection status in _Server database.
It might be important for clients to know that relay lost connection
with the relay remote, so they could re-connect to other relay.

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:38:31 +02:00
Ilya Maximets
7964ffe7d2 ovsdb: relay: Add support for transaction forwarding.
Current version of ovsdb relay allows to scale out read-only
access to the primary database.  However, many clients are not
read-only but read-mostly.  For example, ovn-controller.

In order to scale out database access for this case ovsdb-server
need to process transactions that are not read-only.  Relay is not
allowed to do that, i.e. not allowed to modify the database, but it
can act like a proxy and forward transactions that includes database
modifications to the primary server and forward replies back to a
client.  At the same time it may serve read-only transactions and
monitor requests by itself greatly reducing the load on primary
server.

This configuration will slightly increase transaction latency, but
it's not very important for read-mostly use cases.

Implementation details:
With this change instead of creating a trigger to commit the
transaction, ovsdb-server will create a trigger for transaction
forwarding.  Later, ovsdb_relay_run() will send all new transactions
to the relay source.  Once transaction reply received from the
relay source, ovsdb-relay module will update the state of the
transaction forwarding with the reply.  After that, trigger_run()
will complete the trigger and jsonrpc_server_run() will send the
reply back to the client.  Since transaction reply from the relay
source will be received after all the updates, client will receive
all the updates before receiving the transaction reply as it is in
a normal scenario with other database models.

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:38:07 +02:00
Ilya Maximets
026c77c58d ovsdb: New ovsdb 'relay' service model.
New database service model 'relay' that is needed to scale out
read-mostly database access, e.g. ovn-controller connections to
OVN_Southbound.

In this service model ovsdb-server connects to existing OVSDB
server and maintains in-memory copy of the database.  It serves
read-only transactions and monitor requests by its own, but
forwards write transactions to the relay source.

Key differences from the active-backup replication:
- support for "write" transactions (next commit).
- no on-disk storage. (probably, faster operation)
- support for multiple remotes (connect to the clustered db).
- doesn't try to keep connection as long as possible, but
  faster reconnects to other remotes to avoid missing updates.
- No need to know the complete database schema beforehand,
  only the schema name.
- can be used along with other standalone and clustered databases
  by the same ovsdb-server process. (doesn't turn the whole
  jsonrpc server to read-only mode)
- supports modern version of monitors (monitor_cond_since),
  because based on ovsdb-cs.
- could be chained, i.e. multiple relays could be connected
  one to another in a row or in a tree-like form.
- doesn't increase availability.
- cannot be converted to other service models or become a main
  active server.

Some performance test results can be found here:
  https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385825.html

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:38:03 +02:00
Ilya Maximets
b4cef64c83 ovsdb: row: Add support for xor-based row updates.
This will be used to apply update3 type updates to ovsdb tables
while processing updates for future ovsdb 'relay' service model.

'ovsdb_datum_apply_diff' is allowed to fail, so adding support
to return this error.

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:37:46 +02:00
Ilya Maximets
85dbbe275b ovsdb: table: Expose functions to execute operations on ovsdb tables.
These functions will be used later for ovsdb 'relay' service model, so
moving them to a common code.

Warnings translated to ovsdb errors, caller in replication.c only
printed inconsistency warnings, but mostly ignored them.  Implementing
the same logic by checking the error tag.

Also ovsdb_execute_insert() previously printed incorrect warning about
duplicate row while it was a syntax error in json.  Fixing that by
actually checking for the duplicate and reporting correct ovsdb error.

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:37:43 +02:00