2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-29 21:38:13 +00:00

3402 Commits

Author SHA1 Message Date
Frode Nordahl
bfee9f6c01 netlink: Add support for parsing link layer address.
Data retrieved from netlink and friends may include link layer
address.  Add type to nl_attr_type and min/max functions to allow
use of nl_policy_parse with this type of data.

While this will not be used by Open vSwitch itself at this time,
sibling and derived projects want to use the great netlink library
that OVS provides, and it is not possible to safely override the
global nl_attr_type symbol at link time.

Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
2021-08-20 11:32:52 -07:00
Dumitru Ceara
daf627f459 ovsdb-cs: Perform forced reconnects without a backoff.
The ovsdb-cs layer triggers a forced reconnect in various cases:
- when an inconsistency is detected in the data received from the
  remote server.
- when the remote server is running in clustered mode and transitioned
  to "follower", if the client is configured in "leader-only" mode.
- when explicitly requested by upper layers (e.g., by the user
  application, through the IDL layer).

In such cases it's desirable that reconnection should happen as fast as
possible, without the current exponential backoff maintained by the
underlying reconnect object.  Furthermore, since 3c2d6274bcee ("raft:
Transfer leadership before creating snapshots."), leadership changes
inside the clustered database happen more often and, therefore,
"leader-only" clients need to reconnect more often too.

Forced reconnects call jsonrpc_session_force_reconnect() which will not
reset backoff.  To make sure clients reconnect as fast as possible in
the aforementioned scenarios we first call the new API,
jsonrpc_session_reset_backoff(), in ovsdb-cs, for sessions that are in
state CS_S_MONITORING (i.e., the remote is likely still alive and
functioning fine).

jsonrpc_session_reset_backoff() resets the number of backoff-free
reconnect retries to the number of remotes configured for the session,
ensuring that all remotes are retried exactly once with backoff 0.

This commit also updates the Python IDL and jsonrpc implementations.
The Python IDL wasn't tracking the IDL_S_MONITORING state explicitly,
we now do that too.  Tests were also added to make sure the IDL forced
reconnects happen without backoff.

Reported-at: https://bugzilla.redhat.com/1977264
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-23 17:29:36 +02:00
kumar Amber
69b2bdfd3f system-dpdk.at: Fix module not found error for pyhton < 3.6.
This fixes the flake8 error on pyhton version older than 3.6
as ModuleNotFoundError in not available before 3.6.

../../tests/mfex_fuzzy.py:5:8: F821 undefined name 'ModuleNotFoundError'
Makefile:5826: recipe for target 'flake8-check' failed

Since it doesn't really make any sense to catch this exception,
try-except block is just removed.  Additionally the check for
scapy replaced with the more reliable one.  Imports re-ordered,
because standard imports should go first.

Fixes: 50be6715c0 ("test/sytem-dpdk: Add unit test for mfex autovalidator")
Signed-off-by: kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-23 17:25:46 +02:00
Ben Pfaff
f05d6d623e ofproto-dpif-xlate: Fix continuations with OF instructions in OF1.1+.
Open vSwitch supports OpenFlow "instructions", which were introduced in
OpenFlow 1.1 and act like restricted kinds of actions that can only
appear in a particular order and particular circumstances.  OVS did
not support two of these instructions, "write_metadata" and
"goto_table", properly in the case where they appeared in a flow that
needed to be frozen for continuations.

Both of these instructions had the problem that they couldn't be
properly serialized into the stream of actions, because they're not
actions.  This commit fixes that problem in freeze_unroll_actions()
by converting them into equivalent actions for serialization.

goto_table had the additional problem that it was being serialized to
the frozen stream even after it had been executed.  This was already
properly handled in do_xlate_actions() for resubmit, which is almost
equivalent to goto_table, so this commit applies the same fix to
goto_table.  (The commit removes an assertion from the goto_table
implementation, but there wasn't any real value in that assertion and
I thought the code looked cleaner without it.)

This commit adds tests that would have found these bugs.  This includes
adding a variant of each continuation test that uses OF1.3 for
monitor/resume (which is necessary to trigger these bugs) plus specific
tests for continuations with goto_table and write_metadata.  It also
improves the continuation test infrastructure to add more detail on
the problem if a test fails.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reported-by: Grayson Wu <wgrayson@vmware.com>
Reported-at: https://github.com/openvswitch/ovs-issues/issues/213
Discussed-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/386166.html
Acked-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-22 12:26:20 -07:00
Ilya Maximets
298d4151f4 bond: Fix broken rebalancing after link state changes.
There are 3 constraints for moving hashes from one member to another:

  1. The load difference exceeds ~ 3% of the load of one member.
  2. The difference in load between members exceeds 100,000 bytes.
  3. Moving the hash reduces the load difference by more than 10%.

In the current implementation, if one of the members transitions to
the DOWN state, all hashes assigned to it will be moved to the other
members.  After that, if the member goes UP, it will wait for
rebalancing to get hashes.  But in case we have more than 10 equally
loaded hashes, it will never meet constraint # 3, because each hash
will handle less than 10% of the load.  The situation gets worse when
the number of flows grows and it is almost impossible to transfer any
hash when all 256 hash records are used, which is very likely when we
have few hundred/thousand flows.

As a result, if one of the members goes down and back up while traffic
flows, it will never be used to transmit packets again.  This will not
be fixed even if we completely stop the traffic and start it again,
because the first two constraints will block rebalancing in the
earlier stages, while we have low traffic volume.

Moving a single hash if the destination does not have any hashes,
as it was before commit c460a6a7bc75 ("ofproto/bond: simplifying the
rebalancing logic"), will not help, because a single hash is not
enough to make the difference in load less than 10% of the total load,
and this member will handle only that one hash forever.

To fix this, let's try to move multiple hashes at the same time to
meet constraint # 3.

The implementation includes sorting the "records" to be able to
collect records with a cumulative load close enough to the ideal value.

Acked-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 21:52:23 +02:00
David Marchand
3222a89d9a dpif-netdev: Report overhead busy cycles per pmd.
Users complained that per rxq pmd usage was confusing: summing those
values per pmd would never reach 100% even if increasing traffic load
beyond pmd capacity.

This is because the dpif-netdev/pmd-rxq-show command only reports "pure"
rxq cycles while some cycles are used in the pmd mainloop and adds up to
the total pmd load.

dpif-netdev/pmd-stats-show does report per pmd load usage.
This load is measured since the last dpif-netdev/pmd-stats-clear call.
On the other hand, the per rxq pmd usage reflects the pmd load on a 10s
sliding window which makes it non trivial to correlate.

Gather per pmd busy cycles with the same periodicity and report the
difference as overhead in dpif-netdev/pmd-rxq-show so that we have all
info in a single command.

Example:
$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 3:
  isolated : true
  port: dpdk0             queue-id:  0 (enabled)   pmd usage: 90 %
  overhead:  4 %
pmd thread numa_id 1 core_id 5:
  isolated : false
  port: vhost0            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost1            queue-id:  0 (enabled)   pmd usage: 93 %
  port: vhost2            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost6            queue-id:  0 (enabled)   pmd usage:  0 %
  overhead:  6 %
pmd thread numa_id 1 core_id 31:
  isolated : true
  port: dpdk1             queue-id:  0 (enabled)   pmd usage: 86 %
  overhead:  4 %
pmd thread numa_id 1 core_id 33:
  isolated : false
  port: vhost3            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost4            queue-id:  0 (enabled)   pmd usage:  0 %
  port: vhost5            queue-id:  0 (enabled)   pmd usage: 92 %
  port: vhost7            queue-id:  0 (enabled)   pmd usage:  0 %
  overhead:  7 %

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 17:43:42 +01:00
Kevin Traynor
30bfba0249 tests: Add new test for cross-numa pmd rxq assignments.
Add some tests to ensure that if there are numa local
PMDs they are used for polling an rxq.

Also check that if there are only numa non-local PMDs they
will be used ito poll the rxq and but the user will be warned.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:52:09 +01:00
Kevin Traynor
6193e03267 dpif-netdev: Allow pin rxq and non-isolate PMD.
Pinning an rxq to a PMD with pmd-rxq-affinity may be done for
various reasons such as reserving a full PMD for an rxq, or to
ensure that multiple rxqs from a port are handled on different PMDs.

Previously pmd-rxq-affinity always isolated the PMD so no other rxqs
could be assigned to it by OVS. There may be cases where there is
unused cycles on those pmds and the user would like other rxqs to
also be able to be assigned to it by OVS.

Add an option to pin the rxq and non-isolate the PMD. The default
behaviour is unchanged, which is pin and isolate the PMD.

In order to pin and non-isolate:
ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false

Note this is available only with group assignment type, as pinning
conflicts with the operation of the other rxq assignment algorithms.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:57 +01:00
Kevin Traynor
3dd050909a dpif-netdev: Add group rxq scheduling assignment type.
Add an rxq scheduling option that allows rxqs to be grouped
on a pmd based purely on their load.

The current default 'cycles' assignment sorts rxqs by measured
processing load and then assigns them to a list of round robin PMDs.
This helps to keep the rxqs that require most processing on different
cores but as it selects the PMDs in round robin order, it equally
distributes rxqs to PMDs.

'cycles' assignment has the advantage in that it separates the most
loaded rxqs from being on the same core but maintains the rxqs being
spread across a broad range of PMDs to mitigate against changes to
traffic pattern.

'cycles' assignment has the disadvantage that in order to make the
trade off between optimising for current traffic load and mitigating
against future changes, it tries to assign and equal amount of rxqs
per PMD in a round robin manner and this can lead to a less than optimal
balance of the processing load.

Now that PMD auto load balance can help mitigate with future changes in
traffic patterns, a 'group' assignment can be used to assign rxqs based
on their measured cycles and the estimated running total of the PMDs.

In this case, there is no restriction about keeping equal number of
rxqs per PMD as it is purely load based.

This means that one PMD may have a group of low load rxqs assigned to it
while another PMD has one high load rxq assigned to it, as that is the
best balance of their measured loads across the PMDs.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:47 +01:00
Kevin Traynor
4fb54652e0 dpif-netdev: Assign PMD for failed pinned rxqs.
Previously, if pmd-rxq-affinity was used to pin an rxq to
a core that was not in pmd-cpu-mask the rxq was not polled
for and the user received a warning. This meant that no traffic
would be received from that rxq.

Now that pinned and non-pinned rxqs are assigned to PMDs in
a common call to rxq scheduling, if an invalid core is
selected in pmd-rxq-affinity the rxq can be assigned an
available PMD (if any).

A warning will still be logged as the requested core could
not be used.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:35 +01:00
Kevin Traynor
f577c2d046 dpif-netdev: Rework rxq scheduling code.
This reworks the current rxq scheduling code to break it into more
generic and reusable pieces.

The behaviour does not change from a user perspective, except the logs
are updated to be more consistent.

From an implementation view, there are some changes with mind to
extending functionality.

The high level reusable functions added in this patch are:
- Generate a list of current numas and pmds
- Perform rxq scheduling assignments into that list
- Effect the rxq scheduling assignments so they are used

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 16:51:01 +01:00
Vasu Dasari
ccc24fc88d ofproto-dpif: APIs and CLI option to add/delete static fdb entry.
Currently there is an option to add/flush/show ARP/ND neighbor. This
covers L3 side.  For L2 side, there is only fdb show command.  This
commit gives an option to add/del an fdb entry via ovs-appctl.

CLI command looks like:

To add:
    ovs-appctl fdb/add <bridge> <port> <vlan> <Mac>
    ovs-appctl fdb/add br0 p1 0 50:54:00:00:00:05

To del:
    ovs-appctl fdb/del <bridge> <vlan> <Mac>
    ovs-appctl fdb/del br0 0 50:54:00:00:00:05

Added two new APIs to provide convenient interface to add and delete
static-macs.
bool xlate_add_static_mac_entry(const struct ofproto_dpif *,
                                ofp_port_t in_port,
                                struct eth_addr dl_src, int vlan);
bool xlate_delete_static_mac_entry(const struct ofproto_dpif *,
                                   struct eth_addr dl_src, int vlan);

1. Static entry should not age.  To indicate that entry being
   programmed is a static entry, 'expires' field in 'struct mac_entry'
   will be set to a MAC_ENTRY_AGE_STATIC_ENTRY. A check for this value
   is made while deleting mac entry as part of regular aging process.
2. Another change to the mac-update logic, when a packet with same
   dl_src as that of a static-mac entry arrives on any port, the logic
   will not modify the expires field.
3. While flushing fdb entries, made sure static ones are not evicted.
4. Updated "ovs-appctl fdb/stats-show br0" to display number of static
   entries in switch

Added following tests:
  ofproto-dpif - static-mac add/del/flush
  ofproto-dpif - static-mac mac moves

Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048894.html
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1597752
Signed-off-by: Vasu Dasari <vdasari@gmail.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-16 16:21:02 +02:00
Harry van Haaren
dc39608d2a dpif/stats: Add miniflow extract opt hits counter
This commit adds a new counter to be displayed to the user when
requesting datapath packet statistics. It counts the number of
packets that are parsed and a miniflow built up from it by the
optimized miniflow extract parsers.

The ovs-appctl command "dpif-netdev/pmd-perf-show" now has an
extra entry indicating if the optimized MFEX was hit:

  - MFEX Opt hits:        6786432  (100.0 %)

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:31:14 +01:00
Kumar Amber
50be6715c0 test/sytem-dpdk: Add unit test for mfex autovalidator
Tests:
  6: OVS-DPDK - MFEX Autovalidator
  7: OVS-DPDK - MFEX Autovalidator Fuzzy
  8: OVS-DPDK - MFEX Configuration

Added a new directory to store the PCAP file used
in the tests and a script to generate the fuzzy traffic
type pcap to be used in fuzzy unit test.

Signed-off-by: Kumar Amber <kumar.amber@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-16 11:30:31 +01:00
Ilya Maximets
7964ffe7d2 ovsdb: relay: Add support for transaction forwarding.
Current version of ovsdb relay allows to scale out read-only
access to the primary database.  However, many clients are not
read-only but read-mostly.  For example, ovn-controller.

In order to scale out database access for this case ovsdb-server
need to process transactions that are not read-only.  Relay is not
allowed to do that, i.e. not allowed to modify the database, but it
can act like a proxy and forward transactions that includes database
modifications to the primary server and forward replies back to a
client.  At the same time it may serve read-only transactions and
monitor requests by itself greatly reducing the load on primary
server.

This configuration will slightly increase transaction latency, but
it's not very important for read-mostly use cases.

Implementation details:
With this change instead of creating a trigger to commit the
transaction, ovsdb-server will create a trigger for transaction
forwarding.  Later, ovsdb_relay_run() will send all new transactions
to the relay source.  Once transaction reply received from the
relay source, ovsdb-relay module will update the state of the
transaction forwarding with the reply.  After that, trigger_run()
will complete the trigger and jsonrpc_server_run() will send the
reply back to the client.  Since transaction reply from the relay
source will be received after all the updates, client will receive
all the updates before receiving the transaction reply as it is in
a normal scenario with other database models.

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:38:07 +02:00
Ilya Maximets
e93fc5db9b ovsdb: storage: Allow setting the name for the unbacked storage.
ovsdb_create() requires schema or storage to be nonnull, but in
practice it requires to have schema name or a storage name to
use it as a database name.  Only clustered storage has a name.
This means that only clustered database can be created without
schema,  Changing that by allowing unbacked storage to have a
name.  This way we can create database with unbacked storage
without schema.  Will be used in next commits to create database
for ovsdb 'relay' service model.

Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 22:37:32 +02:00
Ilya Maximets
00dda78ed4 ovsdb-cs: Avoid unnecessary re-connections when updating remotes.
If a new database server added to the cluster, or if one of the
database servers changed its IP address or port, then you need to
update the list of remotes for the client.  For example, if a new
OVN_Southbound database server is added, you need to update the
ovn-remote for the ovn-controller.

However, in the current implementation, the ovsdb-cs module always
closes the current connection and creates a new one.  This can lead
to a storm of re-connections if all ovn-controllers will be updated
simultaneously.  They can also start re-dowloading the database
content, creating even more load on the database servers.

Correct this by saving an existing connection if it is still in the
list of remotes after the update.

'reconnect' module will report connection state updates, but that
is OK since no real re-connection happened and we only updated the
state of a new 'reconnect' instance.

If required, re-connection can be forced after the update of remotes
with ovsdb_cs_force_reconnect().

Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-15 21:04:49 +02:00
Cian Ferriter
d76a719a7a dpif-netdev: Add a partial HWOL PMD statistic.
It is possible for packets traversing the userspace datapath to match a
flow before hitting on EMC by using a mark ID provided by a NIC. Add a
PMD statistic for this hit.

Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2021-07-09 17:13:55 +01:00
Paolo Valerio
61e48c2d1d conntrack: Handle SNAT with all-zero IP address.
This patch introduces for the userspace datapath the handling
of rules like the following:

  ct(commit,nat(src=0.0.0.0),...)

Kernel datapath already handle this case that is particularly
handy in scenarios like the following:

Given A: 10.1.1.1, B: 192.168.2.100, C: 10.1.1.2

A opens a connection toward B on port 80 selecting as source port 10000.
B's IP gets dnat'ed to C's IP (10.1.1.1:10000 -> 192.168.2.100:80).

This will result in:

  tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=10000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10000),
     protoinfo=(state=ESTABLISHED)

A now tries to establish another connection with C using source port
10000, this time using C's IP address (10.1.1.1:10000 -> 10.1.1.2:80).

This second connection, if processed by conntrack with no SNAT/DNAT
involved, collides with the reverse tuple of the first connection,
so the entry for this valid connection doesn't get created.

With this commit, and adding a SNAT rule with 0.0.0.0 for
10.1.1.1:10000 -> 10.1.1.2:80 will allow to create the conn entry:

  tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=10000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10001),
     protoinfo=(state=ESTABLISHED)
  tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=10000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10000),
     protoinfo=(state=ESTABLISHED)

The issue exists even in the opposite case (with A trying to connect
to C using B's IP after establishing a direct connection from A to C).

This commit refactors the relevant function in a way that both of the
previously mentioned cases are handled as well.

Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-08 23:49:34 +02:00
Paolo Valerio
1e19f9aa26 conntrack: Handle already natted packets.
When a packet gets dnatted and then recirculated, it could be possible
that it matches another rule that performs another nat action.
The kernel datapath handles this situation turning to a no-op the
second nat action, so natting only once the packet.  In the userspace
datapath instead, when the ct action gets executed, an initial lookup
of the translated packet fails to retrieve the connection related to
the packet, leading to the creation of a new entry in ct for the src
nat action with a subsequent failure of the connection establishment.

with the following flows:

table=0,priority=30,in_port=1,ip,nw_dst=192.168.2.100,actions=ct(commit,nat(dst=10.1.1.2:80),table=1)
table=0,priority=20,in_port=2,ip,actions=ct(nat,table=1)
table=0,priority=10,ip,actions=resubmit(,2)
table=0,priority=10,arp,actions=NORMAL
table=0,priority=0,actions=drop
table=1,priority=5,ip,actions=ct(commit,nat(src=10.1.1.240),table=2)
table=2,in_port=ovs-l0,actions=2
table=2,in_port=ovs-r0,actions=1

Establishing a connection from 10.1.1.1 to 192.168.2.100 the outcome is:

  tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=4000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.240,sport=80,dport=4000),
     protoinfo=(state=ESTABLISHED)
  tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=4000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=4000),
     protoinfo=(state=ESTABLISHED)

With this patch applied the outcome is:

  tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=4000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=4000),
     protoinfo=(state=ESTABLISHED)

The patch performs, for already natted packets, a lookup of the
reverse key in order to retrieve the related entry, it also adds a
test case that besides testing the scenario ensures that the other ct
actions are executed.

Reported-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-08 23:49:34 +02:00
Eelco Chaudron
e6ad4d8d9c conntrack: Document all-zero IP SNAT behavior and add a test case.
Currently, conntrack in the kernel has an undocumented feature referred
to as all-zero IP address SNAT. Basically, when a source port
collision is detected during the commit, the source port will be
translated to an ephemeral port. If there is no collision, no SNAT is
performed.

This patchset documents this behavior and adds a self-test to verify
it's not changing. In addition, a datapath feature flag is added for
the all-zero IP SNAT case. This will help applications on top of OVS,
like OVN, to determine this feature can be used.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-08 21:19:14 +02:00
Eelco Chaudron
355fef6f2c ofproto-dpif-xlate: Avoid successive ct_clear datapath actions.
Due to flow lookup optimizations, especially in the resubmit/clone cases,
we might end up with multiple ct_clear actions, which are not necessary.

This patch only adds the ct_clear action to the datapath if any ct state
is tracked.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-08 21:19:06 +02:00
Ilya Maximets
b7809111a6 odp-util: Stop key parsing if already oversized.
We don't need to continue parsing if already oversized.  This is not
very important, but fuzzer times out while parsing very long flow.

The check could be written as a single 'if' statement, but I found
my variant much more readable.

Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=35519
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2021-07-07 23:39:28 +02:00
David Wilder
3da3cc1a0c ovs-numa: Support non-contiguous numa nodes and offline CPU cores.
This change removes the assumption that numa nodes and cores are numbered
contiguously in linux.  This change is required to support some Power
systems.

A check has been added to verify that cores are online,
offline cores result in non-contiguously numbered cores.

DPDK EAL option generation is updated to work with non-contiguous numa nodes.
These options can be seen in the ovs-vswitchd.log.  For example:
a system containing only numa nodes 0 and 8 will generate the following:

EAL ARGS: ovs-vswitchd --socket-mem 1024,0,0,0,0,0,0,0,1024 \
                       --socket-limit 1024,0,0,0,0,0,0,0,1024 -l 0

Tests for pmd and dpif-netdev have been updated to validate non-contiguous
numbered nodes.

Signed-off-by: David Wilder <dwilder@us.ibm.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-07 23:35:57 +02:00
Ilya Maximets
b57b062f5d ofp-actions: Report an error if there are too many actions to parse.
Not a very important fix, but fuzzer times out trying to test parsing
of a huge number of actions.  Fixing that by reporting an error as
soon as ofpacts oversized.

It would be great to use ofpbuf_oversized() function instead of manual
size checking, but ofpacts->header here always points to the last
pushed action, so the value that ofpbuf_oversized() would check is
always small.

Adding a unit test for this, plus the extra test for too deep nesting.

Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20254
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org>
2021-07-07 22:48:05 +02:00
Tianyu Yuan
f686957c96 add test cases for ingress_policing_kpkts parameters
Exercise OVS setting of ingress_policing_kpkts parameters using ovs-vsctl
and verify that the correct values are stored on OVSDB.

Verify the ingress_policing parameters with tc command. Also check offload
and non-offload in tc software datapath based on tc filter type (matchall
and basic).  Skip test of pps if OVS or kernel does not support pps rate
limit.

Example invocation:
make check TESTSUITEFLAGS='-k ingress_policing_kpkts'
make check-offloads TESTSUITEFLAGS='-k ingress_policing_kpkts'

Signed-off-by: Tianyu Yuan <tianyu.yuan@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
2021-07-01 20:44:25 +02:00
Aaron Conole
b6c5f30cfa checkpatch: Ignore macro definitions of FOR_EACH.
When defining a FOR_EACH macro, checkpatch freaks out and generates a
control block whitespace error.  Create an exception so that it doesn't
generate errors for this case.

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2020-August/373509.html
Reported-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-01 16:31:55 +02:00
Kevin Traynor
f0e4a7338c tests: Add PMD auto load balance unit tests.
These tests focus on enabling/disabling and user parameters.

Co-Authored-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-06-24 22:11:10 +02:00
Kevin Traynor
833f1b843d pmd.at: Get next line number of log.
Some tests get the current log line number so they can check that
there is a new occurrence of a log entry after a command.

'tail' uses the line number as the starting line number. However,
this will include the last line of the log before the command.

To prevent any races on logs and possibly checking an existing log
entry prior to a command here or in reuse of this method, get the
next line number of the log and use that as the starting line for tail.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-06-24 22:10:40 +02:00
Rosemarie O'Riorden
bd90524550 Remove Python 2 leftovers.
Fixes: 1ca0323e7c29 ("Require Python 3 and remove support for Python 2.")
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949875
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-06-22 21:29:57 +02:00
Martin Varghese
e81ed94214 Fix redundant datapath set ethernet action with NSH Decap.
When a decap action is applied on NSH header encapsulating a
ethernet packet a redundant set mac address action is programmed
to the datapath.

Fixes: f839892a206a ("OF support and translation of generic encap and decap")
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-06-16 12:44:54 +02:00
Martin Varghese
c2999459d2 tests: Fixed L3 over patch port tests.
Normal action is replaced with output to GRE port for sending
l3 packets over GRE tunnel. Normal action cannot be used with
l3 packets.

Fixes: d03d0cf2b71b ("tests: Extend PTAP unit tests with decap action")
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-06-16 11:06:29 +02:00
Toms Atteka
cca40141a8 netlink: removed incorrect optimization
This optimization caused FLOW_TNL_F_UDPIF flag not to be used in
hash calculation for geneve tunnel when revalidating flows which
resulted in different cache hash values and incorrect behaviour.

Added test to prevent regression.

CC: Jesse Gross <jesse@nicira.com>
Fixes: 6728d578f64e ("dpif-netdev: Translate Geneve options per-flow, not per-packet.")
Reported-at: https://github.com/vmware-tanzu/antrea/issues/897
Signed-off-by: Toms Atteka <cpp.code.lv@gmail.com>
Acked-by: Ansis Atteka <aatteka@ovn.org>
2021-06-15 13:34:34 -05:00
Ilya Maximets
2afe31169a odp-util: Return an error on actions overflow while parsing from string.
We don't need to continue parsing if already oversized.  This is not
very important, but fuzzer times out while parsing very long list of
actions.

Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=29190
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>
2021-06-14 21:19:00 +02:00
Ben Pfaff
5fe3ef1a0c tests: Fix spelling error in test name.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
2021-06-14 11:21:23 -07:00
Ilya Maximets
c5a58ec155 python: idl: Allow retry even when using a single remote.
As described in commit [1], it's possible that remote IP is backed by
a load-balancer and re-connection to this same IP will lead to
connection to a different server.  This case is supported for C version
of IDL and should be supported in a same way for python implementation.

[1] ca367fa5f8bb ("ovsdb-idl.c: Allows retry even when using a single remote.")

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
2021-06-11 01:11:57 +02:00
Tao YunXiang
91cb55bc8a system-traffic.at:add missing comma
Add missing comma.

Signed-off-by: Tao YunXiang <taoyunxiang@cmss.chinamobile.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Cc: Joe Stringer <joe@ovn.org>
2021-06-10 16:10:30 -07:00
Ilya Maximets
4275b5b7fb ovsdb-client: Integrate record/replay functionality.
This is primarily to be able to test recording of client connections.
Unit test added accordingly.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
2021-06-07 21:03:16 +02:00
Ilya Maximets
0be15ad76f ovsdb-server.at: Add unit test for record/replay.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
2021-06-07 21:03:16 +02:00
Ben Pfaff
3012710ec2 tests: Fix PKIDIR checks in AT_SKIP.
In Autotest, [xyz] just expands to xyz.  To get [xyz] in output, we
need [[xyz]] in input.

I spotted this based on "expr" reporting an error in testsuite output.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
2021-06-02 09:33:21 -07:00
Ben Pfaff
5da031d6df tests: Drop support for glibc before version 2.11.
The "ldd" call here didn't work if libtool was involved and would print
an error message.  We could fix that, but the check is only needed for
glibc earlier than 2.11.  glibc 2.11 was released in 2009, so it should
be safe to expect that testers are running it or a newer version.

This is a crossport of a patch originally applied to OVN as
commit 2870efff89337298.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Numan Siddique <numans@ovn.org>
2021-06-01 11:55:38 -07:00
Adrian Moreno
0b3ff31d35 ofp_actions: Fix set_mpls_tc formatting.
Apart from a cut-and-paste typo, the man page claims that mpls_labels
can be provided in hexadecimal format but that's currently not the case.

Fix mpls ofp-action formatting, add size checks on ofp-action parsing
and add some unit tests.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-05-19 12:42:16 +02:00
Ilya Maximets
7731d26144 dpif-netdev: Remove meter rate from the bucket size calculation.
Implementation of meters supposed to be a classic token bucket with 2
typical parameters: rate and burst size.

Burst size in this schema is the maximum number of bytes/packets that
could pass without being rate limited.

Recent changes to userspace datapath made meter implementation to be
in line with the kernel one, and this uncovered several issues.

The main problem is that maximum bucket size for unknown reason
accounts not only burst size, but also the numerical value of rate.
This creates a lot of confusion around behavior of meters.

For example, if rate is configured as 1000 pps and burst size set to 1,
this should mean that meter will tolerate bursts of 1 packet at most,
i.e. not a single packet above the rate should pass the meter.
However, current implementation calculates maximum bucket size as
(rate + burst size), so the effective bucket size will be 1001.  This
means that first 1000 packets will not be rate limited and average
rate might be twice as high as the configured rate.  This also makes
it practically impossible to configure meter that will have burst size
lower than the rate, which might be a desirable configuration if the
rate is high.

Inability to configure low values of a burst size and overall inability
for a user to predict what will be a maximum and average rate from the
configured parameters of a meter without looking at the OVS and kernel
code might be also classified as a security issue, because drop meters
are frequently used as a way of protection from DoS attacks.

This change removes rate from the calculation of a bucket size, making
it in line with the classic token bucket algorithm and essentially
making the rate and burst tolerance being predictable from a users'
perspective.

Same change will be proposed for the kernel implementation.
Unit tests changed back to their correct version and enhanced.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
2021-05-18 22:11:14 +02:00
Mark Gray
5dce24d04d ipsec: Fix race in system tests.
This patch fixes an issue where, depending on timing fluctuations,
each node has not fully loaded all connections before the other
node begins to establish a connection. In this failure case, the
"ovs-monitor-ipsec" instance on the "left" node may `ipsec auto --start`
a connection which then gets rejected by the "right" side. Almost,
simulaneously, the "right" side may initiate a connection that gets
rejected by the "left" side. This can happen as, for all tunnels except
for GRE, each node has two connections (an "in" connection and an "out"
connection) that get added one after the other. If the "in" connection
"starts" on both sides, the "out" connection from the other node
may not be available causing the connection to fail. At this point,
"Libreswan" will wait to retry the connection. In the interim, the
OVS system test times out. This race manifests itself more frequently
in a virtualized environment.

This patch resolves this issue by waiting for the "left" node to load
all connections before starting the "right" side. This will cause
the "left" side to fail to establish a connection with the "right"
side (as the "right" side connections have not been loaded) but will
cause the "right" side to succeed to establish a connection as all
connections will have been loaded on the "left" side.

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-April/381857.html
Fixes: 8fc62df8b135 ("ipsec: Introduce IPsec system tests for Libreswan.")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Tested-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-04-20 00:00:22 +02:00
Tianyu Yuan
44ea24427e Add test cases for ingress_policing parameters
tests/ovs-vsctl.at: Add ingress_policing test in ovs-vsctl unit test
tests/system-offloads-traffic.at: Check ingress_policing with offloads enabled and disabled

Exercise OVS setting of ingress_policing parameters using ovs-vsctl and verify that the correct values are stored on OVSDB.
Verify the ingress_policing parameters with tc command. Also check offload and non-offload in tc software datapath based on tc filter type (matchall and basic).

Example invocation:
make check TESTSUITEFLAGS='-k ingress_policing'
make check-offloads TESTSUITEFLAGS='-k ingress_policing'

Signed-off-by: Tianyu Yuan <tianyu.yuan@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
2021-04-13 15:36:44 +02:00
Mark Gray
8fc62df8b1 ipsec: Introduce IPsec system tests for Libreswan.
This patch adds system tests for OVS IPsec using Libreswan.
If Libreswan is not present on the system, the tests will
be skipped.

These tests set up an underlay switch with bridge 'br0'
to carry encrypted traffic between two emulated "nodes".
Each "node" is a separate network namespace ('left' and
'right') and runs an instance of the Libreswan "pluto"
daemon, ovs-monitor-ipsec, ovs-vswitch and ovsdb-server.

Each test sets up IPsec between the two emulated "nodes"
using various configurations (currently tunnel
type, IPv6/IPv6, authentication method, local_ip). After
configuration, connectivity between the two nodes is
tested and the underlay traffic is also inspected to
ensure the traffic is encrypted.

All IPsec system tests can be run by using the ipsec
keyword:

sudo make check-kernel TESTSUITEFLAGS='-k ipsec'

Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-04-01 19:13:31 +02:00
Mark Gray
4ce8bb159e system-common-macros: clean up veth device on test failure.
'on_exit' should be run directly after creation
of veth device.

Fixes: 119db2cb18a7 ("kmod-macros: Move some code to traffic-common-macros.")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-04-01 19:13:31 +02:00
Dumitru Ceara
ac85cdb38c ovsdb-idl: Mark arc sources as updated when destination is deleted.
Considering two DB rows, 'a' from table A and 'b' from table B (with
column 'ref_a' a reference to table A):
a = {A._uuid=<U1>}
b = {B._uuid=<U2>, B.ref_a=<U1>}

When the IDL client processes an update that deletes row 'a', row 'b'
is also marked as 'updated' if change tracking is enabled for table B.

Fixes: 102781cc02c6 ("ovsdb-idl: Track changes for table references.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-04-01 14:15:49 +02:00
Dumitru Ceara
95689f1668 ovsdb-idl: Preserve references for deleted rows.
Considering two DB rows, 'a' from table A and 'b' from table B (with
column 'ref_a' a reference to table A):
a = {A._uuid=<U1>}
b = {B._uuid=<U2>, B.ref_a=<U1>}

Assuming both records are present in the IDL client's in-memory view of
the database, depending whether row 'b' is also deleted in the same
transaction or not, deletion of row 'a' should generate the following
tracked changes:

1. only row 'a' is deleted:
- for table A:
  - deleted records: a = {A._uuid=<U1>}
- for table B:
  - updated records: b = {B._uuid=<U2>, B.ref_a=[]}

2. row 'a' and row 'b' are deleted in the same update:
- for table A:
  - deleted records: a = {A._uuid=<U1>}
- for table B:
  - deleted records: b = {B._uuid=<U2>, B.ref_a=<U1>}

To ensure this, we now delay reparsing row backrefs for deleted rows
until all updates in the current run have been processed.

Without this change, in scenario 2 above, the tracked changes for table
B would be:
- deleted records: b = {B._uuid=<U2>, B.ref_a=[]}

In particular, for strong references, row 'a' can never be deleted in
a transaction that happens strictly before row 'b' is deleted.  In some
cases [0] both rows are deleted in the same transaction and having
B.ref_a=[] would violate the integrity of the database from client
perspective.  This would force the client to always validate that
strong reference fields are non-NULL.  This is not really an option
because the information in the original reference is required for
incrementally processing the record deletion.

[0] with ovn-monitor-all=true, the following command triggers a crash
    in ovn-controller because a strong reference field becomes NULL:
    $ ovn-nbctl --wait=hv -- lr-add r -- lrp-add r rp 00:00:00:00:00:01 1.0.0.1/24
    $ ovn-nbctl lr-del r

Reported-at: https://bugzilla.redhat.com/1932642
Fixes: 72aeb243a52a ("ovsdb-idl: Tracking - preserve data for deleted rows.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-04-01 14:15:49 +02:00
Dumitru Ceara
4c0d093b17 ovsdb-idl.at: Make test outputs more predictable.
IDL tests need predictable output from test-ovsdb.

This used to be done by first sorting the output of test-ovsdb and then
applying uuidfilt to predictably translate UUIDs.  This was not
reliable enough in case test-ovsdb processes two or more insert/delete
operations in the same iteration because the order of lines in the
output depends on the automatically generated UUID values.

To fix this we change the way test-ovsdb and test-ovsdb.py generate
outputs and prepend the table name and tracking information before
printing the contents of a row.

All existing ovsdb-idl.at and ovsdb-cluster.at tests are updated to
expect the new output format.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-04-01 13:53:20 +02:00