2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-31 14:25:26 +00:00
Commit Graph

19471 Commits

Author SHA1 Message Date
Ilya Maximets
49dec92421 AUTHORS: Add Jon Kohler.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-09-09 20:56:52 +02:00
Jon Kohler
c0e053f6d1 netdev-linux: Skip some internal kernel stats gathering.
For netdev_linux_update_via_netlink(), hint to the kernel that
we do not need it to gather netlink internal stats when we want
to update the netlink flags, as those stats are not rendered
within OVS.

Background:
ovs-vswitchd can spend quite a bit of time blocked by the kernel
during netlink calls, especially systems with many cores. This
time is dominated by the kernel-side internal stats gathering
mechanism in netlink, specifically:
  inet6_fill_link_af
    inet6_fill_ifla6_attrs
      __snmp6_fill_stats64

In Linux 4.4+, there exists a hint for netlink requests to not
trigger the ipv6 stats gathering mechanism, which greatly reduces
the amount of time that ovs-vswitchd is on CPU.

Testing and Results:
Tested booting 320 VM's and measuring OVS utilization with perf
record, then visualized into a flamegraph using a patched version
of ovs 2.14.2. Calls under bridge_run() seem to get hit the worst
by this issue.

Before bridge_run() == 11.3% of samples
After bridge_run() == 3.4% of samples

Note that there are at least two observed netlink calls under
bridge_run that are still kernel stats heavy after this patch:

Call 1:
  bridge_run -> netdev_run -> route_table_run -> route_table_reset ->
    ovs_router_insert -> ovs_router_insert__ -> get_src_addr ->
      netdev_ger_addr_list -> netdev_linux_get_addr_list -> getifaddrs

Since the actual netlink call is coming from getifaddrs() in glibc,
fixing would likely involve either duplicating glibc code in ovs
source or patch glibc.

Call 2:
  bridge_run -> iface_refresh_stats -> netdev_get_stats ->
    netdev_linux_get_stats -> get_stats_via_netlink

This does use netlink based stats; however, it isn't immediately
clear if just dropping the stats from inet6_fill_link_af would
impact anything or not. Given this call is more intermittent, its
of lesser concern.

Acked-by: Greg Smith <gasmith@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-09-09 20:53:54 +02:00
Eelco Chaudron
b2151d8fd1 tests: Use _DAEMONIZE macro's to start tcpdump.
Using [NETNS|OVS]_DAEMONIZE will start tcpdump in the background,
and it will also make sure it gets killed in corner cases.

For the check_pkt_len tests, we also kill tcpdump between individual
tests in the same test case to avoid confusion when analyzing results.
This also required some changes to the awk expressions, as an extra
newline is added to the output when tcpdump gets stopped.

Fixes: 02dabb21f2 ("tests: Add check_pkt_len action test to system-offload-traffic.")
Suggested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-09-09 19:08:12 +02:00
Ilya Maximets
49efc63ad8 ofproto-dpif-xlate: Fix error messages for nonexistent ports/recirc_ids.
If tnl_port_should_receive() is false, we're looking for a normal
port, not a tunnel one.  And it's better to print recirculation IDs
in hex since they are typically printed this way in flow dumps.

Fixes: d40533fc82 ("odp-util: Improve log messages and error reporting for Netlink parsing.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-09-09 19:03:09 +02:00
Eelco Chaudron
4f5decf4ab ofproto-dpif-xlate: Optimize datapath action set by removing last clone action.
When OFPROTO non-reversible actions are translated to data plane
actions, the only thing looked at is if there are more actions
pending. If this is the case, the action is encapsulated in a
clone().

This could lead to unnecessary clones if no meaningful data
plane actions are added. For example, the register pop in the
included test case.

The best solution would probably be to build the full action
path and determine if the clone is needed. However, this would
be a huge change in the existing design, so for now, we just try
to optimize the generated datapath flow. We can revisit this
later, as some of the pending CT issues might need this rework.

Fixes: feee58b958 ("ofproto-dpif-xlate: Keep track of the last action")
Fixes: dadd8357f2 ("ofproto-dpif: Fix issue with non-reversible actions on a patch ports.")
Acked-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-09-09 16:54:58 +02:00
wenxu
d046453b56 ofproto-dpif-xlate: Clear tunnel wc bits if original packet is non-tunnel.
A packet go through the encap openflow(set_field tun_id/src/dst)
The tunnel wc bits will be set. But it should be clear if the
original packet is non-tunnel. It is not necessary for datapath
wc the tunnel info for match(like the similar logic for vlan).

Signed-off-by: wenxu <wenxu@chinatelecom.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-09-09 16:22:33 +02:00
Ilya Maximets
5046f2e35f sset, smap, hmapx: Reserve hash map space while cloning.
This makes the clone a little bit faster by avoiding multiple
incremental expansions with re-allocations on big sets.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-30 23:51:05 +02:00
Ilya Maximets
a32a4e1fa2 raft: Fix unnecessary periodic compactions.
While creating a new database file and storing a new snapshot
into it, raft module by mistake updates the base offset for the
old file.  So, the base offset of a new file remains zero.  Then
the old file is getting replaced with the new one, copying new
offsets as well.  In the end, after a full compaction, base offset
is always zero.  And any offset is twice as large as zero.  That
triggers a new compaction again at the earliest scheduled time.
In practice this issue triggers compaction every 10-20 minutes
regardless of the database load, after the first one is triggered
by the actual file growth or by the 24h maximum limit.

Fixes: 1b1d2e6daa ("ovsdb: Introduce experimental support for clustered databases.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-August/051977.html
Reported-by: Oleksandr Mykhalskyi <oleksandr.mykhalskyi@netcracker.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-30 22:43:46 +02:00
Ilya Maximets
1336eeb570 netdev-offload-tc: Parse tunnel options only for geneve ports.
Cited commit correctly fixed the issue of incorrect reporting
of zero-length geneve option matches as wildcarded.  But as a
side effect, exact match on the metadata length was added to
every tunnel flow match, even if the tunnel is not geneve.
That doesn't generate any functional issues, but it maybe
confusing to see 'tunnel(...,geneve(),...)' while looking at
datapath flow dumps for, e.g., vxlan tunnel flows.

Fix that by checking the port type before parsing geneve options.
tunnel() attribute itself doesn't have enough information to
figure out the tunnel type.

Fixes: 7a6c8074c5 ("netdev-offload-tc: Fix the mask for tunnel metadata length.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-30 22:37:35 +02:00
Ilya Maximets
e5f79eaea5 ovsdb: Don't store rows that didn't change in transaction history.
Transaction history is used to construct database updates for clients.
But if the row didn't change it will never be used for monitor updates,
because ovsdb_monitor_changes_classify() will always return
OVSDB_CHANGES_NO_EFFECT.  So, ovsdb_monitor_history_change_cb()
will never add it to the update.

This condition is very common for rows with references.  While
processing strong references in ovsdb_txn_adjust_atom_refs() the
whole destination row will be cloned into transaction just to update
the reference counter.  If this row will not be changed later in
the transaction, it will just stay in that state and will be added
to the transaction history.  Since the data didn't change, both 'old'
and 'new' datums will be the same and equal to one in the database.
So, we're keeping 2 copies of the same row in memory and we are
never using them.  In this case, we should just not add them to the
transaction history in the first place.

This change should save some space in the transaction history in case
of transactions with rows with big number of strong references.
This should also speed up the processing since we will not clone
these rows for transaction history and will not count their atoms.

Testing shows about 5-10% performance improvement in ovn-heater
test scenarios.

'n_atoms' counter for transaction adjusted to count only changed
rows, so we will have accurate value for a number of atoms in the
history.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-30 22:32:46 +02:00
Paolo Valerio
0c4b063190 netlink-conntrack: Do not fail to parse if optional TCP protocol attributes are not found.
Some of the CTA_PROTOINFO_TCP nested attributes are not always
included in the received message, but the parsing logic considers them
as required, failing in case they are not found.

This was observed while monitoring some connections by reading the
events sent by conntrack:

./ovstest test-netlink-conntrack monitor
[...]
2022-08-04T09:39:02Z|00007|netlink_conntrack|ERR|Could not parse nested TCP protoinfo
  options. Possibly incompatible Linux kernel version.
2022-08-04T09:39:02Z|00008|netlink_notifier|WARN|unexpected netlink message contents
[...]

All the TCP DELETE/DESTROY events fail to parse with the message
above.

Fix it by turning the relevant attributes to optional.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-30 22:31:34 +02:00
Ilya Maximets
78dad3a0c8 python-c-ext: Use designated initializers for type and module.
Python documentation suggests to do so "to avoid listing all the
PyTypeObject fields that you don't care about and also to avoid
caring about the fields' declaration order".  And that does make
sense.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-30 21:41:32 +02:00
Ilya Maximets
ff55e8f385 netdev-offload-tc: Add missing handling of the tunnel source port.
netdev_tc_flow_put() "consumes" the tunnel.tp_src value, but
it's never passed down to TC, and not parsed back.  Fix that.

Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-19 19:40:00 +02:00
Ilya Maximets
262eded5fb netdev-offload-tc: Fix ignoring unknown tunnel keys.
Current offloading code supports only limited number of tunnel keys
and silently ignores everything it doesn't understand.  This is
causing, for example, offloaded ERSPAN tunnels to not work, because
flow is offloaded, but ERSPAN options are not provided to TC.

There is a number of tunnel keys, which are supported by the userspace,
but silently ignored during offloading:

  OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT
  OVS_TUNNEL_KEY_ATTR_OAM
  OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS
  OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS

OVS_TUNNEL_KEY_ATTR_CSUM is kind of supported, but only for actions
and for some reason is set from the tunnel port instead of the
provided action, and not currently supported for the tunnel key in
the match.

Addig a default case to fail offloading of unknown attributes.  For
now explicitly allowing incorrect behavior for the DONT_FRAGMENT flag,
otherwise we'll break all tunnel offloading by default.  VXLAN and
ERSPAN options has to fail offloading, because the tunnel will not
work otherwise.  OAM is not a default configurations, so failing it
as well. The missing DONT_FRAGMENT flag though should, probably,
cause frequent flow revalidation, but that is not new with this patch.

Same for the 'match' key, only clearing masks that was actually
consumed, except for the DONT_FRAGMENT and CSUM flags, which are
explicitly allowed and highlighted as broken.

Also, destination port as well as CSUM configuration for unknown
reason was not taken from the actions list and were passed via HW
offload info instead of being consumed from the set() action.

Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2022-July/395522.html
Reported-by: Eelco Chaudron <echaudro@redhat.com>
Fixes: 8f283af892 ("netdev-tc-offloads: Implement netdev flow put using tc interface")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-19 19:39:51 +02:00
Ilya Maximets
7a26634678 netdev-offload-tc: Use masks instead of keys while parsing tunnel attributes.
If the key is zero, it doesn't mean we don't need to match on it.
Masks should be checked instead.

Fixes: 49a7961fca ("lib/tc: Avoid matching on tunnel ttl or tos if not needed")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-19 19:39:51 +02:00
Ilya Maximets
5d91bdf12e netdev-offload-tc: Explicitly handle mask for the tunnel destination port.
netdev_tc_flow_put() ignores the tunnel.tp_dst mask.  That results
in the exact match on the value.  TC supports the masked match on
this field and it does return the mask back during the flow dump
even if it wasn't provided initially.  OVS should correctly handle
that.  There is a problem though.  Some drivers (mlx5) doesn't
support offloading if the destination port is not an exact match [1].

Keeping the logic as-is for now, but making it explicit and somewhat
documented in the comment, so it is clear what is happening and we can
revisit this in the future.

[1] https://patchwork.ozlabs.org/project/openvswitch/patch/20220704224505.1117988-3-i.maximets@ovn.org/#2927396

Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-19 19:39:51 +02:00
Ilya Maximets
7a6c8074c5 netdev-offload-tc: Fix the mask for tunnel metadata length.
'wc.masks.tunnel.metadata.present.len' must be a mask for the same
field in the flow key, but in the tc_flower structure it's the real
length of metadata masks.

That is correctly handled for the individual opt->length, setting all
the masks to 0x1f like it's done in the tun_metadata_to_geneve_mask__(),
but not handled for the main 'len' field.

Fix that by setting the mask to 0xff like it's done during the flow
translation in xlate_actions() and during the flow dump in the
tun_metadata_from_geneve_nlattr().

Also, flower always has an exact match on the present.len field
regardless of its value and regardless of this field being masked
by OVS flow translation layer while installing the flow.  Hence,
all tunnel flows dumped from TC should have an exact match on
present.len and also UDPIF flag, because present.len doesn't make
sense without that flag.  Without the change, zero-length options
match is incorrectly reported as a wildcard match.  The side effect
though is that zero-length match on geneve options is reported even
for other tunnel types, e.g. vxlan.  But that should be fairly
harmless.  To avoid reporting a match on empty geneve options for
vxlan/etc. tunnels we'll need to check the tunnel port type, there
is no enough information in the TUNNEL attribute itself.

Extra checks and comments added around the code to better explain
what is going on.

Fixes: a468645c6d ("lib/tc: add geneve with option match offload")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-19 19:39:45 +02:00
Ilya Maximets
8d35e5fafc AUTHORS: Add Michael Santana.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-16 01:20:31 +02:00
Ilya Maximets
960c0e742f Set release date for 3.0.0.
Acked-by: Ian Stokes <ian.stokes@intel.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-15 21:25:58 +02:00
Ilya Maximets
8f62b45e40 releases: Mark 2.17 as a new LTS release.
With release of OVS v3.0.0, according to our release process,
2.17.x becomes a new LTS series.

Acked-by: Ian Stokes <ian.stokes@intel.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-15 21:25:58 +02:00
Ilya Maximets
516f181a21 docs: Remove remaining references to OVS kmod and XenServer.
README file still mentions a kernel module and some parts of
the documentation still have XenServer references, e.g. 'xs-*'
database configuration options.  Removing them.

Fixes: 422e904378 ("make: Remove the Linux datapath.")
Fixes: 83c9518e7c ("xenserver: Remove xenserver.")
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-15 19:46:00 +02:00
Michael Santana
2803b3fb53 handlers: Fix handlers mapping.
The handler and CPU mapping in upcalls are incorrect, and this is
specially noticeable systems with cpu isolation enabled.

Say we have a 12 core system where only every even number CPU is enabled
C0, C2, C4, C6, C8, C10

This means we will create an array of size 6 that will be sent to
kernel that is populated with sockets [S0, S1, S2, S3, S4, S5]

The problem is when the kernel does an upcall it checks the socket array
via the index of the CPU, effectively adding additional load on some
CPUs while leaving no work on other CPUs.

e.g.

C0  indexes to S0
C2  indexes to S2 (should be S1)
C4  indexes to S4 (should be S2)

Modulo of 6 (size of socket array) is applied, so we wrap back to S0
C6  indexes to S0 (should be S3)
C8  indexes to S2 (should be S4)
C10 indexes to S4 (should be S5)

Effectively sockets S0, S2, S4 get overloaded while sockets S1, S3, S5
get no work assigned to them

This leads to the kernel to throw the following message:
"openvswitch: cpu_id mismatch with handler threads"

Instead we will send the kernel a corrected array of sockets the size
of all CPUs in the system, or the largest core_id on the system, which
ever one is greatest. This is to take care of systems with non-continous
core cpus.

In the above example we would create a
corrected array in a round-robin(assuming prime bias) fashion as follows:
[S0, S1, S2, S3, S4, S5, S6, S0, S1, S2, S3, S4]

Fixes: b1e517bd2f ("dpif-netlink: Introduce per-cpu upcall dispatch.")
Co-authored-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-15 19:43:57 +02:00
Michael Santana
a5cacea5f9 handlers: Create additional handler threads when using CPU isolation.
Additional threads are required to service upcalls when we have CPU
isolation (in per-cpu dispatch mode). The reason additional threads
are required is because it creates a more fair distribution. With more
threads we decrease the load of each thread as more threads would
decrease the number of cores each threads is assigned.

Adding additional threads also increases the chance OVS utilizes all
cores available to use. Some RPS schemas might make some handler
threads get all the workload while others get no workload. This tends
to happen when the handler thread count is low.

An example would be an RPS that sends traffic on all even cores on a
system with only the lower half of the cores available for OVS to use.
In this example we have as many handlers threads as there are
available cores. In this case 50% of the handler threads get all the
workload while the other 50% get no workload. Not only that, but OVS
is only utilizing half of the cores that it can use. This is the worst
case scenario.

The ideal scenario is to have as many threads as there are cores - in
this case we guarantee that all cores OVS can use are utilized

But, adding as many threads are there are cores could have a performance
hit when the number of active cores (which all threads have to share) is
very low. For this reason we avoid creating as many threads as there
are cores and instead meet somewhere in the middle.

The formula used to calculate the number of handler threads to create
is as follows:

handlers_n = min(next_prime(active_cores+1), total_cores)

Assume default behavior when total_cores <= 2, that is do not create
additional threads when we have less than 2 total cores on the system

Fixes: b1e517bd2f ("dpif-netlink: Introduce per-cpu upcall dispatch.")
Signed-off-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-15 18:51:09 +02:00
Greg Rose
83c9518e7c xenserver: Remove xenserver.
Remove the current xenserver implementation - it is obsolete and
since 3.0 we do not support kernel module builds [1].

1. https://mail.openvswitch.org/pipermail/ovs-dev/2022-July/395789.html

[i.maximets]
Can be added back if people willing to maintain it will be found.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-15 13:07:13 +02:00
Cian Ferriter
ac1332216e acinclude: Improve vpopcntdq build check.
Support for vpopcntdq instruction generation by the compiler was already
checked in the OVS_CHECK_AVX512 AC function by checking if the compiler
accepted the -mavx512vpopcntdq option.  However, there can be situations
where the compiler supports vpopcntdq generation but the assembler
doesn't support the instruction.

The below OVS_CHECK_AVX512VPOPCNTDQ AC function will check for both
compiler and assembler support for the vpopcntdq instruction.

Fixes: cb1c640077 ("acinclude: Add seperate checks for AVX512 ISA.")
Reported-by: Ian Stokes <ian.stokes@intel.com>
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-08-12 09:52:20 +01:00
Ales Musil
39364e11dd packets: Fix misaligned access to ip6_hdr.
The ip6_hdr is aligned to 4 bytes, but the pointer
from dp_packet_l3 is aligned to 2 bytes. Use
ovs_16aligned_ip6_hdr instead to get 2 bytes alignment.

Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-12 01:40:58 +02:00
Ilya Maximets
16193fe730 AUTHORS: Add Miro Tomaska.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-12 01:29:18 +02:00
Miro Tomaska
1731ed43c6 python: Do not send non-zero flag for a SSL socket.
pyOpenSSL was recently switched for the Python standard library ssl
module in the cited commit.  Python SSLsocket.send() does not allow
non-zero optional flag and it will explicitly raise an exception for
that.  pyOpenSSL did nothing with this flag but kept it to be
compatible with socket API:
  https://github.com/pyca/pyopenssl/blob/main/src/OpenSSL/SSL.py#L1844

Fixes: 68543dd523 ("python: Replace pyOpenSSL with ssl.")
Reported-at: https://bugzilla.redhat.com/2115035
Acked-By: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Miro Tomaska <mtomaska@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-12 01:23:39 +02:00
Ilya Maximets
d1864effeb ovsdb: Fix copying weak references into transaction history.
Transaction history is used only to construct row data updates for
clients, it's not used for checking data integrity, hence it doesn't
need a copy of weak references.

Not copying this data saves a lot of CPU cycles and memory in some
cases.  For example, in 250-node density-heavy scenario in ovn-heater
these references can take up to 70% of RSS, which is about 8 GB of
essentially wasted memory as reported by valgrind massif:

 -------------------------------------------------------------------------------
   n        time(i)         total(B)    useful-heap(B) extra-heap(B)   stacks(B)
 -------------------------------------------------------------------------------
  20 1,011,495,832,314  11,610,557,104  10,217,785,620 1,392,771,484        0

 88.00% (10,217,785,620B) (heap allocation functions) malloc/new/new[]
 ->70.47% (8,181,819,064B) 0x455372: xcalloc__ (util.c:121)
   ->70.07% (8,135,785,424B) 0x41609D: ovsdb_weak_ref_clone (row.c:66)
     ->70.07% (8,135,785,424B) 0x41609D: ovsdb_row_clone (row.c:151)
       ->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_clone (transaction.c:1124)
       | ->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_add_to_history (transaction.c:1163)
       |   ->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_replay_commit (transaction.c:1198)
       |     ->34.74% (4,034,041,440B) 0x408C35: parse_txn (ovsdb-server.c:633)
       |       ->34.74% (4,034,041,440B) 0x408C35: read_db (ovsdb-server.c:663)
       |         ->34.74% (4,034,041,440B) 0x406C9D: main_loop (ovsdb-server.c:238)
       |           ->34.74% (4,034,041,440B) 0x406C9D: main (ovsdb-server.c:500)
       |
       ->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_clone (transaction.c:1125)
         ->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_add_to_history (transaction.c:1163)
           ->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_replay_commit (transaction.c:1198)
             ->34.74% (4,034,041,440B) 0x408C35: parse_txn (ovsdb-server.c:633)
               ->34.74% (4,034,041,440B) 0x408C35: read_db (ovsdb-server.c:663)
                 ->34.74% (4,034,041,440B) 0x406C9D: main_loop (ovsdb-server.c:238)
                   ->34.74% (4,034,041,440B) 0x406C9D: main (ovsdb-server.c:500)

Replacing ovsdb_row_clone() with ovsdb_row_datum_clone() to avoid
cloning unnecessary metadata.  The ovsdb_txn_clone() function re-named
to avoid issues if it will be re-used in the future for some other
use-case.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-12 01:20:04 +02:00
Sunil Pai G
b0e8668f38 dpif-netdev: Simplify AVX512 build time checks to enhance readability.
The preprocessor comparison string to check AVX512 capabilities are
lengthy and effecting user readability. Simpify this by aliasing the checks.

Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
2022-08-10 15:57:39 +01:00
Ilya Maximets
a7045017d8 github: Move CI to ubuntu 20.04 base image.
18.04 image is deprecated and will disappear soon.  Also some
slowdowns and brownouts are planned to push users away from
this deprecated version:

  https://github.com/actions/virtual-environments/issues/6002

Moving to 20.04.  Can't move to 22.04 at the moment because of
deprecation warnings from openssl 3.0.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-09 13:38:39 +02:00
Ilya Maximets
1fd336ccee netdev-offload-tc: Disable offload of IPv6 fragments.
OVS kernel datapath and TC are parsing IPv6 fragments differently.
For IPv6 later fragments, according to the original design [1], OVS
always sets nw_proto=44 (IPPROTO_FRAGMENT), regardless of the type
of the L4 protocol.

This leads to situation where flow for nw_proto=44 gets installed
to TC, but packets can not match on it, causing all IPv6 later
fragments to go to userspace significantly degrading performance.

Disabling offload for such packets, so the flow can be installed
to the OVS kernel datapath instead.  Disabling for all IPv6 fragments
including the first one, because it doesn't make a lot of sense to
handle them separately.  It may also cause potential problems with
conntrack trying to re-assemble a packet from fragments handled by
different datapaths (first in HW, later in OVS kernel).

Checking both 'nw_proto' and 'nw_frag' as classifier might decide
to match only on one of them and also nw_proto will not be 44 for
the first fragment.

The issue was hidden for some time due to incorrect behavior of the
OVS kernel datapath that was recently fixed in kernel commit:

 12378a5a75e3 ("net: openvswitch: fix parsing of nw_proto for IPv6 fragments")

To allow offloading in the future either flow dissector in TC
should be changed to parse packets in the same way as OVS does,
or parsing in OVS kernel and userspace should be made configurable,
so users can opt-in to the behavior change.  Silent change of the
behavior (change by default) is not an option, because existing
OpenFlow pipelines may depend on a certain behavior.

[1] https://docs.openvswitch.org/en/latest/topics/design/#fragments

Fixes: 83e866067e ("netdev-tc-offloads: Add support for IP fragmentation")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-08 19:15:09 +02:00
Han Ding
269a947c7b ovs-save: Use right OpenFlow version for add-tlv-map.
When the bridge protocols is not included Openflow10, printing an error
message "version negotiation failed" when doing "Restoring saved flows".

Signed-off-by: Han Ding <handing@chinatelecom.cn>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-08 19:03:48 +02:00
Paolo Valerio
b47ebf7186 system-traffic: Fix IPv4 fragmentation test sequence for check-kernel.
The following test sequence:

conntrack - IPv4 fragmentation incomplete reassembled packet
conntrack - IPv4 fragmentation with fragments specified

leads to a systematic failure of the latter test on the kernel
datapath (linux).  Multiple executions of the former may also lead to
multiple failures.
This is due to the fact that fragments not yet reassembled are kept in
a queue for /proc/sys/net/ipv4/ipfrag_time seconds, and if the
kernel receives a fragment already present in the queue, it returns
-EINVAL.

Below the related log message:
|00058|dpif|WARN|system@ovs-system: execute ct(commit) failed (Invalid argument)
  on packet udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,
  nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,nw_frag=first,tp_src=1,
  tp_dst=2 udp_csum:0

Fix the sequence by sending the second fragment in "conntrack - IPv4
fragmentation incomplete reassembled packet", once the checks are
done.

IPv6 tests are not affected as the defrag kernel code path pretends to
add the duplicate fragment to the queue returning -EINPROGRESS, when a
duplicate is detected.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-08 16:37:32 +02:00
Ilya Maximets
d6c6b216e4 system-traffic: Fix incorrect neigh entry in ipv6 header modification test.
The permanent neighbor entry for fc00::1 is added into a wrong
namespace, so in order to reply to a ping from at_ns1, the
address of fc00::1 has to be discovered.  Interfaces are attached
to OVS and we're removing flows that can forward ND requests
after initial setup.  In case ND request wasn't sent and replied
before that, at_ns1 will not be able to discover fc00:1 and won't
reply to pings.

It's hard to catch this condition while running tests locally,
but for some reason our CI is failing consistently.

Fix the issue by removing all the unnecessary permanent entries
and just allowing all the normal traffic to flow through the
low priority OVS flow, so all addresses can be discovered.

Also adding one more wait to avoid occasional drops of the very
first packet.

Fixes: 2ff43c78c6 ("packets: Re-calculate IPv6 checksum only for first frag upon modify.")
Acked-by: Salem Sol <salems@nvidia.com>
Acked-by: Michael Phelan <michael.phelan@intel.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-08 16:36:38 +02:00
Ilya Maximets
6fcd733f75 system-traffic: Don't run IPv6 header modification test on kernels < 5.19.
OVS kernel module is incorrectly updating checksums while changing
IPv6 fields of later fragments that doesn't really have L4 headers.

This makes the 'ping6 between two ports with header modify' test
fail on most of the distribution kernels.

The issue got indirectly fixed in latest 5.19 with commit:

  12378a5a75e3 ("net: openvswitch: fix parsing of nw_proto for IPv6 fragments")

The reason is that set_ipv6() function in net/openvswitch/actions.c
is using the protocol number from the parsed flow key and not from
the packet itself, and nw_proto=44 is not a protocol where we can
update the checksum.

It was backported to all supported upstream stable trees, but didn't
find its way to most of the distributions yet.

Restricting the test to 5.19+ kernels to avoid failures on distro
kernels.  Additionally allowing the previous test for later fragments
to be executed in userspace testsuite.

Fixes: 2ff43c78c6 ("packets: Re-calculate IPv6 checksum only for first frag upon modify.")
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-08 12:36:45 +02:00
Ilya Maximets
434025a154 python: Fix E275 missing whitespace after keyword.
With just released flake8 5.0 we're getting a bunch of E275 errors:

 utilities/bugtool/ovs-bugtool.in:959:23: E275 missing whitespace after keyword
 tests/test-ovsdb.py:623:11: E275 missing whitespace after keyword
 python/setup.py:105:8: E275 missing whitespace after keyword
 python/setup.py:106:8: E275 missing whitespace after keyword
 python/ovs/db/idl.py:145:15: E275 missing whitespace after keyword
 python/ovs/db/idl.py:167:15: E275 missing whitespace after keyword
 make[2]: *** [flake8-check] Error 1

This breaks CI on branches below 2.16.  We don't see a problem right
now on newer branches because we're installing extra dependencies
that backtrack flake8 down to 4.1 or even 3.9.

Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 18:13:20 +02:00
Ilya Maximets
398623a63e tc: Use sparse hex dump while printing inconsistencies.
Instead of a very long hex string something like this will be printed:

 |DBG|tc flower compare failed mask compare:
 Expected Mask:
 00000000  ff ff 00 00 ff ff ff ff-ff ff ff ff ff ff ff ff
 00000020  00 00 00 00 00 00 00 00-00 00 00 00 00 00 03 00
 00000090  00 00 00 00 00 00 00 00-ff ff ff ff ff ff ff ff
 000000c0  ff 00 00 00 ff ff 00 00-ff ff ff ff ff ff ff ff

 Received Mask:
 00000000  ff ff 00 00 ff ff ff ff-ff ff ff ff ff ff ff ff
 00000020  00 00 00 00 00 00 00 00-00 00 00 00 00 00 03 00
 00000090  00 00 00 00 00 00 00 00-ff ff ff ff ff ff ff ff
 000000c0  ff 00 00 00 00 00 00 00-ff ff ff ff ff ff ff ff

It's easier to spot the difference this way and count which bytes are
to blame, since offsets are printed as well.

Using a sparse dump to avoid printing huge number of all-zero lines.

Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 14:18:17 +02:00
Ilya Maximets
a7680c3caf netdev-offload-tc: Print unused mask bits on failure.
This change extends the debug logging with the sparse
dump of the flow mask structure to make debug process
easier.

Sample output:

  |netdev_offload_tc|DBG|offloading isn't supported, unknown attribute
  Unused mask bits:
  00000270  00 00 00 00 00 00 00 00-00 00 00 ff 00 00 00 00

In this example, 0x270 + 11 = 635, which is an offset of
the nsh.mdtype in the struct flow.

Suggested-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 14:18:09 +02:00
Ilya Maximets
823d4f6bc8 dynamic-string: Add function for a sparse hex dump.
New function to dump large and sparsely populated data structures
like struct flow.

Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 14:18:05 +02:00
Ilya Maximets
050dc8fed2 system-offloads-traffic: Fix waiting for netcat indefinitely.
$NC_EOF_OPT should be used to avoid some netcat implementations
to wait indefinitely.

This fixes the check-offloads testsuite hanging in Ubuntu 22.04.

Fixes: 5660b89a30 ("dpif-netlink: Offloading meter to tc police action")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 14:10:03 +02:00
Ilya Maximets
01edbc3add dpif-netlink: Fix incorrect bit shift in compat mode.
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior in
 lib/dpif-netlink.c:1077:40: runtime error:
   left shift of 1 by 31 places cannot be represented in type 'int'

     #0  0x73fc31 in dpif_netlink_port_add_compat lib/dpif-netlink.c:1077:40
     #1  0x73fc31 in dpif_netlink_port_add lib/dpif-netlink.c:1132:17
     #2  0x2c1745 in dpif_port_add lib/dpif.c:597:13
     #3  0x07b279 in port_add ofproto/ofproto-dpif.c:3957:17
     #4  0x01b209 in ofproto_port_add ofproto/ofproto.c:2124:13
     #5  0xfdbfce in iface_do_create vswitchd/bridge.c:2066:13
     #6  0xfdbfce in iface_create vswitchd/bridge.c:2109:13
     #7  0xfdbfce in bridge_add_ports__ vswitchd/bridge.c:1173:21
     #8  0xfb5319 in bridge_add_ports vswitchd/bridge.c:1189:5
     #9  0xfb5319 in bridge_reconfigure vswitchd/bridge.c:901:9
     #10 0xfae0f9 in bridge_run vswitchd/bridge.c:3334:9
     #11 0xfe67dd in main vswitchd/ovs-vswitchd.c:129:9
     #12 0x4b6d8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
     #13 0x4b6e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f)
     #14 0x562594eed024 in _start (vswitchd/ovs-vswitchd+0x787024)

Fixes: 526df7d854 ("tunnel: Provide framework for tunnel extensions for VXLAN-GBP and others")
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 14:07:37 +02:00
Ilya Maximets
91b41af0d9 checkpatch: Add check for a Fixes tag.
A new check for common mistakes while formatting a 'Fixes:' tag.

Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 14:06:37 +02:00
Timothy Redaelli
6a9ec13aa3 python: Use setuptools instead of distutils.
On Python 3.12, distutils will be removed and it's currently (3.10+)
deprecated (see PEP 632).

Since the suggested and simplest replacement is setuptools, this commit
replaces distutils to use setuptools instead.

setuptools < 59.0 doesn't have setuptools.errors and so, in this case,
distutils.errors is still used.

Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 14:01:23 +02:00
Ilya Maximets
47cfa89412 AUTHORS: Add Salem Sol.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 14:01:23 +02:00
Salem Sol
2ff43c78c6 packets: Re-calculate IPv6 checksum only for first frag upon modify.
In case of modifying an IPv6 packet src/dst address the L4 checksum
should be recalculated only for the first frag.  Currently it's done
for all frags, leading to incorrect reassembled packet checksum.
Fix it by adding a new flag to recalculate the checksum only for the
first frag.

Fixes: bc7a5acdff ("datapath: add ipv6 'set' action")
Signed-off-by: Salem Sol <salems@nvidia.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-08-04 13:32:44 +02:00
Vlad Buslov
d9268782af netdev-linux: set correct action for packets that passed policer
Referenced commit changed policer action type from TC_ACT_UNSPEC (continue)
to TC_ACT_PIPE. However, since neither TC hardware offload layer nor mlx5
driver at the time validated action type and always assumed 'continue', the
breakage wasn't caught until later validation code was added. The change
also broke valid configuration when sending from offload-capable device to
non-offload capable. For example, when sending from mlx5 VF to OvS bridge
netdevice the traffic that passed matchall classifier with policer could no
longer match the following flower rule in software:

filter protocol all pref 1 matchall chain 0
filter protocol all pref 1 matchall chain 0 handle 0x1
  in_hw (rule hit 7863)
        action order 1:  police 0x1 rate 32Mbit burst 1000Kb mtu 64Kb action drop/pipe overhead 0b
        ref 1 bind 1  installed 17 sec firstused 17 sec
        Action statistics:
        Sent 152199634 bytes 102550 pkt (dropped 1315, overlimits 1315 requeues 0)
        Sent software 74612172 bytes 51275 pkt
        Sent hardware 77587462 bytes 51275 pkt
        backlog 0b 0p requeues 0
        used_hw_stats delayed

filter protocol ip pref 3 flower chain 0
filter protocol ip pref 3 flower chain 0 handle 0x1
  dst_mac aa:94:1f:f2:f8:44
  src_mac e4:00:01:08:00:02
  eth_type ipv4
  ip_flags nofrag
  not_in_hw
        action order 1: skbedit  ptype host pipe
         index 1 ref 1 bind 1 installed 6 sec used 6 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0

        action order 2: mirred (Ingress Redirect to device br-ovs) stolen
        index 1 ref 1 bind 1 installed 6 sec used 6 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        cookie 401a9c8b3d403c62240d3eb5e21c1604
        no_percpu

Fix the issue by restoring matchall and basic policers action type to
'continue'.

Fixes: c2567e533f ("add port-based ingress policing based packet-per-second rate-limiting")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
2022-08-04 10:04:28 +01:00
Ilya Maximets
c43da842fb test-ovsdb: Fix false-positive leaks from LeakSanitizer.
LeakSanitizer for some reason reports these json objects as leaked,
even though we do have references to them at the moment ovs_fatal()
called from check_ovsdb_error().

Previously it complained only with -O2, but with newer versions of
clang/llvm it started complaining even with -O1.  For example, negative
ovsdb parsing tests are failing on ubuntu 22.04 with clang 14 if built
with ASan and detect_leaks=1.

Fix that by destroying the json object before aborting the process.
And we may also build with default -O2 in CI with that change.

Alternative implementation might be to just pass the json to destroy
to every check_ovsdb_error() call, but indirect registering of the
pointer seems a bit less invasive.

Acked-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-29 17:22:02 +02:00
Ilya Maximets
7670c7c2e1 m4: Update ax_func_posix_memalign to the latest version.
This fixes the obsolescence warning for AC_TRY_RUN with autoconf 2.70+:

  $ ./boot.sh
  configure.ac:141: warning: The macro `AC_TRY_RUN' is obsolete.
  configure.ac:141: You should run autoupdate.
  ./lib/autoconf/general.m4:2997: AC_TRY_RUN is expanded from...
  lib/m4sugar/m4sh.m4:692: _AS_IF_ELSE is expanded from...
  lib/m4sugar/m4sh.m4:699: AS_IF is expanded from...
  ./lib/autoconf/general.m4:2249: AC_CACHE_VAL is expanded from...
  ./lib/autoconf/general.m4:2270: AC_CACHE_CHECK is expanded from...
  m4/ax_func_posix_memalign.m4:27: AX_FUNC_POSIX_MEMALIGN is expanded from...
  configure.ac:141: the top level

Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-29 17:14:55 +02:00
Ilya Maximets
97adbe9437 m4: Replace obsolete AC_HELP_STRING with AS_HELP_STRING.
AS_HELP_STRING is a direct replacement for AC_HELP_STRING.
It is available since autoconf 2.57a.  OVS requires 2.63,
so AS_HELP_STRING can be freely used.

This fixes the following warning on systems with 2.70+:

  $ ./boot.sh
  ...
  configure.ac:92: warning: The macro `AC_HELP_STRING' is obsolete.
  configure.ac:92: You should run autoupdate.
  ...

Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-29 17:14:39 +02:00