2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-22 01:51:26 +00:00

20531 Commits

Author SHA1 Message Date
Mike Pattrick
0add983b38 ovsdb: Use table indexes if available for ovsdb_query().
Currently all OVSDB database queries except for UUID lookups all result
in linear lookups over the entire table, even if an index is present.

This patch modifies ovsdb_query() to attempt an index lookup first, if
possible. If no matching indexes are present then a linear index is
still conducted.

To test this, I set up an ovsdb database with a variable number of rows
and timed the average of how long ovsdb-client took to query a single
row. The first two tests involved a linear scan that didn't match any
rows, so there was no overhead associated with sending or encoding
output. The post-patch linear scan was a worst case scenario where the
table did have an appropriate index but the conditions made its usage
impossible. The indexed lookup test was for a matching row, which did
also include overhead associated with a match. The results are included
in the table below.

Rows                   | 100k | 200k | 300k | 400k | 500k
-----------------------+------+------+------+------+-----
Pre-patch linear scan  |  9ms | 24ms | 37ms | 49ms | 61ms
Post-patch linear scan |  9ms | 24ms | 38ms | 49ms | 61ms
Indexed lookup         |  3ms |  3ms |  3ms |  3ms |  3ms

I also tested the performance of ovsdb_query() by wrapping it in a loop
and measuring the time it took to perform 1000 linear scans on 1, 10,
100k, and 200k rows. This test showed that the new index checking code
did not slow down worst case lookups to a statistically detectable
degree.

Reported-at: https://issues.redhat.com/browse/FDP-590
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-15 18:05:32 +02:00
Eelco Chaudron
5c4d60671c dpif: Fix infinite netlink loop in dpif_execute_helper_cb.
When a meter action is encountered and stored in the auxiliary
structure, and subsequently, a non-meter action is processed
within a nested list during callback execution, an infinite
loop is triggered.

This patch maintains the current behavior but stores all
required meter actions in an ofpbuf for deferred execution.

Reported-at: https://patchwork.ozlabs.org/project/openvswitch/patch/20250506022337.3242-1-danieldin186@gmail.com/
Fixes: 076caa2fb077 ("ofproto: Meter translation.")
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-07-15 16:40:45 +02:00
Eelco Chaudron
50e1e57f81 utilities:gdb: Add GDB function to dump Netlink attributes.
This commit adds the ovs_dump_nla GDB command, which allows
dumping Netlink attributes. Here are some examples:

(gdb) ovs_dump_nla 0x7f10e35d8858 172 ovs_check_pkt_len_attr
(struct nlattr *) 0x7f10e35d8858:[OVS_CHECK_PKT_LEN_ATTR_PKT_...
(struct nlattr *) 0x7f10e35d8860:[OVS_CHECK_PKT_LEN_ATTR_...
(struct nlattr *) 0x7f10e35d88b0:[OVS_CHECK_PKT_LEN_ATTR_...

(gdb) ovs_dump_nla 0x7f10e35d8858 172
(struct nlattr *) 0x7f10e35d8858: {nla_len = 6, nla_type ...
(struct nlattr *) 0x7f10e35d8860: {nla_len = 80, nla_type ...
...len = 80, nla_type = 3}, nl_attr_get() = 0x7f10e35d88b4

(gdb)  ovs_dump_nla 0x7f10e35d88b4 80 ovs_action_attr dump
... nla_type = 19}, nl_attr_get() = 0x7f10e35d88b8: 3f 01 00 00
...

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-07-15 15:21:45 +02:00
Adrian Moreno
6d4044899e docs: Specify retis dependency on USDT probes.
Retis' "--ovs-track" option relies on USDT probes being available.

Fixes: 22732c0e6770 ("tests: Add support for running system tests under retis.")
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-10 14:10:36 +02:00
Ilya Maximets
0d9dc8e9ca dpif-netlink: Provide original upcall pid in 'execute' commands.
When a packet enters kernel datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall.  With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet.  This allows us to send
packets from the same traffic flow through the same handler.

The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.

While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again.  And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.

However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread.  So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.

The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port.  In real setups the
number of recirculations can go up to 4 or 5, sometimes more.

There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet.  So, there is a race window when the flow is already
installed and the second packet can match it inside the kernel and be
forwarded to the destination before the first packet is re-injected.
But the fact that packets are going through multiple upcalls handled
by different userspace threads makes the reordering noticeably more
likely, because we not only have a race between the kernel and a
userspace handler (which is hard to avoid), but also between multiple
userspace handlers.

For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2.  Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets.  Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.

Let's provide the original upcall PID via the new netlink attribute
OVS_PACKET_ATTR_UPCALL_PID.  This way the upcall triggered during the
execution will go to the same handler.  Packets will be enqueued to
the same socket and re-injected in the same order.  This doesn't
eliminate re-ordering as stated above, since we still have a race
between the kernel and the handler thread, but it allows to eliminate
races between multiple handlers.

The openvswitch kernel module ignores unknown attributes for the
OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older
kernels.

Reported-at: https://issues.redhat.com/browse/FDP-1479
Link: https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maximets@ovn.org/
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-10 12:20:54 +02:00
Eelco Chaudron
0d5eece55c mcast-snooping: Properly check MLD packet length.
If an MLD packet is not large enough to contain the
message-specific data, it may lead to a NULL pointer access.
This patch fixes the issue by adding appropriate length checks.

Fixes: 06994f879c9d ("mcast-snooping: Add Multicast Listener Discovery support")
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-07-07 16:01:56 +02:00
Ilya Maximets
22732c0e67 tests: Add support for running system tests under retis.
Retis is very useful for debugging our system tests or debugging
kernel issues through our system tests.  This change adds a convenient
way to run any kernel system test with the retis capture on the
background.  E.g.:

  make check-kernel OVS_TEST_WITH_RETIS=yes TESTSUITEFLAGS='167 -d'

Retis 1.5 is required, since we're using ifdump profile, and it also
will mount debugfs for us in case of running in a different namespace.
It should be available in $PATH.

In addition to just capturing the retis.data, we're also running the
capture with --print to print all the events as they appear, and
producing the sorted output in the end.  This makes it easier to work
across systems with different versions of retis and saves time for
running the sort manually.  The raw data is still available for
advanced processing, if needed.

Not specifying any particular collector, capturing everything that's
enabled by default.  OVS tracking is turned on by default.

Since OVS tracking is used, it's required to start retis after the
kernel datapath is created, otherwise it will fail to obtain the map
of upcall PIDs.  That's why we need to start it after the bridge is
created.

Only adding support for kernel-related test suites for now.  For
userspace test suites it may also be useful at some point, but
currently that requires running without --ovs-track and isn't too
important.

Startup of the retis capture adds significant amount of time to each
test, so not running it by default.

Link: https://github.com/retis-org/retis
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-04 17:43:55 +02:00
Ilya Maximets
0491972828 seq: Fix deadlock with the time_init.
There is an ABBA deadlock between time_init() and seq_wait():

    Thread 1:
        poll_block()
        time_poll()
        time_init()
        pthread_once() <-- lock A
        do_time_init()
        seq_create()
        pthread_mutex_lock(seq_mutex) <-- lock B

    Thread 2:
        seq_wait(different seqno)
        pthread_mutex_lock(seq_mutex) <-- lock B
        poll_immediate_wake()
        poll_timer_wait()
        time_msec()
        time_init()
        pthread_once() <-- lock A

This is likely the same deadlock Intel CI saw last year before the lab
was shut down.

The issue should not happen with normal applications as those would
normally have the time module initialized early in the process before
waiting on any sequence numbers, but it happens in the test-barrier
application from time to time causing the test suite to hang.

Fix that by making sure we're not calling poll_immediate_wake() under
the seq_mutex.  The time and seq modules are independent and it's hard
to ensure the dependency without exporting some of their internals.
Instead re-defining the prototype of the poll_immediate_wake_at(),
adding the thread safety annotation, so we have some basic protection
from this deadlock if the code ever changes.  Compiler will warn on
the prototype mismatch as well if it ever happens, so it's not a big
problem.  Having this prototype also gives us a spot in the code where
we can place a comment explaining the locking order.

Reportde-at: https://mail.openvswitch.org/pipermail/ovs-dev/2024-July/415436.html
Reported-at: https://issues.redhat.com/browse/FDP-1493
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-04 17:39:19 +02:00
David Marchand
1210864a63 netdev-dpdk: Remove unused macro for TSO offloads.
This macro is a left over from previous implementation.

Fixes: 3337e6d91c5b ("userspace: Enable L4 checksum offloading by default.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-01 12:57:20 +02:00
Ilya Maximets
3d2f64e5d6 ovsdb-idl: Add new functions to check the column type on the server.
Currently, there is no convenient way to know what are the constraints
for a particular column in the server's schema in C IDL.  This is
a problem, because clients may want to know how many elements are
allowed in a certain column.  For example, we recently increased the
allowed number of prefixes configured in the Flow_Table table in OVS,
but the client (ovn-controller) has no good way to know how many
prefixes are actually supported in the schema of the currently running
ovsdb-server.  The IDL's code is generated from one schema version,
while the actual server may be using newer or older one.  If the
client specifies too many prefixes, the transaction will fail, and
there is also no good way to tell from the ovn-controller why exactly
transaction failed.

Currently used solution is to create another database connection just
to intercept schema changes and parse the schema JSON manually inside
the ovn-controller:
  89e43f7528

While this approach works, it's not a clean solution.  We have the
server's schema on the CS level and we can provide the types to the
application via IDL functions.  This will allow ovn-controller to
just use ovsrec_flow_table_prefixes_server_type(idl)->n_max instead
of all the awkward schema parsing.

Python IDL is more dynamic and has a different way of connecting
where the user first obtains the schema and then initializes IDL
with that schema.  The parsed schema object with all the types is
also available through the get_idl_schema() method.  So, it is
already possible to check the types there.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-07-01 12:52:17 +02:00
Ilya Maximets
f5819e699e json: Store short arrays in-place.
Similarly to strings, 24 bytes that we have in 'struct json' can fit
up to 3 JSON_ARRAY elements.  And we can use separate storage types
to count them.

There are many small arrays in typical databases, for example, every
UUID is a 2-element array.  So, the change does have a noticeable
performance impact.

With 350MB OVN Northbound database with 12M atoms:

                         Before        After       Improvement
 ovsdb-client dump      16.6 sec      14.9 sec       10.2 %
 Compaction             13.4 sec      11.0 sec       17.9 %
 Memory usage (RSS)     2.05 GB       1.90 GB         7.3 %

With 615MB OVN Southbound database with 23M atoms:

                         Before        After       Improvement
 ovsdb-client dump      43.7 sec      40.5 sec        7.3 %
 Compaction             32.5 sec      29.4 sec        9.5 %
 Memory usage (RSS)     4.80 GB       4.46 GB         7.1 %

In the results above, 'ovsdb-client dump' is measuring how log it
takes for the server to prepare and send a reply, 'Memory usage (RSS)'
reflects the RSS of the ovsdb-server after loading the full database.
ovn-heater tests report similar reduction in CPU and memory usage
on heavy operations like compaction.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-30 16:53:56 +02:00
Ilya Maximets
1de4a08c22 json: Use functions to access json arrays.
Internal implementation of JSON array will be changed in the future
commits.  Add access functions that users can rely on instead of
accessing the internals of 'struct json' directly and convert all the
users.  Structure fields are intentionally renamed to make sure that
no code is using the old fields directly.

json_array() function is removed, as not needed anymore.  Added new
functions:  json_array_size(), json_array_at(), json_array_set()
and json_array_pop().  These are enough to cover all the use cases
within OVS.

The change is fairly large, however, IMO, it's a much overdue cleanup
that we need even without changing the underlying implementation.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-30 16:53:56 +02:00
Ilya Maximets
9669b50f56 json: Store short strings in-place.
The 'struct json' contains a union and the largest element of that
union is 'struct json_array', which takes 24 bytes.  It means, that a
lot of space in this structure remains unused whenever the type is not
JSON_ARRAY.

For example, the 'string' pointer used for JSON_STRING only takes 8
bytes on a 64-bit system leaving 24 - 8 = 16 bytes unused.  There is
also a 4-byte hole between the 'type' and the 'count'.

A pretty common optimization technique for storing strings is to
store short ones in place of the pointer and only allocate dynamically
the larger strings that do not fit.  In our case, we have even larger
space of 24 bytes to work with.  So, we could use all 24 bytes to
store the strings (23 string bytes + '\0') and use the 4 byte unused
space outside the union to store the storage type.

This approach should allow us to save on memory allocation for short
strings and also save on accesses to them, as the content will fit
into the same cache line as the 'struct json' itself.

In practice, large OVN databases tend to operate with quite large
strings.  For example, all the logical flow matches and actions in
OVN Southbound database would not fit.  However, this approach still
allows to improve performance with large OVN databases.

With 350MB OVN Northbound database with 12M atoms:

                         Before        After       Improvement
 ovsdb-client dump      18.6 sec      16.6 sec       10.7 %
 Compaction             14.0 sec      13.4 sec        4.2 %
 Memory usage (RSS)     2.28 GB       2.05 GB        10.0 %

With 615MB OVN Southbound database with 23M atoms:

                         Before        After       Improvement
 ovsdb-client dump      46.1 sec      43.7 sec        5.2 %
 Compaction             34.8 sec      32.5 sec        6.6 %
 Memory usage (RSS)     5.29 GB       4.80 GB         9.3 %

In the results above, 'ovsdb-client dump' is measuring how log it
takes for the server to prepare and send a reply, 'Memory usage (RSS)'
reflects the RSS of the ovsdb-server after loading the full database.
ovn-heater tests report similar reduction in CPU and memory usage
on heavy operations like compaction.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-30 16:53:56 +02:00
Ilya Maximets
6c48b29f52 json: Always use the json_string() method to access the strings.
We'll be changing the way strings are stored, so the direct access
will not be safe anymore.  Change all the users to use the proper
API as they should have been doing anyway.  This also means splitting
the handling of strings and serialized objects in most cases as
they will be treated differently.

The only code outside of json implementation for which direct access
is preserved is substitute_uuids() in test-ovsdb.c.  It's an unusual
string manipulation that is only needed for the testing, so doesn't
seem worthy adding a new API function.  We could introduce something
like json_string_replace() if this use case will appear somewhere
else in the future.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-30 16:53:56 +02:00
Dumitru Ceara
41a4a3723d sparse/socket.h: Add AF_BRIDGE definition.
OVN will be using AF_BRIDGE in the near future as part of the effort to
add dynamic routing support for EVPN.  Without these definitions
compilation (with sparse enabled) fails in OVN.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-25 12:45:04 +02:00
Ilya Maximets
83af8ee6f1 tests: ipsec: Adjust status checks for upcoming Libreswan 5.3.
Future Libreswan will start also reporting a number of 'routed'
connections:
  8f754fe854

Need to adjust our parsing commands accordingly.

Instead of just adding '.*' in the sed regex, making it more generic,
so we can query 'routed' connections in the future without changing
the macro and be more tolerant to future format changes.  While at it,
also changing some `` into $() to be more consistent with the rest of
the file.

Acked-by: Mike Pattrick <mkp@redhat.com
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-25 12:43:29 +02:00
Ilya Maximets
83de251fa5 ipsec: libreswan: Remove old certs before importing new ones.
If started with --no-restart-ike-daemon, ovs-monitor-ipsec doesn't
clear the NSS database.  This is not a problem if the certificates do
not change while the monitor is down, because completely duplicate
entries cannot be added to the NSS database.  However, if the monitor
is stopped, then certificates change on disk and then the monitor is
started back, it will add new tunnel certificates alongside the old
ones and will fail to add the new CA certificate.  So, we'll end up
with multiple certificates for the same tunnel and the outdated CA
certificate.  This will not allow creating new connections as we'll
not be able to verify certificates of the new CA:

  # certutil -L -d sql:/var/lib/ipsec/nss

  Certificate Nickname             Trust Attributes
                                   SSL,S/MIME,JAR/XPI

  ovs_certkey_c04c352b             u,u,u
  ovs_cert_cacert                  CT,,
  ovs_certkey_c04c352b             u,u,u
  ovs_certkey_c04c352b             u,u,u
  ovs_certkey_c04c352b             u,u,u
  ovs_certkey_c04c352b             u,u,u
  ovs_certkey_c04c352b             u,u,u
  ovs_certkey_c04c352b             u,u,u

  pluto: "ovn-c04c35-0-out-1" #459: processing decrypted
   IKE_AUTH request containing SK{IDi,CERT,CERTREQ,IDr,AUTH,SA,
   TSi,TSr,N(USE_TRANSPORT_MODE)}
  pluto: "ovn-c04c35-0-out-1" #459: NSS: ERROR:
   IPsec certificate CN=c04c352b,OU=kind,O=ovnkubernetes,C=US invalid:
   SEC_ERROR_UNKNOWN_ISSUER: Peer's Certificate issuer is not recognized.
  pluto: "ovn-c04c35-0-out-1" #459: NSS: end certificate invalid

Fix that by always checking certificates in the NSS database before
importing the new one.  If they do not match, then remove the old
one from the NSS and add the new one.

We have to call deletion multiple times in order to remove all the
potential duplicates from previous runs.  This will be useful on
upgrade, but also may save us if one of the deletions ever fail for
any reason and we'll end up with a duplicate entry anyway.

One alternative might be to always clear the database, even if the
--no-restart-ike-daemon option is set, but there is a chance that
we'll refresh and ask to re-read secrets before we got all the tunnel
information from the database.  That may affect dataplane.  Even if
this is really not possible, the logic seems too far apart to rely on.
Also, Libreswan 4.6 seems to have some bug that prevents re-adding
deleted connections if we removed and re-add the same certificate
(newer versions don't have this issue), so it's better if we do not
touch certificates that didn't actually change if we're not restarting
the IKE daemon.

The clearing may seem redundant now, but it may still be useful to
clean up certificates for tunnels that disappeared while the monitor
was down.  Approach taken in this change doesn't cover this case.

Test is added to check the described scenario.  The 'on_exit' command
is converted to obtain the monitor PID at exit, since we're now killing
one monitor and starting another.

Fixes: fe5ff26a49f6 ("ovs-monitor-ipsec: Add option to not restart IKE daemon.")
Reported-at: https://issues.redhat.com/browse/FDP-1473
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-25 12:42:08 +02:00
Ilya Maximets
80d723736b cirrus: Update to FreeBSD 14.3 and 13.5.
13.5 was released a few months back and 14.3 just recently.  Older
point releases may become unavailable in gcloud in the near future.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-23 22:08:50 +02:00
David Marchand
6090603703 netdev-dpdk: Remove limit on maximum descriptors count.
Using larger rxq can be beneficial in highly bursty setups.
Remove the artificial limit on the count of descriptors in rxq and txq.
The device driver will limit the values in any case.

Reported-at: https://issues.redhat.com/browse/FDP-1415
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-20 21:17:02 +02:00
Ilya Maximets
edecb74043 python: idl: Don't notify the application on _Server database updates.
_Server database is not managed by the user and needed mostly for IDL
itself to see changes in the schema or cluster leadership.  However,
we're currently delivering notifications about changes in that database
confusing the application (the application didn't subscribe to this
database) and also we're increasing the change_seqno potentially
returning true for has_ever_connected() call even if we didn't really
get any real data yet or even connected to the right database.

In the tests these notifications can be seen as two events at the
beginning of every test with the notification enabled:

  000: event:create, row={}, uuid=<0>, updates=None
  000: event:create, row={}, uuid=<1>, updates=None

Tests only print the 'simple' table, so the content is omitted, but
the data is still there and the empty events are printed out.

We should not notify the application nor touch the change_seqno.
Tests updated accordingly.  Unfortunately, removing first two lines
from a test changes the numbers generated by the UUID filter, so the
rest of the test needs adjustments as well.

Fixes: c39751e44539 ("python: Monitor Database table to manage lifecycle of IDL client.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-20 21:17:02 +02:00
David Marchand
ab062d3cb4 netdev-dpdk: Adjust IPv4 checksum capability for vhost-user.
If no L4 checksum can be requested, OVS may as well compute IPv4
checksum when needed.

This allows a small optimization where the whole preparation step can
be skipped on a batch when a (vhost-user) DPDK port has no offload
capability.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:03:20 +02:00
David Marchand
dd443c1a7a netdev-dpdk: Stop relying on vhost-user Tx flags.
vhost-user legacy behavior has been to mark mbuf with Tx offload flags
based on what the virtio-net header contained (but provide no
Rx information, like IP checksum or L4 checksum validity).

Changing to the non legacy mode means that no code out of OVS should set
any RTE_MBUF_F_TX_* flag. Had a check accordingly.

Link: https://git.dpdk.org/dpdk/commit/?id=ca7036b4af3a
Reported-at: https://issues.redhat.com/browse/FDP-1147
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:03:15 +02:00
David Marchand
b8032fac2c dp-packet: Remove direct access to DPDK offloads.
Now that every use of ol_flags have been reworked, we can remove helper
and additional field in dp_packet when not building with DPDK.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:03:12 +02:00
David Marchand
cf7b86db1f dp-packet: Rework TCP segmentation.
Rather than mark with a offload flags + mark with a segmentation size,
simply rely on the netdev implementation which sets a segmentation size
when appropriate.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:03:09 +02:00
David Marchand
e36793e11f dp-packet: Resolve unknown checksums.
Now that IP and L4 checksum offloading don't require tweaking Tx flags,
update checksum status in parts of OVS that validate checksums (in case
of unknown status).

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:03:03 +02:00
David Marchand
2956a61265 dp-packet: Rework L4 checksum offloads.
The DPDK mbuf API specifies 4 status when it comes to L4 checksums:
- RTE_MBUF_F_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum
- RTE_MBUF_F_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong
- RTE_MBUF_F_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid
- RTE_MBUF_F_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet
  data, but the integrity of the L4 data is verified.

Similarly to the IP checksum offloads API, revise OVS L4 offloads API.

No information about the L4 protocol is provided by any netdev-*
implementation, so OVS needs to mark this L4 protocol during flow
extraction.

Rename current API for consistency with dp_packet_(inner_)?l4_checksum_.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:02:56 +02:00
David Marchand
3daf04a4c5 dp-packet: Rework IP checksum offloads.
As the packet traverses through OVS, offloading Tx flags must be carefully
evaluated and updated which results in a bit of complexity because of a
separate "outer" Tx offloading flag coming from DPDK API,
and a "normal"/"inner" Tx offloading flag.

On the other hand, the DPDK mbuf API specifies 4 status when it comes to
IP checksums:
- RTE_MBUF_F_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum
- RTE_MBUF_F_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong
- RTE_MBUF_F_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid
- RTE_MBUF_F_RX_IP_CKSUM_NONE: the IP checksum is not correct in the
  packet data, but the integrity of the IP header is verified.

This patch changes OVS API so that OVS code only tracks the status of
the checksum of the "current" L3 header and let the Tx flags aspect to
the netdev-* implementations.

With this API, the flow extraction can be cleaned up.

During packet processing, OVS can simply look for the IP checksum validity
(either good, or partial) before changing some IP header, and then mark
the checksum as partial.

In the conntrack case, when natting packets, the checksum status of the
inner part (ICMP error case) must be forced temporarily as unknown
to force checksum resolution.

When tunneling comes into play, IP checksums status is bit-shifted for
future considerations in the processing if, for example, the tunnel
header gets decapsulated again, or in the netdev-* implementations that
support tunnel offloading.

Finally, netdev-* implementations only need to care about packets in
partial status: a good checksum does not need touching, a bad checksum
has been updated by kept as bad by OVS, an unknown checksum is either
an IPv6 or if it was an IPv4, OVS updated it too (keeping it good or bad
accordingly).

Rename current API for consistency with dp_packet_(inner_)?ip_checksum_.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:00:54 +02:00
David Marchand
67abd51540 dp-packet: Rework tunnel offloads.
Rather than set bits in the mbuf ol_flags field, that only makes sense
for netdev-dpdk ports, mark packet for tunnel offload in OVS offloads
API.

While at it, since there is nothing really "hardware" related, rename
current API for consistency with dp_packet_tunnel_ prefix.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:00:48 +02:00
David Marchand
e2200485c5 dp-packet: Expand offloads preparation helper.
Expand this helper to clearly separate the non tunnel case from the
tunnel one. This will make later changes easier to read.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:00:45 +02:00
David Marchand
d29ba0abdc dp-packet: Add OVS offloading API.
As a preparation for tracking inner checksums, separate Rx checksum
status from the DPDK ol_flags field.
To minimize the cost of translating from DPDK API to OVS API, simply map
OVS flags to DPDK Rx mbuf flags.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:00:34 +02:00
David Marchand
19ef1b1f0f dp-packet: Remove DPDK specific IP version.
Flagging packets with IP version is only needed at the netdev-dpdk level.

In most cases, OVS is already inspecting the IP header in packet data,
so maintaining such IP version metadata won't save much cycles
(given the cost of additional branches necessary for handling
outer/inner flags).

Cleanup OVS shared code and only set these flags in netdev-dpdk.c.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 20:59:22 +02:00
David Marchand
52fdeda11a dp-packet: Remove Linux specific L4 offloads.
As the virtio-net offload API is used for netdev-linux ports, but
provides no information about the potentially encapsulated protocol
concerned by a checksum request, specific information from this netdev-
specific implementation is propagated into OVS code, and must be
carefully evaluated when some tunnel gets decapsulated.

This induces a cost in "normal" processing path, while the netdev-linux
path is not performance critical.

This patch removes such specific information, yet try harder to parse
the packet on the Rx side and set offload flags accordingly for non
encapsulated traffic. For encapsulated traffic, the inner
checksum is computed.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 20:59:04 +02:00
Terry Wilson
a86ae3c865 python: Add uuid/convert references to uuid for Row.__str__.
Row stringification happens a lot in client logs and it is far
more useful to have the logged Row's uuid printed. This also
adds converting referenced Row objects, and references within
set and map columns to UUIDs.

Signed-off-by: Terry Wilson <twilson@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 12:37:29 +02:00
Mark Michelson
8ee7ecb8a2 db-ctl-base: Allow retrieving rows of type OVSDB_TYPE_UUID.
The ctl_get_row() function attempts to match a user-provided string to a
particular database row. This works by comparing the user-provided
string to the values of columns provided by the ctl utility (e.g.
ovs-vsctl).

Before this commit, this comparison could only be made for columns of
type OVSDB_TYPE_INTEGER and OVSDB_TYPE_STRING. If a ctl utility provided
a column of a different type, then db-ctl-base.c would assert in
get_row_by_id().

This commit enhances the ability of ctl_get_row() to also retrieve rows
based on columns of type OVSDB_TYPE_UUID. The user-provided string is
converted to a UUID and compared against the column's value. If it
matches, then the row matches.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 12:36:29 +02:00
Aaron Conole
8a1a0ea7c0 AUTHORS: Add Changliang Wu.
Signed-off-by: Aaron Conole <aconole@redhat.com>
2025-06-13 14:09:11 -04:00
Changliang Wu
aea4734299 lldp: Fix out of bound write in chassisid_to_string.
snprintf will automatically write \0 at the end of the string,
and the last one byte will be out of bound.

create a new function ds_put_hex_with_delimiter,
instead of chassisid_to string and format_hex_arg.

Found in sanitize test.

Signed-off-by: Changliang Wu <changliang.wu@smartx.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2025-06-13 14:06:55 -04:00
Mike Pattrick
614029aac0 conntrack: Allow inner NAT of related fragments.
Currently conntrack will refuse to extract metadata from fragmented
IPv4 packets. Usually the fragments would be processed by the ipf
module, but this isn't the case for ICMP related packets. The current
handling will result in these being incorrectly processed.

This patch checks for a frag offset instead of just frag flags, which is
similar to how conntrack handles fragments in the kernel.

Reported-at: https://issues.redhat.com/browse/FDP-136
Reported-by: Ales Musil <amusil@redhat.com>
Fixes: a489b16854b5 ("conntrack: New userspace connection tracker.")
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2025-06-13 14:06:07 -04:00
Eelco Chaudron
ca9e67c801 daemon-unix: Handle potential negative values from sysconf().
Coverity reports that daemon_set_new_user() may receive a large
unsigned value from get_sysconf_buffer_size(), due to sysconf()
returning -1 and being cast to size_t.

Although this would likely lead to an allocation failure and abort,
it's better to handle the error in place.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-06-12 15:28:31 +02:00
Eelco Chaudron
99af7f3791 ovsdb: Fix Coverity leak warning by marking code as unreachable.
Coverity reports a memory leak on the 'error' variable in
ovsdb_trigger_try(). However, this code path is unreachable due to an
ovs_assert() in an earlier function call.

To make this clear to Coverity and silence the warning, the section is
explicitly marked as unreachable.

Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-06-10 17:07:06 +02:00
Eelco Chaudron
2c634482f2 raft: Fix resource leak from ignored ovsdb_log_write_and_free() error.
The Raft codebase includes calls to ovsdb_log_write_and_free() that
are incorrectly wrapped in ignore(). This causes potential error
resources to be leaked.

These calls should be wrapped in ovsdb_error_destroy() instead, to
ensure that any returned error objects are properly freed and do not
result in memory leaks.

Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-06-10 17:05:37 +02:00
Eelco Chaudron
b90304bfe7 ovsdb-server: Fix potential memory leak in parse_options().
When duplicate --config-file command-line arguments are passed,
the resources for previously specified file path were not freed.

This fix ensures unused resources are properly freed while
preserving the existing behavior of using the last configuration
file path specified.

Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-06-10 17:04:49 +02:00
Eelco Chaudron
d1bd62dae5 ofproto-dpif-upcall: Check odp_tun_key_from_attr() return value.
In the IPFIX and flow sample upcall handling, check the validity
of the tunnel key returned by odp_tun_key_from_attr(). If the
tunnel key is invalid, return an error.

This was reported by Coverity, but the change also improves
robustness and avoids undefined behavior in the case of malformed
tunnel attributes.

Fixes: 8b7ea2d48033 ("Extend OVS IPFIX exporter to export tunnel headers")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
2025-06-10 17:04:09 +02:00
Eelco Chaudron
88737f02ed ofproto-dpif-xlate: Fix memory leak in xlate_generic_encap_action().
This is not a real issue, as the initializer function,
rewrite_flow_push_nsh(), ensures it returns NULL on error.
However, cleaning this up improves code clarity and resolves
a Coverity warning about a potential memory leak.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-06-10 17:03:39 +02:00
Eelco Chaudron
8fca3f99cf lldp: Fix Coverity warning about resource leak in lldp test.
Coverity reported a potential resource leak in the LLDP test code.
While this condition should never occur in practice, since the test
would crash on out-of-memory, the warning is addressed by ensuring
the cleanup function is called on error paths.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2025-06-10 17:02:58 +02:00
Ales Musil
d283829477 sparse: Define new AVX10 includes added in GCC >= 15.
The GCC >=15 added new AVX10 header files, add defines for them as
sparse is not able to understand new types in those. This can be
seen with DPDK headers.

Tested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-03 18:48:45 +02:00
Ales Musil
0e419d1b4f sparse: Add workaround for OpenSSL configuration.
sparse fails to process OpenSSL configuration header file in recent
OpenSSL version (3.2.x). Add workaround header that will disable
the problematic macro.

Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-02 17:30:02 +02:00
Ilya Maximets
8224cd47f3 tests: tunnel-push-pop: Fix occasional failure of the drop test.
Datapath port zero is normally taken by the 'datapath interface', i.e.
the ovs-dummy interface.  This makes it not possible to allocate port
zero for the p0 interface.  So, it will race with p1 for the number 1.
If p0 happens to be created first, it will take the 1 and p1 will get
the port 2 and then the test passes.  However, if p1 is created first,
then it will take the 1 and p0 will take the 2.  In this case the
test fails as the port name in the trace will be different.

Use '--names' to avoid this problem, but also fix the port numbers and
use the 'add_of_ports' macro instead of plain-coding the port addition.
The macro would've made the issue more obvious in the first place.

Fixes: 1015b13f054d ("ofproto-dpif-xlate: Add a drop action for native tunnel failure.")
Acked-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-02 16:03:58 +02:00
David Marchand
e99ce7d5df flow: Fix checksum offloads with simple match.
Packets with L4 partial status for a simple match flow would not get L4
checksums offloads applied.

This was not caught in unit tests, because packets from netdev-dummy
(calling miniflow_extract) would get Tx flags set early, before
parse_tcp_flags() got called during packet processing.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-05-30 18:00:56 +02:00
Kevin Traynor
48ce3a5a52 dpdk: Use DPDK 24.11.2 release.
Update the CI and docs to use DPDK 24.11.2.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
2025-05-30 16:48:44 +01:00
Roi Dayan
b42f9fde4a netdev-dpdk: Fix possible memory leak in vhost stats.
On error condition need to release the allocated structs.

Reported by Coverity.

Fixes: 3b29286db1c5 ("netdev-dpdk: Add per virtqueue statistics.")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
2025-05-30 14:22:23 +01:00