2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-31 06:15:47 +00:00
Commit Graph

162 Commits

Author SHA1 Message Date
David Marchand
2956a61265 dp-packet: Rework L4 checksum offloads.
The DPDK mbuf API specifies 4 status when it comes to L4 checksums:
- RTE_MBUF_F_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum
- RTE_MBUF_F_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong
- RTE_MBUF_F_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid
- RTE_MBUF_F_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet
  data, but the integrity of the L4 data is verified.

Similarly to the IP checksum offloads API, revise OVS L4 offloads API.

No information about the L4 protocol is provided by any netdev-*
implementation, so OVS needs to mark this L4 protocol during flow
extraction.

Rename current API for consistency with dp_packet_(inner_)?l4_checksum_.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:02:56 +02:00
David Marchand
3daf04a4c5 dp-packet: Rework IP checksum offloads.
As the packet traverses through OVS, offloading Tx flags must be carefully
evaluated and updated which results in a bit of complexity because of a
separate "outer" Tx offloading flag coming from DPDK API,
and a "normal"/"inner" Tx offloading flag.

On the other hand, the DPDK mbuf API specifies 4 status when it comes to
IP checksums:
- RTE_MBUF_F_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum
- RTE_MBUF_F_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong
- RTE_MBUF_F_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid
- RTE_MBUF_F_RX_IP_CKSUM_NONE: the IP checksum is not correct in the
  packet data, but the integrity of the IP header is verified.

This patch changes OVS API so that OVS code only tracks the status of
the checksum of the "current" L3 header and let the Tx flags aspect to
the netdev-* implementations.

With this API, the flow extraction can be cleaned up.

During packet processing, OVS can simply look for the IP checksum validity
(either good, or partial) before changing some IP header, and then mark
the checksum as partial.

In the conntrack case, when natting packets, the checksum status of the
inner part (ICMP error case) must be forced temporarily as unknown
to force checksum resolution.

When tunneling comes into play, IP checksums status is bit-shifted for
future considerations in the processing if, for example, the tunnel
header gets decapsulated again, or in the netdev-* implementations that
support tunnel offloading.

Finally, netdev-* implementations only need to care about packets in
partial status: a good checksum does not need touching, a bad checksum
has been updated by kept as bad by OVS, an unknown checksum is either
an IPv6 or if it was an IPv4, OVS updated it too (keeping it good or bad
accordingly).

Rename current API for consistency with dp_packet_(inner_)?ip_checksum_.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-06-19 21:00:54 +02:00
Mike Pattrick
614029aac0 conntrack: Allow inner NAT of related fragments.
Currently conntrack will refuse to extract metadata from fragmented
IPv4 packets. Usually the fragments would be processed by the ipf
module, but this isn't the case for ICMP related packets. The current
handling will result in these being incorrectly processed.

This patch checks for a frag offset instead of just frag flags, which is
similar to how conntrack handles fragments in the kernel.

Reported-at: https://issues.redhat.com/browse/FDP-136
Reported-by: Ales Musil <amusil@redhat.com>
Fixes: a489b16854 ("conntrack: New userspace connection tracker.")
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2025-06-13 14:06:07 -04:00
David Marchand
71f3dd3e9c conntrack: Fix embedded checksums in ICMP errors.
Helpers like packet_set_ipv4() resets IP csum flags.
Inspecting and natting embedded payload in an ICMP error is thus broken
if the "outer" IP header had some Rx checksum flags that made it
eligible to Tx IP checksum.
Reset temporarily any Tx checksum to force those helpers to resolve the
checksums.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-05-21 23:14:42 +02:00
David Marchand
4b00509ea1 conntrack: Do not validate already checked checksum.
Bad packets were still being validated in software when entering
conntrack.  Trust decision taken wrt IP checksum offloading (checking
dp_packet_hwol_l3_csum_ipv4_ol()) and avoid revalidating a known
bad checksum.

While at it, add coverage counters so that checksum validation impact
can be monitored, and unit tests.

Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-05-21 22:05:48 +02:00
Mike Pattrick
484208bd17 ipf: Maintain packet zone and direction.
Currently ipf will inject completed fragments into the first available
batch. In almost all cases, this is the batch which contained the last
fragment of the packet. However, in cases where the batch is full the
packets are added to whatever random subsequent batch arrives to
conntrack. This could result in packets being processed incorrectly, for
example some completed frags may be inserted into a batch from the
interface that they should have been destined for.

This patch verifies the zone matches, and that the batch contains a
packet of the same in_port as the completed fragments.

Reported-at: https://issues.redhat.com/browse/FDP-1052
Fixes: 4ea96698f6 ("Userspace datapath: Add fragmentation handling.")
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2025-04-17 16:51:46 -04:00
Aaron Conole
4c5c1aa9f9 conntrack: Fix Windows build due to ternary syntax extension.
In the cited commit a ternary using syntax extension slipped in.

The extension allows omitting the second operand and it is not
supported by MSVC resulting in a build failure.

Fix it by simply specifying the second operand.

Fixes: b57c1da5c3 ("conntrack: Use a per zone default limit.")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
[Paolo: added commit message]
Co-authored-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-10-14 12:11:51 +02:00
Paolo Valerio
b57c1da5c3 conntrack: Use a per zone default limit.
Before this change the default limit, instead of being considered
per-zone, was considered as a global value that every new entry was
checked against during the creation. This was not the intended
behavior as the default limit should be inherited by each zone instead
of being an aggregate number.

This change corrects that by removing the default limit from the cmap
and making it global (atomic). Now, whenever a new connection needs to
be committed, if default_zone_limit is set and the entry for the zone
doesn't exist, a new entry for that zone is lazily created, marked as
default. All subsequent packets for that zone will undergo the regular
lookup process.
To distinguish between default and user-defined entries, the storage
for the limit member of struct conntrack_zone_limit has been changed
from a 32-bit unsigned integer to a 64-bit signed integer. The
negative value ZONE_LIMIT_CONN_DEFAULT now indicates a default entry.

Operations such as creation/deletion are modified accordingly taking
into account this new behavior.

Worth noting that OVS_REQUIRES(ct->ct_lock) is not a strict
requirement for zone_limit_lookup_or_default(), however since the
function operates under the lock and it can create an entry in the
slow path, the lock requirement is enforced in order to make thread
safety checks work. The function can still be moved outside the
creation lock or any lock, keeping the fastpath lockless (turning
zone_limit_lookup_protected() to its unprotected version) and locking
only in the slow path (replacing zone_limit_create__() with
zone_limit_create__().

The patch also extends `conntrack - limit by zone` test in order to
check the behavior, and while at it, update test's packet-out to use
compose-packet function.

Fixes: a7f33fdbfb ("conntrack: Support zone limits.")
Reported-at: https://issues.redhat.com/browse/FDP-122
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-10-09 10:55:18 -04:00
Paolo Valerio
41f3f5b902 conntrack: Turn zl local limit into atomic.
while at it, changes struct zone_limit initialization in
zone_limit_create() in order to use atomic init operations instead of
relying on memset() which, although correctly initializes the struct,
is semantically not aware of atomics.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-10-09 10:55:18 -04:00
Paolo Valerio
8ff40f3358 conntrack: Do not use atomics to report zones info.
Atomics are not needed when reporting zone limits.
Remove the restriction by defining a non-atomic common structure
to report such data.
The change also access atomics using the related operations to
retrieve atomics reporting only the fields required by the requesting
level instead of relying of struct copy.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-10-09 10:55:18 -04:00
Paolo Valerio
8ec7d55bfc conntrack: Add zone limit coverage counter.
Similarly to what it's done for conntrack_full, add
conntrack_zone_full increased when new entries are not added due to
reaching the zone limit.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-10-09 10:55:18 -04:00
Aaron Conole
6c3074686a conntrack: Disambiguate the cleaned count log.
After 3d9c1b855a ("conntrack: Replace timeout based expiration lists with rculists.")
the conntrack cleanup log reports the number of connections it checked
rather than the number of connections it cleaned.  This patch includes the
count of connections cleaned during expiration sweeping.

Reported-by: Cheng Li <lic121@chinatelecom.cn>
Suggested-by: Cheng Li <lic121@chinatelecom.cn>
Fixes: 3d9c1b855a ("conntrack: Replace timeout based expiration lists with rculists.")
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
2024-09-11 15:34:39 +02:00
Mike Pattrick
3833506db0 conntrack: Fully initialize conn struct before insertion.
In case packets are concurrently received in both directions, there's
a chance that the ones in the reverse direction get received right
after the connection gets added to the connection tracker but before
some of the connection's fields are fully initialized.
This could cause OVS to access potentially invalid, as the lookup may
end up retrieving the wrong offsets during CONTAINER_OF(), or
uninitialized memory.

This may happen in case of regular NAT or all-zero SNAT.

Fix it by initializing early the connections fields.

Fixes: 1116459b3b ("conntrack: Remove nat_conn introducing key directionality.")
Reported-at: https://issues.redhat.com/browse/FDP-616
Acked-by: Simon Horman <horms@ovn.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Co-authored-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-05-13 21:00:58 +02:00
Xavier Simonart
4989dc7e0e conntrack: Do not use {0} to initialize unions.
In the following case:
    union ct_addr {
        unsigned int ipv4;
        struct in6_addr ipv6;
    };
    union ct_addr zero_ip = {0};

The ipv6 field might not be properly initialized.
For instance, clang 18.1.1 does not initialize the ipv6 field.

Reported-at: https://issues.redhat.com/browse/FDP-608
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-05-13 20:54:53 +02:00
Felix Huettner
139b564dbd conntrack: Key connections by zone.
Currently conntrack uses a single large cmap for all connections stored.
This cmap contains all connections for all conntrack zones which are
completely separate from each other. By separating each zone to its own
cmap we can significantly optimize the performance when using multiple
zones.

The change fixes a similar issue as [1] where slow conntrack zone flush
operations significantly slow down OVN router failover. The difference is
just that this fix is used whith dpdk, while [1] was when using the ovs
kernel module.

As we now need to store more cmap's the memory usage of struct conntrack
increases by 524280 bytes. Additionally we need 65535 cmaps with 128
bytes each. This leads to a total memory increase of around 10MB.

Running "./ovstest test-conntrack benchmark 4 33554432 32 1" shows no
real difference in the multithreading behaviour against a single zone.

Running the new "./ovstest test-conntrack benchmark-zones" show
significant speedups as shown below. The values for "ct execute" are for
acting on the complete zone with all its entries in total (so in the
first case adding 10,000 new conntrack entries). All tests are run 1000
times.

When running with 1,000 zones with 10,000 entries each we see the
following results (all in microseconds):
"./ovstest test-conntrack benchmark-zones 10000 1000 1000"

                         +------+--------+---------+---------+
                         |  Min |   Max  |  95%ile |   Avg   |
+------------------------+------+--------+---------+---------+
| ct execute (commit)    |      |        |         |         |
|            with commit | 2266 |   3505 | 2707.06 | 2592.06 |
|         without commit | 2411 |  12730 | 4432.50 | 2736.78 |
+------------------------+------+--------+---------+---------+
| ct execute (no commit) |      |        |         |         |
|            with commit |  699 |   1238 |  886.15 |  722.67 |
|         without commit |  700 |   3377 | 1934.42 |  803.53 |
+------------------------+------+--------+---------+---------+
| flush full zone        |      |        |         |         |
|            with commit |  619 |   1122 |  901.36 |  679.15 |
|         without commit |  618 | 105078 |   64591 | 2886.46 |
+------------------------+------+--------+---------+---------+
| flush empty zone       |      |        |         |         |
|            with commit |    0 |      5 |    1.00 |    0.64 |
|         without commit |   54 |  87469 |   64520 | 2172.25 |
+------------------------+------+--------+---------+---------+

When running with 10,000 zones with 1,000 entries each we see the
following results (all in microseconds):
"./ovstest test-conntrack benchmark-zones 1000 10000 1000"

                         +------+--------+---------+---------+
                         |  Min |   Max  |  95%ile |   Avg   |
+------------------------+------+--------+---------+---------+
| ct execute (commit)    |      |        |         |         |
|            with commit |  215 |    287 |  231.88 |  222.30 |
|         without commit |  214 |   1692 |  569.18 |  285.83 |
+------------------------+------+--------+---------+---------+
| ct execute (no commit) |      |        |         |         |
|            with commit |   68 |     97 |   74.69 |   70.09 |
|         without commit |   68 |    300 |  158.40 |   82.06 |
+------------------------+------+--------+---------+---------+
| flush full zone        |      |        |         |         |
|            with commit |   47 |    211 |   56.34 |   50.34 |
|         without commit |   48 |  96330 |   63392 |   63923 |
+------------------------+------+--------+---------+---------+
| flush empty zone       |      |        |         |         |
|            with commit |    0 |      1 |    1.00 |    0.44 |
|         without commit |    3 | 109728 |   63923 | 3629.44 |
+------------------------+------+--------+---------+---------+

Comparing the averages we see:
* a moderate performance improvement for conntrack_execute with or
  without commiting of around 6% to 23%
* a significant performance improvement for flushing a full zone of
  around 75% to 99%
* an even more significant improvement for flushing empty zones since we
  no longer need to check any unrelated connections

[1] 9ec849e8aa

Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
Signed-off-by: Simon Horman <horms@ovn.org>
2024-05-03 13:03:40 +01:00
Paolo Valerio
b5e6829254 conntrack: Do not use icmp reverse helper for icmpv6.
In the flush tuple code path, while populating the conn_key,
reverse_icmp_type() gets called for both icmp and icmpv6 cases,
while, depending on the proto, its respective helper should be
called, instead.

The above leads to an abort:

[...]
__GI_abort () at abort.c:79
reverse_icmp_type (type=128 '\200') at lib/conntrack.c:1795
tuple_to_conn_key (...) at lib/conntrack.c:2590
in conntrack_flush_tuple (...) at lib/conntrack.c:2787
in dpif_netdev_ct_flush (...) at lib/dpif-netdev.c:9618
ct_dpif_flush_tuple (...) at lib/ct-dpif.c:331
ct_dpif_flush (...) at lib/ct-dpif.c:361
dpctl_flush_conntrack (...) at lib/dpctl.c:1797
[...]

Fix it by calling reverse_icmp6_type() when needed.
Furthermore, self tests have been modified in order to exercise and
check this behavior.

Fixes: 271e48a0e2 ("conntrack: Support conntrack flush by ct 5-tuple")
Reported-at: https://issues.redhat.com/browse/FDP-447
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-04-02 22:13:38 +02:00
Xavier Simonart
6c082a8310 conntrack: Fix flush not flushing all elements.
On netdev datapath, when a ct element was cleaned, the cmap
could be shrinked, potentially causing some elements to be skipped
in the flush iteration.

Fixes: 967bb5c5cd ("conntrack: Add rcu support.")
Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
2024-03-06 17:50:27 +00:00
Paolo Valerio
afdc1171a8 conntrack: Handle persistent selection for IP addresses.
The patch, when 'persistent' flag is specified, makes the IP selection
in a range persistent across reboots.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
2024-02-21 10:12:42 +00:00
Paolo Valerio
99413ec261 conntrack: Handle random selection for port ranges.
The userspace conntrack only supported hash for port selection.
With the patch, both userspace and kernel datapath support the random
flag.

The default behavior remains the same, that is, if no flags are
specified, hash is selected.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Simon Horman <horms@ovn.org>
2024-02-21 10:12:04 +00:00
Viacheslav Galaktionov
8abe32f957 conntrack: Use helpers from committed connections.
When a packet hits a flow rule without an explicitly specified helper,
OvS has to rely on automatic application layer gateway detection to
find related connections. This works as long as services are running on
their standard ports, e.g. when FTP servers use TCP port 21.

However, sometimes it's necessary to run services on non-standard ports.
In that case, there is no way for OvS to guess which protocol is used
within a given flow. Of course, this means that no related connections
can be recognized.

When a connection is committed with a particular helper, it's reasonable
to assume this helper will be used in subsequent CT actions, as long as
they don't override it. Achieve this behaviour by using the committed
connection's helper when a flow rule does not specify one.

Signed-off-by: Viacheslav Galaktionov <viacheslav.galaktionov@arknetworks.am>
Acked-by: Ivan Malov <ivan.malov@arknetworks.am>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-01-10 16:16:08 -05:00
Viacheslav Galaktionov
14ef8b451f lib/conntrack: Only use given packet in protocol detection.
The current protocol detection logic relies on two pieces of metadata
passed as arguments: tp_src and tp_dst, which represent the L4 source
and destination port numbers from the flow that triggered the current
flow rule first, and was responsible for creating the current DP flow.

Since multiple network flows of many different kinds, potentially using
different protocols on all layers, can be processed by one flow rule,
using the metadata of some unrelated flow might lead to unexpected
results. For example, ICMP type and code can be interpreted as TCP
source and destination ports. This can confuse the code responsible for
the helper selection, leading to errors in traffic handling and
incorrect detection of related flows.

One of the easiest ways to fix this problem is to simply remove the
tp_src and tp_dst parameters from the picture. The current code base has
no good use for them.

The helper selection logic was based on these values and therefore needs
to be changed. Ensure that the helper specified in a flow rule is used,
given it is compatible with the L4 protocol of the packet. When a flow
rule does not specify a helper, one can still be picked using the given
packet's metadata like TCP/UDP ports.

Signed-off-by: Viacheslav Galaktionov <viacheslav.galaktionov@arknetworks.am>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2024-01-10 16:16:08 -05:00
Ales Musil
8f4b86237b dpctl: Allow the default CT zone limit to be deleted.
Add optional argument to dpctl ct-del-limits called
"default", which allows to remove the default limit
making it effectively system default.

Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-12-05 21:53:52 +01:00
Ales Musil
4b9eb061b1 ct-dpif: Handle default zone limit the same way as other limits.
Internally handle default CT zone limit as other limits that
can be passed via the list with special value -1. Currently,
the -1 is treated by both datapaths as default, add static
asserts to make sure that this remains the case in the future.
This allows us to easily delete the default zone limit.

Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-12-05 20:42:04 +01:00
Peng He
1116459b3b conntrack: Remove nat_conn introducing key directionality.
The patch avoids the extra allocation for nat_conn.
Currently, when doing NAT, the userspace conntrack will use an extra
conn for the two directions in a flow. However, each conn has actually
the two keys for both orig and rev directions. This patch introduces a
key_node[CT_DIRS] member as per Aaron's suggestion in the conn which
consists of a key, direction, and a cmap_node for hash lookup so
addressing the feedback received by the original patch [0].

With this adjustment, we also remove the assertion that connections in
the table are DEFAULT while updating connection state and/or removing
connections.

[0] https://patchwork.ozlabs.org/project/openvswitch/patch/20201129033255.64647-2-hepeng.0320@bytedance.com/

Reported-by: Michael Plato <michael.plato@tu-berlin.de>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-September/052065.html
Signed-off-by: Peng He <hepeng.0320@bytedance.com>
Co-authored-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Tested-by: Frode Nordahl <frode.nordahl@canonical.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
2023-08-31 13:41:08 -04:00
Paolo Valerio
501f665a5a conntrack: Extract l4 information for SCTP.
Since a27d70a89 ("conntrack: add generic IP protocol support") all
the unrecognized IP protocols get handled using ct_proto_other ops
and are managed as L3 using 3 tuples.

This patch stores L4 information for SCTP in the conn_key so that
multiple conn instances, instead of one with ports zeroed, will be
created when there are multiple SCTP connections between two hosts.
It also performs crc32c check when not offloaded, and adds SCTP to
pat_enabled.

With this patch, given two SCTP association between two hosts,
tracking the connection will result in:

sctp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=55884,dport=5201),
    reply=(src=10.1.1.1,dst=10.1.1.2,sport=5201,dport=12345),zone=1
sctp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=59874,dport=5202),
    reply=(src=10.1.1.1,dst=10.1.1.2,sport=5202,dport=12346),zone=1

instead of:

sctp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=0,dport=0),
    reply=(src=10.1.1.1,dst=10.1.1.2,sport=0,dport=0),zone=1

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-07-13 21:22:41 +02:00
Paolo Valerio
9b4d2ad8e8 conntrack: Allow to dump userspace conntrack expectations.
The patch introduces a new commands ovs-appctl dpctl/dump-conntrack-exp
that allows to dump the existing expectations for the userspace ct.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-06-29 22:20:43 +02:00
Mike Pattrick
3337e6d91c userspace: Enable L4 checksum offloading by default.
The netdev receiving packets is supposed to provide the flags
indicating if the L4 checksum was verified and it is OK or BAD,
otherwise the stack will check when appropriate by software.

If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.

When encapsulate a packet with that flag, set the checksum
of the inner L4 header since that is not yet supported.

Calculate the L4 checksum when the packet is going to be sent
over a device that doesn't support the feature.

Linux tap devices allows enabling L3 and L4 offload, so this
patch enables the feature. However, Linux socket interface
remains disabled because the API doesn't allow enabling
those two features without enabling TSO too.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-06-15 23:50:30 +02:00
Mike Pattrick
5d11c47d3e userspace: Enable IP checksum offloading by default.
The netdev receiving packets is supposed to provide the flags
indicating if the IP checksum was verified and it is GOOD or BAD,
otherwise the stack will check when appropriate by software.

If the packet comes with good checksum, then postpone the
checksum calculation to the egress device if needed.

When encapsulate a packet with that flag, set the checksum
of the inner IP header since that is not yet supported.

Calculate the IP checksum when the packet is going to be sent over
a device that doesn't support the feature.

Linux devices don't support IP checksum offload alone, so the
support is not enabled.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Co-authored-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-06-15 23:49:51 +02:00
Paolo Valerio
9fa612959c ovs-dpctl: Add new command dpctl/ct-[sg]et-sweep-interval.
Since 3d9c1b855a ("conntrack: Replace timeout based expiration lists
with rculists.") the sweep interval changed as well as the constraints
related to the sweeper.
Being able to change the default reschedule time may be convenient in
some conditions, like debugging.
This patch introduces new commands allowing to get and set the sweep
interval in ms.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-04-06 22:59:25 +02:00
Nobuhiro MIKI
349112f975 flow: Support rt_hdr in parse_ipv6_ext_hdrs().
Checks whether IPPROTO_ROUTING exists in the IPv6 extension headers.
If it exists, the first address is retrieved.

If NULL is specified for "frag_hdr" and/or "rt_hdr", those addresses in
the header are not reported to the caller. Of course, "frag_hdr" and
"rt_hdr" are properly parsed inside this function.

Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-03-29 21:41:28 +02:00
Liang Mancang
b0d9a1efcc conntrack: Fix conntrack_clean may access the same exp_list each time.
when a exp_list contains more than the clean_end's number of nodes,
and these nodes will not expire immediately. Then, every times we
call conntrack_clean, it use the same next_sweep to get exp_list.

Actually, we should add i every times after we call ct_sweep.

Fixes: 3d9c1b855a ("conntrack: Replace timeout based expiration lists with rculists.")
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Liang Mancang <liangmc1@chinatelecom.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-02-21 21:02:20 +01:00
Ales Musil
0a7587034d conntrack: Properly unNAT inner header of related traffic.
The inner header was not handled properly.
Simplify the code which allows proper handling
of the inner headers.

Reported-at: https://bugzilla.redhat.com/2137754
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ales Musil <amusil@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2023-02-13 19:17:18 +01:00
Paolo Valerio
a3848d98e1 conntrack: Show parent key if present.
Similarly to what happens when CTA_TUPLE_MASTER is present in a ct
netlink dump, add the ability to print out the parent key to the
userspace implementation as well.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-11-02 19:49:07 +01:00
Ilya Maximets
b159525903 conntrack: Check for expiration before comparing the keys during the lookup.
This could save some costly key comparison miss, especially in the
case there are many expired connections waiting for the sweeper to
evict them.

Acked-by: Aaron Conole <aconole@redhat.com>
Co-authored-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-13 00:50:23 +02:00
Gaetan Rivet
78387e88bd conntrack: Use an atomic conn expiration value.
A lock is taken during conn_lookup() to check whether a connection is
expired before returning it. This lock can have some contention.

Even though this lock ensures a consistent sequence of writes, it does
not imply a specific order. A ct_clean thread taking the lock first
could read a value that would be updated immediately after by a PMD
waiting on the same lock, just as well as the inverse order.

As such, the expiration time can be stale anytime it is read. In this
context, using an atomic will ensure the same guarantees for either
writes or reads, i.e. writes are consistent and reads are not undefined
behaviour. Reading an atomic is however less costly than taking and
releasing a lock.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-13 00:50:23 +02:00
Gaetan Rivet
3d9c1b855a conntrack: Replace timeout based expiration lists with rculists.
This patch aims to replace the expiration lists as, due to the way
they are used, besides being a source of contention, they have a known
issue when used with non-default policies for different zones that
could lead to retaining expired connections potentially for a long
time.

This patch replaces them with an array of rculist used to distribute
all the newly created connections in order to, during the sweeping
phase, scan them without locking, and evict the expired connections
only locking during the actual removal.  This allows to reduce the
contention introduced by the pushback performed at every packet
update, also solving the issue related to zones and timeout policies.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Co-authored-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-13 00:50:23 +02:00
Gaetan Rivet
4847baf4a9 conntrack-tp: Use a cmap to store timeout policies.
Multiple lookups are done to stored timeout policies, each time blocking
the global 'ct_lock'. This is usually not necessary and it should be
acceptable to get policy updates slightly delayed (by one RCU sync
at most). Using a CMAP reduces multiple lock taking and releasing in
the connection insertion path.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-12 20:44:46 +02:00
Gaetan Rivet
6edc278c85 conntrack: Use a cmap to store zone limits.
Change the data structure from hmap to cmap for zone limits.
As they are shared amongst multiple conntrack users, multiple
readers want to check the current zone limit state before progressing in
their processing. Using a CMAP allows doing lookups without taking the
global 'ct_lock', thus reducing contention.

Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-12 20:44:46 +02:00
Ilya Maximets
4e1e1e189f conntrack: Fix incorrect bit shift while hashing nat range.
'max_port' is 16bit field, shift expands it to 'int', not unsigned int.

 lib/conntrack.c:2245:41: runtime error:
   left shift of 34568 by 16 places cannot be represented in type 'int'.

     0 0xec45f4 in nat_range_hash lib/conntrack.c:2245:41
     1 0xec45f4 in nat_get_unique_tuple lib/conntrack.c:2422:21
     2 0xec45f4 in conn_not_found lib/conntrack.c:1035:32
     3 0xeaf0a5 in process_one lib/conntrack.c:1407:20
     4 0xea9390 in conntrack_execute lib/conntrack.c:1465:13
     5 0x839530 in dp_execute_cb lib/dpif-netdev.c:9060:9
     6 0x9909cc in odp_execute_actions lib/odp-execute.c:868:17
     7 0x830946 in dp_netdev_execute_actions lib/dpif-netdev.c:9106:5
     8 0x830946 in handle_packet_upcall lib/dpif-netdev.c:8294:5
     9 0x82ea5e in fast_path_processing lib/dpif-netdev.c:8390:25
     10 0x7ed87f in dp_netdev_input__ lib/dpif-netdev.c:8479:9
     11 0x7eb5fc in dp_netdev_input lib/dpif-netdev.c:8517:5
     12 0x81dada in dp_netdev_process_rxq_port lib/dpif-netdev.c:5329:19
     13 0x7f0063 in dpif_netdev_run lib/dpif-netdev.c:6664:25
     14 0x85f036 in dpif_run lib/dpif.c:467:16
     15 0x61833a in type_run ofproto/ofproto-dpif.c:366:9
     16 0x5c210e in ofproto_type_run ofproto/ofproto.c:1822:31
     17 0x565db2 in bridge_run__ vswitchd/bridge.c:3245:9
     18 0x562f82 in bridge_run vswitchd/bridge.c:3310:5
     19 0x59a98c in main vswitchd/ovs-vswitchd.c:129:9
     20 0x7f8864c3acf2 in __libc_start_main (/lib64/libc.so.6+0x3acf2)
     21 0x47e60d in _start (vswitchd/ovs-vswitchd+0x47e60d)

Fixes: 92edd073ce ("conntrack: Hash entire NAT data structure in nat_range_hash().")
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-06-24 23:51:31 +02:00
wenxu
165f5fbb5e conntrack: Limit port clash resolution attempts.
In case almost or all available ports are taken, clash resolution can
take a very long time, resulting in pmd lockup.

This can happen when many to-be-natted hosts connect to same
destination:port (e.g. a proxy) and all connections pass the same SNAT.

Pick a random offset in the acceptable range, then try ever smaller
number of adjacent port numbers, until either the limit is reached or a
useable port was found.  This results in at most 248 attempts
(128 + 64 + 32 + 16 + 8, i.e. 4 restarts with new search offset)
instead of 64000+.

Signed-off-by: wenxu <wenxu@chinatelecom.cn>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-06-07 14:50:36 +02:00
wenxu
c608ace71d conntrack: Remove the IP iterations in nat_get_unique_l4.
Removing the IP iterations, and just picking the IP address
with the hash base on the least-used src-ip/dst-ip/proto triple.

Signed-off-by: wenxu <wenxu@chinatelecom.cn>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-06-07 14:50:00 +02:00
Adrian Moreno
745c80f52c hindex: remove the next variable in safe loops.
Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for HINDEX_*_SAFE.

In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-03-30 16:59:03 +02:00
Adrian Moreno
e9bf5bffb0 list: use short version of safe loops if possible.
Using the SHORT version of the *_SAFE loops makes the code cleaner
and less error-prone. So, use the SHORT version and remove the extra
variable when possible.

In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-03-30 16:59:02 +02:00
wenxu
545b64415d conntrack: Prefer dst port range during unique tuple search.
This commit splits the nested loop used to search the unique ports for
the reverse tuple.
It affects only the dnat action, giving more precedence to the dnat
range, similarly to the kernel dp, instead of searching through the
default ephemeral source range for each destination port.

Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-03-04 20:16:37 +01:00
wenxu
ec85f5325f conntrack: Select correct sport range for well-known origin sport.
Like the kernel datapath. The sport nat range for well-konwn origin
sport should limit in the well-known ports.

Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-03-04 20:14:29 +01:00
wenxu
a2fa8b2895 conntrack: Remove the nat_action_info from the conn.
Only 'nat_action_info->nat_action' is used for packet forwarding.
Other items such as min/max_ip/port are used only when creating
new connections. No need to store the whole nat_action_info in conn.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: Michael Santana <msantana@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-09-16 00:01:47 +02:00
Gaetan Rivet
b889d5dcc8 conntrack: Init hash basis first at creation.
The 'hash_basis' field is used sometimes during sub-systems init
routine. It will be 0 by default before randomization. Sub-systems would
then init some nodes with incorrect hash values.

The timeout policies module is affected, making the default policy being
referenced using an incorrect hash value.

Fixes: 2078901a4c ("userspace: Add conntrack timeout policy support.")
Signed-off-by: Gaetan Rivet <grive@u256.net>
Reviewed-by: Eli Britstein <elibr@nvidia.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-09 22:23:59 +02:00
Paolo Valerio
61e48c2d1d conntrack: Handle SNAT with all-zero IP address.
This patch introduces for the userspace datapath the handling
of rules like the following:

  ct(commit,nat(src=0.0.0.0),...)

Kernel datapath already handle this case that is particularly
handy in scenarios like the following:

Given A: 10.1.1.1, B: 192.168.2.100, C: 10.1.1.2

A opens a connection toward B on port 80 selecting as source port 10000.
B's IP gets dnat'ed to C's IP (10.1.1.1:10000 -> 192.168.2.100:80).

This will result in:

  tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=10000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10000),
     protoinfo=(state=ESTABLISHED)

A now tries to establish another connection with C using source port
10000, this time using C's IP address (10.1.1.1:10000 -> 10.1.1.2:80).

This second connection, if processed by conntrack with no SNAT/DNAT
involved, collides with the reverse tuple of the first connection,
so the entry for this valid connection doesn't get created.

With this commit, and adding a SNAT rule with 0.0.0.0 for
10.1.1.1:10000 -> 10.1.1.2:80 will allow to create the conn entry:

  tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=10000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10001),
     protoinfo=(state=ESTABLISHED)
  tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=10000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10000),
     protoinfo=(state=ESTABLISHED)

The issue exists even in the opposite case (with A trying to connect
to C using B's IP after establishing a direct connection from A to C).

This commit refactors the relevant function in a way that both of the
previously mentioned cases are handled as well.

Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-08 23:49:34 +02:00
Paolo Valerio
1e19f9aa26 conntrack: Handle already natted packets.
When a packet gets dnatted and then recirculated, it could be possible
that it matches another rule that performs another nat action.
The kernel datapath handles this situation turning to a no-op the
second nat action, so natting only once the packet.  In the userspace
datapath instead, when the ct action gets executed, an initial lookup
of the translated packet fails to retrieve the connection related to
the packet, leading to the creation of a new entry in ct for the src
nat action with a subsequent failure of the connection establishment.

with the following flows:

table=0,priority=30,in_port=1,ip,nw_dst=192.168.2.100,actions=ct(commit,nat(dst=10.1.1.2:80),table=1)
table=0,priority=20,in_port=2,ip,actions=ct(nat,table=1)
table=0,priority=10,ip,actions=resubmit(,2)
table=0,priority=10,arp,actions=NORMAL
table=0,priority=0,actions=drop
table=1,priority=5,ip,actions=ct(commit,nat(src=10.1.1.240),table=2)
table=2,in_port=ovs-l0,actions=2
table=2,in_port=ovs-r0,actions=1

Establishing a connection from 10.1.1.1 to 192.168.2.100 the outcome is:

  tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=4000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.240,sport=80,dport=4000),
     protoinfo=(state=ESTABLISHED)
  tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=4000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=4000),
     protoinfo=(state=ESTABLISHED)

With this patch applied the outcome is:

  tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=4000,dport=80),
     reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=4000),
     protoinfo=(state=ESTABLISHED)

The patch performs, for already natted packets, a lookup of the
reverse key in order to retrieve the related entry, it also adds a
test case that besides testing the scenario ensures that the other ct
actions are executed.

Reported-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-07-08 23:49:34 +02:00
Paolo Valerio
2c597c8900 conntrack: add coverage counters for L3 bad checksum.
similarly to what already exists for L4, add conntrack_l3csum_err
and ipf_l3csum_err for L3.

Received packets with L3 bad checksum will increase respectively
ipf_l3csum_err if they are fragments and conntrack_l3csum_err
otherwise.

Although the patch basically covers IPv4, the names are kept generic.

Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2021-06-30 23:56:03 +02:00