mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-31 06:15:47 +00:00

Author	SHA1	Message	Date
David Marchand	2956a61265	dp-packet: Rework L4 checksum offloads. The DPDK mbuf API specifies 4 status when it comes to L4 checksums: - RTE_MBUF_F_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum - RTE_MBUF_F_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong - RTE_MBUF_F_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid - RTE_MBUF_F_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet data, but the integrity of the L4 data is verified. Similarly to the IP checksum offloads API, revise OVS L4 offloads API. No information about the L4 protocol is provided by any netdev-* implementation, so OVS needs to mark this L4 protocol during flow extraction. Rename current API for consistency with dp_packet_(inner_)?l4_checksum_. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 21:02:56 +02:00
David Marchand	3daf04a4c5	dp-packet: Rework IP checksum offloads. As the packet traverses through OVS, offloading Tx flags must be carefully evaluated and updated which results in a bit of complexity because of a separate "outer" Tx offloading flag coming from DPDK API, and a "normal"/"inner" Tx offloading flag. On the other hand, the DPDK mbuf API specifies 4 status when it comes to IP checksums: - RTE_MBUF_F_RX_IP_CKSUM_UNKNOWN: no information about the RX IP checksum - RTE_MBUF_F_RX_IP_CKSUM_BAD: the IP checksum in the packet is wrong - RTE_MBUF_F_RX_IP_CKSUM_GOOD: the IP checksum in the packet is valid - RTE_MBUF_F_RX_IP_CKSUM_NONE: the IP checksum is not correct in the packet data, but the integrity of the IP header is verified. This patch changes OVS API so that OVS code only tracks the status of the checksum of the "current" L3 header and let the Tx flags aspect to the netdev-* implementations. With this API, the flow extraction can be cleaned up. During packet processing, OVS can simply look for the IP checksum validity (either good, or partial) before changing some IP header, and then mark the checksum as partial. In the conntrack case, when natting packets, the checksum status of the inner part (ICMP error case) must be forced temporarily as unknown to force checksum resolution. When tunneling comes into play, IP checksums status is bit-shifted for future considerations in the processing if, for example, the tunnel header gets decapsulated again, or in the netdev-* implementations that support tunnel offloading. Finally, netdev-* implementations only need to care about packets in partial status: a good checksum does not need touching, a bad checksum has been updated by kept as bad by OVS, an unknown checksum is either an IPv6 or if it was an IPv4, OVS updated it too (keeping it good or bad accordingly). Rename current API for consistency with dp_packet_(inner_)?ip_checksum_. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-19 21:00:54 +02:00
Mike Pattrick	614029aac0	conntrack: Allow inner NAT of related fragments. Currently conntrack will refuse to extract metadata from fragmented IPv4 packets. Usually the fragments would be processed by the ipf module, but this isn't the case for ICMP related packets. The current handling will result in these being incorrectly processed. This patch checks for a frag offset instead of just frag flags, which is similar to how conntrack handles fragments in the kernel. Reported-at: https://issues.redhat.com/browse/FDP-136 Reported-by: Ales Musil <amusil@redhat.com> Fixes: `a489b16854` ("conntrack: New userspace connection tracker.") Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>	2025-06-13 14:06:07 -04:00
David Marchand	71f3dd3e9c	conntrack: Fix embedded checksums in ICMP errors. Helpers like packet_set_ipv4() resets IP csum flags. Inspecting and natting embedded payload in an ICMP error is thus broken if the "outer" IP header had some Rx checksum flags that made it eligible to Tx IP checksum. Reset temporarily any Tx checksum to force those helpers to resolve the checksums. Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-05-21 23:14:42 +02:00
David Marchand	4b00509ea1	conntrack: Do not validate already checked checksum. Bad packets were still being validated in software when entering conntrack. Trust decision taken wrt IP checksum offloading (checking dp_packet_hwol_l3_csum_ipv4_ol()) and avoid revalidating a known bad checksum. While at it, add coverage counters so that checksum validation impact can be monitored, and unit tests. Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-05-21 22:05:48 +02:00
Mike Pattrick	484208bd17	ipf: Maintain packet zone and direction. Currently ipf will inject completed fragments into the first available batch. In almost all cases, this is the batch which contained the last fragment of the packet. However, in cases where the batch is full the packets are added to whatever random subsequent batch arrives to conntrack. This could result in packets being processed incorrectly, for example some completed frags may be inserted into a batch from the interface that they should have been destined for. This patch verifies the zone matches, and that the batch contains a packet of the same in_port as the completed fragments. Reported-at: https://issues.redhat.com/browse/FDP-1052 Fixes: `4ea96698f6` ("Userspace datapath: Add fragmentation handling.") Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>	2025-04-17 16:51:46 -04:00
Aaron Conole	4c5c1aa9f9	conntrack: Fix Windows build due to ternary syntax extension. In the cited commit a ternary using syntax extension slipped in. The extension allows omitting the second operand and it is not supported by MSVC resulting in a build failure. Fix it by simply specifying the second operand. Fixes: `b57c1da5c3` ("conntrack: Use a per zone default limit.") Reported-by: Ilya Maximets <i.maximets@ovn.org> [Paolo: added commit message] Co-authored-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>	2024-10-14 12:11:51 +02:00
Paolo Valerio	b57c1da5c3	conntrack: Use a per zone default limit. Before this change the default limit, instead of being considered per-zone, was considered as a global value that every new entry was checked against during the creation. This was not the intended behavior as the default limit should be inherited by each zone instead of being an aggregate number. This change corrects that by removing the default limit from the cmap and making it global (atomic). Now, whenever a new connection needs to be committed, if default_zone_limit is set and the entry for the zone doesn't exist, a new entry for that zone is lazily created, marked as default. All subsequent packets for that zone will undergo the regular lookup process. To distinguish between default and user-defined entries, the storage for the limit member of struct conntrack_zone_limit has been changed from a 32-bit unsigned integer to a 64-bit signed integer. The negative value ZONE_LIMIT_CONN_DEFAULT now indicates a default entry. Operations such as creation/deletion are modified accordingly taking into account this new behavior. Worth noting that OVS_REQUIRES(ct->ct_lock) is not a strict requirement for zone_limit_lookup_or_default(), however since the function operates under the lock and it can create an entry in the slow path, the lock requirement is enforced in order to make thread safety checks work. The function can still be moved outside the creation lock or any lock, keeping the fastpath lockless (turning zone_limit_lookup_protected() to its unprotected version) and locking only in the slow path (replacing zone_limit_create__() with zone_limit_create__(). The patch also extends `conntrack - limit by zone` test in order to check the behavior, and while at it, update test's packet-out to use compose-packet function. Fixes: `a7f33fdbfb` ("conntrack: Support zone limits.") Reported-at: https://issues.redhat.com/browse/FDP-122 Reported-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>	2024-10-09 10:55:18 -04:00
Paolo Valerio	41f3f5b902	conntrack: Turn zl local limit into atomic. while at it, changes struct zone_limit initialization in zone_limit_create() in order to use atomic init operations instead of relying on memset() which, although correctly initializes the struct, is semantically not aware of atomics. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>	2024-10-09 10:55:18 -04:00
Paolo Valerio	8ff40f3358	conntrack: Do not use atomics to report zones info. Atomics are not needed when reporting zone limits. Remove the restriction by defining a non-atomic common structure to report such data. The change also access atomics using the related operations to retrieve atomics reporting only the fields required by the requesting level instead of relying of struct copy. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>	2024-10-09 10:55:18 -04:00
Paolo Valerio	8ec7d55bfc	conntrack: Add zone limit coverage counter. Similarly to what it's done for conntrack_full, add conntrack_zone_full increased when new entries are not added due to reaching the zone limit. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>	2024-10-09 10:55:18 -04:00
Aaron Conole	6c3074686a	conntrack: Disambiguate the cleaned count log. After `3d9c1b855a` ("conntrack: Replace timeout based expiration lists with rculists.") the conntrack cleanup log reports the number of connections it checked rather than the number of connections it cleaned. This patch includes the count of connections cleaned during expiration sweeping. Reported-by: Cheng Li <lic121@chinatelecom.cn> Suggested-by: Cheng Li <lic121@chinatelecom.cn> Fixes: `3d9c1b855a` ("conntrack: Replace timeout based expiration lists with rculists.") Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>	2024-09-11 15:34:39 +02:00
Mike Pattrick	3833506db0	conntrack: Fully initialize conn struct before insertion. In case packets are concurrently received in both directions, there's a chance that the ones in the reverse direction get received right after the connection gets added to the connection tracker but before some of the connection's fields are fully initialized. This could cause OVS to access potentially invalid, as the lookup may end up retrieving the wrong offsets during CONTAINER_OF(), or uninitialized memory. This may happen in case of regular NAT or all-zero SNAT. Fix it by initializing early the connections fields. Fixes: `1116459b3b` ("conntrack: Remove nat_conn introducing key directionality.") Reported-at: https://issues.redhat.com/browse/FDP-616 Acked-by: Simon Horman <horms@ovn.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Co-authored-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-05-13 21:00:58 +02:00
Xavier Simonart	4989dc7e0e	conntrack: Do not use {0} to initialize unions. In the following case: union ct_addr { unsigned int ipv4; struct in6_addr ipv6; }; union ct_addr zero_ip = {0}; The ipv6 field might not be properly initialized. For instance, clang 18.1.1 does not initialize the ipv6 field. Reported-at: https://issues.redhat.com/browse/FDP-608 Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Xavier Simonart <xsimonar@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-05-13 20:54:53 +02:00
Felix Huettner	139b564dbd	conntrack: Key connections by zone. Currently conntrack uses a single large cmap for all connections stored. This cmap contains all connections for all conntrack zones which are completely separate from each other. By separating each zone to its own cmap we can significantly optimize the performance when using multiple zones. The change fixes a similar issue as [1] where slow conntrack zone flush operations significantly slow down OVN router failover. The difference is just that this fix is used whith dpdk, while [1] was when using the ovs kernel module. As we now need to store more cmap's the memory usage of struct conntrack increases by 524280 bytes. Additionally we need 65535 cmaps with 128 bytes each. This leads to a total memory increase of around 10MB. Running "./ovstest test-conntrack benchmark 4 33554432 32 1" shows no real difference in the multithreading behaviour against a single zone. Running the new "./ovstest test-conntrack benchmark-zones" show significant speedups as shown below. The values for "ct execute" are for acting on the complete zone with all its entries in total (so in the first case adding 10,000 new conntrack entries). All tests are run 1000 times. When running with 1,000 zones with 10,000 entries each we see the following results (all in microseconds): "./ovstest test-conntrack benchmark-zones 10000 1000 1000" +------+--------+---------+---------+ \| Min \| Max \| 95%ile \| Avg \| +------------------------+------+--------+---------+---------+ \| ct execute (commit) \| \| \| \| \| \| with commit \| 2266 \| 3505 \| 2707.06 \| 2592.06 \| \| without commit \| 2411 \| 12730 \| 4432.50 \| 2736.78 \| +------------------------+------+--------+---------+---------+ \| ct execute (no commit) \| \| \| \| \| \| with commit \| 699 \| 1238 \| 886.15 \| 722.67 \| \| without commit \| 700 \| 3377 \| 1934.42 \| 803.53 \| +------------------------+------+--------+---------+---------+ \| flush full zone \| \| \| \| \| \| with commit \| 619 \| 1122 \| 901.36 \| 679.15 \| \| without commit \| 618 \| 105078 \| 64591 \| 2886.46 \| +------------------------+------+--------+---------+---------+ \| flush empty zone \| \| \| \| \| \| with commit \| 0 \| 5 \| 1.00 \| 0.64 \| \| without commit \| 54 \| 87469 \| 64520 \| 2172.25 \| +------------------------+------+--------+---------+---------+ When running with 10,000 zones with 1,000 entries each we see the following results (all in microseconds): "./ovstest test-conntrack benchmark-zones 1000 10000 1000" +------+--------+---------+---------+ \| Min \| Max \| 95%ile \| Avg \| +------------------------+------+--------+---------+---------+ \| ct execute (commit) \| \| \| \| \| \| with commit \| 215 \| 287 \| 231.88 \| 222.30 \| \| without commit \| 214 \| 1692 \| 569.18 \| 285.83 \| +------------------------+------+--------+---------+---------+ \| ct execute (no commit) \| \| \| \| \| \| with commit \| 68 \| 97 \| 74.69 \| 70.09 \| \| without commit \| 68 \| 300 \| 158.40 \| 82.06 \| +------------------------+------+--------+---------+---------+ \| flush full zone \| \| \| \| \| \| with commit \| 47 \| 211 \| 56.34 \| 50.34 \| \| without commit \| 48 \| 96330 \| 63392 \| 63923 \| +------------------------+------+--------+---------+---------+ \| flush empty zone \| \| \| \| \| \| with commit \| 0 \| 1 \| 1.00 \| 0.44 \| \| without commit \| 3 \| 109728 \| 63923 \| 3629.44 \| +------------------------+------+--------+---------+---------+ Comparing the averages we see: * a moderate performance improvement for conntrack_execute with or without commiting of around 6% to 23% * a significant performance improvement for flushing a full zone of around 75% to 99% * an even more significant improvement for flushing empty zones since we no longer need to check any unrelated connections [1] `9ec849e8aa` Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz> Signed-off-by: Simon Horman <horms@ovn.org>	2024-05-03 13:03:40 +01:00
Paolo Valerio	b5e6829254	conntrack: Do not use icmp reverse helper for icmpv6. In the flush tuple code path, while populating the conn_key, reverse_icmp_type() gets called for both icmp and icmpv6 cases, while, depending on the proto, its respective helper should be called, instead. The above leads to an abort: [...] __GI_abort () at abort.c:79 reverse_icmp_type (type=128 '\200') at lib/conntrack.c:1795 tuple_to_conn_key (...) at lib/conntrack.c:2590 in conntrack_flush_tuple (...) at lib/conntrack.c:2787 in dpif_netdev_ct_flush (...) at lib/dpif-netdev.c:9618 ct_dpif_flush_tuple (...) at lib/ct-dpif.c:331 ct_dpif_flush (...) at lib/ct-dpif.c:361 dpctl_flush_conntrack (...) at lib/dpctl.c:1797 [...] Fix it by calling reverse_icmp6_type() when needed. Furthermore, self tests have been modified in order to exercise and check this behavior. Fixes: `271e48a0e2` ("conntrack: Support conntrack flush by ct 5-tuple") Reported-at: https://issues.redhat.com/browse/FDP-447 Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-04-02 22:13:38 +02:00
Xavier Simonart	6c082a8310	conntrack: Fix flush not flushing all elements. On netdev datapath, when a ct element was cleaned, the cmap could be shrinked, potentially causing some elements to be skipped in the flush iteration. Fixes: `967bb5c5cd` ("conntrack: Add rcu support.") Signed-off-by: Xavier Simonart <xsimonar@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Simon Horman <horms@ovn.org>	2024-03-06 17:50:27 +00:00
Paolo Valerio	afdc1171a8	conntrack: Handle persistent selection for IP addresses. The patch, when 'persistent' flag is specified, makes the IP selection in a range persistent across reboots. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Simon Horman <horms@ovn.org>	2024-02-21 10:12:42 +00:00
Paolo Valerio	99413ec261	conntrack: Handle random selection for port ranges. The userspace conntrack only supported hash for port selection. With the patch, both userspace and kernel datapath support the random flag. The default behavior remains the same, that is, if no flags are specified, hash is selected. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Simon Horman <horms@ovn.org>	2024-02-21 10:12:04 +00:00
Viacheslav Galaktionov	8abe32f957	conntrack: Use helpers from committed connections. When a packet hits a flow rule without an explicitly specified helper, OvS has to rely on automatic application layer gateway detection to find related connections. This works as long as services are running on their standard ports, e.g. when FTP servers use TCP port 21. However, sometimes it's necessary to run services on non-standard ports. In that case, there is no way for OvS to guess which protocol is used within a given flow. Of course, this means that no related connections can be recognized. When a connection is committed with a particular helper, it's reasonable to assume this helper will be used in subsequent CT actions, as long as they don't override it. Achieve this behaviour by using the committed connection's helper when a flow rule does not specify one. Signed-off-by: Viacheslav Galaktionov <viacheslav.galaktionov@arknetworks.am> Acked-by: Ivan Malov <ivan.malov@arknetworks.am> Signed-off-by: Aaron Conole <aconole@redhat.com>	2024-01-10 16:16:08 -05:00
Viacheslav Galaktionov	14ef8b451f	lib/conntrack: Only use given packet in protocol detection. The current protocol detection logic relies on two pieces of metadata passed as arguments: tp_src and tp_dst, which represent the L4 source and destination port numbers from the flow that triggered the current flow rule first, and was responsible for creating the current DP flow. Since multiple network flows of many different kinds, potentially using different protocols on all layers, can be processed by one flow rule, using the metadata of some unrelated flow might lead to unexpected results. For example, ICMP type and code can be interpreted as TCP source and destination ports. This can confuse the code responsible for the helper selection, leading to errors in traffic handling and incorrect detection of related flows. One of the easiest ways to fix this problem is to simply remove the tp_src and tp_dst parameters from the picture. The current code base has no good use for them. The helper selection logic was based on these values and therefore needs to be changed. Ensure that the helper specified in a flow rule is used, given it is compatible with the L4 protocol of the packet. When a flow rule does not specify a helper, one can still be picked using the given packet's metadata like TCP/UDP ports. Signed-off-by: Viacheslav Galaktionov <viacheslav.galaktionov@arknetworks.am> Signed-off-by: Aaron Conole <aconole@redhat.com>	2024-01-10 16:16:08 -05:00
Ales Musil	8f4b86237b	dpctl: Allow the default CT zone limit to be deleted. Add optional argument to dpctl ct-del-limits called "default", which allows to remove the default limit making it effectively system default. Signed-off-by: Ales Musil <amusil@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-12-05 21:53:52 +01:00
Ales Musil	4b9eb061b1	ct-dpif: Handle default zone limit the same way as other limits. Internally handle default CT zone limit as other limits that can be passed via the list with special value -1. Currently, the -1 is treated by both datapaths as default, add static asserts to make sure that this remains the case in the future. This allows us to easily delete the default zone limit. Signed-off-by: Ales Musil <amusil@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-12-05 20:42:04 +01:00
Peng He	1116459b3b	conntrack: Remove nat_conn introducing key directionality. The patch avoids the extra allocation for nat_conn. Currently, when doing NAT, the userspace conntrack will use an extra conn for the two directions in a flow. However, each conn has actually the two keys for both orig and rev directions. This patch introduces a key_node[CT_DIRS] member as per Aaron's suggestion in the conn which consists of a key, direction, and a cmap_node for hash lookup so addressing the feedback received by the original patch [0]. With this adjustment, we also remove the assertion that connections in the table are DEFAULT while updating connection state and/or removing connections. [0] https://patchwork.ozlabs.org/project/openvswitch/patch/20201129033255.64647-2-hepeng.0320@bytedance.com/ Reported-by: Michael Plato <michael.plato@tu-berlin.de> Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-September/052065.html Signed-off-by: Peng He <hepeng.0320@bytedance.com> Co-authored-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Tested-by: Frode Nordahl <frode.nordahl@canonical.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>	2023-08-31 13:41:08 -04:00
Paolo Valerio	501f665a5a	conntrack: Extract l4 information for SCTP. Since `a27d70a89` ("conntrack: add generic IP protocol support") all the unrecognized IP protocols get handled using ct_proto_other ops and are managed as L3 using 3 tuples. This patch stores L4 information for SCTP in the conn_key so that multiple conn instances, instead of one with ports zeroed, will be created when there are multiple SCTP connections between two hosts. It also performs crc32c check when not offloaded, and adds SCTP to pat_enabled. With this patch, given two SCTP association between two hosts, tracking the connection will result in: sctp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=55884,dport=5201), reply=(src=10.1.1.1,dst=10.1.1.2,sport=5201,dport=12345),zone=1 sctp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=59874,dport=5202), reply=(src=10.1.1.1,dst=10.1.1.2,sport=5202,dport=12346),zone=1 instead of: sctp,orig=(src=10.1.1.2,dst=10.1.1.1,sport=0,dport=0), reply=(src=10.1.1.1,dst=10.1.1.2,sport=0,dport=0),zone=1 Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-07-13 21:22:41 +02:00
Paolo Valerio	9b4d2ad8e8	conntrack: Allow to dump userspace conntrack expectations. The patch introduces a new commands ovs-appctl dpctl/dump-conntrack-exp that allows to dump the existing expectations for the userspace ct. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-29 22:20:43 +02:00
Mike Pattrick	3337e6d91c	userspace: Enable L4 checksum offloading by default. The netdev receiving packets is supposed to provide the flags indicating if the L4 checksum was verified and it is OK or BAD, otherwise the stack will check when appropriate by software. If the packet comes with good checksum, then postpone the checksum calculation to the egress device if needed. When encapsulate a packet with that flag, set the checksum of the inner L4 header since that is not yet supported. Calculate the L4 checksum when the packet is going to be sent over a device that doesn't support the feature. Linux tap devices allows enabling L3 and L4 offload, so this patch enables the feature. However, Linux socket interface remains disabled because the API doesn't allow enabling those two features without enabling TSO too. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-15 23:50:30 +02:00
Mike Pattrick	5d11c47d3e	userspace: Enable IP checksum offloading by default. The netdev receiving packets is supposed to provide the flags indicating if the IP checksum was verified and it is GOOD or BAD, otherwise the stack will check when appropriate by software. If the packet comes with good checksum, then postpone the checksum calculation to the egress device if needed. When encapsulate a packet with that flag, set the checksum of the inner IP header since that is not yet supported. Calculate the IP checksum when the packet is going to be sent over a device that doesn't support the feature. Linux devices don't support IP checksum offload alone, so the support is not enabled. Signed-off-by: Flavio Leitner <fbl@sysclose.org> Co-authored-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-06-15 23:49:51 +02:00
Paolo Valerio	9fa612959c	ovs-dpctl: Add new command dpctl/ct-[sg]et-sweep-interval. Since `3d9c1b855a` ("conntrack: Replace timeout based expiration lists with rculists.") the sweep interval changed as well as the constraints related to the sweeper. Being able to change the default reschedule time may be convenient in some conditions, like debugging. This patch introduces new commands allowing to get and set the sweep interval in ms. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-04-06 22:59:25 +02:00
Nobuhiro MIKI	349112f975	flow: Support rt_hdr in parse_ipv6_ext_hdrs(). Checks whether IPPROTO_ROUTING exists in the IPv6 extension headers. If it exists, the first address is retrieved. If NULL is specified for "frag_hdr" and/or "rt_hdr", those addresses in the header are not reported to the caller. Of course, "frag_hdr" and "rt_hdr" are properly parsed inside this function. Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-03-29 21:41:28 +02:00
Liang Mancang	b0d9a1efcc	conntrack: Fix conntrack_clean may access the same exp_list each time. when a exp_list contains more than the clean_end's number of nodes, and these nodes will not expire immediately. Then, every times we call conntrack_clean, it use the same next_sweep to get exp_list. Actually, we should add i every times after we call ct_sweep. Fixes: `3d9c1b855a` ("conntrack: Replace timeout based expiration lists with rculists.") Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Liang Mancang <liangmc1@chinatelecom.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-02-21 21:02:20 +01:00
Ales Musil	0a7587034d	conntrack: Properly unNAT inner header of related traffic. The inner header was not handled properly. Simplify the code which allows proper handling of the inner headers. Reported-at: https://bugzilla.redhat.com/2137754 Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ales Musil <amusil@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-02-13 19:17:18 +01:00
Paolo Valerio	a3848d98e1	conntrack: Show parent key if present. Similarly to what happens when CTA_TUPLE_MASTER is present in a ct netlink dump, add the ability to print out the parent key to the userspace implementation as well. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-11-02 19:49:07 +01:00
Ilya Maximets	b159525903	conntrack: Check for expiration before comparing the keys during the lookup. This could save some costly key comparison miss, especially in the case there are many expired connections waiting for the sweeper to evict them. Acked-by: Aaron Conole <aconole@redhat.com> Co-authored-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-13 00:50:23 +02:00
Gaetan Rivet	78387e88bd	conntrack: Use an atomic conn expiration value. A lock is taken during conn_lookup() to check whether a connection is expired before returning it. This lock can have some contention. Even though this lock ensures a consistent sequence of writes, it does not imply a specific order. A ct_clean thread taking the lock first could read a value that would be updated immediately after by a PMD waiting on the same lock, just as well as the inverse order. As such, the expiration time can be stale anytime it is read. In this context, using an atomic will ensure the same guarantees for either writes or reads, i.e. writes are consistent and reads are not undefined behaviour. Reading an atomic is however less costly than taking and releasing a lock. Signed-off-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-13 00:50:23 +02:00
Gaetan Rivet	3d9c1b855a	conntrack: Replace timeout based expiration lists with rculists. This patch aims to replace the expiration lists as, due to the way they are used, besides being a source of contention, they have a known issue when used with non-default policies for different zones that could lead to retaining expired connections potentially for a long time. This patch replaces them with an array of rculist used to distribute all the newly created connections in order to, during the sweeping phase, scan them without locking, and evict the expired connections only locking during the actual removal. This allows to reduce the contention introduced by the pushback performed at every packet update, also solving the issue related to zones and timeout policies. Signed-off-by: Gaetan Rivet <grive@u256.net> Co-authored-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-13 00:50:23 +02:00
Gaetan Rivet	4847baf4a9	conntrack-tp: Use a cmap to store timeout policies. Multiple lookups are done to stored timeout policies, each time blocking the global 'ct_lock'. This is usually not necessary and it should be acceptable to get policy updates slightly delayed (by one RCU sync at most). Using a CMAP reduces multiple lock taking and releasing in the connection insertion path. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-12 20:44:46 +02:00
Gaetan Rivet	6edc278c85	conntrack: Use a cmap to store zone limits. Change the data structure from hmap to cmap for zone limits. As they are shared amongst multiple conntrack users, multiple readers want to check the current zone limit state before progressing in their processing. Using a CMAP allows doing lookups without taking the global 'ct_lock', thus reducing contention. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-12 20:44:46 +02:00
Ilya Maximets	4e1e1e189f	conntrack: Fix incorrect bit shift while hashing nat range. 'max_port' is 16bit field, shift expands it to 'int', not unsigned int. lib/conntrack.c:2245:41: runtime error: left shift of 34568 by 16 places cannot be represented in type 'int'. 0 0xec45f4 in nat_range_hash lib/conntrack.c:2245:41 1 0xec45f4 in nat_get_unique_tuple lib/conntrack.c:2422:21 2 0xec45f4 in conn_not_found lib/conntrack.c:1035:32 3 0xeaf0a5 in process_one lib/conntrack.c:1407:20 4 0xea9390 in conntrack_execute lib/conntrack.c:1465:13 5 0x839530 in dp_execute_cb lib/dpif-netdev.c:9060:9 6 0x9909cc in odp_execute_actions lib/odp-execute.c:868:17 7 0x830946 in dp_netdev_execute_actions lib/dpif-netdev.c:9106:5 8 0x830946 in handle_packet_upcall lib/dpif-netdev.c:8294:5 9 0x82ea5e in fast_path_processing lib/dpif-netdev.c:8390:25 10 0x7ed87f in dp_netdev_input__ lib/dpif-netdev.c:8479:9 11 0x7eb5fc in dp_netdev_input lib/dpif-netdev.c:8517:5 12 0x81dada in dp_netdev_process_rxq_port lib/dpif-netdev.c:5329:19 13 0x7f0063 in dpif_netdev_run lib/dpif-netdev.c:6664:25 14 0x85f036 in dpif_run lib/dpif.c:467:16 15 0x61833a in type_run ofproto/ofproto-dpif.c:366:9 16 0x5c210e in ofproto_type_run ofproto/ofproto.c:1822:31 17 0x565db2 in bridge_run__ vswitchd/bridge.c:3245:9 18 0x562f82 in bridge_run vswitchd/bridge.c:3310:5 19 0x59a98c in main vswitchd/ovs-vswitchd.c:129:9 20 0x7f8864c3acf2 in __libc_start_main (/lib64/libc.so.6+0x3acf2) 21 0x47e60d in _start (vswitchd/ovs-vswitchd+0x47e60d) Fixes: `92edd073ce` ("conntrack: Hash entire NAT data structure in nat_range_hash().") Acked-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-06-24 23:51:31 +02:00
wenxu	165f5fbb5e	conntrack: Limit port clash resolution attempts. In case almost or all available ports are taken, clash resolution can take a very long time, resulting in pmd lockup. This can happen when many to-be-natted hosts connect to same destination:port (e.g. a proxy) and all connections pass the same SNAT. Pick a random offset in the acceptable range, then try ever smaller number of adjacent port numbers, until either the limit is reached or a useable port was found. This results in at most 248 attempts (128 + 64 + 32 + 16 + 8, i.e. 4 restarts with new search offset) instead of 64000+. Signed-off-by: wenxu <wenxu@chinatelecom.cn> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-06-07 14:50:36 +02:00
wenxu	c608ace71d	conntrack: Remove the IP iterations in nat_get_unique_l4. Removing the IP iterations, and just picking the IP address with the hash base on the least-used src-ip/dst-ip/proto triple. Signed-off-by: wenxu <wenxu@chinatelecom.cn> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-06-07 14:50:00 +02:00
Adrian Moreno	745c80f52c	hindex: remove the next variable in safe loops. Using SHORT version of the _SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for HINDEX__SAFE. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-30 16:59:03 +02:00
Adrian Moreno	e9bf5bffb0	list: use short version of safe loops if possible. Using the SHORT version of the *_SAFE loops makes the code cleaner and less error-prone. So, use the SHORT version and remove the extra variable when possible. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-30 16:59:02 +02:00
wenxu	545b64415d	conntrack: Prefer dst port range during unique tuple search. This commit splits the nested loop used to search the unique ports for the reverse tuple. It affects only the dnat action, giving more precedence to the dnat range, similarly to the kernel dp, instead of searching through the default ephemeral source range for each destination port. Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-04 20:16:37 +01:00
wenxu	ec85f5325f	conntrack: Select correct sport range for well-known origin sport. Like the kernel datapath. The sport nat range for well-konwn origin sport should limit in the well-known ports. Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-03-04 20:14:29 +01:00
wenxu	a2fa8b2895	conntrack: Remove the nat_action_info from the conn. Only 'nat_action_info->nat_action' is used for packet forwarding. Other items such as min/max_ip/port are used only when creating new connections. No need to store the whole nat_action_info in conn. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Gaetan Rivet <grive@u256.net> Acked-by: Michael Santana <msantana@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-09-16 00:01:47 +02:00
Gaetan Rivet	b889d5dcc8	conntrack: Init hash basis first at creation. The 'hash_basis' field is used sometimes during sub-systems init routine. It will be 0 by default before randomization. Sub-systems would then init some nodes with incorrect hash values. The timeout policies module is affected, making the default policy being referenced using an incorrect hash value. Fixes: `2078901a4c` ("userspace: Add conntrack timeout policy support.") Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-09 22:23:59 +02:00
Paolo Valerio	61e48c2d1d	conntrack: Handle SNAT with all-zero IP address. This patch introduces for the userspace datapath the handling of rules like the following: ct(commit,nat(src=0.0.0.0),...) Kernel datapath already handle this case that is particularly handy in scenarios like the following: Given A: 10.1.1.1, B: 192.168.2.100, C: 10.1.1.2 A opens a connection toward B on port 80 selecting as source port 10000. B's IP gets dnat'ed to C's IP (10.1.1.1:10000 -> 192.168.2.100:80). This will result in: tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=10000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10000), protoinfo=(state=ESTABLISHED) A now tries to establish another connection with C using source port 10000, this time using C's IP address (10.1.1.1:10000 -> 10.1.1.2:80). This second connection, if processed by conntrack with no SNAT/DNAT involved, collides with the reverse tuple of the first connection, so the entry for this valid connection doesn't get created. With this commit, and adding a SNAT rule with 0.0.0.0 for 10.1.1.1:10000 -> 10.1.1.2:80 will allow to create the conn entry: tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=10000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10001), protoinfo=(state=ESTABLISHED) tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=10000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10000), protoinfo=(state=ESTABLISHED) The issue exists even in the opposite case (with A trying to connect to C using B's IP after establishing a direct connection from A to C). This commit refactors the relevant function in a way that both of the previously mentioned cases are handled as well. Suggested-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-08 23:49:34 +02:00
Paolo Valerio	1e19f9aa26	conntrack: Handle already natted packets. When a packet gets dnatted and then recirculated, it could be possible that it matches another rule that performs another nat action. The kernel datapath handles this situation turning to a no-op the second nat action, so natting only once the packet. In the userspace datapath instead, when the ct action gets executed, an initial lookup of the translated packet fails to retrieve the connection related to the packet, leading to the creation of a new entry in ct for the src nat action with a subsequent failure of the connection establishment. with the following flows: table=0,priority=30,in_port=1,ip,nw_dst=192.168.2.100,actions=ct(commit,nat(dst=10.1.1.2:80),table=1) table=0,priority=20,in_port=2,ip,actions=ct(nat,table=1) table=0,priority=10,ip,actions=resubmit(,2) table=0,priority=10,arp,actions=NORMAL table=0,priority=0,actions=drop table=1,priority=5,ip,actions=ct(commit,nat(src=10.1.1.240),table=2) table=2,in_port=ovs-l0,actions=2 table=2,in_port=ovs-r0,actions=1 Establishing a connection from 10.1.1.1 to 192.168.2.100 the outcome is: tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=4000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.240,sport=80,dport=4000), protoinfo=(state=ESTABLISHED) tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=4000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=4000), protoinfo=(state=ESTABLISHED) With this patch applied the outcome is: tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=4000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=4000), protoinfo=(state=ESTABLISHED) The patch performs, for already natted packets, a lookup of the reverse key in order to retrieve the related entry, it also adds a test case that besides testing the scenario ensures that the other ct actions are executed. Reported-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-08 23:49:34 +02:00
Paolo Valerio	2c597c8900	conntrack: add coverage counters for L3 bad checksum. similarly to what already exists for L4, add conntrack_l3csum_err and ipf_l3csum_err for L3. Received packets with L3 bad checksum will increase respectively ipf_l3csum_err if they are fragments and conntrack_l3csum_err otherwise. Although the patch basically covers IPv4, the names are kept generic. Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-06-30 23:56:03 +02:00

1 2 3 4

162 Commits