when a exp_list contains more than the clean_end's number of nodes,
and these nodes will not expire immediately. Then, every times we
call conntrack_clean, it use the same next_sweep to get exp_list.
Actually, we should add i every times after we call ct_sweep.
Fixes: 3d9c1b855a ("conntrack: Replace timeout based expiration lists with rculists.")
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Liang Mancang <liangmc1@chinatelecom.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
IPFIX templates have to be sent for each Observation Domain ID.
Currently, a timer is kept at each dpif_ipfix_exporter to send them.
This works fine for per-bridge sampling where there is only one
Observation Domain ID per exporter. However, this is does not work for
per-flow sampling where more than one Observation Domain IDs can be
specified by the controller. In this case, ovs-vswitchd will only send
template information for one (arbitrary) DomainID.
Fix per-flow sampling by using an hmap to keep a timer for each
Observation Domain ID.
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
I have not been active in OVS development in long enough that I should
move to emeritus status.
Signed-off-by: Russell Bryant <russell@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds new ovs-ctl options to pass umask configuration to allow
OVS daemons set requested socket permissions on group. Previous
behaviour (if using with systemd service unit) created sockets with 0750
permissions mask (group has no write permission).
Write permission for group is reasonable in usecase, where ovs-vswitchd
or ovsdb-server runs as a non-privileged user:group (say,
openvswitch:openvswitch) and it is needed to access unix socket from
process running as another non-privileged user. In this case
administrator has to add that user to openvswitch group and can connect
to OVS sockets from a process running under that user.
Two new ovs-ctl options --ovsdb-server-umask and --ovs-vswitchd-umask
were added to manage umask values for appropriate daemons. This is
useful for systemd users: both ovs-vswitchd and ovsdb-server systemd
units read options from single /etc/sysconfig/openvswitch configuration
file. So, with separate options it is possible to set umask only for
specific daemon.
OPTIONS="--ovsdb-server-umask=0002"
in /etc/openvswitch/sysconfig file will set umask to 0002 value before
starting only ovsdb-server, while
OPTIONS="--ovs-vswitchd-umask=0002"
will set umask to ovs-vswitchd daemon.
Previous behaviour (not setting umask) is left as default.
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401501.html
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Vladislav Odintsov <odivlad@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Instead of using all zero stats when executing a revalidate for missed
dp flows, use the last known stats to avoid odd statistics being used.
As these zero stats are stored in the ukey, the next time revalidate_ukey()
is called the delta between the new stats and the zero stats is used, which
would cause an additional increase in total packets/bytes.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Michael Santana <msantana@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Commit 08e9e53373 fixed proper initialization of the dns-resolve
module, and made DNS resolution asynchronous.
A side effect of that change revealed a long standing logic bug
which broke ovsdb-server listener configuration using DNS names.
Previously this worked because the DNS resolution would block,
now that DNS resolution is asynchronous the code before this
change would assume the error from jsonrpc_pstream_open meant
the remote was a specification for an active outgoing
connection, even when that was not the case.
To fix this a couple of changes was made to socket-util:
1) Pass optional result of dns resolution from inet_parse_passive.
When (re-)configuring listeners that use DNS names, we may need
to know whether the provided connection string is invalid or if
the provided DNS name has finished resolving.
2) Check dns resolution status in inet_open_passive.
If the connection string is valid, and contains a DNS name,
inet_open_passive will now return -EAGAIN if dns resolution
failed. DNS resolution failure may either mean the asynchronous
resolver has not completed yet, or that the name does not resolve.
Reported-at: https://bugs.launchpad.net/bugs/1998781
Fixes: 08e9e53373 ("ovsdb: raft: Fix inability to read the database with DNS host names.")
Fixes: 771680d96f ("DNS: Add basic support for asynchronous DNS resolving")
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
I commented the three remaining failures when running tc with the
system-traffic tests. In addition I ran the following test to verify
we did not see any failures with recheck enabled:
for i in {1..50}; do make check-offloads || \
make check-offloads TESTSUITEFLAGS="--recheck" || break; \
echo "ALL_50_OK: $i"; done;
Unfortunately, a bunch of test cases showed occasional failures.
For now, they are excluded from the test cases and need further
investigation. They are:
datapath - truncate and output to gre tunnel
datapath - truncate and output to gre tunnel by simulated packets
These tests where executed on a Fedora37 machine with the kernel
6.1.5-200.fc37.x86_64 installed.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
With some datapaths, read TC, it takes a bit longer to update the
OpenFlow statistics. Rather than adding an additional delay, try
to read the counters multiple times until we get the desired value.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
If a tc flow was installed but has not yet been used, report it as such.
In addition, add a delay to the "IGMP - flood under normal action" test
case to make it work with many repetitions. This delay is also present
in other ICMP/IGMP tests.
Fixes: f98e418fbd ("tc: Add tc flower functions")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Make the order of the Netlink attributes for odp_flow_key_from_flow__()
the same as the kernel will return them.
This will make sure the attributes displayed in the dpctl/dump-flows
output appear in the same order for all datapath.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
tc does not include ethernet header length in packet byte count.
This fix will allow the packets that go trough tc to be 14 bytes less.
This difference in the TC implementation is already described in
tc-offload.rst.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The tc conntrack implementation does not support the timeout option.
The current implementation is silently ignoring the timeout option
by adding a general conntrack entry.
This patch will skip the related test by overriding the support macro.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
tc does not support conntrack ALGs. Even worse, with tc enabled, they
should not be used/configured at all. This is because even though TC
will ignore the rules with ALG configured, i.e., they will flow through
the kernel module, return traffic might flow through a tc conntrack
rule, and it will not invoke the ALG helper.
Fixes: 576126a931 ("netdev-offload-tc: Add conntrack support")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Include and run the system-traffic.at tests as part of the system offload
testsuite. Exclude all the tests that will not run without any special
modifications.
Lowered log level for "recirc_id sharing not supported" message, so tests
will not fail with older kernels. This is not an error level message, but
should be debug, like all other, EOPNOTSUPP, related log messages.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The userspace datapath manages all the magaflows by a cmap. The cmap
data structure will grow/shrink during the datapath processing and it
will re-position megaflows. This might result in two revalidator threads
might process a same megaflow during one dump stage.
Consider a situation that, revalidator 1 processes a megaflow A, and
decides to delete it from the datapath, at the mean time, this megaflow
A is also queued in the process batch of revalidator 2. Normally it's ok
for revalidators to process the same megaflow multiple times, as the
dump_seq shows it's already dumped and the stats will not be contributed
twice.
Assume that right after A is deleted, a PMD thread generates again
a new megaflow B which has the same match and action of A. The ukey
of megaflow B will replace the one of megaflow A. Now the ukey B is
new to the revalidator system and its dump seq is 0.
Now since the dump seq of ukey B is 0, when processing megaflow A,
the revalidator 2 will not identify this megaflow A has already been
dumped by revalidator 1 and will contribute the old megaflow A's stats
again, this results in an inconsistent stats between ukeys and megaflows.
To fix this, the newly generated the ukey B should take the dump_seq
of the replaced ukey A to avoid a same megaflow being revalidated
twice in one dump stage.
We observe in the production environment, the OpenFlow rules' stats
sometimes are amplified compared to the actual value.
Signed-off-by: Peng He <hepeng.0320@bytedance.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
UB Sanitizer report:
lib/dp-packet.h:587:22: runtime error: member access within misaligned
address 0x000001ecde10 for type 'struct dp_packet', which requires 64
byte alignment
#0 in dp_packet_set_base lib/dp-packet.h:587
#1 in dp_packet_use__ lib/dp-packet.c:46
#2 in dp_packet_use lib/dp-packet.c:60
#3 in dp_packet_init lib/dp-packet.c:126
#4 in dp_packet_new lib/dp-packet.c:150
[...]
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a flow gets modified, i.e. the actions are changes, the tc layer will
remove, and re-add the flow. This is causing all the counters to be reset.
This patch will remember the previous tc counters and adjust any requests
for statistics. This is done in a similar way as the rte_flow implementation.
It also updates the check_pkt_len tc test to purge the flows, so we do
not use existing updated tc flow counters, but start with fresh installed
set of datapath flows.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Current numa.h header for sparse re-defines functions in a way
that breaks the header from libnuma 2.0.13+, because the original
issue was fixed in that version:
25dcde021d
Sparse errors as a result:
lib/netdev-afxdp.c: note: in included file (through include/sparse/numa.h):
/usr/include/numa.h:346:26: error: macro "numa_get_interleave_mask_compat"
passed 1 arguments, but takes just 0
/usr/include/numa.h:376:26: error: macro "numa_get_membind_compat"
passed 1 arguments, but takes just 0
/usr/include/numa.h:406:26: error: macro "numa_get_run_node_mask_compat"
passed 1 arguments, but takes just 0
/usr/include/numa.h:347:1: error: Expected ; at end of declaration
/usr/include/numa.h:347:1: error: got {
/usr/include/numa.h:351:9: error: 'tp' has implicit type
It's hard to adjust defines to work with both versions of a header.
Just defining all the functions we actually use in OVS instead and
not including the original header.
Fixes: e8568993e0 ("netdev-afxdp: NUMA-aware memory allocation for XSK related memory.")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Add support to count upcall packets per port, both succeed and failed,
which is a better way to see how many packets upcalled on each interface.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: wangchuanlei <wangchuanlei@inspur.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
A long long time ago, an effort was made to make tc flower
rtnl_lock() free. However, on the OVS part we forgot to add
the TCA_KIND "flower" attribute, which tell the kernel to skip
the lock. This patch corrects this by adding the attribute for
the delete and get operations.
The kernel code calls tcf_proto_is_unlocked() to determine the
rtnl_lock() is needed for the specific tc protocol. It does this
in the tc_new_tfilter(), tc_del_tfilter(), and in tc_get_tfilter().
If the name is not set, tcf_proto_is_unlocked() will always return
false. If set, the specific protocol is queried for unlocked support.
Fixes: f98e418fbd ("tc: Add tc flower functions")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The following tests use the nc command and should be skipped if
nc is not present.
- "offloads - check interface meter offloading - offloads disabled"
- "offloads - check interface meter offloading - offloads enabled"
Fixes: 5660b89a30 ("dpif-netlink: Offloading meter to tc police action")
Reported-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
pmd-perf-show with pmd-perf-metrics=true displays a histogram
with averages. However, averages were not displayed when there
is no iterations.
They will be all zero so it is not hiding useful information
but the stats look incomplete without them, especially when
they are displayed for some PMD thread cores and not others.
The histogram print is large and this is just an extra couple
of lines, so might as well print them all the time to ensure
that the user does not think there is something missing from
the display.
Before patch:
Histograms
cycles/it
499 0
716 0
1025 0
1469 0
<snip>
After patch:
Histograms
cycles/it
499 0
716 0
1025 0
1469 0
<snip>
---------------
cycles/it
0
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Some stats in pmd-perf-show don't check for divide by zero
which results in not a number (-nan).
This is a normal case for some of the stats when there are
no Rx queues assigned to the PMD thread core.
It is not obvious what -nan is to a user so add a check for
divide by zero and set stat to 0 if present.
Before patch:
pmd thread numa_id 1 core_id 9:
Iterations: 0 (-nan us/it)
- Used TSC cycles: 0 ( 0.0 % of total cycles)
- idle iterations: 0 ( -nan % of used cycles)
- busy iterations: 0 ( -nan % of used cycles)
After patch:
pmd thread numa_id 1 core_id 9:
Iterations: 0 (0.00 us/it)
- Used TSC cycles: 0 ( 0.0 % of total cycles)
- idle iterations: 0 ( 0.0 % of used cycles)
- busy iterations: 0 ( 0.0 % of used cycles)
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
On Fedora 37 (at least), MFEX unit tests are failing because of
deprecation warnings:
$ python3 tests/mfex_fuzzy.py test_traffic.pcap 2000
/usr/lib/python3.11/site-packages/scapy/layers/ipsec.py:471:
CryptographyDeprecationWarning: Blowfish has been deprecated
cipher=algorithms.Blowfish,
/usr/lib/python3.11/site-packages/scapy/layers/ipsec.py:485:
CryptographyDeprecationWarning: CAST5 has been deprecated
cipher=algorithms.CAST5,
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Today the minimum value for this setting is 1. This patch allows it to
be 0, meaning not checking pps at all, and always do revalidation.
This is particularly useful for environments where some of the
applications with long-lived connections may have very low traffic for
certain period but have high rate of burst periodically. It is desirable
to keep the datapath flows instead of periodically deleting them to
avoid burst of packet miss to userspace.
When setting to 0, there may be more datapath flows to be revalidated,
resulting in higher CPU cost of revalidator threads. This is the
downside but in certain cases this is still more desirable than packet
misses to user space.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
rte_pktmbuf_free_bulk() function was introduced in 19.11 and became
stable in 21.11. Use it to free arrays of mbufs instead of freeing
packets one by one.
In simple V2V testing with 64B packets, 2 PMD threads and bidirectional
traffic this change improves performance by 3.5 - 4.5 %.
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Column conversion involves converting it to json and back. These are
heavy operations and completely unnecessary if the column type didn't
change. Most of the time schema changes only add new columns/tables
without changing existing ones at all. Clone the column instead to
save some time.
This will also save time while destroying the original database since
we will only need to reduce reference counters on unchanged datum
objects that were cloned instead of actually freeing them.
Additionally, moving the column lookup into a separate loop, so we
don't perform an shash lookup for each column of each row.
Testing with 440 MB OVN_Southbound database shows 70% speed up of the
ovsdb_convert() function. Execution time reduced from 15 to 4.4
seconds, 3.5 of which is a post-conversion transaction replay. Overall
time required for the online database conversion reduced from 37 to 25
seconds.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Now that the timer slack for the PMD threads is reduced we can also
reduce the start/increment for PMD load based sleeping to match it.
This will further reduce initial sleep times making it more resilient
to interfaces that might be sensitive to large sleep times.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The default Linux timer slack groups timer expires into 50 uS intervals.
With some traffic patterns this can mean that returning to process
packets after a sleep takes too long and packets are dropped.
Add a helper to util.c and set use it to reduce the timer slack
for PMD threads, so that sleeps with smaller resolutions can be done
to prevent sleeping for too long.
Fixes: de3bbdc479 ("dpif-netdev: Add PMD load based sleeping.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401121.html
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
As Ilya reported, we have a ABBA deadlock between DPDK vq->access_lock
and OVS dev->mutex when OVS main thread refreshes statistics, while a
vring state change event is being processed for a same vhost port.
To break from this situation, move vring state change notifications
handling from the vhost-events DPDK thread to a dedicated thread
using a lockless queue.
Besides, for the case when a bogus/malicious guest is sending continuous
updates, add a counter of pending updates in the queue and warn if a
threshold of 1000 entries is reached.
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401101.html
Fixes: 3b29286db1 ("netdev-dpdk: Add per virtqueue statistics.")
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The counter for the number of atoms has to be re-set to the number from
the new database, otherwise the value will be incorrect. For example,
this is causing the atom counter doubling after online conversion of
a clustered database.
Miscounting may also lead to increased memory consumption by the
transaction history or otherwise too aggressive transaction history
sweep.
Fixes: 317b1bfd7d ("ovsdb: Don't let transaction history grow larger than the database.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>