RHEL 8.4 is the first of the RHEL 8.x kernels that has broken ABI so
it requires the same sort of fix as we did for several RHEL 7.x kernel
that needed two kernel rpms to work for all minor revisions of the
baseline kernel module.
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Same json from a json cache is typically sent to all the clients,
e.g., in case of OVN deployment with ovn-monitor-all=true.
There could be hundreds or thousands connected clients and ovsdb
will serialize the same json object for each of them before sending.
Serializing it once before storing into json cache to speed up
processing.
This change allows to save a lot of CPU cycles and a bit of memory
since we need to store in memory only a string and not the full json
object.
Testing with ovn-heater on 120 nodes using density-heavy scenario
shows reduction of the total CPU time used by Southbound DB processes
from 256 minutes to 147. Duration of unreasonably long poll intervals
also reduced dramatically from 7 to 2 seconds:
Count Min Max Median Mean 95 percentile
-------------------------------------------------------------
Before 1934 1012 7480 4302.5 4875.3 7034.3
After 1909 1004 2730 1453.0 1532.5 2053.6
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Raft log entries (and raft database snapshot) contains json objects
of the data. Follower receives append requests with data that gets
parsed and added to the raft log. Leader receives execution requests,
parses data out of them and adds to the log. In both cases, later
ovsdb-server reads the log with ovsdb_storage_read(), constructs
transaction and updates the database. On followers these json objects
in common case are never used again. Leader may use them to send
append requests or snapshot installation requests to followers.
However, all these operations (except for ovsdb_storage_read()) are
just serializing the json in order to send it over the network.
Json objects are significantly larger than their serialized string
representation. For example, the snapshot of the database from one of
the ovn-heater scale tests takes 270 MB as a string, but 1.6 GB as
a json object from the total 3.8 GB consumed by ovsdb-server process.
ovsdb_storage_read() for a given raft entry happens only once in a
lifetime, so after this call, we can serialize the json object, store
the string representation and free the actual json object that ovsdb
will never need again. This can save a lot of memory and can also
save serialization time, because each raft entry for append requests
and snapshot installation requests serialized only once instead of
doing that every time such request needs to be sent.
JSON_SERIALIZED_OBJECT can be used in order to seamlessly integrate
pre-serialized data into raft_header and similar json objects.
One major special case is creation of a database snapshot.
Snapshot installation request received over the network will be parsed
and read by ovsdb-server just like any other raft log entry. However,
snapshots created locally with raft_store_snapshot() will never be
read back, because they reflect the current state of the database,
hence already applied. For this case we can free the json object
right after writing snapshot on disk.
Tests performed with ovn-heater on 60 node density-light scenario,
where on-disk database goes up to 97 MB, shows average memory
consumption of ovsdb-server Southbound DB processes decreased by 58%
(from 602 MB to 256 MB per process) and peak memory consumption
decreased by 40% (from 1288 MB to 771 MB).
Test with 120 nodes on density-heavy scenario with 270 MB on-disk
database shows 1.5 GB memory consumption decrease as expected.
Also, total CPU time consumed by the Southbound DB process reduced
from 296 to 256 minutes. Number of unreasonably long poll intervals
reduced from 2896 down to 1934.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Introducing a new json type JSON_SERIALIZED_OBJECT. It's not an
actual type that can be seen in a json message on a wire, but
internal type that is intended to hold a serialized version of
some other json object. For this reason it's defined after the
JSON_N_TYPES to not confuse parsers and other parts of the code
that relies on compliance with RFC 4627.
With this JSON type internal users may construct large JSON objects,
parts of which are already serialized. This way, while serializing
the larger object, data from JSON_SERIALIZED_OBJECT can be added
directly to the result, without additional processing.
This will be used by next commits to add pre-serialized JSON data
to the raft_header structure, that can be converted to a JSON
before writing the file transaction on disk or sending to other
servers. Same technique can also be used to pre-serialize json_cache
for ovsdb monitors, this should allow to not perform serialization
for every client and will save some more memory.
Since serialized JSON is just a string, reusing the 'json->string'
pointer for it.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This way it's easier to show it on a website as it will be updated
automatically along with the rest of the documentation.
Sphinx doesn't render everything perfectly, but it looks good enough
in both man and html versions. rST is a bit easier to read and it
takes less space.
Conversion performed manually since I didn't found any good tool
that can actually make the process any faster.
Along the way I replaced versions like x.y.90 with x.y+1, because
it doesn't seem correct to me to refer non-released versions of OVS
in the docs. Fixed a couple of small mistakes like duplicated
paragraph and reference to a different section by incorrect name.
Also removed bits of xml->nroff conversion code that is not needed
anymore.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Roi Dayan <roid@nvidia.com>
Current string serialization code puts all characters one by one.
This is slow because dynamic string needs to perform length checks
on every ds_put_char() and it's also doesn't allow compiler to use
better memory copy operations, i.e. doesn't allow copying few bytes
at once.
Special symbols are rare in a typical database. Quotes are frequent,
but not too frequent. In databases created by ovn-kubernetes, for
example, usually there are at least 10 to 50 chars between quotes.
So, it's better to count characters that doesn't require escaping
and use fast data copy for the whole sequential block.
Testing with a synthetic benchmark (included) on my laptop shows
following performance improvement:
Size Q S Before After Diff
-----------------------------------------------------
100000 0 0 : 0.227 ms 0.142 ms -37.4 %
100000 2 1 : 0.277 ms 0.186 ms -32.8 %
100000 10 1 : 0.361 ms 0.309 ms -14.4 %
10000000 0 0 : 22.720 ms 12.160 ms -46.4 %
10000000 2 1 : 27.470 ms 19.300 ms -29.7 %
10000000 10 1 : 37.950 ms 31.250 ms -17.6 %
100000000 0 0 : 239.600 ms 126.700 ms -47.1 %
100000000 2 1 : 292.400 ms 188.600 ms -35.4 %
100000000 10 1 : 387.700 ms 321.200 ms -17.1 %
Here Q - probability (%) for a character to be a '\"' and
S - probability (%) to be a special character ( < 32).
Testing with a closer to real world scenario shows overall decrease
of the time needed for database compaction by ~5-10 %. And this
change also decreases CPU consumption in general, because string
serialization is used in many different places including ovsdb
monitors and raft.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Numan Siddique <numans@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
FreeBSD tests in Cirrus CI are broken and, I guess, windows tests too:
89. library.at:258: testing netlink policy ...
./library.at:259: ovstest test-netlink-policy ll_addr
--- /dev/null 2021-08-20 19:02:41.907547000 +0000
+++ /tmp/cirrus-ci-build/tests/testsuite.dir/at-groups/89/stderr
@@ -0,0 +1 @@
+ovstest: unknown command 'test-netlink-policy'; use --help for help
./library.at:259: exit code was 1, expected 0
89. library.at:258: 89. netlink policy (library.at:258): FAILED
'tests/test-netlink-policy.c' is built only on Linux, test
must be skipped on all other platforms to unblock CI.
Fixes: bfee9f6c01 ("netlink: Add support for parsing link layer address.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Frode Nordahl <frode.nordahl@canonical.com>
This patch adds 2 new APIs in the ovsdb-idl client library
- ovsdb_idl_server_has_table() and ovsdb_idl_server_has_column() to
query if a table and a column is present in the IDL or not. This
patch also adds IDL helper functions which are auto generated from
the schema which makes it easier for the clients.
These APIs are required for scenarios where the server schema is old and
missing a table or column and the client (built with a new schema
version) does a transaction with the missing table or column. This
results in a continuous loop of transaction failures.
Related-Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1992705
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Data retrieved from netlink and friends may include link layer
address. Add type to nl_attr_type and min/max functions to allow
use of nl_policy_parse with this type of data.
While this will not be used by Open vSwitch itself at this time,
sibling and derived projects want to use the great netlink library
that OVS provides, and it is not possible to safely override the
global nl_attr_type symbol at link time.
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
in case nl_msg_nlmsgerr returns true which basically means that the
nlmsg_type == NLMSG_ERROR, we need to log the error code, besides the
descriptive representation, stored by nl_msg_nlmsgerr instead of
"error".
Fixes: 72d32ac0b3 ("netlink-socket: Make caller provide message receive buffers.")
Suggested-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The 's_tnl' member in flow_patterns and flow_actions should be
to set to DS_EMPTY_INITIALIZER, to be consistent with dynamic string
initializations.
Also, there's a potential memory leak of flow_patterns->s_tnl.
Fix this by destroying s_tnl in free_flow_patterns().
Fixes: 507d20e77b ("netdev-offload-dpdk: Support vports flows offload.")
Fixes: be56e063d0 ("netdev-offload-dpdk: Support tunnel pop action.")
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ds_clone() crashes while trying to clone an empty dynamic string.
It happens because it doesn't check if memory was allocated and
tries to read from the NULL pointer. ds_init() doesn't allocate
any memory.
For example:
In netdev_offload_dpdk_flow_create() when an offload request fails,
dump_flow() is called to log a warning message. The 's_tnl' string
in flow_patterns gets initialized in vport_to_rte_tunnel() conditionally
via ds_put_format(). If it is not initialized, it crashes later in
dump_flow_attr()->ds_clone()->memcpy() while dereferencing this string.
To fix this, check if memory for the src string has been allocated,
before copying it to the dst string.
Fixes: fa44a4a3ff ("ovn-controller: Persist desired conntrack groups.")
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch fixes a memory leak in the commands for DPIF and MFEX
get and set. In order to operate the commands require a pmd_list,
which is currently not freed after it has been used. This issue
was identified by a static analysis tool.
Fixes: 3d8f47bc ("dpif-netdev: Add command line and function pointer for miniflow extract")
Fixes: abb807e2 ("dpif-netdev: Add command to switch dpif implementation.")
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This patch fixes a memory leak when the command
"dpif-netdev/subtable-lookup-prio-set" is run, the pmd_list
required to iterate the PMD threads was not being freed.
This issue was identified by a static analysis tool.
Fixes: 3d018c3e ("dpif-netdev: Add subtable lookup prio set command.")
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The commit removes the dead code from the
MFEX set command as highlighted by static tool
analysis.
Fixes: a395b132b7 ("dpif-netdev: Add packet count and core id paramters for study")
Signed-off-by: kumar Amber <kumar.amber@intel.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The "name" parameter isn't optional so don't use brackets around it.
Fixes: 5c5c98cec2 ("docs/dpdk/bridge: Add miniflow extract section.")
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Document the "ovs-appctl dpif-netdev/miniflow-parser-get" and
"ovs-appctl dpif-netdev/miniflow-parser-set" commands in the vswitchd
manpage.
Fixes: 3d8f47bc04 ("dpif-netdev: Add command line and function pointer for miniflow extract")
Signed-off-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds extra checks around the AVX-512 vpopcnt instruction
enabling, ensuring that in the function where the ISA is enabled the
compiler has also indicated its support for the ISA. This is achieved
by checking the __AVX512VPOPCNTDQ__ define, which the compiler sets if
it is capable of handling the vpopcnt instruction.
If the compiler is not capable of handling vpopcnt, we fall back to
the emulated vpopcnt implementation.
Reported-by: Ian Stokes <ian.stokes@intel.com>
Fixes: 1e31489134 ("dpcls-avx512: Enable avx512 vector popcount instruction.")
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Lin Huang authored sevaral patches, but wasn't added to the list of
authors for some reason. Adding by request.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Wentao Jia provided useful bug reports for ovsdb relay mode.
Gaetan Rivet authored several patches already, but I seem to
forget to add him to the list of authors.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
'sent_node' is initialized to all zeroes by xzalloc(), but
HMAP_NODE_NULL is not all zeroes. hmap_node_is_null() is used
to detect if the node is valid, but it will fail and cause
segmentation fault on attempt to remove the non-existent node
from the hash map. This can happen if client disconnected while
the transaction is not yet forwarded to the relay source:
Program terminated with signal 11, Segmentation fault.
0 in hmap_remove at include/openvswitch/hmap.h:293
293 while (*bucket != node) {
(gdb) bt
0 hmap_remove at include/openvswitch/hmap.h:293
1 ovsdb_txn_forward_unlist at ovsdb/transaction-forward.c:67
2 ovsdb_txn_forward_destroy at ovsdb/transaction-forward.c:79
3 ovsdb_trigger_destroy at ovsdb/trigger.c:70
4 ovsdb_jsonrpc_trigger_complete at ovsdb/jsonrpc-server.c:1192
5 ovsdb_jsonrpc_trigger_remove__ at ovsdb/jsonrpc-server.c:1204
6 ovsdb_jsonrpc_trigger_complete_all at ovsdb/jsonrpc-server.c:1223
7 ovsdb_jsonrpc_session_run at ovsdb/jsonrpc-server.c:546
8 ovsdb_jsonrpc_session_run_all at ovsdb/jsonrpc-server.c:591
9 ovsdb_jsonrpc_server_run at ovsdb/jsonrpc-server.c:406
10 main_loop
(gdb) print db->txn_forward_sent
$20 = {buckets = 0x..., one = 0x0, mask = 63, n = 0}
(gdb) print txn_fwd->sent_node
$24 = {hash = 0, next = 0x0}
Fix that by correct initialization of the 'sent_node'.
Reported-by: Wentao Jia <wentao.jia@easystack.cn>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-August/051354.html
Fixes: 7964ffe7d2 ("ovsdb: relay: Add support for transaction forwarding.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
The correct way to pass configuration options is to define them
inside the config.h. Additionally, few long lines wrapped and
fixed the unnecessary double check for -mavx512f.
Fixes: abb807e27d ("dpif-netdev: Add command to switch dpif implementation.")
Fixes: 5324b54e60 ("dpif-netdev: Add configure to enable autovalidator at build time.")
Fixes: e90e115a01 ("dpif-netdev: implement subtable lookup validation.")
Fixes: 352b6c7116 ("dpif-lookup: add avx512 gather implementation.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Association of a mark to a flow is done as part of its offload handling,
in the offloading thread. However, the PMD thread specifies whether an
offload request is an "add" or "modify" by the association of a mark to
the flow.
This is exposed to a race condition. A flow might be created with
actions that cannot be fully offloaded, for example flooding (before MAC
learning), and later modified to have actions that can be fully
offloaded. If the two requests are queued before the offload thread
handling, they are both marked as "add". When the offload thread handles
them, the first request is partially offloaded, and the second one is
ignored as the flow is already considered as offloaded.
Fix it by specifying add/modify of an offload request by the actual flow
state change, without relying on the mark.
Fixes: 3c7330ebf0 ("netdev-offload-dpdk: Support offload of output action.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
dp_netdev_flow_offload_main thread is asynchronous, by the cited commit.
There might be a case where there are modification requests of the same
flow submitted before handled. Then, if the first handling fails, the
rule for the flow is deleted, and the mark is freed. Then, the following
one should not be handled as a modification, but rather as an "add".
Fixes: 02bb2824e5 ("dpif-netdev: do hw flow offload in a thread")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Before flushing offloads of a removed port was supported by [1], it was
necessary to flush the 'marks'. In doing so, all offloads of the PMD are
removed, include the ones that are not related to the removed port and
that are not modified following this removal. As a result such flows are
evicted from being offloaded, and won't resume offloading.
As PMD offload flush is not necessary, avoid it.
[1] 62d1c28e9c ("dpif-netdev: Flush offload rules upon port deletion.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Atomic variables must be read atomically. And since we already have
'smc_enable_db' in the PMD context, we need to use it from there to
avoid reading atomically twice.
Also, 'smc_enable_db' is a global configuration, there is no need
to read it per-port or per-rxq.
Fixes: 9ac84a1a36 ("dpif-avx512: Add ISA implementation of dpif.")
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Cian Ferriter <cian.ferriter@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This change removes the automatic memory limit on start-up of OVS with
DPDK. As DPDK supports dynamic memory allocation, there is no
need to limit the amount of memory available, if not requested.
Currently, if socket-limit is not configured, it is set to the value of
socket-mem. With this change, the user can decide to set it or have no
memory limit.
Removed logs that announce this change and fixed documentation.
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This change removes the default values for EAL args socket-mem and
socket-limit. As DPDK supports dynamic memory allocation, there is no
need to allocate a certain amount of memory on start-up, nor limit the
amount of memory available, if not requested.
Currently, socket-mem has a default value of 1024 when it is not
configured by the user, and socket-limit takes on the value of
socket-mem, 1024, by default. With this change, socket-mem is not
configured by default, meaning that socket-limit is not either.
Neither, either or both options can be set.
Removed extra logs that announce this change and fixed documentation.
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Fixes: b6207b1d27 ("netdev-offload-dpdk: Support offload of set IPv6 actions.")
Signed-off-by: Eli Britstein <elibr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
19 bytes in 1 blocks are definitely lost in loss record 24 of 121
at 0x4839748: malloc (vg_replace_malloc.c:306)
by 0x483BD63: realloc (vg_replace_malloc.c:834)
by 0x521C26: xrealloc (util.c:149)
by 0x478F91: ds_reserve (dynamic-string.c:63)
by 0x47928B: ds_put_format_valist (dynamic-string.c:161)
by 0x47920A: ds_put_format (dynamic-string.c:142)
by 0x506DE5: process_status_msg (process.c:0)
by 0x52A6D0: fork_and_wait_for_startup (daemon-unix.c:284)
by 0x52A54D: daemonize_start (daemon-unix.c:453)
by 0x40EB3E: main (ovs-vswitchd.c:91)
Fixes: b925336a36 ("daemon: restart child process if it died before signaling its readiness")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Roi Dayan <roid@nvidia.com>
The ovsdb-cs layer triggers a forced reconnect in various cases:
- when an inconsistency is detected in the data received from the
remote server.
- when the remote server is running in clustered mode and transitioned
to "follower", if the client is configured in "leader-only" mode.
- when explicitly requested by upper layers (e.g., by the user
application, through the IDL layer).
In such cases it's desirable that reconnection should happen as fast as
possible, without the current exponential backoff maintained by the
underlying reconnect object. Furthermore, since 3c2d6274bc ("raft:
Transfer leadership before creating snapshots."), leadership changes
inside the clustered database happen more often and, therefore,
"leader-only" clients need to reconnect more often too.
Forced reconnects call jsonrpc_session_force_reconnect() which will not
reset backoff. To make sure clients reconnect as fast as possible in
the aforementioned scenarios we first call the new API,
jsonrpc_session_reset_backoff(), in ovsdb-cs, for sessions that are in
state CS_S_MONITORING (i.e., the remote is likely still alive and
functioning fine).
jsonrpc_session_reset_backoff() resets the number of backoff-free
reconnect retries to the number of remotes configured for the session,
ensuring that all remotes are retried exactly once with backoff 0.
This commit also updates the Python IDL and jsonrpc implementations.
The Python IDL wasn't tracking the IDL_S_MONITORING state explicitly,
we now do that too. Tests were also added to make sure the IDL forced
reconnects happen without backoff.
Reported-at: https://bugzilla.redhat.com/1977264
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This fixes the flake8 error on pyhton version older than 3.6
as ModuleNotFoundError in not available before 3.6.
../../tests/mfex_fuzzy.py:5:8: F821 undefined name 'ModuleNotFoundError'
Makefile:5826: recipe for target 'flake8-check' failed
Since it doesn't really make any sense to catch this exception,
try-except block is just removed. Additionally the check for
scapy replaced with the more reliable one. Imports re-ordered,
because standard imports should go first.
Fixes: 50be6715c0 ("test/sytem-dpdk: Add unit test for mfex autovalidator")
Signed-off-by: kumar Amber <kumar.amber@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
GCC 11 with -O1 on Feodra 34 emits a false-positive warning like this:
lib/netdev-dummy.c: In function ‘dummy_packet_stream_run’:
lib/netdev-dummy.c:284:16: error: ‘n’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]
284 | if (retval == n && dp_packet_size(&s->rxbuf) > 2) {
| ^
This breaks the build with --enable-Werror. Initializing 'n' to
avoid the warning.
Acked-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Open vSwitch supports OpenFlow "instructions", which were introduced in
OpenFlow 1.1 and act like restricted kinds of actions that can only
appear in a particular order and particular circumstances. OVS did
not support two of these instructions, "write_metadata" and
"goto_table", properly in the case where they appeared in a flow that
needed to be frozen for continuations.
Both of these instructions had the problem that they couldn't be
properly serialized into the stream of actions, because they're not
actions. This commit fixes that problem in freeze_unroll_actions()
by converting them into equivalent actions for serialization.
goto_table had the additional problem that it was being serialized to
the frozen stream even after it had been executed. This was already
properly handled in do_xlate_actions() for resubmit, which is almost
equivalent to goto_table, so this commit applies the same fix to
goto_table. (The commit removes an assertion from the goto_table
implementation, but there wasn't any real value in that assertion and
I thought the code looked cleaner without it.)
This commit adds tests that would have found these bugs. This includes
adding a variant of each continuation test that uses OF1.3 for
monitor/resume (which is necessary to trigger these bugs) plus specific
tests for continuations with goto_table and write_metadata. It also
improves the continuation test infrastructure to add more detail on
the problem if a test fails.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Reported-by: Grayson Wu <wgrayson@vmware.com>
Reported-at: https://github.com/openvswitch/ovs-issues/issues/213
Discussed-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/386166.html
Acked-by: Ilya Maximets <i.maximets@ovn.org>
While testing OVS-windows flows for the DNAT action, the checksum
In TCP header is set incorrectly when TCP offload is enabled by
Default. As a result, the packet will be dropped on receiver linuxVM.
>>>sample flow default configuration on both Windows VM and Linux VM
(src=40.0.1.2,dst=10.150.0.1) --dnat--> (src=40.0.1.2,dst==30.1.0.2)
Without the fix for some TCP packet(40.0.1.2->30.1.0.2 with payload
len 207) the TCP checksum will be pseduo header checksum and the value
is 0x01d6. With the fix the checksum will be 0x47ee, it could be got
the correct TCP checksum on the receiver Linux VM.
Signed-off-by: Wilson Peng<pweisong@vmware.com>
Signed-off-by: Anand Kumar<kumaranand@vmware.com>
Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Signed-off-by: Alin-Gabriel Serdean <aserdean@ovn.org>
There are 3 constraints for moving hashes from one member to another:
1. The load difference exceeds ~ 3% of the load of one member.
2. The difference in load between members exceeds 100,000 bytes.
3. Moving the hash reduces the load difference by more than 10%.
In the current implementation, if one of the members transitions to
the DOWN state, all hashes assigned to it will be moved to the other
members. After that, if the member goes UP, it will wait for
rebalancing to get hashes. But in case we have more than 10 equally
loaded hashes, it will never meet constraint # 3, because each hash
will handle less than 10% of the load. The situation gets worse when
the number of flows grows and it is almost impossible to transfer any
hash when all 256 hash records are used, which is very likely when we
have few hundred/thousand flows.
As a result, if one of the members goes down and back up while traffic
flows, it will never be used to transmit packets again. This will not
be fixed even if we completely stop the traffic and start it again,
because the first two constraints will block rebalancing in the
earlier stages, while we have low traffic volume.
Moving a single hash if the destination does not have any hashes,
as it was before commit c460a6a7bc ("ofproto/bond: simplifying the
rebalancing logic"), will not help, because a single hash is not
enough to make the difference in load less than 10% of the total load,
and this member will handle only that one hash forever.
To fix this, let's try to move multiple hashes at the same time to
meet constraint # 3.
The implementation includes sorting the "records" to be able to
collect records with a cumulative load close enough to the ideal value.
Acked-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.
This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:
* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.
In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:
a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.
Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
'n_handlers' and 'n_revalidators' are declared as type 'size_t'.
However, dpif_handlers_set() requires parameter 'n_handlers' as
type 'uint32_t'. This patch fixes this type mismatch.
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Users complained that per rxq pmd usage was confusing: summing those
values per pmd would never reach 100% even if increasing traffic load
beyond pmd capacity.
This is because the dpif-netdev/pmd-rxq-show command only reports "pure"
rxq cycles while some cycles are used in the pmd mainloop and adds up to
the total pmd load.
dpif-netdev/pmd-stats-show does report per pmd load usage.
This load is measured since the last dpif-netdev/pmd-stats-clear call.
On the other hand, the per rxq pmd usage reflects the pmd load on a 10s
sliding window which makes it non trivial to correlate.
Gather per pmd busy cycles with the same periodicity and report the
difference as overhead in dpif-netdev/pmd-rxq-show so that we have all
info in a single command.
Example:
$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 3:
isolated : true
port: dpdk0 queue-id: 0 (enabled) pmd usage: 90 %
overhead: 4 %
pmd thread numa_id 1 core_id 5:
isolated : false
port: vhost0 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost1 queue-id: 0 (enabled) pmd usage: 93 %
port: vhost2 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost6 queue-id: 0 (enabled) pmd usage: 0 %
overhead: 6 %
pmd thread numa_id 1 core_id 31:
isolated : true
port: dpdk1 queue-id: 0 (enabled) pmd usage: 86 %
overhead: 4 %
pmd thread numa_id 1 core_id 33:
isolated : false
port: vhost3 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost4 queue-id: 0 (enabled) pmd usage: 0 %
port: vhost5 queue-id: 0 (enabled) pmd usage: 92 %
port: vhost7 queue-id: 0 (enabled) pmd usage: 0 %
overhead: 7 %
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>