It seems that on slow system with high concurrency and cpu contention
time/warp is not accurate enough for the ALB unit tests with the minimum
time/warp that was used to hit an amount of events. This results in some
intermittent test failures.
As those tests are just waiting for a certain amount of events to occur
and there is no functional change during that time let's do the time/warp
again with higher values.
With this no failures are seen in several hundred runs.
Fixes: a83a406096 ("dpif-netdev: Sync PMD ALB state with user commands.")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The next log line number should be updated to ensure that the
anticipated log has occurred again after more time has passed.
Fixes: a83a406096 ("dpif-netdev: Sync PMD ALB state with user commands.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
'dataofs' field of TCP header indicates the TCP header length. The
length should be >= 20 bytes/4 and <= TCP data length. This patch is
to test the 'dataofs' and not parse layer 4 fields when meet bad
dataofs.
This behavior is consistent with the openvswitch kernel module.
Fixes: 5a51b2cd34 ("lib/ofpbuf: Remove 'l7' pointer.")
Signed-off-by: lic121 <lic121@chinatelecom.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Without this fix, flowgen.py generates bad tcp pkts.
tcpdump reports "bad hdr length 4 - too short" with the pcap
generated by flowgen.py
This patch is to correct pkt data endianness
Signed-off-by: lic121 <lic121@chinatelecom.cn>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
OVS_DP_F_UNALIGNED is already set, no need to set again. If restarting ovs,
dp is already created. So dpif_netlink_dp_transact() will return EEXIST.
No need to probe again.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently when a user creates an openflow group with with multiple
buckets without specifying a selection type, the efficient dp_hash is
only selected if the user is creating fewer than 64 buckets. But when
dp_hash is explicitly selected, up to 256 buckets are supported.
While up to 64 buckets seems like a lot, certain OVN/Open Stack
workloads could result in the user creating more than 64 buckets. For
example, when using OVN to load balance. This patch increases the
default maximum from 64 to 256.
This change to the default limit doesn't affect how many buckets are
actually created, that is specified by the user when the group is
created, just how traffic is distributed across buckets.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Rows that refer to rows that were inserted in the current IDL run should
only be reparsed if they don't get deleted (become orphan) in the current
IDL run.
Fixes: 7b8aeadd60 ("ovsdb-idl: Re-parse backrefs of inserted rows only once.")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
While removing flows, removal itself is deferred, so classifier changes
performed already from the RCU thread. This way every deferred removal
triggers classifier change and reallocation of a pvector. Freeing of
old version of a pvector is postponed. Since all this is happening
from an RCU thread, all these copies of the same pvector will be freed
only after the next grace period.
Below is the example output of the 'valgrind --tool=massif' from an OVN
deployment, where copies of that pvector took 5 GB of memory while
processing a bundled flow removal:
-------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B)
-------------------------------------------------------------------
89 176,257,987,954 5,329,763,160 5,318,171,607 11,591,553
99.78% (5,318,171,607B) (heap allocation functions) malloc/new/new[]
->98.45% (5,247,008,392B) xmalloc__ (util.c:137)
|->98.17% (5,232,137,408B) pvector_impl_dup (pvector.c:48)
||->98.16% (5,231,472,896B) pvector_remove (pvector.c:159)
|||->98.16% (5,231,472,800B) destroy_subtable (classifier.c:1558)
||||->98.16% (5,231,472,800B) classifier_remove (classifier.c:792)
|||| ->98.16% (5,231,472,800B) classifier_remove_assert (classifier.c:832)
|||| ->98.16% (5,231,472,800B) remove_rule_rcu__ (ofproto.c:2978)
|||| ->98.16% (5,231,472,800B) remove_rule_rcu (ofproto.c:2990)
|||| ->98.16% (5,231,472,800B) ovsrcu_call_postponed (ovs-rcu.c:346)
|||| ->98.16% (5,231,472,800B) ovsrcu_postpone_thread (ovs-rcu.c:362)
|||| ->98.16% (5,231,472,800B) ovsthread_wrapper
|||| ->98.16% (5,231,472,800B) start_thread
|||| ->98.16% (5,231,472,800B) clone
Collecting all the flows to be removed and postponing removal for
all of them together to avoid the problem. This way all removals
will trigger only a single pvector re-allocation greatly reducing
the CPU and memory usage.
Reported-by: Vladislav Odintsov <odivlad@gmail.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-November/389538.html
Tested-by: Vladislav Odintsov <odivlad@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
While processing a bundle, OVS will add all new and modified rules
to classifiers. Classifiers are using RCU-protected pvector to
store subtables. Addition of a new subtable or removal of the old
one leads to re-allocation and memory copy of the pvector array.
Old version of that array is given to RCU thread to free it later.
The problem is that bundle is processed under the mutex without
entering the quiescent state. Therefore, memory can not be freed
until the whole bundle is processed. So, if a few thousands of
flows added to the same table in a bundle, pvector will be re-allocated
while adding each of them. So, we'll have a few thousands of copies
of the same array waiting to be freed.
In case of OVN deployments, there could be hundreds of thousands of
flows in the same table leading to a fast consumption of a huge
amount of memory and wasting a lot of CPU cycles on allocations and
copies. Below snippet of the 'valgrind --tool=massif' output shows
ovs-vswitchd consuming 3.5GB of RAM while processing a bundle with
65K FLOW_MODs in the OVN deployment. 3.4GB of that memory are
copies of the same pvector.
-------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B)
-------------------------------------------------------------------
64 109,907,465,404 3,559,987,568 3,546,879,748 13,107,820
99.63% (3,546,879,748B) (heap allocation functions) malloc/new/new[]
->97.61% (3,474,750,333B) xmalloc__ (util.c:137)
|->97.61% (3,474,750,333B) xmalloc (util.c:172)
| ->96.38% (3,431,068,352B) pvector_impl_dup (pvector.c:48)
|| ->96.38% (3,431,067,840B) pvector_insert (pvector.c:138)
|| |->96.38% (3,431,067,840B) classifier_replace (classifier.c:664)
|| | ->96.38% (3,431,067,840B) classifier_insert (classifier.c:695)
|| | ->96.38% (3,431,067,840B) replace_rule_start (ofproto.c:5563)
|| | ->96.38% (3,431,067,840B) add_flow_start (ofproto.c:5179)
|| | ->96.38% (3,431,067,840B) ofproto_flow_mod_start (ofproto.c:8017)
|| | ->96.38% (3,431,067,744B) do_bundle_commit (ofproto.c:8168)
|| | |->96.38% (3,431,067,744B) handle_bundle_control (ofproto.c:8309)
|| | | ->96.38% (3,431,067,744B) handle_single_part_openflow (ofproto.c:8593)
|| | | ->96.38% (3,431,067,744B) handle_openflow (ofproto.c:8674)
|| | | ->96.38% (3,431,067,744B) ofconn_run (connmgr.c:1329)
|| | | ->96.38% (3,431,067,744B) connmgr_run (connmgr.c:356)
|| | | ->96.38% (3,431,067,744B) ofproto_run (ofproto.c:1879)
|| | | ->96.38% (3,431,067,744B) bridge_run__ (bridge.c:3251)
|| | | ->96.38% (3,431,067,744B) bridge_run (bridge.c:3310)
|| | | ->96.38% (3,431,067,744B) main (ovs-vswitchd.c:127)
Fixing that by postponing the publishing of classifier updates,
so each flow modification can work with the same version of pvector.
Unfortunately, bundled PACKET_OUT messages requires all previous
changes to be published before processing, otherwise the packet
will use wrong version of OF tables. Publishing all changes before
processing PACKET_OUT messages to avoid this issue. Hopefully,
mixup of a big number of FLOW_MOD and PACKET_OUT messages is not
a common usecase.
Reported-by: Vladislav Odintsov <odivlad@gmail.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-November/389503.html
Tested-by: Vladislav Odintsov <odivlad@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ssl_send() clones the data before sending, but if SSL_write() succeeds
at the first attempt, this is only a waste of CPU cycles.
Trying to send the original buffer instead and only copying remaining
data if it's not possible to send it all right away.
This should save a few cycles on every send.
Note:
It's probably possible to avoid the copy even if we can't send
everything at once, but will, likely, require some major change
of the stream-sll module in order to take into account all the
corner cases related to SSL connection. So, not trying to do that
for now.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ovsdb_atom_string and json_string are basically the same data structure
and ovsdb-server frequently needs to convert one to another. We can
avoid that by using json_string from the beginning for all ovsdb
strings. So, the conversion turns into simple json_clone(), i.e.
increment of a reference counter. This change gives a moderate
performance boost in some scenarios, improves the code clarity and
may be useful for future development.
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
With the next commit reference counting of json objects will take
significant part of the CPU time for ovsdb-server. Inlining them
to reduce the cost of a function call.
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Length in Data Link Header for these packets should not include
source and destination MACs or the length field itself.
Therefore, it should be 14 bytes less, otherwise other network
tools like wireshark complains:
Expert Info (Error/Malformed):
Length field value goes past the end of the payload
Additionally fixing the printing of the packet/flow configuration,
as it currently prints '%s=%s' strings without any real data.
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently the ovs-tcpdump utility creates a virtual tunnel to send
packets to. This method functions perfectly fine, however, it can
greatly impact performance of the monitored port.
It has been reported to reduce packet throughput significantly. I was
able to reproduce a reduction in throughput of up 70 percent in some
tests with a simple setup of two hosts communicating through a single
bridge on Linux with the kernel module datapath. Another more complex
test was configured for the usermode datapath both with and without
DPDK. This test involved a data path going from a VM, through a port
into one OVS bridge, out through a network card which could be DPDK
enabled for the relevant tests, in to a different network interface,
then into a different OVS bridge, through another port, and then into
a virtual machine.
Using the dummy driver resulted in the following impact to performance
compared to no ovs-tcpdump. Due to intra-test variance and fluctuations
during the first few seconds after installing a tap; multiple samples
were taken over multiple test runs. The first few seconds worth of
results were discarded and then results were averaged out.
If the dummy driver isn't present, falls back on the existing tap code.
Original Script
===============
Category Impact on Throughput
Kernel datapath - 65%
Usermode (no DPDK) - 26%
DPDK ports in use - 37%
New Script
==========
Category Impact on Throughput
Kernel datapath - 5%
Usermode (no DPDK) - 16%
DPDK ports in use - 29%
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ovsdb-tool join-cluster requires a remote addr, so the existing
code that tried to join a cluster without it when there was an
existing $DB_FILE would fail.
Instead, if we are trying to specifically join a cluster and there
is an existing $DB_FILE, back it up and remove the original before
continuing to join the cluster.
Signed-off-by: Terry Wilson <twilson@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Flavio Fernandes <flavio@flaviof.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Many python implementations pre-allocate space for multiple
objects in empty dicts and lists. Using a custom dict-like object
that only generates these objects when they are accessed can save
memory.
On a fairly pathological case where the DB has 1000 networks each
with 100 ports, with only 'name' fields set, this saves around
300MB of memory.
One could argue that if values are not going to change from their
defaults, then users should not be monitoring those columns, but
it's also probably good to not waste memory even if user code is
sub-optimal.
Signed-off-by: Terry Wilson <twilson@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Flavio Fernandes <flavio@flaviof.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The match keyword "igmp" is not supported in ofp-parse, which means
that flow dumps cannot be restored. Previously a workaround was
added to ovs-save to avoid changing output in stable branches.
This patch changes the output to print igmp match in the accepted
ofp-parse format (ip,nw_proto=2) and print igmp_type/code as generic
tp_src/dst. Tests are added, and NEWS is updated to reflect this change.
The workaround in ovs-save is still included to ensure that flows
can be restored when upgrading an older ovs-vswitchd. This workaround
should be removed in later versions.
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Salvatore Daniele <sdaniele@redhat.com>
Co-authored-by: Salvatore Daniele <sdaniele@redhat.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
match.c generates the keyword "igmp", which is not supported in ofp-parse.
This means that flow dumps containing 'igmp' can not be restored.
Removing the 'igmp' keyword entirely could break existing scripts in stable
branches, so this patch creates a workaround within ovs-save by converting any
instances of "igmp" within $bridge.flows.dump into "ip, nw_proto=2", and any
instances of igmp_type/code into the generic tp_src/dst.
Signed-off-by: Salvatore Daniele <sdaniele@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
It is useful to also log when tnl_port_build_header() failed.
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
While adding new rows ovsdb-idl re-parses all the other rows that
references this new one. For example, current ovn-kubernetes creates
load balancers and adds the same load balancer to all logical switches
and logical routers. So, then a new load balancer is added, rows for
all logical switches and routers re-parsed.
During initial database connection (or re-connection with
monitor/monitor_cond or monitor_cond_since with outdated last
transaction id) the client downloads the whole content of a database.
In case of OVN, there might be already thousands of load balancers
configured. ovsdb-idl will process rows in that initial monitor reply
one-by-one. Therefore, for each load balancer row, it will re-parse
all rows for switches and routers.
Assuming that we have 120 Logical Switches and 30K load balancers.
Processing of the initial monitor reply will take 120 (switch rows) *
30K (load balancer references in a switch row) * 30K (load balancer
rows) = 10^11 operations, which may take hours. ovn-kubernetes will
use LB groups eventually, but there are other less obvious cases that
cannot be changed that easily.
Re-parsing doesn't change any internal structures of the IDL. It
destroys and re-creates exactly same arcs between rows. The only
thing that changes is the application-facing array of pointers.
Since internal structures remains intact, suggested solution is to
postpone the re-parsing of back references until all the monitor
updates processed. This way we can re-parse each row only once.
Tested in a sandbox with 120 LSs, 120 LRs and 3K LBs, where each
load balancer added to each LS and LR, by re-statring ovn-northd and
measuring the time spent in ovsdb_idl_run().
Before the change:
OVN_Southbound: ovsdb_idl_run took: 924 ms
OVN_Northbound: ovsdb_idl_run took: 825118 ms --> 13.75 minutes!
After:
OVN_Southbound: ovsdb_idl_run took: 692 ms
OVN_Northbound: ovsdb_idl_run took: 1698 ms
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Commit dc0bd12f5b removed restriction that tunnel endpoint must be a
bridge port. So, currently OVS has to check if the native tunnel needs
to be terminated regardless of the output port. Unfortunately, there
is a side effect: tnl_port_map_lookup() always adds at least 'dl_dst'
match to the megaflow that ends up in the corresponding datapath flow.
And since tunneling works on L3 level and not restricted by any
particular bridge, this extra match criteria is added to every
datapath flow on every bridge even if that bridge cannot be part of
a tunnel processing.
For example, if OVS has at least one tunnel configured and we're
adding a completely separate bridge with 2 ports and simple rules
to forward packets between two ports, there still will be a match on
a destination mac address:
1. <create a tunnel configuration in OVS>
2. ovs-vsctl add-br br-non-tunnel -- set bridge datapath_type=netdev
3. ovs-vsctl add-port br-non-tunnel port0
-- add-port br-non-tunnel port1
4. ovs-ofctl del-flows br-non-tunnel
5. ovs-ofctl add-flow br-non-tunnel in_port=port0,actions=port1
6. ovs-ofctl add-flow br-non-tunnel in_port=port1,actions=port0
# ovs-appctl ofproto/trace br-non-tunnel in_port=port0
Flow: in_port=1,vlan_tci=0x0000,
dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000
bridge("br-non-tunnel")
-----------------------
0. in_port=1, priority 32768
output:2
Final flow: unchanged
Megaflow: recirc_id=0,eth,in_port=1,dl_dst=00:00:00:00:00:00,dl_type=0x0000
Datapath actions: 5 ^^^^^^^^^^^^^^^^^^^^^^^^
This increases the number of upcalls and installed datapath flows,
since separate flow needs to be installed per destination MAC, reducing
the switching performance. This also blocks datapath performance
optimizations that are based on the datapath flow simplicity.
In general, in order to be a tunnel endpoint, port has to have an IP
address. Hence native tunnel termination should be attempted only
for such ports. This allows to avoid extra matches in most cases.
Fixes: dc0bd12f5b ("userspace: Enable non-bridge port as tunnel endpoint.")
Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-October/388904.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Mike Pattrick <mkp@redhat.com>
xlate_check_pkt_larger() sets ctx->exit to 'true' at the end
causing the translation to stop. This results in incomplete
datapath rules.
For example, for the below OF rules configured on a bridge,
table=0,in_port=1 actions=load:0x1->NXM_NX_REG1[[]],resubmit(,1),
load:0x2->NXM_NX_REG1[[]],resubmit(,1),
load:0x3->NXM_NX_REG1[[]],resubmit(,1)
table=1,in_port=1,reg1=0x1 actions=check_pkt_larger(200)->NXM_NX_REG0[[0]],
resubmit(,4)
table=1,in_port=1,reg1=0x2 actions=output:2
table=1,in_port=1,reg1=0x3 actions=output:4
table=4,in_port=1 actions=output:3
The datapath flow should be:
check_pkt_len(size=200,gt(3),le(3)),2,4
But right now it is:
check_pkt_len(size=200,gt(3),le(3))
Actions after the first resubmit(,1) in the first flow in table 0
are never applied. This patch fixes this issue.
Fixes: 5b34f8fc3b ("Add a new OVS action check_pkt_larger")
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2018365
Reported-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: Numan Siddique <numans@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Previously, when a user enabled PMD auto load balancer with
pmd-auto-lb="true", some conditions such as number of PMDs/RxQs
that were required for a rebalance to take place were checked.
If the configuration meant that a rebalance would not take place
then PMD ALB was logged as 'disabled' and not run.
Later, if the PMD/RxQ configuration changed whereby a rebalance
could be effective, PMD ALB was logged as 'enabled' and would run at
the appropriate time.
This worked ok from a functional view but it is unintuitive for the
user reading the logs.
e.g. with one PMD (PMD ALB would not be effective)
User enables ALB, but logs say it is disabled because it won't run.
$ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true"
|dpif_netdev|INFO|PMD auto load balance is disabled
No dry run takes place.
Add more PMDs (PMD ALB may be effective).
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50
|dpif_netdev|INFO|PMD auto load balance is enabled ...
Dry run takes place.
|dpif_netdev|DBG|PMD auto load balance performing dry run.
A better approach is to simply reflect back the user enable/disable
state in the logs and deal with whether the rebalance will be effective
when needed. That is the approach taken in this patch.
To cut down on unneccessary work, some basic checks are also made before
starting a PMD ALB dry run and debug logs can indicate this to the user.
e.g. with one PMD (PMD ALB would not be effective)
User enables ALB, and logs confirm the user has enabled it.
$ ovs-vsctl set open_vSwitch . other_config:pmd-auto-lb="true"
|dpif_netdev|INFO|PMD auto load balance is enabled...
No dry run takes place.
|dpif_netdev|DBG|PMD auto load balance nothing to do, not enough non-isolated PMDs or RxQs.
Add more PMDs (PMD ALB may be effective).
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=50
Dry run takes place.
|dpif_netdev|DBG|PMD auto load balance performing dry run.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Sunil Pai G <sunil.pai.g@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
When a PMD reload occurs, some PMD cycle measurements are reset.
In order to preserve the full cycles history of an Rxq, the RxQ
cycle measurements were not reset.
These are both used together to display the % of a PMD that an
RxQ is using in the pmd-rxq-show stat.
Resetting one and not the other can lead to some unintuitive
looking stats while the stats settle for the new config. In
some cases, it may appear like the RxQs are using >100% of a PMD.
This resolves when the stats settle for the new config, but
seeing RxQs apparently using >100% of a PMD may confuse a user
and lead them to think there is a bug.
To avoid this, reset the RxQ cycle measurements on PMD reload.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Gaetan Rivet <grive@u256.net>
Acked-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The admissibility check currently log a message like (line wrapped in
this commitlog):
bond(revalidator11)|DBG|member (dpdk0): Admissibility
verdict is to drop pkt as different port is learned.active member: false,
may_enable: true enable: true LACP status:2
Fix spaces around the period character and separate debug infos with
commas.
Prefix all log messages in this check with bond and member names.
Display a human readable string for LACP status.
New logs look like:
bond(revalidator11)|DBG|bond dpdkbond0: member dpdk0: admissibility
verdict is to drop pkt as different port is learned, active member: false,
may_enable: true, enabled: true, LACP status: off
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
While testing OVS-windows flows for the Ip fragments, the traffic will be dropped
As it may match incorrect OVS flow. From the code, after the Ipv4 fragments are
Reassembled, it willl still use the flow key of the last Ipv4 fragments to match
CT causing match error.
Reported-at:https://github.com/openvswitch/ovs-issues/issues/232
Signed-off-by: Wilson Peng <pweisong@vmware.com>
Signed-off-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Python objects normally have a dictionary named __dict__ allocated
for handling dynamically assigned attributes. Depending on
architecture and Python version, that empty dict may be between
64 and 280 bytes.
Seeing as Atom and Datum objects do not need dynamic attribute
support and there can be millions of rows in a database, avoiding
this allocation with __slots__ can save 100s of MBs of memory per
Idl process.
Signed-off-by: Terry Wilson <twilson@redhat.com>
Acked-by: Timothy Redaelli <tredaelli@redhat.com>
Tested-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Systemd unit file generates warnings about PID file path since /var/run
is a legacy path so just use /run instead of /var/run.
/var/run is a symlink of /run starting from RHEL7 (and any other distribution
that uses systemd).
Reported-at: https://bugzilla.redhat.com/1952081
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This patch adds a general way of viewing/configuring datapath
cache sizes. With an implementation for the netlink interface.
The ovs-dpctl/ovs-appctl show commands will display the
current cache sizes configured:
$ ovs-dpctl show
system@ovs-system:
lookups: hit:25 missed:63 lost:0
flows: 0
masks: hit:282 total:0 hit/pkt:3.20
cache: hit:4 hit-rate:4.54%
caches:
masks-cache: size:256
port 0: ovs-system (internal)
port 1: br-int (internal)
port 2: genev_sys_6081 (geneve: packet_type=ptap)
port 3: br-ex (internal)
port 4: eth2
port 5: sw0p1 (internal)
port 6: sw0p3 (internal)
A specific cache can be configured as follows:
$ ovs-appctl dpctl/cache-set-size DP CACHE SIZE
$ ovs-dpctl cache-set-size DP CACHE SIZE
For example to disable the cache do:
$ ovs-dpctl cache-set-size system@ovs-system masks-cache 0
Setting cache size successful, new size 0.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
If user frequently changes a lot of rows in a database, transaction
history could grow way larger than the database itself. This wastes
a lot of memory and also makes monitor_cond_since slower than
usual monotor_cond if the transaction id is old enough, because
re-construction of the changes from a history is slower than just
creation of initial database snapshot. This is also the case if
user deleted a lot of data, so transaction history still holds all of
it while the database itself doesn't.
In case of current lb-per-service model in ovn-kubernetes, each
load-balancer is added to every logical switch/router. Such a
transaction touches more than a half of a OVN_Northbound database.
And each of these transactions is added to the transaction history.
Since transaction history depth is 100, in worst case scenario,
it will hold 100 copies of a database increasing memory consumption
dramatically. In tests with 3000 LBs and 120 LSs, memory goes up
to 3 GB, while holding at 30 MB if transaction history disabled in
the code.
Fixing that by keeping count of the number of ovsdb_atom's in the
database and not allowing the total number of atoms in transaction
history to grow larger than this value. Counting atoms is fairly
cheap because we don't need to iterate over them, so it doesn't have
significant performance impact. It would be ideal to measure the
size of individual atoms, but that will hit the performance.
Counting cells instead of atoms is not sufficient, because OVN
users are adding hundreds or thousands of atoms to a single cell,
so they are largely different in size.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
On large scale deployments with records that contain large sets, this
significantly improves client side performance as it avoids comparing
full contents of the old and new rows.
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
The main idea is to not store list of weak references in the source
row, so they all don't need to be re-checked/updated on every
modification of that source row. The point is that source row already
knows UUIDs of all destination rows stored in the data, so there is no
much profit in storing this information somewhere else. If needed,
destination row can be looked up and reference can be looked up in the
destination row. For the fast lookup, destination row now stores
references in a hash map.
Weak reference structure now contains the table and uuid of a source
row instead of a direct pointer. This allows to replace/update the
source row without breaking any weak references stored in destination
rows.
Structure also now contains the key-value pair of atoms that triggered
creation of this reference. These atoms can be used to quickly
subtract removed references from a source row. During reassessment,
ovsdb now only needs to care about new added or removed atoms, and
atoms that got removed due to removal of the destination rows, but
these are marked for reassessment by the destination row.
ovsdb_datum_subtract() is used to remove atoms that points to removed
or incorrect rows, so there is no need to re-sort datum in the end.
Results of an OVN load-balancer benchmark that adds 3K load-balancers
to each of 120 logical switches and 120 logical routers in the OVN
sandbox with clustered Northbound database and then removes them:
Before:
%CPU CPU Time CMD
86.8 00:16:05 ovsdb-server nb1.db
44.1 00:08:11 ovsdb-server nb2.db
43.2 00:08:00 ovsdb-server nb3.db
After:
%CPU CPU Time CMD
54.9 00:02:58 ovsdb-server nb1.db
33.3 00:01:48 ovsdb-server nb2.db
32.2 00:01:44 ovsdb-server nb3.db
So, on a cluster leader the processing time dropped by 5.4x, on
followers - by 4.5x. More load-balancers - larger the performance
difference. There is a slight increase of memory usage, because new
reference structure is larger, but the difference is not significant.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Added new function to return memory usage statistics for database
objects inside IDL. Statistics similar to what ovsdb-server reports.
Not counting _Server database as it should be small, hence doesn't
worth adding extra code to the ovsdb-cs module. Can be added later
if needed.
ovs-vswitchd is a user in OVS, but this API will be mostly useful for
OVN daemons.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Currently, there are some patches with the tags wrongly written (with
space instead of dash ) and this may prevent some automatic system or CI
to detect them correctly.
This commit adds a check in checkpatch to be sure the tag is written
correctly with dash and not with space.
The tags supported by the commit are:
Acked-by, Reported-at, Reported-by, Requested-by, Reviewed-by, Submitted-at
and Suggested-by.
It's not necessary to add "Signed-off-by" since it's already checked in
checkpatch.
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There was typo in function pointer check in
dpif_bond_add() before calling bond_add().
Fixes: 9df65060cf ("userspace: Avoid dp_hash recirculation for balance-tcp bond mode.")
Signed-off-by: Somnath Chatterjee <somnath.b.chatterjee@ericsson.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, pyOpenSSL is half-deprecated upstream and so it's removed on
some distributions (for example on CentOS Stream 9,
https://issues.redhat.com/browse/CS-336), but since OVS only
supports Python 3 it's possible to replace pyOpenSSL with "import ssl"
included in base Python 3.
Stream recv and send had to be splitted as _recv and _send, since SSLError
is a subclass of socket.error and so it was not possible to except for
SSLWantReadError and SSLWantWriteError in recv and send of SSLStream.
TCPstream._open cannot be used in SSLStream, since Python ssl module
requires the SSL socket to be created before connecting it, so
SSLStream._open needs to create the socket, create SSL socket and then
connect the SSL socket.
Reported-by: Timothy Redaelli <tredaelli@redhat.com>
Reported-at: https://bugzilla.redhat.com/1988429
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Acked-by: Terry Wilson <twilson@redhat.com>
Tested-by: Terry Wilson <twilson@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
In an upcoming patch, PyOpenSSL will be replaced with Python ssl module,
but in order to do an async connection with Python ssl module the ssl
socket must be created when the socket is created, but before the
socket is connected.
So, inet_open_active function is splitted in 3 parts:
- inet_create_socket_active: creates the socket and returns the family and
the socket, or (error, None) if some error needs to be returned.
- inet_connect_active: connect the socket and returns the errno (it
returns 0 if errno is EINPROGRESS or EWOULDBLOCK).
connect is replaced by connect_ex, since Python suggest to use it for
asynchronous connects and it's also cleaner since inet_connect_active
returns errno that connect_ex already returns, moreover due to a Python
limitation connect cannot not be used with ssl module.
inet_open_active function is changed in order to use the new functions
inet_create_socket_active and inet_connect_active.
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Acked-by: Terry Wilson <twilson@redhat.com>
Tested-by: Terry Wilson <twilson@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
While testing OVS-windows flows for the DNAT/SNAT action, the checksum in
TCP header is set incorrectly when TCP offload is enabled by default. As a
result,the packet will be dropped on the Windows VM when processing the packet
from Linux VM which has included correct checksum at first. On the Windows VM,
it has gone through two NAT actions and OVS Windows kernel will reset the
checksum to PseudoChecksum and then it will lose the original correct checksum
value which is set outside.
Back to the Nat TCP/UDP checksum value reset logic,it should reset it TCP checksum
To be PseudoChecksum value only on Tx direction for TCP Offload case. For the packet
From the outside, OVS Windows Kernel does not need reset the TCP/UDP checksum as
It should be the job of the received network driver to get out a correct checksum
Value.
>>>sample flow on default configuration on both Windows VM and Linux VM
(src=192.168.252.1,dst=10.110.225.146)-->dnat/snat-> (src=169.254.169.253,
Dst=10.176.26.107) Without the fix the return back packet(src=10.176.26.107,
Dst=169.254.169.253) will have the correct TCP checksum. After the reverse NAT
Actions, it will be changed to be packet (src=10.110.225.146, Dst=192.168.252.1)
But with incorrect TCP checksum 0xa97a which is
The PseudoChecksum. Related packet is put on the reported issue below.
Reported-at:https://github.com/openvswitch/ovs-issues/issues/231
Signed-off-by: Wilson Peng <pweisong@vmware.com>
Signed-off-by: Alin-Gabriel Serdean <aserdean@ovn.org>
Recently, the github actions CI environment has been broken due to an
incompatibility between sphinx-build and the docutils python package.
The pip3 install command will upgrade docutils to an incompatible
version.
Since we install sphinx via pip3, it will always install an appropriate
version of docutils package. By forcing the upgrade, we created a broken
situation. Remove the upgrade command and trust pip3.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Since recently actions/setup-python@v2 started to pull python 3.10.0
which seems to be incompatible with the meson 0.47.1 which we're using
to build DPDK.
This broke CI on 2.16 and master branches:
https://github.com/ovsrobot/ovs/runs/3967167374
Sticking the version to 3.9 for now to avoid the CI failure.
Dependency resolver is still not very happy, but at least it works.
We'll need to find a newer version of meson to use later and revert
this change.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Currently the layers info propogated to ProcessDeferredActions may be
incorrect. Because of this, any subsequent usage of layers might result
in undesired behavior. Accordingly in this patch it will add the related
layers in the deferred action to make sure the layers consistent with
the related NBL.
In the specified case 229, we have encountered one issue when doing
the decap Geneve Packet and doing the twice NAT(via two flow tables)
and found the HTTP packet will be changed the TCP sequence.
After debugging, we found the issue is caused by the not-updated
layers value isTcp and isUdp for Geneve decapping case.
The related function call chains are listed below,
OvsExecuteDpIoctl—>OvsActionsExecute—>OvsDoExecuteActions->OvsTunnelPortRx
——>OvsDoExecuteActions——〉nat ct action and recircle action
->OvsActionsExecute->defered_actions processing for nat and recircle action
For the Geneve packet decaping, it will firstly set the layers for Udp packet.
Then it will go on doing OVS flow extract to get the inner packet layers and
Processing the first nat action and first recircle action. After that datapath
Will do defered_actions processing on OvsActionsExecute. And it does inherit
The incorrect geneve packet layers value( isTCP 0 and isUdp 1).So in the second
Nat action processing it will get the wrong TCP Headers in OvsUpdateAddressAndPort
And it will update related TCP check field value but in this case it will change
The packet Tcp seq value.
Reported-at:https://github.com/openvswitch/ovs-issues/issues/229
Signed-off-by: Wilson Peng <pweisong@vmware.com>
Signed-off-by: Alin-Gabriel Serdean <aserdean@ovn.org>
CT zone could be set from a field that is not included in frozen
metadata. Consider the example rules which are typically seen in
OpenStack security group rules:
priority=100,in_port=1,tcp,ct_state=-trk,action=ct(zone=5,table=0)
priority=100,in_port=1,tcp,ct_state=+trk,action=ct(commit,zone=NXM_NX_CT_ZONE[]),2
The zone is set from the first rule's ct action. These two rules will
generate two megaflows: the first one uses zone=5 to query the CT module,
the second one sets the zone-id from the first megaflow and commit to CT.
The current implementation will generate a megaflow that does not use
ct_zone=5 as a match, but directly commit into the ct using zone=5, as zone is
set by an Imm not a field.
Consider a situation that one changes the zone id (for example to 15)
in the first rule, however, still keep the second rule unchanged. During
this change, there is traffic hitting the two generated megaflows, the
revaldiator would revalidate all megaflows, however, the revalidator will
not change the second megaflow, because zone=5 is recorded in the
megaflow, so the xlate will still translate the commit action into zone=5,
and the new traffic will still commit to CT as zone=5, not zone=15,
resulting in taffic drops and other issues.
Just like OVS set-field convention, if a field X is set by Y
(Y is a variable not an Imm), we should also mask Y as a match
in the generated megaflow. An exception is that if the zone-id is
set by the field that is included in the frozen state (i.e. regs) and this
upcall is a resume of a thawed xlate, the un-wildcarding can be skipped,
as the recirc_id is a hash of the values in these fields, and it will change
following the changes of these fields. When the recirc_id changes,
all megaflows with the old recirc id will be invalid later.
Fixes: 07659514c3 ("Add support for connection tracking.")
Reported-by: Sai Su <susai.ss@bytedance.com>
Signed-off-by: Peng He <hepeng.0320@bytedance.com>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>