Found by Coverity.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ovsdb_datum_apply_diff() is heavily used in ovsdb transactions, but
it's linear in terms of number of comparisons. And it also clones
all the atoms along the way. In most cases size of a diff is much
smaller than the size of the original datum, this allows to perform
the same operation in-place with only O(diff->n * log2(old->n))
comparisons and O(old->n + diff->n) memory copies with memcpy.
Using this function while applying diffs read from the storage gives
a significant performance boost and allows to execute much more
transactions per second.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
ovsdb_create() requires schema or storage to be nonnull, but in
practice it requires to have schema name or a storage name to
use it as a database name. Only clustered storage has a name.
This means that only clustered database can be created without
schema, Changing that by allowing unbacked storage to have a
name. This way we can create database with unbacked storage
without schema. Will be used in next commits to create database
for ovsdb 'relay' service model.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, ovsdb-server stores complete value for the column in a database
file and in a raft log in case this column changed. This means that
transaction that adds, for example, one new acl to a port group creates
a log entry with all UUIDs of all existing acls + one new. Same for
ports in logical switches and routers and more other columns with sets
in Northbound DB.
There could be thousands of acls in one port group or thousands of ports
in a single logical switch. And the typical use case is to add one new
if we're starting a new service/VM/container or adding one new node in a
kubernetes or OpenStack cluster. This generates huge amount of traffic
within ovsdb raft cluster, grows overall memory consumption and hurts
performance since all these UUIDs are parsed and formatted to/from json
several times and stored on disks. And more values we have in a set -
more space a single log entry will occupy and more time it will take to
process by ovsdb-server cluster members.
Simple test:
1. Start OVN sandbox with clustered DBs:
# make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered'
2. Run a script that creates one port group and adds 4000 acls into it:
# cat ../memory-test.sh
pg_name=my_port_group
export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off)
ovn-nbctl pg-add $pg_name
for i in $(seq 1 4000); do
echo "Iteration: $i"
ovn-nbctl --log acl-add $pg_name from-lport $i udp drop
done
ovn-nbctl acl-del $pg_name
ovn-nbctl pg-del $pg_name
ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
ovn-appctl -t ovn-nbctl exit
---
4. Check the current memory consumption of ovsdb-server processes and
space occupied by database files:
# ls sandbox/[ns]b*.db -alh
# ps -eo vsz,rss,comm,cmd | egrep '=[ns]b[123].pid'
Test results with current ovsdb log format:
On-disk Nb DB size : ~369 MB
RSS of Nb ovsdb-servers: ~2.7 GB
Time to finish the test: ~2m
In order to mitigate memory consumption issues and reduce computational
load on ovsdb-servers let's store diff between old and new values
instead. This will make size of each log entry that adds single acl to
port group (or port to logical switch or anything else like that) very
small and independent from the number of already existing acls (ports,
etc.).
Added a new marker '_is_diff' into a file transaction to specify that
this transaction contains diffs instead of replacements for the existing
data.
One side effect is that this change will actually increase the size of
file transaction that removes more than a half of entries from the set,
because diff will be larger than the resulted new value. However, such
operations are rare.
Test results with change applied:
On-disk Nb DB size : ~2.7 MB ---> reduced by 99%
RSS of Nb ovsdb-servers: ~580 MB ---> reduced by 78%
Time to finish the test: ~1m27s ---> reduced by 27%
After this change new ovsdb-server is still able to read old databases,
but old ovsdb-server will not be able to read new ones.
Since new servers could join ovsdb cluster dynamically it's hard to
implement any runtime mechanism to handle cases where different
versions of ovsdb-server joins the cluster. However we still need to
handle cluster upgrades. For this case added special command line
argument to disable new functionality. Documentation updated with the
recommended way to upgrade the ovsdb cluster.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Memory leak happens while converting existing database into new
database according to the specified schema (ovsdb-client convert
new-schema). Memory leak was detected by valgrind while executing
functional test "schema conversion online - clustered"
==16202== 96 bytes in 6 blocks are definitely lost in loss record 326 of 399
==16202== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==16202== by 0x44A5D4: xmalloc (util.c:138)
==16202== by 0x4377A6: alloc_default_atoms (ovsdb-data.c:315)
==16202== by 0x437F18: ovsdb_datum_init_default (ovsdb-data.c:918)
==16202== by 0x413D82: ovsdb_row_create (row.c:59)
==16202== by 0x40AA53: ovsdb_convert_table (file.c:220)
==16202== by 0x40AA53: ovsdb_convert (file.c:275)
==16202== by 0x416BE1: ovsdb_trigger_try (trigger.c:255)
==16202== by 0x40D29E: ovsdb_jsonrpc_trigger_create (jsonrpc-server.c:1119)
==16202== by 0x40D29E: ovsdb_jsonrpc_session_got_request (jsonrpc-server.c:986)
==16202== by 0x40D29E: ovsdb_jsonrpc_session_run (jsonrpc-server.c:556)
==16202== by 0x40D29E: ovsdb_jsonrpc_session_run_all (jsonrpc-server.c:586)
==16202== by 0x40D29E: ovsdb_jsonrpc_server_run (jsonrpc-server.c:401)
==16202== by 0x40682E: main_loop (ovsdb-server.c:209)
==16202== by 0x40682E: main (ovsdb-server.c:460)
The problem was in ovsdb_datum_convert() function, which overrides
pointers to datum memory allocated in ovsdb_row_create() function.
Fix was done by freeing this memory before ovsdb_datum_convert()
is called.
Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Several OVS structs contain embedded named unions, like this:
struct {
...
union {
...
} u;
};
C11 standardized a feature that many compilers already implemented
anyway, where an embedded union may be unnamed, like this:
struct {
...
union {
...
};
};
This is more convenient because it allows the programmer to omit "u."
in many places. OVS already used this feature in several places. This
commit embraces it in several others.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
Tested-by: Alin Gabriel Serdean <aserdean@ovn.org>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
This commit adds support for OVSDB clustering via Raft. Please read
ovsdb(7) for information on how to set up a clustered database. It is
simple and boils down to running "ovsdb-tool create-cluster" on one server
and "ovsdb-tool join-cluster" on each of the others and then starting
ovsdb-server in the usual way on all of them.
One you have a clustered database, you configure ovn-controller and
ovn-northd to use it by pointing them to all of the servers, e.g. where
previously you might have said "tcp:1.2.3.4" was the database server,
now you say that it is "tcp:1.2.3.4,tcp:5.6.7.8,tcp:9.10.11.12".
This also adds support for database clustering to ovs-sandbox.
Acked-by: Justin Pettit <jpettit@ovn.org>
Tested-by: aginwala <aginwala@asu.edu>
Signed-off-by: Ben Pfaff <blp@ovn.org>
With this change, "ovsdb-client convert" can be used to convert a database
from one schema to another without taking the database offline.
This can be useful to minimize downtime for a database during a software
upgrade.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
Until now, ovsdb-server has internally chained a list of replicas from each
database. Whenever ovsdb_txn_commit() commits a transaction, it passes the
transaction to each replica. The first replica, which is always the disk
file that stores the database, is special because it is the only replica
that can report an error and thereby abort the transaction. This is a very
special property that genuinely distinguishes this first replica from the
others on the chain. This commit breaks that first replica out as a
separate kind of entity that is not on the list of replicas. When later
commits add support for clustering, there will only be more and more
special cases for the "first replica", so it makes sense to distinguish it
this way.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
The OVSDB log code has always had the ability to commit the log to disk and
wait for the commit to finish. This patch introduces a new feature that
allows the client to start a commit in the background and then to determine
asynchronously that the commit has completed. This will be especially
useful later for the distributed database feature.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
We want to compact database file if it has been over 24 hours since we
last compacted it and there's more than 100 commits regardless of the
size of the database. This patch fixes the previous comparisson which
checked if 24 hours was elapsed since the next scheduled compaction.
Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Before this patch, the databases were automatically compacted when a
transaction is logged when:
* It's been > 10 minutes after last compaction AND
* At least 100 commits have occurred AND
* Database has grown at least 4x since last compaction (and it's > 10M)
This patch changes the conditions as follows:
* It's been > 10 minutes after last compaction AND
* At least 100 commits have occurred AND either
- It's been > 24 hours after the last compaction OR
- Database has grown at least 2x since last compaction (and it's > 10M)
Reported-by: Daniel Alvarez <dalvarez@redhat.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2018-March/046309.html
Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Current code is mixing wall and monotonic clocks and the traces are not
useful since the timestamps are not accurate. This patch fixes it by
using the same time reference for the log as used in the code.
Without this patch, the traces look like this:
compacting database online (1519124364.908 seconds old, 951 transactions)
Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Upcoming support for clustered databases will need to provide a more
abstract way to determine when a given file should be compacted, so this
changes the standalone database support to use this mechanism in advance.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
Until now, OVSDB_LOG_CREATE implied EXCL, but this commit breaks them
apart.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
Until now, the logging code in ovsdb has only supported a single file
format, for OVSDB standalone database files. Upcoming commits will add
support for another, incompatible format, which uses a different magic
string for identification. This commit allows the logging code to
support both formats.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
This allows slight code simplifications across the tree.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
This patch allows online compacting to be done under Windows.
To achieve the above we need to close all file handles before trying to
rename the file, switch from rename to MoveFileEx (because rename/MoveFile
fails if the destination exists), reopen the right type of log after the
rename.
If we could not reopen the compacted database or the original database
after the close simply abort and rely on the service manager. This
can be changed in the future.
Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Co-authored-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
To easily allow both in- and out-of-tree building of the Python
wrapper for the OVS JSON parser (e.g. w/ pip), move json.h to
include/openvswitch. This also requires moving lib/{hmap,shash}.h.
Both hmap.h and shash.h were #include-ing "util.h" even though the
headers themselves did not use anything from there, but rather from
include/openvswitch/util.h. Fixing that required including util.h
in several C files mostly due to OVS_NOT_REACHED and things like
xmalloc.
Signed-off-by: Terry Wilson <twilson@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Until now, the minimum database size before automatically compacting has
been 10 MB, regardless of the inherent size of the data in the database.
A couple of people have pointed out that this won't scale well to larger
databases. This commit changes this criterion to 4 times the previously
compacted size of the database, with 10 MB as a minimum.
The 4x factor is suggested by Diego Ongaro's thesis, "Consensus: Bridging
Theory and Practice", section 5.1.2 "When to snapshot".
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
A new function vlog_insert_module() is introduced to avoid using
list_insert() from the vlog.h header.
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Ben Pfaff <blp@nicira.com>
This is expected to make system debugging easier.
This raises two compatibility issues:
1. When a new ovsdb-tool reads an old database, it will multiply by 1000 any
timestamp it reads which is less than 1<<31. Since this date corresponds to
Jan 16 1970 this is unlikely to cause a problem.
2. When an old ovsdb-tool reads a new database, it will interpret the
millisecond timestamps as seconds and report dates in the far future; the
time of this commit is reported as the year 45672 (each second since the
epoch is interpreted as 16 minutes).
Signed-off-by: Paul Ingram <pingram@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
The ovsdb-server compaction timing logic is written assuming monotonic
time at milliscond resolution but it calculated the next compaction time
based on the oldest commit in the database. This was a problem because
commit timestamps are written in wall-clock time to second resolution.
This commit calculates the next compaction time based on the time when
the database was first loaded or the last compaction was done, both in
monotonic time at millisecond resolution.
Signed-off-by: Paul Ingram <pingram@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
This is a straight search-and-replace, except that I also removed #include
<assert.h> from each file where there were no assert calls left.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Ethan Jackson <ethan@nicira.com>
lockfile_lock() accepts a timeout argument but, aside from unit tests
pertaining to timeout, its value is always 0. Since this feature relies on
a periodic SIGALRM signal, which is not a given if we're not caching time,
the cleanest solution is just to remove it.
Signed-off-by: Leo Alterman <lalterman@nicira.com>
Replaced all instances of Nicira Networks(, Inc) to Nicira, Inc.
Feature #10593
Signed-off-by: Raju Subramanian <rsubramanian@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
If 'error' is nonnull then we destroy the row, so we must not try to reuse
the row immediately after that.
Support request #6155.
Repoted-by: Geoff White <gwhite@nicira.com>
When ovsdb-server reads a database file that is corrupted at the
transaction level (that is, the transaction is valid JSON and has the
correct SHA-1 hash, but it does not describe a valid database transaction),
then ovsdb-server should truncate it and overwrite it by valid
transactions. However, until now, it didn't. Instead, it would keep the
invalid transaction and possibly every transaction in the database file
(depending on in what way the transaction was invalid), which would just
cause the same trouble again the next time the database was read.
This fixes the problem. An invalid transaction will be deleted from the
database file at the first write to the database.
Bug #5144.
Bug #5149.
If there's database corruption then it indicates that something went wrong,
e.g. the machine was powered-off by power failure. It's definitely
something that the admin should know about. This sounds like an error to
me, so use that log level.
ovsdb_txn_commit() may return a ovsdb_error structure, which should be
freed by the caller. The only remaining caller that discards the result
is in ovsdb_file_open__(), which this fixes.
Suggested-by: Ben Pfaff <blp@nicira.com>
There's no indication that "date" is optional in the description of
ovsdb_file_txn_from_json(), and the one caller always passes it in, so
don't bother checking whether it exists.
Coverity #10732
This new function saves reading the whole database when only the schema is
of interest. This commit adapts ovsdb-tool to use it for the "db-version"
command. Upcoming commits will introduce another caller.
Adding a macro to define the vlog module in use adds a level of
indirection, which makes it easier to change how the vlog module must be
defined. A followup commit needs to do that, so getting these widespread
changes out of the way first should make that commit easier to review.
Most of the timekeeping needs of OVS are simply to measure intervals,
which means that it is sensitive to changes in the clock. This commit
replaces the existing clocks with monotonic timers. An additional set
of wall clock timers are added and used in locations that need absolute
time.
Bug #1858
If the database grows fairly large, and we've written a fair number of
transactions to it, and it's been a while since the database was compacted,
then (after the next commit) compact the database.
Also, compact the database online if the "ovsdb-server/compact" command is
issued via unixctl. I suspect that this feature will rarely if ever be
used in practice, but it's easier to test than compacting automatically.
Bug #2391.
This is in preparation for exposing ovsdb_file to clients outside this
translation unit. These clients don't care that the ovsdb_file is an
ovsdb replica--that's an implementation detail--and so it makes sense to
rename it from their point of view.
This is just a search-and-replace plus reindenting where appropriate.
Until now, each part of a transaction commit that is interested in whether
a column's value has changed has had to do a comparison of the old and new
values itself. There can be several interested parties per commit
(generally one for file storage and one for each remove OVSDB connection),
so this seems like too much redundancy. This commit adds a bitmap
to struct ovsdb_txn_row that tracks whether a column's value has actually
changed, to reduce this overhead.
As a convenient side effect of doing these checks up front, it then
becomes easily possible to drop txn_rows (and txn_tables and entire txns)
that become no-ops. (This probably fixes bug #2400, which reported that
some no-ops actually report updates over monitors.)
The current callers of ovsdb_log_open() always want to lock the file if
they are accessing it for read/write access. An upcoming commit will add
a new caller that does not fit this model (it wants to lock the file
across a wider region) and so the caller should be able to choose whether
to do locking. This commit adds that ability.
Also, get rid of the use of <fcntl.h> flags to choose the open mode, which
has always seemed somewhat crude and which this change would make even
cruder.
When a new record is inserted into a database, ovsdb logs the values of all
of the fields in the record. However, often new records have many columns
that contain default values. There is no need to log those values, so this
commit causes them to be omitted.
As a side effect, this also makes "ovsdb-tool show-log --more --more"
output easier to read, because record insertions print less noise. (Adding
--more --more to this command makes it print changes to database records.
The --more option will be introduced in an upcoming commit.)