2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-29 05:18:13 +00:00
ovs/ovsdb/storage.h

105 lines
4.5 KiB
C
Raw Normal View History

/* Copyright (c) 2009, 2010, 2011, 2016, 2017 Nicira, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this storage except in compliance with the License.
* You may obtain a copy of the License at:
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#ifndef OVSDB_STORAGE_H
#define OVSDB_STORAGE_H 1
#include <stdint.h>
#include <sys/types.h>
#include "compiler.h"
struct json;
struct ovsdb_schema;
struct ovsdb_storage;
struct simap;
struct uuid;
struct ovsdb_error *ovsdb_storage_open(const char *filename, bool rw,
struct ovsdb_storage **)
OVS_WARN_UNUSED_RESULT;
struct ovsdb_storage *ovsdb_storage_create_unbacked(const char *name);
void ovsdb_storage_close(struct ovsdb_storage *);
const char *ovsdb_storage_get_model(const struct ovsdb_storage *);
bool ovsdb_storage_is_clustered(const struct ovsdb_storage *);
bool ovsdb_storage_is_connected(const struct ovsdb_storage *);
bool ovsdb_storage_is_dead(const struct ovsdb_storage *);
bool ovsdb_storage_is_leader(const struct ovsdb_storage *);
const struct uuid *ovsdb_storage_get_cid(const struct ovsdb_storage *);
const struct uuid *ovsdb_storage_get_sid(const struct ovsdb_storage *);
uint64_t ovsdb_storage_get_applied_index(const struct ovsdb_storage *);
void ovsdb_storage_get_memory_usage(const struct ovsdb_storage *,
struct simap *usage);
char *ovsdb_storage_get_error(const struct ovsdb_storage *);
void ovsdb_storage_run(struct ovsdb_storage *);
void ovsdb_storage_wait(struct ovsdb_storage *);
const char *ovsdb_storage_get_name(const struct ovsdb_storage *);
struct ovsdb_error *ovsdb_storage_read(struct ovsdb_storage *,
struct ovsdb_schema **schemap,
struct json **txnp,
struct uuid *txnid)
OVS_WARN_UNUSED_RESULT;
bool ovsdb_storage_read_wait(struct ovsdb_storage *);
void ovsdb_storage_unread(struct ovsdb_storage *);
struct ovsdb_write *ovsdb_storage_write(struct ovsdb_storage *,
const struct json *,
const struct uuid *prereq,
struct uuid *result,
bool durable)
OVS_WARN_UNUSED_RESULT;
struct ovsdb_error *ovsdb_storage_write_block(struct ovsdb_storage *,
const struct json *,
const struct uuid *prereq,
struct uuid *result,
bool durable);
bool ovsdb_write_is_complete(const struct ovsdb_write *);
const struct ovsdb_error *ovsdb_write_get_error(const struct ovsdb_write *);
uint64_t ovsdb_write_get_commit_index(const struct ovsdb_write *);
void ovsdb_write_wait(const struct ovsdb_write *);
void ovsdb_write_destroy(struct ovsdb_write *);
ovsdb: storage: Randomize should_snapshot checks when the minimum time passed. Snapshots are scheduled for every 10-20 minutes. It's a random value in this interval for each server. Once the time is up, but the maximum time (24 hours) not reached yet, ovsdb will start checking if the log grew a lot on every iteration. Once the growth is detected, compaction is triggered. OTOH, it's very common for an OVSDB cluster to not have the log growing very fast. If the log didn't grow 2x in 20 minutes, the randomness of the initial scheduled time is gone and all the servers are checking if they need to create snapshot on every iteration. And since all of them are part of the same cluster, their logs are growing with the same speed. Once the critical mass is reached, all the servers will start creating snapshots at the same time. If the database is big enough, that might leave the cluster unresponsive for an extended period of time (e.g. 10-15 seconds for OVN_Southbound database in a larger scale OVN deployment) until the compaction completed. Fix that by re-scheduling a quick retry if the minimal time already passed. Effectively, this will work as a randomized 1-2 min delay between checks, so the servers will not synchronize. Scheduling function updated to not change the upper limit on quick reschedules to avoid delaying the snapshot creation indefinitely. Currently quick re-schedules are only used for the error cases, and there is always a 'slow' re-schedule after the successful compaction. So, the change of a scheduling function doesn't change the current behavior much. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
2021-12-13 16:43:33 +01:00
bool ovsdb_storage_should_snapshot(struct ovsdb_storage *);
struct ovsdb_error *ovsdb_storage_store_snapshot(struct ovsdb_storage *storage,
const struct json *schema,
ovsdb: Prepare snapshot JSON in a separate thread. Conversion of the database data into JSON object, serialization and destruction of that object are the most heavy operations during the database compaction. If these operations are moved to a separate thread, the main thread can continue processing database requests in the meantime. With this change, the compaction is split in 3 phases: 1. Initialization: - Create a copy of the database. - Remember current database index. - Start a separate thread to convert a copy of the database into serialized JSON object. 2. Wait: - Continue normal operation until compaction thread is done. - Meanwhile, compaction thread: * Convert database copy to JSON. * Serialize resulted JSON. * Destroy original JSON object. 3. Finish: - Destroy the database copy. - Take the snapshot created by the thread. - Write on disk. The key for this schema to be fast is the ability to create a shallow copy of the database. This doesn't take too much time allowing the thread to do most of work. Database copy is created and destroyed only by the main thread, so there is no need for synchronization. Such solution allows to reduce the time main thread is blocked by compaction by 80-90%. For example, in ovn-heater tests with 120 node density-heavy scenario, where compaction normally takes 5-6 seconds at the end of a test, measured compaction times was all below 1 second with the change applied. Also, note that these measured times are the sum of phases 1 and 3, so actual poll intervals are about half a second in this case. Only implemented for raft storage for now. The implementation for standalone databases can be added later by using a file offset as a database index and copying newly added changes from the old file to a new one during ovsdb_log_replace(). Reported-at: https://bugzilla.redhat.com/2069108 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2022-07-01 01:34:07 +02:00
const struct json *snapshot,
uint64_t applied_index)
OVS_WARN_UNUSED_RESULT;
struct ovsdb_write *ovsdb_storage_write_schema_change(
struct ovsdb_storage *,
const struct ovsdb_schema *, const struct json *data,
const struct uuid *prereq, struct uuid *result)
OVS_WARN_UNUSED_RESULT;
/* Convenience functions for ovsdb-tool and other command-line utilities,
* for use with standalone database files only, which terminate the process
* on error. */
struct ovsdb_storage *ovsdb_storage_open_standalone(const char *filename,
bool rw);
struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
ovsdb: raft: Don't forward more than one command to the leader. Every transaction has RAFT log prerequisites. Even if transactions are not related (because RAFT doesn't actually know what data it is handling). When leader writes a new record to a RAFT storage, it is getting appended to the log right away and changes current 'eid', i.e., changes prerequisites. The leader will not try to write new records until the current one is committed, because until then the pre-check will be failing. However, that is different for the follower. Followers do not add records to the RAFT log until the leader sends an append request back. So, if there are multiple transactions pending on a follower, it will create a command for each of them and prerequisites will be set to the same values. All these commands will be sent to the leader, but only one can succeed at a time, because accepting one command immediately changes prerequisites and all other commands become non-applicable. So, out of N commands, 1 will succeed and N - 1 will fail. The cluster failure is a transient failure, so the follower will re-process all the failed transactions and send them again. 1 will succeed and N - 2 will fail. And so on, until there are no more transactions. In the end, instead of processing N transactions, the follower is performing N * (N - 1) / 2 transaction processing iterations. That is consuming a huge amount of CPU resources completely unnecessarily. Since there is no real chance for multiple transactions from the same follower to succeed, it's better to not send them in the first place. This also eliminates prerequisite mismatch messages on a leader in this particular case. In a test with 30 parallel shell threads executing 12K transactions total with separate ovsdb-client calls through the same follower there is about 60% performance improvement. The test takes ~100 seconds to complete without this change and ~40 seconds with this change applied. The new time is very close to what it takes to execute the same test through the cluster leader. The test can be found at the link below. Note: prerequisite failures on a leader are still possible, but mostly in a case of simultaneous transactions from different followers. It's a normal thing for a distributed database due to its nature. Link: https://mail.openvswitch.org/pipermail/ovs-dev/2024-June/415167.html Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2024-06-27 00:02:21 +02:00
/* Checks that there is a chance for a record with specified prerequisites
* to be successfully written to the storage. */
bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
const struct uuid *prereq);
ovsdb raft: Precheck prereq before proposing commit. In current OVSDB Raft design, when there are multiple transactions pending, either from same server node or different nodes in the cluster, only the first one can be successful at once, and following ones will fail at the prerequisite check on leader node, because the first one will update the expected prerequisite eid on leader node, and the prerequisite used for proposing a commit has to be committed eid, so it is not possible for a node to use the latest prerequisite expected by the leader to propose a commit until the lastest transaction is committed by the leader and updated the committed_index on the node. Current implementation proposes the commit as soon as the transaction is requested by the client, which results in continously retry which causes high CPU load and waste. Particularly, even if all clients are using leader_only to connect to only the leader, the prereq check failure still happens a lot when a batch of transactions are pending on the leader node - the leader node proposes a batch of commits using the same committed eid as prerequisite and it updates the expected prereq as soon as the first one is in progress, but it needs time to append to followers and wait until majority replies to update the committed_index, which results in continously useless retries of the following transactions proposed by the leader itself. This patch doesn't change the design but simplely pre-checks if current eid is same as prereq, before proposing the commit, to avoid waste of CPU cycles, for both leader and followers. When clients use leader_only mode, this patch completely eliminates the prereq check failures. In scale test of OVN with 1k HVs and creating and binding 10k lports, the patch resulted in 90% CPU cost reduction on leader and >80% CPU cost reduction on followers. (The test was with leader election base time set to 10000ms, because otherwise the test couldn't complete because of the frequent leader re-election.) This is just one of the related performance problems of the prereq checking mechanism dicussed at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-February/048243.html Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
2019-03-01 10:56:37 -08:00
#endif /* ovsdb/storage.h */