mir/ovs - ovs - Mike's Git repositories

mir/ovs

mirror of https://github.com/openvswitch/ovs synced 2025-08-22 01:51:26 +00:00

Author	SHA1	Message	Date
Ilya Maximets	1de4a08c22	json: Use functions to access json arrays. Internal implementation of JSON array will be changed in the future commits. Add access functions that users can rely on instead of accessing the internals of 'struct json' directly and convert all the users. Structure fields are intentionally renamed to make sure that no code is using the old fields directly. json_array() function is removed, as not needed anymore. Added new functions: json_array_size(), json_array_at(), json_array_set() and json_array_pop(). These are enough to cover all the use cases within OVS. The change is fairly large, however, IMO, it's a much overdue cleanup that we need even without changing the underlying implementation. Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2025-06-30 16:53:56 +02:00
Ilya Maximets	f8ed13355d	ovsdb: raft: Don't forward more than one command to the leader. Every transaction has RAFT log prerequisites. Even if transactions are not related (because RAFT doesn't actually know what data it is handling). When leader writes a new record to a RAFT storage, it is getting appended to the log right away and changes current 'eid', i.e., changes prerequisites. The leader will not try to write new records until the current one is committed, because until then the pre-check will be failing. However, that is different for the follower. Followers do not add records to the RAFT log until the leader sends an append request back. So, if there are multiple transactions pending on a follower, it will create a command for each of them and prerequisites will be set to the same values. All these commands will be sent to the leader, but only one can succeed at a time, because accepting one command immediately changes prerequisites and all other commands become non-applicable. So, out of N commands, 1 will succeed and N - 1 will fail. The cluster failure is a transient failure, so the follower will re-process all the failed transactions and send them again. 1 will succeed and N - 2 will fail. And so on, until there are no more transactions. In the end, instead of processing N transactions, the follower is performing N * (N - 1) / 2 transaction processing iterations. That is consuming a huge amount of CPU resources completely unnecessarily. Since there is no real chance for multiple transactions from the same follower to succeed, it's better to not send them in the first place. This also eliminates prerequisite mismatch messages on a leader in this particular case. In a test with 30 parallel shell threads executing 12K transactions total with separate ovsdb-client calls through the same follower there is about 60% performance improvement. The test takes ~100 seconds to complete without this change and ~40 seconds with this change applied. The new time is very close to what it takes to execute the same test through the cluster leader. The test can be found at the link below. Note: prerequisite failures on a leader are still possible, but mostly in a case of simultaneous transactions from different followers. It's a normal thing for a distributed database due to its nature. Link: https://mail.openvswitch.org/pipermail/ovs-dev/2024-June/415167.html Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2024-07-08 22:13:20 +02:00
Ilya Maximets	a73b0206ba	ovsdb: Check for ephemeral columns before writing a new schema. Clustered databases do not support ephemeral columns, but ovsdb-server checks for them after the conversion result is read from the storage. It's much easier to recover if this constraint is checked before writing to the storage instead. It's not a big problem, because the check is always performed by the native ovsdb clients before sending a conversion request. But the server, in general, should not trust clients to do the right thing. Check in the update_schema() remains, because we shouldn't blindly trust the storage. Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2023-04-24 22:34:56 +02:00
Ilya Maximets	3cd2cbd684	ovsdb: Prepare snapshot JSON in a separate thread. Conversion of the database data into JSON object, serialization and destruction of that object are the most heavy operations during the database compaction. If these operations are moved to a separate thread, the main thread can continue processing database requests in the meantime. With this change, the compaction is split in 3 phases: 1. Initialization: - Create a copy of the database. - Remember current database index. - Start a separate thread to convert a copy of the database into serialized JSON object. 2. Wait: - Continue normal operation until compaction thread is done. - Meanwhile, compaction thread: * Convert database copy to JSON. * Serialize resulted JSON. * Destroy original JSON object. 3. Finish: - Destroy the database copy. - Take the snapshot created by the thread. - Write on disk. The key for this schema to be fast is the ability to create a shallow copy of the database. This doesn't take too much time allowing the thread to do most of work. Database copy is created and destroyed only by the main thread, so there is no need for synchronization. Such solution allows to reduce the time main thread is blocked by compaction by 80-90%. For example, in ovn-heater tests with 120 node density-heavy scenario, where compaction normally takes 5-6 seconds at the end of a test, measured compaction times was all below 1 second with the change applied. Also, note that these measured times are the sum of phases 1 and 3, so actual poll intervals are about half a second in this case. Only implemented for raft storage for now. The implementation for standalone databases can be added later by using a file offset as a database index and copying newly added changes from the old file to a new one during ovsdb_log_replace(). Reported-at: https://bugzilla.redhat.com/2069108 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2022-07-13 20:33:14 +02:00
Ilya Maximets	339f97044e	ovsdb: storage: Randomize should_snapshot checks when the minimum time passed. Snapshots are scheduled for every 10-20 minutes. It's a random value in this interval for each server. Once the time is up, but the maximum time (24 hours) not reached yet, ovsdb will start checking if the log grew a lot on every iteration. Once the growth is detected, compaction is triggered. OTOH, it's very common for an OVSDB cluster to not have the log growing very fast. If the log didn't grow 2x in 20 minutes, the randomness of the initial scheduled time is gone and all the servers are checking if they need to create snapshot on every iteration. And since all of them are part of the same cluster, their logs are growing with the same speed. Once the critical mass is reached, all the servers will start creating snapshots at the same time. If the database is big enough, that might leave the cluster unresponsive for an extended period of time (e.g. 10-15 seconds for OVN_Southbound database in a larger scale OVN deployment) until the compaction completed. Fix that by re-scheduling a quick retry if the minimal time already passed. Effectively, this will work as a randomized 1-2 min delay between checks, so the servers will not synchronize. Scheduling function updated to not change the upper limit on quick reschedules to avoid delaying the snapshot creation indefinitely. Currently quick re-schedules are only used for the error cases, and there is always a 'slow' re-schedule after the successful compaction. So, the change of a scheduling function doesn't change the current behavior much. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>	2021-12-13 21:54:45 +01:00
Ilya Maximets	0de8829540	raft: Don't keep full json objects in memory if no longer needed. Raft log entries (and raft database snapshot) contains json objects of the data. Follower receives append requests with data that gets parsed and added to the raft log. Leader receives execution requests, parses data out of them and adds to the log. In both cases, later ovsdb-server reads the log with ovsdb_storage_read(), constructs transaction and updates the database. On followers these json objects in common case are never used again. Leader may use them to send append requests or snapshot installation requests to followers. However, all these operations (except for ovsdb_storage_read()) are just serializing the json in order to send it over the network. Json objects are significantly larger than their serialized string representation. For example, the snapshot of the database from one of the ovn-heater scale tests takes 270 MB as a string, but 1.6 GB as a json object from the total 3.8 GB consumed by ovsdb-server process. ovsdb_storage_read() for a given raft entry happens only once in a lifetime, so after this call, we can serialize the json object, store the string representation and free the actual json object that ovsdb will never need again. This can save a lot of memory and can also save serialization time, because each raft entry for append requests and snapshot installation requests serialized only once instead of doing that every time such request needs to be sent. JSON_SERIALIZED_OBJECT can be used in order to seamlessly integrate pre-serialized data into raft_header and similar json objects. One major special case is creation of a database snapshot. Snapshot installation request received over the network will be parsed and read by ovsdb-server just like any other raft log entry. However, snapshots created locally with raft_store_snapshot() will never be read back, because they reflect the current state of the database, hence already applied. For this case we can free the json object right after writing snapshot on disk. Tests performed with ovn-heater on 60 node density-light scenario, where on-disk database goes up to 97 MB, shows average memory consumption of ovsdb-server Southbound DB processes decreased by 58% (from 602 MB to 256 MB per process) and peak memory consumption decreased by 40% (from 1288 MB to 771 MB). Test with 120 nodes on density-heavy scenario with 270 MB on-disk database shows 1.5 GB memory consumption decrease as expected. Also, total CPU time consumed by the Southbound DB process reduced from 296 to 256 minutes. Number of unreasonably long poll intervals reduced from 2896 down to 1934. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-09-01 00:14:42 +02:00
Ilya Maximets	e93fc5db9b	ovsdb: storage: Allow setting the name for the unbacked storage. ovsdb_create() requires schema or storage to be nonnull, but in practice it requires to have schema name or a storage name to use it as a database name. Only clustered storage has a name. This means that only clustered database can be created without schema, Changing that by allowing unbacked storage to have a name. This way we can create database with unbacked storage without schema. Will be used in next commits to create database for ovsdb 'relay' service model. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2021-07-15 22:37:32 +02:00
Ilya Maximets	3c2d6274bc	raft: Transfer leadership before creating snapshots. With a big database writing snapshot could take a lot of time, for example, on one of the systems compaction of 300MB database takes about 10 seconds to complete. For the clustered database, 40% of this time takes conversion of the database to the file transaction json format, the rest of time is formatting a string and writing to disk. Of course, this highly depends on the disc and CPU speeds. 300MB is the very possible database size for the OVN Southbound DB, and it might be even bigger than that. During compaction the database is not available and the ovsdb-server doesn't do any other tasks. If leader spends 10-15 seconds writing a snapshot, the cluster is not functional for that time period. Leader also, likely, has some monitors to serve, so the one poll interval may be 15-20 seconds long in the end. Systems with so big databases typically has very high election timers configured (16 seconds), so followers will start election only after this significant amount of time. Once leader is back to the operational state, it will re-connect and try to join the cluster back. In some cases, this might also trigger the 'connected' state flapping on the old leader triggering a re-connection of clients. This issue has been observed with large-scale OVN deployments. One of the methods to improve the situation is to transfer leadership before compacting. This allows to keep the cluster functional, while one of the servers writes a snapshot. Additionally logging the time spent for compaction if it was longer than 1 second. This adds a bit of visibility to 'unreasonably long poll interval's. Reported-at: https://bugzilla.redhat.com/1960391 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>	2021-05-14 16:00:22 +02:00
Dumitru Ceara	7024ddf320	ovsdb: Add unixctl command to show storage status. If a database enters an error state, e.g., in case of RAFT when reading the DB file contents if applying the RAFT records triggers constraint violations, there's no way to determine this unless a client generates a write transaction. Such write transactions would fail with "ovsdb-error: inconsistent data". This commit adds a new command to show the status of the storage that's backing a database. Example, on an inconsistent database: $ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB status: ovsdb error: inconsistent data Example, on a consistent database: $ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB status: ok Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2020-09-16 00:15:37 +02:00
Ilya Maximets	3423cd97f8	ovsdb: Add raft memory usage to memory report. Memory reports could be found in logs or by calling 'memory/show' appctl command. For ovsdb-server it includes information about db cells, monitor connections with their backlog size, etc. But it doesn't contain any information about memory consumed by raft. Backlogs of raft connections could be insanely large because of snapshot installation requests that simply contains the whole database. In not that healthy clusters where one of ovsdb servers is not able to timely handle all the incoming raft traffic, backlog on a sender's side could cause significant memory consumption issues. Adding new 'raft-connections' and 'raft-backlog' counters to the memory report to better track such conditions. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>	2020-05-25 13:56:41 +02:00
Han Zhou	2cd62f75c1	ovsdb raft: Precheck prereq before proposing commit. In current OVSDB Raft design, when there are multiple transactions pending, either from same server node or different nodes in the cluster, only the first one can be successful at once, and following ones will fail at the prerequisite check on leader node, because the first one will update the expected prerequisite eid on leader node, and the prerequisite used for proposing a commit has to be committed eid, so it is not possible for a node to use the latest prerequisite expected by the leader to propose a commit until the lastest transaction is committed by the leader and updated the committed_index on the node. Current implementation proposes the commit as soon as the transaction is requested by the client, which results in continously retry which causes high CPU load and waste. Particularly, even if all clients are using leader_only to connect to only the leader, the prereq check failure still happens a lot when a batch of transactions are pending on the leader node - the leader node proposes a batch of commits using the same committed eid as prerequisite and it updates the expected prereq as soon as the first one is in progress, but it needs time to append to followers and wait until majority replies to update the committed_index, which results in continously useless retries of the following transactions proposed by the leader itself. This patch doesn't change the design but simplely pre-checks if current eid is same as prereq, before proposing the commit, to avoid waste of CPU cycles, for both leader and followers. When clients use leader_only mode, this patch completely eliminates the prereq check failures. In scale test of OVN with 1k HVs and creating and binding 10k lports, the patch resulted in 90% CPU cost reduction on leader and >80% CPU cost reduction on followers. (The test was with leader election base time set to 10000ms, because otherwise the test couldn't complete because of the frequent leader re-election.) This is just one of the related performance problems of the prereq checking mechanism dicussed at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-February/048243.html Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>	2019-03-07 14:23:51 -08:00
Ben Pfaff	fa37affad3	Embrace anonymous unions. Several OVS structs contain embedded named unions, like this: struct { ... union { ... } u; }; C11 standardized a feature that many compilers already implemented anyway, where an embedded union may be unnamed, like this: struct { ... union { ... }; }; This is more convenient because it allows the programmer to omit "u." in many places. OVS already used this feature in several places. This commit embraces it in several others. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org> Tested-by: Alin Gabriel Serdean <aserdean@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>	2018-05-25 13:36:05 -07:00
Ben Pfaff	1b1d2e6daa	ovsdb: Introduce experimental support for clustered databases. This commit adds support for OVSDB clustering via Raft. Please read ovsdb(7) for information on how to set up a clustered database. It is simple and boils down to running "ovsdb-tool create-cluster" on one server and "ovsdb-tool join-cluster" on each of the others and then starting ovsdb-server in the usual way on all of them. One you have a clustered database, you configure ovn-controller and ovn-northd to use it by pointing them to all of the servers, e.g. where previously you might have said "tcp:1.2.3.4" was the database server, now you say that it is "tcp:1.2.3.4,tcp:5.6.7.8,tcp:9.10.11.12". This also adds support for database clustering to ovs-sandbox. Acked-by: Justin Pettit <jpettit@ovn.org> Tested-by: aginwala <aginwala@asu.edu> Signed-off-by: Ben Pfaff <blp@ovn.org>	2018-03-24 12:04:53 -07:00

13 Commits