2
0
mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-31 06:25:31 +00:00

Refactor qp-trie to use QSBR

The first working multi-threaded qp-trie was stuck with an unpleasant
trade-off:

  * Use `isc_rwlock`, which has acceptable write performance, but
    terrible read scalability because the qp-trie made all accesses
    through a single lock.

  * Use `liburcu`, which has great read scalability, but terrible
    write performance, because I was relying on `rcu_synchronize()`
    which is rather slow. And `liburcu` is LGPL.

To get the best of both worlds, we need our own scalable read side,
which we now have with `isc_qsbr`. And we need to modify the write
side so that it is not blocked by readers.

Better write performance requires an async cleanup function like
`call_rcu()`, instead of the blocking `rcu_synchronize()`. (There
is no blocking cleanup in `isc_qsbr`, because I have concluded
that it would be an attractive nuisance.)

Until now, all my multithreading qp-trie designs have been based
around two versions, read-only and mutable. This is too few to
work with asynchronous cleanup. The bare minimum (as in epoch
based reclamation) is three, but it makes more sense to support an
arbitrary number. Doing multi-version support "properly" makes
fewer assumptions about how safe memory reclamation works, and it
makes snapshots and rollbacks simpler.

To avoid making the memory management even more complicated, I
have introduced a new kind of "packed reader node" to anchor the
root of a version of the trie. This is simpler because it re-uses
the existing chunk lifetime logic - see the discussion under
"packed reader nodes" in `qp_p.h`.

I have also made the chunk lifetime logic simpler. The idea of a
"generation" is gone; instead, chunks are either mutable or
immutable. And the QSBR phase number is used to indicate when a
chunk can be reclaimed.

Instead of the `shared_base` flag (which was basically a one-bit
reference count, with a two version limit) the base array now has a
refcount, which replaces the confusing ad-hoc lifetime logic with
something more familiar and systematic.
This commit is contained in:
Tony Finch
2022-12-22 14:55:14 +00:00
parent 549854f63b
commit 4b5ec07bb7
13 changed files with 2040 additions and 1049 deletions

View File

@@ -1,3 +1,8 @@
6117. [func] Add a qp-trie data structure. This is a foundation for
our plan to replace, in stages, BIND's red-black tree.
The qp-trie has lock-free multithreaded reads, using
QSBR for safe memory reclamation. [GL !7130]
6116. [placeholder] 6116. [placeholder]
6115. [bug] Unregister db update notify callback before detaching 6115. [bug] Unregister db update notify callback before detaching

View File

@@ -362,7 +362,7 @@ one 64 bit word and one 32-bit word.
A branch node contains A branch node contains
* a branch/leaf tag bit * two type tag bits
* a 47-wide bitmap, with a bit for each common hostname character * a 47-wide bitmap, with a bit for each common hostname character
and each escape character and each escape character
@@ -374,8 +374,8 @@ A branch node contains
these references are described in more detail below these references are described in more detail below
A leaf node contains a pointer value (which we assume to be 64 bits) A leaf node contains a pointer value (which we assume to be 64 bits)
and a 32-bit integer value. The branch/leaf tag is smuggled into the and a 32-bit integer value. The type tag is smuggled into the
low-order bit of the pointer value, so the pointer value must have low-order bits of the pointer value, so the pointer value must have
large enough alignment. (This requirement is checked when a leaf is large enough alignment. (This requirement is checked when a leaf is
added to the trie.) Apart from that, the meaning of leaf values added to the trie.) Apart from that, the meaning of leaf values
is entirely under control of the qp-trie user. is entirely under control of the qp-trie user.
@@ -478,8 +478,8 @@ labels. This is slightly different from the root node, which tested the
first character of the label; here we are testing the last character. first character of the label; here we are testing the last character.
memory management for concurrency concurrency and transactions
--------------------------------- ----------------------------
The following sections discuss how the qp-trie supports concurrency. The following sections discuss how the qp-trie supports concurrency.
@@ -487,12 +487,32 @@ The requirement is to support many concurrent read threads, and
allow updates to occur without blocking readers (or blocking readers allow updates to occur without blocking readers (or blocking readers
as little as possible). as little as possible).
Concurrent access to a qp-trie uses a transactional API. There can be
at most one writer at a time. When a writer commits its transaction
(by atomically replacing the trie's root pointer) the changes become
visible to readers. Read transactions ensure that memory is not
reclaimed while readers are still using it.
If there are relatively long read transactions and brief write
transactions (though that is unlikely) there can be multiple versions
of a qp-trie in use at a time.
copy-on-write
-------------
The strategy is to use "copy-on-write", that is, when an update The strategy is to use "copy-on-write", that is, when an update
needs to alter the trie it makes a copy of the parts that it needs needs to alter the trie it makes a copy of the parts that it needs
to change, so that concurrent readers can continue to use the to change, so that concurrent readers can continue to use the
original. (It is analogous to multiversion concurrency in databases original. (It is analogous to multiversion concurrency in databases
such as PostgreSQL, where copy-on-write uses a write-ahead log.) such as PostgreSQL, where copy-on-write uses a write-ahead log.)
The qp-trie only uses copy-on-write when the nodes that need to be
altered can be shared with concurrent readers. After copying, the
nodes are exclusive to the writer and can be updated in place. This
reduces the pressure on the allocator a lot: pure copy-on-write
allocates and discards memory at a ferocious rate.
Software that uses copy-on-write needs some mechanism for clearing Software that uses copy-on-write needs some mechanism for clearing
away old versions that are no longer in use. (For example, VACUUM in away old versions that are no longer in use. (For example, VACUUM in
PostgreSQL.) The qp-trie code uses a custom allocator with a simple PostgreSQL.) The qp-trie code uses a custom allocator with a simple
@@ -567,27 +587,114 @@ garbage collector. Reference counting for value objects is handled
by the `attach()` and `detach()` qp-trie methods. by the `attach()` and `detach()` qp-trie methods.
memory layout chunked memory layout
------------- ---------------------
BIND's qp-trie code organizes its memory as a collection of "chunks", BIND's qp-trie code organizes its memory as a collection of "chunks"
each of which is a few pages in size and large enough to hold a few allocated by `malloc()`, each of which is a few pages in size and
thousand nodes. large enough to hold a thousand nodes or so.
Most memory management is per-chunk: obtaining memory from the
system allocator and returning it; keeping track of which chunks are
in use by readers, and which chunks can be mutated; and counting
whether chunks are fragmented enough to need garbage collection.
As noted above, we also use the chunk-based layout to reduce the size As noted above, we also use the chunk-based layout to reduce the size
of interior nodes. Instead of using a native pointer (typically 64 of interior nodes. Instead of using a native pointer (typically 64
bits) to refer to a node, we use a 32 bit integer containing the chunk bits) to refer to a node, we use a 32 bit integer containing the chunk
number and the position of the node in the chunk. This reduces the number and the position of the node in the chunk. This reduces the
memory used by interior nodes by 25%. memory used for interior nodes by 25%. See the "helper types" section
in `lib/dns/qp_p.h` for the relevant definitions.
BIND stores each zone separately, and there can be a very large number
of zones in a server. To avoid wasting memory on small zones that only
have a few names, chunks can be "shrunk" using `realloc()` to fit just
the nodes that have been allocated.
chunk metadata
--------------
The chunked memory layout is supported by a `base` array of pointers
to the start of each chunk. A chunk number is just an index into this
array.
Alongside the `base` array is a `usage` array, indexed the same way.
Instead of keeping track of individual nodes, the allocator just keeps
a count of how many nodes have been allocated from a chunk, and how
many were subsequently freed. The `used` count of the newest chunk
also serves as the allocation point for the bump allocator, and the
size of the chunk when it has been shrunk. This is why we increment
the `free` count when a node is discarded, instead of decrementing the
`used` count. The `usage` array also contains some fields used for
chunk reclamation, about which more below.
The `base` and `usage` arrays are separate because the `usage` array
is only used by writers, and never shared with readers. The read-only
hot path only needs the `base` array, so keeping it separate is more
cache-friendly: less memory pressure on the read path and less
interference from false sharing with write ops.
Both arrays can have empty slots in which new chunks can be allocated;
when a chunk is reclaimed its slot becomes empty. Additions and
removals from the `base` array don't affect readers: they will not see
a reference to a new chunk until after the writer commits, and the
chunk reclamation machinery ensures that no readers depend on a chunk
before it is deleted.
When the arrays fill up they are reallocated. This is easy for the
`usage` array because it is only accessed by writers, but the `base`
array must be cloned, and the old version must be reclaimed later
after it is no longer used by readers. For this reason the `base`
array has a reference count.
lightweight write transactions
------------------------------
"Write" transactions are intended for use when there is a heavy write
load, such as a resolver cache. They minimize the amount of allocation
by re-using the same chunk for the bump allocator across multiple
transactions until it fills up.
When a write (or update) is committed, a new packed read-only trie
anchor is created. This contains a pointer to the `base` array and a
32-bit reference to the trie's root node. The packed reader is stored
in a pair of nodes in the current chunk, allocated by the bump
allocator, so it does not need to be `malloc()`ed separately, and so
the chunk reclamation machinery can also reclaim the `base` array when
it is no longer in use.
heavyweight update transactions
-------------------------------
By contrast, "update" transactions are intended to keep memory usage
as low as possible between writes. On commit, the trie is compacted,
and the bump allocator's chunk is shrunk to fit. When a transaction is
opened, a fresh chunk must be allocated.
Update transactions also support rollback, which requires making a
copy of all the chunk metadata.
lightweight query transactions
------------------------------
A "query" transaction dereferences a pointer to the current trie
anchor and unpacks it into a `dns_qpread_t` object on the stack. There
is no explicit interlocking with writers. Instead, query transactions
must only be used inside an `isc_loop` callback function; the qp-trie
memory reclamation machinery knows that the reader has completed when
the callback returns to the loop. See `include/isc/qsbr.h` for more
about how this works.
heavyweight read-only snapshots
-------------------------------
A "snapshot" is for things like zone transfers that need a long-lived
consistent view of a zone. When a snapshot is created, it includes a
copy of the necessary parts of the `base` array. A qp-trie keeps a
list of its snapshots, and there are flags in the `usage` array to
mark which chunks are in use by snapshots and therefore cannot be
reclaimed.
In `lib/dns/qp_p.h`, the _"main qp-trie structures"_ hold information
about a trie's chunks. Most of the chunk handling code is in the
_"allocator"_ and _"chunk reclamation"_ sections in `lib/dns/qp.c`.
lifecycle of value objects lifecycle of value objects
@@ -609,103 +716,23 @@ adding special lookup functions that return whether leaf objects are
mutable - see the "todo" in `include/dns/qp.h`. mutable - see the "todo" in `include/dns/qp.h`.
locking and RCU chunk cleanup
--------------- -------------
The Linux kernel has a collection of copy-on-write schemes collectively After a "write" or "update" transaction has committed, there can be a
called read-copy-update; there is also https://liburcu.org/ for RCU in number of chunks that are no longer needed by the latest version of
userspace. RCU is attractively speedy: readers can proceed without the trie, but still in use by readers accessing an older version.
blocking at all; writers can proceed concurrently with readers, and The qp-trie uses a QSBR callback to clean up chunks when they are no
updates can be committed without blocking. A commit is just a single longer used at all.
atomic pointer update. RCU only requires writers to block when waiting
for a "grace period" while older readers complete their critical
sections, after which the writer can free memory that is no longer in
use. Writers must also block on a mutex to ensure there is only one
writer at a time.
The qp-trie concurrency strategy is designed to be able to use RCU, but When reclaiming a chunk, we have to scan it for any remaining leaf
RCU is not required. Instead of RCU we can use a reader-writer lock. nodes. When nodes are accessibly only to the writer, they are zeroed
This requires readers to block when a writer commits, which (in RCU out when they are freed. If they are shared with readers, they must be
style) just requires an atomic pointer swap. The rwlock also changes left in place (though the `free` count in the usage array is still
when writers must block: commits must wait for readers to exit their adjucted), and finally `detach()`ed when the chunk is reclaimed.
critical sections, but there is no further waiting to be able to release
memory.
In BIND, there are two kinds of reader: queries, which are relatiely This chunk scan also cleans up old `base` arrays referred to by packed
quick, and zone transfers, which are relatively slow. BIND's dbversion reader nodes.
machinery allows updates to proceed while there are long-running zone
transfers. RCU supports this without further machinery, but a
reader-writer lock needs some help so that long-running readers can
avoid blocking writers.
To avoid blocking updates, long-running readers can take a snapshot of a
qp-trie, which only requires copying the allocator's chunk array. After
a writer commits, it does not releases memory if there are any
snapshots. Instead, chunks that are no longer needed by the latest
version of the trie are stashed on a list to be released later,
analogous to RCU waiting for a grace period.
The locking occurs only in the functions under _"read-write
transactions"_ and _"read-only transactions"_ in `lib/dns/qp.c`.
immutability and copy-on-write
------------------------------
A qp-trie has a `generation` counter which is incremented by each
write transaction. We keep track of which generation each chunk was
created in; only chunks created in the current generation are
mutable, because older chunks may be in use by concurrent readers.
This logic is implemented by `chunk_alloc()` and `chunk_mutable()`
in `lib/dns/qp.c`.
The `make_twigs_mutable()` function ensures that a node is mutable,
copying it if necessary.
The chunk arrays are a mixture of mutable and immutable. Pointers to
immutable chunks are immutable; new chunks can be assigned to unused
entries; and entries are cleared when it is safe to reclaim the chunks
they refer to. If the chunk arrays need to be expanded, the existing
arrays are retained for use by readers, and the writer uses the
expanded arrays (see `alloc_slow()`). The old arrays are cleaned up
after the writer commits.
update transactions
-------------------
A typical heavy-weight `update` transaction comprises:
* make a copy of the chunk arrays in case we need to roll back
* get a freshly allocated chunk where new nodes or copied nodes
can be written
* make any changes that are required; nodes in old chunks are
copied to the new space first; new nodes are modified in place
to avoid creating unnecessary garbage
* when the updates are finished, and before committing, run the
garbage collector to clear out chunks that were fragmented by the
update
* shrink the allocation chunk to eliminate unused space
* commit the update by flipping the root pointer of the trie; this
is the only point that needs a multithreading interlock
* free any chunks that were emptied by the garbage collector
A lightweight `write` transaction is similar, except that:
* rollback is not supported
* any existing allocation chunk is reused if possible
* the gabage collector is not run before committing
* the allocation chunk is not shrunk
testing strategies testing strategies

View File

@@ -6,6 +6,7 @@ AM_CFLAGS += \
AM_CPPFLAGS += \ AM_CPPFLAGS += \
$(LIBISC_CFLAGS) \ $(LIBISC_CFLAGS) \
$(LIBDNS_CFLAGS) \ $(LIBDNS_CFLAGS) \
$(LIBUV_CFLAGS) \
-DFUZZDIR=\"$(abs_srcdir)\" -DFUZZDIR=\"$(abs_srcdir)\"
AM_LDFLAGS += \ AM_LDFLAGS += \

View File

@@ -16,6 +16,7 @@
#include <stdbool.h> #include <stdbool.h>
#include <stdint.h> #include <stdint.h>
#include <isc/qsbr.h>
#include <isc/random.h> #include <isc/random.h>
#include <isc/refcount.h> #include <isc/refcount.h>
#include <isc/rwlock.h> #include <isc/rwlock.h>

View File

@@ -42,30 +42,24 @@
* value can be freed after it is no longer needed by readers using an old * value can be freed after it is no longer needed by readers using an old
* version of the trie. * version of the trie.
* *
* For fast concurrent reads, call `dns_qpmulti_query()` to get a * For fast concurrent reads, call `dns_qpmulti_query()` to fill in a
* `dns_qpread_t`. Readers can access a single version of the trie between * `dns_qpread_t`, which must be allocated on the stack. Readers can
* write commits. Most write activity is not blocked by readers, but reads * access a single version of the trie within the scope of an isc_loop
* must finish before a write can commit (a read-write lock blocks * thread (NOT an isc_work thread). We rely on the loop to bound the
* commits). * lifetime of a `dns_qpread_t`, instead of using locks. Readers are
* not blocked by any write activity, and vice versa.
* *
* For long-running reads that need a stable view of the trie, while still * For reads that need a stable view of the trie for multiple cycles
* allow commits to proceed, call `dns_qpmulti_snapshot()` to get a * of an isc_loop, or which can be used from any thread, call
* `dns_qpsnap_t`. It briefly gets the write mutex while creating the * `dns_qpmulti_snapshot()` to get a `dns_qpsnap_t`. A snapshot is for
* snapshot, which requires allocating a copy of some of the trie's * relatively heavy long-running read-only operations such as zone
* metadata. A snapshot is for relatively heavy long-running read-only * transfers.
* operations such as zone transfers.
*
* While snapshots exist, a qp-trie cannot reclaim memory: it does not
* retain detailed information about which memory is used by which
* snapshots, so it pessimistically retains all memory that might be
* used by old versions of the trie.
* *
* You can start one read-write transaction at a time using * You can start one read-write transaction at a time using
* `dns_qpmulti_write()` or `dns_qpmulti_update()`. Either way, you * `dns_qpmulti_write()` or `dns_qpmulti_update()`. Either way, you
* get a `dns_qp_t` that can be modified like a single-threaded trie, * get a `dns_qp_t` that can be modified like a single-threaded trie,
* without affecting other read-only query or snapshot users of the * without affecting other read-only query or snapshot users of the
* `dns_qpmulti_t`. Committing a transaction only blocks readers * `dns_qpmulti_t`.
* briefly when flipping the active readonly `dns_qp_t` pointer.
* *
* "Update" transactions are heavyweight. They allocate working memory to * "Update" transactions are heavyweight. They allocate working memory to
* hold modifications to the trie, and compact the trie before committing. * hold modifications to the trie, and compact the trie before committing.
@@ -96,34 +90,68 @@
typedef struct dns_qp dns_qp_t; typedef struct dns_qp dns_qp_t;
/*% /*%
* A `dns_qpmulti_t` supports multi-version concurrent reads and transactional * A `dns_qpmulti_t` supports multi-version wait-free concurrent reads
* modification. * and one transactional modification at a time.
*/ */
typedef struct dns_qpmulti dns_qpmulti_t; typedef struct dns_qpmulti dns_qpmulti_t;
/*% /*%
* A `dns_qpread_t` is a lightweight read-only handle on a `dns_qpmulti_t`. * Read-only parts of a qp-trie.
*
* A `dns_qpreader_t` is the common prefix of the `dns_qpreadable`
* types, containing just the fields neded for the hot path.
*
* Ranty aside: annoyingly, C doesn't allow us to use a predeclared
* structure type as an anonymous struct member, so we have to use a
* macro. (GCC and Clang have the feature we want under -fms-extensions,
* but a non-standard extension won't make these declarations neater if
* we must also have a standard alternative.)
*/ */
typedef struct dns_qpread dns_qpread_t; #define DNS_QPREADER_FIELDS \
uint32_t magic; \
uint32_t root_ref; \
dns_qpbase_t *base; \
void *uctx; \
const struct dns_qpmethods *methods
typedef struct dns_qpbase dns_qpbase_t; /* private */
typedef struct dns_qpreader {
DNS_QPREADER_FIELDS;
} dns_qpreader_t;
/*% /*%
* A `dns_qpsnap_t` is a heavier read-only snapshot of a `dns_qpmulti_t`. * A `dns_qpread_t` is a read-only handle on a `dns_qpmulti_t`.
* The caller provides space for it on the stack; it can be
* used by only one thread. As well as the `DNS_QPREADER_FIELDS`,
* it contains a thread ID to check for incorrect usage.
*/
typedef struct dns_qpread {
DNS_QPREADER_FIELDS;
uint32_t tid;
} dns_qpread_t;
/*%
* A `dns_qpsnap_t` is a read-only snapshot of a `dns_qpmulti_t`.
* It requires allocation and taking the `dns_qpmulti_t` mutex to
* create; it can be used from any thread.
*/ */
typedef struct dns_qpsnap dns_qpsnap_t; typedef struct dns_qpsnap dns_qpsnap_t;
/* /*%
* The read-only qp-trie functions can work on either of the read-only * The read-only qp-trie functions can work on either of the read-only
* qp-trie types or the general-purpose read-write `dns_qp_t`. They * qp-trie types dns_qpsnap_t or dns_qpread_t, or the general-purpose
* relies on the fact that all the `dns_qpreadable_t` structures start * read-write `dns_qp_t`. They rely on the fact that all the
* with a `dns_qpread_t`. * `dns_qpreadable_t` structures start with a `dns_qpreader_t`
*/ */
typedef union dns_qpreadable { typedef union dns_qpreadable {
dns_qpread_t *qpr; dns_qpreader_t *qp;
dns_qpsnap_t *qps; dns_qpread_t *qpr;
dns_qp_t *qpt; dns_qpsnap_t *qps;
dns_qp_t *qpt;
} dns_qpreadable_t __attribute__((__transparent_union__)); } dns_qpreadable_t __attribute__((__transparent_union__));
#define dns_qpreadable_cast(qp) ((qp).qpr) #define dns_qpreader(qpr) ((qpr).qp)
/*% /*%
* A trie lookup key is a small array, allocated on the stack during trie * A trie lookup key is a small array, allocated on the stack during trie
@@ -136,9 +164,6 @@ typedef union dns_qpreadable {
* common hostname character; otherwise unusual characters are escaped, * common hostname character; otherwise unusual characters are escaped,
* using two bytes in the key. So we allow keys to be up to 512 bytes. * using two bytes in the key. So we allow keys to be up to 512 bytes.
* (The actual max is (255 - 5) * 2 + 6 == 506) * (The actual max is (255 - 5) * 2 + 6 == 506)
*
* Every byte of a key must be greater than 0 and less than 48. Elements
* after the end of the key are treated as having the value 1.
*/ */
typedef uint8_t dns_qpkey_t[512]; typedef uint8_t dns_qpkey_t[512];
@@ -154,7 +179,9 @@ typedef uint8_t dns_qpkey_t[512];
* *
* The `attach` and `detach` methods adjust reference counts on value * The `attach` and `detach` methods adjust reference counts on value
* objects. They support copy-on-write and safe memory reclamation * objects. They support copy-on-write and safe memory reclamation
* needed for multi-version concurrency. * needed for multi-version concurrency. The methods are only called
* when the `dns_qpmulti_t` mutex is held, so they only need to use
* atomic ops if the refcounts are used by code other than the qp-trie.
* *
* Note: When a value object reference count is greater than one, the * Note: When a value object reference count is greater than one, the
* object is in use by concurrent readers so it must not be modified. A * object is in use by concurrent readers so it must not be modified. A
@@ -237,15 +264,17 @@ dns_qp_destroy(dns_qp_t **qptp);
*/ */
void void
dns_qpmulti_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *uctx, dns_qpmulti_create(isc_mem_t *mctx, isc_loopmgr_t *loopmgr,
const dns_qpmethods_t *methods, void *uctx,
dns_qpmulti_t **qpmp); dns_qpmulti_t **qpmp);
/*%< /*%<
* Create a multi-threaded qp-trie. * Create a multi-threaded qp-trie.
* *
* Requires: * Requires:
* \li `mctx` is a pointer to a valid memory context. * \li `mctx` is a pointer to a valid memory context.
* \li all the methods are non-NULL * \li 'loopmgr' is a valid loop manager.
* \li `qpmp != NULL && *qpmp == NULL` * \li `qpmp != NULL && *qpmp == NULL`
* \li all the methods are non-NULL
* *
* Ensures: * Ensures:
* \li `*qpmp` is a pointer to a valid multi-threaded qp-trie * \li `*qpmp` is a pointer to a valid multi-threaded qp-trie
@@ -400,7 +429,7 @@ dns_qp_insert(dns_qp_t *qp, void *pval, uint32_t ival);
* Requires: * Requires:
* \li `qp` is a pointer to a valid qp-trie * \li `qp` is a pointer to a valid qp-trie
* \li `pval != NULL` * \li `pval != NULL`
* \li `alignof(pval) > 1` * \li `alignof(pval) >= 4`
* *
* Returns: * Returns:
* \li ISC_R_EXISTS if the trie already has a leaf with the same key * \li ISC_R_EXISTS if the trie already has a leaf with the same key
@@ -440,34 +469,32 @@ dns_qp_deletename(dns_qp_t *qp, const dns_name_t *name);
*/ */
void void
dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t **qprp); dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t *qpr);
/*%< /*%<
* Start a lightweight (brief) read-only transaction * Start a lightweight (brief) read-only transaction
* *
* This takes a read lock on `multi`s rwlock that prevents * The `dns_qpmulti_query()` function must be called from an isc_loop
* transactions from committing. * thread and its 'qpr' argument must be allocated on the stack.
* *
* Requires: * Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie * \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qprp != NULL` * \li `qpr != NULL`
* \li `*qprp == NULL`
* *
* Returns: * Returns:
* \li `*qprp` is a pointer to a valid read-only qp-trie handle * \li `qpr` is a valid read-only qp-trie handle
*/ */
void void
dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t **qprp); dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t *qpr);
/*%< /*%<
* End a lightweight read transaction, i.e. release read lock * End a lightweight read transaction.
* *
* Requires: * Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie * \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qprp != NULL` * \li `qpr` is a read-only qp-trie handle obtained from `multi`
* \li `*qprp` is a read-only qp-trie handle obtained from `multi`
* *
* Returns: * Returns:
* \li `*qprp == NULL` * \li `qpr` is invalidated
*/ */
void void
@@ -478,7 +505,7 @@ dns_qpmulti_snapshot(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
* This function briefly takes and releases the modification mutex * This function briefly takes and releases the modification mutex
* while allocating a copy of the trie's metadata. While the snapshot * while allocating a copy of the trie's metadata. While the snapshot
* exists it does not interfere with other read-only or read-write * exists it does not interfere with other read-only or read-write
* transactions on the trie, except that memory cannot be reclaimed. * transactions on the trie.
* *
* Requires: * Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie * \li `multi` is a pointer to a valid multi-threaded qp-trie
@@ -494,10 +521,6 @@ dns_qpsnap_destroy(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
/*%< /*%<
* End a heavyweight read transaction * End a heavyweight read transaction
* *
* If this is the last remaining snapshot belonging to `multi` then
* this function takes the modification mutex in order to free() any
* memory that is no longer in use.
*
* Requires: * Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie * \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qpsp != NULL` * \li `qpsp != NULL`
@@ -538,6 +561,12 @@ dns_qpmulti_write(dns_qpmulti_t *multi, dns_qp_t **qptp);
* for a large trie that gets frequent small writes, such as a DNS * for a large trie that gets frequent small writes, such as a DNS
* cache. * cache.
* *
* A sequence of lightweight write transactions can accumulate
* garbage that the automatic compact/recycle cannot reclaim.
* To reclaim this space, you can use the `dns_qp_memusage
* fragmented` flag to trigger a call to dns_qp_compact(), or you
* can use occasional update transactions to compact the trie.
*
* During the transaction, the modification mutex is held. * During the transaction, the modification mutex is held.
* *
* Requires: * Requires:
@@ -554,10 +583,9 @@ dns_qpmulti_commit(dns_qpmulti_t *multi, dns_qp_t **qptp);
/*%< /*%<
* Complete a modification transaction * Complete a modification transaction
* *
* The commit itself only requires flipping the read pointer inside * Apart from memory management logistics, the commit itself only
* `multi` from the old version of the trie to the new version. This * requires flipping the read pointer inside `multi` from the old
* function takes a write lock on `multi`s rwlock just long enough to * version of the trie to the new version. Readers are not blocked.
* flip the pointer. This briefly blocks `query` readers.
* *
* This function releases the modification mutex after the post-commit * This function releases the modification mutex after the post-commit
* memory reclamation is completed. * memory reclamation is completed.

File diff suppressed because it is too large Load Diff

View File

@@ -13,6 +13,8 @@
/* /*
* For an overview, see doc/design/qp-trie.md * For an overview, see doc/design/qp-trie.md
*
* This private header defines the internal data structures,
*/ */
#pragma once #pragma once
@@ -23,12 +25,15 @@
*/ */
/* /*
* A qp-trie node can be a leaf or a branch. It consists of three 32-bit * A qp-trie node is normally either a branch or a leaf. It consists of
* words into which the components are packed. They are used as a 64-bit * three 32-bit words into which the components are packed. They are used
* word and a 32-bit word, but they are not declared like that to avoid * as a 64-bit word and a 32-bit word, but they are not declared like that
* unwanted padding, keeping the size down to 12 bytes. They are in native * to avoid unwanted padding, keeping the size down to 12 bytes. They are
* endian order so getting the 64-bit part should compile down to an * in native endian order so getting the 64-bit part should compile down to
* unaligned load. * an unaligned load.
*
* The type of node is identified by the tag in the least significant bits
* of the 64-bit word.
* *
* In a branch the 64-bit word is described by the enum below. The 32-bit * In a branch the 64-bit word is described by the enum below. The 32-bit
* word is a reference to the packed sparse vector of "twigs", i.e. child * word is a reference to the packed sparse vector of "twigs", i.e. child
@@ -37,8 +42,12 @@
* actually branch, i.e. branches cannot have only 1 child. * actually branch, i.e. branches cannot have only 1 child.
* *
* The contents of each leaf are set by the trie's user. The 64-bit word * The contents of each leaf are set by the trie's user. The 64-bit word
* contains a pointer value (which must be word-aligned), and the 32-bit * contains a pointer value (which must be word-aligned, so the tag bits
* word is an arbitrary integer value. * are zero), and the 32-bit word is an arbitrary integer value.
*
* There is a third kind of node, reader nodes, which anchor the root of a
* trie. A pair of reader nodes together contain a packed `dns_qpreader_t`.
* See the section on "packed reader nodes" below.
*/ */
typedef struct qp_node { typedef struct qp_node {
#if WORDS_BIGENDIAN #if WORDS_BIGENDIAN
@@ -49,16 +58,36 @@ typedef struct qp_node {
} qp_node_t; } qp_node_t;
/* /*
* A branch node contains a 64-bit word comprising the branch/leaf tag, * The possible values of the node type tag. Type tags must fit in two bits
* the bitmap, and an offset into the key. It is called an "index word" * for compatibility with 4-byte pointer alignment on 32-bit systems.
* because it describes how to access the twigs vector (think "database */
* index"). The following enum sets up the bit positions of these parts. enum {
LEAF_TAG = 0, /* leaf node */
BRANCH_TAG = 1, /* branch node */
READER_TAG = 2, /* reader node */
TAG_MASK = 3, /* mask covering tag bits */
};
/*
* This code does not work on CPUs with large pointers, e.g. CHERI capability
* architectures. When porting to that kind of machine, a `dns_qpnode` should
* be just a `uintptr_t`; a leaf node will contain a single pointer, and a
* branch node will fit in the same space with room to spare.
*/
STATIC_ASSERT(sizeof(void *) <= sizeof(uint64_t),
"pointers must fit in 64 bits");
/*
* A branch node contains a 64-bit word comprising the type tag, the
* bitmap, and an offset into the key. It is called an "index word" because
* it describes how to access the twigs vector (think "database index").
* The following enum sets up the bit positions of these parts.
* *
* In a leaf, the same 64-bit word contains a pointer. The pointer * In a leaf, the same 64-bit word contains a pointer. The pointer
* must be word-aligned so that the branch/leaf tag bit is zero. * must be word-aligned so that the branch/leaf tag bit is zero.
* This requirement is checked by the newleaf() constructor. * This requirement is checked by the newleaf() constructor.
* *
* The bitmap is just above the tag bit. The `bits_for_byte[]` table is * The bitmap is just above the type tag. The `bits_for_byte[]` table is
* used to fill in a key so that bit tests can work directly against the * used to fill in a key so that bit tests can work directly against the
* index word without superfluous masking or shifting; we don't need to * index word without superfluous masking or shifting; we don't need to
* mask out the bitmap before testing a bit, but we do need to mask the * mask out the bitmap before testing a bit, but we do need to mask the
@@ -70,24 +99,17 @@ typedef struct qp_node {
* The names are SHIFT_thing because they are qp_shift_t values. (See * The names are SHIFT_thing because they are qp_shift_t values. (See
* below for the various `qp_*` type declarations.) * below for the various `qp_*` type declarations.)
* *
* These values are relatively fixed in practice; the symbolic names * These values are relatively fixed in practice: SHIFT_NOBYTE needs
* avoid mystery numbers in the code. * to leave space for the type tag, and the implementation of
* `dns_qpkey_fromname()` depends on the bitmap being large enough.
* The symbolic names avoid mystery numbers in the code.
*/ */
enum { enum {
SHIFT_BRANCH = 0, /* branch / leaf tag */ SHIFT_NOBYTE = 2, /* label separator has no byte value */
SHIFT_NOBYTE, /* label separator has no byte value */
SHIFT_BITMAP, /* many bits here */ SHIFT_BITMAP, /* many bits here */
SHIFT_OFFSET = 48, /* offset of byte in key */ SHIFT_OFFSET = 49, /* offset of byte in key */
}; };
/*
* Value of the node type tag bit.
*
* It is defined this way to be explicit about where the value comes
* from, even though we know it is always the bottom bit.
*/
#define BRANCH_TAG (1ULL << SHIFT_BRANCH)
/*********************************************************************** /***********************************************************************
* *
* garbage collector tuning parameters * garbage collector tuning parameters
@@ -123,7 +145,13 @@ STATIC_ASSERT(6 <= QP_CHUNK_LOG && QP_CHUNK_LOG <= 20,
#define QP_CHUNK_BYTES (QP_CHUNK_SIZE * sizeof(qp_node_t)) #define QP_CHUNK_BYTES (QP_CHUNK_SIZE * sizeof(qp_node_t))
/* /*
* A chunk needs to be compacted if it has fragmented this much. * We need a bitfield this big to count how much of a chunk is in use:
* it needs to count from 0 up to and including `1 << QP_CHUNK_LOG`.
*/
#define QP_USAGE_BITS (QP_CHUNK_LOG + 1)
/*
* A chunk needs to be compacted if it is less full than this threshold.
* (12% overhead seems reasonable) * (12% overhead seems reasonable)
*/ */
#define QP_MAX_FREE (QP_CHUNK_SIZE / 8) #define QP_MAX_FREE (QP_CHUNK_SIZE / 8)
@@ -221,93 +249,198 @@ ref_cell(qp_ref_t ref) {
return (ref % QP_CHUNK_SIZE); return (ref % QP_CHUNK_SIZE);
} }
/*
* We should not use the `root_ref` in an empty trie, so we set it
* to a value that should trigger an obvious bug. See qp_init()
* and get_root() below.
*/
#define INVALID_REF ((qp_ref_t)~0UL)
/***********************************************************************
*
* chunk arrays
*/
/*
* A `dns_qp_t` contains two arrays holding information about each chunk.
*
* The `base` array holds pointers to the base of each chunk.
* The `usage` array hold the allocator's state for each chunk.
*
* The `base` array is used by the hot qp-trie traversal paths. It can
* be shared by multiple versions of a trie, which are tracked with a
* refcount. Old versions of the trie can retain old versions of the
* `base` array.
*
* In multithreaded code, the `usage` array is only used when the
* `dns_qpmulti_t` mutex is held, and there is only one version of
* it in active use (maybe with a snapshot for rollback support).
*
* The two arrays are separate because they have rather different
* access patterns, different lifetimes, and different element sizes.
*/
/*
* For most purposes we don't need to know exactly which cells are
* in use in a chunk, we only need to know how many of them there are.
*
* After we have finished allocating from a chunk, the `used` counter
* is the size we need to know for shrinking the chunk and for
* scanning it to detach leaf values before the chunk is free()d. The
* `free` counter tells us when the chunk needs compacting and when it
* has become empty.
*
* The `exists` flag allows the chunk scanning loops to look at the
* usage array only.
*
* In multithreaded code, we mark chunks as `immutable` when a modify
* transaction is opened. (We don't mark them immutable on commit,
* because the old bump chunk must remain mutable between write
* transactions, but it must become immutable when an update
* transaction is opened.)
*
* When a chunk becomes empty (wrt the latest version of the trie), we
* note the QSBR phase after which no old versions of the trie will
* need the chunk and it will be safe to free(). There are a few flags
* used to mark which chunks are still needed by snapshots after the
* chunks have passed their normal reclamation phase.
*/
typedef struct qp_usage {
/*% the allocation point, increases monotonically */
qp_cell_t used : QP_USAGE_BITS;
/*% count of nodes no longer needed, also monotonic */
qp_cell_t free : QP_USAGE_BITS;
/*% qp->base->ptr[chunk] != NULL */
bool exists : 1;
/*% is this chunk shared? [MT] */
bool immutable : 1;
/*% is a snapshot using this chunk? [MT] */
bool snapshot : 1;
/*% tried to free it but a snapshot needs it [MT] */
bool snapfree : 1;
/*% for mark/sweep snapshot flag updates [MT] */
bool snapmark : 1;
/*% in which phase did this chunk become unused? [MT] */
isc_qsbr_phase_t phase : ISC_QSBR_PHASE_BITS;
} qp_usage_t;
/*
* The chunks are owned by the current version of the `base` array.
* When the array is resized, the old version might still be in use by
* concurrent readers, in which case it is free()d later when its
* refcount drops to zero.
*
* A `dns_qpbase_t` counts references from `dns_qp_t` objects and
* from packed readers, but not from `dns_qpread_t` nor from
* `dns_qpsnap_t` objects. Refcount adjustments for `dns_qpread_t`
* would wreck multicore scalability; instead we rely on QSBR.
*
* The `usage` array determines when a chunk is no longer needed: old
* chunk pointers in old `base` arrays are ignored. (They can become
* dangling pointers to free memory, but they will never be
* dereferenced.)
*
* We ensure that individual chunk base pointers remain immutable
* after assignment, and they are not cleared until the chunk is
* free()d, after all readers have departed. Slots can be reused, and
* we allow transactions to fill or re-fill empty slots adjacent to
* busy slots that are in use by readers.
*/
struct dns_qpbase {
isc_refcount_t refcount;
qp_node_t *ptr[];
};
/*
* Returns true when the base array can be free()d.
*/
static inline bool
qpbase_unref(dns_qpreadable_t qpr) {
dns_qpreader_t *qp = dns_qpreader(qpr);
return (qp->base != NULL &&
isc_refcount_decrement(&qp->base->refcount) == 1);
}
/*
* Now we know about `dns_qpreader_t` and `dns_qpbase_t`,
* here's how we convert a twig reference into a pointer.
*/
static inline qp_node_t *
ref_ptr(dns_qpreadable_t qpr, qp_ref_t ref) {
dns_qpreader_t *qp = dns_qpreader(qpr);
return (qp->base->ptr[ref_chunk(ref)] + ref_cell(ref));
}
/*********************************************************************** /***********************************************************************
* *
* main qp-trie structures * main qp-trie structures
*/ */
#define QP_MAGIC ISC_MAGIC('t', 'r', 'i', 'e') #define QP_MAGIC ISC_MAGIC('t', 'r', 'i', 'e')
#define QP_VALID(qp) ISC_MAGIC_VALID(qp, QP_MAGIC) #define QPMULTI_MAGIC ISC_MAGIC('q', 'p', 'm', 'v')
#define QPREADER_MAGIC ISC_MAGIC('q', 'p', 'r', 'x')
#define QP_VALID(qp) ISC_MAGIC_VALID(qp, QP_MAGIC)
#define QPMULTI_VALID(qp) ISC_MAGIC_VALID(qp, QPMULTI_MAGIC)
/* /*
* This is annoying: C doesn't allow us to use a predeclared structure as * Polymorphic initialization of the `dns_qpreader_t` prefix.
* an anonymous struct member, so we have to fart around. The feature we *
* want is available in GCC and Clang with -fms-extensions, but a * The location of the root node is actually a qp_ref_t, but is
* non-standard extension won't make these declarations neater if we must * declared in DNS_QPREADER_FIELDS as uint32_t to avoid leaking too
* also have a standard alternative. * many internal details into the public API.
*
* The `uctx` and `methods` support callbacks into the user's code.
* They are constant after initialization.
*/ */
#define QP_INIT(qp, m, x) \
(*(qp) = (typeof(*(qp))){ \
.magic = QP_MAGIC, \
.root_ref = INVALID_REF, \
.uctx = x, \
.methods = m, \
})
/* /*
* Lightweight read-only access to a qp-trie. * Snapshots have some extra cleanup machinery.
* *
* Just the fields neded for the hot path. The `base` field points * Originally, a snapshot was basically just a `dns_qpread_t`
* to an array containing pointers to the base of each chunk like * allocated on the heap, with the extra behaviour that memory
* `qp->base[chunk]` - see `refptr()` below. * reclamation is suppressed for a particular trie while it has any
* snapshots. However that design gets into trouble for a zone with
* frequent updates and many zone transfers.
* *
* A `dns_qpread_t` has a lifetime that does not extend across multiple * Instead, each snapshot records which chunks it needs. When a
* write transactions, so it can share a chunk `base` array belonging to * snapshot is created, it makes a copy of the `base` array, except
* the `dns_qpmulti_t` it came from. * for chunks that are empty and waiting to be reclaimed. When a
* snapshot is destroyed, we can traverse the list of snapshots to
* accurately mark which chunks are still needed.
* *
* We're lucky with the layout on 64 bit systems: this is only 40 bytes, * A snapshot's `whence` pointer helps ensure that a `dns_qpsnap_t`is
* with no padding. * not muddled up with the wrong `dns_qpmulti_t`.
*/
#define DNS_QPREAD_COMMON \
uint32_t magic; \
qp_node_t root; \
qp_node_t **base; \
void *uctx; \
const dns_qpmethods_t *methods
struct dns_qpread {
DNS_QPREAD_COMMON;
};
/*
* Heavyweight read-only snapshots of a qp-trie.
* *
* Unlike a lightweight `dns_qpread_t`, a snapshot can survive across * A trie's `base` array might have grown after the snapshot was
* multiple write transactions, any of which may need to expand the * created, so it records its own `chunk_max`.
* chunk `base` array. So a `dns_qpsnap_t` keeps its own copy of the
* array, which will always be equal to some prefix of the expanded
* arrays in the `dns_qpmulti_t` that it came from.
*
* The `dns_qpmulti_t` keeps a refcount of its snapshots, and while
* the refcount is non-zero, chunks are not freed or reused. When a
* `dns_qpsnap_t` is destroyed, if it decrements the refcount to zero,
* it can do any deferred cleanup.
*
* The generation number is used for tracing.
*/ */
struct dns_qpsnap { struct dns_qpsnap {
DNS_QPREAD_COMMON; DNS_QPREADER_FIELDS;
uint32_t generation;
dns_qpmulti_t *whence; dns_qpmulti_t *whence;
qp_node_t *base_array[]; uint32_t chunk_max;
ISC_LINK(struct dns_qpsnap) link;
}; };
/* /*
* Read-write access to a qp-trie requires extra fields to support the * Read-write access to a qp-trie requires extra fields to support the
* allocator and garbage collector. * allocator and garbage collector.
* *
* The chunk `base` and `usage` arrays are separate because the `usage`
* array is only needed for allocation, so it is kept separate from the
* data needed by the read-only hot path. The arrays have empty slots where
* new chunks can be placed, so `chunk_max` is the maximum number of chunks
* (until the arrays are resized).
*
* Bare instances of a `struct dns_qp` are used for stand-alone * Bare instances of a `struct dns_qp` are used for stand-alone
* single-threaded tries. For multithreaded access, transactions alternate * single-threaded tries. For multithreaded access, a `dns_qpmulti_t`
* between the `phase` pair of dns_qp objects inside a dns_qpmulti. * wraps a `dns_qp_t` with a mutex and other fields that are only needed
* at the start or end of a transaction.
* *
* For multithreaded access, the `generation` counter allows us to know * Allocations are made sequentially in the `bump` chunk. A sequence
* which chunks are writable or not: writable chunks were allocated in the * of lightweight write transactions can use the same `bump` chunk, so
* current generation. For single-threaded access, the generation counter * its prefix before `fender` is immutable, and the rest is mutable.
* is always zero, so all chunks are considered to be writable.
*
* Allocations are made sequentially in the `bump` chunk. Lightweight write
* transactions can re-use the `bump` chunk, so its prefix before `fender`
* is immutable, and the rest is mutable even though its generation number
* does not match the current generation.
* *
* To decide when to compact and reclaim space, QP_MAX_GARBAGE() examines * To decide when to compact and reclaim space, QP_MAX_GARBAGE() examines
* the values of `used_count`, `free_count`, and `hold_count`. The * the values of `used_count`, `free_count`, and `hold_count`. The
@@ -332,39 +465,25 @@ struct dns_qpsnap {
* normal compaction failed to clear the QP_MAX_GARBAGE() condition. * normal compaction failed to clear the QP_MAX_GARBAGE() condition.
* (This emergency is a bug even tho we have a rescue mechanism.) * (This emergency is a bug even tho we have a rescue mechanism.)
* *
* - The `shared_arrays` flag indicates that the chunk `base` and `usage` * - When a qp-trie is destroyed while it has pending cleanup work, its
* arrays are shared by both `phase`s in this trie's `dns_qpmulti_t`. * `destroy` flag is set so that it is destroyed by the reclaim worker.
* This allows us to delay allocating copies of the arrays during a * (Because items cannot be removed from the middle of the cleanup list.)
* write transaction, until we definitely need to resize them.
* *
* - When built with fuzzing support, we can use mprotect() and munmap() * - When built with fuzzing support, we can use mprotect() and munmap()
* to ensure that incorrect memory accesses cause fatal errors. The * to ensure that incorrect memory accesses cause fatal errors. The
* `write_protect` flag must be set straight after the `dns_qpmulti_t` * `write_protect` flag must be set straight after the `dns_qpmulti_t`
* is created, then left unchanged. * is created, then left unchanged.
* *
* Some of the dns_qp_t fields are only used for multithreaded transactions * Some of the dns_qp_t fields are only needed for multithreaded transactions
* (marked [MT] below) but the same code paths are also used for single- * (marked [MT] below) but the same code paths are also used for single-
* threaded writes. To reduce the size of a dns_qp_t, these fields could * threaded writes.
* perhaps be moved into the dns_qpmulti_t, but that would require some kind
* of conditional runtime downcast from dns_qp_t to dns_multi_t, which is
* likely to be ugly. It is probably best to keep things simple if most tries
* need multithreaded access (XXXFANF do they? e.g. when there are many auth
* zones),
*/ */
struct dns_qp { struct dns_qp {
DNS_QPREAD_COMMON; DNS_QPREADER_FIELDS;
/*% memory context (const) */
isc_mem_t *mctx; isc_mem_t *mctx;
/*% array of per-chunk allocation counters */ /*% array of per-chunk allocation counters */
struct { qp_usage_t *usage;
/*% the allocation point, increases monotonically */
qp_cell_t used;
/*% count of nodes no longer needed, also monotonic */
qp_cell_t free;
/*% when was this chunk allocated? */
uint32_t generation;
} *usage;
/*% transaction counter [MT] */
uint32_t generation;
/*% number of slots in `chunk` and `usage` arrays */ /*% number of slots in `chunk` and `usage` arrays */
qp_chunk_t chunk_max; qp_chunk_t chunk_max;
/*% which chunk is used for allocations */ /*% which chunk is used for allocations */
@@ -375,14 +494,14 @@ struct dns_qp {
qp_cell_t leaf_count; qp_cell_t leaf_count;
/*% total of all usage[] counters */ /*% total of all usage[] counters */
qp_cell_t used_count, free_count; qp_cell_t used_count, free_count;
/*% cells that cannot be recovered right now */ /*% free cells that cannot be recovered right now */
qp_cell_t hold_count; qp_cell_t hold_count;
/*% what kind of transaction was most recently started [MT] */ /*% what kind of transaction was most recently started [MT] */
enum { QP_NONE, QP_WRITE, QP_UPDATE } transaction_mode : 2; enum { QP_NONE, QP_WRITE, QP_UPDATE } transaction_mode : 2;
/*% compact the entire trie [MT] */ /*% compact the entire trie [MT] */
bool compact_all : 1; bool compact_all : 1;
/*% chunk arrays are shared with a readonly qp-trie [MT] */ /*% destroy the trie asynchronously [MT] */
bool shared_arrays : 1; bool destroy : 1;
/*% optionally when compiled with fuzzing support [MT] */ /*% optionally when compiled with fuzzing support [MT] */
bool write_protect : 1; bool write_protect : 1;
}; };
@@ -390,45 +509,60 @@ struct dns_qp {
/* /*
* Concurrent access to a qp-trie. * Concurrent access to a qp-trie.
* *
* The `read` pointer is used for read queries. It points to one of the * The `reader` pointer provides wait-free access to the current version
* `phase` elements. During a transaction, the other `phase` (see * of the trie. See the "packed reader nodes" section below for a
* `write_phase()` below) is modified incrementally in copy-on-write * description of what it points to.
* style. On commit the `read` pointer is swapped to the altered phase. *
* We need access to the loopmgr to hook into QSBR safe memory reclamation.
* It is constant after initialization.
*
* The main object under the protection of the mutex is the `writer`
* containing all the allocator state. There can be a backup copy when
* we want to be able to rollback an update transaction.
*
* There is a `reader_ref` which corresponds to the `reader` pointer
* (`ref_ptr(multi->reader_ref) == multi->reader`). The `reader_ref` is
* necessary when freeing the space used by the reader, because there
* isn't a good way to recover a qp_ref_t from a qp_node_t pointer.
*
* There is a per-trie list of snapshots that is used for reclaiming
* memory when a snapshot is destroyed.
*
* Finally, we maintain a global list of `dns_qpmulti_t` objects that
* need asynchronous safe memory recovery.
*/ */
struct dns_qpmulti { struct dns_qpmulti {
uint32_t magic; uint32_t magic;
/*% controls access to the `read` pointer and its target phase */ /*% safe memory reclamation context (const) */
isc_rwlock_t rwlock; isc_loopmgr_t *loopmgr;
/*% points to phase[r] and swaps on commit */ /*% pointer to current packed reader */
dns_qp_t *read; atomic_ptr(qp_node_t) reader;
/*% protects the snapshot counter and `write_phase()` */ /*% the mutex protects the rest of this structure */
isc_mutex_t mutex; isc_mutex_t mutex;
/*% so we know when old chunks are still shared */ /*% ref_ptr(writer, reader_ref) == reader */
unsigned int snapshots; qp_ref_t reader_ref;
/*% one is read-only, one is mutable */ /*% the main working structure */
dns_qp_t phase[2]; dns_qp_t writer;
/*% saved allocator state to support rollback */
dns_qp_t *rollback;
/*% all snapshots of this trie */
ISC_LIST(dns_qpsnap_t) snapshots;
/*% safe memory reclamation work list */
ISC_SLINK(dns_qpmulti_t) cleanup;
}; };
/*
* Get a pointer to the phase that isn't read-only.
*/
static inline dns_qp_t *
write_phase(dns_qpmulti_t *multi) {
bool read0 = multi->read == &multi->phase[0];
return (read0 ? &multi->phase[1] : &multi->phase[0]);
}
#define QPMULTI_MAGIC ISC_MAGIC('q', 'p', 'm', 'v')
#define QPMULTI_VALID(qp) ISC_MAGIC_VALID(qp, QPMULTI_MAGIC)
/*********************************************************************** /***********************************************************************
* *
* interior node constructors and accessors * interior node constructors and accessors
*/ */
/* /*
* See the comments under "interior node basics" above, which explain the * See the comments under "interior node basics" above, which explain
* layout of nodes as implemented by the following functions. * the layout of nodes as implemented by the following functions.
*
* These functions are (mostly) constructors and getters. Imagine how
* much less code there would be if C had sum types with control over
* the layout...
*/ */
/* /*
@@ -462,7 +596,24 @@ make_node(uint64_t big, uint32_t small) {
} }
/* /*
* Test a node's tag bit. * Extract a pointer from a node's 64 bit word. The double cast is to avoid
* a warning about mismatched pointer/integer sizes on 32 bit systems.
*/
static inline void *
node_pointer(qp_node_t *n) {
return ((void *)(uintptr_t)(node64(n) & ~TAG_MASK));
}
/*
* Examine a node's tag bits
*/
static inline uint32_t
node_tag(qp_node_t *n) {
return (n->biglo & TAG_MASK);
}
/*
* simplified for the hot path
*/ */
static inline bool static inline bool
is_branch(qp_node_t *n) { is_branch(qp_node_t *n) {
@@ -472,12 +623,11 @@ is_branch(qp_node_t *n) {
/* leaf nodes *********************************************************/ /* leaf nodes *********************************************************/
/* /*
* Get a leaf's pointer value. The double cast is to avoid a warning * Get a leaf's pointer value.
* about mismatched pointer/integer sizes on 32 bit systems.
*/ */
static inline void * static inline void *
leaf_pval(qp_node_t *n) { leaf_pval(qp_node_t *n) {
return ((void *)(uintptr_t)node64(n)); return (node_pointer(n));
} }
/* /*
@@ -494,7 +644,7 @@ leaf_ival(qp_node_t *n) {
static inline qp_node_t static inline qp_node_t
make_leaf(const void *pval, uint32_t ival) { make_leaf(const void *pval, uint32_t ival) {
qp_node_t leaf = make_node((uintptr_t)pval, ival); qp_node_t leaf = make_node((uintptr_t)pval, ival);
REQUIRE(!is_branch(&leaf) && pval != NULL); REQUIRE(node_tag(&leaf) == LEAF_TAG);
return (leaf); return (leaf);
} }
@@ -551,15 +701,6 @@ branch_keybit(qp_node_t *n, const dns_qpkey_t key, size_t len) {
return (qpkey_bit(key, len, branch_key_offset(n))); return (qpkey_bit(key, len, branch_key_offset(n)));
} }
/*
* Convert a twig reference into a pointer.
*/
static inline qp_node_t *
ref_ptr(dns_qpreadable_t qpr, qp_ref_t ref) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
return (qp->base[ref_chunk(ref)] + ref_cell(ref));
}
/* /*
* Get a pointer to a branch node's twigs vector. * Get a pointer to a branch node's twigs vector.
*/ */
@@ -576,6 +717,33 @@ prefetch_twigs(dns_qpreadable_t qpr, qp_node_t *n) {
__builtin_prefetch(branch_twigs_vector(qpr, n)); __builtin_prefetch(branch_twigs_vector(qpr, n));
} }
/* root node **********************************************************/
/*
* Get a pointer to the root node, checking if the trie is empty.
*/
static inline qp_node_t *
get_root(dns_qpreadable_t qpr) {
dns_qpreader_t *qp = dns_qpreader(qpr);
if (qp->root_ref == INVALID_REF) {
return (NULL);
} else {
return (ref_ptr(qp, qp->root_ref));
}
}
/*
* When we need to move the root node, we avoid repeating allocation
* logistics by making a temporary fake branch node that has
* `branch_twigs_size() == 1 && branch_twigs_ref() == root_ref`
* just enough to treat the root node as a vector of one twig.
*/
#define MOVABLE_ROOT(qp) \
(&(qp_node_t){ \
.biglo = BRANCH_TAG | (1 << SHIFT_NOBYTE), \
.small = qp->root_ref, \
})
/*********************************************************************** /***********************************************************************
* *
* bitmap popcount shenanigans * bitmap popcount shenanigans
@@ -585,26 +753,26 @@ prefetch_twigs(dns_qpreadable_t qpr, qp_node_t *n) {
* How many twigs appear in the vector before the one corresponding to the * How many twigs appear in the vector before the one corresponding to the
* given bit? Calculated using popcount of part of the branch's bitmap. * given bit? Calculated using popcount of part of the branch's bitmap.
* *
* To calculate a mask that covers the lesser bits in the bitmap, we * To calculate a mask that covers the lesser bits in the bitmap,
* subtract 1 to set the bits, and subtract the branch tag because it * we subtract 1 to set all lesser bits, and subtract the tag mask
* is not part of the bitmap. * because the type tag is not part of the bitmap.
*/ */
static inline qp_weight_t static inline qp_weight_t
branch_twigs_before(qp_node_t *n, qp_shift_t bit) { branch_count_bitmap_before(qp_node_t *n, qp_shift_t bit) {
uint64_t mask = (1ULL << bit) - 1 - BRANCH_TAG; uint64_t mask = (1ULL << bit) - 1 - TAG_MASK;
uint64_t bmp = branch_index(n) & mask; uint64_t bitmap = branch_index(n) & mask;
return ((qp_weight_t)__builtin_popcountll(bmp)); return ((qp_weight_t)__builtin_popcountll(bitmap));
} }
/* /*
* How many twigs does this node have? * How many twigs does this branch have?
* *
* The offset is directly after the bitmap so the offset's lesser bits * The offset is directly after the bitmap so the offset's lesser bits
* covers the whole bitmap, and the bitmap's weight is the number of twigs. * covers the whole bitmap, and the bitmap's weight is the number of twigs.
*/ */
static inline qp_weight_t static inline qp_weight_t
branch_twigs_size(qp_node_t *n) { branch_twigs_size(qp_node_t *n) {
return (branch_twigs_before(n, SHIFT_OFFSET)); return (branch_count_bitmap_before(n, SHIFT_OFFSET));
} }
/* /*
@@ -612,7 +780,7 @@ branch_twigs_size(qp_node_t *n) {
*/ */
static inline qp_weight_t static inline qp_weight_t
branch_twig_pos(qp_node_t *n, qp_shift_t bit) { branch_twig_pos(qp_node_t *n, qp_shift_t bit) {
return (branch_twigs_before(n, bit)); return (branch_count_bitmap_before(n, bit));
} }
/* /*
@@ -643,6 +811,80 @@ zero_twigs(qp_node_t *twigs, qp_weight_t size) {
memset(twigs, 0, size * sizeof(qp_node_t)); memset(twigs, 0, size * sizeof(qp_node_t));
} }
/***********************************************************************
*
* packed reader nodes
*/
/*
* The purpose of these packed reader nodes is to simplify safe memory
* reclamation for a multithreaded qp-trie.
*
* After the `reader` pointer in a qpmulti is replaced, we need to wait
* for a grace period before we can reclaim the memory that is no longer
* needed by the trie. So we need some kind of structure to hold
* pointers to the (logically) detached memory until it is safe to free.
* This memory includes the chunks and the `base` arrays.
*
* Packed reader nodes save us from having to track `dns_qpread_t`
* objects as distinct allocations: the packed reader nodes get
* reclaimed when the the chunk containing their cells is reclaimed.
* When a real `dns_qpread_t` object is needed, it is allocated on the
* stack (it must not live longer than a isc_loop callback) and the
* packed reader is unpacked into it.
*
* Chunks are owned by the current `base` array, so unused chunks are
* held there until they are free()d. Old `base` arrays are attached
* to packed reader nodes with a refcount. When a chunk is reclaimed,
* it is scanned so that `chunk_free()` can call `detach_leaf()` on
* any remaining references to leaf objects. Similarly, it calls
* `qpbase_unref()` to reclaim old `base` arrays.
*/
/*
* Two nodes is just enough space for the information needed by
* readers and for deferred memory reclamation.
*/
#define READER_SIZE 2
/*
* Create a packed reader; space for the reader should have been
* allocated using `alloc_twigs(&multi->writer, READER_SIZE)`.
*/
static inline void
make_reader(qp_node_t *reader, dns_qpmulti_t *multi) {
dns_qp_t *qp = &multi->writer;
reader[0] = make_node(READER_TAG | (uintptr_t)multi, QPREADER_MAGIC);
reader[1] = make_node(READER_TAG | (uintptr_t)qp->base, qp->root_ref);
}
static inline bool
reader_valid(qp_node_t *reader) {
return (reader != NULL && //
node_tag(&reader[0]) == READER_TAG &&
node_tag(&reader[1]) == READER_TAG &&
node32(&reader[0]) == QPREADER_MAGIC);
}
/*
* Verify and unpack a reader. We return the `multi` pointer to use in
* consistency checks.
*/
static inline dns_qpmulti_t *
unpack_reader(dns_qpreader_t *qp, qp_node_t *reader) {
INSIST(reader_valid(reader));
dns_qpmulti_t *multi = node_pointer(&reader[0]);
INSIST(QPMULTI_VALID(multi));
*qp = (dns_qpreader_t){
.magic = QP_MAGIC,
.uctx = multi->writer.uctx,
.methods = multi->writer.methods,
.root_ref = node32(&reader[1]),
.base = node_pointer(&reader[1]),
};
return (multi);
}
/*********************************************************************** /***********************************************************************
* *
* method invocation helpers * method invocation helpers
@@ -650,26 +892,26 @@ zero_twigs(qp_node_t *twigs, qp_weight_t size) {
static inline void static inline void
attach_leaf(dns_qpreadable_t qpr, qp_node_t *n) { attach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr); dns_qpreader_t *qp = dns_qpreader(qpr);
qp->methods->attach(qp->uctx, leaf_pval(n), leaf_ival(n)); qp->methods->attach(qp->uctx, leaf_pval(n), leaf_ival(n));
} }
static inline void static inline void
detach_leaf(dns_qpreadable_t qpr, qp_node_t *n) { detach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr); dns_qpreader_t *qp = dns_qpreader(qpr);
qp->methods->detach(qp->uctx, leaf_pval(n), leaf_ival(n)); qp->methods->detach(qp->uctx, leaf_pval(n), leaf_ival(n));
} }
static inline size_t static inline size_t
leaf_qpkey(dns_qpreadable_t qpr, qp_node_t *n, dns_qpkey_t key) { leaf_qpkey(dns_qpreadable_t qpr, qp_node_t *n, dns_qpkey_t key) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr); dns_qpreader_t *qp = dns_qpreader(qpr);
return (qp->methods->makekey(key, qp->uctx, leaf_pval(n), return (qp->methods->makekey(key, qp->uctx, leaf_pval(n),
leaf_ival(n))); leaf_ival(n)));
} }
static inline char * static inline char *
triename(dns_qpreadable_t qpr, char *buf, size_t size) { triename(dns_qpreadable_t qpr, char *buf, size_t size) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr); dns_qpreader_t *qp = dns_qpreader(qpr);
qp->methods->triename(qp->uctx, buf, size); qp->methods->triename(qp->uctx, buf, size);
return (buf); return (buf);
} }

View File

@@ -1,11 +1,14 @@
include $(top_srcdir)/Makefile.top include $(top_srcdir)/Makefile.top
AM_CFLAGS += -Wno-vla
AM_CPPFLAGS += \ AM_CPPFLAGS += \
$(LIBUV_CFLAGS) \ $(LIBUV_CFLAGS) \
$(LIBISC_CFLAGS) \ $(LIBISC_CFLAGS) \
$(LIBDNS_CFLAGS) \ $(LIBDNS_CFLAGS) \
-I$(top_srcdir)/fuzz \ -I$(top_srcdir)/fuzz \
-I$(top_srcdir)/lib/dns \ -I$(top_srcdir)/lib/dns \
-I$(top_srcdir)/lib/isc \
-I$(top_srcdir)/tests/include -I$(top_srcdir)/tests/include
LDADD += \ LDADD += \

View File

@@ -17,6 +17,8 @@
#include <isc/file.h> #include <isc/file.h>
#include <isc/hashmap.h> #include <isc/hashmap.h>
#include <isc/ht.h> #include <isc/ht.h>
#include <isc/list.h>
#include <isc/qsbr.h>
#include <isc/rwlock.h> #include <isc/rwlock.h>
#include <isc/util.h> #include <isc/util.h>

File diff suppressed because it is too large Load Diff

View File

@@ -40,30 +40,30 @@ ISC_RUN_TEST_IMPL(qpkey_name) {
} testcases[] = { } testcases[] = {
{ {
.namestr = ".", .namestr = ".",
.key = { 0x01, 0x01 }, .key = { 0x02, 0x02 },
.len = 1, .len = 1,
}, },
{ {
.namestr = "\\000", .namestr = "\\000",
.key = { 0x02, 0x02, 0x01, 0x01 }, .key = { 0x03, 0x03, 0x02, 0x02 },
.len = 3, .len = 3,
}, },
{ {
.namestr = "example.com.", .namestr = "example.com.",
.key = { 0x01, 0x15, 0x21, 0x1f, 0x01, 0x17, 0x2a, 0x13, .key = { 0x02, 0x16, 0x22, 0x20, 0x02, 0x18, 0x2b, 0x14,
0x1f, 0x22, 0x1e, 0x17, 0x01, 0x01 }, 0x20, 0x23, 0x1f, 0x18, 0x02, 0x02 },
.len = 13, .len = 13,
}, },
{ {
.namestr = "example.com", .namestr = "example.com",
.key = { 0x15, 0x21, 0x1f, 0x01, 0x17, 0x2a, 0x13, 0x1f, .key = { 0x16, 0x22, 0x20, 0x02, 0x18, 0x2b, 0x14, 0x20,
0x22, 0x1e, 0x17, 0x01, 0x01 }, 0x23, 0x1f, 0x18, 0x02, 0x02 },
.len = 12, .len = 12,
}, },
{ {
.namestr = "EXAMPLE.COM", .namestr = "EXAMPLE.COM",
.key = { 0x15, 0x21, 0x1f, 0x01, 0x17, 0x2a, 0x13, 0x1f, .key = { 0x16, 0x22, 0x20, 0x02, 0x18, 0x2b, 0x14, 0x20,
0x22, 0x1e, 0x17, 0x01, 0x01 }, 0x23, 0x1f, 0x18, 0x02, 0x02 },
.len = 12, .len = 12,
}, },
}; };
@@ -78,8 +78,8 @@ ISC_RUN_TEST_IMPL(qpkey_name) {
in = dns_fixedname_name(&fn1); in = dns_fixedname_name(&fn1);
len = dns_qpkey_fromname(key, in); len = dns_qpkey_fromname(key, in);
assert_true(testcases[i].len == len); assert_int_equal(testcases[i].len, len);
assert_true(memcmp(testcases[i].key, key, len) == 0); assert_memory_equal(testcases[i].key, key, len);
out = dns_fixedname_initname(&fn2); out = dns_fixedname_initname(&fn2);
qp_test_keytoname(key, out); qp_test_keytoname(key, out);

View File

@@ -22,8 +22,12 @@
#define UNIT_TESTING #define UNIT_TESTING
#include <cmocka.h> #include <cmocka.h>
#include <isc/assertions.h>
#include <isc/log.h> #include <isc/log.h>
#include <isc/loop.h>
#include <isc/magic.h>
#include <isc/mem.h> #include <isc/mem.h>
#include <isc/qsbr.h>
#include <isc/random.h> #include <isc/random.h>
#include <isc/refcount.h> #include <isc/refcount.h>
#include <isc/rwlock.h> #include <isc/rwlock.h>
@@ -44,11 +48,10 @@
#define TRANSACTION_COUNT 1234 #define TRANSACTION_COUNT 1234
#if VERBOSE #if VERBOSE
#define TRACE(fmt, ...) \ #define TRACE(fmt, ...) \
isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE, DNS_LOGMODULE_QP, \ isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE, DNS_LOGMODULE_QP, \
ISC_LOG_DEBUG(7), "%s:%d:%s: " fmt, __FILE__, __LINE__, \ ISC_LOG_DEBUG(7), "%s:%d:%s(): " fmt, __FILE__, \
__func__, ##__VA_ARGS__) __LINE__, __func__, ##__VA_ARGS__)
#else #else
#define TRACE(...) #define TRACE(...)
#endif #endif
@@ -110,8 +113,8 @@ item_attach(void *ctx, void *pval, uint32_t ival) {
static void static void
item_detach(void *ctx, void *pval, uint32_t ival) { item_detach(void *ctx, void *pval, uint32_t ival) {
INSIST(ctx == NULL); assert_null(ctx);
INSIST(pval == &item[ival]); assert_ptr_equal(pval, &item[ival]);
item[ival].refcount--; item[ival].refcount--;
} }
@@ -124,11 +127,11 @@ item_makekey(dns_qpkey_t key, void *ctx, void *pval, uint32_t ival) {
if (!(ival < ARRAY_SIZE(item) && lo <= ip && ip < hi && if (!(ival < ARRAY_SIZE(item) && lo <= ip && ip < hi &&
pval == &item[ival])) pval == &item[ival]))
{ {
TRACE("ival %u pval %lx", ival, ip); ISC_INSIST(ival < ARRAY_SIZE(item));
INSIST(ival < ARRAY_SIZE(item)); ISC_INSIST(pval != NULL);
INSIST(ip >= lo); ISC_INSIST(ip >= lo);
INSIST(ip < hi); ISC_INSIST(ip < hi);
INSIST(pval == &item[ival]); ISC_INSIST(pval == &item[ival]);
} }
memmove(key, item[ival].key, item[ival].len); memmove(key, item[ival].key, item[ival].len);
return (item[ival].len); return (item[ival].len);
@@ -182,13 +185,13 @@ checkkey(dns_qpreadable_t qpr, size_t i, bool exists, const char *rubric) {
isc_result_t result; isc_result_t result;
result = dns_qp_getkey(qpr, item[i].key, item[i].len, &pval, &ival); result = dns_qp_getkey(qpr, item[i].key, item[i].len, &pval, &ival);
if (result == ISC_R_SUCCESS) { if (result == ISC_R_SUCCESS) {
ASSERT(exists == true); assert_true(exists);
ASSERT(pval == &item[i]); assert_ptr_equal(pval, &item[i]);
ASSERT(ival == i); assert_int_equal(ival, i);
} else if (result == ISC_R_NOTFOUND) { } else if (result == ISC_R_NOTFOUND) {
ASSERT(exists == false); assert_false(exists);
ASSERT(pval == NULL); assert_null(pval);
ASSERT(ival == ~0U); assert_int_equal(ival, ~0U);
} else { } else {
UNREACHABLE(); UNREACHABLE();
} }
@@ -232,9 +235,9 @@ one_transaction(dns_qpmulti_t *qpm) {
isc_result_t result; isc_result_t result;
bool ok = true; bool ok = true;
dns_qpreadable_t qpo = (dns_qpreadable_t)(dns_qp_t *)NULL; dns_qpreader_t *qpo = NULL;
dns_qpread_t *qpr = NULL;
dns_qpsnap_t *qps = NULL; dns_qpsnap_t *qps = NULL;
dns_qpread_t qpr = { 0 };
dns_qp_t *qpw = NULL; dns_qp_t *qpw = NULL;
bool snap = isc_random_uniform(2) == 0; bool snap = isc_random_uniform(2) == 0;
@@ -242,7 +245,7 @@ one_transaction(dns_qpmulti_t *qpm) {
bool rollback = update && isc_random_uniform(4) == 0; bool rollback = update && isc_random_uniform(4) == 0;
size_t count = isc_random_uniform(TRANSACTION_SIZE); size_t count = isc_random_uniform(TRANSACTION_SIZE);
TRACE("transaction %s %s %s %zu", snap ? "snapshot" : "query", TRACE("transaction %s %s %s size %zu", snap ? "snapshot" : "query",
update ? "update" : "write", rollback ? "rollback" : "commit", update ? "update" : "write", rollback ? "rollback" : "commit",
count); count);
@@ -255,7 +258,7 @@ one_transaction(dns_qpmulti_t *qpm) {
/* briefly take and drop mutex */ /* briefly take and drop mutex */
if (snap) { if (snap) {
dns_qpmulti_snapshot(qpm, &qps); dns_qpmulti_snapshot(qpm, &qps);
qpo = (dns_qpreadable_t)qps; qpo = (dns_qpreader_t *)qps;
} }
/* take mutex */ /* take mutex */
@@ -265,10 +268,9 @@ one_transaction(dns_qpmulti_t *qpm) {
dns_qpmulti_write(qpm, &qpw); dns_qpmulti_write(qpm, &qpw);
} }
/* take rwlock */
if (!snap) { if (!snap) {
dns_qpmulti_query(qpm, &qpr); dns_qpmulti_query(qpm, &qpr);
qpo = (dns_qpreadable_t)qpr; qpo = (dns_qpreader_t *)&qpr;
} }
for (size_t n = 0; n < count; n++) { for (size_t n = 0; n < count; n++) {
@@ -278,15 +280,15 @@ one_transaction(dns_qpmulti_t *qpm) {
ASSERT(checkkey(qpw, i, item[i].in_rw, "before rw")); ASSERT(checkkey(qpw, i, item[i].in_rw, "before rw"));
if (item[i].in_rw) { if (item[i].in_rw) {
/* TRACE("delete %zu %.*s", i, item[i].len, /* TRACE("delete %zu %.*s", i,
item[i].ascii); */ item[i].len, item[i].ascii); */
result = dns_qp_deletekey(qpw, item[i].key, result = dns_qp_deletekey(qpw, item[i].key,
item[i].len); item[i].len);
ASSERT(result == ISC_R_SUCCESS); ASSERT(result == ISC_R_SUCCESS);
item[i].in_rw = false; item[i].in_rw = false;
} else { } else {
/* TRACE("insert %zu %.*s", i, item[i].len, /* TRACE("insert %zu %.*s", i,
item[i].ascii); */ item[i].len, item[i].ascii); */
result = dns_qp_insert(qpw, &item[i], i); result = dns_qp_insert(qpw, &item[i], i);
ASSERT(result == ISC_R_SUCCESS); ASSERT(result == ISC_R_SUCCESS);
item[i].in_rw = true; item[i].in_rw = true;
@@ -307,7 +309,6 @@ one_transaction(dns_qpmulti_t *qpm) {
assert_true(checkallrw(qpw)); assert_true(checkallrw(qpw));
if (!snap) { if (!snap) {
/* drop the rwlock so the commit can take it */
dns_qpread_destroy(qpm, &qpr); dns_qpread_destroy(qpm, &qpr);
} }
@@ -322,7 +323,7 @@ one_transaction(dns_qpmulti_t *qpm) {
"rollback ro")); "rollback ro"));
} }
item[i].in_rw = item[i].in_ro; item[i].in_rw = item[i].in_ro;
ASSERT(checkkey(qpr, i, item[i].in_rw, "rollback rw")); ASSERT(checkkey(&qpr, i, item[i].in_rw, "rollback rw"));
} }
dns_qpread_destroy(qpm, &qpr); dns_qpread_destroy(qpm, &qpr);
} else { } else {
@@ -336,7 +337,7 @@ one_transaction(dns_qpmulti_t *qpm) {
"commit ro")); "commit ro"));
} }
item[i].in_ro = item[i].in_rw; item[i].in_ro = item[i].in_rw;
ASSERT(checkkey(qpr, i, item[i].in_rw, "commit rw")); ASSERT(checkkey(&qpr, i, item[i].in_rw, "commit rw"));
} }
dns_qpread_destroy(qpm, &qpr); dns_qpread_destroy(qpm, &qpr);
} }
@@ -347,28 +348,45 @@ one_transaction(dns_qpmulti_t *qpm) {
dns_qpsnap_destroy(qpm, &qps); dns_qpsnap_destroy(qpm, &qps);
} }
TRACE("completed %s %s %s size %zu", snap ? "snapshot" : "query",
update ? "update" : "write", rollback ? "rollback" : "commit",
count);
if (!ok) { if (!ok) {
TRACE("transaction failed"); TRACE("transaction failed");
dns_qpmulti_query(qpm, &qpr); dns_qpmulti_query(qpm, &qpr);
qp_test_dumptrie(qpr); qp_test_dumptrie(&qpr);
dns_qpread_destroy(qpm, &qpr); dns_qpread_destroy(qpm, &qpr);
} }
assert_true(ok); assert_true(ok);
} }
ISC_RUN_TEST_IMPL(qpmulti) { static void
setup_logging(); many_transactions(void *arg) {
setup_items(); UNUSED(arg);
dns_qpmulti_t *qpm = NULL; dns_qpmulti_t *qpm = NULL;
dns_qpmulti_create(mctx, &test_methods, NULL, &qpm); dns_qpmulti_create(mctx, loopmgr, &test_methods, NULL, &qpm);
qpm->writer.write_protect = true;
for (size_t n = 0; n < TRANSACTION_COUNT; n++) { for (size_t n = 0; n < TRANSACTION_COUNT; n++) {
TRACE("transaction %zu", n);
one_transaction(qpm); one_transaction(qpm);
isc__qsbr_quiescent_state(isc_loop_current(loopmgr));
isc_loopmgr_wakeup(loopmgr);
} }
dns_qpmulti_destroy(&qpm); dns_qpmulti_destroy(&qpm);
isc_loopmgr_shutdown(loopmgr);
}
ISC_RUN_TEST_IMPL(qpmulti) {
setup_loopmgr(NULL);
setup_logging();
setup_items();
isc_loop_setup(isc_loop_main(loopmgr), many_transactions, NULL);
isc_loopmgr_run(loopmgr);
isc_loopmgr_destroy(&loopmgr);
isc_log_destroy(&dns_lctx); isc_log_destroy(&dns_lctx);
} }

View File

@@ -16,7 +16,9 @@
#include <stdio.h> #include <stdio.h>
#include <isc/buffer.h> #include <isc/buffer.h>
#include <isc/loop.h>
#include <isc/magic.h> #include <isc/magic.h>
#include <isc/qsbr.h>
#include <isc/refcount.h> #include <isc/refcount.h>
#include <isc/rwlock.h> #include <isc/rwlock.h>
#include <isc/util.h> #include <isc/util.h>
@@ -136,7 +138,7 @@ qp_test_keytoname(const dns_qpkey_t key, dns_name_t *name) {
static size_t static size_t
getheight(dns_qp_t *qp, qp_node_t *n) { getheight(dns_qp_t *qp, qp_node_t *n) {
if (!is_branch(n)) { if (node_tag(n) == LEAF_TAG) {
return (0); return (0);
} }
size_t max_height = 0; size_t max_height = 0;
@@ -151,18 +153,15 @@ getheight(dns_qp_t *qp, qp_node_t *n) {
size_t size_t
qp_test_getheight(dns_qp_t *qp) { qp_test_getheight(dns_qp_t *qp) {
return (getheight(qp, &qp->root)); qp_node_t *root = get_root(qp);
return (root == NULL ? 0 : getheight(qp, root));
} }
static size_t static size_t
maxkeylen(dns_qp_t *qp, qp_node_t *n) { maxkeylen(dns_qp_t *qp, qp_node_t *n) {
if (!is_branch(n)) { if (node_tag(n) == LEAF_TAG) {
if (leaf_pval(n) == NULL) { dns_qpkey_t key;
return (0); return (leaf_qpkey(qp, n, key));
} else {
dns_qpkey_t key;
return (leaf_qpkey(qp, n, key));
}
} }
size_t max_len = 0; size_t max_len = 0;
qp_weight_t size = branch_twigs_size(n); qp_weight_t size = branch_twigs_size(n);
@@ -176,7 +175,8 @@ maxkeylen(dns_qp_t *qp, qp_node_t *n) {
size_t size_t
qp_test_maxkeylen(dns_qp_t *qp) { qp_test_maxkeylen(dns_qp_t *qp) {
return (maxkeylen(qp, &qp->root)); qp_node_t *root = get_root(qp);
return (root == NULL ? 0 : maxkeylen(qp, root));
} }
/*********************************************************************** /***********************************************************************
@@ -186,8 +186,9 @@ qp_test_maxkeylen(dns_qp_t *qp) {
static void static void
dumpread(dns_qpreadable_t qpr, const char *type, const char *tail) { dumpread(dns_qpreadable_t qpr, const char *type, const char *tail) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr); dns_qpreader_t *qp = dns_qpreader(qpr);
printf("%s %p root %p base %p methods %p%s", type, qp, &qp->root, printf("%s %p root %u %u:%u base %p methods %p%s", type, qp,
qp->root_ref, ref_chunk(qp->root_ref), ref_cell(qp->root_ref),
qp->base, qp->methods, tail); qp->base, qp->methods, tail);
} }
@@ -195,17 +196,14 @@ static void
dumpqp(dns_qp_t *qp, const char *type) { dumpqp(dns_qp_t *qp, const char *type) {
dumpread(qp, type, " mctx "); dumpread(qp, type, " mctx ");
printf("%p\n", qp->mctx); printf("%p\n", qp->mctx);
printf("%s %p usage %p generation %u " printf("%s %p usage %p chunk_max %u bump %u fender %u\n", type, qp,
"chunk_max %u bump %u fender %u\n", qp->usage, qp->chunk_max, qp->bump, qp->fender);
type, qp, qp->usage, qp->generation, qp->chunk_max, qp->bump,
qp->fender);
printf("%s %p leaf %u live %u used %u free %u hold %u\n", type, qp, printf("%s %p leaf %u live %u used %u free %u hold %u\n", type, qp,
qp->leaf_count, qp->used_count - qp->free_count, qp->used_count, qp->leaf_count, qp->used_count - qp->free_count, qp->used_count,
qp->free_count, qp->hold_count); qp->free_count, qp->hold_count);
printf("%s %p compact_all=%d shared_arrays=%d" printf("%s %p compact_all=%d transaction_mode=%d write_protect=%d\n",
" transaction_mode=%d write_protect=%d\n", type, qp, qp->compact_all, qp->transaction_mode,
type, qp, qp->compact_all, qp->shared_arrays, qp->write_protect);
qp->transaction_mode, qp->write_protect);
} }
void void
@@ -229,10 +227,19 @@ qp_test_dumpqp(dns_qp_t *qp) {
void void
qp_test_dumpmulti(dns_qpmulti_t *multi) { qp_test_dumpmulti(dns_qpmulti_t *multi) {
dumpqp(&multi->phase[0], "qpmulti->phase[0]"); dns_qpreader_t qpr;
dumpqp(&multi->phase[1], "qpmulti->phase[1]"); qp_node_t *reader = atomic_load(&multi->reader);
printf("qpmulti %p read %p snapshots %u\n", &multi, multi->read, dns_qpmulti_t *whence = unpack_reader(&qpr, reader);
multi->snapshots); dumpqp(&multi->writer, "qpmulti->writer");
printf("qpmulti->reader %p root_ref %u %u:%u base %p\n", reader,
qpr.root_ref, ref_chunk(qpr.root_ref), ref_cell(qpr.root_ref),
qpr.base);
printf("qpmulti->reader %p whence %p\n", reader, whence);
unsigned int snapshots = 0;
for (dns_qpsnap_t *snap = ISC_LIST_HEAD(multi->snapshots); //
snap != NULL; snap = ISC_LIST_NEXT(snap, link), snapshots++)
{}
printf("qpmulti %p snapshots %u\n", multi, snapshots);
fflush(stdout); fflush(stdout);
} }
@@ -242,9 +249,11 @@ qp_test_dumpchunks(dns_qp_t *qp) {
qp_cell_t free = 0; qp_cell_t free = 0;
dumpqp(qp, "qp"); dumpqp(qp, "qp");
for (qp_chunk_t c = 0; c < qp->chunk_max; c++) { for (qp_chunk_t c = 0; c < qp->chunk_max; c++) {
printf("qp %p chunk %u base %p used %u free %u generation %u\n", printf("qp %p chunk %u base %p "
qp, c, qp->base[c], qp->usage[c].used, qp->usage[c].free, "used %u free %u immutable %u phase %u\n",
qp->usage[c].generation); qp, c, qp->base->ptr[c], qp->usage[c].used,
qp->usage[c].free, qp->usage[c].immutable,
qp->usage[c].phase);
used += qp->usage[c].used; used += qp->usage[c].used;
free += qp->usage[c].free; free += qp->usage[c].free;
} }
@@ -254,7 +263,7 @@ qp_test_dumpchunks(dns_qp_t *qp) {
void void
qp_test_dumptrie(dns_qpreadable_t qpr) { qp_test_dumptrie(dns_qpreadable_t qpr) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr); dns_qpreader_t *qp = dns_qpreader(qpr);
struct { struct {
qp_ref_t ref; qp_ref_t ref;
qp_shift_t max, pos; qp_shift_t max, pos;
@@ -267,11 +276,18 @@ qp_test_dumptrie(dns_qpreadable_t qpr) {
* node; the ref is deliberately out of bounds, and pos == max * node; the ref is deliberately out of bounds, and pos == max
* so we will immediately stop scanning it * so we will immediately stop scanning it
*/ */
stack[sp].ref = ~0U; stack[sp].ref = INVALID_REF;
stack[sp].max = 0; stack[sp].max = 0;
stack[sp].pos = 0; stack[sp].pos = 0;
qp_node_t *n = &qp->root;
printf("%p ROOT\n", n); qp_node_t *n = get_root(qp);
if (n == NULL) {
printf("%p EMPTY\n", n);
fflush(stdout);
return;
} else {
printf("%p ROOT qp %p base %p\n", n, qp, qp->base);
}
for (;;) { for (;;) {
if (is_branch(n)) { if (is_branch(n)) {
@@ -291,25 +307,20 @@ qp_test_dumptrie(dns_qpreadable_t qpr) {
} }
assert(len == max); assert(len == max);
qp_test_keytoascii(bits, len); qp_test_keytoascii(bits, len);
printf("%*s%p BRANCH %p %d %zu %s\n", (int)sp * 2, "", printf("%*s%p BRANCH %p %u %u:%u %zu %s\n", (int)sp * 2,
n, twigs, ref, branch_key_offset(n), bits); "", n, twigs, ref, ref_chunk(ref), ref_cell(ref),
branch_key_offset(n), bits);
++sp; ++sp;
stack[sp].ref = ref; stack[sp].ref = ref;
stack[sp].max = max; stack[sp].max = max;
stack[sp].pos = 0; stack[sp].pos = 0;
} else { } else {
if (leaf_pval(n) != NULL) { dns_qpkey_t key;
dns_qpkey_t key; qp_test_keytoascii(key, leaf_qpkey(qp, n, key));
qp_test_keytoascii(key, leaf_qpkey(qp, n, key)); printf("%*s%p LEAF %p %d %s\n", (int)sp * 2, "", n,
printf("%*s%p LEAF %p %d %s\n", (int)sp * 2, "", leaf_pval(n), leaf_ival(n), key);
n, leaf_pval(n), leaf_ival(n), key); leaf_count++;
leaf_count++;
} else {
assert(n == &qp->root);
assert(leaf_count == 0);
printf("%p EMPTY", n);
}
} }
while (stack[sp].pos == stack[sp].max) { while (stack[sp].pos == stack[sp].max) {
@@ -328,7 +339,9 @@ qp_test_dumptrie(dns_qpreadable_t qpr) {
static void static void
dumpdot_name(qp_node_t *n) { dumpdot_name(qp_node_t *n) {
if (is_branch(n)) { if (n == NULL) {
printf("empty");
} else if (is_branch(n)) {
qp_ref_t ref = branch_twigs_ref(n); qp_ref_t ref = branch_twigs_ref(n);
printf("c%dn%d", ref_chunk(ref), ref_cell(ref)); printf("c%dn%d", ref_chunk(ref), ref_cell(ref));
} else { } else {
@@ -338,7 +351,9 @@ dumpdot_name(qp_node_t *n) {
static void static void
dumpdot_twig(dns_qp_t *qp, qp_node_t *n) { dumpdot_twig(dns_qp_t *qp, qp_node_t *n) {
if (is_branch(n)) { if (n == NULL) {
printf("empty [shape=oval, label=\"\\N EMPTY\"];\n");
} else if (is_branch(n)) {
dumpdot_name(n); dumpdot_name(n);
printf(" [shape=record, label=\"{ \\N\\noff %zu | ", printf(" [shape=record, label=\"{ \\N\\noff %zu | ",
branch_key_offset(n)); branch_key_offset(n));
@@ -370,11 +385,7 @@ dumpdot_twig(dns_qp_t *qp, qp_node_t *n) {
} else { } else {
dns_qpkey_t key; dns_qpkey_t key;
const char *str; const char *str;
if (leaf_pval(n) == NULL) { str = qp_test_keytoascii(key, leaf_qpkey(qp, n, key));
str = "EMPTY";
} else {
str = qp_test_keytoascii(key, leaf_qpkey(qp, n, key));
}
printf("v%p [shape=oval, label=\"\\N ival %d\\n%s\"];\n", printf("v%p [shape=oval, label=\"\\N ival %d\\n%s\"];\n",
leaf_pval(n), leaf_ival(n), str); leaf_pval(n), leaf_ival(n), str);
} }
@@ -383,7 +394,7 @@ dumpdot_twig(dns_qp_t *qp, qp_node_t *n) {
void void
qp_test_dumpdot(dns_qp_t *qp) { qp_test_dumpdot(dns_qp_t *qp) {
REQUIRE(QP_VALID(qp)); REQUIRE(QP_VALID(qp));
qp_node_t *n = &qp->root; qp_node_t *n = get_root(qp);
printf("strict digraph {\nrankdir = \"LR\"; ranksep = 1.0;\n"); printf("strict digraph {\nrankdir = \"LR\"; ranksep = 1.0;\n");
printf("ROOT [shape=point]; ROOT -> "); printf("ROOT [shape=point]; ROOT -> ");
dumpdot_name(n); dumpdot_name(n);