diff --git a/doc/design/qp-trie.md b/doc/design/qp-trie.md new file mode 100644 index 0000000000..a5f4d9b6dd --- /dev/null +++ b/doc/design/qp-trie.md @@ -0,0 +1,770 @@ + + +A qp-trie for the DNS +===================== + +A qp-trie is a data structure that supports lookups in a sorted +collection of keys. It is efficient both in terms of fast lookups and +using little memory. It is particularly well-suited for use in DNS +servers. + +These notes outline how BIND's `dns_qp` implementation works, how it +is optimized for lookups keyed by DNS names, and how it supports +multi-version concurrency. + + +data structure zoo +------------------ + +Chasing a pointer indirection is very slow, up to 100ns, whereas a +sequential memory access takes less than 10ns. So, to make a data +structure fast, we need to minimize indirections. + +There is a tradeoff between speed and flexibility in standard data +structures: + + * Arrays are very simple and fast (a lookup goes straight to the + right address), but the key can only be a small integer. + + * Hash tables allow you to use arbitrary lookup keys (such as + strings), but may require probing multiple addresses to find the + right element. + + * Radix trees allow you to do lookups based on the sorting order of + the keys, provided it is lexical like `memcmp()`; however, lookups + require multiple indirections. + + * Comparison search trees (binary trees and B-trees) allow you to + use an arbitrary ordering predicate, but each indirection during + a lookup also requires a comparison. + +In the DNS, we need to use some kind of tree to support the kinds of +lookup required for DNSSEC: find longest match, find nearest +predecessor or successor, and so forth. So what kind of tree is best? + + +in theory +--------- + +In a tree where the average length of a key is `k`, and the number of +elements in the tree is `n`, the theoretical performance bounds are, +for a comparison tree: + + * `Ω(k * log n)` + * `Ο(k * n)` + +And for a radix tree: + + * `Ω(k + log n)` + * `Ο(k + k)` + +Here, `Ω()` is the lower bound and `Ο()` is the upper bound; we +expect typical performance to be close to the lower bound. + +The multiplications in the comparison tree expressions means that each +indirection requires a comparison `Ο(k)`, whereas they are additions +in the radix tree expressions because a radix tree traversal only +needs one key comparison. + +The upper bounds say that (in the absence of balancing) a comparison +tree can devolve into a linked list of nodes, whereas the shape of a +radix tree is determined by the set of keys independent of the order +of insertion or the number of keys. + +The logarithms hide some interesting constant factors. In a binary +tree, the log is base 2. In a radix tree, the radix is the base of the +logarithm. So, if we increase the radix, the constant factor gets +smaller. The rough equivalent for a binary tree would be to use a +B-tree instead, but although B-trees have fewer indirections they do +not reduce the number of comparisons. + +In implementation terms, a larger radix means tree nodes get wider +and the tree becomes shallower. A shallower tree requires fewer +indirections, so it should be faster. The trick is to increase the +radix without blowing up the tree's memory usage, which can lose +more performance than we win. + +This analysis suggests that a radix tree is better than a comparison +tree, provided keys can be compared lexically - which is true for DNS +names, with some rearrangement (described below). When using big-o +notation, we also need to be wary of the constant factors; but in this +case they also favour a radix tree, especially with the optimization +tricks used by BIND's qp-trie. + +Note: "radix" comes from the latin for "root", so "radix tree" is a +pun, which is geekily amusing especially when talking about logs. + + +what is a trie? +--------------- + +A trie is another name for a radix tree (or "digital tree" according +to Knuth). It is short for information reTRIEval, and I pronounce it +exactly like "tree" (though Knuth pronounces it like "try"). + +In a trie, keys are divided into digits depending on some radix e.g. +base 2 for binary tries, base 256 for byte-indexed tries. When +searching the trie, successive digits in the key, from most to least +significant, are used to select branches from successive nodes in +the trie, roughly like: + + for (offset = 0; isbranch(node); offset++) + node = node->child[key[offset]]; + +All of the keys in a subtrie have identical prefixes. Tries do not +need to store keys since they are implicit in the structure. + + +binary crit-bit trees +--------------------- + +A patricia trie is a binary trie which omits nodes that have only one +child. Dan Bernstein calls his tightly space-optimized version a +"crit-bit tree". +https://cr.yp.to/critbit.html +https://github.com/agl/critbit/ + +Unlike a basic trie, a crit-bit tree skips parts of the key when +every element in a subtree shares the same sequence of bits. +Each node is annotated with the offset of the bit that is used to +select the branch; offsets always increase as you go deeper into +the tree. + + while (isbranch(node)) + node = node->child[key[node->offset]]; + +In a crit-bit tree the keys are not implicit in the structure +because parts of them are skipped. Therefore, each leaf refers to a +copy of its key so that when you find a leaf you can verify that the +skipped bits match. + + +prefetching +----------- + +Observe that in the loop above, the current node has only one child +pointer, and the child nodes are adjacent in memory. This means it +is possible to tell the CPU to prefetch the child nodes before +extracting the critical bit from the key and choosing which child is +next. A qp-trie has a similar layout, but it has more child nodes +(still adjacent in memory) and it does more computation to choose +which one is next. + +When I originally invented the qp-trie code, I found that explicit +prefetch hints made the qp-trie substantially faster and the crit-bit +tree slightly faster. The hints help the CPU to do useful work at the +same time as the memory subsystem. (This is unusual for linked data +structures, which tend to alternate between CPU waiting for memory, +and memory waiting for CPU.) + +Large modern CPUs (after about 2015) are better at prefetching +automatically, so the explicit hint is less important than it used to +be, but `lib/dns/qp.c` still has `__builtin_prefetch()` hints in its +inner traversal loops. + + +packed sparse vectors with popcount +----------------------------------- + +The `popcount` instruction counts the number of bits that are set +in a word. It's also known as the Hamming weight; Knuth calls it +"sideways add". https://en.wikipedia.org/wiki/popcount + +You can use `popcount` to implement a sparse vector of length `N` +containing `M <= N` members using bitmap of length `N` and a packed +vector of `M` elements. A member `b` is present in the vector if bit +`b` is set, so `M == popcount(bitmap)`. The index of member `b` in +the packed vector is the popcount of the bits preceding `b`. + + // size of vector + size = popcount(bitmap); + // bit position + bit = 1 << b; + // is element present? + if (bitmap & bit) { + // mask covers the preceding elements + mask = bit - 1; + // position of element in packed vector + pos = popcount(bitmap & mask); + // fetch element + elem = vector[pos]; + } + +See "Hacker's Delight" by Hank Warren, section 5-1 "Counting 1 +bits", subsection "applications". http://www.hackersdelight.org + +See under _"bitmap popcount shenanigans"_ in `lib/dns/qp.c` for how +this is implemented in BIND. + + +popcount for trie nodes +----------------------- + +Phil Bagwell's hashed array-mapped tries (HAMT) use popcount for +compact trie nodes. In a HAMT, string keys are hashed, and the hash is +used as the index to the trie, with radix 2^32 or 2^64. +http://infoscience.epfl.ch/record/64394/files/triesearches.pdf +http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf + +As discussed above, increasing the radix makes the tree shallower, so +it should be faster. The downside is usually much greater memory +overhead. Child vectors are often sparsely populated, so we can +greatly reduce the overhead by packing them with popcount. + +The HAMT relies on hashing, which keeps keys dense. This means it +can be laid out like a basic trie with implicit keys (i.e. hash +values). The disadvantage of hashing is that strings are stored +out of order. + + +qp-trie +------- + +A qp-trie is a mash-up of Bernstein's crit-bit tree with Bagwell's +HAMT. Like a crit-bit tree, a qp-trie omits nodes with one child; +nodes include a key offset; and keys a referenced from leaves instead +of being implicit in the trie structure. Like a HAMT, nodes have a +popcount packed vector of children, but unlike a HAMT, keys are not +hashed. + +A qp-trie is faster than a crit-bit tree and uses less memory, because +its wider fan-out requires fewer nodes and popcount packs them very +efficiently. Like a crit-bit tree but unlike a HAMT, a qp-trie stores +keys in lexical order. + +As in a HAMT, the original layout of a qp-trie node is a pair of +words, which are used as key and value pointers in leaf nodes, and +index word and pointer in branch nodes. The index word contains the +popcount bitmap (as in a HAMT) and the offset into the key (as in a +crit-bit tree), as well as a leaf/branch tag bit. The pointer refers +to the branch node's "twigs", which is what we call the packed sparse +vector of child nodes. + +The fan-out of a qp-trie is limited by the need to fit the bitmap and +the nybble offset into a 64-bit word; a radix of 16 or 32 works well, +and 32 is slightly faster (though 5-bit nybbles are fiddly). But radix +64 requires an extra word per node, and the extra memory overhead +makes it slower as well as bulkier. + +Early qp-trie implementations used a node layout like the +following. However, in practice C bitfields have too many +portability gotchas to work well. It is better to use hand-written +shifting and masking to access the parts of the index word. + + #define NYBBLE 4 // or 5 + #define RADIX (1 << NYBBLE) + + union qp_node { + struct { + unsigned tag : 1; + unsigned bitmap : RADIX; + unsigned offset : (64 - 1 - RADIX); + union qp_node *twigs; + } branch; + struct { + void *value; + const char *key; + } leaf; + }; + + +DNS qp-trie +----------- + +BIND uses a variant of a qp-trie optimized for DNS names. DNS names +almost always use the usual hostname alphabet of (case-insensitive) +letters, digits, hyphen, plus underscore (which is often used in the DNS +for non-hostname purposes), and finally the label separator (which is +written as '.' in presentation-format domain names, and is the label +length in wire format). This adds up to 39 common characters. + +A bitmap for 39 common characters is small enough to fit into a +qp-trie index word, so we can (in principle) walk down the trie one +character at a time, as if the radix were 256, but without needing a +multi-word bitmap. + +However, DNS names can contain arbitrary bytes. To support the 200-ish +unusual characters we use an escaping scheme, described in more detail +below. This requires a few more bits in the bitmap to represent the +escape characters, so our radix ends up being 47. This still fits into +the 64-bit index word, so we get the compactness of a qp-trie but with +faster byte-at-a-time lookups for DNS names that use common hostname +characters. + +You can also use other kinds of keys with BIND's DNS qp-trie, provided +they are not too long. You must provide your own key preparation +function, e.g. for uniform binary keys you might extract 5-bit nybbles +to get a radix-32 trie. + + +preparing a lookup key +---------------------- + +A DNS name needs to be rearranged to use it as a qp-trie key, so that +the lexical order of rearranged keys matches the canonical DNS name +order specified in RFC 4034 section 6.1: + + * reverse the order of the labels so that they run from most + significant to least significant, left to right (but the + characters in each label remain in the same order) + + * convert uppercase ASCII letters to lowercase ASCII + + * change the label separators to a non-byte value that sorts before + the zero byte + +For qp-trie lookups there are a couple of extra steps: + + * There is an escaping mechanism to support DNS names that use + unusual characters. Common characters use one byte in the lookup + key, but unusual characters are expanded to two bytes. To preserve + the correct lexical order, there are different escape bytes + depending on how the unusual character sorts relative to the + common hostname characters. + + * Characters in the DNS name need to be converted to bitmap + positions. This is done at the same time as preparing the lookup + key, to move work out of the inner trie traversal loop. + +These 5 transformations can be done in a single pass over a DNS name +using a single lookup table. The transformed name is usually the +same length (up to 2x longer if it contains unusual characters). + +You can use absolute or relative DNS names as keys, without ambiguity +(provided you have some way of knowing what names are relative to). +When converted to a lookup key, absolute names start with a non-byte +value representing the root, and relative names do not. + +Lookup keys are ephemeral, allocated on the stack during a lookup. + +See under _"converting DNS names to trie keys"_ in `lib/dns/qp.c` +for how this is implemented in BIND. + + +node layout +----------- + +Earlier I said that the original qp-trie node layout consists of two +words: one 64 bit word for the branch index, and one pointer-sized +word. BIND's qp-trie uses a layout that is smaller on 64-bit systems: +one 64 bit word and one 32-bit word. + +A branch node contains + + * a branch/leaf tag bit + + * a 47-wide bitmap, with a bit for each common hostname character + and each escape character + + * a 9-bit key offset, enough to count twice the length of a DNS + name + + * a 32-bit "twigs" reference to the packed vector of child nodes; + these references are described in more detail below + +A leaf node contains a pointer value (which we assume to be 64 bits) +and a 32-bit integer value. The branch/leaf tag is smuggled into the +low-order bit of the pointer value, so the pointer value must have +large enough alignment. (This requirement is checked when a leaf is +added to the trie.) Apart from that, the meaning of leaf values +is entirely under control of the qp-trie user. + +When constructing a qp-trie the user provides a collection of method +pointers. The qp-trie code calls these methods when it needs to do +anything that needs to look into a leaf value, such as extracting the +key. + +See under _"interior node basics"_ and _"interior node constructors +and accessors"_ in `lib/dns/qp_p.h` for the implementation. + + +example +------- + +Consider a small zone: + + example. ; apex + mail.example. ; IMAP server + mx.example. ; incoming mail + www.example. ; web load balancer + www1.example. ; back-end web servers + www2.example. + +It becomes a qp-trie as follows. I am writing bitmaps as lists of +characters representing the bits that are set, with `'.'` for label +separators. I have used arbitrary names for the addresses of the twigs +vectors. + + root = (qp_node){ + tag: BRANCH, + offset: 9, + bitmap: [ '.', 'm', 'w' ], + twigs: &one, + }; + +Note that the offset skips the root zone, the zone name, and the apex +label separator. If the offset is beyond the end of the key, the byte +value is the label separator. + + one = (qp_node[3]){ + { + tag: LEAF, + key: "example.", + }, + { + tag: BRANCH, + offset: 10, + bitmap: [ 'a', 'x' ], + twigs: &two, + }, + { + tag: BRANCH, + offset: 12, + bitmap: [ '.', '1', '2' ], + twigs: &three, + }, + }; + +This twigs vector has an element for the zone apex, and the two +different initial characters of the subdomains. + +The mail servers differ in the next character, so the offset bumps from +9 to 10 without skipping any characters. The web servers all start with +www, so the offset bumps from 9 to 12, skipping the common prefix. + + two = (qp_node[2]){ + { + tag: LEAF, + key: "mail.example.", + }, + { + tag: LEAF, + key: "mx.example.", + }, + }; + +The different lengths of `mail` and `mx` don't matter: we implicitly +skip to the end of the key when we reach a leaf node. + + three = (qp_node[3]){ + { + tag: LEAF, + key: "www.example.", + }, + { + tag: LEAF, + key: "www1.example.", + }, + { + tag: LEAF, + key: "www2.example.", + }, + }; + +When the trie includes labels of differing lengths, we can have a node +that chooses between a label separator and characters from the longer +labels. This is slightly different from the root node, which tested the +first character of the label; here we are testing the last character. + + +memory management for concurrency +--------------------------------- + +The following sections discuss how the qp-trie supports concurrency. + +The requirement is to support many concurrent read threads, and +allow updates to occur without blocking readers (or blocking readers +as little as possible). + +The strategy is to use "copy-on-write", that is, when an update +needs to alter the trie it makes a copy of the parts that it needs +to change, so that concurrent readers can continue to use the +original. (It is analogous to multiversion concurrency in databases +such as PostgreSQL, where copy-on-write uses a write-ahead log.) + +Software that uses copy-on-write needs some mechanism for clearing +away old versions that are no longer in use. (For example, VACUUM in +PostgreSQL.) The qp-trie code uses a custom allocator with a simple +garbage collector; as well as supporting concurrency, the qp-trie's +memory manager makes tries smaller and faster. + + +allocation +---------- + +A qp-trie is relatively demanding on its allocator. Twigs vectors +can be lots of different sizes, and every mutation of the trie +requires an alloc and/or a free. + +Older versions of the qp-trie code used the system allocator. Many +allocators (such as `jemalloc`) segregate the heap into different +size classes, so that each chunk of memory is dedicated to +allocations of the same size. While this memory layout provides good +locality when objects of the same type have the same size, it tends +to scatter the interior nodes of a qp-trie all over the address space. + +BIND's qp-trie code uses a "bump allocator" for its interior nodes, +which is one of the simplest and fastest possible: an allocation +usually only requires incrementing a pointer and checking if it has +reached a limit. (If the check fails the allocator goes into its +slow path.) Allocations have good locality because they write +sequentially into memory. (A bit like a write-ahead log.) + +Bump allocators need reasonably large contiguous chunks of empty +memory to make the most of their efficiency, so they are often +coupled with some kind of compacting garbage collector, which +defragments the heap to recover free space. + +See `alloc_twigs()` in `lib/dns/qp.c` for the bump allocator fast +path. + + +garbage collection +------------------ + +[The Garbage Collection Handbook](https://gchandbook.org/) says +there are four basic kinds of automatic memory management. + +Reference counting is used by scripting languages such as Perl and +Python, and also for manual memory management such as in operating +system kernels and BIND. + +To avoid writing a custom allocator, I previously tried adapting the +qp-trie code to use refcounting to support copy-on-write, but I was +not very happy with the complexity of the implementation, and I +thought it was ugly that I needed to modify refcounts in nodes that +were logically read-only. + +(Two other kinds of GC are mark-sweep and mark-compact. Both of them +have a similar disadvantage to refcounting: a simple GC mark phase +modifies nodes that are logically read-only. And mark-sweep leaves +memory fragmented so it does not support a bump allocator.) + +The fourth kind is copying garbage collection. It works well with a +bump allocator, because copying the data structure using a bump +allocator in the most obvious way naturally compacts the data. And +the copying phase of the GC can run concurrently with readers +without interference. + +BIND's qp-trie code uses a copying garbage collector only for its +interior nodes. The value objects that are attached to the leaves of +the trie are allocated by `isc_mem` and use reference counting like +the rest of BIND. + +See `compact()` in `lib/dns/qp.c` for the copying phase of the +garbage collector. Reference counting for value objects is handled +by the `attach()` and `detach()` qp-trie methods. + + +memory layout +------------- + +BIND's qp-trie code organizes its memory as a collection of "chunks", +each of which is a few pages in size and large enough to hold a few +thousand nodes. + +Most memory management is per-chunk: obtaining memory from the +system allocator and returning it; keeping track of which chunks are +in use by readers, and which chunks can be mutated; and counting +whether chunks are fragmented enough to need garbage collection. + +As noted above, we also use the chunk-based layout to reduce the size +of interior nodes. Instead of using a native pointer (typically 64 +bits) to refer to a node, we use a 32 bit integer containing the chunk +number and the position of the node in the chunk. This reduces the +memory used by interior nodes by 25%. + +In `lib/dns/qp_p.h`, the _"main qp-trie structures"_ hold information +about a trie's chunks. Most of the chunk handling code is in the +_"allocator"_ and _"chunk reclamation"_ sections in `lib/dns/qp.c`. + + +lifecycle of value objects +-------------------------- + +A leaf node contains a pointer to a value object that is not managed +by the qp-trie garbage collector. Instead, the user provides +`attach` and `detach` methods that the qp-trie code calls to update +the reference counts in the value objects. + +Value object reference counts do not indicate whether the object is +mutable: its refcount can be 1 while it is only in use by readers +(and must be left unchanged), or newly created by a writer (and +therefore mutable). + +So, callers must keep track themselves whether leaf objects are newly +inserted (and therefore mutable) or not. XXXFANF this might change, by +adding special lookup functions that return whether leaf objects are +mutable - see the "todo" in `include/dns/qp.h`. + + +locking and RCU +--------------- + +The Linux kernel has a collection of copy-on-write schemes collectively +called read-copy-update; there is also https://liburcu.org/ for RCU in +userspace. RCU is attractively speedy: readers can proceed without +blocking at all; writers can proceed concurrently with readers, and +updates can be committed without blocking. A commit is just a single +atomic pointer update. RCU only requires writers to block when waiting +for a "grace period" while older readers complete their critical +sections, after which the writer can free memory that is no longer in +use. Writers must also block on a mutex to ensure there is only one +writer at a time. + +The qp-trie concurrency strategy is designed to be able to use RCU, but +RCU is not required. Instead of RCU we can use a reader-writer lock. +This requires readers to block when a writer commits, which (in RCU +style) just requires an atomic pointer swap. The rwlock also changes +when writers must block: commits must wait for readers to exit their +critical sections, but there is no further waiting to be able to release +memory. + +In BIND, there are two kinds of reader: queries, which are relatiely +quick, and zone transfers, which are relatively slow. BIND's dbversion +machinery allows updates to proceed while there are long-running zone +transfers. RCU supports this without further machinery, but a +reader-writer lock needs some help so that long-running readers can +avoid blocking writers. + +To avoid blocking updates, long-running readers can take a snapshot of a +qp-trie, which only requires copying the allocator's chunk array. After +a writer commits, it does not releases memory if there are any +snapshots. Instead, chunks that are no longer needed by the latest +version of the trie are stashed on a list to be released later, +analogous to RCU waiting for a grace period. + +The locking occurs only in the functions under _"read-write +transactions"_ and _"read-only transactions"_ in `lib/dns/qp.c`. + + +immutability and copy-on-write +------------------------------ + +A qp-trie has a `generation` counter which is incremented by each +write transaction. We keep track of which generation each chunk was +created in; only chunks created in the current generation are +mutable, because older chunks may be in use by concurrent readers. + +This logic is implemented by `chunk_alloc()` and `chunk_mutable()` +in `lib/dns/qp.c`. + +The `make_twigs_mutable()` function ensures that a node is mutable, +copying it if necessary. + +The chunk arrays are a mixture of mutable and immutable. Pointers to +immutable chunks are immutable; new chunks can be assigned to unused +entries; and entries are cleared when it is safe to reclaim the chunks +they refer to. If the chunk arrays need to be expanded, the existing +arrays are retained for use by readers, and the writer uses the +expanded arrays (see `alloc_slow()`). The old arrays are cleaned up +after the writer commits. + + +update transactions +------------------- + +A typical heavy-weight `update` transaction comprises: + + * make a copy of the chunk arrays in case we need to roll back + + * get a freshly allocated chunk where new nodes or copied nodes + can be written + + * make any changes that are required; nodes in old chunks are + copied to the new space first; new nodes are modified in place + to avoid creating unnecessary garbage + + * when the updates are finished, and before committing, run the + garbage collector to clear out chunks that were fragmented by the + update + + * shrink the allocation chunk to eliminate unused space + + * commit the update by flipping the root pointer of the trie; this + is the only point that needs a multithreading interlock + + * free any chunks that were emptied by the garbage collector + +A lightweight `write` transaction is similar, except that: + + * rollback is not supported + + * any existing allocation chunk is reused if possible + + * the gabage collector is not run before committing + + * the allocation chunk is not shrunk + + +testing strategies +------------------ + +The main qp-trie test is in `tests/dns/qpmulti_test.c`. This uses +randomized testing of the transactional API, with a lot of consistency +checking to detect bugs. + +There are also a couple of fuzzers, which aim to benefit from +coverage-guided exploration of the test space and test minimization. +In `fuzz/dns_qp.c` we treat the fuzzer input as a bytecode to exercise +the single-threaded API, and `fuzz/dns_qpkey_name.c` checks conversion +from DNS names to lookup keys. + +In `tests/bench` there are a few benchmarks. `load-names` does a very +basic comparison between BIND's hash table, red-black tree, and +qp-trie. `qpmulti` checks multicore performance of the transactional +API (similar to `qpmulti_test` but without the consistency checking). +And `qp-dump` is a utility for printing out the contents of a qp-trie. + +John Regehr has some nice essays about testing data structures: + + * Levels of fuzzing: https://blog.regehr.org/archives/1039 + + (how much semantic knowledge does your fuzzer have?) + + * Testing with small capacities: https://blog.regehr.org/archives/1138 + + (I need to be able to change the chunk size) + + * Write fuzzable code: https://blog.regehr.org/archives/1687 + + * Oracles for random testing: https://blog.regehr.org/archives/856 + + +warning: generational collection +-------------------------------- + +The "generational hypothesis" is that most allocations have a short +lifetime, so it is profitable for a garbage collector to split its +heap into a number of generations. The youngest generation is where +allocations happen; it typically uses a bump allocator, and when the +allocation pointer reaches its limit, the youngest generation's +contents are copied to the second generation. The hypothesis is that +only a small fraction of the youngest generation will still be live +when the GC runs, so this copy will not take much time or space. + +For a qp-trie the truth of this hypothesis depends on the order in +which keys are added or removed. It may be true if there is good +locality, for example, adding keys in lexicographic order, but not in +general. + +When a qp-trie is mutated, only one node needs to be altered, near the +leaf that is added or removed. Nodes near the root of the trie tend to +be more stable and long-lived. However, during a copy-on-write +transaction, the path from the root to an altered leaf must be copied, +so nodes near the root are no longer stable and long-lived. They may +become stable in a long transaction, but that isn't guaranteed. + +So the idea of generational garbage collection seems to be unhelpful +for a qp-trie. diff --git a/lib/dns/Makefile.am b/lib/dns/Makefile.am index 2539ad5517..aa88ad1962 100644 --- a/lib/dns/Makefile.am +++ b/lib/dns/Makefile.am @@ -99,6 +99,7 @@ libdns_la_HEADERS = \ include/dns/order.h \ include/dns/peer.h \ include/dns/private.h \ + include/dns/qp.h \ include/dns/rbt.h \ include/dns/rcode.h \ include/dns/rdata.h \ @@ -157,6 +158,7 @@ libdns_la_SOURCES = \ cache.c \ callbacks.c \ catz.c \ + client.c \ clientinfo.c \ compress.c \ db.c \ @@ -206,6 +208,8 @@ libdns_la_SOURCES = \ order.c \ peer.c \ private.c \ + qp.c \ + qp_p.h \ rbt.c \ rbtdb.h \ rbtdb.c \ @@ -233,18 +237,17 @@ libdns_la_SOURCES = \ transport.c \ tkey.c \ tsig.c \ + tsig_p.h \ ttl.c \ update.c \ validator.c \ view.c \ xfrin.c \ zone.c \ + zone_p.h \ zoneverify.c \ zonekey.c \ - zt.c \ - client.c \ - tsig_p.h \ - zone_p.h + zt.c if HAVE_GSSAPI libdns_la_SOURCES += \ diff --git a/lib/dns/include/dns/log.h b/lib/dns/include/dns/log.h index 03b97ac77d..0b2f8eb508 100644 --- a/lib/dns/include/dns/log.h +++ b/lib/dns/include/dns/log.h @@ -80,6 +80,7 @@ extern isc_logmodule_t dns_modules[]; #define DNS_LOGMODULE_DYNDB (&dns_modules[30]) #define DNS_LOGMODULE_DNSTAP (&dns_modules[31]) #define DNS_LOGMODULE_SSU (&dns_modules[32]) +#define DNS_LOGMODULE_QP (&dns_modules[33]) ISC_LANG_BEGINDECLS diff --git a/lib/dns/include/dns/qp.h b/lib/dns/include/dns/qp.h new file mode 100644 index 0000000000..118f22d677 --- /dev/null +++ b/lib/dns/include/dns/qp.h @@ -0,0 +1,574 @@ +/* + * Copyright (C) Internet Systems Consortium, Inc. ("ISC") + * + * SPDX-License-Identifier: MPL-2.0 + * + * This Source Code Form is subject to the terms of the Mozilla Public + * License, v. 2.0. If a copy of the MPL was not distributed with this + * file, you can obtain one at https://mozilla.org/MPL/2.0/. + * + * See the COPYRIGHT file distributed with this work for additional + * information regarding copyright ownership. + */ + +#pragma once + +/* + * A qp-trie is a kind of key -> value map, supporting lookups that are + * aware of the lexicographic order of keys. + * + * Keys are `dns_qpkey_t`, which is a string-like thing, usually created + * from a DNS name. You can use both relative and absolute DNS names as + * keys. + * + * Leaf values are a pair of a `void *` pointer and a `uint32_t` + * (because that is what fits inside an internal qp-trie leaf node). + * + * The trie does not store keys; instead keys are derived from leaf values + * by calling a method provided by the user. + * + * There are a few flavours of qp-trie. + * + * The basic `dns_qp_t` supports single-threaded read/write access. + * + * A `dns_qpmulti_t` is a wrapper that supports multithreaded access. + * There can be many concurrent readers and a single writer. Writes are + * transactional, and support multi-version concurrency. + * + * The concurrency strategy uses copy-on-write. When making changes during + * a transaction, the caller must not modify leaf values in place, but + * instead delete the old leaf from the trie and insert a replacement. Leaf + * values have reference counts, which will indicate when the old leaf + * value can be freed after it is no longer needed by readers using an old + * version of the trie. + * + * For fast concurrent reads, call `dns_qpmulti_query()` to get a + * `dns_qpread_t`. Readers can access a single version of the trie between + * write commits. Most write activity is not blocked by readers, but reads + * must finish before a write can commit (a read-write lock blocks + * commits). + * + * For long-running reads that need a stable view of the trie, while still + * allow commits to proceed, call `dns_qpmulti_snapshot()` to get a + * `dns_qpsnap_t`. It briefly gets the write mutex while creating the + * snapshot, which requires allocating a copy of some of the trie's + * metadata. A snapshot is for relatively heavy long-running read-only + * operations such as zone transfers. + * + * While snapshots exist, a qp-trie cannot reclaim memory: it does not + * retain detailed information about which memory is used by which + * snapshots, so it pessimistically retains all memory that might be + * used by old versions of the trie. + * + * You can start one read-write transaction at a time using + * `dns_qpmulti_write()` or `dns_qpmulti_update()`. Either way, you + * get a `dns_qp_t` that can be modified like a single-threaded trie, + * without affecting other read-only query or snapshot users of the + * `dns_qpmulti_t`. Committing a transaction only blocks readers + * briefly when flipping the active readonly `dns_qp_t` pointer. + * + * "Update" transactions are heavyweight. They allocate working memory to + * hold modifications to the trie, and compact the trie before committing. + * For extra space savings, a partially-used allocation chunk is shrunk to + * the smallest size possible. Unlike "write" transactions, an "update" + * transaction can be rolled back instead of committed. (Update + * transactions are intended for things like authoritative zones, where it + * is important to keep the per-trie memory overhead low because there can + * be a very large number of them.) + * + * "Write" transactions are more lightweight: they skip the allocation and + * compaction at the start and end of the transaction. (Write transactions + * are intended for frequent small changes, as in the DNS cache.) + */ + +/*********************************************************************** + * + * types + */ + +#include + +#include + +/*% + * A `dns_qp_t` supports single-threaded read/write access. + */ +typedef struct dns_qp dns_qp_t; + +/*% + * A `dns_qpmulti_t` supports multi-version concurrent reads and transactional + * modification. + */ +typedef struct dns_qpmulti dns_qpmulti_t; + +/*% + * A `dns_qpread_t` is a lightweight read-only handle on a `dns_qpmulti_t`. + */ +typedef struct dns_qpread dns_qpread_t; + +/*% + * A `dns_qpsnap_t` is a heavier read-only snapshot of a `dns_qpmulti_t`. + */ +typedef struct dns_qpsnap dns_qpsnap_t; + +/* + * The read-only qp-trie functions can work on either of the read-only + * qp-trie types or the general-purpose read-write `dns_qp_t`. They + * relies on the fact that all the `dns_qpreadable_t` structures start + * with a `dns_qpread_t`. + */ +typedef union dns_qpreadable { + dns_qpread_t *qpr; + dns_qpsnap_t *qps; + dns_qp_t *qpt; +} dns_qpreadable_t __attribute__((__transparent_union__)); + +#define dns_qpreadable_cast(qp) ((qp).qpr) + +/*% + * A trie lookup key is a small array, allocated on the stack during trie + * searches. Keys are usually created on demand from DNS names using + * `dns_qpkey_fromname()`, but in principle you can define your own + * functions to convert other types to trie lookup keys. + * + * A domain name can be up to 255 bytes. When converted to a key, each + * character in the name corresponds to one byte in the key if it is a + * common hostname character; otherwise unusual characters are escaped, + * using two bytes in the key. So we allow keys to be up to 512 bytes. + * (The actual max is (255 - 5) * 2 + 6 == 506) + * + * Every byte of a key must be greater than 0 and less than 48. Elements + * after the end of the key are treated as having the value 1. + */ +typedef uint8_t dns_qpkey_t[512]; + +/*% + * These leaf methods allow the qp-trie code to call back to the code + * responsible for the leaf values that are stored in the trie. The + * methods are provided for a whole trie when the trie is created. + * + * The qp-trie is also given a context pointer that is passed to the + * methods, so the methods know about the trie's context as well as a + * particular leaf value. + * + * The `attach` and `detach` methods adjust reference counts on value + * objects. They support copy-on-write and safe memory reclamation + * needed for multi-version concurrency. + * + * Note: When a value object reference count is greater than one, the + * object is in use by concurrent readers so it must not be modified. A + * refcount equal to one does not indicate whether or not the object is + * mutable: its refcount can be 1 while it is only in use by readers (and + * must be left unchanged), or newly created by a writer (and therefore + * mutable). + * + * The `makekey` method fills in a `dns_qpkey_t` corresponding to a + * value object stored in the qp-trie. It returns the length of the + * key. This method will typically call dns_qpkey_fromname() with a + * name stored in the value object. + * + * For logging and tracing, the `triename` method copies a human- + * readable identifier into `buf` which has max length `size`. + */ +typedef struct dns_qpmethods { + void (*attach)(void *ctx, void *pval, uint32_t ival); + void (*detach)(void *ctx, void *pval, uint32_t ival); + size_t (*makekey)(dns_qpkey_t key, void *ctx, void *pval, + uint32_t ival); + void (*triename)(void *ctx, char *buf, size_t size); +} dns_qpmethods_t; + +/*% + * Buffers for use by the `triename()` method need to be large enough + * to hold a zone name and a few descriptive words. + */ +#define DNS_QP_TRIENAME_MAX 300 + +/*% + * A container for the counters returned by `dns_qp_memusage()` + */ +typedef struct dns_qp_memusage { + void *ctx; /*%< qp-trie method context */ + size_t leaves; /*%< values in the trie */ + size_t live; /*%< nodes in use */ + size_t used; /*%< allocated nodes */ + size_t hold; /*%< nodes retained for readers */ + size_t free; /*%< nodes to be reclaimed */ + size_t node_size; /*%< in bytes */ + size_t chunk_size; /*%< nodes per chunk */ + size_t chunk_count; /*%< allocated chunks */ + size_t bytes; /*%< total memory in chunks and metadata */ +} dns_qp_memusage_t; + +/*********************************************************************** + * + * functions - create, destory, enquire + */ + +void +dns_qp_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx, + dns_qp_t **qptp); +/*%< + * Create a single-threaded qp-trie. + * + * Requires: + * \li `mctx` is a pointer to a valid memory context. + * \li all the methods are non-NULL + * \li `qptp != NULL && *qptp == NULL` + * + * Ensures: + * \li `*qptp` is a pointer to a valid single-threaded qp-trie + */ + +void +dns_qp_destroy(dns_qp_t **qptp); +/*%< + * Destroy a single-threaded qp-trie. + * + * Requires: + * \li `qptp != NULL` + * \li `*qptp` is a pointer to a valid single-threaded qp-trie + * + * Ensures: + * \li all memory allocated by the qp-trie has been released + * \li `*qptp` is NULL + */ + +void +dns_qpmulti_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx, + dns_qpmulti_t **qpmp); +/*%< + * Create a multi-threaded qp-trie. + * + * Requires: + * \li `mctx` is a pointer to a valid memory context. + * \li all the methods are non-NULL + * \li `qpmp != NULL && *qpmp == NULL` + * + * Ensures: + * \li `*qpmp` is a pointer to a valid multi-threaded qp-trie + */ + +void +dns_qpmulti_destroy(dns_qpmulti_t **qpmp); +/*%< + * Destroy a multi-threaded qp-trie. + * + * Requires: + * \li `qptp != NULL` + * \li `*qptp` is a pointer to a valid multi-threaded qp-trie + * \li there are no write or update transactions in progress + * \li no snapshots exist + * + * Ensures: + * \li all memory allocated by the qp-trie has been released + * \li `*qpmp` is NULL + */ + +void +dns_qp_compact(dns_qp_t *qp); +/*%< + * Defragment the entire qp-trie and release unused memory. + * + * When modifications make a trie too fragmented, it is automatically + * compacted. Automatic compaction avoids compacting chunks that are not + * fragmented to save time, but this function compacts the entire trie to + * defragment it as much as possible. + * + * This function can be used with a single-threaded qp-trie and during a + * transaction on a multi-threaded trie. + * + * Requires: + * \li `qp` is a pointer to a valid qp-trie + */ + +void +dns_qp_gctime(uint64_t *compact_us, uint64_t *recover_us, + uint64_t *rollback_us); +/*%< + * Get the total times spent on garbage collection in microseconds. + * + * These counters are global, covering every qp-trie in the program. + * + * XXXFANF This is a placeholder until we can record times in histograms. + */ + +dns_qp_memusage_t +dns_qp_memusage(dns_qp_t *qp); +/*%< + * Get the memory counters from a qp-trie + * + * Requires: + * \li `qp` is a pointer to a valid qp-trie + * + * Returns: + * \li a `dns_qp_memusage_t` structure described above + */ + +/*********************************************************************** + * + * functions - search, modify + */ + +/* + * XXXFANF todo, based on what we discover BIND needs + * + * fancy searches: longest match, lexicographic predecessor, + * etc. + * + * do we need specific lookup functions to find out if the + * returned value is readonly or mutable? + * + * richer modification such as dns_qp_replace{key,name} + * + * iteration - probably best to put an explicit stack in the iterator, + * cf. rbtnodechain + */ + +size_t +dns_qpkey_fromname(dns_qpkey_t key, const dns_name_t *name); +/*%< + * Convert a DNS name into a trie lookup key. + * + * Requires: + * \li `name` is a pointer to a valid `dns_name_t` + * + * Returns: + * \li the length of the key + */ + +isc_result_t +dns_qp_getkey(dns_qpreadable_t qpr, const dns_qpkey_t searchk, size_t searchl, + void **pval_r, uint32_t *ival_r); +/*%< + * Find a leaf in a qp-trie that matches the given key + * + * The leaf values are assigned to `*pval_r` and `*ival_r` + * + * Requires: + * \li `qpr` is a pointer to a readable qp-trie + * \li `pval_r != NULL` + * \li `ival_r != NULL` + * + * Returns: + * \li ISC_R_NOTFOUND if the trie has no leaf with a matching key + * \li ISC_R_SUCCESS if the leaf was found + */ + +isc_result_t +dns_qp_getname(dns_qpreadable_t qpr, const dns_name_t *name, void **pval_r, + uint32_t *ival_r); +/*%< + * Find a leaf in a qp-trie that matches the given DNS name + * + * The leaf values are assigned to `*pval_r` and `*ival_r` + * + * Requires: + * \li `qpr` is a pointer to a readable qp-trie + * \li `name` is a pointer to a valid `dns_name_t` + * \li `pval_r != NULL` + * \li `ival_r != NULL` + * + * Returns: + * \li ISC_R_NOTFOUND if the trie has no leaf with a matching key + * \li ISC_R_SUCCESS if the leaf was found + */ + +isc_result_t +dns_qp_insert(dns_qp_t *qp, void *pval, uint32_t ival); +/*%< + * Insert a leaf into a qp-trie + * + * Requires: + * \li `qp` is a pointer to a valid qp-trie + * \li `pval != NULL` + * \li `alignof(pval) > 1` + * + * Returns: + * \li ISC_R_EXISTS if the trie already has a leaf with the same key + * \li ISC_R_SUCCESS if the leaf was added to the trie + */ + +isc_result_t +dns_qp_deletekey(dns_qp_t *qp, const dns_qpkey_t key, size_t len); +/*%< + * Delete a leaf from a qp-trie that matches the given key + * + * Requires: + * \li `qp` is a pointer to a valid qp-trie + * + * Returns: + * \li ISC_R_NOTFOUND if the trie has no leaf with a matching key + * \li ISC_R_SUCCESS if the leaf was deleted from the trie + */ + +isc_result_t +dns_qp_deletename(dns_qp_t *qp, const dns_name_t *name); +/*%< + * Delete a leaf from a qp-trie that matches the given DNS name + * + * Requires: + * \li `qp` is a pointer to a valid qp-trie + * \li `name` is a pointer to a valid qp-trie + * + * Returns: + * \li ISC_R_NOTFOUND if the trie has no leaf with a matching name + * \li ISC_R_SUCCESS if the leaf was deleted from the trie + */ + +/*********************************************************************** + * + * functions - transactions + */ + +void +dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t **qprp); +/*%< + * Start a lightweight (brief) read-only transaction + * + * This takes a read lock on `multi`s rwlock that prevents + * transactions from committing. + * + * Requires: + * \li `multi` is a pointer to a valid multi-threaded qp-trie + * \li `qprp != NULL` + * \li `*qprp == NULL` + * + * Returns: + * \li `*qprp` is a pointer to a valid read-only qp-trie handle + */ + +void +dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t **qprp); +/*%< + * End a lightweight read transaction, i.e. release read lock + * + * Requires: + * \li `multi` is a pointer to a valid multi-threaded qp-trie + * \li `qprp != NULL` + * \li `*qprp` is a read-only qp-trie handle obtained from `multi` + * + * Returns: + * \li `*qprp == NULL` + */ + +void +dns_qpmulti_snapshot(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp); +/*%< + * Start a heavyweight (long) read-only transaction + * + * This function briefly takes and releases the modification mutex + * while allocating a copy of the trie's metadata. While the snapshot + * exists it does not interfere with other read-only or read-write + * transactions on the trie, except that memory cannot be reclaimed. + * + * Requires: + * \li `multi` is a pointer to a valid multi-threaded qp-trie + * \li `qpsp != NULL` + * \li `*qpsp == NULL` + * + * Returns: + * \li `*qpsp` is a pointer to a snapshot obtained from `multi` + */ + +void +dns_qpsnap_destroy(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp); +/*%< + * End a heavyweight read transaction + * + * If this is the last remaining snapshot belonging to `multi` then + * this function takes the modification mutex in order to free() any + * memory that is no longer in use. + * + * Requires: + * \li `multi` is a pointer to a valid multi-threaded qp-trie + * \li `qpsp != NULL` + * \li `*qpsp` is a pointer to a snapshot obtained from `multi` + * + * Returns: + * \li `*qpsp == NULL` + */ + +void +dns_qpmulti_update(dns_qpmulti_t *multi, dns_qp_t **qptp); +/*%< + * Start a heavyweight write transaction + * + * This style of transaction allocates a copy of the trie's metadata to + * support rollback, and it aims to minimize the memory usage of the + * trie between transactions. The trie is compacted when the transaction + * commits, and any partly-used chunk is shrunk to fit. + * + * During the transaction, the modification mutex is held. + * + * Requires: + * \li `multi` is a pointer to a valid multi-threaded qp-trie + * \li `qptp != NULL` + * \li `*qptp == NULL` + * + * Returns: + * \li `*qptp` is a pointer to the modifiable qp-trie inside `multi` + */ + +void +dns_qpmulti_write(dns_qpmulti_t *multi, dns_qp_t **qptp); +/*%< + * Start a lightweight write transaction + * + * This style of transaction does not need extra allocations in addition + * to the ones required by insert and delete operations. It is intended + * for a large trie that gets frequent small writes, such as a DNS + * cache. + * + * During the transaction, the modification mutex is held. + * + * Requires: + * \li `multi` is a pointer to a valid multi-threaded qp-trie + * \li `qptp != NULL` + * \li `*qptp == NULL` + * + * Returns: + * \li `*qptp` is a pointer to the modifiable qp-trie inside `multi` + */ + +void +dns_qpmulti_commit(dns_qpmulti_t *multi, dns_qp_t **qptp); +/*%< + * Complete a modification transaction + * + * The commit itself only requires flipping the read pointer inside + * `multi` from the old version of the trie to the new version. This + * function takes a write lock on `multi`s rwlock just long enough to + * flip the pointer. This briefly blocks `query` readers. + * + * This function releases the modification mutex after the post-commit + * memory reclamation is completed. + * + * Requires: + * \li `multi` is a pointer to a valid multi-threaded qp-trie + * \li `qptp != NULL` + * \li `*qptp` is a pointer to the modifiable qp-trie inside `multi` + * + * Returns: + * \li `*qptp == NULL` + */ + +void +dns_qpmulti_rollback(dns_qpmulti_t *multi, dns_qp_t **qptp); +/*%< + * Abandon an update transaction + * + * This function reclaims the memory allocated during the transaction + * and releases the modification mutex. + * + * Requires: + * \li `multi` is a pointer to a valid multi-threaded qp-trie + * \li `qptp != NULL` + * \li `*qptp` is a pointer to the modifiable qp-trie inside `multi` + * \li `*qptp` was obtained from `dns_qpmulti_update()` + * + * Returns: + * \li `*qptp == NULL` + */ + +/**********************************************************************/ diff --git a/lib/dns/log.c b/lib/dns/log.c index a8bf01d04b..6900a47374 100644 --- a/lib/dns/log.c +++ b/lib/dns/log.c @@ -36,23 +36,18 @@ isc_logcategory_t dns_categories[] = { * \#define to . */ isc_logmodule_t dns_modules[] = { - { "dns/db", 0 }, { "dns/rbtdb", 0 }, - { "dns/rbt", 0 }, { "dns/rdata", 0 }, - { "dns/master", 0 }, { "dns/message", 0 }, - { "dns/cache", 0 }, { "dns/config", 0 }, - { "dns/resolver", 0 }, { "dns/zone", 0 }, - { "dns/journal", 0 }, { "dns/adb", 0 }, - { "dns/xfrin", 0 }, { "dns/xfrout", 0 }, - { "dns/acl", 0 }, { "dns/validator", 0 }, - { "dns/dispatch", 0 }, { "dns/request", 0 }, - { "dns/masterdump", 0 }, { "dns/tsig", 0 }, - { "dns/tkey", 0 }, { "dns/sdb", 0 }, - { "dns/diff", 0 }, { "dns/hints", 0 }, - { "dns/unused1", 0 }, { "dns/dlz", 0 }, - { "dns/dnssec", 0 }, { "dns/crypto", 0 }, - { "dns/packets", 0 }, { "dns/nta", 0 }, - { "dns/dyndb", 0 }, { "dns/dnstap", 0 }, - { "dns/ssu", 0 }, { NULL, 0 } + { "dns/db", 0 }, { "dns/rbtdb", 0 }, { "dns/rbt", 0 }, + { "dns/rdata", 0 }, { "dns/master", 0 }, { "dns/message", 0 }, + { "dns/cache", 0 }, { "dns/config", 0 }, { "dns/resolver", 0 }, + { "dns/zone", 0 }, { "dns/journal", 0 }, { "dns/adb", 0 }, + { "dns/xfrin", 0 }, { "dns/xfrout", 0 }, { "dns/acl", 0 }, + { "dns/validator", 0 }, { "dns/dispatch", 0 }, { "dns/request", 0 }, + { "dns/masterdump", 0 }, { "dns/tsig", 0 }, { "dns/tkey", 0 }, + { "dns/sdb", 0 }, { "dns/diff", 0 }, { "dns/hints", 0 }, + { "dns/unused1", 0 }, { "dns/dlz", 0 }, { "dns/dnssec", 0 }, + { "dns/crypto", 0 }, { "dns/packets", 0 }, { "dns/nta", 0 }, + { "dns/dyndb", 0 }, { "dns/dnstap", 0 }, { "dns/ssu", 0 }, + { "dns/qp", 0 }, { NULL, 0 }, }; isc_log_t *dns_lctx = NULL; diff --git a/lib/dns/qp.c b/lib/dns/qp.c new file mode 100644 index 0000000000..890539ab2b --- /dev/null +++ b/lib/dns/qp.c @@ -0,0 +1,1571 @@ +/* + * Copyright (C) Internet Systems Consortium, Inc. ("ISC") + * + * SPDX-License-Identifier: MPL-2.0 + * + * This Source Code Form is subject to the terms of the Mozilla Public + * License, v. 2.0. If a copy of the MPL was not distributed with this + * file, you can obtain one at https://mozilla.org/MPL/2.0/. + * + * See the COPYRIGHT file distributed with this work for additional + * information regarding copyright ownership. + */ + +/* + * For an overview, see doc/design/qp-trie.md + */ + +#include +#include +#include +#include +#include + +#if FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION +#include +#include +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#include "qp_p.h" + +/* + * very basic garbage collector statistics + * + * XXXFANF for now we're logging GC times, but ideally we should + * accumulate stats more quietly and report via the statschannel + */ +static atomic_uint_fast64_t compact_time; +static atomic_uint_fast64_t recycle_time; +static atomic_uint_fast64_t rollback_time; + +#if 1 +#define QP_LOG_STATS(...) \ + isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE, DNS_LOGMODULE_QP, \ + ISC_LOG_DEBUG(1), __VA_ARGS__) +#else +#define QP_LOG_STATS(...) +#endif + +#define PRItime " %" PRIu64 " us " + +#if 0 +/* + * QP_TRACE is generally used in allocation-related functions so it doesn't + * trace very high-frequency ops + */ +#define QP_TRACE(fmt, ...) \ + if (isc_log_wouldlog(dns_lctx, ISC_LOG_DEBUG(7))) { \ + isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE, \ + DNS_LOGMODULE_QP, ISC_LOG_DEBUG(7), \ + "%s:%d:%s(qp %p ctx \"%s\" gen %u): " fmt, \ + __FILE__, __LINE__, __func__, qp, TRIENAME(qp), \ + qp->generation, ##__VA_ARGS__); \ + } else \ + do { \ + } while (0) +#else +#define QP_TRACE(...) +#endif + +/*********************************************************************** + * + * converting DNS names to trie keys + */ + +/* + * Number of distinct byte values, i.e. 256 + */ +#define BYTE_VALUES (UINT8_MAX + 1) + +/* + * Lookup table mapping bytes in DNS names to bit positions, used + * by dns_qpkey_fromname() to convert DNS names to qp-trie keys. + * + * Each element holds one or two bit positions, bit_one in the + * lower half and bit_two in the upper half. + * + * For common hostname characters, bit_two is zero (which cannot + * be a valid bit position). + * + * For others, bit_one is the escape bit, and bit_two is the + * position of the character within the escaped range. + */ +uint16_t dns_qp_bits_for_byte[BYTE_VALUES] = { 0 }; + +/* + * And the reverse, mapping bit positions to characters, so the tests + * can print diagnostics involving qp-trie keys. + * + * This table only handles the first bit in an escape sequence; we + * arrange that we can calculate the byte value for both bits by + * adding the the second bit to the first bit's byte value. + */ +uint8_t dns_qp_byte_for_bit[SHIFT_OFFSET] = { 0 }; + +/* + * Fill in the lookup tables at program startup. (It doesn't matter + * when this is initialized relative to other startup code.) + */ +static void +initialize_bits_for_byte(void) ISC_CONSTRUCTOR; + +/* + * The bit positions have to be between SHIFT_BITMAP and SHIFT_OFFSET. + * + * Each byte range in between common hostname characters has a different + * escape character, to preserve the correct lexical order. + * + * Escaped byte ranges mostly fit into the space available in the + * bitmap, except for those above 'z' (which is mostly bytes with the + * top bit set). So, when we reach the end of the bitmap we roll over + * to the next escape character. + * + * After filling the table we ensure that the bit positions for + * hostname characters and escape characters all fit. + */ +static void +initialize_bits_for_byte(void) { + /* zero common character marker not a valid shift position */ + INSIST(0 < SHIFT_BITMAP); + /* first bit is common byte or escape byte */ + qp_shift_t bit_one = SHIFT_BITMAP; + /* second bit is position in escaped range */ + qp_shift_t bit_two = SHIFT_BITMAP; + bool escaping = true; + + for (unsigned int byte = 0; byte < BYTE_VALUES; byte++) { + if (qp_common_character(byte)) { + escaping = false; + bit_one++; + dns_qp_byte_for_bit[bit_one] = byte; + dns_qp_bits_for_byte[byte] = bit_one; + } else if ('A' <= byte && byte <= 'Z') { + /* map upper case to lower case */ + qp_shift_t after_esc = bit_one + 1; + qp_shift_t skip_punct = 'a' - '_'; + qp_shift_t letter = byte - 'A'; + qp_shift_t bit = after_esc + skip_punct + letter; + dns_qp_bits_for_byte[byte] = bit; + /* to simplify reverse conversion in the tests */ + bit_two++; + } else { + /* non-hostname characters need to be escaped */ + if (!escaping || bit_two >= SHIFT_OFFSET) { + escaping = true; + bit_one++; + dns_qp_byte_for_bit[bit_one] = byte; + bit_two = SHIFT_BITMAP; + } + dns_qp_bits_for_byte[byte] = bit_two << 8 | bit_one; + bit_two++; + } + } + ENSURE(bit_one < SHIFT_OFFSET); +} + +/* + * Convert a DNS name into a trie lookup key. + * + * Returns the length of the key. + * + * For performance we get our hands dirty in the guts of the name. + * + * We don't worry about the distinction between absolute and relative + * names. When the trie is only used with absolute names, the first byte + * of the key will always be SHIFT_NOBYTE and it will always be skipped + * when traversing the trie. So keeping the root label costs little, and + * it allows us to support tries of relative names too. In fact absolute + * and relative names can be mixed in the same trie without causing + * confusion, because the presence or absence of the initial + * SHIFT_NOBYTE in the key disambiguates them (exactly like a trailing + * dot in a zone file). + */ +size_t +dns_qpkey_fromname(dns_qpkey_t key, const dns_name_t *name) { + size_t len, label; + + REQUIRE(ISC_MAGIC_VALID(name, DNS_NAME_MAGIC)); + REQUIRE(name->offsets != NULL); + REQUIRE(name->labels > 0); + + len = 0; + label = name->labels; + while (label-- > 0) { + const uint8_t *ldata = name->ndata + name->offsets[label]; + size_t label_len = *ldata++; + while (label_len-- > 0) { + uint16_t bits = dns_qp_bits_for_byte[*ldata++]; + key[len++] = bits & 0xFF; /* bit_one */ + if ((bits >> 8) != 0) { /* escape? */ + key[len++] = bits >> 8; /* bit_two */ + } + } + /* label terminator */ + key[len++] = SHIFT_NOBYTE; + } + /* mark end with a double NOBYTE */ + key[len] = SHIFT_NOBYTE; + return (len); +} + +/* + * Sentinel value for equal keys + */ +#define QPKEY_EQUAL (~(size_t)0) + +/* + * Compare two keys and return the offset where they differ. + * + * This offset is used to work out where a trie search diverged: when one + * of the keys is in the trie and one is not, the common prefix (up to the + * offset) is the part of the unknown key that exists in the trie. This + * matters for adding new keys or finding neighbours of missing keys. + * + * When the keys are different lengths it is possible (but unwise) for + * the longer key to be the same as the shorter key but with superfluous + * trailing SHIFT_NOBYTE elements. This makes the keys equal for the + * purpose of traversing the trie. + */ +static size_t +qpkey_compare(const dns_qpkey_t key_a, const size_t keylen_a, + const dns_qpkey_t key_b, const size_t keylen_b) { + size_t keylen = ISC_MAX(keylen_a, keylen_b); + for (size_t offset = 0; offset < keylen; offset++) { + if (qpkey_bit(key_a, keylen_a, offset) != + qpkey_bit(key_b, keylen_b, offset)) + { + return (offset); + } + } + return (QPKEY_EQUAL); +} + +/*********************************************************************** + * + * allocator wrappers + */ + +#if FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION + +/* + * Optionally (for debugging) during a copy-on-write transaction, use + * memory protection to ensure that the shared chunks are not modified. + * Once a chunk becomes shared, it remains read-only until it is freed. + * POSIX says we have to use mmap() to get an allocation that we can + * definitely pass to mprotect(). + */ + +static size_t +chunk_size_raw(void) { + size_t size = (size_t)sysconf(_SC_PAGE_SIZE); + return (ISC_MAX(size, QP_CHUNK_BYTES)); +} + +static void * +chunk_get_raw(dns_qp_t *qp) { + if (qp->write_protect) { + size_t size = chunk_size_raw(); + void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE, -1, 0); + RUNTIME_CHECK(ptr != MAP_FAILED); + return (ptr); + } else { + return (isc_mem_allocate(qp->mctx, QP_CHUNK_BYTES)); + } +} + +static void +chunk_free_raw(dns_qp_t *qp, void *ptr) { + if (qp->write_protect) { + RUNTIME_CHECK(munmap(ptr, chunk_size_raw()) == 0); + } else { + isc_mem_free(qp->mctx, ptr); + } +} + +static void * +chunk_shrink_raw(dns_qp_t *qp, void *ptr, size_t bytes) { + if (qp->write_protect) { + return (ptr); + } else { + return (isc_mem_reallocate(qp->mctx, ptr, bytes)); + } +} + +static void +write_protect(dns_qp_t *qp, void *ptr, bool readonly) { + if (qp->write_protect) { + int prot = readonly ? PROT_READ : PROT_READ | PROT_WRITE; + size_t size = chunk_size_raw(); + RUNTIME_CHECK(mprotect(ptr, size, prot) >= 0); + } +} + +static void +write_protect_all(dns_qp_t *qp) { + for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) { + if (chunk != qp->bump && qp->base[chunk] != NULL) { + write_protect(qp, qp->base[chunk], true); + } + } +} + +#else + +#define chunk_get_raw(qp) isc_mem_allocate(qp->mctx, QP_CHUNK_BYTES) +#define chunk_free_raw(qp, ptr) isc_mem_free(qp->mctx, ptr) + +#define chunk_shrink_raw(qp, ptr, size) isc_mem_reallocate(qp->mctx, ptr, size) + +#define write_protect(qp, chunk, readonly) +#define write_protect_all(qp) + +#endif + +static void * +clone_array(isc_mem_t *mctx, void *oldp, size_t oldsz, size_t newsz, + size_t elemsz) { + uint8_t *newp = NULL; + + INSIST(oldsz <= newsz); + INSIST(newsz < UINT32_MAX); + INSIST(elemsz < UINT32_MAX); + INSIST(((uint64_t)newsz) * ((uint64_t)elemsz) <= UINT32_MAX); + + /* sometimes we clone an array before it has been populated */ + if (newsz > 0) { + oldsz *= elemsz; + newsz *= elemsz; + newp = isc_mem_allocate(mctx, newsz); + if (oldsz > 0) { + memmove(newp, oldp, oldsz); + } + memset(newp + oldsz, 0, newsz - oldsz); + } + return (newp); +} + +/*********************************************************************** + * + * allocator + */ + +/* + * How many cells are actually in use in a chunk? + */ +static inline qp_cell_t +chunk_usage(dns_qp_t *qp, qp_chunk_t chunk) { + return (qp->usage[chunk].used - qp->usage[chunk].free); +} + +/* + * Is this chunk wasting space? + */ +static inline qp_cell_t +chunk_fragmented(dns_qp_t *qp, qp_chunk_t chunk) { + return (qp->usage[chunk].free > QP_MAX_FREE); +} + +/* + * We can mutate a chunk if it was allocated in the current generation. + * This might not be true for the `bump` chunk when it is reused. + */ +static inline bool +chunk_mutable(dns_qp_t *qp, qp_chunk_t chunk) { + return (qp->usage[chunk].generation == qp->generation); +} + +/* + * When we reuse the bump chunk across multiple write transactions, + * it can have an immutable prefix and a mutable suffix. + */ +static inline bool +twigs_mutable(dns_qp_t *qp, qp_ref_t ref) { + qp_chunk_t chunk = ref_chunk(ref); + qp_cell_t cell = ref_cell(ref); + if (chunk == qp->bump) { + return (cell >= qp->fender); + } else { + return (chunk_mutable(qp, chunk)); + } +} + +/* + * Create a fresh bump chunk and allocate some twigs from it. + */ +static qp_ref_t +chunk_alloc(dns_qp_t *qp, qp_chunk_t chunk, qp_weight_t size) { + REQUIRE(qp->base[chunk] == NULL); + REQUIRE(qp->usage[chunk].generation == 0); + REQUIRE(qp->usage[chunk].used == 0); + REQUIRE(qp->usage[chunk].free == 0); + + qp->base[chunk] = chunk_get_raw(qp); + qp->usage[chunk].generation = qp->generation; + qp->usage[chunk].used = size; + qp->usage[chunk].free = 0; + qp->used_count += size; + qp->bump = chunk; + qp->fender = 0; + + QP_TRACE("chunk %u gen %u base %p", chunk, qp->usage[chunk].generation, + qp->base[chunk]); + return (make_ref(chunk, 0)); +} + +static void +free_chunk_arrays(dns_qp_t *qp) { + QP_TRACE("base %p usage %p max %u", qp->base, qp->usage, qp->chunk_max); + /* + * They should both be null or both non-null; if they are out of sync, + * this will intentionally trigger an assert in `isc_mem_free()`. + */ + if (qp->base != NULL || qp->usage != NULL) { + isc_mem_free(qp->mctx, qp->base); + isc_mem_free(qp->mctx, qp->usage); + } +} + +/* + * This is used both to grow the arrays when they fill up, and to copy them at + * the start of an update transaction. We check if the old arrays are in use by + * readers, in which case we will do safe memory reclamation later. + */ +static void +clone_chunk_arrays(dns_qp_t *qp, qp_chunk_t newmax) { + qp_chunk_t oldmax; + void *base, *usage; + + oldmax = qp->chunk_max; + qp->chunk_max = newmax; + + base = clone_array(qp->mctx, qp->base, oldmax, newmax, + sizeof(*qp->base)); + usage = clone_array(qp->mctx, qp->usage, oldmax, newmax, + sizeof(*qp->usage)); + + if (qp->shared_arrays) { + qp->shared_arrays = false; + } else { + free_chunk_arrays(qp); + } + qp->base = base; + qp->usage = usage; + + QP_TRACE("base %p usage %p max %u", qp->base, qp->usage, qp->chunk_max); +} + +/* + * There was no space in the bump chunk, so find a place to put a fresh + * chunk in the chunk table, then allocate some twigs from it. + */ +static qp_ref_t +alloc_slow(dns_qp_t *qp, qp_weight_t size) { + qp_chunk_t chunk; + + for (chunk = 0; chunk < qp->chunk_max; chunk++) { + if (qp->base[chunk] == NULL) { + return (chunk_alloc(qp, chunk, size)); + } + } + ENSURE(chunk == qp->chunk_max); + clone_chunk_arrays(qp, GROWTH_FACTOR(chunk)); + return (chunk_alloc(qp, chunk, size)); +} + +/* + * Ensure we are using a fresh bump chunk. + */ +static void +alloc_reset(dns_qp_t *qp) { + (void)alloc_slow(qp, 0); +} + +/* + * Allocate some fresh twigs. This is the bump allocator fast path. + */ +static inline qp_ref_t +alloc_twigs(dns_qp_t *qp, qp_weight_t size) { + qp_chunk_t chunk = qp->bump; + qp_cell_t cell = qp->usage[chunk].used; + if (cell + size <= QP_CHUNK_SIZE) { + qp->usage[chunk].used += size; + qp->used_count += size; + return (make_ref(chunk, cell)); + } else { + return (alloc_slow(qp, size)); + } +} + +/* + * Record that some twigs are no longer being used, and if possible + * zero them to ensure that there isn't a spurious double detach when + * the chunk is later recycled. + * + * NOTE: the caller is responsible for attaching or detaching any + * leaves as required. + */ +static inline void +free_twigs(dns_qp_t *qp, qp_ref_t twigs, qp_weight_t size) { + qp_chunk_t chunk = ref_chunk(twigs); + + qp->free_count += size; + qp->usage[chunk].free += size; + ENSURE(qp->free_count <= qp->used_count); + ENSURE(qp->usage[chunk].free <= qp->usage[chunk].used); + + if (twigs_mutable(qp, twigs)) { + zero_twigs(ref_ptr(qp, twigs), size); + } else { + qp->hold_count += size; + ENSURE(qp->free_count >= qp->hold_count); + } +} + +/*********************************************************************** + * + * chunk reclamation + */ + +/* + * When a chunk is being recycled after a long-running read transaction, + * or after a rollback, we need to detach any leaves that remain. + */ +static void +chunk_free(dns_qp_t *qp, qp_chunk_t chunk) { + QP_TRACE("chunk %u gen %u base %p", chunk, qp->usage[chunk].generation, + qp->base[chunk]); + + qp_node_t *n = qp->base[chunk]; + write_protect(qp, n, false); + + for (qp_cell_t count = qp->usage[chunk].used; count > 0; count--, n++) { + if (!is_branch(n) && leaf_pval(n) != NULL) { + detach_leaf(qp, n); + } + } + chunk_free_raw(qp, qp->base[chunk]); + + INSIST(qp->used_count >= qp->usage[chunk].used); + INSIST(qp->free_count >= qp->usage[chunk].free); + qp->used_count -= qp->usage[chunk].used; + qp->free_count -= qp->usage[chunk].free; + qp->usage[chunk].used = 0; + qp->usage[chunk].free = 0; + qp->usage[chunk].generation = 0; + qp->base[chunk] = NULL; +} + +/* + * If we have any nodes on hold during a transaction, we must leave + * immutable chunks intact. As the last stage of safe memory reclamation, + * we can clear the hold counter and recycle all empty chunks (even from a + * nominally read-only `dns_qp_t`) because nothing refers to them any more. + * + * If we are using RCU, this can be called by `defer_rcu()` or `call_rcu()` + * to clean up after readers have left their critical sections. + */ +static void +recycle(dns_qp_t *qp) { + isc_time_t t0, t1; + uint64_t time; + unsigned int live = 0; + unsigned int keep = 0; + unsigned int free = 0; + + QP_TRACE("expect to free %u cells -> %u chunks", + (qp->free_count - qp->hold_count), + (qp->free_count - qp->hold_count) / QP_CHUNK_SIZE); + + isc_time_now_hires(&t0); + + for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) { + if (qp->base[chunk] == NULL) { + continue; + } else if (chunk == qp->bump || chunk_usage(qp, chunk) > 0) { + live++; + } else if (chunk_mutable(qp, chunk) || qp->hold_count == 0) { + chunk_free(qp, chunk); + free++; + } else { + keep++; + } + } + + isc_time_now_hires(&t1); + time = isc_time_microdiff(&t1, &t0); + atomic_fetch_add_relaxed(&recycle_time, time); + + QP_LOG_STATS("qp recycle" PRItime "live %u keep %u free %u chunks", + time, live, keep, free); + QP_LOG_STATS("qp recycle after leaf %u live %u used %u free %u hold %u", + qp->leaf_count, qp->used_count - qp->free_count, + qp->used_count, qp->free_count, qp->hold_count); +} + +/*********************************************************************** + * + * garbage collector + */ + +/* + * Move a branch node's twigs to the `bump` chunk, for copy-on-write + * or for garbage collection. We don't update the node in place + * because `compact_recursive()` does not ensure the node itself is + * mutable until after it discovers evacuation was necessary. + */ +static qp_ref_t +evacuate_twigs(dns_qp_t *qp, qp_node_t *n) { + qp_weight_t size = branch_twigs_size(n); + qp_ref_t old_ref = branch_twigs_ref(n); + qp_ref_t new_ref = alloc_twigs(qp, size); + qp_node_t *old_twigs = ref_ptr(qp, old_ref); + qp_node_t *new_twigs = ref_ptr(qp, new_ref); + + move_twigs(new_twigs, old_twigs, size); + free_twigs(qp, old_ref, size); + + /* + * free_twigs() could not zero out the old twigs, + * so we have to re-attach to any leaves + */ + if (!twigs_mutable(qp, old_ref)) { + for (qp_weight_t pos = 0; pos < size; pos++) { + qp_node_t *twig = &new_twigs[pos]; + if (!is_branch(twig)) { + attach_leaf(qp, twig); + } + } + } + + return (new_ref); +} + +/* + * Evacuate the node's twigs and update the node in place. + */ +static void +evacuate(dns_qp_t *qp, qp_node_t *n) { + *n = make_node(branch_index(n), evacuate_twigs(qp, n)); +} + +/* + * Compact the trie by traversing the whole thing recursively, copying + * bottom-up as required. The aim is to avoid evacuation as much as + * possible, but when parts of the trie are shared, we need to evacuate + * the paths from the root to the parts of the trie that occupy + * fragmented chunks. + * + * Without the "should we evacuate?" check, the algorithm will leave + * the trie unchanged. If the twigs are all leaves, the loop changes + * nothing, so we will return this node's original ref. If all of the + * twigs that are branches did not need moving, again, the loop + * changes nothing. So the evacuation check is the only place that the + * algorithm introduces ref changes, that then bubble up through the + * logic inside the loop. + */ +static qp_ref_t +compact_recursive(dns_qp_t *qp, qp_node_t *parent) { + qp_ref_t ref = branch_twigs_ref(parent); + /* should we evacuate the twigs? */ + if (chunk_fragmented(qp, ref_chunk(ref)) || qp->compact_all) { + ref = evacuate_twigs(qp, parent); + } + bool mutable = twigs_mutable(qp, ref); + qp_weight_t size = branch_twigs_size(parent); + for (qp_weight_t pos = 0; pos < size; pos++) { + qp_node_t *child = ref_ptr(qp, ref) + pos; + if (!is_branch(child)) { + continue; + } + qp_ref_t old_ref = branch_twigs_ref(child); + qp_ref_t new_ref = compact_recursive(qp, child); + if (old_ref == new_ref) { + continue; + } + if (!mutable) { + ref = evacuate_twigs(qp, parent); + /* the twigs have moved */ + child = ref_ptr(qp, ref) + pos; + mutable = true; + } + *child = make_node(branch_index(child), new_ref); + } + return (ref); +} + +static void +compact(dns_qp_t *qp) { + isc_time_t t0, t1; + uint64_t time; + + QP_LOG_STATS( + "qp compact before leaf %u live %u used %u free %u hold %u", + qp->leaf_count, qp->used_count - qp->free_count, qp->used_count, + qp->free_count, qp->hold_count); + + isc_time_now_hires(&t0); + + /* + * Reset the bump chunk if it is fragmented. + */ + if (chunk_fragmented(qp, qp->bump)) { + alloc_reset(qp); + } + + if (is_branch(&qp->root)) { + qp->root = make_node(branch_index(&qp->root), + compact_recursive(qp, &qp->root)); + } + qp->compact_all = false; + + isc_time_now_hires(&t1); + time = isc_time_microdiff(&t1, &t0); + atomic_fetch_add_relaxed(&compact_time, time); + + QP_LOG_STATS("qp compact" PRItime + "leaf %u live %u used %u free %u hold %u", + time, qp->leaf_count, qp->used_count - qp->free_count, + qp->used_count, qp->free_count, qp->hold_count); +} + +void +dns_qp_compact(dns_qp_t *qp) { + REQUIRE(VALID_QP(qp)); + qp->compact_all = true; + compact(qp); + recycle(qp); +} + +static void +auto_compact_recycle(dns_qp_t *qp) { + compact(qp); + recycle(qp); + /* + * This shouldn't happen if the garbage collector is + * working correctly. We can recover at the cost of some + * time and space, but recovery should be cheaper than + * letting compact+recycle fail repeatedly. + */ + if (QP_MAX_GARBAGE(qp)) { + isc_log_write(dns_lctx, DNS_LOGCATEGORY_DATABASE, + DNS_LOGMODULE_QP, ISC_LOG_NOTICE, + "qp %p ctx \"%s\" compact/recycle " + "failed to recover any space, " + "scheduling a full compaction", + qp, TRIENAME(qp)); + qp->compact_all = true; + } +} + +/* + * Free some twigs and compact the trie if necessary; the space + * accounting is similar to `evacuate_twigs()` above. + * + * This is called by the trie modification API entry points. The + * free_twigs() function requires the caller to attach or detach any + * leaves as necessary. Callers of squash_twigs() satisfy this + * requirement by calling cow_twigs(). + * + * Aside: In typical garbage collectors, compaction is triggered when + * the allocator runs out of space. But that is because typical garbage + * collectors do not know how much memory can be recovered, so they must + * find out by scanning the heap. The qp-trie code was originally + * designed to use malloc() and free(), so it has more information about + * when garbage collection might be worthwhile. Hence we can trigger + * collection when garbage passes a threshold. + * + * XXXFANF: If we need to avoid latency outliers caused by compaction in + * write transactions, we can check qp->transaction_mode here. + */ +static inline void +squash_twigs(dns_qp_t *qp, qp_ref_t twigs, qp_weight_t size) { + free_twigs(qp, twigs, size); + if (twigs_mutable(qp, twigs) && QP_MAX_GARBAGE(qp)) { + auto_compact_recycle(qp); + } +} + +/* + * Shared twigs need copy-on-write. As we walk down the trie finding + * the right place to modify, make_twigs_mutable() is called to ensure + * that shared nodes on the path from the root are copied to a mutable + * chunk. + */ +static inline void +make_twigs_mutable(dns_qp_t *qp, qp_node_t *n) { + if (!twigs_mutable(qp, branch_twigs_ref(n))) { + evacuate(qp, n); + } +} + +/*********************************************************************** + * + * public accessors for memory management internals + */ + +dns_qp_memusage_t +dns_qp_memusage(dns_qp_t *qp) { + REQUIRE(VALID_QP(qp)); + + dns_qp_memusage_t memusage = { + .ctx = qp->ctx, + .leaves = qp->leaf_count, + .live = qp->used_count - qp->free_count, + .used = qp->used_count, + .hold = qp->hold_count, + .free = qp->free_count, + .node_size = sizeof(qp_node_t), + .chunk_size = QP_CHUNK_SIZE, + }; + + for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) { + if (qp->base[chunk] != NULL) { + memusage.chunk_count += 1; + } + } + + /* slight over-estimate if chunks have been shrunk */ + memusage.bytes = memusage.chunk_count * QP_CHUNK_BYTES + + qp->chunk_max * sizeof(*qp->base) + + qp->chunk_max * sizeof(*qp->usage); + + return (memusage); +} + +void +dns_qp_gctime(uint64_t *compact_p, uint64_t *recycle_p, uint64_t *rollback_p) { + *compact_p = atomic_load_relaxed(&compact_time); + *recycle_p = atomic_load_relaxed(&recycle_time); + *rollback_p = atomic_load_relaxed(&rollback_time); +} + +/*********************************************************************** + * + * read-write transactions + */ + +static dns_qp_t * +transaction_open(dns_qpmulti_t *multi, dns_qp_t **qptp) { + dns_qp_t *qp, *old; + + REQUIRE(VALID_QPMULTI(multi)); + REQUIRE(qptp != NULL && *qptp == NULL); + + LOCK(&multi->mutex); + + old = multi->read; + qp = write_phase(multi); + + INSIST(VALID_QP(old)); + INSIST(!VALID_QP(qp)); + + /* + * prepare for copy-on-write + */ + *qp = *old; + qp->shared_arrays = true; + qp->hold_count = qp->free_count; + + /* + * Start a new generation, and ensure it isn't zero because we + * want to avoid confusion with unset qp->usage structures. + */ + if (++qp->generation == 0) { + ++qp->generation; + } + + *qptp = qp; + return (qp); +} + +/* + * a write is light + * + * We need to ensure we alloce from a fresh chunk if the last transaction + * shrunk the bump chunk; but usually in a sequence of write transactions + * we just mark the point where we started this generation. + * + * (Instead of keeping the previous transaction's mode, I considered + * forcing allocation into the slow path by fiddling with the bump + * chunk's usage counters. But that is troublesome because + * `chunk_free_now()` needs to know how much of the chunk to scan.) + */ +void +dns_qpmulti_write(dns_qpmulti_t *multi, dns_qp_t **qptp) { + dns_qp_t *qp = transaction_open(multi, qptp); + QP_TRACE(""); + + if (qp->transaction_mode == QP_UPDATE) { + alloc_reset(qp); + } else { + qp->fender = qp->usage[qp->bump].used; + } + + qp->transaction_mode = QP_WRITE; + write_protect_all(qp); +} + +/* + * an update is heavy + * + * Make sure we have copies of all usage counters so that we can rollback. + * Do this before allocating a bump chunk so that all chunks allocated in + * this transaction are in the fresh chunk arrays. (If the existing chunk + * arrays happen to be full we might immediately clone them a second time. + * Probably not worth worrying about?) + */ +void +dns_qpmulti_update(dns_qpmulti_t *multi, dns_qp_t **qptp) { + dns_qp_t *qp = transaction_open(multi, qptp); + QP_TRACE(""); + + clone_chunk_arrays(qp, qp->chunk_max); + alloc_reset(qp); + + qp->transaction_mode = QP_UPDATE; + write_protect_all(qp); +} + +void +dns_qpmulti_commit(dns_qpmulti_t *multi, dns_qp_t **qptp) { + dns_qp_t *qp, *old; + + REQUIRE(VALID_QPMULTI(multi)); + REQUIRE(qptp != NULL); + REQUIRE(*qptp == write_phase(multi)); + + old = multi->read; + qp = write_phase(multi); + + QP_TRACE(""); + + if (qp->transaction_mode == QP_UPDATE) { + qp_chunk_t c; + size_t bytes; + + compact(qp); + c = qp->bump; + bytes = qp->usage[c].used * sizeof(qp_node_t); + if (bytes == 0) { + chunk_free(qp, c); + } else { + qp->base[c] = chunk_shrink_raw(qp, qp->base[c], bytes); + } + } + +#if HAVE_LIBURCU + rcu_assign_pointer(multi->read, qp); + /* + * XXXFANF: At this point we need to wait for a grace period (to be + * sure readers have finished) before recovering memory. This is not + * very fast, hurting write throughput. To fix it we need read + * transactions to be able to survive multiple write transactions, so + * that it matters less if we are slow to detect when readers have + * exited their critical sections. Instead of the current read / snap + * distinction, we need to allocate a read snapshot when a + * transaction commits, and clean it up (along with the unused + * chunks) in an rcu callback. + */ + synchronize_rcu(); +#else + RWLOCK(&multi->rwlock, isc_rwlocktype_write); + multi->read = qp; + RWUNLOCK(&multi->rwlock, isc_rwlocktype_write); +#endif + + /* + * Were the chunk arrays reallocated at some point? + */ + if (qp->shared_arrays) { + INSIST(old->base == qp->base); + INSIST(old->usage == qp->usage); + /* this becomes correct when `*old` is invalidated */ + qp->shared_arrays = false; + } else { + INSIST(old->base != qp->base); + INSIST(old->usage != qp->usage); + free_chunk_arrays(old); + } + + /* + * It is safe to recycle all empty chunks if they aren't being + * used by snapshots. + */ + qp->hold_count = 0; + if (multi->snapshots == 0) { + recycle(qp); + } + + *old = (dns_qp_t){}; + *qptp = NULL; + UNLOCK(&multi->mutex); +} + +/* + * Throw away everything that was allocated during this transaction. + */ +void +dns_qpmulti_rollback(dns_qpmulti_t *multi, dns_qp_t **qptp) { + dns_qp_t *qp; + isc_time_t t0, t1; + uint64_t time; + unsigned int free = 0; + + REQUIRE(VALID_QPMULTI(multi)); + REQUIRE(qptp != NULL); + REQUIRE(*qptp == write_phase(multi)); + + qp = *qptp; + + REQUIRE(qp->transaction_mode == QP_UPDATE); + QP_TRACE(""); + + isc_time_now_hires(&t0); + + /* + * recycle any chunks allocated in this transaction, + * including the bump chunk, and detach value objects + */ + for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) { + if (qp->base[chunk] != NULL && chunk_mutable(qp, chunk)) { + chunk_free(qp, chunk); + free++; + } + } + + /* free the cloned arrays */ + INSIST(!qp->shared_arrays); + free_chunk_arrays(qp); + + isc_time_now_hires(&t1); + time = isc_time_microdiff(&t1, &t0); + atomic_fetch_add_relaxed(&rollback_time, time); + + QP_LOG_STATS("qp rollback" PRItime "free %u chunks", time, free); + + *qp = (dns_qp_t){}; + *qptp = NULL; + UNLOCK(&multi->mutex); +} + +/*********************************************************************** + * + * read-only transactions + */ + +/* + * a query is light + */ + +void +dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t **qprp) { + REQUIRE(VALID_QPMULTI(multi)); + REQUIRE(qprp != NULL && *qprp == NULL); + +#if HAVE_LIBURCU + rcu_read_lock(); + *qprp = (dns_qpread_t *)rcu_dereference(multi->read); +#else + RWLOCK(&multi->rwlock, isc_rwlocktype_read); + *qprp = (dns_qpread_t *)multi->read; +#endif +} + +void +dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t **qprp) { + REQUIRE(VALID_QPMULTI(multi)); + REQUIRE(qprp != NULL && *qprp != NULL); + + /* + * when we are using RCU, then multi->read can change during + * our critical section, so it can be different from *qprp + */ + dns_qp_t *qp = (dns_qp_t *)*qprp; + *qprp = NULL; + REQUIRE(qp == &multi->phase[0] || qp == &multi->phase[1]); + +#if HAVE_LIBURCU + rcu_read_unlock(); +#else + RWUNLOCK(&multi->rwlock, isc_rwlocktype_read); +#endif +} + +/* + * a snapshot is heavy + */ + +void +dns_qpmulti_snapshot(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp) { + dns_qp_t *old; + dns_qpsnap_t *qp; + size_t array_size, alloc_size; + + REQUIRE(VALID_QPMULTI(multi)); + REQUIRE(qpsp != NULL && *qpsp == NULL); + + /* + * we need a consistent view of the chunk base array and chunk_max so + * we can't use the rwlock here (nor can we use dns_qpmulti_query) + */ + LOCK(&multi->mutex); + old = multi->read; + + array_size = sizeof(qp_node_t *) * old->chunk_max; + alloc_size = sizeof(dns_qpsnap_t) + array_size; + qp = isc_mem_allocate(old->mctx, alloc_size); + *qp = (dns_qpsnap_t){ + .magic = QP_MAGIC, + .root = old->root, + .methods = old->methods, + .ctx = old->ctx, + .generation = old->generation, + .base = qp->base_array, + .whence = multi, + }; + /* sometimes we take a snapshot of an empty trie */ + if (array_size > 0) { + memmove(qp->base, old->base, array_size); + } + + multi->snapshots++; + *qpsp = qp; + + QP_TRACE("multi %p snaps %u", multi, multi->snapshots); + UNLOCK(&multi->mutex); +} + +void +dns_qpsnap_destroy(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp) { + dns_qpsnap_t *qp; + + REQUIRE(VALID_QPMULTI(multi)); + REQUIRE(qpsp != NULL && *qpsp != NULL); + + qp = *qpsp; + *qpsp = NULL; + + /* + * `multi` and `whence` are redundant, but it helps + * to make sure the API is being used correctly + */ + REQUIRE(multi == qp->whence); + + LOCK(&multi->mutex); + QP_TRACE("multi %p snaps %u gen %u", multi, multi->snapshots, + multi->read->generation); + + isc_mem_free(multi->read->mctx, qp); + multi->snapshots--; + if (multi->snapshots == 0) { + /* + * Clean up if there were updates while we were working, + * and we are the last snapshot keeping the memory alive + */ + recycle(multi->read); + } + UNLOCK(&multi->mutex); +} + +/*********************************************************************** + * + * constructors, destructors + */ + +static void +initialize_guts(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx, + dns_qp_t *qp) { + REQUIRE(methods != NULL); + REQUIRE(methods->attach != NULL); + REQUIRE(methods->detach != NULL); + REQUIRE(methods->makekey != NULL); + REQUIRE(methods->triename != NULL); + + *qp = (dns_qp_t){ + .magic = QP_MAGIC, + .methods = methods, + .ctx = ctx, + }; + isc_mem_attach(mctx, &qp->mctx); +} + +void +dns_qp_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx, + dns_qp_t **qptp) { + dns_qp_t *qp; + + REQUIRE(qptp != NULL && *qptp == NULL); + + qp = isc_mem_get(mctx, sizeof(*qp)); + initialize_guts(mctx, methods, ctx, qp); + alloc_reset(qp); + QP_TRACE(""); + *qptp = qp; +} + +void +dns_qpmulti_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx, + dns_qpmulti_t **qpmp) { + dns_qpmulti_t *multi; + dns_qp_t *qp; + + REQUIRE(qpmp != NULL && *qpmp == NULL); + + multi = isc_mem_get(mctx, sizeof(*multi)); + *multi = (dns_qpmulti_t){ + .magic = QPMULTI_MAGIC, + .read = &multi->phase[0], + }; + isc_rwlock_init(&multi->rwlock); + isc_mutex_init(&multi->mutex); + + /* + * Do not waste effort allocating a bump chunk that will be thrown + * away when a transaction is opened. dns_qpmulti_update() always + * allocates; to ensure dns_qpmulti_write() does too, pretend the + * previous transaction was an update + */ + qp = multi->read; + initialize_guts(mctx, methods, ctx, qp); + qp->transaction_mode = QP_UPDATE; + QP_TRACE(""); + *qpmp = multi; +} + +static void +destroy_guts(dns_qp_t *qp) { + if (qp->leaf_count == 1) { + detach_leaf(qp, &qp->root); + } + if (qp->chunk_max == 0) { + return; + } + for (qp_chunk_t chunk = 0; chunk < qp->chunk_max; chunk++) { + if (qp->base[chunk] != NULL) { + chunk_free(qp, chunk); + } + } + ENSURE(qp->used_count == 0); + ENSURE(qp->free_count == 0); + ENSURE(qp->hold_count == 0); + free_chunk_arrays(qp); +} + +void +dns_qp_destroy(dns_qp_t **qptp) { + dns_qp_t *qp; + + REQUIRE(qptp != NULL); + REQUIRE(VALID_QP(*qptp)); + + qp = *qptp; + *qptp = NULL; + + /* do not try to destroy part of a dns_qpmulti_t */ + REQUIRE(qp->transaction_mode == QP_NONE); + + QP_TRACE(""); + destroy_guts(qp); + isc_mem_putanddetach(&qp->mctx, qp, sizeof(*qp)); +} + +void +dns_qpmulti_destroy(dns_qpmulti_t **qpmp) { + dns_qp_t *qp = NULL; + dns_qpmulti_t *multi = NULL; + + REQUIRE(qpmp != NULL); + REQUIRE(VALID_QPMULTI(*qpmp)); + + multi = *qpmp; + qp = multi->read; + *qpmp = NULL; + + REQUIRE(VALID_QP(qp)); + REQUIRE(!VALID_QP(write_phase(multi))); + REQUIRE(multi->snapshots == 0); + + QP_TRACE(""); + destroy_guts(qp); + isc_mutex_destroy(&multi->mutex); + isc_rwlock_destroy(&multi->rwlock); + isc_mem_putanddetach(&qp->mctx, multi, sizeof(*multi)); +} + +/*********************************************************************** + * + * modification + */ + +isc_result_t +dns_qp_insert(dns_qp_t *qp, void *pval, uint32_t ival) { + qp_ref_t new_ref, old_ref; + qp_node_t new_leaf, old_node; + qp_node_t *new_twigs, *old_twigs; + qp_shift_t new_bit, old_bit; + dns_qpkey_t new_key, old_key; + size_t new_keylen, old_keylen; + size_t offset; + uint64_t index; + qp_shift_t bit; + qp_weight_t pos, size; + qp_node_t *n; + + REQUIRE(VALID_QP(qp)); + + new_leaf = make_leaf(pval, ival); + new_keylen = leaf_qpkey(qp, &new_leaf, new_key); + + /* first leaf in an empty trie? */ + if (qp->leaf_count == 0) { + qp->root = new_leaf; + qp->leaf_count++; + attach_leaf(qp, &new_leaf); + return (ISC_R_SUCCESS); + } + + /* + * We need to keep searching down to a leaf even if our key is + * missing from this branch. It doesn't matter which twig we + * choose since the keys are all the same up to this node's + * offset. Note that if we simply use branch_twig_pos(n, bit) + * we may get an out-of-bounds access if our bit is greater + * than all the set bits in the node. + */ + n = &qp->root; + while (is_branch(n)) { + prefetch_twigs(qp, n); + bit = branch_keybit(n, new_key, new_keylen); + pos = branch_has_twig(n, bit) ? branch_twig_pos(n, bit) : 0; + n = branch_twigs_vector(qp, n) + pos; + } + + /* do the keys differ, and if so, where? */ + old_keylen = leaf_qpkey(qp, n, old_key); + offset = qpkey_compare(new_key, new_keylen, old_key, old_keylen); + if (offset == QPKEY_EQUAL) { + return (ISC_R_EXISTS); + } + new_bit = qpkey_bit(new_key, new_keylen, offset); + old_bit = qpkey_bit(old_key, old_keylen, offset); + + qp->leaf_count++; + attach_leaf(qp, &new_leaf); + + /* find where to insert a branch or grow an existing branch. */ + n = &qp->root; + while (is_branch(n)) { + prefetch_twigs(qp, n); + if (offset < branch_key_offset(n)) { + goto newbranch; + } + if (offset == branch_key_offset(n)) { + goto growbranch; + } + make_twigs_mutable(qp, n); + bit = branch_keybit(n, new_key, new_keylen); + INSIST(branch_has_twig(n, bit)); + n = branch_twig_ptr(qp, n, bit); + } + +newbranch: + new_ref = alloc_twigs(qp, 2); + new_twigs = ref_ptr(qp, new_ref); + + /* save before overwriting. */ + old_node = *n; + + /* new branch node takes old node's place */ + index = BRANCH_TAG | (1ULL << new_bit) | (1ULL << old_bit) | + ((uint64_t)offset << SHIFT_OFFSET); + *n = make_node(index, new_ref); + + /* populate twigs */ + new_twigs[old_bit > new_bit] = old_node; + new_twigs[new_bit > old_bit] = new_leaf; + + return (ISC_R_SUCCESS); + +growbranch: + INSIST(!branch_has_twig(n, new_bit)); + + /* locate twigs vectors */ + size = branch_twigs_size(n); + old_ref = branch_twigs_ref(n); + new_ref = alloc_twigs(qp, size + 1); + old_twigs = ref_ptr(qp, old_ref); + new_twigs = ref_ptr(qp, new_ref); + + /* embiggen branch node */ + index = branch_index(n) | (1ULL << new_bit); + *n = make_node(index, new_ref); + + /* embiggen twigs vector */ + pos = branch_twig_pos(n, new_bit); + move_twigs(new_twigs, old_twigs, pos); + new_twigs[pos] = new_leaf; + move_twigs(new_twigs + pos + 1, old_twigs + pos, size - pos); + + /* clean up */ + squash_twigs(qp, old_ref, size); + + return (ISC_R_SUCCESS); +} + +isc_result_t +dns_qp_deletekey(dns_qp_t *qp, const dns_qpkey_t search_key, + size_t search_keylen) { + dns_qpkey_t found_key; + size_t found_keylen; + qp_shift_t bit = 0; /* suppress warning */ + qp_weight_t pos, size; + qp_ref_t ref; + qp_node_t *twigs; + qp_node_t *parent; + qp_node_t *n; + + REQUIRE(VALID_QP(qp)); + + parent = NULL; + n = &qp->root; + while (is_branch(n)) { + prefetch_twigs(qp, n); + bit = branch_keybit(n, search_key, search_keylen); + if (!branch_has_twig(n, bit)) { + return (ISC_R_NOTFOUND); + } + make_twigs_mutable(qp, n); + parent = n; + n = branch_twig_ptr(qp, n, bit); + } + + /* empty trie? */ + if (leaf_pval(n) == NULL) { + return (ISC_R_NOTFOUND); + } + + found_keylen = leaf_qpkey(qp, n, found_key); + if (qpkey_compare(search_key, search_keylen, found_key, found_keylen) != + QPKEY_EQUAL) + { + return (ISC_R_NOTFOUND); + } + + qp->leaf_count--; + detach_leaf(qp, n); + + /* trie becomes empty */ + if (qp->leaf_count == 0) { + INSIST(n == &qp->root && parent == NULL); + zero_twigs(n, 1); + return (ISC_R_SUCCESS); + } + + /* step back to parent node */ + n = parent; + parent = NULL; + + INSIST(bit != 0); + size = branch_twigs_size(n); + pos = branch_twig_pos(n, bit); + ref = branch_twigs_ref(n); + twigs = ref_ptr(qp, ref); + + if (size == 2) { + /* + * move the other twig to the parent branch. + */ + *n = twigs[!pos]; + squash_twigs(qp, ref, 2); + } else { + /* + * shrink the twigs in place, to avoid using the bump + * chunk too fast - the gc will clean up after us + */ + *n = make_node(branch_index(n) & ~(1ULL << bit), ref); + move_twigs(twigs + pos, twigs + pos + 1, size - pos - 1); + squash_twigs(qp, ref + size - 1, 1); + } + + return (ISC_R_SUCCESS); +} + +isc_result_t +dns_qp_deletename(dns_qp_t *qp, const dns_name_t *name) { + dns_qpkey_t key; + size_t keylen = dns_qpkey_fromname(key, name); + return (dns_qp_deletekey(qp, key, keylen)); +} + +/*********************************************************************** + * + * search + */ + +isc_result_t +dns_qp_getkey(dns_qpreadable_t qpr, const dns_qpkey_t search_key, + size_t search_keylen, void **pval_r, uint32_t *ival_r) { + dns_qpread_t *qp = dns_qpreadable_cast(qpr); + dns_qpkey_t found_key; + size_t found_keylen; + qp_shift_t bit; + qp_node_t *n; + + REQUIRE(VALID_QP(qp)); + REQUIRE(pval_r != NULL); + REQUIRE(ival_r != NULL); + + n = &qp->root; + while (is_branch(n)) { + prefetch_twigs(qp, n); + bit = branch_keybit(n, search_key, search_keylen); + if (!branch_has_twig(n, bit)) { + return (ISC_R_NOTFOUND); + } + n = branch_twig_ptr(qp, n, bit); + } + + /* empty trie? */ + if (leaf_pval(n) == NULL) { + return (ISC_R_NOTFOUND); + } + + found_keylen = leaf_qpkey(qp, n, found_key); + if (qpkey_compare(search_key, search_keylen, found_key, found_keylen) != + QPKEY_EQUAL) + { + return (ISC_R_NOTFOUND); + } + + *pval_r = leaf_pval(n); + *ival_r = leaf_ival(n); + return (ISC_R_SUCCESS); +} + +isc_result_t +dns_qp_getname(dns_qpreadable_t qpr, const dns_name_t *name, void **pval_r, + uint32_t *ival_r) { + dns_qpkey_t key; + size_t keylen = dns_qpkey_fromname(key, name); + return (dns_qp_getkey(qpr, key, keylen, pval_r, ival_r)); +} + +/**********************************************************************/ diff --git a/lib/dns/qp_p.h b/lib/dns/qp_p.h new file mode 100644 index 0000000000..c74d503f8d --- /dev/null +++ b/lib/dns/qp_p.h @@ -0,0 +1,703 @@ +/* + * Copyright (C) Internet Systems Consortium, Inc. ("ISC") + * + * SPDX-License-Identifier: MPL-2.0 + * + * This Source Code Form is subject to the terms of the Mozilla Public + * License, v. 2.0. If a copy of the MPL was not distributed with this + * file, you can obtain one at https://mozilla.org/MPL/2.0/. + * + * See the COPYRIGHT file distributed with this work for additional + * information regarding copyright ownership. + */ + +/* + * For an overview, see doc/design/qp-trie.md + */ + +#pragma once + +/*********************************************************************** + * + * interior node basics + */ + +/* + * A qp-trie node can be a leaf or a branch. It consists of three 32-bit + * words into which the components are packed. They are used as a 64-bit + * word and a 32-bit word, but they are not declared like that to avoid + * unwanted padding, keeping the size down to 12 bytes. They are in native + * endian order so getting the 64-bit part should compile down to an + * unaligned load. + * + * In a branch the 64-bit word is described by the enum below. The 32-bit + * word is a reference to the packed sparse vector of "twigs", i.e. child + * nodes. A branch node has at least 2 and less than SHIFT_OFFSET twigs + * (see the enum below). The qp-trie update functions ensure that branches + * actually branch, i.e. branches cannot have only 1 child. + * + * The contents of each leaf are set by the trie's user. The 64-bit word + * contains a pointer value (which must be word-aligned), and the 32-bit + * word is an arbitrary integer value. + */ +typedef struct qp_node { +#if WORDS_BIGENDIAN + uint32_t bighi, biglo, small; +#else + uint32_t biglo, bighi, small; +#endif +} qp_node_t; + +/* + * A branch node contains a 64-bit word comprising the branch/leaf tag, + * the bitmap, and an offset into the key. It is called an "index word" + * because it describes how to access the twigs vector (think "database + * index"). The following enum sets up the bit positions of these parts. + * + * In a leaf, the same 64-bit word contains a pointer. The pointer + * must be word-aligned so that the branch/leaf tag bit is zero. + * This requirement is checked by the newleaf() constructor. + * + * The bitmap is just above the tag bit. The `bits_for_byte[]` table is + * used to fill in a key so that bit tests can work directly against the + * index word without superfluous masking or shifting; we don't need to + * mask out the bitmap before testing a bit, but we do need to mask the + * bitmap before calling popcount. + * + * The byte offset into the key is at the top of the word, so that it + * can be extracted with just a shift, with no masking needed. + * + * The names are SHIFT_thing because they are qp_shift_t values. (See + * below for the various `qp_*` type declarations.) + * + * These values are relatively fixed in practice; the symbolic names + * avoid mystery numbers in the code. + */ +enum { + SHIFT_BRANCH = 0, /* branch / leaf tag */ + SHIFT_NOBYTE, /* label separator has no byte value */ + SHIFT_BITMAP, /* many bits here */ + SHIFT_OFFSET = 48, /* offset of byte in key */ +}; + +/* + * Value of the node type tag bit. + * + * It is defined this way to be explicit about where the value comes + * from, even though we know it is always the bottom bit. + */ +#define BRANCH_TAG (1ULL << SHIFT_BRANCH) + +/*********************************************************************** + * + * garbage collector tuning parameters + */ + +/* + * A "cell" is a location that can contain a `qp_node_t`, and a "chunk" + * is a moderately large array of cells. A big trie can occupy + * multiple chunks. (Unlike other nodes, a trie's root node lives in + * its `struct dns_qp` instead of being allocated in a cell.) + * + * The qp-trie allocator hands out space for twigs vectors. Allocations are + * made sequentially from one of the chunks; this kind of "sequential + * allocator" is also known as a "bump allocator", so in `struct dns_qp` + * (see below) the allocation chunk is called `bump`. + */ + +/* + * Number of cells in a chunk is a power of 2, which must have space for + * a full twigs vector (48 wide). When testing, use a much smaller chunk + * size to make the allocator work harder. + */ +#ifdef FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION +#define QP_CHUNK_LOG 7 +#else +#define QP_CHUNK_LOG 10 +#endif + +STATIC_ASSERT(6 <= QP_CHUNK_LOG && QP_CHUNK_LOG <= 20, + "qp-trie chunk size is unreasonable"); + +#define QP_CHUNK_SIZE (1U << QP_CHUNK_LOG) +#define QP_CHUNK_BYTES (QP_CHUNK_SIZE * sizeof(qp_node_t)) + +/* + * A chunk needs to be compacted if it has fragmented this much. + * (12% overhead seems reasonable) + */ +#define QP_MAX_FREE (QP_CHUNK_SIZE / 8) + +/* + * Compact automatically when we pass this threshold: when there is a lot + * of free space in absolute terms, and when we have freed more than half + * of the space we allocated. + * + * The current compaction algorithm scans the whole trie, so it is important + * to scale the threshold based on the size of the trie to avoid quadratic + * behaviour. XXXFANF find an algorithm that scans less of the trie! + * + * During a modification transaction, when we copy-on-write some twigs we + * count the old copy as "free", because they will be when the transaction + * commits. But they cannot be recovered immediately so they are also + * counted as on hold, and discounted when we decide whether to compact. + */ +#define QP_MAX_GARBAGE(qp) \ + (((qp)->free_count - (qp)->hold_count) > QP_CHUNK_SIZE * 4 && \ + ((qp)->free_count - (qp)->hold_count) > (qp)->used_count / 2) + +/* + * The chunk base and usage arrays are resized geometically and start off + * with two entries. + */ +#define GROWTH_FACTOR(size) ((size) + (size) / 2 + 2) + +/*********************************************************************** + * + * helper types + */ + +/* + * C is not strict enough with its integer types for these typedefs to + * improve type safety, but it helps to have annotations saying what + * particular kind of number we are dealing with. + */ + +/* + * The number or position of a bit inside a word. (0..63) + * + * Note: A dns_qpkey_t is logically an array of qp_shift_t values, but it + * isn't declared that way because dns_qpkey_t is a public type whereas + * qp_shift_t is private. + */ +typedef uint8_t qp_shift_t; + +/* + * The number of bits set in a word (as in Hamming weight or popcount) + * which is used for the position of a node in the packed sparse + * vector of twigs. (0..47) because our bitmap does not fill the word. + */ +typedef uint8_t qp_weight_t; + +/* + * A chunk number, i.e. an index into the chunk arrays. + */ +typedef uint32_t qp_chunk_t; + +/* + * Cell offset within a chunk, or a count of cells. Each cell in a + * chunk can contain a node. + */ +typedef uint32_t qp_cell_t; + +/* + * A twig reference is used to refer to a twigs vector, which occupies a + * contiguous group of cells. + */ +typedef uint32_t qp_ref_t; + +/* + * Constructors and accessors for qp_ref_t values, defined here to show + * how the qp_ref_t, qp_chunk_t, qp_cell_t types relate to each other + */ + +static inline qp_ref_t +make_ref(qp_chunk_t chunk, qp_cell_t cell) { + return (QP_CHUNK_SIZE * chunk + cell); +} + +static inline qp_chunk_t +ref_chunk(qp_ref_t ref) { + return (ref / QP_CHUNK_SIZE); +} + +static inline qp_cell_t +ref_cell(qp_ref_t ref) { + return (ref % QP_CHUNK_SIZE); +} + +/*********************************************************************** + * + * main qp-trie structures + */ + +#define QP_MAGIC ISC_MAGIC('t', 'r', 'i', 'e') +#define VALID_QP(qp) ISC_MAGIC_VALID(qp, QP_MAGIC) + +/* + * This is annoying: C doesn't allow us to use a predeclared structure as + * an anonymous struct member, so we have to fart around. The feature we + * want is available in GCC and Clang with -fms-extensions, but a + * non-standard extension won't make these declarations neater if we must + * also have a standard alternative. + */ + +/* + * Lightweight read-only access to a qp-trie. + * + * Just the fields neded for the hot path. The `base` field points + * to an array containing pointers to the base of each chunk like + * `qp->base[chunk]` - see `refptr()` below. + * + * A `dns_qpread_t` has a lifetime that does not extend across multiple + * write transactions, so it can share a chunk `base` array belonging to + * the `dns_qpmulti_t` it came from. + * + * We're lucky with the layout on 64 bit systems: this is only 40 bytes, + * with no padding. + */ +#define DNS_QPREAD_COMMON \ + uint32_t magic; \ + qp_node_t root; \ + qp_node_t **base; \ + void *ctx; \ + const dns_qpmethods_t *methods + +struct dns_qpread { + DNS_QPREAD_COMMON; +}; + +/* + * Heavyweight read-only snapshots of a qp-trie. + * + * Unlike a lightweight `dns_qpread_t`, a snapshot can survive across + * multiple write transactions, any of which may need to expand the + * chunk `base` array. So a `dns_qpsnap_t` keeps its own copy of the + * array, which will always be equal to some prefix of the expanded + * arrays in the `dns_qpmulti_t` that it came from. + * + * The `dns_qpmulti_t` keeps a refcount of its snapshots, and while + * the refcount is non-zero, chunks are not freed or reused. When a + * `dns_qpsnap_t` is destroyed, if it decrements the refcount to zero, + * it can do any deferred cleanup. + * + * The generation number is used for tracing. + */ +struct dns_qpsnap { + DNS_QPREAD_COMMON; + uint32_t generation; + dns_qpmulti_t *whence; + qp_node_t *base_array[]; +}; + +/* + * Read-write access to a qp-trie requires extra fields to support the + * allocator and garbage collector. + * + * The chunk `base` and `usage` arrays are separate because the `usage` + * array is only needed for allocation, so it is kept separate from the + * data needed by the read-only hot path. The arrays have empty slots where + * new chunks can be placed, so `chunk_max` is the maximum number of chunks + * (until the arrays are resized). + * + * Bare instances of a `struct dns_qp` are used for stand-alone + * single-threaded tries. For multithreaded access, transactions alternate + * between the `phase` pair of dns_qp objects inside a dns_qpmulti. + * + * For multithreaded access, the `generation` counter allows us to know + * which chunks are writable or not: writable chunks were allocated in the + * current generation. For single-threaded access, the generation counter + * is always zero, so all chunks are considered to be writable. + * + * Allocations are made sequentially in the `bump` chunk. Lightweight write + * transactions can re-use the `bump` chunk, so its prefix before `fender` + * is immutable, and the rest is mutable even though its generation number + * does not match the current generation. + * + * To decide when to compact and reclaim space, QP_MAX_GARBAGE() examines + * the values of `used_count`, `free_count`, and `hold_count`. The + * `hold_count` tracks nodes that need to be retained while readers are + * using them; they are free but cannot be reclaimed until the transaction + * has committed, so the `hold_count` is discounted from QP_MAX_GARBAGE() + * during a transaction. + * + * There are some flags that alter the behaviour of write transactions. + * + * - The `transaction_mode` indicates whether the current transaction is a + * light write or a heavy update, or (between transactions) the previous + * transaction's mode, because the setup for the next transaction + * depends on how the previous one committed. The mode is set at the + * start of each transaction. It is QP_NONE in a single-threaded qp-trie + * to detect if part of a `dns_qpmulti_t` is passed to dns_qp_destroy(). + * + * - The `compact_all` flag is used when every node in the trie should be + * copied. (Usually compation aims to avoid moving nodes out of + * unfragmented chunks.) It is used when compaction is explicitly + * requested via `dns_qp_compact()`, and as an emergency mechanism if + * normal compaction failed to clear the QP_MAX_GARBAGE() condition. + * (This emergency is a bug even tho we have a rescue mechanism.) + * + * - The `shared_arrays` flag indicates that the chunk `base` and `usage` + * arrays are shared by both `phase`s in this trie's `dns_qpmulti_t`. + * This allows us to delay allocating copies of the arrays during a + * write transaction, until we definitely need to resize them. + * + * - When built with fuzzing support, we can use mprotect() and munmap() + * to ensure that incorrect memory accesses cause fatal errors. The + * `write_protect` flag must be set straight after the `dns_qpmulti_t` + * is created, then left unchanged. + * + * Some of the dns_qp_t fields are only used for multithreaded transactions + * (marked [MT] below) but the same code paths are also used for single- + * threaded writes. To reduce the size of a dns_qp_t, these fields could + * perhaps be moved into the dns_qpmulti_t, but that would require some kind + * of conditional runtime downcast from dns_qp_t to dns_multi_t, which is + * likely to be ugly. It is probably best to keep things simple if most tries + * need multithreaded access (XXXFANF do they? e.g. when there are many auth + * zones), + */ +struct dns_qp { + DNS_QPREAD_COMMON; + isc_mem_t *mctx; + /*% array of per-chunk allocation counters */ + struct { + /*% the allocation point, increases monotonically */ + qp_cell_t used; + /*% count of nodes no longer needed, also monotonic */ + qp_cell_t free; + /*% when was this chunk allocated? */ + uint32_t generation; + } *usage; + /*% transaction counter [MT] */ + uint32_t generation; + /*% number of slots in `chunk` and `usage` arrays */ + qp_chunk_t chunk_max; + /*% which chunk is used for allocations */ + qp_chunk_t bump; + /*% twigs in the `bump` chunk below `fender` are read only [MT] */ + qp_cell_t fender; + /*% number of leaf nodes */ + qp_cell_t leaf_count; + /*% total of all usage[] counters */ + qp_cell_t used_count, free_count; + /*% cells that cannot be recovered right now */ + qp_cell_t hold_count; + /*% what kind of transaction was most recently started [MT] */ + enum { QP_NONE, QP_WRITE, QP_UPDATE } transaction_mode : 2; + /*% compact the entire trie [MT] */ + bool compact_all : 1; + /*% chunk arrays are shared with a readonly qp-trie [MT] */ + bool shared_arrays : 1; + /*% optionally when compiled with fuzzing support [MT] */ + bool write_protect : 1; +}; + +/* + * Concurrent access to a qp-trie. + * + * The `read` pointer is used for read queries. It points to one of the + * `phase` elements. During a transaction, the other `phase` (see + * `write_phase()` below) is modified incrementally in copy-on-write + * style. On commit the `read` pointer is swapped to the altered phase. + */ +struct dns_qpmulti { + uint32_t magic; + /*% controls access to the `read` pointer and its target phase */ + isc_rwlock_t rwlock; + /*% points to phase[r] and swaps on commit */ + dns_qp_t *read; + /*% protects the snapshot counter and `write_phase()` */ + isc_mutex_t mutex; + /*% so we know when old chunks are still shared */ + unsigned int snapshots; + /*% one is read-only, one is mutable */ + dns_qp_t phase[2]; +}; + +/* + * Get a pointer to the phase that isn't read-only. + */ +static inline dns_qp_t * +write_phase(dns_qpmulti_t *multi) { + bool read0 = multi->read == &multi->phase[0]; + return (read0 ? &multi->phase[1] : &multi->phase[0]); +} + +#define QPMULTI_MAGIC ISC_MAGIC('q', 'p', 'm', 'v') +#define VALID_QPMULTI(qp) ISC_MAGIC_VALID(qp, QPMULTI_MAGIC) + +/*********************************************************************** + * + * interior node constructors and accessors + */ + +/* + * See the comments under "interior node basics" above, which explain the + * layout of nodes as implemented by the following functions. + */ + +/* + * Get the 64-bit word of a node. + */ +static inline uint64_t +node64(qp_node_t *n) { + uint64_t lo = n->biglo; + uint64_t hi = n->bighi; + return (lo | (hi << 32)); +} + +/* + * Get the 32-bit word of a node. + */ +static inline uint32_t +node32(qp_node_t *n) { + return (n->small); +} + +/* + * Create a node from its parts + */ +static inline qp_node_t +make_node(uint64_t big, uint32_t small) { + return ((qp_node_t){ + .biglo = (uint32_t)(big), + .bighi = (uint32_t)(big >> 32), + .small = small, + }); +} + +/* + * Test a node's tag bit. + */ +static inline bool +is_branch(qp_node_t *n) { + return (n->biglo & BRANCH_TAG); +} + +/* leaf nodes *********************************************************/ + +/* + * Get a leaf's pointer value. The double cast is to avoid a warning + * about mismatched pointer/integer sizes on 32 bit systems. + */ +static inline void * +leaf_pval(qp_node_t *n) { + return ((void *)(uintptr_t)node64(n)); +} + +/* + * Get a leaf's integer value + */ +static inline uint32_t +leaf_ival(qp_node_t *n) { + return (node32(n)); +} + +/* + * Create a leaf node from its parts + */ +static inline qp_node_t +make_leaf(const void *pval, uint32_t ival) { + qp_node_t leaf = make_node((uintptr_t)pval, ival); + REQUIRE(!is_branch(&leaf) && pval != NULL); + return (leaf); +} + +/* branch nodes *******************************************************/ + +/* + * The following function names use plural `twigs` when they work on a + * branch's twigs vector as a whole, and singular `twig` when they work on + * a particular twig. + */ + +/* + * Get a branch node's index word + */ +static inline uint64_t +branch_index(qp_node_t *n) { + return (node64(n)); +} + +/* + * Get a reference to a branch node's child twigs. + */ +static inline qp_ref_t +branch_twigs_ref(qp_node_t *n) { + return (node32(n)); +} + +/* + * Bit positions in the bitmap come directly from the key. DNS names are + * converted to keys using the tables declared at the end of this file. + */ +static inline qp_shift_t +qpkey_bit(const dns_qpkey_t key, size_t len, size_t offset) { + if (offset < len) { + return (key[offset]); + } else { + return (SHIFT_NOBYTE); + } +} + +/* + * Extract a branch node's offset field, used to index the key. + */ +static inline size_t +branch_key_offset(qp_node_t *n) { + return ((size_t)(branch_index(n) >> SHIFT_OFFSET)); +} + +/* + * Which bit identifies the twig of this node for this key? + */ +static inline qp_shift_t +branch_keybit(qp_node_t *n, const dns_qpkey_t key, size_t len) { + return (qpkey_bit(key, len, branch_key_offset(n))); +} + +/* + * Convert a twig reference into a pointer. + */ +static inline qp_node_t * +ref_ptr(dns_qpreadable_t qpr, qp_ref_t ref) { + dns_qpread_t *qp = dns_qpreadable_cast(qpr); + return (qp->base[ref_chunk(ref)] + ref_cell(ref)); +} + +/* + * Get a pointer to a branch node's twigs vector. + */ +static inline qp_node_t * +branch_twigs_vector(dns_qpreadable_t qpr, qp_node_t *n) { + dns_qpread_t *qp = dns_qpreadable_cast(qpr); + return (ref_ptr(qp, branch_twigs_ref(n))); +} + +/* + * Warm up the cache while calculating which twig we want. + */ +static inline void +prefetch_twigs(dns_qpreadable_t qpr, qp_node_t *n) { + __builtin_prefetch(branch_twigs_vector(qpr, n)); +} + +/*********************************************************************** + * + * bitmap popcount shenanigans + */ + +/* + * How many twigs appear in the vector before the one corresponding to the + * given bit? Calculated using popcount of part of the branch's bitmap. + * + * To calculate a mask that covers the lesser bits in the bitmap, we + * subtract 1 to set the bits, and subtract the branch tag because it + * is not part of the bitmap. + */ +static inline qp_weight_t +branch_twigs_before(qp_node_t *n, qp_shift_t bit) { + uint64_t mask = (1ULL << bit) - 1 - BRANCH_TAG; + uint64_t bmp = branch_index(n) & mask; + return ((qp_weight_t)__builtin_popcountll(bmp)); +} + +/* + * How many twigs does this node have? + * + * The offset is directly after the bitmap so the offset's lesser bits + * covers the whole bitmap, and the bitmap's weight is the number of twigs. + */ +static inline qp_weight_t +branch_twigs_size(qp_node_t *n) { + return (branch_twigs_before(n, SHIFT_OFFSET)); +} + +/* + * Position of a twig within the packed sparse vector. + */ +static inline qp_weight_t +branch_twig_pos(qp_node_t *n, qp_shift_t bit) { + return (branch_twigs_before(n, bit)); +} + +/* + * Get a pointer to a particular twig. + */ +static inline qp_node_t * +branch_twig_ptr(dns_qpreadable_t qpr, qp_node_t *n, qp_shift_t bit) { + return (branch_twigs_vector(qpr, n) + branch_twig_pos(n, bit)); +} + +/* + * Is the twig identified by this bit present? + */ +static inline bool +branch_has_twig(qp_node_t *n, qp_shift_t bit) { + return (branch_index(n) & (1ULL << bit)); +} + +/* twig logistics *****************************************************/ + +static inline void +move_twigs(qp_node_t *to, qp_node_t *from, qp_weight_t size) { + memmove(to, from, size * sizeof(qp_node_t)); +} + +static inline void +zero_twigs(qp_node_t *twigs, qp_weight_t size) { + memset(twigs, 0, size * sizeof(qp_node_t)); +} + +/*********************************************************************** + * + * method invocation helpers + */ + +static inline void +attach_leaf(dns_qpreadable_t qpr, qp_node_t *n) { + dns_qpread_t *qp = dns_qpreadable_cast(qpr); + qp->methods->attach(qp->ctx, leaf_pval(n), leaf_ival(n)); +} + +static inline void +detach_leaf(dns_qpreadable_t qpr, qp_node_t *n) { + dns_qpread_t *qp = dns_qpreadable_cast(qpr); + qp->methods->detach(qp->ctx, leaf_pval(n), leaf_ival(n)); +} + +static inline size_t +leaf_qpkey(dns_qpreadable_t qpr, qp_node_t *n, dns_qpkey_t key) { + dns_qpread_t *qp = dns_qpreadable_cast(qpr); + return (qp->methods->makekey(key, qp->ctx, leaf_pval(n), leaf_ival(n))); +} + +static inline char * +triename(dns_qpreadable_t qpr, char *buf, size_t size) { + dns_qpread_t *qp = dns_qpreadable_cast(qpr); + qp->methods->triename(qp->ctx, buf, size); + return (buf); +} + +#define TRIENAME(qp) \ + triename(qp, (char[DNS_QP_TRIENAME_MAX]){}, DNS_QP_TRIENAME_MAX) + +/*********************************************************************** + * + * converting DNS names to trie keys + */ + +/* + * This is a deliberate simplification of the hostname characters, + * because it doesn't matter much if we treat a few extra characters + * favourably: there is plenty of space in the index word for a + * slightly larger bitmap. + */ +static inline bool +qp_common_character(uint8_t byte) { + return (('-' <= byte && byte <= '9') || ('_' <= byte && byte <= 'z')); +} + +/* + * Lookup table mapping bytes in DNS names to bit positions, used + * by dns_qpkey_fromname() to convert DNS names to qp-trie keys. + */ +extern uint16_t dns_qp_bits_for_byte[]; + +/* + * And the reverse, mapping bit positions to characters, so the tests + * can print diagnostics involving qp-trie keys. + */ +extern uint8_t dns_qp_byte_for_bit[]; + +/**********************************************************************/