Add a qp-trie data structure

A qp-trie is a kind of radix tree that is particularly well-suited to DNS servers. I invented the qp-trie in 2015, based on Dan Bernstein's crit-bit trees and Phil Bagwell's HAMT. https://dotat.at/prog/qp/ This code incorporates some new ideas that I prototyped using NLnet Labs NSD in 2020 (optimizations for DNS names as keys) and 2021 (custom allocator and garbage collector). https://dotat.at/cgi/git/nsd.git The BIND version of my qp-trie code has a number of improvements compared to the prototype developed for NSD. * The main omission in the prototype was the very sketchy outline of how locking might work. Now the locking has been implemented, using a reader/writer lock and a mutex. However, it is designed to benefit from liburcu if that is available. * The prototype was designed for two-version concurrency, one version for readers and one for the writer. The new code supports multiversion concurrency, to provide a basis for BIND's dbversion machinery, so that updates are not blocked by long-running zone transfers. * There are now two kinds of transaction that modify the trie: an `update` aims to support many very small zones without wasting memory; a `write` avoids unnecessary allocation to help the performance of many small changes to the cache. * There is also a single-threaded interface for situations where concurrent access is not necessary. * The API makes better use of types to make it more clear which operations are permitted when. * The lookup table used to convert a DNS name to a qp-trie key is now initialized by a run-time constructor instead of a programmer using copy-and-paste. Key conversion is more flexible, so the qp-trie can be used with keys other than DNS names. * There has been much refactoring and re-arranging things to improve the terminology and order of presentation in the code, and the internal documentation has been moved from a comment into a file of its own. Some of the required functionality has been stripped out, to be brought back later after the basics are known to work. * Garbage collector performance statistics are missing. * Fancy searches are missing, such as longest match and nearest match. * Iteration is missing. * Search for update is missing, for cases where the caller needs to know if the value object is mutable or not.
2025-08-31 06:25:31 +00:00 · 2022-05-09 14:31:35 +01:00
parent 7975b785fd
commit 6b9ddbd1ce
7 changed files with 3638 additions and 21 deletions
--- a/doc/design/qp-trie.md
+++ b/doc/design/qp-trie.md
@@ -0,0 +1,770 @@
+<!--
+Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+
+SPDX-License-Identifier: MPL-2.0
+
+This Source Code Form is subject to the terms of the Mozilla Public
+License, v. 2.0.  If a copy of the MPL was not distributed with this
+file, you can obtain one at https://mozilla.org/MPL/2.0/.
+
+See the COPYRIGHT file distributed with this work for additional
+information regarding copyright ownership.
+-->
+
+A qp-trie for the DNS
+=====================
+
+A qp-trie is a data structure that supports lookups in a sorted
+collection of keys. It is efficient both in terms of fast lookups and
+using little memory. It is particularly well-suited for use in DNS
+servers.
+
+These notes outline how BIND's `dns_qp` implementation works, how it
+is optimized for lookups keyed by DNS names, and how it supports
+multi-version concurrency.
+
+
+data structure zoo
+------------------
+
+Chasing a pointer indirection is very slow, up to 100ns, whereas a
+sequential memory access takes less than 10ns. So, to make a data
+structure fast, we need to minimize indirections.
+
+There is a tradeoff between speed and flexibility in standard data
+structures:
+
+  * Arrays are very simple and fast (a lookup goes straight to the
+    right address), but the key can only be a small integer.
+
+  * Hash tables allow you to use arbitrary lookup keys (such as
+    strings), but may require probing multiple addresses to find the
+    right element.
+
+  * Radix trees allow you to do lookups based on the sorting order of
+    the keys, provided it is lexical like `memcmp()`; however, lookups
+    require multiple indirections.
+
+  * Comparison search trees (binary trees and B-trees) allow you to
+    use an arbitrary ordering predicate, but each indirection during
+    a lookup also requires a comparison.
+
+In the DNS, we need to use some kind of tree to support the kinds of
+lookup required for DNSSEC: find longest match, find nearest
+predecessor or successor, and so forth. So what kind of tree is best?
+
+
+in theory
+---------
+
+In a tree where the average length of a key is `k`, and the number of
+elements in the tree is `n`, the theoretical performance bounds are,
+for a comparison tree:
+
+  * `Ω(k * log n)`
+  * `Ο(k * n)`
+
+And for a radix tree:
+
+  * `Ω(k + log n)`
+  * `Ο(k + k)`
+
+Here, `Ω()` is the lower bound and `Ο()` is the upper bound; we
+expect typical performance to be close to the lower bound.
+
+The multiplications in the comparison tree expressions means that each
+indirection requires a comparison `Ο(k)`, whereas they are additions
+in the radix tree expressions because a radix tree traversal only
+needs one key comparison.
+
+The upper bounds say that (in the absence of balancing) a comparison
+tree can devolve into a linked list of nodes, whereas the shape of a
+radix tree is determined by the set of keys independent of the order
+of insertion or the number of keys.
+
+The logarithms hide some interesting constant factors. In a binary
+tree, the log is base 2. In a radix tree, the radix is the base of the
+logarithm. So, if we increase the radix, the constant factor gets
+smaller. The rough equivalent for a binary tree would be to use a
+B-tree instead, but although B-trees have fewer indirections they do
+not reduce the number of comparisons.
+
+In implementation terms, a larger radix means tree nodes get wider
+and the tree becomes shallower. A shallower tree requires fewer
+indirections, so it should be faster. The trick is to increase the
+radix without blowing up the tree's memory usage, which can lose
+more performance than we win.
+
+This analysis suggests that a radix tree is better than a comparison
+tree, provided keys can be compared lexically - which is true for DNS
+names, with some rearrangement (described below). When using big-o
+notation, we also need to be wary of the constant factors; but in this
+case they also favour a radix tree, especially with the optimization
+tricks used by BIND's qp-trie.
+
+Note: "radix" comes from the latin for "root", so "radix tree" is a
+pun, which is geekily amusing especially when talking about logs.
+
+
+what is a trie?
+---------------
+
+A trie is another name for a radix tree (or "digital tree" according
+to Knuth). It is short for information reTRIEval, and I pronounce it
+exactly like "tree" (though Knuth pronounces it like "try").
+
+In a trie, keys are divided into digits depending on some radix e.g.
+base 2 for binary tries, base 256 for byte-indexed tries. When
+searching the trie, successive digits in the key, from most to least
+significant, are used to select branches from successive nodes in
+the trie, roughly like:
+
+        for (offset = 0; isbranch(node); offset++)
+            node = node->child[key[offset]];
+
+All of the keys in a subtrie have identical prefixes. Tries do not
+need to store keys since they are implicit in the structure.
+
+
+binary crit-bit trees
+---------------------
+
+A patricia trie is a binary trie which omits nodes that have only one
+child. Dan Bernstein calls his tightly space-optimized version a
+"crit-bit tree".
+https://cr.yp.to/critbit.html
+https://github.com/agl/critbit/
+
+Unlike a basic trie, a crit-bit tree skips parts of the key when
+every element in a subtree shares the same sequence of bits.
+Each node is annotated with the offset of the bit that is used to
+select the branch; offsets always increase as you go deeper into
+the tree.
+
+    while (isbranch(node))
+        node = node->child[key[node->offset]];
+
+In a crit-bit tree the keys are not implicit in the structure
+because parts of them are skipped. Therefore, each leaf refers to a
+copy of its key so that when you find a leaf you can verify that the
+skipped bits match.
+
+
+prefetching
+-----------
+
+Observe that in the loop above, the current node has only one child
+pointer, and the child nodes are adjacent in memory. This means it
+is possible to tell the CPU to prefetch the child nodes before
+extracting the critical bit from the key and choosing which child is
+next. A qp-trie has a similar layout, but it has more child nodes
+(still adjacent in memory) and it does more computation to choose
+which one is next.
+
+When I originally invented the qp-trie code, I found that explicit
+prefetch hints made the qp-trie substantially faster and the crit-bit
+tree slightly faster. The hints help the CPU to do useful work at the
+same time as the memory subsystem. (This is unusual for linked data
+structures, which tend to alternate between CPU waiting for memory,
+and memory waiting for CPU.)
+
+Large modern CPUs (after about 2015) are better at prefetching
+automatically, so the explicit hint is less important than it used to
+be, but `lib/dns/qp.c` still has `__builtin_prefetch()` hints in its
+inner traversal loops.
+
+
+packed sparse vectors with popcount
+-----------------------------------
+
+The `popcount` instruction counts the number of bits that are set
+in a word. It's also known as the Hamming weight; Knuth calls it
+"sideways add". https://en.wikipedia.org/wiki/popcount
+
+You can use `popcount` to implement a sparse vector of length `N`
+containing `M <= N` members using bitmap of length `N` and a packed
+vector of `M` elements. A member `b` is present in the vector if bit
+`b` is set, so `M == popcount(bitmap)`. The index of member `b` in
+the packed vector is the popcount of the bits preceding `b`.
+
+    // size of vector
+    size = popcount(bitmap);
+    // bit position
+    bit =  1 << b;
+    // is element present?
+    if (bitmap & bit) {
+        // mask covers the preceding elements
+        mask = bit - 1;
+        // position of element in packed vector
+        pos = popcount(bitmap & mask);
+        // fetch element
+        elem = vector[pos];
+    }
+
+See "Hacker's Delight" by Hank Warren, section 5-1 "Counting 1
+bits", subsection "applications". http://www.hackersdelight.org
+
+See under _"bitmap popcount shenanigans"_ in `lib/dns/qp.c` for how
+this is implemented in BIND.
+
+
+popcount for trie nodes
+-----------------------
+
+Phil Bagwell's hashed array-mapped tries (HAMT) use popcount for
+compact trie nodes. In a HAMT, string keys are hashed, and the hash is
+used as the index to the trie, with radix 2^32 or 2^64.
+http://infoscience.epfl.ch/record/64394/files/triesearches.pdf
+http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf
+
+As discussed above, increasing the radix makes the tree shallower, so
+it should be faster. The downside is usually much greater memory
+overhead. Child vectors are often sparsely populated, so we can
+greatly reduce the overhead by packing them with popcount.
+
+The HAMT relies on hashing, which keeps keys dense. This means it
+can be laid out like a basic trie with implicit keys (i.e. hash
+values). The disadvantage of hashing is that strings are stored
+out of order.
+
+
+qp-trie
+-------
+
+A qp-trie is a mash-up of Bernstein's crit-bit tree with Bagwell's
+HAMT. Like a crit-bit tree, a qp-trie omits nodes with one child;
+nodes include a key offset; and keys a referenced from leaves instead
+of being implicit in the trie structure. Like a HAMT, nodes have a
+popcount packed vector of children, but unlike a HAMT, keys are not
+hashed.
+
+A qp-trie is faster than a crit-bit tree and uses less memory, because
+its wider fan-out requires fewer nodes and popcount packs them very
+efficiently. Like a crit-bit tree but unlike a HAMT, a qp-trie stores
+keys in lexical order.
+
+As in a HAMT, the original layout of a qp-trie node is a pair of
+words, which are used as key and value pointers in leaf nodes, and
+index word and pointer in branch nodes. The index word contains the
+popcount bitmap (as in a HAMT) and the offset into the key (as in a
+crit-bit tree), as well as a leaf/branch tag bit. The pointer refers
+to the branch node's "twigs", which is what we call the packed sparse
+vector of child nodes.
+
+The fan-out of a qp-trie is limited by the need to fit the bitmap and
+the nybble offset into a 64-bit word; a radix of 16 or 32 works well,
+and 32 is slightly faster (though 5-bit nybbles are fiddly). But radix
+64 requires an extra word per node, and the extra memory overhead
+makes it slower as well as bulkier.
+
+Early qp-trie implementations used a node layout like the
+following. However, in practice C bitfields have too many
+portability gotchas to work well. It is better to use hand-written
+shifting and masking to access the parts of the index word.
+
+        #define NYBBLE 4 // or 5
+        #define RADIX (1 << NYBBLE)
+
+        union qp_node {
+            struct {
+                unsigned tag : 1;
+                unsigned bitmap : RADIX;
+                unsigned offset : (64 - 1 - RADIX);
+                union qp_node *twigs;
+            } branch;
+            struct {
+                void *value;
+                const char *key;
+            } leaf;
+        };
+
+
+DNS qp-trie
+-----------
+
+BIND uses a variant of a qp-trie optimized for DNS names. DNS names
+almost always use the usual hostname alphabet of (case-insensitive)
+letters, digits, hyphen, plus underscore (which is often used in the DNS
+for non-hostname purposes), and finally the label separator (which is
+written as '.' in presentation-format domain names, and is the label
+length in wire format). This adds up to 39 common characters.
+
+A bitmap for 39 common characters is small enough to fit into a
+qp-trie index word, so we can (in principle) walk down the trie one
+character at a time, as if the radix were 256, but without needing a
+multi-word bitmap.
+
+However, DNS names can contain arbitrary bytes. To support the 200-ish
+unusual characters we use an escaping scheme, described in more detail
+below. This requires a few more bits in the bitmap to represent the
+escape characters, so our radix ends up being 47. This still fits into
+the 64-bit index word, so we get the compactness of a qp-trie but with
+faster byte-at-a-time lookups for DNS names that use common hostname
+characters.
+
+You can also use other kinds of keys with BIND's DNS qp-trie, provided
+they are not too long. You must provide your own key preparation
+function, e.g. for uniform binary keys you might extract 5-bit nybbles
+to get a radix-32 trie.
+
+
+preparing a lookup key
+----------------------
+
+A DNS name needs to be rearranged to use it as a qp-trie key, so that
+the lexical order of rearranged keys matches the canonical DNS name
+order specified in RFC 4034 section 6.1:
+
+  * reverse the order of the labels so that they run from most
+    significant to least significant, left to right (but the
+    characters in each label remain in the same order)
+
+  * convert uppercase ASCII letters to lowercase ASCII
+
+  * change the label separators to a non-byte value that sorts before
+    the zero byte
+
+For qp-trie lookups there are a couple of extra steps:
+
+  * There is an escaping mechanism to support DNS names that use
+    unusual characters. Common characters use one byte in the lookup
+    key, but unusual characters are expanded to two bytes. To preserve
+    the correct lexical order, there are different escape bytes
+    depending on how the unusual character sorts relative to the
+    common hostname characters.
+
+  * Characters in the DNS name need to be converted to bitmap
+    positions. This is done at the same time as preparing the lookup
+    key, to move work out of the inner trie traversal loop.
+
+These 5 transformations can be done in a single pass over a DNS name
+using a single lookup table. The transformed name is usually the
+same length (up to 2x longer if it contains unusual characters).
+
+You can use absolute or relative DNS names as keys, without ambiguity
+(provided you have some way of knowing what names are relative to).
+When converted to a lookup key, absolute names start with a non-byte
+value representing the root, and relative names do not.
+
+Lookup keys are ephemeral, allocated on the stack during a lookup.
+
+See under _"converting DNS names to trie keys"_ in `lib/dns/qp.c`
+for how this is implemented in BIND.
+
+
+node layout
+-----------
+
+Earlier I said that the original qp-trie node layout consists of two
+words: one 64 bit word for the branch index, and one pointer-sized
+word. BIND's qp-trie uses a layout that is smaller on 64-bit systems:
+one 64 bit word and one 32-bit word.
+
+A branch node contains
+
+  * a branch/leaf tag bit
+
+  * a 47-wide bitmap, with a bit for each common hostname character
+    and each escape character
+
+  * a 9-bit key offset, enough to count twice the length of a DNS
+    name
+
+  * a 32-bit "twigs" reference to the packed vector of child nodes;
+    these references are described in more detail below
+
+A leaf node contains a pointer value (which we assume to be 64 bits)
+and a 32-bit integer value. The branch/leaf tag is smuggled into the
+low-order bit of the pointer value, so the pointer value must have
+large enough alignment. (This requirement is checked when a leaf is
+added to the trie.) Apart from that, the meaning of leaf values
+is entirely under control of the qp-trie user.
+
+When constructing a qp-trie the user provides a collection of method
+pointers. The qp-trie code calls these methods when it needs to do
+anything that needs to look into a leaf value, such as extracting the
+key.
+
+See under _"interior node basics"_ and _"interior node constructors
+and accessors"_ in `lib/dns/qp_p.h` for the implementation.
+
+
+example
+-------
+
+Consider a small zone:
+
+        example.        ; apex
+        mail.example.   ; IMAP server
+        mx.example.     ; incoming mail
+        www.example.    ; web load balancer
+        www1.example.   ; back-end web servers
+        www2.example.
+
+It becomes a qp-trie as follows. I am writing bitmaps as lists of
+characters representing the bits that are set, with `'.'` for label
+separators. I have used arbitrary names for the addresses of the twigs
+vectors.
+
+    root = (qp_node){
+        tag: BRANCH,
+        offset: 9,
+        bitmap: [ '.', 'm', 'w' ],
+        twigs: &one,
+    };
+
+Note that the offset skips the root zone, the zone name, and the apex
+label separator. If the offset is beyond the end of the key, the byte
+value is the label separator.
+
+    one = (qp_node[3]){
+        {
+            tag: LEAF,
+            key: "example.",
+        },
+        {
+            tag: BRANCH,
+            offset: 10,
+            bitmap: [ 'a', 'x' ],
+            twigs: &two,
+        },
+        {
+            tag: BRANCH,
+            offset: 12,
+            bitmap: [ '.', '1', '2' ],
+            twigs: &three,
+        },
+    };
+
+This twigs vector has an element for the zone apex, and the two
+different initial characters of the subdomains.
+
+The mail servers differ in the next character, so the offset bumps from
+9 to 10 without skipping any characters. The web servers all start with
+www, so the offset bumps from 9 to 12, skipping the common prefix.
+
+    two = (qp_node[2]){
+        {
+            tag: LEAF,
+            key: "mail.example.",
+        },
+        {
+            tag: LEAF,
+            key: "mx.example.",
+        },
+    };
+
+The different lengths of `mail` and `mx` don't matter: we implicitly
+skip to the end of the key when we reach a leaf node.
+
+    three = (qp_node[3]){
+        {
+            tag: LEAF,
+            key: "www.example.",
+        },
+        {
+            tag: LEAF,
+            key: "www1.example.",
+        },
+        {
+            tag: LEAF,
+            key: "www2.example.",
+        },
+    };
+
+When the trie includes labels of differing lengths, we can have a node
+that chooses between a label separator and characters from the longer
+labels. This is slightly different from the root node, which tested the
+first character of the label; here we are testing the last character.
+
+
+memory management for concurrency
+---------------------------------
+
+The following sections discuss how the qp-trie supports concurrency.
+
+The requirement is to support many concurrent read threads, and
+allow updates to occur without blocking readers (or blocking readers
+as little as possible).
+
+The strategy is to use "copy-on-write", that is, when an update
+needs to alter the trie it makes a copy of the parts that it needs
+to change, so that concurrent readers can continue to use the
+original. (It is analogous to multiversion concurrency in databases
+such as PostgreSQL, where copy-on-write uses a write-ahead log.)
+
+Software that uses copy-on-write needs some mechanism for clearing
+away old versions that are no longer in use. (For example, VACUUM in
+PostgreSQL.) The qp-trie code uses a custom allocator with a simple
+garbage collector; as well as supporting concurrency, the qp-trie's
+memory manager makes tries smaller and faster.
+
+
+allocation
+----------
+
+A qp-trie is relatively demanding on its allocator. Twigs vectors
+can be lots of different sizes, and every mutation of the trie
+requires an alloc and/or a free.
+
+Older versions of the qp-trie code used the system allocator. Many
+allocators (such as `jemalloc`) segregate the heap into different
+size classes, so that each chunk of memory is dedicated to
+allocations of the same size. While this memory layout provides good
+locality when objects of the same type have the same size, it tends
+to scatter the interior nodes of a qp-trie all over the address space.
+
+BIND's qp-trie code uses a "bump allocator" for its interior nodes,
+which is one of the simplest and fastest possible: an allocation
+usually only requires incrementing a pointer and checking if it has
+reached a limit. (If the check fails the allocator goes into its
+slow path.) Allocations have good locality because they write
+sequentially into memory. (A bit like a write-ahead log.)
+
+Bump allocators need reasonably large contiguous chunks of empty
+memory to make the most of their efficiency, so they are often
+coupled with some kind of compacting garbage collector, which
+defragments the heap to recover free space.
+
+See `alloc_twigs()` in `lib/dns/qp.c` for the bump allocator fast
+path.
+
+
+garbage collection
+------------------
+
+[The Garbage Collection Handbook](https://gchandbook.org/) says
+there are four basic kinds of automatic memory management.
+
+Reference counting is used by scripting languages such as Perl and
+Python, and also for manual memory management such as in operating
+system kernels and BIND.
+
+To avoid writing a custom allocator, I previously tried adapting the
+qp-trie code to use refcounting to support copy-on-write, but I was
+not very happy with the complexity of the implementation, and I
+thought it was ugly that I needed to modify refcounts in nodes that
+were logically read-only.
+
+(Two other kinds of GC are mark-sweep and mark-compact. Both of them
+have a similar disadvantage to refcounting: a simple GC mark phase
+modifies nodes that are logically read-only. And mark-sweep leaves
+memory fragmented so it does not support a bump allocator.)
+
+The fourth kind is copying garbage collection. It works well with a
+bump allocator, because copying the data structure using a bump
+allocator in the most obvious way naturally compacts the data. And
+the copying phase of the GC can run concurrently with readers
+without interference.
+
+BIND's qp-trie code uses a copying garbage collector only for its
+interior nodes. The value objects that are attached to the leaves of
+the trie are allocated by `isc_mem` and use reference counting like
+the rest of BIND.
+
+See `compact()` in `lib/dns/qp.c` for the copying phase of the
+garbage collector. Reference counting for value objects is handled
+by the `attach()` and `detach()` qp-trie methods.
+
+
+memory layout
+-------------
+
+BIND's qp-trie code organizes its memory as a collection of "chunks",
+each of which is a few pages in size and large enough to hold a few
+thousand nodes.
+
+Most memory management is per-chunk: obtaining memory from the
+system allocator and returning it; keeping track of which chunks are
+in use by readers, and which chunks can be mutated; and counting
+whether chunks are fragmented enough to need garbage collection.
+
+As noted above, we also use the chunk-based layout to reduce the size
+of interior nodes. Instead of using a native pointer (typically 64
+bits) to refer to a node, we use a 32 bit integer containing the chunk
+number and the position of the node in the chunk. This reduces the
+memory used by interior nodes by 25%.
+
+In `lib/dns/qp_p.h`, the _"main qp-trie structures"_ hold information
+about a trie's chunks. Most of the chunk handling code is in the
+_"allocator"_ and _"chunk reclamation"_ sections in `lib/dns/qp.c`.
+
+
+lifecycle of value objects
+--------------------------
+
+A leaf node contains a pointer to a value object that is not managed
+by the qp-trie garbage collector. Instead, the user provides
+`attach` and `detach` methods that the qp-trie code calls to update
+the reference counts in the value objects.
+
+Value object reference counts do not indicate whether the object is
+mutable: its refcount can be 1 while it is only in use by readers
+(and must be left unchanged), or newly created by a writer (and
+therefore mutable).
+
+So, callers must keep track themselves whether leaf objects are newly
+inserted (and therefore mutable) or not. XXXFANF this might change, by
+adding special lookup functions that return whether leaf objects are
+mutable - see the "todo" in `include/dns/qp.h`.
+
+
+locking and RCU
+---------------
+
+The Linux kernel has a collection of copy-on-write schemes collectively
+called read-copy-update; there is also https://liburcu.org/ for RCU in
+userspace. RCU is attractively speedy: readers can proceed without
+blocking at all; writers can proceed concurrently with readers, and
+updates can be committed without blocking. A commit is just a single
+atomic pointer update. RCU only requires writers to block when waiting
+for a "grace period" while older readers complete their critical
+sections, after which the writer can free memory that is no longer in
+use. Writers must also block on a mutex to ensure there is only one
+writer at a time.
+
+The qp-trie concurrency strategy is designed to be able to use RCU, but
+RCU is not required. Instead of RCU we can use a reader-writer lock.
+This requires readers to block when a writer commits, which (in RCU
+style) just requires an atomic pointer swap. The rwlock also changes
+when writers must block: commits must wait for readers to exit their
+critical sections, but there is no further waiting to be able to release
+memory.
+
+In BIND, there are two kinds of reader: queries, which are relatiely
+quick, and zone transfers, which are relatively slow. BIND's dbversion
+machinery allows updates to proceed while there are long-running zone
+transfers. RCU supports this without further machinery, but a
+reader-writer lock needs some help so that long-running readers can
+avoid blocking writers.
+
+To avoid blocking updates, long-running readers can take a snapshot of a
+qp-trie, which only requires copying the allocator's chunk array. After
+a writer commits, it does not releases memory if there are any
+snapshots. Instead, chunks that are no longer needed by the latest
+version of the trie are stashed on a list to be released later,
+analogous to RCU waiting for a grace period.
+
+The locking occurs only in the functions under _"read-write
+transactions"_ and _"read-only transactions"_ in `lib/dns/qp.c`.
+
+
+immutability and copy-on-write
+------------------------------
+
+A qp-trie has a `generation` counter which is incremented by each
+write transaction. We keep track of which generation each chunk was
+created in; only chunks created in the current generation are
+mutable, because older chunks may be in use by concurrent readers.
+
+This logic is implemented by `chunk_alloc()` and `chunk_mutable()`
+in `lib/dns/qp.c`.
+
+The `make_twigs_mutable()` function ensures that a node is mutable,
+copying it if necessary.
+
+The chunk arrays are a mixture of mutable and immutable. Pointers to
+immutable chunks are immutable; new chunks can be assigned to unused
+entries; and entries are cleared when it is safe to reclaim the chunks
+they refer to. If the chunk arrays need to be expanded, the existing
+arrays are retained for use by readers, and the writer uses the
+expanded arrays (see `alloc_slow()`). The old arrays are cleaned up
+after the writer commits.
+
+
+update transactions
+-------------------
+
+A typical heavy-weight `update` transaction comprises:
+
+  * make a copy of the chunk arrays in case we need to roll back
+
+  * get a freshly allocated chunk where new nodes or copied nodes
+    can be written
+
+  * make any changes that are required; nodes in old chunks are
+    copied to the new space first; new nodes are modified in place
+    to avoid creating unnecessary garbage
+
+  * when the updates are finished, and before committing, run the
+    garbage collector to clear out chunks that were fragmented by the
+    update
+
+  * shrink the allocation chunk to eliminate unused space
+
+  * commit the update by flipping the root pointer of the trie; this
+    is the only point that needs a multithreading interlock
+
+  * free any chunks that were emptied by the garbage collector
+
+A lightweight `write` transaction is similar, except that:
+
+  * rollback is not supported
+
+  * any existing allocation chunk is reused if possible
+
+  * the gabage collector is not run before committing
+
+  * the allocation chunk is not shrunk
+
+
+testing strategies
+------------------
+
+The main qp-trie test is in `tests/dns/qpmulti_test.c`. This uses
+randomized testing of the transactional API, with a lot of consistency
+checking to detect bugs.
+
+There are also a couple of fuzzers, which aim to benefit from
+coverage-guided exploration of the test space and test minimization.
+In `fuzz/dns_qp.c` we treat the fuzzer input as a bytecode to exercise
+the single-threaded API, and `fuzz/dns_qpkey_name.c` checks conversion
+from DNS names to lookup keys.
+
+In `tests/bench` there are a few benchmarks. `load-names` does a very
+basic comparison between BIND's hash table, red-black tree, and
+qp-trie. `qpmulti` checks multicore performance of the transactional
+API (similar to `qpmulti_test` but without the consistency checking).
+And `qp-dump` is a utility for printing out the contents of a qp-trie.
+
+John Regehr has some nice essays about testing data structures:
+
+  * Levels of fuzzing: https://blog.regehr.org/archives/1039
+
+    (how much semantic knowledge does your fuzzer have?)
+
+  * Testing with small capacities: https://blog.regehr.org/archives/1138
+
+    (I need to be able to change the chunk size)
+
+  * Write fuzzable code: https://blog.regehr.org/archives/1687
+
+  * Oracles for random testing: https://blog.regehr.org/archives/856
+
+
+warning: generational collection
+--------------------------------
+
+The "generational hypothesis" is that most allocations have a short
+lifetime, so it is profitable for a garbage collector to split its
+heap into a number of generations. The youngest generation is where
+allocations happen; it typically uses a bump allocator, and when the
+allocation pointer reaches its limit, the youngest generation's
+contents are copied to the second generation. The hypothesis is that
+only a small fraction of the youngest generation will still be live
+when the GC runs, so this copy will not take much time or space.
+
+For a qp-trie the truth of this hypothesis depends on the order in
+which keys are added or removed. It may be true if there is good
+locality, for example, adding keys in lexicographic order, but not in
+general.
+
+When a qp-trie is mutated, only one node needs to be altered, near the
+leaf that is added or removed. Nodes near the root of the trie tend to
+be more stable and long-lived. However, during a copy-on-write
+transaction, the path from the root to an altered leaf must be copied,
+so nodes near the root are no longer stable and long-lived. They may
+become stable in a long transaction, but that isn't guaranteed.
+
+So the idea of generational garbage collection seems to be unhelpful
+for a qp-trie.
--- a/lib/dns/Makefile.am
+++ b/lib/dns/Makefile.am
@@ -99,6 +99,7 @@ libdns_la_HEADERS =			\
 	include/dns/order.h		\
 	include/dns/peer.h		\
 	include/dns/private.h		\
+	include/dns/qp.h		\
 	include/dns/rbt.h		\
 	include/dns/rcode.h		\
 	include/dns/rdata.h		\
@@ -157,6 +158,7 @@ libdns_la_SOURCES =			\
 	cache.c				\
 	callbacks.c			\
 	catz.c				\
+	client.c			\
 	clientinfo.c			\
 	compress.c			\
 	db.c				\
@@ -206,6 +208,8 @@ libdns_la_SOURCES =			\
 	order.c				\
 	peer.c				\
 	private.c			\
+	qp.c				\
+	qp_p.h				\
 	rbt.c				\
 	rbtdb.h				\
 	rbtdb.c				\
@@ -233,18 +237,17 @@ libdns_la_SOURCES =			\
 	transport.c			\
 	tkey.c				\
 	tsig.c				\
+	tsig_p.h			\
 	ttl.c				\
 	update.c			\
 	validator.c			\
 	view.c				\
 	xfrin.c				\
 	zone.c				\
+	zone_p.h			\
 	zoneverify.c			\
 	zonekey.c			\
-	zt.c				\
-	client.c			\
-	tsig_p.h			\
-	zone_p.h
+	zt.c

 if HAVE_GSSAPI
 libdns_la_SOURCES +=			\
--- a/lib/dns/include/dns/log.h
+++ b/lib/dns/include/dns/log.h
@@ -80,6 +80,7 @@ extern isc_logmodule_t	 dns_modules[];
 #define DNS_LOGMODULE_DYNDB	 (&dns_modules[30])
 #define DNS_LOGMODULE_DNSTAP	 (&dns_modules[31])
 #define DNS_LOGMODULE_SSU	 (&dns_modules[32])
+#define DNS_LOGMODULE_QP	 (&dns_modules[33])

 ISC_LANG_BEGINDECLS

--- a/lib/dns/include/dns/qp.h
+++ b/lib/dns/include/dns/qp.h
@@ -0,0 +1,574 @@
+/*
+ * Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+ *
+ * SPDX-License-Identifier: MPL-2.0
+ *
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, you can obtain one at https://mozilla.org/MPL/2.0/.
+ *
+ * See the COPYRIGHT file distributed with this work for additional
+ * information regarding copyright ownership.
+ */
+
+#pragma once
+
+/*
+ * A qp-trie is a kind of key -> value map, supporting lookups that are
+ * aware of the lexicographic order of keys.
+ *
+ * Keys are `dns_qpkey_t`, which is a string-like thing, usually created
+ * from a DNS name. You can use both relative and absolute DNS names as
+ * keys.
+ *
+ * Leaf values are a pair of a `void *` pointer and a `uint32_t`
+ * (because that is what fits inside an internal qp-trie leaf node).
+ *
+ * The trie does not store keys; instead keys are derived from leaf values
+ * by calling a method provided by the user.
+ *
+ * There are a few flavours of qp-trie.
+ *
+ * The basic `dns_qp_t` supports single-threaded read/write access.
+ *
+ * A `dns_qpmulti_t` is a wrapper that supports multithreaded access.
+ * There can be many concurrent readers and a single writer. Writes are
+ * transactional, and support multi-version concurrency.
+ *
+ * The concurrency strategy uses copy-on-write. When making changes during
+ * a transaction, the caller must not modify leaf values in place, but
+ * instead delete the old leaf from the trie and insert a replacement. Leaf
+ * values have reference counts, which will indicate when the old leaf
+ * value can be freed after it is no longer needed by readers using an old
+ * version of the trie.
+ *
+ * For fast concurrent reads, call `dns_qpmulti_query()` to get a
+ * `dns_qpread_t`. Readers can access a single version of the trie between
+ * write commits. Most write activity is not blocked by readers, but reads
+ * must finish before a write can commit (a read-write lock blocks
+ * commits).
+ *
+ * For long-running reads that need a stable view of the trie, while still
+ * allow commits to proceed, call `dns_qpmulti_snapshot()` to get a
+ * `dns_qpsnap_t`. It briefly gets the write mutex while creating the
+ * snapshot, which requires allocating a copy of some of the trie's
+ * metadata. A snapshot is for relatively heavy long-running read-only
+ * operations such as zone transfers.
+ *
+ * While snapshots exist, a qp-trie cannot reclaim memory: it does not
+ * retain detailed information about which memory is used by which
+ * snapshots, so it pessimistically retains all memory that might be
+ * used by old versions of the trie.
+ *
+ * You can start one read-write transaction at a time using
+ * `dns_qpmulti_write()` or `dns_qpmulti_update()`. Either way, you
+ * get a `dns_qp_t` that can be modified like a single-threaded trie,
+ * without affecting other read-only query or snapshot users of the
+ * `dns_qpmulti_t`. Committing a transaction only blocks readers
+ * briefly when flipping the active readonly `dns_qp_t` pointer.
+ *
+ * "Update" transactions are heavyweight. They allocate working memory to
+ * hold modifications to the trie, and compact the trie before committing.
+ * For extra space savings, a partially-used allocation chunk is shrunk to
+ * the smallest size possible. Unlike "write" transactions, an "update"
+ * transaction can be rolled back instead of committed. (Update
+ * transactions are intended for things like authoritative zones, where it
+ * is important to keep the per-trie memory overhead low because there can
+ * be a very large number of them.)
+ *
+ * "Write" transactions are more lightweight: they skip the allocation and
+ * compaction at the start and end of the transaction. (Write transactions
+ * are intended for frequent small changes, as in the DNS cache.)
+ */
+
+/***********************************************************************
+ *
+ *  types
+ */
+
+#include <isc/attributes.h>
+
+#include <dns/types.h>
+
+/*%
+ * A `dns_qp_t` supports single-threaded read/write access.
+ */
+typedef struct dns_qp dns_qp_t;
+
+/*%
+ * A `dns_qpmulti_t` supports multi-version concurrent reads and transactional
+ * modification.
+ */
+typedef struct dns_qpmulti dns_qpmulti_t;
+
+/*%
+ * A `dns_qpread_t` is a lightweight read-only handle on a `dns_qpmulti_t`.
+ */
+typedef struct dns_qpread dns_qpread_t;
+
+/*%
+ * A `dns_qpsnap_t` is a heavier read-only snapshot of a `dns_qpmulti_t`.
+ */
+typedef struct dns_qpsnap dns_qpsnap_t;
+
+/*
+ * The read-only qp-trie functions can work on either of the read-only
+ * qp-trie types or the general-purpose read-write `dns_qp_t`. They
+ * relies on the fact that all the `dns_qpreadable_t` structures start
+ * with a `dns_qpread_t`.
+ */
+typedef union dns_qpreadable {
+	dns_qpread_t *qpr;
+	dns_qpsnap_t *qps;
+	dns_qp_t     *qpt;
+} dns_qpreadable_t __attribute__((__transparent_union__));
+
+#define dns_qpreadable_cast(qp) ((qp).qpr)
+
+/*%
+ * A trie lookup key is a small array, allocated on the stack during trie
+ * searches. Keys are usually created on demand from DNS names using
+ * `dns_qpkey_fromname()`, but in principle you can define your own
+ * functions to convert other types to trie lookup keys.
+ *
+ * A domain name can be up to 255 bytes. When converted to a key, each
+ * character in the name corresponds to one byte in the key if it is a
+ * common hostname character; otherwise unusual characters are escaped,
+ * using two bytes in the key. So we allow keys to be up to 512 bytes.
+ * (The actual max is (255 - 5) * 2 + 6 == 506)
+ *
+ * Every byte of a key must be greater than 0 and less than 48. Elements
+ * after the end of the key are treated as having the value 1.
+ */
+typedef uint8_t dns_qpkey_t[512];
+
+/*%
+ * These leaf methods allow the qp-trie code to call back to the code
+ * responsible for the leaf values that are stored in the trie. The
+ * methods are provided for a whole trie when the trie is created.
+ *
+ * The qp-trie is also given a context pointer that is passed to the
+ * methods, so the methods know about the trie's context as well as a
+ * particular leaf value.
+ *
+ * The `attach` and `detach` methods adjust reference counts on value
+ * objects. They support copy-on-write and safe memory reclamation
+ * needed for multi-version concurrency.
+ *
+ * Note: When a value object reference count is greater than one, the
+ * object is in use by concurrent readers so it must not be modified. A
+ * refcount equal to one does not indicate whether or not the object is
+ * mutable: its refcount can be 1 while it is only in use by readers (and
+ * must be left unchanged), or newly created by a writer (and therefore
+ * mutable).
+ *
+ * The `makekey` method fills in a `dns_qpkey_t` corresponding to a
+ * value object stored in the qp-trie. It returns the length of the
+ * key. This method will typically call dns_qpkey_fromname() with a
+ * name stored in the value object.
+ *
+ * For logging and tracing, the `triename` method copies a human-
+ * readable identifier into `buf` which has max length `size`.
+ */
+typedef struct dns_qpmethods {
+	void (*attach)(void *ctx, void *pval, uint32_t ival);
+	void (*detach)(void *ctx, void *pval, uint32_t ival);
+	size_t (*makekey)(dns_qpkey_t key, void *ctx, void *pval,
+			  uint32_t ival);
+	void (*triename)(void *ctx, char *buf, size_t size);
+} dns_qpmethods_t;
+
+/*%
+ * Buffers for use by the `triename()` method need to be large enough
+ * to hold a zone name and a few descriptive words.
+ */
+#define DNS_QP_TRIENAME_MAX 300
+
+/*%
+ * A container for the counters returned by `dns_qp_memusage()`
+ */
+typedef struct dns_qp_memusage {
+	void  *ctx;	    /*%< qp-trie method context */
+	size_t leaves;	    /*%< values in the trie */
+	size_t live;	    /*%< nodes in use */
+	size_t used;	    /*%< allocated nodes */
+	size_t hold;	    /*%< nodes retained for readers */
+	size_t free;	    /*%< nodes to be reclaimed */
+	size_t node_size;   /*%< in bytes */
+	size_t chunk_size;  /*%< nodes per chunk */
+	size_t chunk_count; /*%< allocated chunks */
+	size_t bytes;	    /*%< total memory in chunks and metadata */
+} dns_qp_memusage_t;
+
+/***********************************************************************
+ *
+ *  functions - create, destory, enquire
+ */
+
+void
+dns_qp_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
+	      dns_qp_t **qptp);
+/*%<
+ * Create a single-threaded qp-trie.
+ *
+ * Requires:
+ * \li  `mctx` is a pointer to a valid memory context.
+ * \li  all the methods are non-NULL
+ * \li  `qptp != NULL && *qptp == NULL`
+ *
+ * Ensures:
+ * \li  `*qptp` is a pointer to a valid single-threaded qp-trie
+ */
+
+void
+dns_qp_destroy(dns_qp_t **qptp);
+/*%<
+ * Destroy a single-threaded qp-trie.
+ *
+ * Requires:
+ * \li  `qptp != NULL`
+ * \li  `*qptp` is a pointer to a valid single-threaded qp-trie
+ *
+ * Ensures:
+ * \li  all memory allocated by the qp-trie has been released
+ * \li  `*qptp` is NULL
+ */
+
+void
+dns_qpmulti_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
+		   dns_qpmulti_t **qpmp);
+/*%<
+ * Create a multi-threaded qp-trie.
+ *
+ * Requires:
+ * \li  `mctx` is a pointer to a valid memory context.
+ * \li  all the methods are non-NULL
+ * \li  `qpmp != NULL && *qpmp == NULL`
+ *
+ * Ensures:
+ * \li  `*qpmp` is a pointer to a valid multi-threaded qp-trie
+ */
+
+void
+dns_qpmulti_destroy(dns_qpmulti_t **qpmp);
+/*%<
+ * Destroy a multi-threaded qp-trie.
+ *
+ * Requires:
+ * \li  `qptp != NULL`
+ * \li  `*qptp` is a pointer to a valid multi-threaded qp-trie
+ * \li  there are no write or update transactions in progress
+ * \li  no snapshots exist
+ *
+ * Ensures:
+ * \li  all memory allocated by the qp-trie has been released
+ * \li  `*qpmp` is NULL
+ */
+
+void
+dns_qp_compact(dns_qp_t *qp);
+/*%<
+ * Defragment the entire qp-trie and release unused memory.
+ *
+ * When modifications make a trie too fragmented, it is automatically
+ * compacted. Automatic compaction avoids compacting chunks that are not
+ * fragmented to save time, but this function compacts the entire trie to
+ * defragment it as much as possible.
+ *
+ * This function can be used with a single-threaded qp-trie and during a
+ * transaction on a multi-threaded trie.
+ *
+ * Requires:
+ * \li  `qp` is a pointer to a valid qp-trie
+ */
+
+void
+dns_qp_gctime(uint64_t *compact_us, uint64_t *recover_us,
+	      uint64_t *rollback_us);
+/*%<
+ * Get the total times spent on garbage collection in microseconds.
+ *
+ * These counters are global, covering every qp-trie in the program.
+ *
+ * XXXFANF This is a placeholder until we can record times in histograms.
+ */
+
+dns_qp_memusage_t
+dns_qp_memusage(dns_qp_t *qp);
+/*%<
+ * Get the memory counters from a qp-trie
+ *
+ * Requires:
+ * \li  `qp` is a pointer to a valid qp-trie
+ *
+ * Returns:
+ * \li  a `dns_qp_memusage_t` structure described above
+ */
+
+/***********************************************************************
+ *
+ *  functions - search, modify
+ */
+
+/*
+ * XXXFANF todo, based on what we discover BIND needs
+ *
+ * fancy searches: longest match, lexicographic predecessor,
+ * etc.
+ *
+ * do we need specific lookup functions to find out if the
+ * returned value is readonly or mutable?
+ *
+ * richer modification such as dns_qp_replace{key,name}
+ *
+ * iteration - probably best to put an explicit stack in the iterator,
+ * cf. rbtnodechain
+ */
+
+size_t
+dns_qpkey_fromname(dns_qpkey_t key, const dns_name_t *name);
+/*%<
+ * Convert a DNS name into a trie lookup key.
+ *
+ * Requires:
+ * \li  `name` is a pointer to a valid `dns_name_t`
+ *
+ * Returns:
+ * \li  the length of the key
+ */
+
+isc_result_t
+dns_qp_getkey(dns_qpreadable_t qpr, const dns_qpkey_t searchk, size_t searchl,
+	      void **pval_r, uint32_t *ival_r);
+/*%<
+ * Find a leaf in a qp-trie that matches the given key
+ *
+ * The leaf values are assigned to `*pval_r` and `*ival_r`
+ *
+ * Requires:
+ * \li  `qpr` is a pointer to a readable qp-trie
+ * \li  `pval_r != NULL`
+ * \li  `ival_r != NULL`
+ *
+ * Returns:
+ * \li  ISC_R_NOTFOUND if the trie has no leaf with a matching key
+ * \li  ISC_R_SUCCESS if the leaf was found
+ */
+
+isc_result_t
+dns_qp_getname(dns_qpreadable_t qpr, const dns_name_t *name, void **pval_r,
+	       uint32_t *ival_r);
+/*%<
+ * Find a leaf in a qp-trie that matches the given DNS name
+ *
+ * The leaf values are assigned to `*pval_r` and `*ival_r`
+ *
+ * Requires:
+ * \li  `qpr` is a pointer to a readable qp-trie
+ * \li  `name` is a pointer to a valid `dns_name_t`
+ * \li  `pval_r != NULL`
+ * \li  `ival_r != NULL`
+ *
+ * Returns:
+ * \li  ISC_R_NOTFOUND if the trie has no leaf with a matching key
+ * \li  ISC_R_SUCCESS if the leaf was found
+ */
+
+isc_result_t
+dns_qp_insert(dns_qp_t *qp, void *pval, uint32_t ival);
+/*%<
+ * Insert a leaf into a qp-trie
+ *
+ * Requires:
+ * \li  `qp` is a pointer to a valid qp-trie
+ * \li  `pval != NULL`
+ * \li  `alignof(pval) > 1`
+ *
+ * Returns:
+ * \li  ISC_R_EXISTS if the trie already has a leaf with the same key
+ * \li  ISC_R_SUCCESS if the leaf was added to the trie
+ */
+
+isc_result_t
+dns_qp_deletekey(dns_qp_t *qp, const dns_qpkey_t key, size_t len);
+/*%<
+ * Delete a leaf from a qp-trie that matches the given key
+ *
+ * Requires:
+ * \li  `qp` is a pointer to a valid qp-trie
+ *
+ * Returns:
+ * \li  ISC_R_NOTFOUND if the trie has no leaf with a matching key
+ * \li  ISC_R_SUCCESS if the leaf was deleted from the trie
+ */
+
+isc_result_t
+dns_qp_deletename(dns_qp_t *qp, const dns_name_t *name);
+/*%<
+ * Delete a leaf from a qp-trie that matches the given DNS name
+ *
+ * Requires:
+ * \li  `qp` is a pointer to a valid qp-trie
+ * \li  `name` is a pointer to a valid qp-trie
+ *
+ * Returns:
+ * \li  ISC_R_NOTFOUND if the trie has no leaf with a matching name
+ * \li  ISC_R_SUCCESS if the leaf was deleted from the trie
+ */
+
+/***********************************************************************
+ *
+ *  functions - transactions
+ */
+
+void
+dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t **qprp);
+/*%<
+ * Start a lightweight (brief) read-only transaction
+ *
+ * This takes a read lock on `multi`s rwlock that prevents
+ * transactions from committing.
+ *
+ * Requires:
+ * \li  `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li  `qprp != NULL`
+ * \li  `*qprp == NULL`
+ *
+ * Returns:
+ * \li  `*qprp` is a pointer to a valid read-only qp-trie handle
+ */
+
+void
+dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t **qprp);
+/*%<
+ * End a lightweight read transaction, i.e. release read lock
+ *
+ * Requires:
+ * \li  `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li  `qprp != NULL`
+ * \li  `*qprp` is a read-only qp-trie handle obtained from `multi`
+ *
+ * Returns:
+ * \li  `*qprp == NULL`
+ */
+
+void
+dns_qpmulti_snapshot(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
+/*%<
+ * Start a heavyweight (long) read-only transaction
+ *
+ * This function briefly takes and releases the modification mutex
+ * while allocating a copy of the trie's metadata. While the snapshot
+ * exists it does not interfere with other read-only or read-write
+ * transactions on the trie, except that memory cannot be reclaimed.
+ *
+ * Requires:
+ * \li  `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li  `qpsp != NULL`
+ * \li  `*qpsp == NULL`
+ *
+ * Returns:
+ * \li  `*qpsp` is a pointer to a snapshot obtained from `multi`
+ */
+
+void
+dns_qpsnap_destroy(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
+/*%<
+ * End a heavyweight read transaction
+ *
+ * If this is the last remaining snapshot belonging to `multi` then
+ * this function takes the modification mutex in order to free() any
+ * memory that is no longer in use.
+ *
+ * Requires:
+ * \li  `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li  `qpsp != NULL`
+ * \li  `*qpsp` is a pointer to a snapshot obtained from `multi`
+ *
+ * Returns:
+ * \li  `*qpsp == NULL`
+ */
+
+void
+dns_qpmulti_update(dns_qpmulti_t *multi, dns_qp_t **qptp);
+/*%<
+ * Start a heavyweight write transaction
+ *
+ * This style of transaction allocates a copy of the trie's metadata to
+ * support rollback, and it aims to minimize the memory usage of the
+ * trie between transactions. The trie is compacted when the transaction
+ * commits, and any partly-used chunk is shrunk to fit.
+ *
+ * During the transaction, the modification mutex is held.
+ *
+ * Requires:
+ * \li  `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li  `qptp != NULL`
+ * \li  `*qptp == NULL`
+ *
+ * Returns:
+ * \li  `*qptp` is a pointer to the modifiable qp-trie inside `multi`
+ */
+
+void
+dns_qpmulti_write(dns_qpmulti_t *multi, dns_qp_t **qptp);
+/*%<
+ * Start a lightweight write transaction
+ *
+ * This style of transaction does not need extra allocations in addition
+ * to the ones required by insert and delete operations. It is intended
+ * for a large trie that gets frequent small writes, such as a DNS
+ * cache.
+ *
+ * During the transaction, the modification mutex is held.
+ *
+ * Requires:
+ * \li  `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li  `qptp != NULL`
+ * \li  `*qptp == NULL`
+ *
+ * Returns:
+ * \li  `*qptp` is a pointer to the modifiable qp-trie inside `multi`
+ */
+
+void
+dns_qpmulti_commit(dns_qpmulti_t *multi, dns_qp_t **qptp);
+/*%<
+ * Complete a modification transaction
+ *
+ * The commit itself only requires flipping the read pointer inside
+ * `multi` from the old version of the trie to the new version. This
+ * function takes a write lock on `multi`s rwlock just long enough to
+ * flip the pointer. This briefly blocks `query` readers.
+ *
+ * This function releases the modification mutex after the post-commit
+ * memory reclamation is completed.
+ *
+ * Requires:
+ * \li  `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li  `qptp != NULL`
+ * \li  `*qptp` is a pointer to the modifiable qp-trie inside `multi`
+ *
+ * Returns:
+ * \li  `*qptp == NULL`
+ */
+
+void
+dns_qpmulti_rollback(dns_qpmulti_t *multi, dns_qp_t **qptp);
+/*%<
+ * Abandon an update transaction
+ *
+ * This function reclaims the memory allocated during the transaction
+ * and releases the modification mutex.
+ *
+ * Requires:
+ * \li  `multi` is a pointer to a valid multi-threaded qp-trie
+ * \li  `qptp != NULL`
+ * \li  `*qptp` is a pointer to the modifiable qp-trie inside `multi`
+ * \li  `*qptp` was obtained from `dns_qpmulti_update()`
+ *
+ * Returns:
+ * \li  `*qptp == NULL`
+ */
+
+/**********************************************************************/
--- a/lib/dns/log.c
+++ b/lib/dns/log.c
@@ -36,23 +36,18 @@ isc_logcategory_t dns_categories[] = {
 * \#define to <dns/log.h>.
 */
 isc_logmodule_t dns_modules[] = {
-	{ "dns/db", 0 },	 { "dns/rbtdb", 0 },
-	{ "dns/rbt", 0 },	 { "dns/rdata", 0 },
-	{ "dns/master", 0 },	 { "dns/message", 0 },
-	{ "dns/cache", 0 },	 { "dns/config", 0 },
-	{ "dns/resolver", 0 },	 { "dns/zone", 0 },
-	{ "dns/journal", 0 },	 { "dns/adb", 0 },
-	{ "dns/xfrin", 0 },	 { "dns/xfrout", 0 },
-	{ "dns/acl", 0 },	 { "dns/validator", 0 },
-	{ "dns/dispatch", 0 },	 { "dns/request", 0 },
-	{ "dns/masterdump", 0 }, { "dns/tsig", 0 },
-	{ "dns/tkey", 0 },	 { "dns/sdb", 0 },
-	{ "dns/diff", 0 },	 { "dns/hints", 0 },
-	{ "dns/unused1", 0 },	 { "dns/dlz", 0 },
-	{ "dns/dnssec", 0 },	 { "dns/crypto", 0 },
-	{ "dns/packets", 0 },	 { "dns/nta", 0 },
-	{ "dns/dyndb", 0 },	 { "dns/dnstap", 0 },
-	{ "dns/ssu", 0 },	 { NULL, 0 }
+	{ "dns/db", 0 },	 { "dns/rbtdb", 0 },	{ "dns/rbt", 0 },
+	{ "dns/rdata", 0 },	 { "dns/master", 0 },	{ "dns/message", 0 },
+	{ "dns/cache", 0 },	 { "dns/config", 0 },	{ "dns/resolver", 0 },
+	{ "dns/zone", 0 },	 { "dns/journal", 0 },	{ "dns/adb", 0 },
+	{ "dns/xfrin", 0 },	 { "dns/xfrout", 0 },	{ "dns/acl", 0 },
+	{ "dns/validator", 0 },	 { "dns/dispatch", 0 }, { "dns/request", 0 },
+	{ "dns/masterdump", 0 }, { "dns/tsig", 0 },	{ "dns/tkey", 0 },
+	{ "dns/sdb", 0 },	 { "dns/diff", 0 },	{ "dns/hints", 0 },
+	{ "dns/unused1", 0 },	 { "dns/dlz", 0 },	{ "dns/dnssec", 0 },
+	{ "dns/crypto", 0 },	 { "dns/packets", 0 },	{ "dns/nta", 0 },
+	{ "dns/dyndb", 0 },	 { "dns/dnstap", 0 },	{ "dns/ssu", 0 },
+	{ "dns/qp", 0 },	 { NULL, 0 },
 };

 isc_log_t *dns_lctx = NULL;
--- a/lib/dns/qp.c
+++ b/lib/dns/qp.c
--- a/lib/dns/qp_p.h
+++ b/lib/dns/qp_p.h
@@ -0,0 +1,703 @@
+/*
+ * Copyright (C) Internet Systems Consortium, Inc. ("ISC")
+ *
+ * SPDX-License-Identifier: MPL-2.0
+ *
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, you can obtain one at https://mozilla.org/MPL/2.0/.
+ *
+ * See the COPYRIGHT file distributed with this work for additional
+ * information regarding copyright ownership.
+ */
+
+/*
+ * For an overview, see doc/design/qp-trie.md
+ */
+
+#pragma once
+
+/***********************************************************************
+ *
+ *  interior node basics
+ */
+
+/*
+ * A qp-trie node can be a leaf or a branch. It consists of three 32-bit
+ * words into which the components are packed. They are used as a 64-bit
+ * word and a 32-bit word, but they are not declared like that to avoid
+ * unwanted padding, keeping the size down to 12 bytes. They are in native
+ * endian order so getting the 64-bit part should compile down to an
+ * unaligned load.
+ *
+ * In a branch the 64-bit word is described by the enum below. The 32-bit
+ * word is a reference to the packed sparse vector of "twigs", i.e. child
+ * nodes. A branch node has at least 2 and less than SHIFT_OFFSET twigs
+ * (see the enum below). The qp-trie update functions ensure that branches
+ * actually branch, i.e. branches cannot have only 1 child.
+ *
+ * The contents of each leaf are set by the trie's user. The 64-bit word
+ * contains a pointer value (which must be word-aligned), and the 32-bit
+ * word is an arbitrary integer value.
+ */
+typedef struct qp_node {
+#if WORDS_BIGENDIAN
+	uint32_t bighi, biglo, small;
+#else
+	uint32_t biglo, bighi, small;
+#endif
+} qp_node_t;
+
+/*
+ * A branch node contains a 64-bit word comprising the branch/leaf tag,
+ * the bitmap, and an offset into the key. It is called an "index word"
+ * because it describes how to access the twigs vector (think "database
+ * index"). The following enum sets up the bit positions of these parts.
+ *
+ * In a leaf, the same 64-bit word contains a pointer. The pointer
+ * must be word-aligned so that the branch/leaf tag bit is zero.
+ * This requirement is checked by the newleaf() constructor.
+ *
+ * The bitmap is just above the tag bit. The `bits_for_byte[]` table is
+ * used to fill in a key so that bit tests can work directly against the
+ * index word without superfluous masking or shifting; we don't need to
+ * mask out the bitmap before testing a bit, but we do need to mask the
+ * bitmap before calling popcount.
+ *
+ * The byte offset into the key is at the top of the word, so that it
+ * can be extracted with just a shift, with no masking needed.
+ *
+ * The names are SHIFT_thing because they are qp_shift_t values. (See
+ * below for the various `qp_*` type declarations.)
+ *
+ * These values are relatively fixed in practice; the symbolic names
+ * avoid mystery numbers in the code.
+ */
+enum {
+	SHIFT_BRANCH = 0,  /* branch / leaf tag */
+	SHIFT_NOBYTE,	   /* label separator has no byte value */
+	SHIFT_BITMAP,	   /* many bits here */
+	SHIFT_OFFSET = 48, /* offset of byte in key */
+};
+
+/*
+ * Value of the node type tag bit.
+ *
+ * It is defined this way to be explicit about where the value comes
+ * from, even though we know it is always the bottom bit.
+ */
+#define BRANCH_TAG (1ULL << SHIFT_BRANCH)
+
+/***********************************************************************
+ *
+ *  garbage collector tuning parameters
+ */
+
+/*
+ * A "cell" is a location that can contain a `qp_node_t`, and a "chunk"
+ * is a moderately large array of cells. A big trie can occupy
+ * multiple chunks. (Unlike other nodes, a trie's root node lives in
+ * its `struct dns_qp` instead of being allocated in a cell.)
+ *
+ * The qp-trie allocator hands out space for twigs vectors. Allocations are
+ * made sequentially from one of the chunks; this kind of "sequential
+ * allocator" is also known as a "bump allocator", so in `struct dns_qp`
+ * (see below) the allocation chunk is called `bump`.
+ */
+
+/*
+ * Number of cells in a chunk is a power of 2, which must have space for
+ * a full twigs vector (48 wide). When testing, use a much smaller chunk
+ * size to make the allocator work harder.
+ */
+#ifdef FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION
+#define QP_CHUNK_LOG 7
+#else
+#define QP_CHUNK_LOG 10
+#endif
+
+STATIC_ASSERT(6 <= QP_CHUNK_LOG && QP_CHUNK_LOG <= 20,
+	      "qp-trie chunk size is unreasonable");
+
+#define QP_CHUNK_SIZE  (1U << QP_CHUNK_LOG)
+#define QP_CHUNK_BYTES (QP_CHUNK_SIZE * sizeof(qp_node_t))
+
+/*
+ * A chunk needs to be compacted if it has fragmented this much.
+ * (12% overhead seems reasonable)
+ */
+#define QP_MAX_FREE (QP_CHUNK_SIZE / 8)
+
+/*
+ * Compact automatically when we pass this threshold: when there is a lot
+ * of free space in absolute terms, and when we have freed more than half
+ * of the space we allocated.
+ *
+ * The current compaction algorithm scans the whole trie, so it is important
+ * to scale the threshold based on the size of the trie to avoid quadratic
+ * behaviour. XXXFANF find an algorithm that scans less of the trie!
+ *
+ * During a modification transaction, when we copy-on-write some twigs we
+ * count the old copy as "free", because they will be when the transaction
+ * commits. But they cannot be recovered immediately so they are also
+ * counted as on hold, and discounted when we decide whether to compact.
+ */
+#define QP_MAX_GARBAGE(qp)                                            \
+	(((qp)->free_count - (qp)->hold_count) > QP_CHUNK_SIZE * 4 && \
+	 ((qp)->free_count - (qp)->hold_count) > (qp)->used_count / 2)
+
+/*
+ * The chunk base and usage arrays are resized geometically and start off
+ * with two entries.
+ */
+#define GROWTH_FACTOR(size) ((size) + (size) / 2 + 2)
+
+/***********************************************************************
+ *
+ *  helper types
+ */
+
+/*
+ * C is not strict enough with its integer types for these typedefs to
+ * improve type safety, but it helps to have annotations saying what
+ * particular kind of number we are dealing with.
+ */
+
+/*
+ * The number or position of a bit inside a word. (0..63)
+ *
+ * Note: A dns_qpkey_t is logically an array of qp_shift_t values, but it
+ * isn't declared that way because dns_qpkey_t is a public type whereas
+ * qp_shift_t is private.
+ */
+typedef uint8_t qp_shift_t;
+
+/*
+ * The number of bits set in a word (as in Hamming weight or popcount)
+ * which is used for the position of a node in the packed sparse
+ * vector of twigs. (0..47) because our bitmap does not fill the word.
+ */
+typedef uint8_t qp_weight_t;
+
+/*
+ * A chunk number, i.e. an index into the chunk arrays.
+ */
+typedef uint32_t qp_chunk_t;
+
+/*
+ * Cell offset within a chunk, or a count of cells. Each cell in a
+ * chunk can contain a node.
+ */
+typedef uint32_t qp_cell_t;
+
+/*
+ * A twig reference is used to refer to a twigs vector, which occupies a
+ * contiguous group of cells.
+ */
+typedef uint32_t qp_ref_t;
+
+/*
+ * Constructors and accessors for qp_ref_t values, defined here to show
+ * how the qp_ref_t, qp_chunk_t, qp_cell_t types relate to each other
+ */
+
+static inline qp_ref_t
+make_ref(qp_chunk_t chunk, qp_cell_t cell) {
+	return (QP_CHUNK_SIZE * chunk + cell);
+}
+
+static inline qp_chunk_t
+ref_chunk(qp_ref_t ref) {
+	return (ref / QP_CHUNK_SIZE);
+}
+
+static inline qp_cell_t
+ref_cell(qp_ref_t ref) {
+	return (ref % QP_CHUNK_SIZE);
+}
+
+/***********************************************************************
+ *
+ *  main qp-trie structures
+ */
+
+#define QP_MAGIC     ISC_MAGIC('t', 'r', 'i', 'e')
+#define VALID_QP(qp) ISC_MAGIC_VALID(qp, QP_MAGIC)
+
+/*
+ * This is annoying: C doesn't allow us to use a predeclared structure as
+ * an anonymous struct member, so we have to fart around. The feature we
+ * want is available in GCC and Clang with -fms-extensions, but a
+ * non-standard extension won't make these declarations neater if we must
+ * also have a standard alternative.
+ */
+
+/*
+ * Lightweight read-only access to a qp-trie.
+ *
+ * Just the fields neded for the hot path. The `base` field points
+ * to an array containing pointers to the base of each chunk like
+ * `qp->base[chunk]` - see `refptr()` below.
+ *
+ * A `dns_qpread_t` has a lifetime that does not extend across multiple
+ * write transactions, so it can share a chunk `base` array belonging to
+ * the `dns_qpmulti_t` it came from.
+ *
+ * We're lucky with the layout on 64 bit systems: this is only 40 bytes,
+ * with no padding.
+ */
+#define DNS_QPREAD_COMMON \
+	uint32_t magic;   \
+	qp_node_t root;   \
+	qp_node_t **base; \
+	void *ctx;        \
+	const dns_qpmethods_t *methods
+
+struct dns_qpread {
+	DNS_QPREAD_COMMON;
+};
+
+/*
+ * Heavyweight read-only snapshots of a qp-trie.
+ *
+ * Unlike a lightweight `dns_qpread_t`, a snapshot can survive across
+ * multiple write transactions, any of which may need to expand the
+ * chunk `base` array. So a `dns_qpsnap_t` keeps its own copy of the
+ * array, which will always be equal to some prefix of the expanded
+ * arrays in the `dns_qpmulti_t` that it came from.
+ *
+ * The `dns_qpmulti_t` keeps a refcount of its snapshots, and while
+ * the refcount is non-zero, chunks are not freed or reused. When a
+ * `dns_qpsnap_t` is destroyed, if it decrements the refcount to zero,
+ * it can do any deferred cleanup.
+ *
+ * The generation number is used for tracing.
+ */
+struct dns_qpsnap {
+	DNS_QPREAD_COMMON;
+	uint32_t generation;
+	dns_qpmulti_t *whence;
+	qp_node_t *base_array[];
+};
+
+/*
+ * Read-write access to a qp-trie requires extra fields to support the
+ * allocator and garbage collector.
+ *
+ * The chunk `base` and `usage` arrays are separate because the `usage`
+ * array is only needed for allocation, so it is kept separate from the
+ * data needed by the read-only hot path. The arrays have empty slots where
+ * new chunks can be placed, so `chunk_max` is the maximum number of chunks
+ * (until the arrays are resized).
+ *
+ * Bare instances of a `struct dns_qp` are used for stand-alone
+ * single-threaded tries. For multithreaded access, transactions alternate
+ * between the `phase` pair of dns_qp objects inside a dns_qpmulti.
+ *
+ * For multithreaded access, the `generation` counter allows us to know
+ * which chunks are writable or not: writable chunks were allocated in the
+ * current generation. For single-threaded access, the generation counter
+ * is always zero, so all chunks are considered to be writable.
+ *
+ * Allocations are made sequentially in the `bump` chunk. Lightweight write
+ * transactions can re-use the `bump` chunk, so its prefix before `fender`
+ * is immutable, and the rest is mutable even though its generation number
+ * does not match the current generation.
+ *
+ * To decide when to compact and reclaim space, QP_MAX_GARBAGE() examines
+ * the values of `used_count`, `free_count`, and `hold_count`. The
+ * `hold_count` tracks nodes that need to be retained while readers are
+ * using them; they are free but cannot be reclaimed until the transaction
+ * has committed, so the `hold_count` is discounted from QP_MAX_GARBAGE()
+ * during a transaction.
+ *
+ * There are some flags that alter the behaviour of write transactions.
+ *
+ *  - The `transaction_mode` indicates whether the current transaction is a
+ *    light write or a heavy update, or (between transactions) the previous
+ *    transaction's mode, because the setup for the next transaction
+ *    depends on how the previous one committed. The mode is set at the
+ *    start of each transaction. It is QP_NONE in a single-threaded qp-trie
+ *    to detect if part of a `dns_qpmulti_t` is passed to dns_qp_destroy().
+ *
+ *  - The `compact_all` flag is used when every node in the trie should be
+ *    copied. (Usually compation aims to avoid moving nodes out of
+ *    unfragmented chunks.) It is used when compaction is explicitly
+ *    requested via `dns_qp_compact()`, and as an emergency mechanism if
+ *    normal compaction failed to clear the QP_MAX_GARBAGE() condition.
+ *    (This emergency is a bug even tho we have a rescue mechanism.)
+ *
+ *  - The `shared_arrays` flag indicates that the chunk `base` and `usage`
+ *    arrays are shared by both `phase`s in this trie's `dns_qpmulti_t`.
+ *    This allows us to delay allocating copies of the arrays during a
+ *    write transaction, until we definitely need to resize them.
+ *
+ *  - When built with fuzzing support, we can use mprotect() and munmap()
+ *    to ensure that incorrect memory accesses cause fatal errors. The
+ *    `write_protect` flag must be set straight after the `dns_qpmulti_t`
+ *    is created, then left unchanged.
+ *
+ * Some of the dns_qp_t fields are only used for multithreaded transactions
+ * (marked [MT] below) but the same code paths are also used for single-
+ * threaded writes. To reduce the size of a dns_qp_t, these fields could
+ * perhaps be moved into the dns_qpmulti_t, but that would require some kind
+ * of conditional runtime downcast from dns_qp_t to dns_multi_t, which is
+ * likely to be ugly. It is probably best to keep things simple if most tries
+ * need multithreaded access (XXXFANF do they? e.g. when there are many auth
+ * zones),
+ */
+struct dns_qp {
+	DNS_QPREAD_COMMON;
+	isc_mem_t *mctx;
+	/*% array of per-chunk allocation counters */
+	struct {
+		/*% the allocation point, increases monotonically */
+		qp_cell_t used;
+		/*% count of nodes no longer needed, also monotonic */
+		qp_cell_t free;
+		/*% when was this chunk allocated? */
+		uint32_t generation;
+	} *usage;
+	/*% transaction counter [MT] */
+	uint32_t generation;
+	/*% number of slots in `chunk` and `usage` arrays */
+	qp_chunk_t chunk_max;
+	/*% which chunk is used for allocations */
+	qp_chunk_t bump;
+	/*% twigs in the `bump` chunk below `fender` are read only [MT] */
+	qp_cell_t fender;
+	/*% number of leaf nodes */
+	qp_cell_t leaf_count;
+	/*% total of all usage[] counters */
+	qp_cell_t used_count, free_count;
+	/*% cells that cannot be recovered right now */
+	qp_cell_t hold_count;
+	/*% what kind of transaction was most recently started [MT] */
+	enum { QP_NONE, QP_WRITE, QP_UPDATE } transaction_mode : 2;
+	/*% compact the entire trie [MT] */
+	bool compact_all : 1;
+	/*% chunk arrays are shared with a readonly qp-trie [MT] */
+	bool shared_arrays : 1;
+	/*% optionally when compiled with fuzzing support [MT] */
+	bool write_protect : 1;
+};
+
+/*
+ * Concurrent access to a qp-trie.
+ *
+ * The `read` pointer is used for read queries. It points to one of the
+ * `phase` elements. During a transaction, the other `phase` (see
+ * `write_phase()` below) is modified incrementally in copy-on-write
+ * style. On commit the `read` pointer is swapped to the altered phase.
+ */
+struct dns_qpmulti {
+	uint32_t magic;
+	/*% controls access to the `read` pointer and its target phase */
+	isc_rwlock_t rwlock;
+	/*% points to phase[r] and swaps on commit */
+	dns_qp_t *read;
+	/*% protects the snapshot counter and `write_phase()` */
+	isc_mutex_t mutex;
+	/*% so we know when old chunks are still shared */
+	unsigned int snapshots;
+	/*% one is read-only, one is mutable */
+	dns_qp_t phase[2];
+};
+
+/*
+ * Get a pointer to the phase that isn't read-only.
+ */
+static inline dns_qp_t *
+write_phase(dns_qpmulti_t *multi) {
+	bool read0 = multi->read == &multi->phase[0];
+	return (read0 ? &multi->phase[1] : &multi->phase[0]);
+}
+
+#define QPMULTI_MAGIC	  ISC_MAGIC('q', 'p', 'm', 'v')
+#define VALID_QPMULTI(qp) ISC_MAGIC_VALID(qp, QPMULTI_MAGIC)
+
+/***********************************************************************
+ *
+ *  interior node constructors and accessors
+ */
+
+/*
+ * See the comments under "interior node basics" above, which explain the
+ * layout of nodes as implemented by the following functions.
+ */
+
+/*
+ * Get the 64-bit word of a node.
+ */
+static inline uint64_t
+node64(qp_node_t *n) {
+	uint64_t lo = n->biglo;
+	uint64_t hi = n->bighi;
+	return (lo | (hi << 32));
+}
+
+/*
+ * Get the 32-bit word of a node.
+ */
+static inline uint32_t
+node32(qp_node_t *n) {
+	return (n->small);
+}
+
+/*
+ * Create a node from its parts
+ */
+static inline qp_node_t
+make_node(uint64_t big, uint32_t small) {
+	return ((qp_node_t){
+		.biglo = (uint32_t)(big),
+		.bighi = (uint32_t)(big >> 32),
+		.small = small,
+	});
+}
+
+/*
+ * Test a node's tag bit.
+ */
+static inline bool
+is_branch(qp_node_t *n) {
+	return (n->biglo & BRANCH_TAG);
+}
+
+/* leaf nodes *********************************************************/
+
+/*
+ * Get a leaf's pointer value. The double cast is to avoid a warning
+ * about mismatched pointer/integer sizes on 32 bit systems.
+ */
+static inline void *
+leaf_pval(qp_node_t *n) {
+	return ((void *)(uintptr_t)node64(n));
+}
+
+/*
+ * Get a leaf's integer value
+ */
+static inline uint32_t
+leaf_ival(qp_node_t *n) {
+	return (node32(n));
+}
+
+/*
+ * Create a leaf node from its parts
+ */
+static inline qp_node_t
+make_leaf(const void *pval, uint32_t ival) {
+	qp_node_t leaf = make_node((uintptr_t)pval, ival);
+	REQUIRE(!is_branch(&leaf) && pval != NULL);
+	return (leaf);
+}
+
+/* branch nodes *******************************************************/
+
+/*
+ * The following function names use plural `twigs` when they work on a
+ * branch's twigs vector as a whole, and singular `twig` when they work on
+ * a particular twig.
+ */
+
+/*
+ * Get a branch node's index word
+ */
+static inline uint64_t
+branch_index(qp_node_t *n) {
+	return (node64(n));
+}
+
+/*
+ * Get a reference to a branch node's child twigs.
+ */
+static inline qp_ref_t
+branch_twigs_ref(qp_node_t *n) {
+	return (node32(n));
+}
+
+/*
+ * Bit positions in the bitmap come directly from the key. DNS names are
+ * converted to keys using the tables declared at the end of this file.
+ */
+static inline qp_shift_t
+qpkey_bit(const dns_qpkey_t key, size_t len, size_t offset) {
+	if (offset < len) {
+		return (key[offset]);
+	} else {
+		return (SHIFT_NOBYTE);
+	}
+}
+
+/*
+ * Extract a branch node's offset field, used to index the key.
+ */
+static inline size_t
+branch_key_offset(qp_node_t *n) {
+	return ((size_t)(branch_index(n) >> SHIFT_OFFSET));
+}
+
+/*
+ * Which bit identifies the twig of this node for this key?
+ */
+static inline qp_shift_t
+branch_keybit(qp_node_t *n, const dns_qpkey_t key, size_t len) {
+	return (qpkey_bit(key, len, branch_key_offset(n)));
+}
+
+/*
+ * Convert a twig reference into a pointer.
+ */
+static inline qp_node_t *
+ref_ptr(dns_qpreadable_t qpr, qp_ref_t ref) {
+	dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+	return (qp->base[ref_chunk(ref)] + ref_cell(ref));
+}
+
+/*
+ * Get a pointer to a branch node's twigs vector.
+ */
+static inline qp_node_t *
+branch_twigs_vector(dns_qpreadable_t qpr, qp_node_t *n) {
+	dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+	return (ref_ptr(qp, branch_twigs_ref(n)));
+}
+
+/*
+ * Warm up the cache while calculating which twig we want.
+ */
+static inline void
+prefetch_twigs(dns_qpreadable_t qpr, qp_node_t *n) {
+	__builtin_prefetch(branch_twigs_vector(qpr, n));
+}
+
+/***********************************************************************
+ *
+ *  bitmap popcount shenanigans
+ */
+
+/*
+ * How many twigs appear in the vector before the one corresponding to the
+ * given bit? Calculated using popcount of part of the branch's bitmap.
+ *
+ * To calculate a mask that covers the lesser bits in the bitmap, we
+ * subtract 1 to set the bits, and subtract the branch tag because it
+ * is not part of the bitmap.
+ */
+static inline qp_weight_t
+branch_twigs_before(qp_node_t *n, qp_shift_t bit) {
+	uint64_t mask = (1ULL << bit) - 1 - BRANCH_TAG;
+	uint64_t bmp = branch_index(n) & mask;
+	return ((qp_weight_t)__builtin_popcountll(bmp));
+}
+
+/*
+ * How many twigs does this node have?
+ *
+ * The offset is directly after the bitmap so the offset's lesser bits
+ * covers the whole bitmap, and the bitmap's weight is the number of twigs.
+ */
+static inline qp_weight_t
+branch_twigs_size(qp_node_t *n) {
+	return (branch_twigs_before(n, SHIFT_OFFSET));
+}
+
+/*
+ * Position of a twig within the packed sparse vector.
+ */
+static inline qp_weight_t
+branch_twig_pos(qp_node_t *n, qp_shift_t bit) {
+	return (branch_twigs_before(n, bit));
+}
+
+/*
+ * Get a pointer to a particular twig.
+ */
+static inline qp_node_t *
+branch_twig_ptr(dns_qpreadable_t qpr, qp_node_t *n, qp_shift_t bit) {
+	return (branch_twigs_vector(qpr, n) + branch_twig_pos(n, bit));
+}
+
+/*
+ * Is the twig identified by this bit present?
+ */
+static inline bool
+branch_has_twig(qp_node_t *n, qp_shift_t bit) {
+	return (branch_index(n) & (1ULL << bit));
+}
+
+/* twig logistics *****************************************************/
+
+static inline void
+move_twigs(qp_node_t *to, qp_node_t *from, qp_weight_t size) {
+	memmove(to, from, size * sizeof(qp_node_t));
+}
+
+static inline void
+zero_twigs(qp_node_t *twigs, qp_weight_t size) {
+	memset(twigs, 0, size * sizeof(qp_node_t));
+}
+
+/***********************************************************************
+ *
+ *  method invocation helpers
+ */
+
+static inline void
+attach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
+	dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+	qp->methods->attach(qp->ctx, leaf_pval(n), leaf_ival(n));
+}
+
+static inline void
+detach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
+	dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+	qp->methods->detach(qp->ctx, leaf_pval(n), leaf_ival(n));
+}
+
+static inline size_t
+leaf_qpkey(dns_qpreadable_t qpr, qp_node_t *n, dns_qpkey_t key) {
+	dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+	return (qp->methods->makekey(key, qp->ctx, leaf_pval(n), leaf_ival(n)));
+}
+
+static inline char *
+triename(dns_qpreadable_t qpr, char *buf, size_t size) {
+	dns_qpread_t *qp = dns_qpreadable_cast(qpr);
+	qp->methods->triename(qp->ctx, buf, size);
+	return (buf);
+}
+
+#define TRIENAME(qp) \
+	triename(qp, (char[DNS_QP_TRIENAME_MAX]){}, DNS_QP_TRIENAME_MAX)
+
+/***********************************************************************
+ *
+ *  converting DNS names to trie keys
+ */
+
+/*
+ * This is a deliberate simplification of the hostname characters,
+ * because it doesn't matter much if we treat a few extra characters
+ * favourably: there is plenty of space in the index word for a
+ * slightly larger bitmap.
+ */
+static inline bool
+qp_common_character(uint8_t byte) {
+	return (('-' <= byte && byte <= '9') || ('_' <= byte && byte <= 'z'));
+}
+
+/*
+ * Lookup table mapping bytes in DNS names to bit positions, used
+ * by dns_qpkey_fromname() to convert DNS names to qp-trie keys.
+ */
+extern uint16_t dns_qp_bits_for_byte[];
+
+/*
+ * And the reverse, mapping bit positions to characters, so the tests
+ * can print diagnostics involving qp-trie keys.
+ */
+extern uint8_t dns_qp_byte_for_bit[];
+
+/**********************************************************************/