2
0
mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-31 06:25:31 +00:00

Add a qp-trie data structure

A qp-trie is a kind of radix tree that is particularly well-suited to
DNS servers. I invented the qp-trie in 2015, based on Dan Bernstein's
crit-bit trees and Phil Bagwell's HAMT. https://dotat.at/prog/qp/

This code incorporates some new ideas that I prototyped using
NLnet Labs NSD in 2020 (optimizations for DNS names as keys)
and 2021 (custom allocator and garbage collector).
https://dotat.at/cgi/git/nsd.git

The BIND version of my qp-trie code has a number of improvements
compared to the prototype developed for NSD.

  * The main omission in the prototype was the very sketchy outline of
    how locking might work. Now the locking has been implemented,
    using a reader/writer lock and a mutex. However, it is designed to
    benefit from liburcu if that is available.

  * The prototype was designed for two-version concurrency, one
    version for readers and one for the writer. The new code supports
    multiversion concurrency, to provide a basis for BIND's dbversion
    machinery, so that updates are not blocked by long-running zone
    transfers.

  * There are now two kinds of transaction that modify the trie: an
    `update` aims to support many very small zones without wasting
    memory; a `write` avoids unnecessary allocation to help the
    performance of many small changes to the cache.

  * There is also a single-threaded interface for situations where
    concurrent access is not necessary.

  * The API makes better use of types to make it more clear which
    operations are permitted when.

  * The lookup table used to convert a DNS name to a qp-trie key is
    now initialized by a run-time constructor instead of a programmer
    using copy-and-paste. Key conversion is more flexible, so the
    qp-trie can be used with keys other than DNS names.

  * There has been much refactoring and re-arranging things to improve
    the terminology and order of presentation in the code, and the
    internal documentation has been moved from a comment into a file
    of its own.

Some of the required functionality has been stripped out, to be
brought back later after the basics are known to work.

  * Garbage collector performance statistics are missing.

  * Fancy searches are missing, such as longest match and
    nearest match.

  * Iteration is missing.

  * Search for update is missing, for cases where the caller needs to
    know if the value object is mutable or not.
This commit is contained in:
Tony Finch
2022-05-09 14:31:35 +01:00
parent 7975b785fd
commit 6b9ddbd1ce
7 changed files with 3638 additions and 21 deletions

770
doc/design/qp-trie.md Normal file
View File

@@ -0,0 +1,770 @@
<!--
Copyright (C) Internet Systems Consortium, Inc. ("ISC")
SPDX-License-Identifier: MPL-2.0
This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, you can obtain one at https://mozilla.org/MPL/2.0/.
See the COPYRIGHT file distributed with this work for additional
information regarding copyright ownership.
-->
A qp-trie for the DNS
=====================
A qp-trie is a data structure that supports lookups in a sorted
collection of keys. It is efficient both in terms of fast lookups and
using little memory. It is particularly well-suited for use in DNS
servers.
These notes outline how BIND's `dns_qp` implementation works, how it
is optimized for lookups keyed by DNS names, and how it supports
multi-version concurrency.
data structure zoo
------------------
Chasing a pointer indirection is very slow, up to 100ns, whereas a
sequential memory access takes less than 10ns. So, to make a data
structure fast, we need to minimize indirections.
There is a tradeoff between speed and flexibility in standard data
structures:
* Arrays are very simple and fast (a lookup goes straight to the
right address), but the key can only be a small integer.
* Hash tables allow you to use arbitrary lookup keys (such as
strings), but may require probing multiple addresses to find the
right element.
* Radix trees allow you to do lookups based on the sorting order of
the keys, provided it is lexical like `memcmp()`; however, lookups
require multiple indirections.
* Comparison search trees (binary trees and B-trees) allow you to
use an arbitrary ordering predicate, but each indirection during
a lookup also requires a comparison.
In the DNS, we need to use some kind of tree to support the kinds of
lookup required for DNSSEC: find longest match, find nearest
predecessor or successor, and so forth. So what kind of tree is best?
in theory
---------
In a tree where the average length of a key is `k`, and the number of
elements in the tree is `n`, the theoretical performance bounds are,
for a comparison tree:
* `Ω(k * log n)`
* `Ο(k * n)`
And for a radix tree:
* `Ω(k + log n)`
* `Ο(k + k)`
Here, `Ω()` is the lower bound and `Ο()` is the upper bound; we
expect typical performance to be close to the lower bound.
The multiplications in the comparison tree expressions means that each
indirection requires a comparison `Ο(k)`, whereas they are additions
in the radix tree expressions because a radix tree traversal only
needs one key comparison.
The upper bounds say that (in the absence of balancing) a comparison
tree can devolve into a linked list of nodes, whereas the shape of a
radix tree is determined by the set of keys independent of the order
of insertion or the number of keys.
The logarithms hide some interesting constant factors. In a binary
tree, the log is base 2. In a radix tree, the radix is the base of the
logarithm. So, if we increase the radix, the constant factor gets
smaller. The rough equivalent for a binary tree would be to use a
B-tree instead, but although B-trees have fewer indirections they do
not reduce the number of comparisons.
In implementation terms, a larger radix means tree nodes get wider
and the tree becomes shallower. A shallower tree requires fewer
indirections, so it should be faster. The trick is to increase the
radix without blowing up the tree's memory usage, which can lose
more performance than we win.
This analysis suggests that a radix tree is better than a comparison
tree, provided keys can be compared lexically - which is true for DNS
names, with some rearrangement (described below). When using big-o
notation, we also need to be wary of the constant factors; but in this
case they also favour a radix tree, especially with the optimization
tricks used by BIND's qp-trie.
Note: "radix" comes from the latin for "root", so "radix tree" is a
pun, which is geekily amusing especially when talking about logs.
what is a trie?
---------------
A trie is another name for a radix tree (or "digital tree" according
to Knuth). It is short for information reTRIEval, and I pronounce it
exactly like "tree" (though Knuth pronounces it like "try").
In a trie, keys are divided into digits depending on some radix e.g.
base 2 for binary tries, base 256 for byte-indexed tries. When
searching the trie, successive digits in the key, from most to least
significant, are used to select branches from successive nodes in
the trie, roughly like:
for (offset = 0; isbranch(node); offset++)
node = node->child[key[offset]];
All of the keys in a subtrie have identical prefixes. Tries do not
need to store keys since they are implicit in the structure.
binary crit-bit trees
---------------------
A patricia trie is a binary trie which omits nodes that have only one
child. Dan Bernstein calls his tightly space-optimized version a
"crit-bit tree".
https://cr.yp.to/critbit.html
https://github.com/agl/critbit/
Unlike a basic trie, a crit-bit tree skips parts of the key when
every element in a subtree shares the same sequence of bits.
Each node is annotated with the offset of the bit that is used to
select the branch; offsets always increase as you go deeper into
the tree.
while (isbranch(node))
node = node->child[key[node->offset]];
In a crit-bit tree the keys are not implicit in the structure
because parts of them are skipped. Therefore, each leaf refers to a
copy of its key so that when you find a leaf you can verify that the
skipped bits match.
prefetching
-----------
Observe that in the loop above, the current node has only one child
pointer, and the child nodes are adjacent in memory. This means it
is possible to tell the CPU to prefetch the child nodes before
extracting the critical bit from the key and choosing which child is
next. A qp-trie has a similar layout, but it has more child nodes
(still adjacent in memory) and it does more computation to choose
which one is next.
When I originally invented the qp-trie code, I found that explicit
prefetch hints made the qp-trie substantially faster and the crit-bit
tree slightly faster. The hints help the CPU to do useful work at the
same time as the memory subsystem. (This is unusual for linked data
structures, which tend to alternate between CPU waiting for memory,
and memory waiting for CPU.)
Large modern CPUs (after about 2015) are better at prefetching
automatically, so the explicit hint is less important than it used to
be, but `lib/dns/qp.c` still has `__builtin_prefetch()` hints in its
inner traversal loops.
packed sparse vectors with popcount
-----------------------------------
The `popcount` instruction counts the number of bits that are set
in a word. It's also known as the Hamming weight; Knuth calls it
"sideways add". https://en.wikipedia.org/wiki/popcount
You can use `popcount` to implement a sparse vector of length `N`
containing `M <= N` members using bitmap of length `N` and a packed
vector of `M` elements. A member `b` is present in the vector if bit
`b` is set, so `M == popcount(bitmap)`. The index of member `b` in
the packed vector is the popcount of the bits preceding `b`.
// size of vector
size = popcount(bitmap);
// bit position
bit = 1 << b;
// is element present?
if (bitmap & bit) {
// mask covers the preceding elements
mask = bit - 1;
// position of element in packed vector
pos = popcount(bitmap & mask);
// fetch element
elem = vector[pos];
}
See "Hacker's Delight" by Hank Warren, section 5-1 "Counting 1
bits", subsection "applications". http://www.hackersdelight.org
See under _"bitmap popcount shenanigans"_ in `lib/dns/qp.c` for how
this is implemented in BIND.
popcount for trie nodes
-----------------------
Phil Bagwell's hashed array-mapped tries (HAMT) use popcount for
compact trie nodes. In a HAMT, string keys are hashed, and the hash is
used as the index to the trie, with radix 2^32 or 2^64.
http://infoscience.epfl.ch/record/64394/files/triesearches.pdf
http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf
As discussed above, increasing the radix makes the tree shallower, so
it should be faster. The downside is usually much greater memory
overhead. Child vectors are often sparsely populated, so we can
greatly reduce the overhead by packing them with popcount.
The HAMT relies on hashing, which keeps keys dense. This means it
can be laid out like a basic trie with implicit keys (i.e. hash
values). The disadvantage of hashing is that strings are stored
out of order.
qp-trie
-------
A qp-trie is a mash-up of Bernstein's crit-bit tree with Bagwell's
HAMT. Like a crit-bit tree, a qp-trie omits nodes with one child;
nodes include a key offset; and keys a referenced from leaves instead
of being implicit in the trie structure. Like a HAMT, nodes have a
popcount packed vector of children, but unlike a HAMT, keys are not
hashed.
A qp-trie is faster than a crit-bit tree and uses less memory, because
its wider fan-out requires fewer nodes and popcount packs them very
efficiently. Like a crit-bit tree but unlike a HAMT, a qp-trie stores
keys in lexical order.
As in a HAMT, the original layout of a qp-trie node is a pair of
words, which are used as key and value pointers in leaf nodes, and
index word and pointer in branch nodes. The index word contains the
popcount bitmap (as in a HAMT) and the offset into the key (as in a
crit-bit tree), as well as a leaf/branch tag bit. The pointer refers
to the branch node's "twigs", which is what we call the packed sparse
vector of child nodes.
The fan-out of a qp-trie is limited by the need to fit the bitmap and
the nybble offset into a 64-bit word; a radix of 16 or 32 works well,
and 32 is slightly faster (though 5-bit nybbles are fiddly). But radix
64 requires an extra word per node, and the extra memory overhead
makes it slower as well as bulkier.
Early qp-trie implementations used a node layout like the
following. However, in practice C bitfields have too many
portability gotchas to work well. It is better to use hand-written
shifting and masking to access the parts of the index word.
#define NYBBLE 4 // or 5
#define RADIX (1 << NYBBLE)
union qp_node {
struct {
unsigned tag : 1;
unsigned bitmap : RADIX;
unsigned offset : (64 - 1 - RADIX);
union qp_node *twigs;
} branch;
struct {
void *value;
const char *key;
} leaf;
};
DNS qp-trie
-----------
BIND uses a variant of a qp-trie optimized for DNS names. DNS names
almost always use the usual hostname alphabet of (case-insensitive)
letters, digits, hyphen, plus underscore (which is often used in the DNS
for non-hostname purposes), and finally the label separator (which is
written as '.' in presentation-format domain names, and is the label
length in wire format). This adds up to 39 common characters.
A bitmap for 39 common characters is small enough to fit into a
qp-trie index word, so we can (in principle) walk down the trie one
character at a time, as if the radix were 256, but without needing a
multi-word bitmap.
However, DNS names can contain arbitrary bytes. To support the 200-ish
unusual characters we use an escaping scheme, described in more detail
below. This requires a few more bits in the bitmap to represent the
escape characters, so our radix ends up being 47. This still fits into
the 64-bit index word, so we get the compactness of a qp-trie but with
faster byte-at-a-time lookups for DNS names that use common hostname
characters.
You can also use other kinds of keys with BIND's DNS qp-trie, provided
they are not too long. You must provide your own key preparation
function, e.g. for uniform binary keys you might extract 5-bit nybbles
to get a radix-32 trie.
preparing a lookup key
----------------------
A DNS name needs to be rearranged to use it as a qp-trie key, so that
the lexical order of rearranged keys matches the canonical DNS name
order specified in RFC 4034 section 6.1:
* reverse the order of the labels so that they run from most
significant to least significant, left to right (but the
characters in each label remain in the same order)
* convert uppercase ASCII letters to lowercase ASCII
* change the label separators to a non-byte value that sorts before
the zero byte
For qp-trie lookups there are a couple of extra steps:
* There is an escaping mechanism to support DNS names that use
unusual characters. Common characters use one byte in the lookup
key, but unusual characters are expanded to two bytes. To preserve
the correct lexical order, there are different escape bytes
depending on how the unusual character sorts relative to the
common hostname characters.
* Characters in the DNS name need to be converted to bitmap
positions. This is done at the same time as preparing the lookup
key, to move work out of the inner trie traversal loop.
These 5 transformations can be done in a single pass over a DNS name
using a single lookup table. The transformed name is usually the
same length (up to 2x longer if it contains unusual characters).
You can use absolute or relative DNS names as keys, without ambiguity
(provided you have some way of knowing what names are relative to).
When converted to a lookup key, absolute names start with a non-byte
value representing the root, and relative names do not.
Lookup keys are ephemeral, allocated on the stack during a lookup.
See under _"converting DNS names to trie keys"_ in `lib/dns/qp.c`
for how this is implemented in BIND.
node layout
-----------
Earlier I said that the original qp-trie node layout consists of two
words: one 64 bit word for the branch index, and one pointer-sized
word. BIND's qp-trie uses a layout that is smaller on 64-bit systems:
one 64 bit word and one 32-bit word.
A branch node contains
* a branch/leaf tag bit
* a 47-wide bitmap, with a bit for each common hostname character
and each escape character
* a 9-bit key offset, enough to count twice the length of a DNS
name
* a 32-bit "twigs" reference to the packed vector of child nodes;
these references are described in more detail below
A leaf node contains a pointer value (which we assume to be 64 bits)
and a 32-bit integer value. The branch/leaf tag is smuggled into the
low-order bit of the pointer value, so the pointer value must have
large enough alignment. (This requirement is checked when a leaf is
added to the trie.) Apart from that, the meaning of leaf values
is entirely under control of the qp-trie user.
When constructing a qp-trie the user provides a collection of method
pointers. The qp-trie code calls these methods when it needs to do
anything that needs to look into a leaf value, such as extracting the
key.
See under _"interior node basics"_ and _"interior node constructors
and accessors"_ in `lib/dns/qp_p.h` for the implementation.
example
-------
Consider a small zone:
example. ; apex
mail.example. ; IMAP server
mx.example. ; incoming mail
www.example. ; web load balancer
www1.example. ; back-end web servers
www2.example.
It becomes a qp-trie as follows. I am writing bitmaps as lists of
characters representing the bits that are set, with `'.'` for label
separators. I have used arbitrary names for the addresses of the twigs
vectors.
root = (qp_node){
tag: BRANCH,
offset: 9,
bitmap: [ '.', 'm', 'w' ],
twigs: &one,
};
Note that the offset skips the root zone, the zone name, and the apex
label separator. If the offset is beyond the end of the key, the byte
value is the label separator.
one = (qp_node[3]){
{
tag: LEAF,
key: "example.",
},
{
tag: BRANCH,
offset: 10,
bitmap: [ 'a', 'x' ],
twigs: &two,
},
{
tag: BRANCH,
offset: 12,
bitmap: [ '.', '1', '2' ],
twigs: &three,
},
};
This twigs vector has an element for the zone apex, and the two
different initial characters of the subdomains.
The mail servers differ in the next character, so the offset bumps from
9 to 10 without skipping any characters. The web servers all start with
www, so the offset bumps from 9 to 12, skipping the common prefix.
two = (qp_node[2]){
{
tag: LEAF,
key: "mail.example.",
},
{
tag: LEAF,
key: "mx.example.",
},
};
The different lengths of `mail` and `mx` don't matter: we implicitly
skip to the end of the key when we reach a leaf node.
three = (qp_node[3]){
{
tag: LEAF,
key: "www.example.",
},
{
tag: LEAF,
key: "www1.example.",
},
{
tag: LEAF,
key: "www2.example.",
},
};
When the trie includes labels of differing lengths, we can have a node
that chooses between a label separator and characters from the longer
labels. This is slightly different from the root node, which tested the
first character of the label; here we are testing the last character.
memory management for concurrency
---------------------------------
The following sections discuss how the qp-trie supports concurrency.
The requirement is to support many concurrent read threads, and
allow updates to occur without blocking readers (or blocking readers
as little as possible).
The strategy is to use "copy-on-write", that is, when an update
needs to alter the trie it makes a copy of the parts that it needs
to change, so that concurrent readers can continue to use the
original. (It is analogous to multiversion concurrency in databases
such as PostgreSQL, where copy-on-write uses a write-ahead log.)
Software that uses copy-on-write needs some mechanism for clearing
away old versions that are no longer in use. (For example, VACUUM in
PostgreSQL.) The qp-trie code uses a custom allocator with a simple
garbage collector; as well as supporting concurrency, the qp-trie's
memory manager makes tries smaller and faster.
allocation
----------
A qp-trie is relatively demanding on its allocator. Twigs vectors
can be lots of different sizes, and every mutation of the trie
requires an alloc and/or a free.
Older versions of the qp-trie code used the system allocator. Many
allocators (such as `jemalloc`) segregate the heap into different
size classes, so that each chunk of memory is dedicated to
allocations of the same size. While this memory layout provides good
locality when objects of the same type have the same size, it tends
to scatter the interior nodes of a qp-trie all over the address space.
BIND's qp-trie code uses a "bump allocator" for its interior nodes,
which is one of the simplest and fastest possible: an allocation
usually only requires incrementing a pointer and checking if it has
reached a limit. (If the check fails the allocator goes into its
slow path.) Allocations have good locality because they write
sequentially into memory. (A bit like a write-ahead log.)
Bump allocators need reasonably large contiguous chunks of empty
memory to make the most of their efficiency, so they are often
coupled with some kind of compacting garbage collector, which
defragments the heap to recover free space.
See `alloc_twigs()` in `lib/dns/qp.c` for the bump allocator fast
path.
garbage collection
------------------
[The Garbage Collection Handbook](https://gchandbook.org/) says
there are four basic kinds of automatic memory management.
Reference counting is used by scripting languages such as Perl and
Python, and also for manual memory management such as in operating
system kernels and BIND.
To avoid writing a custom allocator, I previously tried adapting the
qp-trie code to use refcounting to support copy-on-write, but I was
not very happy with the complexity of the implementation, and I
thought it was ugly that I needed to modify refcounts in nodes that
were logically read-only.
(Two other kinds of GC are mark-sweep and mark-compact. Both of them
have a similar disadvantage to refcounting: a simple GC mark phase
modifies nodes that are logically read-only. And mark-sweep leaves
memory fragmented so it does not support a bump allocator.)
The fourth kind is copying garbage collection. It works well with a
bump allocator, because copying the data structure using a bump
allocator in the most obvious way naturally compacts the data. And
the copying phase of the GC can run concurrently with readers
without interference.
BIND's qp-trie code uses a copying garbage collector only for its
interior nodes. The value objects that are attached to the leaves of
the trie are allocated by `isc_mem` and use reference counting like
the rest of BIND.
See `compact()` in `lib/dns/qp.c` for the copying phase of the
garbage collector. Reference counting for value objects is handled
by the `attach()` and `detach()` qp-trie methods.
memory layout
-------------
BIND's qp-trie code organizes its memory as a collection of "chunks",
each of which is a few pages in size and large enough to hold a few
thousand nodes.
Most memory management is per-chunk: obtaining memory from the
system allocator and returning it; keeping track of which chunks are
in use by readers, and which chunks can be mutated; and counting
whether chunks are fragmented enough to need garbage collection.
As noted above, we also use the chunk-based layout to reduce the size
of interior nodes. Instead of using a native pointer (typically 64
bits) to refer to a node, we use a 32 bit integer containing the chunk
number and the position of the node in the chunk. This reduces the
memory used by interior nodes by 25%.
In `lib/dns/qp_p.h`, the _"main qp-trie structures"_ hold information
about a trie's chunks. Most of the chunk handling code is in the
_"allocator"_ and _"chunk reclamation"_ sections in `lib/dns/qp.c`.
lifecycle of value objects
--------------------------
A leaf node contains a pointer to a value object that is not managed
by the qp-trie garbage collector. Instead, the user provides
`attach` and `detach` methods that the qp-trie code calls to update
the reference counts in the value objects.
Value object reference counts do not indicate whether the object is
mutable: its refcount can be 1 while it is only in use by readers
(and must be left unchanged), or newly created by a writer (and
therefore mutable).
So, callers must keep track themselves whether leaf objects are newly
inserted (and therefore mutable) or not. XXXFANF this might change, by
adding special lookup functions that return whether leaf objects are
mutable - see the "todo" in `include/dns/qp.h`.
locking and RCU
---------------
The Linux kernel has a collection of copy-on-write schemes collectively
called read-copy-update; there is also https://liburcu.org/ for RCU in
userspace. RCU is attractively speedy: readers can proceed without
blocking at all; writers can proceed concurrently with readers, and
updates can be committed without blocking. A commit is just a single
atomic pointer update. RCU only requires writers to block when waiting
for a "grace period" while older readers complete their critical
sections, after which the writer can free memory that is no longer in
use. Writers must also block on a mutex to ensure there is only one
writer at a time.
The qp-trie concurrency strategy is designed to be able to use RCU, but
RCU is not required. Instead of RCU we can use a reader-writer lock.
This requires readers to block when a writer commits, which (in RCU
style) just requires an atomic pointer swap. The rwlock also changes
when writers must block: commits must wait for readers to exit their
critical sections, but there is no further waiting to be able to release
memory.
In BIND, there are two kinds of reader: queries, which are relatiely
quick, and zone transfers, which are relatively slow. BIND's dbversion
machinery allows updates to proceed while there are long-running zone
transfers. RCU supports this without further machinery, but a
reader-writer lock needs some help so that long-running readers can
avoid blocking writers.
To avoid blocking updates, long-running readers can take a snapshot of a
qp-trie, which only requires copying the allocator's chunk array. After
a writer commits, it does not releases memory if there are any
snapshots. Instead, chunks that are no longer needed by the latest
version of the trie are stashed on a list to be released later,
analogous to RCU waiting for a grace period.
The locking occurs only in the functions under _"read-write
transactions"_ and _"read-only transactions"_ in `lib/dns/qp.c`.
immutability and copy-on-write
------------------------------
A qp-trie has a `generation` counter which is incremented by each
write transaction. We keep track of which generation each chunk was
created in; only chunks created in the current generation are
mutable, because older chunks may be in use by concurrent readers.
This logic is implemented by `chunk_alloc()` and `chunk_mutable()`
in `lib/dns/qp.c`.
The `make_twigs_mutable()` function ensures that a node is mutable,
copying it if necessary.
The chunk arrays are a mixture of mutable and immutable. Pointers to
immutable chunks are immutable; new chunks can be assigned to unused
entries; and entries are cleared when it is safe to reclaim the chunks
they refer to. If the chunk arrays need to be expanded, the existing
arrays are retained for use by readers, and the writer uses the
expanded arrays (see `alloc_slow()`). The old arrays are cleaned up
after the writer commits.
update transactions
-------------------
A typical heavy-weight `update` transaction comprises:
* make a copy of the chunk arrays in case we need to roll back
* get a freshly allocated chunk where new nodes or copied nodes
can be written
* make any changes that are required; nodes in old chunks are
copied to the new space first; new nodes are modified in place
to avoid creating unnecessary garbage
* when the updates are finished, and before committing, run the
garbage collector to clear out chunks that were fragmented by the
update
* shrink the allocation chunk to eliminate unused space
* commit the update by flipping the root pointer of the trie; this
is the only point that needs a multithreading interlock
* free any chunks that were emptied by the garbage collector
A lightweight `write` transaction is similar, except that:
* rollback is not supported
* any existing allocation chunk is reused if possible
* the gabage collector is not run before committing
* the allocation chunk is not shrunk
testing strategies
------------------
The main qp-trie test is in `tests/dns/qpmulti_test.c`. This uses
randomized testing of the transactional API, with a lot of consistency
checking to detect bugs.
There are also a couple of fuzzers, which aim to benefit from
coverage-guided exploration of the test space and test minimization.
In `fuzz/dns_qp.c` we treat the fuzzer input as a bytecode to exercise
the single-threaded API, and `fuzz/dns_qpkey_name.c` checks conversion
from DNS names to lookup keys.
In `tests/bench` there are a few benchmarks. `load-names` does a very
basic comparison between BIND's hash table, red-black tree, and
qp-trie. `qpmulti` checks multicore performance of the transactional
API (similar to `qpmulti_test` but without the consistency checking).
And `qp-dump` is a utility for printing out the contents of a qp-trie.
John Regehr has some nice essays about testing data structures:
* Levels of fuzzing: https://blog.regehr.org/archives/1039
(how much semantic knowledge does your fuzzer have?)
* Testing with small capacities: https://blog.regehr.org/archives/1138
(I need to be able to change the chunk size)
* Write fuzzable code: https://blog.regehr.org/archives/1687
* Oracles for random testing: https://blog.regehr.org/archives/856
warning: generational collection
--------------------------------
The "generational hypothesis" is that most allocations have a short
lifetime, so it is profitable for a garbage collector to split its
heap into a number of generations. The youngest generation is where
allocations happen; it typically uses a bump allocator, and when the
allocation pointer reaches its limit, the youngest generation's
contents are copied to the second generation. The hypothesis is that
only a small fraction of the youngest generation will still be live
when the GC runs, so this copy will not take much time or space.
For a qp-trie the truth of this hypothesis depends on the order in
which keys are added or removed. It may be true if there is good
locality, for example, adding keys in lexicographic order, but not in
general.
When a qp-trie is mutated, only one node needs to be altered, near the
leaf that is added or removed. Nodes near the root of the trie tend to
be more stable and long-lived. However, during a copy-on-write
transaction, the path from the root to an altered leaf must be copied,
so nodes near the root are no longer stable and long-lived. They may
become stable in a long transaction, but that isn't guaranteed.
So the idea of generational garbage collection seems to be unhelpful
for a qp-trie.

View File

@@ -99,6 +99,7 @@ libdns_la_HEADERS = \
include/dns/order.h \
include/dns/peer.h \
include/dns/private.h \
include/dns/qp.h \
include/dns/rbt.h \
include/dns/rcode.h \
include/dns/rdata.h \
@@ -157,6 +158,7 @@ libdns_la_SOURCES = \
cache.c \
callbacks.c \
catz.c \
client.c \
clientinfo.c \
compress.c \
db.c \
@@ -206,6 +208,8 @@ libdns_la_SOURCES = \
order.c \
peer.c \
private.c \
qp.c \
qp_p.h \
rbt.c \
rbtdb.h \
rbtdb.c \
@@ -233,18 +237,17 @@ libdns_la_SOURCES = \
transport.c \
tkey.c \
tsig.c \
tsig_p.h \
ttl.c \
update.c \
validator.c \
view.c \
xfrin.c \
zone.c \
zone_p.h \
zoneverify.c \
zonekey.c \
zt.c \
client.c \
tsig_p.h \
zone_p.h
zt.c
if HAVE_GSSAPI
libdns_la_SOURCES += \

View File

@@ -80,6 +80,7 @@ extern isc_logmodule_t dns_modules[];
#define DNS_LOGMODULE_DYNDB (&dns_modules[30])
#define DNS_LOGMODULE_DNSTAP (&dns_modules[31])
#define DNS_LOGMODULE_SSU (&dns_modules[32])
#define DNS_LOGMODULE_QP (&dns_modules[33])
ISC_LANG_BEGINDECLS

574
lib/dns/include/dns/qp.h Normal file
View File

@@ -0,0 +1,574 @@
/*
* Copyright (C) Internet Systems Consortium, Inc. ("ISC")
*
* SPDX-License-Identifier: MPL-2.0
*
* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, you can obtain one at https://mozilla.org/MPL/2.0/.
*
* See the COPYRIGHT file distributed with this work for additional
* information regarding copyright ownership.
*/
#pragma once
/*
* A qp-trie is a kind of key -> value map, supporting lookups that are
* aware of the lexicographic order of keys.
*
* Keys are `dns_qpkey_t`, which is a string-like thing, usually created
* from a DNS name. You can use both relative and absolute DNS names as
* keys.
*
* Leaf values are a pair of a `void *` pointer and a `uint32_t`
* (because that is what fits inside an internal qp-trie leaf node).
*
* The trie does not store keys; instead keys are derived from leaf values
* by calling a method provided by the user.
*
* There are a few flavours of qp-trie.
*
* The basic `dns_qp_t` supports single-threaded read/write access.
*
* A `dns_qpmulti_t` is a wrapper that supports multithreaded access.
* There can be many concurrent readers and a single writer. Writes are
* transactional, and support multi-version concurrency.
*
* The concurrency strategy uses copy-on-write. When making changes during
* a transaction, the caller must not modify leaf values in place, but
* instead delete the old leaf from the trie and insert a replacement. Leaf
* values have reference counts, which will indicate when the old leaf
* value can be freed after it is no longer needed by readers using an old
* version of the trie.
*
* For fast concurrent reads, call `dns_qpmulti_query()` to get a
* `dns_qpread_t`. Readers can access a single version of the trie between
* write commits. Most write activity is not blocked by readers, but reads
* must finish before a write can commit (a read-write lock blocks
* commits).
*
* For long-running reads that need a stable view of the trie, while still
* allow commits to proceed, call `dns_qpmulti_snapshot()` to get a
* `dns_qpsnap_t`. It briefly gets the write mutex while creating the
* snapshot, which requires allocating a copy of some of the trie's
* metadata. A snapshot is for relatively heavy long-running read-only
* operations such as zone transfers.
*
* While snapshots exist, a qp-trie cannot reclaim memory: it does not
* retain detailed information about which memory is used by which
* snapshots, so it pessimistically retains all memory that might be
* used by old versions of the trie.
*
* You can start one read-write transaction at a time using
* `dns_qpmulti_write()` or `dns_qpmulti_update()`. Either way, you
* get a `dns_qp_t` that can be modified like a single-threaded trie,
* without affecting other read-only query or snapshot users of the
* `dns_qpmulti_t`. Committing a transaction only blocks readers
* briefly when flipping the active readonly `dns_qp_t` pointer.
*
* "Update" transactions are heavyweight. They allocate working memory to
* hold modifications to the trie, and compact the trie before committing.
* For extra space savings, a partially-used allocation chunk is shrunk to
* the smallest size possible. Unlike "write" transactions, an "update"
* transaction can be rolled back instead of committed. (Update
* transactions are intended for things like authoritative zones, where it
* is important to keep the per-trie memory overhead low because there can
* be a very large number of them.)
*
* "Write" transactions are more lightweight: they skip the allocation and
* compaction at the start and end of the transaction. (Write transactions
* are intended for frequent small changes, as in the DNS cache.)
*/
/***********************************************************************
*
* types
*/
#include <isc/attributes.h>
#include <dns/types.h>
/*%
* A `dns_qp_t` supports single-threaded read/write access.
*/
typedef struct dns_qp dns_qp_t;
/*%
* A `dns_qpmulti_t` supports multi-version concurrent reads and transactional
* modification.
*/
typedef struct dns_qpmulti dns_qpmulti_t;
/*%
* A `dns_qpread_t` is a lightweight read-only handle on a `dns_qpmulti_t`.
*/
typedef struct dns_qpread dns_qpread_t;
/*%
* A `dns_qpsnap_t` is a heavier read-only snapshot of a `dns_qpmulti_t`.
*/
typedef struct dns_qpsnap dns_qpsnap_t;
/*
* The read-only qp-trie functions can work on either of the read-only
* qp-trie types or the general-purpose read-write `dns_qp_t`. They
* relies on the fact that all the `dns_qpreadable_t` structures start
* with a `dns_qpread_t`.
*/
typedef union dns_qpreadable {
dns_qpread_t *qpr;
dns_qpsnap_t *qps;
dns_qp_t *qpt;
} dns_qpreadable_t __attribute__((__transparent_union__));
#define dns_qpreadable_cast(qp) ((qp).qpr)
/*%
* A trie lookup key is a small array, allocated on the stack during trie
* searches. Keys are usually created on demand from DNS names using
* `dns_qpkey_fromname()`, but in principle you can define your own
* functions to convert other types to trie lookup keys.
*
* A domain name can be up to 255 bytes. When converted to a key, each
* character in the name corresponds to one byte in the key if it is a
* common hostname character; otherwise unusual characters are escaped,
* using two bytes in the key. So we allow keys to be up to 512 bytes.
* (The actual max is (255 - 5) * 2 + 6 == 506)
*
* Every byte of a key must be greater than 0 and less than 48. Elements
* after the end of the key are treated as having the value 1.
*/
typedef uint8_t dns_qpkey_t[512];
/*%
* These leaf methods allow the qp-trie code to call back to the code
* responsible for the leaf values that are stored in the trie. The
* methods are provided for a whole trie when the trie is created.
*
* The qp-trie is also given a context pointer that is passed to the
* methods, so the methods know about the trie's context as well as a
* particular leaf value.
*
* The `attach` and `detach` methods adjust reference counts on value
* objects. They support copy-on-write and safe memory reclamation
* needed for multi-version concurrency.
*
* Note: When a value object reference count is greater than one, the
* object is in use by concurrent readers so it must not be modified. A
* refcount equal to one does not indicate whether or not the object is
* mutable: its refcount can be 1 while it is only in use by readers (and
* must be left unchanged), or newly created by a writer (and therefore
* mutable).
*
* The `makekey` method fills in a `dns_qpkey_t` corresponding to a
* value object stored in the qp-trie. It returns the length of the
* key. This method will typically call dns_qpkey_fromname() with a
* name stored in the value object.
*
* For logging and tracing, the `triename` method copies a human-
* readable identifier into `buf` which has max length `size`.
*/
typedef struct dns_qpmethods {
void (*attach)(void *ctx, void *pval, uint32_t ival);
void (*detach)(void *ctx, void *pval, uint32_t ival);
size_t (*makekey)(dns_qpkey_t key, void *ctx, void *pval,
uint32_t ival);
void (*triename)(void *ctx, char *buf, size_t size);
} dns_qpmethods_t;
/*%
* Buffers for use by the `triename()` method need to be large enough
* to hold a zone name and a few descriptive words.
*/
#define DNS_QP_TRIENAME_MAX 300
/*%
* A container for the counters returned by `dns_qp_memusage()`
*/
typedef struct dns_qp_memusage {
void *ctx; /*%< qp-trie method context */
size_t leaves; /*%< values in the trie */
size_t live; /*%< nodes in use */
size_t used; /*%< allocated nodes */
size_t hold; /*%< nodes retained for readers */
size_t free; /*%< nodes to be reclaimed */
size_t node_size; /*%< in bytes */
size_t chunk_size; /*%< nodes per chunk */
size_t chunk_count; /*%< allocated chunks */
size_t bytes; /*%< total memory in chunks and metadata */
} dns_qp_memusage_t;
/***********************************************************************
*
* functions - create, destory, enquire
*/
void
dns_qp_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
dns_qp_t **qptp);
/*%<
* Create a single-threaded qp-trie.
*
* Requires:
* \li `mctx` is a pointer to a valid memory context.
* \li all the methods are non-NULL
* \li `qptp != NULL && *qptp == NULL`
*
* Ensures:
* \li `*qptp` is a pointer to a valid single-threaded qp-trie
*/
void
dns_qp_destroy(dns_qp_t **qptp);
/*%<
* Destroy a single-threaded qp-trie.
*
* Requires:
* \li `qptp != NULL`
* \li `*qptp` is a pointer to a valid single-threaded qp-trie
*
* Ensures:
* \li all memory allocated by the qp-trie has been released
* \li `*qptp` is NULL
*/
void
dns_qpmulti_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
dns_qpmulti_t **qpmp);
/*%<
* Create a multi-threaded qp-trie.
*
* Requires:
* \li `mctx` is a pointer to a valid memory context.
* \li all the methods are non-NULL
* \li `qpmp != NULL && *qpmp == NULL`
*
* Ensures:
* \li `*qpmp` is a pointer to a valid multi-threaded qp-trie
*/
void
dns_qpmulti_destroy(dns_qpmulti_t **qpmp);
/*%<
* Destroy a multi-threaded qp-trie.
*
* Requires:
* \li `qptp != NULL`
* \li `*qptp` is a pointer to a valid multi-threaded qp-trie
* \li there are no write or update transactions in progress
* \li no snapshots exist
*
* Ensures:
* \li all memory allocated by the qp-trie has been released
* \li `*qpmp` is NULL
*/
void
dns_qp_compact(dns_qp_t *qp);
/*%<
* Defragment the entire qp-trie and release unused memory.
*
* When modifications make a trie too fragmented, it is automatically
* compacted. Automatic compaction avoids compacting chunks that are not
* fragmented to save time, but this function compacts the entire trie to
* defragment it as much as possible.
*
* This function can be used with a single-threaded qp-trie and during a
* transaction on a multi-threaded trie.
*
* Requires:
* \li `qp` is a pointer to a valid qp-trie
*/
void
dns_qp_gctime(uint64_t *compact_us, uint64_t *recover_us,
uint64_t *rollback_us);
/*%<
* Get the total times spent on garbage collection in microseconds.
*
* These counters are global, covering every qp-trie in the program.
*
* XXXFANF This is a placeholder until we can record times in histograms.
*/
dns_qp_memusage_t
dns_qp_memusage(dns_qp_t *qp);
/*%<
* Get the memory counters from a qp-trie
*
* Requires:
* \li `qp` is a pointer to a valid qp-trie
*
* Returns:
* \li a `dns_qp_memusage_t` structure described above
*/
/***********************************************************************
*
* functions - search, modify
*/
/*
* XXXFANF todo, based on what we discover BIND needs
*
* fancy searches: longest match, lexicographic predecessor,
* etc.
*
* do we need specific lookup functions to find out if the
* returned value is readonly or mutable?
*
* richer modification such as dns_qp_replace{key,name}
*
* iteration - probably best to put an explicit stack in the iterator,
* cf. rbtnodechain
*/
size_t
dns_qpkey_fromname(dns_qpkey_t key, const dns_name_t *name);
/*%<
* Convert a DNS name into a trie lookup key.
*
* Requires:
* \li `name` is a pointer to a valid `dns_name_t`
*
* Returns:
* \li the length of the key
*/
isc_result_t
dns_qp_getkey(dns_qpreadable_t qpr, const dns_qpkey_t searchk, size_t searchl,
void **pval_r, uint32_t *ival_r);
/*%<
* Find a leaf in a qp-trie that matches the given key
*
* The leaf values are assigned to `*pval_r` and `*ival_r`
*
* Requires:
* \li `qpr` is a pointer to a readable qp-trie
* \li `pval_r != NULL`
* \li `ival_r != NULL`
*
* Returns:
* \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
* \li ISC_R_SUCCESS if the leaf was found
*/
isc_result_t
dns_qp_getname(dns_qpreadable_t qpr, const dns_name_t *name, void **pval_r,
uint32_t *ival_r);
/*%<
* Find a leaf in a qp-trie that matches the given DNS name
*
* The leaf values are assigned to `*pval_r` and `*ival_r`
*
* Requires:
* \li `qpr` is a pointer to a readable qp-trie
* \li `name` is a pointer to a valid `dns_name_t`
* \li `pval_r != NULL`
* \li `ival_r != NULL`
*
* Returns:
* \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
* \li ISC_R_SUCCESS if the leaf was found
*/
isc_result_t
dns_qp_insert(dns_qp_t *qp, void *pval, uint32_t ival);
/*%<
* Insert a leaf into a qp-trie
*
* Requires:
* \li `qp` is a pointer to a valid qp-trie
* \li `pval != NULL`
* \li `alignof(pval) > 1`
*
* Returns:
* \li ISC_R_EXISTS if the trie already has a leaf with the same key
* \li ISC_R_SUCCESS if the leaf was added to the trie
*/
isc_result_t
dns_qp_deletekey(dns_qp_t *qp, const dns_qpkey_t key, size_t len);
/*%<
* Delete a leaf from a qp-trie that matches the given key
*
* Requires:
* \li `qp` is a pointer to a valid qp-trie
*
* Returns:
* \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
* \li ISC_R_SUCCESS if the leaf was deleted from the trie
*/
isc_result_t
dns_qp_deletename(dns_qp_t *qp, const dns_name_t *name);
/*%<
* Delete a leaf from a qp-trie that matches the given DNS name
*
* Requires:
* \li `qp` is a pointer to a valid qp-trie
* \li `name` is a pointer to a valid qp-trie
*
* Returns:
* \li ISC_R_NOTFOUND if the trie has no leaf with a matching name
* \li ISC_R_SUCCESS if the leaf was deleted from the trie
*/
/***********************************************************************
*
* functions - transactions
*/
void
dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t **qprp);
/*%<
* Start a lightweight (brief) read-only transaction
*
* This takes a read lock on `multi`s rwlock that prevents
* transactions from committing.
*
* Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qprp != NULL`
* \li `*qprp == NULL`
*
* Returns:
* \li `*qprp` is a pointer to a valid read-only qp-trie handle
*/
void
dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t **qprp);
/*%<
* End a lightweight read transaction, i.e. release read lock
*
* Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qprp != NULL`
* \li `*qprp` is a read-only qp-trie handle obtained from `multi`
*
* Returns:
* \li `*qprp == NULL`
*/
void
dns_qpmulti_snapshot(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
/*%<
* Start a heavyweight (long) read-only transaction
*
* This function briefly takes and releases the modification mutex
* while allocating a copy of the trie's metadata. While the snapshot
* exists it does not interfere with other read-only or read-write
* transactions on the trie, except that memory cannot be reclaimed.
*
* Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qpsp != NULL`
* \li `*qpsp == NULL`
*
* Returns:
* \li `*qpsp` is a pointer to a snapshot obtained from `multi`
*/
void
dns_qpsnap_destroy(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
/*%<
* End a heavyweight read transaction
*
* If this is the last remaining snapshot belonging to `multi` then
* this function takes the modification mutex in order to free() any
* memory that is no longer in use.
*
* Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qpsp != NULL`
* \li `*qpsp` is a pointer to a snapshot obtained from `multi`
*
* Returns:
* \li `*qpsp == NULL`
*/
void
dns_qpmulti_update(dns_qpmulti_t *multi, dns_qp_t **qptp);
/*%<
* Start a heavyweight write transaction
*
* This style of transaction allocates a copy of the trie's metadata to
* support rollback, and it aims to minimize the memory usage of the
* trie between transactions. The trie is compacted when the transaction
* commits, and any partly-used chunk is shrunk to fit.
*
* During the transaction, the modification mutex is held.
*
* Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qptp != NULL`
* \li `*qptp == NULL`
*
* Returns:
* \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
*/
void
dns_qpmulti_write(dns_qpmulti_t *multi, dns_qp_t **qptp);
/*%<
* Start a lightweight write transaction
*
* This style of transaction does not need extra allocations in addition
* to the ones required by insert and delete operations. It is intended
* for a large trie that gets frequent small writes, such as a DNS
* cache.
*
* During the transaction, the modification mutex is held.
*
* Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qptp != NULL`
* \li `*qptp == NULL`
*
* Returns:
* \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
*/
void
dns_qpmulti_commit(dns_qpmulti_t *multi, dns_qp_t **qptp);
/*%<
* Complete a modification transaction
*
* The commit itself only requires flipping the read pointer inside
* `multi` from the old version of the trie to the new version. This
* function takes a write lock on `multi`s rwlock just long enough to
* flip the pointer. This briefly blocks `query` readers.
*
* This function releases the modification mutex after the post-commit
* memory reclamation is completed.
*
* Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qptp != NULL`
* \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
*
* Returns:
* \li `*qptp == NULL`
*/
void
dns_qpmulti_rollback(dns_qpmulti_t *multi, dns_qp_t **qptp);
/*%<
* Abandon an update transaction
*
* This function reclaims the memory allocated during the transaction
* and releases the modification mutex.
*
* Requires:
* \li `multi` is a pointer to a valid multi-threaded qp-trie
* \li `qptp != NULL`
* \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
* \li `*qptp` was obtained from `dns_qpmulti_update()`
*
* Returns:
* \li `*qptp == NULL`
*/
/**********************************************************************/

View File

@@ -36,23 +36,18 @@ isc_logcategory_t dns_categories[] = {
* \#define to <dns/log.h>.
*/
isc_logmodule_t dns_modules[] = {
{ "dns/db", 0 }, { "dns/rbtdb", 0 },
{ "dns/rbt", 0 }, { "dns/rdata", 0 },
{ "dns/master", 0 }, { "dns/message", 0 },
{ "dns/cache", 0 }, { "dns/config", 0 },
{ "dns/resolver", 0 }, { "dns/zone", 0 },
{ "dns/journal", 0 }, { "dns/adb", 0 },
{ "dns/xfrin", 0 }, { "dns/xfrout", 0 },
{ "dns/acl", 0 }, { "dns/validator", 0 },
{ "dns/dispatch", 0 }, { "dns/request", 0 },
{ "dns/masterdump", 0 }, { "dns/tsig", 0 },
{ "dns/tkey", 0 }, { "dns/sdb", 0 },
{ "dns/diff", 0 }, { "dns/hints", 0 },
{ "dns/unused1", 0 }, { "dns/dlz", 0 },
{ "dns/dnssec", 0 }, { "dns/crypto", 0 },
{ "dns/packets", 0 }, { "dns/nta", 0 },
{ "dns/dyndb", 0 }, { "dns/dnstap", 0 },
{ "dns/ssu", 0 }, { NULL, 0 }
{ "dns/db", 0 }, { "dns/rbtdb", 0 }, { "dns/rbt", 0 },
{ "dns/rdata", 0 }, { "dns/master", 0 }, { "dns/message", 0 },
{ "dns/cache", 0 }, { "dns/config", 0 }, { "dns/resolver", 0 },
{ "dns/zone", 0 }, { "dns/journal", 0 }, { "dns/adb", 0 },
{ "dns/xfrin", 0 }, { "dns/xfrout", 0 }, { "dns/acl", 0 },
{ "dns/validator", 0 }, { "dns/dispatch", 0 }, { "dns/request", 0 },
{ "dns/masterdump", 0 }, { "dns/tsig", 0 }, { "dns/tkey", 0 },
{ "dns/sdb", 0 }, { "dns/diff", 0 }, { "dns/hints", 0 },
{ "dns/unused1", 0 }, { "dns/dlz", 0 }, { "dns/dnssec", 0 },
{ "dns/crypto", 0 }, { "dns/packets", 0 }, { "dns/nta", 0 },
{ "dns/dyndb", 0 }, { "dns/dnstap", 0 }, { "dns/ssu", 0 },
{ "dns/qp", 0 }, { NULL, 0 },
};
isc_log_t *dns_lctx = NULL;

1571
lib/dns/qp.c Normal file

File diff suppressed because it is too large Load Diff

703
lib/dns/qp_p.h Normal file
View File

@@ -0,0 +1,703 @@
/*
* Copyright (C) Internet Systems Consortium, Inc. ("ISC")
*
* SPDX-License-Identifier: MPL-2.0
*
* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, you can obtain one at https://mozilla.org/MPL/2.0/.
*
* See the COPYRIGHT file distributed with this work for additional
* information regarding copyright ownership.
*/
/*
* For an overview, see doc/design/qp-trie.md
*/
#pragma once
/***********************************************************************
*
* interior node basics
*/
/*
* A qp-trie node can be a leaf or a branch. It consists of three 32-bit
* words into which the components are packed. They are used as a 64-bit
* word and a 32-bit word, but they are not declared like that to avoid
* unwanted padding, keeping the size down to 12 bytes. They are in native
* endian order so getting the 64-bit part should compile down to an
* unaligned load.
*
* In a branch the 64-bit word is described by the enum below. The 32-bit
* word is a reference to the packed sparse vector of "twigs", i.e. child
* nodes. A branch node has at least 2 and less than SHIFT_OFFSET twigs
* (see the enum below). The qp-trie update functions ensure that branches
* actually branch, i.e. branches cannot have only 1 child.
*
* The contents of each leaf are set by the trie's user. The 64-bit word
* contains a pointer value (which must be word-aligned), and the 32-bit
* word is an arbitrary integer value.
*/
typedef struct qp_node {
#if WORDS_BIGENDIAN
uint32_t bighi, biglo, small;
#else
uint32_t biglo, bighi, small;
#endif
} qp_node_t;
/*
* A branch node contains a 64-bit word comprising the branch/leaf tag,
* the bitmap, and an offset into the key. It is called an "index word"
* because it describes how to access the twigs vector (think "database
* index"). The following enum sets up the bit positions of these parts.
*
* In a leaf, the same 64-bit word contains a pointer. The pointer
* must be word-aligned so that the branch/leaf tag bit is zero.
* This requirement is checked by the newleaf() constructor.
*
* The bitmap is just above the tag bit. The `bits_for_byte[]` table is
* used to fill in a key so that bit tests can work directly against the
* index word without superfluous masking or shifting; we don't need to
* mask out the bitmap before testing a bit, but we do need to mask the
* bitmap before calling popcount.
*
* The byte offset into the key is at the top of the word, so that it
* can be extracted with just a shift, with no masking needed.
*
* The names are SHIFT_thing because they are qp_shift_t values. (See
* below for the various `qp_*` type declarations.)
*
* These values are relatively fixed in practice; the symbolic names
* avoid mystery numbers in the code.
*/
enum {
SHIFT_BRANCH = 0, /* branch / leaf tag */
SHIFT_NOBYTE, /* label separator has no byte value */
SHIFT_BITMAP, /* many bits here */
SHIFT_OFFSET = 48, /* offset of byte in key */
};
/*
* Value of the node type tag bit.
*
* It is defined this way to be explicit about where the value comes
* from, even though we know it is always the bottom bit.
*/
#define BRANCH_TAG (1ULL << SHIFT_BRANCH)
/***********************************************************************
*
* garbage collector tuning parameters
*/
/*
* A "cell" is a location that can contain a `qp_node_t`, and a "chunk"
* is a moderately large array of cells. A big trie can occupy
* multiple chunks. (Unlike other nodes, a trie's root node lives in
* its `struct dns_qp` instead of being allocated in a cell.)
*
* The qp-trie allocator hands out space for twigs vectors. Allocations are
* made sequentially from one of the chunks; this kind of "sequential
* allocator" is also known as a "bump allocator", so in `struct dns_qp`
* (see below) the allocation chunk is called `bump`.
*/
/*
* Number of cells in a chunk is a power of 2, which must have space for
* a full twigs vector (48 wide). When testing, use a much smaller chunk
* size to make the allocator work harder.
*/
#ifdef FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION
#define QP_CHUNK_LOG 7
#else
#define QP_CHUNK_LOG 10
#endif
STATIC_ASSERT(6 <= QP_CHUNK_LOG && QP_CHUNK_LOG <= 20,
"qp-trie chunk size is unreasonable");
#define QP_CHUNK_SIZE (1U << QP_CHUNK_LOG)
#define QP_CHUNK_BYTES (QP_CHUNK_SIZE * sizeof(qp_node_t))
/*
* A chunk needs to be compacted if it has fragmented this much.
* (12% overhead seems reasonable)
*/
#define QP_MAX_FREE (QP_CHUNK_SIZE / 8)
/*
* Compact automatically when we pass this threshold: when there is a lot
* of free space in absolute terms, and when we have freed more than half
* of the space we allocated.
*
* The current compaction algorithm scans the whole trie, so it is important
* to scale the threshold based on the size of the trie to avoid quadratic
* behaviour. XXXFANF find an algorithm that scans less of the trie!
*
* During a modification transaction, when we copy-on-write some twigs we
* count the old copy as "free", because they will be when the transaction
* commits. But they cannot be recovered immediately so they are also
* counted as on hold, and discounted when we decide whether to compact.
*/
#define QP_MAX_GARBAGE(qp) \
(((qp)->free_count - (qp)->hold_count) > QP_CHUNK_SIZE * 4 && \
((qp)->free_count - (qp)->hold_count) > (qp)->used_count / 2)
/*
* The chunk base and usage arrays are resized geometically and start off
* with two entries.
*/
#define GROWTH_FACTOR(size) ((size) + (size) / 2 + 2)
/***********************************************************************
*
* helper types
*/
/*
* C is not strict enough with its integer types for these typedefs to
* improve type safety, but it helps to have annotations saying what
* particular kind of number we are dealing with.
*/
/*
* The number or position of a bit inside a word. (0..63)
*
* Note: A dns_qpkey_t is logically an array of qp_shift_t values, but it
* isn't declared that way because dns_qpkey_t is a public type whereas
* qp_shift_t is private.
*/
typedef uint8_t qp_shift_t;
/*
* The number of bits set in a word (as in Hamming weight or popcount)
* which is used for the position of a node in the packed sparse
* vector of twigs. (0..47) because our bitmap does not fill the word.
*/
typedef uint8_t qp_weight_t;
/*
* A chunk number, i.e. an index into the chunk arrays.
*/
typedef uint32_t qp_chunk_t;
/*
* Cell offset within a chunk, or a count of cells. Each cell in a
* chunk can contain a node.
*/
typedef uint32_t qp_cell_t;
/*
* A twig reference is used to refer to a twigs vector, which occupies a
* contiguous group of cells.
*/
typedef uint32_t qp_ref_t;
/*
* Constructors and accessors for qp_ref_t values, defined here to show
* how the qp_ref_t, qp_chunk_t, qp_cell_t types relate to each other
*/
static inline qp_ref_t
make_ref(qp_chunk_t chunk, qp_cell_t cell) {
return (QP_CHUNK_SIZE * chunk + cell);
}
static inline qp_chunk_t
ref_chunk(qp_ref_t ref) {
return (ref / QP_CHUNK_SIZE);
}
static inline qp_cell_t
ref_cell(qp_ref_t ref) {
return (ref % QP_CHUNK_SIZE);
}
/***********************************************************************
*
* main qp-trie structures
*/
#define QP_MAGIC ISC_MAGIC('t', 'r', 'i', 'e')
#define VALID_QP(qp) ISC_MAGIC_VALID(qp, QP_MAGIC)
/*
* This is annoying: C doesn't allow us to use a predeclared structure as
* an anonymous struct member, so we have to fart around. The feature we
* want is available in GCC and Clang with -fms-extensions, but a
* non-standard extension won't make these declarations neater if we must
* also have a standard alternative.
*/
/*
* Lightweight read-only access to a qp-trie.
*
* Just the fields neded for the hot path. The `base` field points
* to an array containing pointers to the base of each chunk like
* `qp->base[chunk]` - see `refptr()` below.
*
* A `dns_qpread_t` has a lifetime that does not extend across multiple
* write transactions, so it can share a chunk `base` array belonging to
* the `dns_qpmulti_t` it came from.
*
* We're lucky with the layout on 64 bit systems: this is only 40 bytes,
* with no padding.
*/
#define DNS_QPREAD_COMMON \
uint32_t magic; \
qp_node_t root; \
qp_node_t **base; \
void *ctx; \
const dns_qpmethods_t *methods
struct dns_qpread {
DNS_QPREAD_COMMON;
};
/*
* Heavyweight read-only snapshots of a qp-trie.
*
* Unlike a lightweight `dns_qpread_t`, a snapshot can survive across
* multiple write transactions, any of which may need to expand the
* chunk `base` array. So a `dns_qpsnap_t` keeps its own copy of the
* array, which will always be equal to some prefix of the expanded
* arrays in the `dns_qpmulti_t` that it came from.
*
* The `dns_qpmulti_t` keeps a refcount of its snapshots, and while
* the refcount is non-zero, chunks are not freed or reused. When a
* `dns_qpsnap_t` is destroyed, if it decrements the refcount to zero,
* it can do any deferred cleanup.
*
* The generation number is used for tracing.
*/
struct dns_qpsnap {
DNS_QPREAD_COMMON;
uint32_t generation;
dns_qpmulti_t *whence;
qp_node_t *base_array[];
};
/*
* Read-write access to a qp-trie requires extra fields to support the
* allocator and garbage collector.
*
* The chunk `base` and `usage` arrays are separate because the `usage`
* array is only needed for allocation, so it is kept separate from the
* data needed by the read-only hot path. The arrays have empty slots where
* new chunks can be placed, so `chunk_max` is the maximum number of chunks
* (until the arrays are resized).
*
* Bare instances of a `struct dns_qp` are used for stand-alone
* single-threaded tries. For multithreaded access, transactions alternate
* between the `phase` pair of dns_qp objects inside a dns_qpmulti.
*
* For multithreaded access, the `generation` counter allows us to know
* which chunks are writable or not: writable chunks were allocated in the
* current generation. For single-threaded access, the generation counter
* is always zero, so all chunks are considered to be writable.
*
* Allocations are made sequentially in the `bump` chunk. Lightweight write
* transactions can re-use the `bump` chunk, so its prefix before `fender`
* is immutable, and the rest is mutable even though its generation number
* does not match the current generation.
*
* To decide when to compact and reclaim space, QP_MAX_GARBAGE() examines
* the values of `used_count`, `free_count`, and `hold_count`. The
* `hold_count` tracks nodes that need to be retained while readers are
* using them; they are free but cannot be reclaimed until the transaction
* has committed, so the `hold_count` is discounted from QP_MAX_GARBAGE()
* during a transaction.
*
* There are some flags that alter the behaviour of write transactions.
*
* - The `transaction_mode` indicates whether the current transaction is a
* light write or a heavy update, or (between transactions) the previous
* transaction's mode, because the setup for the next transaction
* depends on how the previous one committed. The mode is set at the
* start of each transaction. It is QP_NONE in a single-threaded qp-trie
* to detect if part of a `dns_qpmulti_t` is passed to dns_qp_destroy().
*
* - The `compact_all` flag is used when every node in the trie should be
* copied. (Usually compation aims to avoid moving nodes out of
* unfragmented chunks.) It is used when compaction is explicitly
* requested via `dns_qp_compact()`, and as an emergency mechanism if
* normal compaction failed to clear the QP_MAX_GARBAGE() condition.
* (This emergency is a bug even tho we have a rescue mechanism.)
*
* - The `shared_arrays` flag indicates that the chunk `base` and `usage`
* arrays are shared by both `phase`s in this trie's `dns_qpmulti_t`.
* This allows us to delay allocating copies of the arrays during a
* write transaction, until we definitely need to resize them.
*
* - When built with fuzzing support, we can use mprotect() and munmap()
* to ensure that incorrect memory accesses cause fatal errors. The
* `write_protect` flag must be set straight after the `dns_qpmulti_t`
* is created, then left unchanged.
*
* Some of the dns_qp_t fields are only used for multithreaded transactions
* (marked [MT] below) but the same code paths are also used for single-
* threaded writes. To reduce the size of a dns_qp_t, these fields could
* perhaps be moved into the dns_qpmulti_t, but that would require some kind
* of conditional runtime downcast from dns_qp_t to dns_multi_t, which is
* likely to be ugly. It is probably best to keep things simple if most tries
* need multithreaded access (XXXFANF do they? e.g. when there are many auth
* zones),
*/
struct dns_qp {
DNS_QPREAD_COMMON;
isc_mem_t *mctx;
/*% array of per-chunk allocation counters */
struct {
/*% the allocation point, increases monotonically */
qp_cell_t used;
/*% count of nodes no longer needed, also monotonic */
qp_cell_t free;
/*% when was this chunk allocated? */
uint32_t generation;
} *usage;
/*% transaction counter [MT] */
uint32_t generation;
/*% number of slots in `chunk` and `usage` arrays */
qp_chunk_t chunk_max;
/*% which chunk is used for allocations */
qp_chunk_t bump;
/*% twigs in the `bump` chunk below `fender` are read only [MT] */
qp_cell_t fender;
/*% number of leaf nodes */
qp_cell_t leaf_count;
/*% total of all usage[] counters */
qp_cell_t used_count, free_count;
/*% cells that cannot be recovered right now */
qp_cell_t hold_count;
/*% what kind of transaction was most recently started [MT] */
enum { QP_NONE, QP_WRITE, QP_UPDATE } transaction_mode : 2;
/*% compact the entire trie [MT] */
bool compact_all : 1;
/*% chunk arrays are shared with a readonly qp-trie [MT] */
bool shared_arrays : 1;
/*% optionally when compiled with fuzzing support [MT] */
bool write_protect : 1;
};
/*
* Concurrent access to a qp-trie.
*
* The `read` pointer is used for read queries. It points to one of the
* `phase` elements. During a transaction, the other `phase` (see
* `write_phase()` below) is modified incrementally in copy-on-write
* style. On commit the `read` pointer is swapped to the altered phase.
*/
struct dns_qpmulti {
uint32_t magic;
/*% controls access to the `read` pointer and its target phase */
isc_rwlock_t rwlock;
/*% points to phase[r] and swaps on commit */
dns_qp_t *read;
/*% protects the snapshot counter and `write_phase()` */
isc_mutex_t mutex;
/*% so we know when old chunks are still shared */
unsigned int snapshots;
/*% one is read-only, one is mutable */
dns_qp_t phase[2];
};
/*
* Get a pointer to the phase that isn't read-only.
*/
static inline dns_qp_t *
write_phase(dns_qpmulti_t *multi) {
bool read0 = multi->read == &multi->phase[0];
return (read0 ? &multi->phase[1] : &multi->phase[0]);
}
#define QPMULTI_MAGIC ISC_MAGIC('q', 'p', 'm', 'v')
#define VALID_QPMULTI(qp) ISC_MAGIC_VALID(qp, QPMULTI_MAGIC)
/***********************************************************************
*
* interior node constructors and accessors
*/
/*
* See the comments under "interior node basics" above, which explain the
* layout of nodes as implemented by the following functions.
*/
/*
* Get the 64-bit word of a node.
*/
static inline uint64_t
node64(qp_node_t *n) {
uint64_t lo = n->biglo;
uint64_t hi = n->bighi;
return (lo | (hi << 32));
}
/*
* Get the 32-bit word of a node.
*/
static inline uint32_t
node32(qp_node_t *n) {
return (n->small);
}
/*
* Create a node from its parts
*/
static inline qp_node_t
make_node(uint64_t big, uint32_t small) {
return ((qp_node_t){
.biglo = (uint32_t)(big),
.bighi = (uint32_t)(big >> 32),
.small = small,
});
}
/*
* Test a node's tag bit.
*/
static inline bool
is_branch(qp_node_t *n) {
return (n->biglo & BRANCH_TAG);
}
/* leaf nodes *********************************************************/
/*
* Get a leaf's pointer value. The double cast is to avoid a warning
* about mismatched pointer/integer sizes on 32 bit systems.
*/
static inline void *
leaf_pval(qp_node_t *n) {
return ((void *)(uintptr_t)node64(n));
}
/*
* Get a leaf's integer value
*/
static inline uint32_t
leaf_ival(qp_node_t *n) {
return (node32(n));
}
/*
* Create a leaf node from its parts
*/
static inline qp_node_t
make_leaf(const void *pval, uint32_t ival) {
qp_node_t leaf = make_node((uintptr_t)pval, ival);
REQUIRE(!is_branch(&leaf) && pval != NULL);
return (leaf);
}
/* branch nodes *******************************************************/
/*
* The following function names use plural `twigs` when they work on a
* branch's twigs vector as a whole, and singular `twig` when they work on
* a particular twig.
*/
/*
* Get a branch node's index word
*/
static inline uint64_t
branch_index(qp_node_t *n) {
return (node64(n));
}
/*
* Get a reference to a branch node's child twigs.
*/
static inline qp_ref_t
branch_twigs_ref(qp_node_t *n) {
return (node32(n));
}
/*
* Bit positions in the bitmap come directly from the key. DNS names are
* converted to keys using the tables declared at the end of this file.
*/
static inline qp_shift_t
qpkey_bit(const dns_qpkey_t key, size_t len, size_t offset) {
if (offset < len) {
return (key[offset]);
} else {
return (SHIFT_NOBYTE);
}
}
/*
* Extract a branch node's offset field, used to index the key.
*/
static inline size_t
branch_key_offset(qp_node_t *n) {
return ((size_t)(branch_index(n) >> SHIFT_OFFSET));
}
/*
* Which bit identifies the twig of this node for this key?
*/
static inline qp_shift_t
branch_keybit(qp_node_t *n, const dns_qpkey_t key, size_t len) {
return (qpkey_bit(key, len, branch_key_offset(n)));
}
/*
* Convert a twig reference into a pointer.
*/
static inline qp_node_t *
ref_ptr(dns_qpreadable_t qpr, qp_ref_t ref) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
return (qp->base[ref_chunk(ref)] + ref_cell(ref));
}
/*
* Get a pointer to a branch node's twigs vector.
*/
static inline qp_node_t *
branch_twigs_vector(dns_qpreadable_t qpr, qp_node_t *n) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
return (ref_ptr(qp, branch_twigs_ref(n)));
}
/*
* Warm up the cache while calculating which twig we want.
*/
static inline void
prefetch_twigs(dns_qpreadable_t qpr, qp_node_t *n) {
__builtin_prefetch(branch_twigs_vector(qpr, n));
}
/***********************************************************************
*
* bitmap popcount shenanigans
*/
/*
* How many twigs appear in the vector before the one corresponding to the
* given bit? Calculated using popcount of part of the branch's bitmap.
*
* To calculate a mask that covers the lesser bits in the bitmap, we
* subtract 1 to set the bits, and subtract the branch tag because it
* is not part of the bitmap.
*/
static inline qp_weight_t
branch_twigs_before(qp_node_t *n, qp_shift_t bit) {
uint64_t mask = (1ULL << bit) - 1 - BRANCH_TAG;
uint64_t bmp = branch_index(n) & mask;
return ((qp_weight_t)__builtin_popcountll(bmp));
}
/*
* How many twigs does this node have?
*
* The offset is directly after the bitmap so the offset's lesser bits
* covers the whole bitmap, and the bitmap's weight is the number of twigs.
*/
static inline qp_weight_t
branch_twigs_size(qp_node_t *n) {
return (branch_twigs_before(n, SHIFT_OFFSET));
}
/*
* Position of a twig within the packed sparse vector.
*/
static inline qp_weight_t
branch_twig_pos(qp_node_t *n, qp_shift_t bit) {
return (branch_twigs_before(n, bit));
}
/*
* Get a pointer to a particular twig.
*/
static inline qp_node_t *
branch_twig_ptr(dns_qpreadable_t qpr, qp_node_t *n, qp_shift_t bit) {
return (branch_twigs_vector(qpr, n) + branch_twig_pos(n, bit));
}
/*
* Is the twig identified by this bit present?
*/
static inline bool
branch_has_twig(qp_node_t *n, qp_shift_t bit) {
return (branch_index(n) & (1ULL << bit));
}
/* twig logistics *****************************************************/
static inline void
move_twigs(qp_node_t *to, qp_node_t *from, qp_weight_t size) {
memmove(to, from, size * sizeof(qp_node_t));
}
static inline void
zero_twigs(qp_node_t *twigs, qp_weight_t size) {
memset(twigs, 0, size * sizeof(qp_node_t));
}
/***********************************************************************
*
* method invocation helpers
*/
static inline void
attach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
qp->methods->attach(qp->ctx, leaf_pval(n), leaf_ival(n));
}
static inline void
detach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
qp->methods->detach(qp->ctx, leaf_pval(n), leaf_ival(n));
}
static inline size_t
leaf_qpkey(dns_qpreadable_t qpr, qp_node_t *n, dns_qpkey_t key) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
return (qp->methods->makekey(key, qp->ctx, leaf_pval(n), leaf_ival(n)));
}
static inline char *
triename(dns_qpreadable_t qpr, char *buf, size_t size) {
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
qp->methods->triename(qp->ctx, buf, size);
return (buf);
}
#define TRIENAME(qp) \
triename(qp, (char[DNS_QP_TRIENAME_MAX]){}, DNS_QP_TRIENAME_MAX)
/***********************************************************************
*
* converting DNS names to trie keys
*/
/*
* This is a deliberate simplification of the hostname characters,
* because it doesn't matter much if we treat a few extra characters
* favourably: there is plenty of space in the index word for a
* slightly larger bitmap.
*/
static inline bool
qp_common_character(uint8_t byte) {
return (('-' <= byte && byte <= '9') || ('_' <= byte && byte <= 'z'));
}
/*
* Lookup table mapping bytes in DNS names to bit positions, used
* by dns_qpkey_fromname() to convert DNS names to qp-trie keys.
*/
extern uint16_t dns_qp_bits_for_byte[];
/*
* And the reverse, mapping bit positions to characters, so the tests
* can print diagnostics involving qp-trie keys.
*/
extern uint8_t dns_qp_byte_for_bit[];
/**********************************************************************/