mirror of
https://gitlab.isc.org/isc-projects/bind9
synced 2025-08-31 06:25:31 +00:00
Add a qp-trie data structure
A qp-trie is a kind of radix tree that is particularly well-suited to DNS servers. I invented the qp-trie in 2015, based on Dan Bernstein's crit-bit trees and Phil Bagwell's HAMT. https://dotat.at/prog/qp/ This code incorporates some new ideas that I prototyped using NLnet Labs NSD in 2020 (optimizations for DNS names as keys) and 2021 (custom allocator and garbage collector). https://dotat.at/cgi/git/nsd.git The BIND version of my qp-trie code has a number of improvements compared to the prototype developed for NSD. * The main omission in the prototype was the very sketchy outline of how locking might work. Now the locking has been implemented, using a reader/writer lock and a mutex. However, it is designed to benefit from liburcu if that is available. * The prototype was designed for two-version concurrency, one version for readers and one for the writer. The new code supports multiversion concurrency, to provide a basis for BIND's dbversion machinery, so that updates are not blocked by long-running zone transfers. * There are now two kinds of transaction that modify the trie: an `update` aims to support many very small zones without wasting memory; a `write` avoids unnecessary allocation to help the performance of many small changes to the cache. * There is also a single-threaded interface for situations where concurrent access is not necessary. * The API makes better use of types to make it more clear which operations are permitted when. * The lookup table used to convert a DNS name to a qp-trie key is now initialized by a run-time constructor instead of a programmer using copy-and-paste. Key conversion is more flexible, so the qp-trie can be used with keys other than DNS names. * There has been much refactoring and re-arranging things to improve the terminology and order of presentation in the code, and the internal documentation has been moved from a comment into a file of its own. Some of the required functionality has been stripped out, to be brought back later after the basics are known to work. * Garbage collector performance statistics are missing. * Fancy searches are missing, such as longest match and nearest match. * Iteration is missing. * Search for update is missing, for cases where the caller needs to know if the value object is mutable or not.
This commit is contained in:
770
doc/design/qp-trie.md
Normal file
770
doc/design/qp-trie.md
Normal file
@@ -0,0 +1,770 @@
|
||||
<!--
|
||||
Copyright (C) Internet Systems Consortium, Inc. ("ISC")
|
||||
|
||||
SPDX-License-Identifier: MPL-2.0
|
||||
|
||||
This Source Code Form is subject to the terms of the Mozilla Public
|
||||
License, v. 2.0. If a copy of the MPL was not distributed with this
|
||||
file, you can obtain one at https://mozilla.org/MPL/2.0/.
|
||||
|
||||
See the COPYRIGHT file distributed with this work for additional
|
||||
information regarding copyright ownership.
|
||||
-->
|
||||
|
||||
A qp-trie for the DNS
|
||||
=====================
|
||||
|
||||
A qp-trie is a data structure that supports lookups in a sorted
|
||||
collection of keys. It is efficient both in terms of fast lookups and
|
||||
using little memory. It is particularly well-suited for use in DNS
|
||||
servers.
|
||||
|
||||
These notes outline how BIND's `dns_qp` implementation works, how it
|
||||
is optimized for lookups keyed by DNS names, and how it supports
|
||||
multi-version concurrency.
|
||||
|
||||
|
||||
data structure zoo
|
||||
------------------
|
||||
|
||||
Chasing a pointer indirection is very slow, up to 100ns, whereas a
|
||||
sequential memory access takes less than 10ns. So, to make a data
|
||||
structure fast, we need to minimize indirections.
|
||||
|
||||
There is a tradeoff between speed and flexibility in standard data
|
||||
structures:
|
||||
|
||||
* Arrays are very simple and fast (a lookup goes straight to the
|
||||
right address), but the key can only be a small integer.
|
||||
|
||||
* Hash tables allow you to use arbitrary lookup keys (such as
|
||||
strings), but may require probing multiple addresses to find the
|
||||
right element.
|
||||
|
||||
* Radix trees allow you to do lookups based on the sorting order of
|
||||
the keys, provided it is lexical like `memcmp()`; however, lookups
|
||||
require multiple indirections.
|
||||
|
||||
* Comparison search trees (binary trees and B-trees) allow you to
|
||||
use an arbitrary ordering predicate, but each indirection during
|
||||
a lookup also requires a comparison.
|
||||
|
||||
In the DNS, we need to use some kind of tree to support the kinds of
|
||||
lookup required for DNSSEC: find longest match, find nearest
|
||||
predecessor or successor, and so forth. So what kind of tree is best?
|
||||
|
||||
|
||||
in theory
|
||||
---------
|
||||
|
||||
In a tree where the average length of a key is `k`, and the number of
|
||||
elements in the tree is `n`, the theoretical performance bounds are,
|
||||
for a comparison tree:
|
||||
|
||||
* `Ω(k * log n)`
|
||||
* `Ο(k * n)`
|
||||
|
||||
And for a radix tree:
|
||||
|
||||
* `Ω(k + log n)`
|
||||
* `Ο(k + k)`
|
||||
|
||||
Here, `Ω()` is the lower bound and `Ο()` is the upper bound; we
|
||||
expect typical performance to be close to the lower bound.
|
||||
|
||||
The multiplications in the comparison tree expressions means that each
|
||||
indirection requires a comparison `Ο(k)`, whereas they are additions
|
||||
in the radix tree expressions because a radix tree traversal only
|
||||
needs one key comparison.
|
||||
|
||||
The upper bounds say that (in the absence of balancing) a comparison
|
||||
tree can devolve into a linked list of nodes, whereas the shape of a
|
||||
radix tree is determined by the set of keys independent of the order
|
||||
of insertion or the number of keys.
|
||||
|
||||
The logarithms hide some interesting constant factors. In a binary
|
||||
tree, the log is base 2. In a radix tree, the radix is the base of the
|
||||
logarithm. So, if we increase the radix, the constant factor gets
|
||||
smaller. The rough equivalent for a binary tree would be to use a
|
||||
B-tree instead, but although B-trees have fewer indirections they do
|
||||
not reduce the number of comparisons.
|
||||
|
||||
In implementation terms, a larger radix means tree nodes get wider
|
||||
and the tree becomes shallower. A shallower tree requires fewer
|
||||
indirections, so it should be faster. The trick is to increase the
|
||||
radix without blowing up the tree's memory usage, which can lose
|
||||
more performance than we win.
|
||||
|
||||
This analysis suggests that a radix tree is better than a comparison
|
||||
tree, provided keys can be compared lexically - which is true for DNS
|
||||
names, with some rearrangement (described below). When using big-o
|
||||
notation, we also need to be wary of the constant factors; but in this
|
||||
case they also favour a radix tree, especially with the optimization
|
||||
tricks used by BIND's qp-trie.
|
||||
|
||||
Note: "radix" comes from the latin for "root", so "radix tree" is a
|
||||
pun, which is geekily amusing especially when talking about logs.
|
||||
|
||||
|
||||
what is a trie?
|
||||
---------------
|
||||
|
||||
A trie is another name for a radix tree (or "digital tree" according
|
||||
to Knuth). It is short for information reTRIEval, and I pronounce it
|
||||
exactly like "tree" (though Knuth pronounces it like "try").
|
||||
|
||||
In a trie, keys are divided into digits depending on some radix e.g.
|
||||
base 2 for binary tries, base 256 for byte-indexed tries. When
|
||||
searching the trie, successive digits in the key, from most to least
|
||||
significant, are used to select branches from successive nodes in
|
||||
the trie, roughly like:
|
||||
|
||||
for (offset = 0; isbranch(node); offset++)
|
||||
node = node->child[key[offset]];
|
||||
|
||||
All of the keys in a subtrie have identical prefixes. Tries do not
|
||||
need to store keys since they are implicit in the structure.
|
||||
|
||||
|
||||
binary crit-bit trees
|
||||
---------------------
|
||||
|
||||
A patricia trie is a binary trie which omits nodes that have only one
|
||||
child. Dan Bernstein calls his tightly space-optimized version a
|
||||
"crit-bit tree".
|
||||
https://cr.yp.to/critbit.html
|
||||
https://github.com/agl/critbit/
|
||||
|
||||
Unlike a basic trie, a crit-bit tree skips parts of the key when
|
||||
every element in a subtree shares the same sequence of bits.
|
||||
Each node is annotated with the offset of the bit that is used to
|
||||
select the branch; offsets always increase as you go deeper into
|
||||
the tree.
|
||||
|
||||
while (isbranch(node))
|
||||
node = node->child[key[node->offset]];
|
||||
|
||||
In a crit-bit tree the keys are not implicit in the structure
|
||||
because parts of them are skipped. Therefore, each leaf refers to a
|
||||
copy of its key so that when you find a leaf you can verify that the
|
||||
skipped bits match.
|
||||
|
||||
|
||||
prefetching
|
||||
-----------
|
||||
|
||||
Observe that in the loop above, the current node has only one child
|
||||
pointer, and the child nodes are adjacent in memory. This means it
|
||||
is possible to tell the CPU to prefetch the child nodes before
|
||||
extracting the critical bit from the key and choosing which child is
|
||||
next. A qp-trie has a similar layout, but it has more child nodes
|
||||
(still adjacent in memory) and it does more computation to choose
|
||||
which one is next.
|
||||
|
||||
When I originally invented the qp-trie code, I found that explicit
|
||||
prefetch hints made the qp-trie substantially faster and the crit-bit
|
||||
tree slightly faster. The hints help the CPU to do useful work at the
|
||||
same time as the memory subsystem. (This is unusual for linked data
|
||||
structures, which tend to alternate between CPU waiting for memory,
|
||||
and memory waiting for CPU.)
|
||||
|
||||
Large modern CPUs (after about 2015) are better at prefetching
|
||||
automatically, so the explicit hint is less important than it used to
|
||||
be, but `lib/dns/qp.c` still has `__builtin_prefetch()` hints in its
|
||||
inner traversal loops.
|
||||
|
||||
|
||||
packed sparse vectors with popcount
|
||||
-----------------------------------
|
||||
|
||||
The `popcount` instruction counts the number of bits that are set
|
||||
in a word. It's also known as the Hamming weight; Knuth calls it
|
||||
"sideways add". https://en.wikipedia.org/wiki/popcount
|
||||
|
||||
You can use `popcount` to implement a sparse vector of length `N`
|
||||
containing `M <= N` members using bitmap of length `N` and a packed
|
||||
vector of `M` elements. A member `b` is present in the vector if bit
|
||||
`b` is set, so `M == popcount(bitmap)`. The index of member `b` in
|
||||
the packed vector is the popcount of the bits preceding `b`.
|
||||
|
||||
// size of vector
|
||||
size = popcount(bitmap);
|
||||
// bit position
|
||||
bit = 1 << b;
|
||||
// is element present?
|
||||
if (bitmap & bit) {
|
||||
// mask covers the preceding elements
|
||||
mask = bit - 1;
|
||||
// position of element in packed vector
|
||||
pos = popcount(bitmap & mask);
|
||||
// fetch element
|
||||
elem = vector[pos];
|
||||
}
|
||||
|
||||
See "Hacker's Delight" by Hank Warren, section 5-1 "Counting 1
|
||||
bits", subsection "applications". http://www.hackersdelight.org
|
||||
|
||||
See under _"bitmap popcount shenanigans"_ in `lib/dns/qp.c` for how
|
||||
this is implemented in BIND.
|
||||
|
||||
|
||||
popcount for trie nodes
|
||||
-----------------------
|
||||
|
||||
Phil Bagwell's hashed array-mapped tries (HAMT) use popcount for
|
||||
compact trie nodes. In a HAMT, string keys are hashed, and the hash is
|
||||
used as the index to the trie, with radix 2^32 or 2^64.
|
||||
http://infoscience.epfl.ch/record/64394/files/triesearches.pdf
|
||||
http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf
|
||||
|
||||
As discussed above, increasing the radix makes the tree shallower, so
|
||||
it should be faster. The downside is usually much greater memory
|
||||
overhead. Child vectors are often sparsely populated, so we can
|
||||
greatly reduce the overhead by packing them with popcount.
|
||||
|
||||
The HAMT relies on hashing, which keeps keys dense. This means it
|
||||
can be laid out like a basic trie with implicit keys (i.e. hash
|
||||
values). The disadvantage of hashing is that strings are stored
|
||||
out of order.
|
||||
|
||||
|
||||
qp-trie
|
||||
-------
|
||||
|
||||
A qp-trie is a mash-up of Bernstein's crit-bit tree with Bagwell's
|
||||
HAMT. Like a crit-bit tree, a qp-trie omits nodes with one child;
|
||||
nodes include a key offset; and keys a referenced from leaves instead
|
||||
of being implicit in the trie structure. Like a HAMT, nodes have a
|
||||
popcount packed vector of children, but unlike a HAMT, keys are not
|
||||
hashed.
|
||||
|
||||
A qp-trie is faster than a crit-bit tree and uses less memory, because
|
||||
its wider fan-out requires fewer nodes and popcount packs them very
|
||||
efficiently. Like a crit-bit tree but unlike a HAMT, a qp-trie stores
|
||||
keys in lexical order.
|
||||
|
||||
As in a HAMT, the original layout of a qp-trie node is a pair of
|
||||
words, which are used as key and value pointers in leaf nodes, and
|
||||
index word and pointer in branch nodes. The index word contains the
|
||||
popcount bitmap (as in a HAMT) and the offset into the key (as in a
|
||||
crit-bit tree), as well as a leaf/branch tag bit. The pointer refers
|
||||
to the branch node's "twigs", which is what we call the packed sparse
|
||||
vector of child nodes.
|
||||
|
||||
The fan-out of a qp-trie is limited by the need to fit the bitmap and
|
||||
the nybble offset into a 64-bit word; a radix of 16 or 32 works well,
|
||||
and 32 is slightly faster (though 5-bit nybbles are fiddly). But radix
|
||||
64 requires an extra word per node, and the extra memory overhead
|
||||
makes it slower as well as bulkier.
|
||||
|
||||
Early qp-trie implementations used a node layout like the
|
||||
following. However, in practice C bitfields have too many
|
||||
portability gotchas to work well. It is better to use hand-written
|
||||
shifting and masking to access the parts of the index word.
|
||||
|
||||
#define NYBBLE 4 // or 5
|
||||
#define RADIX (1 << NYBBLE)
|
||||
|
||||
union qp_node {
|
||||
struct {
|
||||
unsigned tag : 1;
|
||||
unsigned bitmap : RADIX;
|
||||
unsigned offset : (64 - 1 - RADIX);
|
||||
union qp_node *twigs;
|
||||
} branch;
|
||||
struct {
|
||||
void *value;
|
||||
const char *key;
|
||||
} leaf;
|
||||
};
|
||||
|
||||
|
||||
DNS qp-trie
|
||||
-----------
|
||||
|
||||
BIND uses a variant of a qp-trie optimized for DNS names. DNS names
|
||||
almost always use the usual hostname alphabet of (case-insensitive)
|
||||
letters, digits, hyphen, plus underscore (which is often used in the DNS
|
||||
for non-hostname purposes), and finally the label separator (which is
|
||||
written as '.' in presentation-format domain names, and is the label
|
||||
length in wire format). This adds up to 39 common characters.
|
||||
|
||||
A bitmap for 39 common characters is small enough to fit into a
|
||||
qp-trie index word, so we can (in principle) walk down the trie one
|
||||
character at a time, as if the radix were 256, but without needing a
|
||||
multi-word bitmap.
|
||||
|
||||
However, DNS names can contain arbitrary bytes. To support the 200-ish
|
||||
unusual characters we use an escaping scheme, described in more detail
|
||||
below. This requires a few more bits in the bitmap to represent the
|
||||
escape characters, so our radix ends up being 47. This still fits into
|
||||
the 64-bit index word, so we get the compactness of a qp-trie but with
|
||||
faster byte-at-a-time lookups for DNS names that use common hostname
|
||||
characters.
|
||||
|
||||
You can also use other kinds of keys with BIND's DNS qp-trie, provided
|
||||
they are not too long. You must provide your own key preparation
|
||||
function, e.g. for uniform binary keys you might extract 5-bit nybbles
|
||||
to get a radix-32 trie.
|
||||
|
||||
|
||||
preparing a lookup key
|
||||
----------------------
|
||||
|
||||
A DNS name needs to be rearranged to use it as a qp-trie key, so that
|
||||
the lexical order of rearranged keys matches the canonical DNS name
|
||||
order specified in RFC 4034 section 6.1:
|
||||
|
||||
* reverse the order of the labels so that they run from most
|
||||
significant to least significant, left to right (but the
|
||||
characters in each label remain in the same order)
|
||||
|
||||
* convert uppercase ASCII letters to lowercase ASCII
|
||||
|
||||
* change the label separators to a non-byte value that sorts before
|
||||
the zero byte
|
||||
|
||||
For qp-trie lookups there are a couple of extra steps:
|
||||
|
||||
* There is an escaping mechanism to support DNS names that use
|
||||
unusual characters. Common characters use one byte in the lookup
|
||||
key, but unusual characters are expanded to two bytes. To preserve
|
||||
the correct lexical order, there are different escape bytes
|
||||
depending on how the unusual character sorts relative to the
|
||||
common hostname characters.
|
||||
|
||||
* Characters in the DNS name need to be converted to bitmap
|
||||
positions. This is done at the same time as preparing the lookup
|
||||
key, to move work out of the inner trie traversal loop.
|
||||
|
||||
These 5 transformations can be done in a single pass over a DNS name
|
||||
using a single lookup table. The transformed name is usually the
|
||||
same length (up to 2x longer if it contains unusual characters).
|
||||
|
||||
You can use absolute or relative DNS names as keys, without ambiguity
|
||||
(provided you have some way of knowing what names are relative to).
|
||||
When converted to a lookup key, absolute names start with a non-byte
|
||||
value representing the root, and relative names do not.
|
||||
|
||||
Lookup keys are ephemeral, allocated on the stack during a lookup.
|
||||
|
||||
See under _"converting DNS names to trie keys"_ in `lib/dns/qp.c`
|
||||
for how this is implemented in BIND.
|
||||
|
||||
|
||||
node layout
|
||||
-----------
|
||||
|
||||
Earlier I said that the original qp-trie node layout consists of two
|
||||
words: one 64 bit word for the branch index, and one pointer-sized
|
||||
word. BIND's qp-trie uses a layout that is smaller on 64-bit systems:
|
||||
one 64 bit word and one 32-bit word.
|
||||
|
||||
A branch node contains
|
||||
|
||||
* a branch/leaf tag bit
|
||||
|
||||
* a 47-wide bitmap, with a bit for each common hostname character
|
||||
and each escape character
|
||||
|
||||
* a 9-bit key offset, enough to count twice the length of a DNS
|
||||
name
|
||||
|
||||
* a 32-bit "twigs" reference to the packed vector of child nodes;
|
||||
these references are described in more detail below
|
||||
|
||||
A leaf node contains a pointer value (which we assume to be 64 bits)
|
||||
and a 32-bit integer value. The branch/leaf tag is smuggled into the
|
||||
low-order bit of the pointer value, so the pointer value must have
|
||||
large enough alignment. (This requirement is checked when a leaf is
|
||||
added to the trie.) Apart from that, the meaning of leaf values
|
||||
is entirely under control of the qp-trie user.
|
||||
|
||||
When constructing a qp-trie the user provides a collection of method
|
||||
pointers. The qp-trie code calls these methods when it needs to do
|
||||
anything that needs to look into a leaf value, such as extracting the
|
||||
key.
|
||||
|
||||
See under _"interior node basics"_ and _"interior node constructors
|
||||
and accessors"_ in `lib/dns/qp_p.h` for the implementation.
|
||||
|
||||
|
||||
example
|
||||
-------
|
||||
|
||||
Consider a small zone:
|
||||
|
||||
example. ; apex
|
||||
mail.example. ; IMAP server
|
||||
mx.example. ; incoming mail
|
||||
www.example. ; web load balancer
|
||||
www1.example. ; back-end web servers
|
||||
www2.example.
|
||||
|
||||
It becomes a qp-trie as follows. I am writing bitmaps as lists of
|
||||
characters representing the bits that are set, with `'.'` for label
|
||||
separators. I have used arbitrary names for the addresses of the twigs
|
||||
vectors.
|
||||
|
||||
root = (qp_node){
|
||||
tag: BRANCH,
|
||||
offset: 9,
|
||||
bitmap: [ '.', 'm', 'w' ],
|
||||
twigs: &one,
|
||||
};
|
||||
|
||||
Note that the offset skips the root zone, the zone name, and the apex
|
||||
label separator. If the offset is beyond the end of the key, the byte
|
||||
value is the label separator.
|
||||
|
||||
one = (qp_node[3]){
|
||||
{
|
||||
tag: LEAF,
|
||||
key: "example.",
|
||||
},
|
||||
{
|
||||
tag: BRANCH,
|
||||
offset: 10,
|
||||
bitmap: [ 'a', 'x' ],
|
||||
twigs: &two,
|
||||
},
|
||||
{
|
||||
tag: BRANCH,
|
||||
offset: 12,
|
||||
bitmap: [ '.', '1', '2' ],
|
||||
twigs: &three,
|
||||
},
|
||||
};
|
||||
|
||||
This twigs vector has an element for the zone apex, and the two
|
||||
different initial characters of the subdomains.
|
||||
|
||||
The mail servers differ in the next character, so the offset bumps from
|
||||
9 to 10 without skipping any characters. The web servers all start with
|
||||
www, so the offset bumps from 9 to 12, skipping the common prefix.
|
||||
|
||||
two = (qp_node[2]){
|
||||
{
|
||||
tag: LEAF,
|
||||
key: "mail.example.",
|
||||
},
|
||||
{
|
||||
tag: LEAF,
|
||||
key: "mx.example.",
|
||||
},
|
||||
};
|
||||
|
||||
The different lengths of `mail` and `mx` don't matter: we implicitly
|
||||
skip to the end of the key when we reach a leaf node.
|
||||
|
||||
three = (qp_node[3]){
|
||||
{
|
||||
tag: LEAF,
|
||||
key: "www.example.",
|
||||
},
|
||||
{
|
||||
tag: LEAF,
|
||||
key: "www1.example.",
|
||||
},
|
||||
{
|
||||
tag: LEAF,
|
||||
key: "www2.example.",
|
||||
},
|
||||
};
|
||||
|
||||
When the trie includes labels of differing lengths, we can have a node
|
||||
that chooses between a label separator and characters from the longer
|
||||
labels. This is slightly different from the root node, which tested the
|
||||
first character of the label; here we are testing the last character.
|
||||
|
||||
|
||||
memory management for concurrency
|
||||
---------------------------------
|
||||
|
||||
The following sections discuss how the qp-trie supports concurrency.
|
||||
|
||||
The requirement is to support many concurrent read threads, and
|
||||
allow updates to occur without blocking readers (or blocking readers
|
||||
as little as possible).
|
||||
|
||||
The strategy is to use "copy-on-write", that is, when an update
|
||||
needs to alter the trie it makes a copy of the parts that it needs
|
||||
to change, so that concurrent readers can continue to use the
|
||||
original. (It is analogous to multiversion concurrency in databases
|
||||
such as PostgreSQL, where copy-on-write uses a write-ahead log.)
|
||||
|
||||
Software that uses copy-on-write needs some mechanism for clearing
|
||||
away old versions that are no longer in use. (For example, VACUUM in
|
||||
PostgreSQL.) The qp-trie code uses a custom allocator with a simple
|
||||
garbage collector; as well as supporting concurrency, the qp-trie's
|
||||
memory manager makes tries smaller and faster.
|
||||
|
||||
|
||||
allocation
|
||||
----------
|
||||
|
||||
A qp-trie is relatively demanding on its allocator. Twigs vectors
|
||||
can be lots of different sizes, and every mutation of the trie
|
||||
requires an alloc and/or a free.
|
||||
|
||||
Older versions of the qp-trie code used the system allocator. Many
|
||||
allocators (such as `jemalloc`) segregate the heap into different
|
||||
size classes, so that each chunk of memory is dedicated to
|
||||
allocations of the same size. While this memory layout provides good
|
||||
locality when objects of the same type have the same size, it tends
|
||||
to scatter the interior nodes of a qp-trie all over the address space.
|
||||
|
||||
BIND's qp-trie code uses a "bump allocator" for its interior nodes,
|
||||
which is one of the simplest and fastest possible: an allocation
|
||||
usually only requires incrementing a pointer and checking if it has
|
||||
reached a limit. (If the check fails the allocator goes into its
|
||||
slow path.) Allocations have good locality because they write
|
||||
sequentially into memory. (A bit like a write-ahead log.)
|
||||
|
||||
Bump allocators need reasonably large contiguous chunks of empty
|
||||
memory to make the most of their efficiency, so they are often
|
||||
coupled with some kind of compacting garbage collector, which
|
||||
defragments the heap to recover free space.
|
||||
|
||||
See `alloc_twigs()` in `lib/dns/qp.c` for the bump allocator fast
|
||||
path.
|
||||
|
||||
|
||||
garbage collection
|
||||
------------------
|
||||
|
||||
[The Garbage Collection Handbook](https://gchandbook.org/) says
|
||||
there are four basic kinds of automatic memory management.
|
||||
|
||||
Reference counting is used by scripting languages such as Perl and
|
||||
Python, and also for manual memory management such as in operating
|
||||
system kernels and BIND.
|
||||
|
||||
To avoid writing a custom allocator, I previously tried adapting the
|
||||
qp-trie code to use refcounting to support copy-on-write, but I was
|
||||
not very happy with the complexity of the implementation, and I
|
||||
thought it was ugly that I needed to modify refcounts in nodes that
|
||||
were logically read-only.
|
||||
|
||||
(Two other kinds of GC are mark-sweep and mark-compact. Both of them
|
||||
have a similar disadvantage to refcounting: a simple GC mark phase
|
||||
modifies nodes that are logically read-only. And mark-sweep leaves
|
||||
memory fragmented so it does not support a bump allocator.)
|
||||
|
||||
The fourth kind is copying garbage collection. It works well with a
|
||||
bump allocator, because copying the data structure using a bump
|
||||
allocator in the most obvious way naturally compacts the data. And
|
||||
the copying phase of the GC can run concurrently with readers
|
||||
without interference.
|
||||
|
||||
BIND's qp-trie code uses a copying garbage collector only for its
|
||||
interior nodes. The value objects that are attached to the leaves of
|
||||
the trie are allocated by `isc_mem` and use reference counting like
|
||||
the rest of BIND.
|
||||
|
||||
See `compact()` in `lib/dns/qp.c` for the copying phase of the
|
||||
garbage collector. Reference counting for value objects is handled
|
||||
by the `attach()` and `detach()` qp-trie methods.
|
||||
|
||||
|
||||
memory layout
|
||||
-------------
|
||||
|
||||
BIND's qp-trie code organizes its memory as a collection of "chunks",
|
||||
each of which is a few pages in size and large enough to hold a few
|
||||
thousand nodes.
|
||||
|
||||
Most memory management is per-chunk: obtaining memory from the
|
||||
system allocator and returning it; keeping track of which chunks are
|
||||
in use by readers, and which chunks can be mutated; and counting
|
||||
whether chunks are fragmented enough to need garbage collection.
|
||||
|
||||
As noted above, we also use the chunk-based layout to reduce the size
|
||||
of interior nodes. Instead of using a native pointer (typically 64
|
||||
bits) to refer to a node, we use a 32 bit integer containing the chunk
|
||||
number and the position of the node in the chunk. This reduces the
|
||||
memory used by interior nodes by 25%.
|
||||
|
||||
In `lib/dns/qp_p.h`, the _"main qp-trie structures"_ hold information
|
||||
about a trie's chunks. Most of the chunk handling code is in the
|
||||
_"allocator"_ and _"chunk reclamation"_ sections in `lib/dns/qp.c`.
|
||||
|
||||
|
||||
lifecycle of value objects
|
||||
--------------------------
|
||||
|
||||
A leaf node contains a pointer to a value object that is not managed
|
||||
by the qp-trie garbage collector. Instead, the user provides
|
||||
`attach` and `detach` methods that the qp-trie code calls to update
|
||||
the reference counts in the value objects.
|
||||
|
||||
Value object reference counts do not indicate whether the object is
|
||||
mutable: its refcount can be 1 while it is only in use by readers
|
||||
(and must be left unchanged), or newly created by a writer (and
|
||||
therefore mutable).
|
||||
|
||||
So, callers must keep track themselves whether leaf objects are newly
|
||||
inserted (and therefore mutable) or not. XXXFANF this might change, by
|
||||
adding special lookup functions that return whether leaf objects are
|
||||
mutable - see the "todo" in `include/dns/qp.h`.
|
||||
|
||||
|
||||
locking and RCU
|
||||
---------------
|
||||
|
||||
The Linux kernel has a collection of copy-on-write schemes collectively
|
||||
called read-copy-update; there is also https://liburcu.org/ for RCU in
|
||||
userspace. RCU is attractively speedy: readers can proceed without
|
||||
blocking at all; writers can proceed concurrently with readers, and
|
||||
updates can be committed without blocking. A commit is just a single
|
||||
atomic pointer update. RCU only requires writers to block when waiting
|
||||
for a "grace period" while older readers complete their critical
|
||||
sections, after which the writer can free memory that is no longer in
|
||||
use. Writers must also block on a mutex to ensure there is only one
|
||||
writer at a time.
|
||||
|
||||
The qp-trie concurrency strategy is designed to be able to use RCU, but
|
||||
RCU is not required. Instead of RCU we can use a reader-writer lock.
|
||||
This requires readers to block when a writer commits, which (in RCU
|
||||
style) just requires an atomic pointer swap. The rwlock also changes
|
||||
when writers must block: commits must wait for readers to exit their
|
||||
critical sections, but there is no further waiting to be able to release
|
||||
memory.
|
||||
|
||||
In BIND, there are two kinds of reader: queries, which are relatiely
|
||||
quick, and zone transfers, which are relatively slow. BIND's dbversion
|
||||
machinery allows updates to proceed while there are long-running zone
|
||||
transfers. RCU supports this without further machinery, but a
|
||||
reader-writer lock needs some help so that long-running readers can
|
||||
avoid blocking writers.
|
||||
|
||||
To avoid blocking updates, long-running readers can take a snapshot of a
|
||||
qp-trie, which only requires copying the allocator's chunk array. After
|
||||
a writer commits, it does not releases memory if there are any
|
||||
snapshots. Instead, chunks that are no longer needed by the latest
|
||||
version of the trie are stashed on a list to be released later,
|
||||
analogous to RCU waiting for a grace period.
|
||||
|
||||
The locking occurs only in the functions under _"read-write
|
||||
transactions"_ and _"read-only transactions"_ in `lib/dns/qp.c`.
|
||||
|
||||
|
||||
immutability and copy-on-write
|
||||
------------------------------
|
||||
|
||||
A qp-trie has a `generation` counter which is incremented by each
|
||||
write transaction. We keep track of which generation each chunk was
|
||||
created in; only chunks created in the current generation are
|
||||
mutable, because older chunks may be in use by concurrent readers.
|
||||
|
||||
This logic is implemented by `chunk_alloc()` and `chunk_mutable()`
|
||||
in `lib/dns/qp.c`.
|
||||
|
||||
The `make_twigs_mutable()` function ensures that a node is mutable,
|
||||
copying it if necessary.
|
||||
|
||||
The chunk arrays are a mixture of mutable and immutable. Pointers to
|
||||
immutable chunks are immutable; new chunks can be assigned to unused
|
||||
entries; and entries are cleared when it is safe to reclaim the chunks
|
||||
they refer to. If the chunk arrays need to be expanded, the existing
|
||||
arrays are retained for use by readers, and the writer uses the
|
||||
expanded arrays (see `alloc_slow()`). The old arrays are cleaned up
|
||||
after the writer commits.
|
||||
|
||||
|
||||
update transactions
|
||||
-------------------
|
||||
|
||||
A typical heavy-weight `update` transaction comprises:
|
||||
|
||||
* make a copy of the chunk arrays in case we need to roll back
|
||||
|
||||
* get a freshly allocated chunk where new nodes or copied nodes
|
||||
can be written
|
||||
|
||||
* make any changes that are required; nodes in old chunks are
|
||||
copied to the new space first; new nodes are modified in place
|
||||
to avoid creating unnecessary garbage
|
||||
|
||||
* when the updates are finished, and before committing, run the
|
||||
garbage collector to clear out chunks that were fragmented by the
|
||||
update
|
||||
|
||||
* shrink the allocation chunk to eliminate unused space
|
||||
|
||||
* commit the update by flipping the root pointer of the trie; this
|
||||
is the only point that needs a multithreading interlock
|
||||
|
||||
* free any chunks that were emptied by the garbage collector
|
||||
|
||||
A lightweight `write` transaction is similar, except that:
|
||||
|
||||
* rollback is not supported
|
||||
|
||||
* any existing allocation chunk is reused if possible
|
||||
|
||||
* the gabage collector is not run before committing
|
||||
|
||||
* the allocation chunk is not shrunk
|
||||
|
||||
|
||||
testing strategies
|
||||
------------------
|
||||
|
||||
The main qp-trie test is in `tests/dns/qpmulti_test.c`. This uses
|
||||
randomized testing of the transactional API, with a lot of consistency
|
||||
checking to detect bugs.
|
||||
|
||||
There are also a couple of fuzzers, which aim to benefit from
|
||||
coverage-guided exploration of the test space and test minimization.
|
||||
In `fuzz/dns_qp.c` we treat the fuzzer input as a bytecode to exercise
|
||||
the single-threaded API, and `fuzz/dns_qpkey_name.c` checks conversion
|
||||
from DNS names to lookup keys.
|
||||
|
||||
In `tests/bench` there are a few benchmarks. `load-names` does a very
|
||||
basic comparison between BIND's hash table, red-black tree, and
|
||||
qp-trie. `qpmulti` checks multicore performance of the transactional
|
||||
API (similar to `qpmulti_test` but without the consistency checking).
|
||||
And `qp-dump` is a utility for printing out the contents of a qp-trie.
|
||||
|
||||
John Regehr has some nice essays about testing data structures:
|
||||
|
||||
* Levels of fuzzing: https://blog.regehr.org/archives/1039
|
||||
|
||||
(how much semantic knowledge does your fuzzer have?)
|
||||
|
||||
* Testing with small capacities: https://blog.regehr.org/archives/1138
|
||||
|
||||
(I need to be able to change the chunk size)
|
||||
|
||||
* Write fuzzable code: https://blog.regehr.org/archives/1687
|
||||
|
||||
* Oracles for random testing: https://blog.regehr.org/archives/856
|
||||
|
||||
|
||||
warning: generational collection
|
||||
--------------------------------
|
||||
|
||||
The "generational hypothesis" is that most allocations have a short
|
||||
lifetime, so it is profitable for a garbage collector to split its
|
||||
heap into a number of generations. The youngest generation is where
|
||||
allocations happen; it typically uses a bump allocator, and when the
|
||||
allocation pointer reaches its limit, the youngest generation's
|
||||
contents are copied to the second generation. The hypothesis is that
|
||||
only a small fraction of the youngest generation will still be live
|
||||
when the GC runs, so this copy will not take much time or space.
|
||||
|
||||
For a qp-trie the truth of this hypothesis depends on the order in
|
||||
which keys are added or removed. It may be true if there is good
|
||||
locality, for example, adding keys in lexicographic order, but not in
|
||||
general.
|
||||
|
||||
When a qp-trie is mutated, only one node needs to be altered, near the
|
||||
leaf that is added or removed. Nodes near the root of the trie tend to
|
||||
be more stable and long-lived. However, during a copy-on-write
|
||||
transaction, the path from the root to an altered leaf must be copied,
|
||||
so nodes near the root are no longer stable and long-lived. They may
|
||||
become stable in a long transaction, but that isn't guaranteed.
|
||||
|
||||
So the idea of generational garbage collection seems to be unhelpful
|
||||
for a qp-trie.
|
@@ -99,6 +99,7 @@ libdns_la_HEADERS = \
|
||||
include/dns/order.h \
|
||||
include/dns/peer.h \
|
||||
include/dns/private.h \
|
||||
include/dns/qp.h \
|
||||
include/dns/rbt.h \
|
||||
include/dns/rcode.h \
|
||||
include/dns/rdata.h \
|
||||
@@ -157,6 +158,7 @@ libdns_la_SOURCES = \
|
||||
cache.c \
|
||||
callbacks.c \
|
||||
catz.c \
|
||||
client.c \
|
||||
clientinfo.c \
|
||||
compress.c \
|
||||
db.c \
|
||||
@@ -206,6 +208,8 @@ libdns_la_SOURCES = \
|
||||
order.c \
|
||||
peer.c \
|
||||
private.c \
|
||||
qp.c \
|
||||
qp_p.h \
|
||||
rbt.c \
|
||||
rbtdb.h \
|
||||
rbtdb.c \
|
||||
@@ -233,18 +237,17 @@ libdns_la_SOURCES = \
|
||||
transport.c \
|
||||
tkey.c \
|
||||
tsig.c \
|
||||
tsig_p.h \
|
||||
ttl.c \
|
||||
update.c \
|
||||
validator.c \
|
||||
view.c \
|
||||
xfrin.c \
|
||||
zone.c \
|
||||
zone_p.h \
|
||||
zoneverify.c \
|
||||
zonekey.c \
|
||||
zt.c \
|
||||
client.c \
|
||||
tsig_p.h \
|
||||
zone_p.h
|
||||
zt.c
|
||||
|
||||
if HAVE_GSSAPI
|
||||
libdns_la_SOURCES += \
|
||||
|
@@ -80,6 +80,7 @@ extern isc_logmodule_t dns_modules[];
|
||||
#define DNS_LOGMODULE_DYNDB (&dns_modules[30])
|
||||
#define DNS_LOGMODULE_DNSTAP (&dns_modules[31])
|
||||
#define DNS_LOGMODULE_SSU (&dns_modules[32])
|
||||
#define DNS_LOGMODULE_QP (&dns_modules[33])
|
||||
|
||||
ISC_LANG_BEGINDECLS
|
||||
|
||||
|
574
lib/dns/include/dns/qp.h
Normal file
574
lib/dns/include/dns/qp.h
Normal file
@@ -0,0 +1,574 @@
|
||||
/*
|
||||
* Copyright (C) Internet Systems Consortium, Inc. ("ISC")
|
||||
*
|
||||
* SPDX-License-Identifier: MPL-2.0
|
||||
*
|
||||
* This Source Code Form is subject to the terms of the Mozilla Public
|
||||
* License, v. 2.0. If a copy of the MPL was not distributed with this
|
||||
* file, you can obtain one at https://mozilla.org/MPL/2.0/.
|
||||
*
|
||||
* See the COPYRIGHT file distributed with this work for additional
|
||||
* information regarding copyright ownership.
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
|
||||
/*
|
||||
* A qp-trie is a kind of key -> value map, supporting lookups that are
|
||||
* aware of the lexicographic order of keys.
|
||||
*
|
||||
* Keys are `dns_qpkey_t`, which is a string-like thing, usually created
|
||||
* from a DNS name. You can use both relative and absolute DNS names as
|
||||
* keys.
|
||||
*
|
||||
* Leaf values are a pair of a `void *` pointer and a `uint32_t`
|
||||
* (because that is what fits inside an internal qp-trie leaf node).
|
||||
*
|
||||
* The trie does not store keys; instead keys are derived from leaf values
|
||||
* by calling a method provided by the user.
|
||||
*
|
||||
* There are a few flavours of qp-trie.
|
||||
*
|
||||
* The basic `dns_qp_t` supports single-threaded read/write access.
|
||||
*
|
||||
* A `dns_qpmulti_t` is a wrapper that supports multithreaded access.
|
||||
* There can be many concurrent readers and a single writer. Writes are
|
||||
* transactional, and support multi-version concurrency.
|
||||
*
|
||||
* The concurrency strategy uses copy-on-write. When making changes during
|
||||
* a transaction, the caller must not modify leaf values in place, but
|
||||
* instead delete the old leaf from the trie and insert a replacement. Leaf
|
||||
* values have reference counts, which will indicate when the old leaf
|
||||
* value can be freed after it is no longer needed by readers using an old
|
||||
* version of the trie.
|
||||
*
|
||||
* For fast concurrent reads, call `dns_qpmulti_query()` to get a
|
||||
* `dns_qpread_t`. Readers can access a single version of the trie between
|
||||
* write commits. Most write activity is not blocked by readers, but reads
|
||||
* must finish before a write can commit (a read-write lock blocks
|
||||
* commits).
|
||||
*
|
||||
* For long-running reads that need a stable view of the trie, while still
|
||||
* allow commits to proceed, call `dns_qpmulti_snapshot()` to get a
|
||||
* `dns_qpsnap_t`. It briefly gets the write mutex while creating the
|
||||
* snapshot, which requires allocating a copy of some of the trie's
|
||||
* metadata. A snapshot is for relatively heavy long-running read-only
|
||||
* operations such as zone transfers.
|
||||
*
|
||||
* While snapshots exist, a qp-trie cannot reclaim memory: it does not
|
||||
* retain detailed information about which memory is used by which
|
||||
* snapshots, so it pessimistically retains all memory that might be
|
||||
* used by old versions of the trie.
|
||||
*
|
||||
* You can start one read-write transaction at a time using
|
||||
* `dns_qpmulti_write()` or `dns_qpmulti_update()`. Either way, you
|
||||
* get a `dns_qp_t` that can be modified like a single-threaded trie,
|
||||
* without affecting other read-only query or snapshot users of the
|
||||
* `dns_qpmulti_t`. Committing a transaction only blocks readers
|
||||
* briefly when flipping the active readonly `dns_qp_t` pointer.
|
||||
*
|
||||
* "Update" transactions are heavyweight. They allocate working memory to
|
||||
* hold modifications to the trie, and compact the trie before committing.
|
||||
* For extra space savings, a partially-used allocation chunk is shrunk to
|
||||
* the smallest size possible. Unlike "write" transactions, an "update"
|
||||
* transaction can be rolled back instead of committed. (Update
|
||||
* transactions are intended for things like authoritative zones, where it
|
||||
* is important to keep the per-trie memory overhead low because there can
|
||||
* be a very large number of them.)
|
||||
*
|
||||
* "Write" transactions are more lightweight: they skip the allocation and
|
||||
* compaction at the start and end of the transaction. (Write transactions
|
||||
* are intended for frequent small changes, as in the DNS cache.)
|
||||
*/
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* types
|
||||
*/
|
||||
|
||||
#include <isc/attributes.h>
|
||||
|
||||
#include <dns/types.h>
|
||||
|
||||
/*%
|
||||
* A `dns_qp_t` supports single-threaded read/write access.
|
||||
*/
|
||||
typedef struct dns_qp dns_qp_t;
|
||||
|
||||
/*%
|
||||
* A `dns_qpmulti_t` supports multi-version concurrent reads and transactional
|
||||
* modification.
|
||||
*/
|
||||
typedef struct dns_qpmulti dns_qpmulti_t;
|
||||
|
||||
/*%
|
||||
* A `dns_qpread_t` is a lightweight read-only handle on a `dns_qpmulti_t`.
|
||||
*/
|
||||
typedef struct dns_qpread dns_qpread_t;
|
||||
|
||||
/*%
|
||||
* A `dns_qpsnap_t` is a heavier read-only snapshot of a `dns_qpmulti_t`.
|
||||
*/
|
||||
typedef struct dns_qpsnap dns_qpsnap_t;
|
||||
|
||||
/*
|
||||
* The read-only qp-trie functions can work on either of the read-only
|
||||
* qp-trie types or the general-purpose read-write `dns_qp_t`. They
|
||||
* relies on the fact that all the `dns_qpreadable_t` structures start
|
||||
* with a `dns_qpread_t`.
|
||||
*/
|
||||
typedef union dns_qpreadable {
|
||||
dns_qpread_t *qpr;
|
||||
dns_qpsnap_t *qps;
|
||||
dns_qp_t *qpt;
|
||||
} dns_qpreadable_t __attribute__((__transparent_union__));
|
||||
|
||||
#define dns_qpreadable_cast(qp) ((qp).qpr)
|
||||
|
||||
/*%
|
||||
* A trie lookup key is a small array, allocated on the stack during trie
|
||||
* searches. Keys are usually created on demand from DNS names using
|
||||
* `dns_qpkey_fromname()`, but in principle you can define your own
|
||||
* functions to convert other types to trie lookup keys.
|
||||
*
|
||||
* A domain name can be up to 255 bytes. When converted to a key, each
|
||||
* character in the name corresponds to one byte in the key if it is a
|
||||
* common hostname character; otherwise unusual characters are escaped,
|
||||
* using two bytes in the key. So we allow keys to be up to 512 bytes.
|
||||
* (The actual max is (255 - 5) * 2 + 6 == 506)
|
||||
*
|
||||
* Every byte of a key must be greater than 0 and less than 48. Elements
|
||||
* after the end of the key are treated as having the value 1.
|
||||
*/
|
||||
typedef uint8_t dns_qpkey_t[512];
|
||||
|
||||
/*%
|
||||
* These leaf methods allow the qp-trie code to call back to the code
|
||||
* responsible for the leaf values that are stored in the trie. The
|
||||
* methods are provided for a whole trie when the trie is created.
|
||||
*
|
||||
* The qp-trie is also given a context pointer that is passed to the
|
||||
* methods, so the methods know about the trie's context as well as a
|
||||
* particular leaf value.
|
||||
*
|
||||
* The `attach` and `detach` methods adjust reference counts on value
|
||||
* objects. They support copy-on-write and safe memory reclamation
|
||||
* needed for multi-version concurrency.
|
||||
*
|
||||
* Note: When a value object reference count is greater than one, the
|
||||
* object is in use by concurrent readers so it must not be modified. A
|
||||
* refcount equal to one does not indicate whether or not the object is
|
||||
* mutable: its refcount can be 1 while it is only in use by readers (and
|
||||
* must be left unchanged), or newly created by a writer (and therefore
|
||||
* mutable).
|
||||
*
|
||||
* The `makekey` method fills in a `dns_qpkey_t` corresponding to a
|
||||
* value object stored in the qp-trie. It returns the length of the
|
||||
* key. This method will typically call dns_qpkey_fromname() with a
|
||||
* name stored in the value object.
|
||||
*
|
||||
* For logging and tracing, the `triename` method copies a human-
|
||||
* readable identifier into `buf` which has max length `size`.
|
||||
*/
|
||||
typedef struct dns_qpmethods {
|
||||
void (*attach)(void *ctx, void *pval, uint32_t ival);
|
||||
void (*detach)(void *ctx, void *pval, uint32_t ival);
|
||||
size_t (*makekey)(dns_qpkey_t key, void *ctx, void *pval,
|
||||
uint32_t ival);
|
||||
void (*triename)(void *ctx, char *buf, size_t size);
|
||||
} dns_qpmethods_t;
|
||||
|
||||
/*%
|
||||
* Buffers for use by the `triename()` method need to be large enough
|
||||
* to hold a zone name and a few descriptive words.
|
||||
*/
|
||||
#define DNS_QP_TRIENAME_MAX 300
|
||||
|
||||
/*%
|
||||
* A container for the counters returned by `dns_qp_memusage()`
|
||||
*/
|
||||
typedef struct dns_qp_memusage {
|
||||
void *ctx; /*%< qp-trie method context */
|
||||
size_t leaves; /*%< values in the trie */
|
||||
size_t live; /*%< nodes in use */
|
||||
size_t used; /*%< allocated nodes */
|
||||
size_t hold; /*%< nodes retained for readers */
|
||||
size_t free; /*%< nodes to be reclaimed */
|
||||
size_t node_size; /*%< in bytes */
|
||||
size_t chunk_size; /*%< nodes per chunk */
|
||||
size_t chunk_count; /*%< allocated chunks */
|
||||
size_t bytes; /*%< total memory in chunks and metadata */
|
||||
} dns_qp_memusage_t;
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* functions - create, destory, enquire
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qp_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
|
||||
dns_qp_t **qptp);
|
||||
/*%<
|
||||
* Create a single-threaded qp-trie.
|
||||
*
|
||||
* Requires:
|
||||
* \li `mctx` is a pointer to a valid memory context.
|
||||
* \li all the methods are non-NULL
|
||||
* \li `qptp != NULL && *qptp == NULL`
|
||||
*
|
||||
* Ensures:
|
||||
* \li `*qptp` is a pointer to a valid single-threaded qp-trie
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qp_destroy(dns_qp_t **qptp);
|
||||
/*%<
|
||||
* Destroy a single-threaded qp-trie.
|
||||
*
|
||||
* Requires:
|
||||
* \li `qptp != NULL`
|
||||
* \li `*qptp` is a pointer to a valid single-threaded qp-trie
|
||||
*
|
||||
* Ensures:
|
||||
* \li all memory allocated by the qp-trie has been released
|
||||
* \li `*qptp` is NULL
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpmulti_create(isc_mem_t *mctx, const dns_qpmethods_t *methods, void *ctx,
|
||||
dns_qpmulti_t **qpmp);
|
||||
/*%<
|
||||
* Create a multi-threaded qp-trie.
|
||||
*
|
||||
* Requires:
|
||||
* \li `mctx` is a pointer to a valid memory context.
|
||||
* \li all the methods are non-NULL
|
||||
* \li `qpmp != NULL && *qpmp == NULL`
|
||||
*
|
||||
* Ensures:
|
||||
* \li `*qpmp` is a pointer to a valid multi-threaded qp-trie
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpmulti_destroy(dns_qpmulti_t **qpmp);
|
||||
/*%<
|
||||
* Destroy a multi-threaded qp-trie.
|
||||
*
|
||||
* Requires:
|
||||
* \li `qptp != NULL`
|
||||
* \li `*qptp` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li there are no write or update transactions in progress
|
||||
* \li no snapshots exist
|
||||
*
|
||||
* Ensures:
|
||||
* \li all memory allocated by the qp-trie has been released
|
||||
* \li `*qpmp` is NULL
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qp_compact(dns_qp_t *qp);
|
||||
/*%<
|
||||
* Defragment the entire qp-trie and release unused memory.
|
||||
*
|
||||
* When modifications make a trie too fragmented, it is automatically
|
||||
* compacted. Automatic compaction avoids compacting chunks that are not
|
||||
* fragmented to save time, but this function compacts the entire trie to
|
||||
* defragment it as much as possible.
|
||||
*
|
||||
* This function can be used with a single-threaded qp-trie and during a
|
||||
* transaction on a multi-threaded trie.
|
||||
*
|
||||
* Requires:
|
||||
* \li `qp` is a pointer to a valid qp-trie
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qp_gctime(uint64_t *compact_us, uint64_t *recover_us,
|
||||
uint64_t *rollback_us);
|
||||
/*%<
|
||||
* Get the total times spent on garbage collection in microseconds.
|
||||
*
|
||||
* These counters are global, covering every qp-trie in the program.
|
||||
*
|
||||
* XXXFANF This is a placeholder until we can record times in histograms.
|
||||
*/
|
||||
|
||||
dns_qp_memusage_t
|
||||
dns_qp_memusage(dns_qp_t *qp);
|
||||
/*%<
|
||||
* Get the memory counters from a qp-trie
|
||||
*
|
||||
* Requires:
|
||||
* \li `qp` is a pointer to a valid qp-trie
|
||||
*
|
||||
* Returns:
|
||||
* \li a `dns_qp_memusage_t` structure described above
|
||||
*/
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* functions - search, modify
|
||||
*/
|
||||
|
||||
/*
|
||||
* XXXFANF todo, based on what we discover BIND needs
|
||||
*
|
||||
* fancy searches: longest match, lexicographic predecessor,
|
||||
* etc.
|
||||
*
|
||||
* do we need specific lookup functions to find out if the
|
||||
* returned value is readonly or mutable?
|
||||
*
|
||||
* richer modification such as dns_qp_replace{key,name}
|
||||
*
|
||||
* iteration - probably best to put an explicit stack in the iterator,
|
||||
* cf. rbtnodechain
|
||||
*/
|
||||
|
||||
size_t
|
||||
dns_qpkey_fromname(dns_qpkey_t key, const dns_name_t *name);
|
||||
/*%<
|
||||
* Convert a DNS name into a trie lookup key.
|
||||
*
|
||||
* Requires:
|
||||
* \li `name` is a pointer to a valid `dns_name_t`
|
||||
*
|
||||
* Returns:
|
||||
* \li the length of the key
|
||||
*/
|
||||
|
||||
isc_result_t
|
||||
dns_qp_getkey(dns_qpreadable_t qpr, const dns_qpkey_t searchk, size_t searchl,
|
||||
void **pval_r, uint32_t *ival_r);
|
||||
/*%<
|
||||
* Find a leaf in a qp-trie that matches the given key
|
||||
*
|
||||
* The leaf values are assigned to `*pval_r` and `*ival_r`
|
||||
*
|
||||
* Requires:
|
||||
* \li `qpr` is a pointer to a readable qp-trie
|
||||
* \li `pval_r != NULL`
|
||||
* \li `ival_r != NULL`
|
||||
*
|
||||
* Returns:
|
||||
* \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
|
||||
* \li ISC_R_SUCCESS if the leaf was found
|
||||
*/
|
||||
|
||||
isc_result_t
|
||||
dns_qp_getname(dns_qpreadable_t qpr, const dns_name_t *name, void **pval_r,
|
||||
uint32_t *ival_r);
|
||||
/*%<
|
||||
* Find a leaf in a qp-trie that matches the given DNS name
|
||||
*
|
||||
* The leaf values are assigned to `*pval_r` and `*ival_r`
|
||||
*
|
||||
* Requires:
|
||||
* \li `qpr` is a pointer to a readable qp-trie
|
||||
* \li `name` is a pointer to a valid `dns_name_t`
|
||||
* \li `pval_r != NULL`
|
||||
* \li `ival_r != NULL`
|
||||
*
|
||||
* Returns:
|
||||
* \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
|
||||
* \li ISC_R_SUCCESS if the leaf was found
|
||||
*/
|
||||
|
||||
isc_result_t
|
||||
dns_qp_insert(dns_qp_t *qp, void *pval, uint32_t ival);
|
||||
/*%<
|
||||
* Insert a leaf into a qp-trie
|
||||
*
|
||||
* Requires:
|
||||
* \li `qp` is a pointer to a valid qp-trie
|
||||
* \li `pval != NULL`
|
||||
* \li `alignof(pval) > 1`
|
||||
*
|
||||
* Returns:
|
||||
* \li ISC_R_EXISTS if the trie already has a leaf with the same key
|
||||
* \li ISC_R_SUCCESS if the leaf was added to the trie
|
||||
*/
|
||||
|
||||
isc_result_t
|
||||
dns_qp_deletekey(dns_qp_t *qp, const dns_qpkey_t key, size_t len);
|
||||
/*%<
|
||||
* Delete a leaf from a qp-trie that matches the given key
|
||||
*
|
||||
* Requires:
|
||||
* \li `qp` is a pointer to a valid qp-trie
|
||||
*
|
||||
* Returns:
|
||||
* \li ISC_R_NOTFOUND if the trie has no leaf with a matching key
|
||||
* \li ISC_R_SUCCESS if the leaf was deleted from the trie
|
||||
*/
|
||||
|
||||
isc_result_t
|
||||
dns_qp_deletename(dns_qp_t *qp, const dns_name_t *name);
|
||||
/*%<
|
||||
* Delete a leaf from a qp-trie that matches the given DNS name
|
||||
*
|
||||
* Requires:
|
||||
* \li `qp` is a pointer to a valid qp-trie
|
||||
* \li `name` is a pointer to a valid qp-trie
|
||||
*
|
||||
* Returns:
|
||||
* \li ISC_R_NOTFOUND if the trie has no leaf with a matching name
|
||||
* \li ISC_R_SUCCESS if the leaf was deleted from the trie
|
||||
*/
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* functions - transactions
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpmulti_query(dns_qpmulti_t *multi, dns_qpread_t **qprp);
|
||||
/*%<
|
||||
* Start a lightweight (brief) read-only transaction
|
||||
*
|
||||
* This takes a read lock on `multi`s rwlock that prevents
|
||||
* transactions from committing.
|
||||
*
|
||||
* Requires:
|
||||
* \li `multi` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li `qprp != NULL`
|
||||
* \li `*qprp == NULL`
|
||||
*
|
||||
* Returns:
|
||||
* \li `*qprp` is a pointer to a valid read-only qp-trie handle
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpread_destroy(dns_qpmulti_t *multi, dns_qpread_t **qprp);
|
||||
/*%<
|
||||
* End a lightweight read transaction, i.e. release read lock
|
||||
*
|
||||
* Requires:
|
||||
* \li `multi` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li `qprp != NULL`
|
||||
* \li `*qprp` is a read-only qp-trie handle obtained from `multi`
|
||||
*
|
||||
* Returns:
|
||||
* \li `*qprp == NULL`
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpmulti_snapshot(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
|
||||
/*%<
|
||||
* Start a heavyweight (long) read-only transaction
|
||||
*
|
||||
* This function briefly takes and releases the modification mutex
|
||||
* while allocating a copy of the trie's metadata. While the snapshot
|
||||
* exists it does not interfere with other read-only or read-write
|
||||
* transactions on the trie, except that memory cannot be reclaimed.
|
||||
*
|
||||
* Requires:
|
||||
* \li `multi` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li `qpsp != NULL`
|
||||
* \li `*qpsp == NULL`
|
||||
*
|
||||
* Returns:
|
||||
* \li `*qpsp` is a pointer to a snapshot obtained from `multi`
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpsnap_destroy(dns_qpmulti_t *multi, dns_qpsnap_t **qpsp);
|
||||
/*%<
|
||||
* End a heavyweight read transaction
|
||||
*
|
||||
* If this is the last remaining snapshot belonging to `multi` then
|
||||
* this function takes the modification mutex in order to free() any
|
||||
* memory that is no longer in use.
|
||||
*
|
||||
* Requires:
|
||||
* \li `multi` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li `qpsp != NULL`
|
||||
* \li `*qpsp` is a pointer to a snapshot obtained from `multi`
|
||||
*
|
||||
* Returns:
|
||||
* \li `*qpsp == NULL`
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpmulti_update(dns_qpmulti_t *multi, dns_qp_t **qptp);
|
||||
/*%<
|
||||
* Start a heavyweight write transaction
|
||||
*
|
||||
* This style of transaction allocates a copy of the trie's metadata to
|
||||
* support rollback, and it aims to minimize the memory usage of the
|
||||
* trie between transactions. The trie is compacted when the transaction
|
||||
* commits, and any partly-used chunk is shrunk to fit.
|
||||
*
|
||||
* During the transaction, the modification mutex is held.
|
||||
*
|
||||
* Requires:
|
||||
* \li `multi` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li `qptp != NULL`
|
||||
* \li `*qptp == NULL`
|
||||
*
|
||||
* Returns:
|
||||
* \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpmulti_write(dns_qpmulti_t *multi, dns_qp_t **qptp);
|
||||
/*%<
|
||||
* Start a lightweight write transaction
|
||||
*
|
||||
* This style of transaction does not need extra allocations in addition
|
||||
* to the ones required by insert and delete operations. It is intended
|
||||
* for a large trie that gets frequent small writes, such as a DNS
|
||||
* cache.
|
||||
*
|
||||
* During the transaction, the modification mutex is held.
|
||||
*
|
||||
* Requires:
|
||||
* \li `multi` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li `qptp != NULL`
|
||||
* \li `*qptp == NULL`
|
||||
*
|
||||
* Returns:
|
||||
* \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpmulti_commit(dns_qpmulti_t *multi, dns_qp_t **qptp);
|
||||
/*%<
|
||||
* Complete a modification transaction
|
||||
*
|
||||
* The commit itself only requires flipping the read pointer inside
|
||||
* `multi` from the old version of the trie to the new version. This
|
||||
* function takes a write lock on `multi`s rwlock just long enough to
|
||||
* flip the pointer. This briefly blocks `query` readers.
|
||||
*
|
||||
* This function releases the modification mutex after the post-commit
|
||||
* memory reclamation is completed.
|
||||
*
|
||||
* Requires:
|
||||
* \li `multi` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li `qptp != NULL`
|
||||
* \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
|
||||
*
|
||||
* Returns:
|
||||
* \li `*qptp == NULL`
|
||||
*/
|
||||
|
||||
void
|
||||
dns_qpmulti_rollback(dns_qpmulti_t *multi, dns_qp_t **qptp);
|
||||
/*%<
|
||||
* Abandon an update transaction
|
||||
*
|
||||
* This function reclaims the memory allocated during the transaction
|
||||
* and releases the modification mutex.
|
||||
*
|
||||
* Requires:
|
||||
* \li `multi` is a pointer to a valid multi-threaded qp-trie
|
||||
* \li `qptp != NULL`
|
||||
* \li `*qptp` is a pointer to the modifiable qp-trie inside `multi`
|
||||
* \li `*qptp` was obtained from `dns_qpmulti_update()`
|
||||
*
|
||||
* Returns:
|
||||
* \li `*qptp == NULL`
|
||||
*/
|
||||
|
||||
/**********************************************************************/
|
@@ -36,23 +36,18 @@ isc_logcategory_t dns_categories[] = {
|
||||
* \#define to <dns/log.h>.
|
||||
*/
|
||||
isc_logmodule_t dns_modules[] = {
|
||||
{ "dns/db", 0 }, { "dns/rbtdb", 0 },
|
||||
{ "dns/rbt", 0 }, { "dns/rdata", 0 },
|
||||
{ "dns/master", 0 }, { "dns/message", 0 },
|
||||
{ "dns/cache", 0 }, { "dns/config", 0 },
|
||||
{ "dns/resolver", 0 }, { "dns/zone", 0 },
|
||||
{ "dns/journal", 0 }, { "dns/adb", 0 },
|
||||
{ "dns/xfrin", 0 }, { "dns/xfrout", 0 },
|
||||
{ "dns/acl", 0 }, { "dns/validator", 0 },
|
||||
{ "dns/dispatch", 0 }, { "dns/request", 0 },
|
||||
{ "dns/masterdump", 0 }, { "dns/tsig", 0 },
|
||||
{ "dns/tkey", 0 }, { "dns/sdb", 0 },
|
||||
{ "dns/diff", 0 }, { "dns/hints", 0 },
|
||||
{ "dns/unused1", 0 }, { "dns/dlz", 0 },
|
||||
{ "dns/dnssec", 0 }, { "dns/crypto", 0 },
|
||||
{ "dns/packets", 0 }, { "dns/nta", 0 },
|
||||
{ "dns/dyndb", 0 }, { "dns/dnstap", 0 },
|
||||
{ "dns/ssu", 0 }, { NULL, 0 }
|
||||
{ "dns/db", 0 }, { "dns/rbtdb", 0 }, { "dns/rbt", 0 },
|
||||
{ "dns/rdata", 0 }, { "dns/master", 0 }, { "dns/message", 0 },
|
||||
{ "dns/cache", 0 }, { "dns/config", 0 }, { "dns/resolver", 0 },
|
||||
{ "dns/zone", 0 }, { "dns/journal", 0 }, { "dns/adb", 0 },
|
||||
{ "dns/xfrin", 0 }, { "dns/xfrout", 0 }, { "dns/acl", 0 },
|
||||
{ "dns/validator", 0 }, { "dns/dispatch", 0 }, { "dns/request", 0 },
|
||||
{ "dns/masterdump", 0 }, { "dns/tsig", 0 }, { "dns/tkey", 0 },
|
||||
{ "dns/sdb", 0 }, { "dns/diff", 0 }, { "dns/hints", 0 },
|
||||
{ "dns/unused1", 0 }, { "dns/dlz", 0 }, { "dns/dnssec", 0 },
|
||||
{ "dns/crypto", 0 }, { "dns/packets", 0 }, { "dns/nta", 0 },
|
||||
{ "dns/dyndb", 0 }, { "dns/dnstap", 0 }, { "dns/ssu", 0 },
|
||||
{ "dns/qp", 0 }, { NULL, 0 },
|
||||
};
|
||||
|
||||
isc_log_t *dns_lctx = NULL;
|
||||
|
1571
lib/dns/qp.c
Normal file
1571
lib/dns/qp.c
Normal file
File diff suppressed because it is too large
Load Diff
703
lib/dns/qp_p.h
Normal file
703
lib/dns/qp_p.h
Normal file
@@ -0,0 +1,703 @@
|
||||
/*
|
||||
* Copyright (C) Internet Systems Consortium, Inc. ("ISC")
|
||||
*
|
||||
* SPDX-License-Identifier: MPL-2.0
|
||||
*
|
||||
* This Source Code Form is subject to the terms of the Mozilla Public
|
||||
* License, v. 2.0. If a copy of the MPL was not distributed with this
|
||||
* file, you can obtain one at https://mozilla.org/MPL/2.0/.
|
||||
*
|
||||
* See the COPYRIGHT file distributed with this work for additional
|
||||
* information regarding copyright ownership.
|
||||
*/
|
||||
|
||||
/*
|
||||
* For an overview, see doc/design/qp-trie.md
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* interior node basics
|
||||
*/
|
||||
|
||||
/*
|
||||
* A qp-trie node can be a leaf or a branch. It consists of three 32-bit
|
||||
* words into which the components are packed. They are used as a 64-bit
|
||||
* word and a 32-bit word, but they are not declared like that to avoid
|
||||
* unwanted padding, keeping the size down to 12 bytes. They are in native
|
||||
* endian order so getting the 64-bit part should compile down to an
|
||||
* unaligned load.
|
||||
*
|
||||
* In a branch the 64-bit word is described by the enum below. The 32-bit
|
||||
* word is a reference to the packed sparse vector of "twigs", i.e. child
|
||||
* nodes. A branch node has at least 2 and less than SHIFT_OFFSET twigs
|
||||
* (see the enum below). The qp-trie update functions ensure that branches
|
||||
* actually branch, i.e. branches cannot have only 1 child.
|
||||
*
|
||||
* The contents of each leaf are set by the trie's user. The 64-bit word
|
||||
* contains a pointer value (which must be word-aligned), and the 32-bit
|
||||
* word is an arbitrary integer value.
|
||||
*/
|
||||
typedef struct qp_node {
|
||||
#if WORDS_BIGENDIAN
|
||||
uint32_t bighi, biglo, small;
|
||||
#else
|
||||
uint32_t biglo, bighi, small;
|
||||
#endif
|
||||
} qp_node_t;
|
||||
|
||||
/*
|
||||
* A branch node contains a 64-bit word comprising the branch/leaf tag,
|
||||
* the bitmap, and an offset into the key. It is called an "index word"
|
||||
* because it describes how to access the twigs vector (think "database
|
||||
* index"). The following enum sets up the bit positions of these parts.
|
||||
*
|
||||
* In a leaf, the same 64-bit word contains a pointer. The pointer
|
||||
* must be word-aligned so that the branch/leaf tag bit is zero.
|
||||
* This requirement is checked by the newleaf() constructor.
|
||||
*
|
||||
* The bitmap is just above the tag bit. The `bits_for_byte[]` table is
|
||||
* used to fill in a key so that bit tests can work directly against the
|
||||
* index word without superfluous masking or shifting; we don't need to
|
||||
* mask out the bitmap before testing a bit, but we do need to mask the
|
||||
* bitmap before calling popcount.
|
||||
*
|
||||
* The byte offset into the key is at the top of the word, so that it
|
||||
* can be extracted with just a shift, with no masking needed.
|
||||
*
|
||||
* The names are SHIFT_thing because they are qp_shift_t values. (See
|
||||
* below for the various `qp_*` type declarations.)
|
||||
*
|
||||
* These values are relatively fixed in practice; the symbolic names
|
||||
* avoid mystery numbers in the code.
|
||||
*/
|
||||
enum {
|
||||
SHIFT_BRANCH = 0, /* branch / leaf tag */
|
||||
SHIFT_NOBYTE, /* label separator has no byte value */
|
||||
SHIFT_BITMAP, /* many bits here */
|
||||
SHIFT_OFFSET = 48, /* offset of byte in key */
|
||||
};
|
||||
|
||||
/*
|
||||
* Value of the node type tag bit.
|
||||
*
|
||||
* It is defined this way to be explicit about where the value comes
|
||||
* from, even though we know it is always the bottom bit.
|
||||
*/
|
||||
#define BRANCH_TAG (1ULL << SHIFT_BRANCH)
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* garbage collector tuning parameters
|
||||
*/
|
||||
|
||||
/*
|
||||
* A "cell" is a location that can contain a `qp_node_t`, and a "chunk"
|
||||
* is a moderately large array of cells. A big trie can occupy
|
||||
* multiple chunks. (Unlike other nodes, a trie's root node lives in
|
||||
* its `struct dns_qp` instead of being allocated in a cell.)
|
||||
*
|
||||
* The qp-trie allocator hands out space for twigs vectors. Allocations are
|
||||
* made sequentially from one of the chunks; this kind of "sequential
|
||||
* allocator" is also known as a "bump allocator", so in `struct dns_qp`
|
||||
* (see below) the allocation chunk is called `bump`.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Number of cells in a chunk is a power of 2, which must have space for
|
||||
* a full twigs vector (48 wide). When testing, use a much smaller chunk
|
||||
* size to make the allocator work harder.
|
||||
*/
|
||||
#ifdef FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION
|
||||
#define QP_CHUNK_LOG 7
|
||||
#else
|
||||
#define QP_CHUNK_LOG 10
|
||||
#endif
|
||||
|
||||
STATIC_ASSERT(6 <= QP_CHUNK_LOG && QP_CHUNK_LOG <= 20,
|
||||
"qp-trie chunk size is unreasonable");
|
||||
|
||||
#define QP_CHUNK_SIZE (1U << QP_CHUNK_LOG)
|
||||
#define QP_CHUNK_BYTES (QP_CHUNK_SIZE * sizeof(qp_node_t))
|
||||
|
||||
/*
|
||||
* A chunk needs to be compacted if it has fragmented this much.
|
||||
* (12% overhead seems reasonable)
|
||||
*/
|
||||
#define QP_MAX_FREE (QP_CHUNK_SIZE / 8)
|
||||
|
||||
/*
|
||||
* Compact automatically when we pass this threshold: when there is a lot
|
||||
* of free space in absolute terms, and when we have freed more than half
|
||||
* of the space we allocated.
|
||||
*
|
||||
* The current compaction algorithm scans the whole trie, so it is important
|
||||
* to scale the threshold based on the size of the trie to avoid quadratic
|
||||
* behaviour. XXXFANF find an algorithm that scans less of the trie!
|
||||
*
|
||||
* During a modification transaction, when we copy-on-write some twigs we
|
||||
* count the old copy as "free", because they will be when the transaction
|
||||
* commits. But they cannot be recovered immediately so they are also
|
||||
* counted as on hold, and discounted when we decide whether to compact.
|
||||
*/
|
||||
#define QP_MAX_GARBAGE(qp) \
|
||||
(((qp)->free_count - (qp)->hold_count) > QP_CHUNK_SIZE * 4 && \
|
||||
((qp)->free_count - (qp)->hold_count) > (qp)->used_count / 2)
|
||||
|
||||
/*
|
||||
* The chunk base and usage arrays are resized geometically and start off
|
||||
* with two entries.
|
||||
*/
|
||||
#define GROWTH_FACTOR(size) ((size) + (size) / 2 + 2)
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* helper types
|
||||
*/
|
||||
|
||||
/*
|
||||
* C is not strict enough with its integer types for these typedefs to
|
||||
* improve type safety, but it helps to have annotations saying what
|
||||
* particular kind of number we are dealing with.
|
||||
*/
|
||||
|
||||
/*
|
||||
* The number or position of a bit inside a word. (0..63)
|
||||
*
|
||||
* Note: A dns_qpkey_t is logically an array of qp_shift_t values, but it
|
||||
* isn't declared that way because dns_qpkey_t is a public type whereas
|
||||
* qp_shift_t is private.
|
||||
*/
|
||||
typedef uint8_t qp_shift_t;
|
||||
|
||||
/*
|
||||
* The number of bits set in a word (as in Hamming weight or popcount)
|
||||
* which is used for the position of a node in the packed sparse
|
||||
* vector of twigs. (0..47) because our bitmap does not fill the word.
|
||||
*/
|
||||
typedef uint8_t qp_weight_t;
|
||||
|
||||
/*
|
||||
* A chunk number, i.e. an index into the chunk arrays.
|
||||
*/
|
||||
typedef uint32_t qp_chunk_t;
|
||||
|
||||
/*
|
||||
* Cell offset within a chunk, or a count of cells. Each cell in a
|
||||
* chunk can contain a node.
|
||||
*/
|
||||
typedef uint32_t qp_cell_t;
|
||||
|
||||
/*
|
||||
* A twig reference is used to refer to a twigs vector, which occupies a
|
||||
* contiguous group of cells.
|
||||
*/
|
||||
typedef uint32_t qp_ref_t;
|
||||
|
||||
/*
|
||||
* Constructors and accessors for qp_ref_t values, defined here to show
|
||||
* how the qp_ref_t, qp_chunk_t, qp_cell_t types relate to each other
|
||||
*/
|
||||
|
||||
static inline qp_ref_t
|
||||
make_ref(qp_chunk_t chunk, qp_cell_t cell) {
|
||||
return (QP_CHUNK_SIZE * chunk + cell);
|
||||
}
|
||||
|
||||
static inline qp_chunk_t
|
||||
ref_chunk(qp_ref_t ref) {
|
||||
return (ref / QP_CHUNK_SIZE);
|
||||
}
|
||||
|
||||
static inline qp_cell_t
|
||||
ref_cell(qp_ref_t ref) {
|
||||
return (ref % QP_CHUNK_SIZE);
|
||||
}
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* main qp-trie structures
|
||||
*/
|
||||
|
||||
#define QP_MAGIC ISC_MAGIC('t', 'r', 'i', 'e')
|
||||
#define VALID_QP(qp) ISC_MAGIC_VALID(qp, QP_MAGIC)
|
||||
|
||||
/*
|
||||
* This is annoying: C doesn't allow us to use a predeclared structure as
|
||||
* an anonymous struct member, so we have to fart around. The feature we
|
||||
* want is available in GCC and Clang with -fms-extensions, but a
|
||||
* non-standard extension won't make these declarations neater if we must
|
||||
* also have a standard alternative.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Lightweight read-only access to a qp-trie.
|
||||
*
|
||||
* Just the fields neded for the hot path. The `base` field points
|
||||
* to an array containing pointers to the base of each chunk like
|
||||
* `qp->base[chunk]` - see `refptr()` below.
|
||||
*
|
||||
* A `dns_qpread_t` has a lifetime that does not extend across multiple
|
||||
* write transactions, so it can share a chunk `base` array belonging to
|
||||
* the `dns_qpmulti_t` it came from.
|
||||
*
|
||||
* We're lucky with the layout on 64 bit systems: this is only 40 bytes,
|
||||
* with no padding.
|
||||
*/
|
||||
#define DNS_QPREAD_COMMON \
|
||||
uint32_t magic; \
|
||||
qp_node_t root; \
|
||||
qp_node_t **base; \
|
||||
void *ctx; \
|
||||
const dns_qpmethods_t *methods
|
||||
|
||||
struct dns_qpread {
|
||||
DNS_QPREAD_COMMON;
|
||||
};
|
||||
|
||||
/*
|
||||
* Heavyweight read-only snapshots of a qp-trie.
|
||||
*
|
||||
* Unlike a lightweight `dns_qpread_t`, a snapshot can survive across
|
||||
* multiple write transactions, any of which may need to expand the
|
||||
* chunk `base` array. So a `dns_qpsnap_t` keeps its own copy of the
|
||||
* array, which will always be equal to some prefix of the expanded
|
||||
* arrays in the `dns_qpmulti_t` that it came from.
|
||||
*
|
||||
* The `dns_qpmulti_t` keeps a refcount of its snapshots, and while
|
||||
* the refcount is non-zero, chunks are not freed or reused. When a
|
||||
* `dns_qpsnap_t` is destroyed, if it decrements the refcount to zero,
|
||||
* it can do any deferred cleanup.
|
||||
*
|
||||
* The generation number is used for tracing.
|
||||
*/
|
||||
struct dns_qpsnap {
|
||||
DNS_QPREAD_COMMON;
|
||||
uint32_t generation;
|
||||
dns_qpmulti_t *whence;
|
||||
qp_node_t *base_array[];
|
||||
};
|
||||
|
||||
/*
|
||||
* Read-write access to a qp-trie requires extra fields to support the
|
||||
* allocator and garbage collector.
|
||||
*
|
||||
* The chunk `base` and `usage` arrays are separate because the `usage`
|
||||
* array is only needed for allocation, so it is kept separate from the
|
||||
* data needed by the read-only hot path. The arrays have empty slots where
|
||||
* new chunks can be placed, so `chunk_max` is the maximum number of chunks
|
||||
* (until the arrays are resized).
|
||||
*
|
||||
* Bare instances of a `struct dns_qp` are used for stand-alone
|
||||
* single-threaded tries. For multithreaded access, transactions alternate
|
||||
* between the `phase` pair of dns_qp objects inside a dns_qpmulti.
|
||||
*
|
||||
* For multithreaded access, the `generation` counter allows us to know
|
||||
* which chunks are writable or not: writable chunks were allocated in the
|
||||
* current generation. For single-threaded access, the generation counter
|
||||
* is always zero, so all chunks are considered to be writable.
|
||||
*
|
||||
* Allocations are made sequentially in the `bump` chunk. Lightweight write
|
||||
* transactions can re-use the `bump` chunk, so its prefix before `fender`
|
||||
* is immutable, and the rest is mutable even though its generation number
|
||||
* does not match the current generation.
|
||||
*
|
||||
* To decide when to compact and reclaim space, QP_MAX_GARBAGE() examines
|
||||
* the values of `used_count`, `free_count`, and `hold_count`. The
|
||||
* `hold_count` tracks nodes that need to be retained while readers are
|
||||
* using them; they are free but cannot be reclaimed until the transaction
|
||||
* has committed, so the `hold_count` is discounted from QP_MAX_GARBAGE()
|
||||
* during a transaction.
|
||||
*
|
||||
* There are some flags that alter the behaviour of write transactions.
|
||||
*
|
||||
* - The `transaction_mode` indicates whether the current transaction is a
|
||||
* light write or a heavy update, or (between transactions) the previous
|
||||
* transaction's mode, because the setup for the next transaction
|
||||
* depends on how the previous one committed. The mode is set at the
|
||||
* start of each transaction. It is QP_NONE in a single-threaded qp-trie
|
||||
* to detect if part of a `dns_qpmulti_t` is passed to dns_qp_destroy().
|
||||
*
|
||||
* - The `compact_all` flag is used when every node in the trie should be
|
||||
* copied. (Usually compation aims to avoid moving nodes out of
|
||||
* unfragmented chunks.) It is used when compaction is explicitly
|
||||
* requested via `dns_qp_compact()`, and as an emergency mechanism if
|
||||
* normal compaction failed to clear the QP_MAX_GARBAGE() condition.
|
||||
* (This emergency is a bug even tho we have a rescue mechanism.)
|
||||
*
|
||||
* - The `shared_arrays` flag indicates that the chunk `base` and `usage`
|
||||
* arrays are shared by both `phase`s in this trie's `dns_qpmulti_t`.
|
||||
* This allows us to delay allocating copies of the arrays during a
|
||||
* write transaction, until we definitely need to resize them.
|
||||
*
|
||||
* - When built with fuzzing support, we can use mprotect() and munmap()
|
||||
* to ensure that incorrect memory accesses cause fatal errors. The
|
||||
* `write_protect` flag must be set straight after the `dns_qpmulti_t`
|
||||
* is created, then left unchanged.
|
||||
*
|
||||
* Some of the dns_qp_t fields are only used for multithreaded transactions
|
||||
* (marked [MT] below) but the same code paths are also used for single-
|
||||
* threaded writes. To reduce the size of a dns_qp_t, these fields could
|
||||
* perhaps be moved into the dns_qpmulti_t, but that would require some kind
|
||||
* of conditional runtime downcast from dns_qp_t to dns_multi_t, which is
|
||||
* likely to be ugly. It is probably best to keep things simple if most tries
|
||||
* need multithreaded access (XXXFANF do they? e.g. when there are many auth
|
||||
* zones),
|
||||
*/
|
||||
struct dns_qp {
|
||||
DNS_QPREAD_COMMON;
|
||||
isc_mem_t *mctx;
|
||||
/*% array of per-chunk allocation counters */
|
||||
struct {
|
||||
/*% the allocation point, increases monotonically */
|
||||
qp_cell_t used;
|
||||
/*% count of nodes no longer needed, also monotonic */
|
||||
qp_cell_t free;
|
||||
/*% when was this chunk allocated? */
|
||||
uint32_t generation;
|
||||
} *usage;
|
||||
/*% transaction counter [MT] */
|
||||
uint32_t generation;
|
||||
/*% number of slots in `chunk` and `usage` arrays */
|
||||
qp_chunk_t chunk_max;
|
||||
/*% which chunk is used for allocations */
|
||||
qp_chunk_t bump;
|
||||
/*% twigs in the `bump` chunk below `fender` are read only [MT] */
|
||||
qp_cell_t fender;
|
||||
/*% number of leaf nodes */
|
||||
qp_cell_t leaf_count;
|
||||
/*% total of all usage[] counters */
|
||||
qp_cell_t used_count, free_count;
|
||||
/*% cells that cannot be recovered right now */
|
||||
qp_cell_t hold_count;
|
||||
/*% what kind of transaction was most recently started [MT] */
|
||||
enum { QP_NONE, QP_WRITE, QP_UPDATE } transaction_mode : 2;
|
||||
/*% compact the entire trie [MT] */
|
||||
bool compact_all : 1;
|
||||
/*% chunk arrays are shared with a readonly qp-trie [MT] */
|
||||
bool shared_arrays : 1;
|
||||
/*% optionally when compiled with fuzzing support [MT] */
|
||||
bool write_protect : 1;
|
||||
};
|
||||
|
||||
/*
|
||||
* Concurrent access to a qp-trie.
|
||||
*
|
||||
* The `read` pointer is used for read queries. It points to one of the
|
||||
* `phase` elements. During a transaction, the other `phase` (see
|
||||
* `write_phase()` below) is modified incrementally in copy-on-write
|
||||
* style. On commit the `read` pointer is swapped to the altered phase.
|
||||
*/
|
||||
struct dns_qpmulti {
|
||||
uint32_t magic;
|
||||
/*% controls access to the `read` pointer and its target phase */
|
||||
isc_rwlock_t rwlock;
|
||||
/*% points to phase[r] and swaps on commit */
|
||||
dns_qp_t *read;
|
||||
/*% protects the snapshot counter and `write_phase()` */
|
||||
isc_mutex_t mutex;
|
||||
/*% so we know when old chunks are still shared */
|
||||
unsigned int snapshots;
|
||||
/*% one is read-only, one is mutable */
|
||||
dns_qp_t phase[2];
|
||||
};
|
||||
|
||||
/*
|
||||
* Get a pointer to the phase that isn't read-only.
|
||||
*/
|
||||
static inline dns_qp_t *
|
||||
write_phase(dns_qpmulti_t *multi) {
|
||||
bool read0 = multi->read == &multi->phase[0];
|
||||
return (read0 ? &multi->phase[1] : &multi->phase[0]);
|
||||
}
|
||||
|
||||
#define QPMULTI_MAGIC ISC_MAGIC('q', 'p', 'm', 'v')
|
||||
#define VALID_QPMULTI(qp) ISC_MAGIC_VALID(qp, QPMULTI_MAGIC)
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* interior node constructors and accessors
|
||||
*/
|
||||
|
||||
/*
|
||||
* See the comments under "interior node basics" above, which explain the
|
||||
* layout of nodes as implemented by the following functions.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Get the 64-bit word of a node.
|
||||
*/
|
||||
static inline uint64_t
|
||||
node64(qp_node_t *n) {
|
||||
uint64_t lo = n->biglo;
|
||||
uint64_t hi = n->bighi;
|
||||
return (lo | (hi << 32));
|
||||
}
|
||||
|
||||
/*
|
||||
* Get the 32-bit word of a node.
|
||||
*/
|
||||
static inline uint32_t
|
||||
node32(qp_node_t *n) {
|
||||
return (n->small);
|
||||
}
|
||||
|
||||
/*
|
||||
* Create a node from its parts
|
||||
*/
|
||||
static inline qp_node_t
|
||||
make_node(uint64_t big, uint32_t small) {
|
||||
return ((qp_node_t){
|
||||
.biglo = (uint32_t)(big),
|
||||
.bighi = (uint32_t)(big >> 32),
|
||||
.small = small,
|
||||
});
|
||||
}
|
||||
|
||||
/*
|
||||
* Test a node's tag bit.
|
||||
*/
|
||||
static inline bool
|
||||
is_branch(qp_node_t *n) {
|
||||
return (n->biglo & BRANCH_TAG);
|
||||
}
|
||||
|
||||
/* leaf nodes *********************************************************/
|
||||
|
||||
/*
|
||||
* Get a leaf's pointer value. The double cast is to avoid a warning
|
||||
* about mismatched pointer/integer sizes on 32 bit systems.
|
||||
*/
|
||||
static inline void *
|
||||
leaf_pval(qp_node_t *n) {
|
||||
return ((void *)(uintptr_t)node64(n));
|
||||
}
|
||||
|
||||
/*
|
||||
* Get a leaf's integer value
|
||||
*/
|
||||
static inline uint32_t
|
||||
leaf_ival(qp_node_t *n) {
|
||||
return (node32(n));
|
||||
}
|
||||
|
||||
/*
|
||||
* Create a leaf node from its parts
|
||||
*/
|
||||
static inline qp_node_t
|
||||
make_leaf(const void *pval, uint32_t ival) {
|
||||
qp_node_t leaf = make_node((uintptr_t)pval, ival);
|
||||
REQUIRE(!is_branch(&leaf) && pval != NULL);
|
||||
return (leaf);
|
||||
}
|
||||
|
||||
/* branch nodes *******************************************************/
|
||||
|
||||
/*
|
||||
* The following function names use plural `twigs` when they work on a
|
||||
* branch's twigs vector as a whole, and singular `twig` when they work on
|
||||
* a particular twig.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Get a branch node's index word
|
||||
*/
|
||||
static inline uint64_t
|
||||
branch_index(qp_node_t *n) {
|
||||
return (node64(n));
|
||||
}
|
||||
|
||||
/*
|
||||
* Get a reference to a branch node's child twigs.
|
||||
*/
|
||||
static inline qp_ref_t
|
||||
branch_twigs_ref(qp_node_t *n) {
|
||||
return (node32(n));
|
||||
}
|
||||
|
||||
/*
|
||||
* Bit positions in the bitmap come directly from the key. DNS names are
|
||||
* converted to keys using the tables declared at the end of this file.
|
||||
*/
|
||||
static inline qp_shift_t
|
||||
qpkey_bit(const dns_qpkey_t key, size_t len, size_t offset) {
|
||||
if (offset < len) {
|
||||
return (key[offset]);
|
||||
} else {
|
||||
return (SHIFT_NOBYTE);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Extract a branch node's offset field, used to index the key.
|
||||
*/
|
||||
static inline size_t
|
||||
branch_key_offset(qp_node_t *n) {
|
||||
return ((size_t)(branch_index(n) >> SHIFT_OFFSET));
|
||||
}
|
||||
|
||||
/*
|
||||
* Which bit identifies the twig of this node for this key?
|
||||
*/
|
||||
static inline qp_shift_t
|
||||
branch_keybit(qp_node_t *n, const dns_qpkey_t key, size_t len) {
|
||||
return (qpkey_bit(key, len, branch_key_offset(n)));
|
||||
}
|
||||
|
||||
/*
|
||||
* Convert a twig reference into a pointer.
|
||||
*/
|
||||
static inline qp_node_t *
|
||||
ref_ptr(dns_qpreadable_t qpr, qp_ref_t ref) {
|
||||
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
|
||||
return (qp->base[ref_chunk(ref)] + ref_cell(ref));
|
||||
}
|
||||
|
||||
/*
|
||||
* Get a pointer to a branch node's twigs vector.
|
||||
*/
|
||||
static inline qp_node_t *
|
||||
branch_twigs_vector(dns_qpreadable_t qpr, qp_node_t *n) {
|
||||
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
|
||||
return (ref_ptr(qp, branch_twigs_ref(n)));
|
||||
}
|
||||
|
||||
/*
|
||||
* Warm up the cache while calculating which twig we want.
|
||||
*/
|
||||
static inline void
|
||||
prefetch_twigs(dns_qpreadable_t qpr, qp_node_t *n) {
|
||||
__builtin_prefetch(branch_twigs_vector(qpr, n));
|
||||
}
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* bitmap popcount shenanigans
|
||||
*/
|
||||
|
||||
/*
|
||||
* How many twigs appear in the vector before the one corresponding to the
|
||||
* given bit? Calculated using popcount of part of the branch's bitmap.
|
||||
*
|
||||
* To calculate a mask that covers the lesser bits in the bitmap, we
|
||||
* subtract 1 to set the bits, and subtract the branch tag because it
|
||||
* is not part of the bitmap.
|
||||
*/
|
||||
static inline qp_weight_t
|
||||
branch_twigs_before(qp_node_t *n, qp_shift_t bit) {
|
||||
uint64_t mask = (1ULL << bit) - 1 - BRANCH_TAG;
|
||||
uint64_t bmp = branch_index(n) & mask;
|
||||
return ((qp_weight_t)__builtin_popcountll(bmp));
|
||||
}
|
||||
|
||||
/*
|
||||
* How many twigs does this node have?
|
||||
*
|
||||
* The offset is directly after the bitmap so the offset's lesser bits
|
||||
* covers the whole bitmap, and the bitmap's weight is the number of twigs.
|
||||
*/
|
||||
static inline qp_weight_t
|
||||
branch_twigs_size(qp_node_t *n) {
|
||||
return (branch_twigs_before(n, SHIFT_OFFSET));
|
||||
}
|
||||
|
||||
/*
|
||||
* Position of a twig within the packed sparse vector.
|
||||
*/
|
||||
static inline qp_weight_t
|
||||
branch_twig_pos(qp_node_t *n, qp_shift_t bit) {
|
||||
return (branch_twigs_before(n, bit));
|
||||
}
|
||||
|
||||
/*
|
||||
* Get a pointer to a particular twig.
|
||||
*/
|
||||
static inline qp_node_t *
|
||||
branch_twig_ptr(dns_qpreadable_t qpr, qp_node_t *n, qp_shift_t bit) {
|
||||
return (branch_twigs_vector(qpr, n) + branch_twig_pos(n, bit));
|
||||
}
|
||||
|
||||
/*
|
||||
* Is the twig identified by this bit present?
|
||||
*/
|
||||
static inline bool
|
||||
branch_has_twig(qp_node_t *n, qp_shift_t bit) {
|
||||
return (branch_index(n) & (1ULL << bit));
|
||||
}
|
||||
|
||||
/* twig logistics *****************************************************/
|
||||
|
||||
static inline void
|
||||
move_twigs(qp_node_t *to, qp_node_t *from, qp_weight_t size) {
|
||||
memmove(to, from, size * sizeof(qp_node_t));
|
||||
}
|
||||
|
||||
static inline void
|
||||
zero_twigs(qp_node_t *twigs, qp_weight_t size) {
|
||||
memset(twigs, 0, size * sizeof(qp_node_t));
|
||||
}
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* method invocation helpers
|
||||
*/
|
||||
|
||||
static inline void
|
||||
attach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
|
||||
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
|
||||
qp->methods->attach(qp->ctx, leaf_pval(n), leaf_ival(n));
|
||||
}
|
||||
|
||||
static inline void
|
||||
detach_leaf(dns_qpreadable_t qpr, qp_node_t *n) {
|
||||
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
|
||||
qp->methods->detach(qp->ctx, leaf_pval(n), leaf_ival(n));
|
||||
}
|
||||
|
||||
static inline size_t
|
||||
leaf_qpkey(dns_qpreadable_t qpr, qp_node_t *n, dns_qpkey_t key) {
|
||||
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
|
||||
return (qp->methods->makekey(key, qp->ctx, leaf_pval(n), leaf_ival(n)));
|
||||
}
|
||||
|
||||
static inline char *
|
||||
triename(dns_qpreadable_t qpr, char *buf, size_t size) {
|
||||
dns_qpread_t *qp = dns_qpreadable_cast(qpr);
|
||||
qp->methods->triename(qp->ctx, buf, size);
|
||||
return (buf);
|
||||
}
|
||||
|
||||
#define TRIENAME(qp) \
|
||||
triename(qp, (char[DNS_QP_TRIENAME_MAX]){}, DNS_QP_TRIENAME_MAX)
|
||||
|
||||
/***********************************************************************
|
||||
*
|
||||
* converting DNS names to trie keys
|
||||
*/
|
||||
|
||||
/*
|
||||
* This is a deliberate simplification of the hostname characters,
|
||||
* because it doesn't matter much if we treat a few extra characters
|
||||
* favourably: there is plenty of space in the index word for a
|
||||
* slightly larger bitmap.
|
||||
*/
|
||||
static inline bool
|
||||
qp_common_character(uint8_t byte) {
|
||||
return (('-' <= byte && byte <= '9') || ('_' <= byte && byte <= 'z'));
|
||||
}
|
||||
|
||||
/*
|
||||
* Lookup table mapping bytes in DNS names to bit positions, used
|
||||
* by dns_qpkey_fromname() to convert DNS names to qp-trie keys.
|
||||
*/
|
||||
extern uint16_t dns_qp_bits_for_byte[];
|
||||
|
||||
/*
|
||||
* And the reverse, mapping bit positions to characters, so the tests
|
||||
* can print diagnostics involving qp-trie keys.
|
||||
*/
|
||||
extern uint8_t dns_qp_byte_for_bit[];
|
||||
|
||||
/**********************************************************************/
|
Reference in New Issue
Block a user