2
0
mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-29 13:38:26 +00:00

59 Commits

Author SHA1 Message Date
Tony Finch
8a3a216f40 Support for iterating over the leaves in a qp-trie
The iterator object records a path through the trie, in a similar
manner to the existing dns_rbtnodechain.
2023-04-05 12:35:04 +01:00
Tony Finch
3c333d02a0 More dns_qpkey_t safety checks
My original idea had been that the core qp-trie code would be mostly
independent of the storage for keys, so I did not make it check at run
time that key lengths are sensible. However, the qp-trie search
routines need to get keys out of leaf objects, for which they provide
storage on the stack, which is particularly dangerous for unchecked
buffer overflows. So this change checks that key lengths are in bounds
at the API boundary between the qp-trie code and the rest of BIND, and
there is no more pretence that keys might be longer.
2023-04-03 15:10:47 +00:00
Tony Finch
0858514ae8 Improve qp-trie compaction in write transactions
In general, it's better to do one thorough compaction when a batch of
work is complete, which is the way that `update` transactions work.
Conversely, `write` transactions are designed so that lots of little
transactions are not too inefficient, but they need explicit
compaction. This changes `dns_qp_compact()` so that it is easier to
compact any time that makes sense, if there isn't a better way to
schedule compaction. And `dns_qpmulti_commit()` only recycles garbage
when there is enough to make it worthwhile.
2023-02-27 13:47:57 +00:00
Tony Finch
7dcde5d2fc Make the qp-trie stats logging quieter
Only log when useful work was done
2023-02-27 13:47:57 +00:00
Tony Finch
4b5ec07bb7 Refactor qp-trie to use QSBR
The first working multi-threaded qp-trie was stuck with an unpleasant
trade-off:

  * Use `isc_rwlock`, which has acceptable write performance, but
    terrible read scalability because the qp-trie made all accesses
    through a single lock.

  * Use `liburcu`, which has great read scalability, but terrible
    write performance, because I was relying on `rcu_synchronize()`
    which is rather slow. And `liburcu` is LGPL.

To get the best of both worlds, we need our own scalable read side,
which we now have with `isc_qsbr`. And we need to modify the write
side so that it is not blocked by readers.

Better write performance requires an async cleanup function like
`call_rcu()`, instead of the blocking `rcu_synchronize()`. (There
is no blocking cleanup in `isc_qsbr`, because I have concluded
that it would be an attractive nuisance.)

Until now, all my multithreading qp-trie designs have been based
around two versions, read-only and mutable. This is too few to
work with asynchronous cleanup. The bare minimum (as in epoch
based reclamation) is three, but it makes more sense to support an
arbitrary number. Doing multi-version support "properly" makes
fewer assumptions about how safe memory reclamation works, and it
makes snapshots and rollbacks simpler.

To avoid making the memory management even more complicated, I
have introduced a new kind of "packed reader node" to anchor the
root of a version of the trie. This is simpler because it re-uses
the existing chunk lifetime logic - see the discussion under
"packed reader nodes" in `qp_p.h`.

I have also made the chunk lifetime logic simpler. The idea of a
"generation" is gone; instead, chunks are either mutable or
immutable. And the QSBR phase number is used to indicate when a
chunk can be reclaimed.

Instead of the `shared_base` flag (which was basically a one-bit
reference count, with a two version limit) the base array now has a
refcount, which replaces the confusing ad-hoc lifetime logic with
something more familiar and systematic.
2023-02-27 13:47:55 +00:00
Tony Finch
549854f63b Some minor qp-trie improvements
Adjust the dns_qp_memusage() and dns_qp_compact() functions
to be more informative and flexible about handling fragmentation.

Avoid wasting space in runt chunks.

Switch from twigs_mutable() to cells_immutable() because that is the
sense we usually want.

Drop the redundant evacuate() function and rename evacuate_twigs() to
evacuate(). Move some chunk test functions closer to their point of
use.

Clarify compact_recursive(). Some small cleanups to comments.

Use isc_time_monotonic() for qp-trie timing stats.

Use #define constants to control debug logging.

Set up DNS name label offsets in dns_qpkey_fromname() so it is easier
to use in cases where the name is not fully hydrated.
2023-02-27 13:47:25 +00:00
Tony Finch
4b09c9a6ae qp-trie naming improvements
Adjust to typename_operation style
	s/VALID_QP/QP_VALID/g
	s/QP_VALIDMULTI/QPMULTI_VALID/g

Improved greppability
	s/\bctx\b/uctx/g

Less cluttered logging
	s/QP_TRACE/TRACE/g
	s/QP_LOG_STATS/LOG_STATS/g
2023-02-27 13:47:25 +00:00
Tony Finch
df6747ee70 Fix qp-trie refcounting mistake
The error occurred when:

  * The bump chunk was re-used across multiple write transactions.
    In this situation the bump chunk is marked immutable, but the
    immutable flag is disregarded for cells after the fender, which
    were allocated in the current transaction.

  * The bump chunk fills up during an insert operation, so that the
    enlarged twigs vector is allocated from a new bump chunk.

  * Before this happened, we should have (but didn't) made the twigs
    vector mutable. This would have adjusted its refcounts as necessary.

  * However, moving to a new bump chunk has a side effect: twigs that
    were previously considered mutable because they are after the
    fender become immutable.

  * Because of this, the old twigs vector was not destroyed as expected.

  * So leaves were duplicated without their refcounts being increased.

The effect is that the refcounts were lower than they should have
been, and underflowed. The tests failed to check for refcount
underflow, so this mistake was detected much later than it ideally
could have been.

After the fix, it is now correct not to ensure the twigs are mutable,
because they are about to be copied to a larger vector. Instead, we
need to find out whether `squash_twigs()` destroyed the old twigs, and
adjust the refcounts accordingly.
2023-02-27 13:47:25 +00:00
Tony Finch
6b9ddbd1ce Add a qp-trie data structure
A qp-trie is a kind of radix tree that is particularly well-suited to
DNS servers. I invented the qp-trie in 2015, based on Dan Bernstein's
crit-bit trees and Phil Bagwell's HAMT. https://dotat.at/prog/qp/

This code incorporates some new ideas that I prototyped using
NLnet Labs NSD in 2020 (optimizations for DNS names as keys)
and 2021 (custom allocator and garbage collector).
https://dotat.at/cgi/git/nsd.git

The BIND version of my qp-trie code has a number of improvements
compared to the prototype developed for NSD.

  * The main omission in the prototype was the very sketchy outline of
    how locking might work. Now the locking has been implemented,
    using a reader/writer lock and a mutex. However, it is designed to
    benefit from liburcu if that is available.

  * The prototype was designed for two-version concurrency, one
    version for readers and one for the writer. The new code supports
    multiversion concurrency, to provide a basis for BIND's dbversion
    machinery, so that updates are not blocked by long-running zone
    transfers.

  * There are now two kinds of transaction that modify the trie: an
    `update` aims to support many very small zones without wasting
    memory; a `write` avoids unnecessary allocation to help the
    performance of many small changes to the cache.

  * There is also a single-threaded interface for situations where
    concurrent access is not necessary.

  * The API makes better use of types to make it more clear which
    operations are permitted when.

  * The lookup table used to convert a DNS name to a qp-trie key is
    now initialized by a run-time constructor instead of a programmer
    using copy-and-paste. Key conversion is more flexible, so the
    qp-trie can be used with keys other than DNS names.

  * There has been much refactoring and re-arranging things to improve
    the terminology and order of presentation in the code, and the
    internal documentation has been moved from a comment into a file
    of its own.

Some of the required functionality has been stripped out, to be
brought back later after the basics are known to work.

  * Garbage collector performance statistics are missing.

  * Fancy searches are missing, such as longest match and
    nearest match.

  * Iteration is missing.

  * Search for update is missing, for cases where the caller needs to
    know if the value object is mutable or not.
2023-02-27 13:47:25 +00:00