mirror of
https://gitlab.isc.org/isc-projects/bind9
synced 2025-08-22 01:59:26 +00:00
254 lines
12 KiB
Plaintext
254 lines
12 KiB
Plaintext
Copyright (C) Internet Systems Consortium, Inc. ("ISC")
|
|
|
|
See COPYRIGHT in the source root or http://isc.org/copyright.html for terms.
|
|
|
|
$Id: red-black,v 1.9 2004/03/05 05:04:46 marka Exp $
|
|
|
|
Red-Black Tree Implementation Notes
|
|
|
|
OVERVIEW
|
|
|
|
BIND9's basic name storage mechanism is to use a modified form of
|
|
balanced binary tree known as a red-black tree. Red-black trees
|
|
provide for relatively efficient storage, retrieval and removal of
|
|
data while maintaining the lexical order of all stored keys, a
|
|
necessary function for DNS security.
|
|
|
|
DESCRIPTION
|
|
|
|
A red-black tree is a balanced binary tree named for the coloring that
|
|
is done in the tree, identifying each node as either red or black.
|
|
There are two simple rules for maintaining the color of nodes:
|
|
(1) A red node has only black children.
|
|
(2) The path from the root to any leaf node always includes the
|
|
same number of black nodes.
|
|
|
|
Whenever a key is added or removed, adjustments are made to adhere to
|
|
those two rules. These adjustments are relatively cheap to make but
|
|
maintain the balance of the tree, thus making for efficient addition,
|
|
lookup and deletion operations, all of which are O(log N). The color
|
|
of a node is not relevant to external users of the tree; it is needed
|
|
only to maintain the balance of the tree.
|
|
|
|
For more information on basic red-black trees, see _Introduction to
|
|
Algorithms_, Cormen, Leiserson, and Rivest, MIT Press / McGraw Hill,
|
|
1990, ISBN 0-262-03141-8, chapter 14.
|
|
|
|
In BIND9, the red-black tree implementation uses DNS names as keys,
|
|
and can store arbitrary data with each key value. "name" and "key"
|
|
are used interchangably in this document.
|
|
|
|
The basic red-black tree algorithm is further adapted for use in BIND9
|
|
to incorporate the notion of hierarchy, creating a tree of red-black
|
|
trees. Where there is more than one name with a common suffix, all
|
|
names with that suffix are stored in their own red-black tree, with a
|
|
down pointer from the suffix locating the subtree.
|
|
|
|
For example, consider storing the following names:
|
|
a x.d.e.f o.w.y.d.e.f
|
|
b z.d.e.f p.w.y.d.e.f
|
|
c g.h q.w.y.d.e.f
|
|
|
|
No matter which order the keys were added, this would result in a tree
|
|
that can be visualized as:
|
|
|
|
b
|
|
/ \
|
|
a d.e.f
|
|
/|\
|
|
c | g.h
|
|
|
|
|
w.y
|
|
/|\
|
|
x | z
|
|
|
|
|
p
|
|
/ \
|
|
o q
|
|
|
|
This tree shows that when there is no key for a particular label, and
|
|
when there is only one known label for its immediate subordinate, then
|
|
multiple labels can appear in a single node, such as at d.e.f and g.h.
|
|
It also demonstrates that there can be more nodes in the tree of trees
|
|
than there are actual keys (which degrades the O(log N) performance
|
|
marginally); the nodes at d.e.f and w.y do not represent keys.
|
|
|
|
As an aside, remember that when ordering DNS names, labels are
|
|
examined from the right, therefore w.y sorts after x and before z.
|
|
|
|
A split can occur not only on a regular label boundary, but also
|
|
between any two bits in an EDNS bitstring label. The common-suffix
|
|
rules will be applied to keep as many bits together as possible.
|
|
|
|
In the current implementation of the tree of trees, a node is
|
|
considered to "formally" exist only if it has data associated with
|
|
it. So if the above tree then had the key d.e.f added to it, the
|
|
operation would succeed rather than getting an "already exists"
|
|
error.
|
|
|
|
Along the same lines, if a key is added with a name which is a proper
|
|
superdomain of the name stored in an existing node, the operation will
|
|
succeed by splitting the existing node into one node that is the key
|
|
and another node that is the remaining parts of the name. Adding e.f
|
|
to the above tree results in the top level red-black tree having a
|
|
node named e.f where the current d.e.f is, and a down pointer from
|
|
d.e.f to a "tree" of a single node named d. The down pointer from d
|
|
would be kept to the level which has x, w.y, and z.
|
|
|
|
A similar split of d.e.f would occur if the name k.e.f were added.
|
|
The top level tree would have the node e.f with a down pointer to a
|
|
level that had both d and k, and d would continue to have its down
|
|
pointer to the x, w.y and z level.
|
|
|
|
It is guaranteed when splitting that external references to the node
|
|
that is split will remain valid --- in the previous examples, anything
|
|
that was pointing to the node that was d.e.f will still point to the
|
|
node that is now just d.
|
|
|
|
When deleting keys, nodes can be rejoined. If both of p.w.y.d.e.f and
|
|
q.w.y.d.e.f were removed from the example tree, the node named w.y
|
|
would become o.w.y. Unlike splitting, it is _not_ guaranteed that
|
|
external references remain consistent; sometimes they will, sometimes
|
|
they won't. Also, note that deletion is not perfectly symmetric with
|
|
addition. If you "undo" the last addition with a deletion of the same
|
|
key then the tree of trees is not guaranteed to have exactly the same
|
|
structure as it had prior to the addition. Sometimes, but not always.
|
|
|
|
Rejoining does not happen if it would violate any of the rules that
|
|
cause a split. o would not be rejoined with w.y if w.y had data
|
|
associated with the key; o would remain as a single node on its own
|
|
level. This emphasizes the rule that a node is considered to formally
|
|
exist only if data is associated with it, because even if w.y.d.e.f
|
|
had been explicitly added as a key but with no data, then o would
|
|
still be merged with the w.y node when p and q were deleted.
|
|
|
|
Searching for a node generally returns one of three possible results:
|
|
either the key is found, a superdomain (partial match) of the key is
|
|
found, or no part of the key is found. The first and last are rather
|
|
obvious, and the second result basically means that a hierarchically
|
|
enclosing name is found; e.g, searching for bb.rc.vix.com turned up
|
|
rc.vix.com, but not the full name.
|
|
|
|
No locking is done within the RBT library. @@@
|
|
|
|
CHAINS
|
|
|
|
@@@
|
|
|
|
When a partial match is made, level_matches is set while the chain
|
|
points to the partial match node that was found. Then the chain is
|
|
adjusted to point to the DNSSEC predecessor node, which might not even
|
|
be under the same top level domain as the name that was searched for.
|
|
For example, consider a database that had only the names vix.com and
|
|
isc.org. A search for uu.net would leave the chain pointed to
|
|
vix.com, the DNSSEC predecessor. Though this might first appear to
|
|
cause level_matches to be bogus because the chain has been unwound and
|
|
sent down another path, note that the partial match node will always
|
|
be in the chain of the predecessor, too --- and often the partial
|
|
match node will be the predecessor itself. In the vix.com/isc.org
|
|
example, the search for uu.net finds a partial match at ".", which is
|
|
of course also in the path to the vix.com predecessor. A search for
|
|
www.isc.org would find that isc.org is both the partial match and the
|
|
predecessor.
|
|
|
|
EXTERNAL PROGRAMMATIC DETAILS
|
|
|
|
This section details the functions used to interact with the BIND9
|
|
red-black tree library, or RBT for short.
|
|
|
|
A source file that will be using RBT will usually need to include
|
|
<dns/rbt.h>. This header file automatically includes <isc/result.h),
|
|
<isc/mem.h>, <dns/types.h>, and <dns/name.h>.
|
|
|
|
The rbt.h file has more complete descriptions of each of the functions
|
|
named here, including what is required for each argument, what each
|
|
function ensures (and might not ensure) will occur, and the full range
|
|
of possible results for each call. Note well: if a function returns a
|
|
dns_result_t rather than void, it definitely means there is something
|
|
that can go possibly wrong in the function and it should be checked by
|
|
the caller.
|
|
|
|
A new tree of trees must be initialized using:
|
|
|
|
dns_result_t dns_rbt_create(isc_mem_t *mctx, void (*deleter)(void *, void *),
|
|
void *deleter_arg, dns_rbt_t **rbtp);
|
|
|
|
The memory context, mctx, must be a non-null pointer that was
|
|
initialized with isc_mem_create(). The deleter argument, if non-null,
|
|
should point to a function that is responsible for cleaning up any
|
|
memory associated with the data pointer of a node when the node is
|
|
deleted. It is passed the deleted node's data pointer as its first
|
|
argument and deleter_arg as its second argument.
|
|
|
|
After initializing an RBT manager, to add keys to the tree, use:
|
|
|
|
dns_result_t dns_rbt_addname(dns_rbt_t *rbt, dns_name_t *name, void *data);
|
|
|
|
The name _must_ be an absolute name. It is not required that the data
|
|
pointer be non-null, but it is recommended that it point to something,
|
|
even just invalid memory, because of the various searching and
|
|
deletion issues described in the previous section. The RBT code will
|
|
not attempt to dereference the pointer.
|
|
|
|
To find a key in the tree, use:
|
|
|
|
dns_result_t dns_rbt_findname(dns_rbt_t *rbt, dns_name_t *name, void **data);
|
|
|
|
The data parameter must not be NULL, but *data must be NULL. The
|
|
result will be either DNS_R_SUCCESS, DNS_R_PARTIALMATCH or
|
|
DNS_R_NOTFOUND. In the first case, an exact match was found for the
|
|
name and there was an associate data pointer, which is returned via
|
|
the data parameter. A partial match results when the name has not
|
|
been found but a superdomain name, with data, does exist; then the
|
|
data for that name is returned in the data parameter. If no data is
|
|
found for the name or a superdomain, *data will remain NULL.
|
|
|
|
|
|
INTERNAL PROGRAMMATIC DETAILS
|
|
|
|
This section is mainly relevant to the RBT DB implementation. It is
|
|
highly recommended that programmers using the RBT library stick to the
|
|
functions named in the previous section.
|
|
|
|
The dns_rbt_addname and dns_rbt_findname functions named in the
|
|
previous section are wrappers around dns_rbt_addnode and
|
|
dns_rbt_findnode. The *node functions for the most part do not
|
|
particularly care whether a node has an associated data pointer or
|
|
not, whereas the *name functions do. The one exception to this is
|
|
that when a PARTIALMATCH is returned for a search, the indicated node
|
|
is the deepest match that has data, rather than just the deepest
|
|
match. Even that behavior is selectable, however, using the boolean
|
|
empty_data_ok argument to dns_rbt_findnode.
|
|
|
|
Each node in the tree of trees is represented by the following structure:
|
|
|
|
typedef struct dns_rbtnode {
|
|
struct dns_rbtnode *left;
|
|
struct dns_rbtnode *right;
|
|
struct dns_rbtnode *down;
|
|
/*
|
|
* The following bitfields add up to a total bitwidth of 32.
|
|
* The range of values necessary for each item is indicated,
|
|
* but in the case of "attributes" the field is wider to accomodate
|
|
* possible future expansion. "offsetlen" could be one bit
|
|
* narrower by always adjusting its value by 1 to find the real
|
|
* offsetlen, but doing so does not gain anything (except perhaps
|
|
* another bit for "attributes", which doesn't yet need any more).
|
|
*/
|
|
unsigned int color:1; /* range is 0..1 */
|
|
unsigned int attributes:6; /* range is 0..2 */
|
|
unsigned int namelen:8; /* range is 1..255 */
|
|
unsigned int offsetlen:8; /* range is 1..128 */
|
|
unsigned int padbytes:9; /* range is 0..380 */
|
|
/*
|
|
* These values are used in the RBT DB implementation. The
|
|
* appropriate node lock must be held before accessing them.
|
|
*/
|
|
void *data;
|
|
unsigned int dirty:1;
|
|
unsigned int locknum:DNS_RBT_LOCKLENGTH;
|
|
unsigned int references:DNS_RBT_REFLENGTH;
|
|
} dns_rbtnode_t;
|
|
|
|
@@@
|