Replace the use of isc_ht API with isc_hashmap API in the dns_adb
database implementation. This requires extending the
dns_adbnamebucket_t and dns_adbentrybucket_t structures to include
keysize and copy of the key because the isc_hashmap API needs the raw
key in case of resizing the hashmap table.
Previously:
* applications were using isc_app as the base unit for running the
application and signal handling.
* networking was handled in the netmgr layer, which would start a
number of threads, each with a uv_loop event loop.
* task/event handling was done in the isc_task unit, which used
netmgr event loops to run the isc_event calls.
In this refactoring:
* the network manager now uses isc_loop instead of maintaining its
own worker threads and event loops.
* the taskmgr that manages isc_task instances now also uses isc_loopmgr,
and every isc_task runs on a specific isc_loop bound to the specific
thread.
* applications have been updated as necessary to use the new API.
* new ISC_LOOP_TEST macros have been added to enable unit tests to
run isc_loop event loops. unit tests have been updated to use this
where needed.
When dumping an ADB address entry associated with a name,
the name bucket lock was held, but the entry bucket lock was
not; this could cause data races when other threads were updating
address entry info. (These races are probably not operationally
harmful, but they triggered TSAN error reports.)
The BUFSIZ value varies between platforms, it could be 8K on Linux and
512 bytes on mingw. Make sure the buffers are always big enough for the
output data to prevent truncation of the output by appropriately
enlarging or sizing the buffers.
this command runs dns_adb_dumpquota() to display all servers
in the ADB that are being actively fetchlimited by the
fetches-per-server controls (i.e, servers with a nonzero average
timeout ratio or with the quota having been reduced from the
default value).
the "fetchlimit" system test has been updated to use the
new command to check quota values instead of "rndc dumpdb".
it's a style violation to have REQUIRE or INSIST contain code that
must run for the server to work. this was being done with some
atomic_compare_exchange calls. these have been cleaned up. uses
of atomic_compare_exchange in assertions have been replaced with
a new macro atomic_compare_exchange_enforced, which uses RUNTIME_CHECK
to ensure that the exchange was successful.
Commits 76bcb4d16b776e25cc67937f7d1a2fe6e365cfd7 and
d48d8e1cf0879b818d710cc1238643610e386d38 did not include
isc_refcount_destroy() calls that would be logical counterparts of the
isc_refcount_init() calls these commits added. Add the missing
isc_refcount_destroy() calls to destroy().
Adding these calls (which ensure a given structure's reference count
equals 0 when it is destroyed, therefore detecting reference counting
issues) uncovered another flaw in the commits mentioned above: missing
isc_refcount_decrement() calls that would be logical counterparts of the
isc_refcount_increment*() calls these commits added. Add the missing
isc_refcount_decrement() calls to unlink_name() and unlink_entry().
Add isc_mutex_destroy() and isc_rwlock_destroy() calls missing from the
commits that introduced the relevant isc_mutex_init() and
isc_rwlock_init() calls:
- 76bcb4d16b776e25cc67937f7d1a2fe6e365cfd7
- 15953043124416ab1dbc857f6885ecdb167401bb
- 857f3bede37ccb419dac3816a0f96fa490af7d92
None of these omissions affect any hot paths, so they are not expected
to cause operational issues; correctness is the only concern here.
Previously, tasks could be created either unbound or bound to a specific
thread (worker loop). The unbound tasks would be assigned to a random
thread every time isc_task_send() was called. Because there's no logic
that would assign the task to the least busy worker, this just creates
unpredictability. Instead of random assignment, bind all the previously
unbound tasks to worker 0, which is guaranteed to exist.
Since commit bad5a523c2e, when the fetches-per-server quota
was increased or decreased, instead of the value being set to
the newly calculated quota, it was set to the *minimum* of
the new quota or 1 - which effectively meant it was always set to 1.
it should instead have been the maximum, to prevent the value from
ever dropping to zero.
the ADB depends on the resolver, but previously only accessed it
via the view. as view->resolver may now be detached before the ADB
finishes, a shutdown race was possible. attaching to the resolver
directly prevents this.
weakly attaching and detaching when creating and destroying the
resolver obviates the need to have a callback event to do the weak
detach. remove the dns_resolver_whenshutdown() mechanism, as it is
now unused.
weakly attaching and detaching the view when creating or destroying
the ADB obviates the need for a whenshutdown callback event to do
the detaching. remove the dns_adb_whenshutdown() mechanism, since
it is no longer needed.
for better object separation, ADB and resolver statistics counters
are now stored in the ADB and resolver objects themsevles, rather than
in the associated view.
In dns_adb_cancelfind(), we need to release the find lock and
then acquire the bucket and find locks in that order, for
consistency with locking hierarchy elsehwere. Previously we
were only acquiring the bucket lock.
Also rewrote the function for better readability.
The dns__adb_attach() had an assertion failure that prevented to attach
to dns_adb if the dns_adb was shutting down. There was a race between
checking for .exiting in dns_adb_createfind and creating new_adbfind() -
other thread could have set the .exiting to true between the check.
Remove the assertion failure and allow attaching to dns_adb even while
shutting down. The process of dns_adb shutting down would be noticed
only a moments later when any other callback is called.
due to a typo in the code, ADB entries were unlinked from their entry
buckets during shutdown if they had a nonzero reference count. they
were only supposed to be unlinked if the reference count was exactly
one (that being the reference held by the bucket itself).
*** CID 351371: Null pointer dereferences (REVERSE_INULL)
/lib/dns/adb.c: 2615 in dns_adb_createfind()
2609 /*
2610 * Copy out error flags from the name structure into the find.
2611 */
2612 find->result_v4 = find_err_map[adbname->fetch_err];
2613 find->result_v6 = find_err_map[adbname->fetch6_err];
2614
>>> CID 351371: Null pointer dereferences (REVERSE_INULL)
>>> Null-checking "find" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
2615 if (find != NULL) {
2616 if (want_event) {
2617 INSIST((find->flags & DNS_ADBFIND_ADDRESSMASK) != 0);
2618 isc_task_attach(task, &(isc_task_t *){ NULL });
2619 find->event.ev_sender = task;
2620 find->event.ev_action = action;
The ADB previously used separate reference counters for internal
and external references, plus additional counters for ABD find
and namehook objects, and used all these counters to coordinate
its shutdown process, which was a multi-stage affair involving
a sequence of control events.
It also used a complex interlocking set of static functions for
referencing, deferencing, linking, unlinking, and cleaning up various
internal objects; these functions returned boolean values to their
callers to indicate what additional processing was needed.
The changes in the previous two commits destabilized this fragile
system in a way that was difficult to recover from, so in this commit
we refactor all of it. The dns_adb and dns_adbentry objects now use
conventional attach and detach functions for reference counting, and
the shutdown process is much more straightforward. Instead of
handling shutdown asynchronously, we can just destroy the ADB when
references reach zero
In addition, ADB locking has been simplified. Instead of a
single `find_{name,entry}_and_lock()` function which searches for
a name or entry's hash bucket, locks it, and then searches for the
name or entry in the bucket, we now use one function to find the
bucket (leaving it to the caller to do the locking) and another
find the name or entry. Instead of locking the entire ADB when
modifying hash tables, we now use read-write locks around the
specific hash table. The only remaining need for adb->lock
is when modifying the `whenshutdown` list.
Comments throughout the module have been improved.
Replace adb->{names,entries} and related arrays (indexed by hashed
bucket) with a isc_ht hash tables storing the new struct
adb{name,entry}bucket_t that wraps all the variables that were
originally stored in arrays indexed by "bucket" number stored directly
in the struct dns_adb.
Previously, the task exclusive mode has been used to grow the internal
arrays used to store the named and entries objects. The isc_ht hash
tables are now protected by the isc_rwlock instead and thus the usage of
the task exclusive mode has been removed from the dns_adb.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
the use of "result" as a variable name for a boolean return value
was confusing; all 'result' variables that are not isc_result_t
have been renamed to 'ret'.
The static function print_dns_name() was a duplicate of
dns_name_print(), so it has been replaced with that.
Changed INSIST to REQUIRE where appropriate, and added NULL
initialization for pointer variables.
Historically, the inline keyword was a strong suggestion to the compiler
that it should inline the function marked inline. As compilers became
better at optimising, this functionality has receded, and using inline
as a suggestion to inline a function is obsolete. The compiler will
happily ignore it and inline something else entirely if it finds that's
a better optimisation.
Therefore, remove all the occurences of the inline keyword with static
functions inside single compilation unit and leave the decision whether
to inline a function or not entirely on the compiler
NOTE: We keep the usage the inline keyword when the purpose is to change
the linkage behaviour.
This commit converts the license handling to adhere to the REUSE
specification. It specifically:
1. Adds used licnses to LICENSES/ directory
2. Add "isc" template for adding the copyright boilerplate
3. Changes all source files to include copyright and SPDX license
header, this includes all the C sources, documentation, zone files,
configuration files. There are notes in the doc/dev/copyrights file
on how to add correct headers to the new files.
4. Handle the rest that can't be modified via .reuse/dep5 file. The
binary (or otherwise unmodifiable) files could have license places
next to them in <foo>.license file, but this would lead to cluttered
repository and most of the files handled in the .reuse/dep5 file are
system test files.
When dns_adb is shutting down, first the adb->shutting_down flag is set
and then task is created that runs shutdown_stage2() that sets the
shutdown flag on names and entries. However, when dns_adb_createfind()
is called, only the individual shutdown flags are being checked, and the
global adb->shutting_down flag was not checked. Because of that it was
possible for a different thread to slip in and create new find between
the dns_adb_shutdown() and dns_adb_detach(), but before the
shutdown_stage2() task is complete. This was detected by
ThreadSanitizer as data race because the zonetable might have been
already detached by dns_view shutdown process and simultaneously
accessed by dns_adb_createfind().
This commit converts the adb->shutting_down to atomic_bool to prevent
the global adb lock when creating the find.
The NAME_FETCH_A and NAME_FETCH_AAAA macros were meant to be
boolean, indicating whether the pointers were set or not, while
the NAME_FETCH_V4 and NAME_FETCH_V6 macros were meant to return
the pointer values. The latter were only used as booleans, so
they've been removed in favor of the former.
Also did some style cleanup and removed an unreachable code block.
Remove the dynamic registration of result codes. Convert isc_result_t
from unsigned + #defines into 32-bit enum type in grand unified
<isc/result.h> header. Keep the existing values of the result codes
even at the expense of the description and identifier tables being
unnecessary large.
Additionally, add couple of:
switch (result) {
[...]
default:
break;
}
statements where compiler now complains about missing enum values in the
switch statement.
- The `timeout_action` parameter to dns_dispatch_addresponse() been
replaced with a netmgr callback that is called when a dispatch read
times out. this callback may optionally reset the read timer and
resume reading.
- Added a function to convert isc_interval to milliseconds; this is used
to translate fctx->interval into a value that can be passed to
dns_dispatch_addresponse() as the timeout.
- Note that netmgr timeouts are accurate to the millisecond, so code to
check whether a timeout has been reached cannot rely on microsecond
accuracy.
- If serve-stale is configured, then a timeout received by the resolver
may trigger it to return stale data, and then resume waiting for the
read timeout. this is no longer based on a separate stale timer.
- The code for canceling requests in request.c has been altered so that
it can run asynchronously.
- TCP timeout events apply to the dispatch, which may be shared by
multiple queries. since in the event of a timeout we have no query ID
to use to identify the resp we wanted, we now just send the timeout to
the oldest query that was pending.
- There was some additional refactoring in the resolver: combining
fctx_join() and fctx_try_events() into one function to reduce code
duplication, and using fixednames in fetchctx and fetchevent.
- Incidental fix: new_adbaddrinfo() can't return NULL anymore, so the
code can be simplified.
On the isc_mem water change the old water_t structure could be used
after free. Instead of introducing reference counting on the hot-path
we are going to introduce additional constraints on the
isc_mem_setwater. Once it's set for the first time, the additional
calls have to be made with the same water and water_arg arguments.
The proper way how to disable the water limit in the isc_mem context is
to call:
isc_mem_setwater(ctx, NULL, NULL, 0, 0);
this ensures that the old water callback is called with ISC_MEM_LOWATER
if the callback was called with ISC_MEM_HIWATER before.
Historically, there were some places where the limits were disabled by
calling:
isc_mem_setwater(ctx, water, water_arg, 0, 0);
which would also call the old callback, but it also causes the water_t
to be allocated and extra check to be executed because water callback is
not NULL.
This commits unifies the calls to disable water to the preferred form.
Current mempools are kind of hybrid structures - they serve two
purposes:
1. mempool with a lock is basically static sized allocator with
pre-allocated free items
2. mempool without a lock is a doubly-linked list of preallocated items
The first kind of usage could be easily replaced with jemalloc small
sized arena objects and thread-local caches.
The second usage not-so-much and we need to keep this (in
libdns:message.c) for performance reasons.
The DNS Flag Day 2020 aims to remove the IP fragmentation problem from
the UDP DNS communication. In this commit, we implement the required
changes and simplify the logic for picking the EDNS Buffer Size.
1. The defaults for `edns-udp-size`, `max-udp-size` and
`nocookie-udp-size` have been changed to `1232` (the value picked by
DNS Flag Day 2020).
2. The probing heuristics that would try 512->4096->1432->1232 buffer
sizes has been removed and the resolver will always use just the
`edns-udp-size` value.
3. Instead of just disabling the PMTUD mechanism on the UDP sockets, we
now set IP_DONTFRAG (IPV6_DONTFRAG) flag. That means that the UDP
packets won't get ever fragmented. If the ICMP packets are lost the
UDP will just timeout and eventually be retried over TCP.
There were several problems with rbt hashtable implementation:
1. Our internal hashing function returns uint64_t value, but it was
silently truncated to unsigned int in dns_name_hash() and
dns_name_fullhash() functions. As the SipHash 2-4 higher bits are
more random, we need to use the upper half of the return value.
2. The hashtable implementation in rbt.c was using modulo to pick the
slot number for the hash table. This has several problems because
modulo is: a) slow, b) oblivious to patterns in the input data. This
could lead to very uneven distribution of the hashed data in the
hashtable. Combined with the single-linked lists we use, it could
really hog-down the lookup and removal of the nodes from the rbt
tree[a]. The Fibonacci Hashing is much better fit for the hashtable
function here. For longer description, read "Fibonacci Hashing: The
Optimization that the World Forgot"[b] or just look at the Linux
kernel. Also this will make Diego very happy :).
3. The hashtable would rehash every time the number of nodes in the rbt
tree would exceed 3 * (hashtable size). The overcommit will make the
uneven distribution in the hashtable even worse, but the main problem
lies in the rehashing - every time the database grows beyond the
limit, each subsequent rehashing will be much slower. The mitigation
here is letting the rbt know how big the cache can grown and
pre-allocate the hashtable to be big enough to actually never need to
rehash. This will consume more memory at the start, but since the
size of the hashtable is capped to `1 << 32` (e.g. 4 mio entries), it
will only consume maximum of 32GB of memory for hashtable in the
worst case (and max-cache-size would need to be set to more than
4TB). Calling the dns_db_adjusthashsize() will also cap the maximum
size of the hashtable to the pre-computed number of bits, so it won't
try to consume more gigabytes of memory than available for the
database.
FIXME: What is the average size of the rbt node that gets hashed? I
chose the pagesize (4k) as initial value to precompute the size of
the hashtable, but the value is based on feeling and not any real
data.
For future work, there are more places where we use result of the hash
value modulo some small number and that would benefit from Fibonacci
Hashing to get better distribution.
Notes:
a. A doubly linked list should be used here to speedup the removal of
the entries from the hashtable.
b. https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/
If there are more that 5 NS record for a zone only perform a
maximum of 4 address lookups for all the name servers. This
limits the amount of remote lookup performed for server
addresses at each level for a given query.
Due to a way the stdatomic.h shim is implemented on Windows, the MSVC
always things that the outside type is the largest - atomic_(u)int_fast64_t.
This can lead to false positives as this one:
lib\dns\adb.c(3678): warning C4477: 'fprintf' : format string '%u' requires an argument of type 'unsigned int', but variadic argument 2 has type 'unsigned __int64'
We workaround the issue by loading the value in a scoped local variable
with correct type first.
Both clang-tidy and uncrustify chokes on statement like this:
for (...)
if (...)
break;
This commit uses a very simple semantic patch (below) to add braces around such
statements.
Semantic patch used:
@@
statement S;
expression E;
@@
while (...)
- if (E) S
+ { if (E) { S } }
@@
statement S;
expression E;
@@
for (...;...;...)
- if (E) S
+ { if (E) { S } }
@@
statement S;
expression E;
@@
if (...)
- if (E) S
+ { if (E) { S } }