View matching on an incoming query checks the query's signature,
which can be a CPU-heavy task for a SIG(0)-signed message. Implement
an asynchronous mode of the view matching function which uses the
offloaded signature checking facilities, and use it for the incoming
queries.
Add support for using the offload threadpool to perform message
signature verifications. This should allow check SIG(0)-signed
messages without affecting the worker threads.
This is a tiny helper function which is used only once and can be
replaced with two function calls instead. Removing this makes
supporting asynchronous signature checking less complicated.
In order to protect from a malicious DNS client that sends many
queries with a SIG(0)-signed message, add a quota of simultaneously
running SIG(0) checks.
This protection can only help when named is using more than one worker
threads. For example, if named is running with the '-n 4' option, and
'sig0checks-quota 2;' is used, then named will make sure to not use
more than 2 workers for the SIG(0) signature checks in parallel, thus
leaving the other workers to serve the remaining clients which do not
use SIG(0)-signed messages.
That limitation is going to change when SIG(0) signature checks are
offloaded to "slow" threads in a future commit.
The 'sig0checks-quota-exempt' ACL option can be used to exempt certain
clients from the quota requirements using their IP or network addresses.
The 'sig0checks-quota-maxwait-ms' option is used to define a maximum
amount of time for named to wait for a quota to appear. If during that
time no new quota becomes available, named will answer to the client
with DNS_R_REFUSED.
By default we log a rekey failure on debug level. We should probably
change the log level to error. We make an exception for when the zone
is not loaded yet, it often happens at startup that a rekey is
run before the zone is fully loaded.
when signatures were not added because of too many types already
existing at a node, the diff was not being cleaned up; this led to
a memory leak being reported at shutdown.
Previously, the number of RR types for a single owner name was limited
only by the maximum number of the types (64k). As the data structure
that holds the RR types for the database node is just a linked list, and
there are places where we just walk through the whole list (again and
again), adding a large number of RR types for a single owner named with
would slow down processing of such name (database node).
Add a configurable limit to cap the number of the RR types for a single
owner. This is enforced at the database (rbtdb, qpzone, qpcache) level
and configured with new max-types-per-name configuration option that
can be configured globally, per-view and per-zone.
Previously, the number of RRs in the RRSets were internally unlimited.
As the data structure that holds the RRs is just a linked list, and
there are places where we just walk through all of the RRs, adding an
RRSet with huge number of RRs inside would slow down processing of said
RRSets.
Add a configurable limit to cap the number of the RRs in a single RRSet.
This is enforced at the database (rbtdb, qpzone, qpcache) level and
configured with new max-records-per-type configuration option that can
be configured globally, per-view and per-zone.
The changes in this MR prevent the memory used for sending the outgoing
TCP requests to spike so much. That strictly remove the extra need for
own memory context, and thus since we generally prefer simplicity,
remove the extra memory context with own jemalloc arenas just for the
outgoing send buffers.
The single TCP read can create as much as 64k divided by the minimum
size of the DNS message. This can clog the processing thread and trash
the memory allocator because we need to do as much as ~20k allocations in
a single UV loop tick.
Limit the number of the DNS messages processed in a single UV loop tick
to just single DNS message and limit the number of the outstanding DNS
messages back to 23. This effectively limits the number of pipelined
DNS messages to that number (this is the limit we already had before).
As a single thread can process only one TCP send at the time, we don't
really need a memory pool for the TCP buffers, but it's enough to have
a single per-loop (client manager) static buffer that's being used to
assemble the DNS message and then it gets copied into own sending
buffer.
In the future, this should get optimized by exposing the uv_try API
from the network manager, and first try to send the message directly
and allocate the sending buffer only if we need to send the data
asynchronously.
Constantly allocating, reallocating and deallocating 64K TCP send
buffers by 'ns_client' instances takes too much CPU time.
There is an existing mechanism to reuse the ns_clent_t structure
associated with the handle using 'isc_nmhandle_getdata/_setdata'
(see ns_client_request()), but it doesn't work with TCP, because
every time ns_client_request() is called it gets a new handle even
for the same TCP connection, see the comments in
streamdns_on_complete_dnsmessage().
To solve the problem, we introduce an array of available (unused)
TCP buffers stored in ns_clientmgr_t structure so that a 'client'
working via TCP can have a chance to reuse one (if there is one)
instead of allocating a new one every time.
When TCP client would not read the DNS message sent to them, the TCP
sends inside named would accumulate and cause degradation of the
service. Throttle the reading from the TCP socket when we accumulate
enough DNS data to be sent. Currently this is limited in a way that a
single largest possible DNS message can fit into the buffer.
This commit ensures that an HTTP endpoints set reference is stored in
a socket object associated with an HTTP/2 stream instead of
referencing the global set stored inside a listener.
This helps to prevent an issue like follows:
1. BIND is configured to serve DoH clients;
2. A client is connected and one or more HTTP/2 stream is
created. Internal pointers are now pointing to the data on the
associated HTTP endpoints set;
3. BIND is reconfigured - the new endpoints set object is created and
promoted to all listeners;
4. The old pointers to the HTTP endpoints set data are now invalid.
Instead referencing a global object that is updated on
re-configurations we now store a local reference which prevents the
endpoints set objects to go out of scope prematurely.
It was reported that HTTP/2 session might get closed or even deleted
before all async. processing has been completed.
This commit addresses that: now we are avoiding using the object when
we do not need it or specifically check if the pointers used are not
'NULL' and by ensuring that there is at least one reference to the
session object while we are doing incoming data processing.
This commit makes the code more resilient to such issues in the
future.
Replace the ISC_LIST based deadnodes implementation with isc_queue which
is wait-free and we don't have to acquire neither the tree nor node lock
to append nodes to the queue and the cleaning process can also
copy (splice) the list into a local copy without acquiring the list.
Currently, there's little benefit to this as we need to hold those
locks anyway, but in the future as we move to RCU based implementation,
this will be ready.
To align the cleaning with our event loop based model, remove the
hardcoded count for the node locks and use the number of the event loops
instead. This way, each event loop can have its own cleaning as part of
the process. Use uniform random numbers to spread the nodes evenly
between the buckets (instead of hashing the domain name).
Add an isc_queue implementation that hides the gory details of cds_wfcq
into more neat API. The same caveats as with cds_wfcq.
TODO: Add documentation to the API.
When the cache's memory context was in over memory state when the
cache was flushed it resulted in LRU cleaning removing newly entered
data in the new cache straight away until the old cache had been
destroyed enough to take it out of over memory state. When flushing
the cache create a new memory context for the new db to prevent this.
The `axfr_makedb()` didn't set the loop on the newly created database,
effectively killing delayed cleaning on such database. Move the
database creation into dns_zone API that knows all the gory details of
creating new database suitable for the zone.
The rdataslab.c was full of code like this:
length = raw[0] * 256 + raw[1];
and
count2 = *current2++ * 256;
count2 += *current2++;
Refactor code like this into peek_uint16() and get_uint16 macros
to prevent code repetition and possible mistakes when copy and
pasting the same code over and over.
As a side note for an entertainment of a careful reader of the commit
messages: The byte manipulation was changed from multiplication and
addition to shift with or.
The difference in the assembly looks like this:
MUL and ADD:
movzx eax, BYTE PTR [rdi]
movzx edi, BYTE PTR [rdi+1]
sal eax, 8
or edi, eax
SHIFT and OR:
movzx edi, WORD PTR [rdi]
rol di, 8
movzx edi, di
If the result and/or buffer is then being used after the macro call,
there's more differences in favor of the SHIFT+OR solution.
The detach function declaration in `ISC__REFCOUNT_TRACE_DECL` had an
returned an accidental implicit int. While not allowed since C99, it
became an error by default in GCC 14.
`ISC_REFCOUNT_TRACE_IMPL` and `ISC_REFCOUNT_STATIC_TRACE_IMPL` expanded
into the wrong macros, trying to declare it again with the wrong number
of parameters.
- duplicated question
- duplicated answer
- qtype as an answer
- two question types
- question names
- nsec3 bad owner name
- short record
- short question
- mismatching question class
- bad record owner name
- mismatched class in record
- mismatched KEY class
- OPT wrong owner name
- invalid RRSIG "covers" type
- UPDATE malformed delete type
- TSIG wrong class
- TSIG not the last record
there were TSAN error reports because of conflicting uses of
node->dirty and node->nsec, which were in the same qword.
this could be resolved by separating them, but we could also
make them into atomic values and remove some node locking.
The fix_iterator() function had a lot of bugs in it and while fixing
them, the number of corner cases and the complexity of the function
got out of hand. Rewrite the function with the following modifications:
The function now requires that the iterator is pointing to a leaf node.
This removes the cases we have to deal when the iterator was left on a
dead branch.
From the leaf node, pop up the iterator stack until we encounter the
branch where the offset point is before the point where the search key
differs. This will bring us to the right branch, or at the first
unmatched node, in which case we pop up to the parent branch. From
there it is easier to retrieve the predecessor.
Once we are at the right branch, all we have to do is find the right
twig (which is either the twig for the character at the position where
the search key differs, or the previous twig) and walk down from there
to the greatest leaf or, in case there is no good twig, get the
previous twig from the successor and get the greatest leaf from there.
If there is no previous twig to select in this branch, because every
leaf from this branch node is greater than the one we wanted, we need
to pop up the stack again and resume at the parent branch. This is
achieved by calling prevleaf().
Move the fix_iterator out of the loop and only call it when we found
a leaf node. This leaf node may be the wrong leaf node, but fix_iterator
should correct that.
Also, when we don't need to set the iterator, just get any leaf. We
only need to have a leaf for the qpkey_compare and the end result does
not matter if compare was against an ancestor leaf or any leaf below
that point.
An assertion failure would be triggered when sending the TCP data ends
after the TCP reading gets closed. Implement proper reference counting
for the isc_httpd object.
When searching for a requested name in dns_qp_lookup(), we may add
a leaf node to the QP chain, then subsequently determine that the
branch we were on was a dead end. When that happens, the chain can be
left holding a pointer to a node that is *not* an ancestor of the
requested name.
We correct for this by unwinding any chain links with an offset
value greater or equal to that of the node we found.
The code below the if/else construction could only be run if the 'if'
code path was taken. Move the code into the 'if' code block so that
it is more easier to read.
The high-water allows administrators to better tune the recursive
clients limit without having to to poll the statistics channel in high
rates to get this number.
Returning the value allows for better high-water tracking without
running into edge cases like the following:
0. The counter is at value X
1. Increment the value (X+1)
2. The value is decreased multiple times in another threads (X+1-Y)
3. Get the value (X+1-Y)
4. Update-if-greater misses the X+1 value which should have been the
high-water