When we are recursing, RPZ processing is not allowed. But when we are
performing a lookup due to "stale-answer-client-timeout", we are still
recursing. This effectively means that RPZ processing is disabled on
such a lookup.
In this case, bail the "stale-answer-client-timeout" lookup and wait
for recursion to complete, as we we can't perform the RPZ rewrite
rules reliably.
The dboption DNS_DBFIND_STALEONLY caused confusion because it implies
we are looking for stale data **only** and ignore any active RRsets in
the cache. Rename it to DNS_DBFIND_STALETIMEOUT as it is more clear
the option is related to a lookup due to "stale-answer-client-timeout".
Rename other usages of "staleonly", instead use "lookup due to...".
Also rename related function and variable names.
When doing a staleonly lookup we don't want to fallback to recursion.
After all, there are obviously problems with recursion, otherwise we
wouldn't do a staleonly lookup.
When resuming from recursion however, we should restore the
RECURSIONOK flag, allowing future required lookups for this client
to recurse.
When implementing "stale-answer-client-timeout", we decided that
we should only return positive answers prematurely to clients. A
negative response is not useful, and in that case it is better to
wait for the recursion to complete.
To do so, we check the result and if it is not ISC_R_SUCCESS, we
decide that it is not good enough. However, there are more return
codes that could lead to a positive answer (e.g. CNAME chains).
This commit removes the exception and now uses the same logic that
other stale lookups use to determine if we found a useful stale
answer (stale_found == true).
This means we can simplify two test cases in the serve-stale system
test: nodata.example is no longer treated differently than data.example.
The NS_QUERYATTR_ANSWERED attribute is to prevent sending a response
twice. Without the attribute, this may happen if a staleonly lookup
found a useful answer and sends a response to the client, and later
recursion ends and also tries to send a response.
The attribute was also used to mask adding a duplicate RRset. This is
considered harmful. When we created a response to the client with a
stale only lookup (regardless if we actually have send the response),
we should clear the rdatasets that were added during that lookup.
Mark such rdatasets with the a new attribute,
DNS_RDATASETATTR_STALE_ADDED. Set a query attribute
NS_QUERYATTR_STALEOK if we may have added rdatasets during a stale
only lookup. Before creating a response on a normal lookup, check if
we can expect rdatasets to have been added during a staleonly lookup.
If so, clear the rdatasets from the message with the attribute
DNS_RDATASETATTR_STALE_ADDED set.
With stale-answer-client-timeout, we may send a response to the client,
but we may want to hold on to the network manager handle, because
recursion is going on in the background, or we need to refresh a
stale RRset.
Simplify the setting of 'nodetach':
* During a staleonly lookup we should not detach the nmhandle, so just
set it prior to 'query_lookup()'.
* During a staleonly "stalefirst" lookup set the 'nodetach' to true
if we are going to refresh the RRset.
Now there is no longer the need to clear the 'nodetach' if we go
through the "dbfind_stale", "stale_refresh_window", or "stale_only"
paths.
When doing a staleonly lookup, ignore active RRsets from cache. If we
don't, we may add a duplicate RRset to the message, and hit an
assertion failure in query.c because adding the duplicate RRset to the
ANSWER section failed.
This can happen on a race condition. When a client query is received,
the recursion is started. When 'stale-answer-client-timeout' triggers
around the same time the recursion completes, the following sequence
of events may happen:
1. Queue the "try stale" fetch_callback() event to the client task.
2. Add the RRsets from the authoritative response to the cache.
3. Queue the "fetch complete" fetch_callback() event to the client task.
4. Execute the "try stale" fetch_callback(), which retrieves the
just-inserted RRset from the database.
5. In "ns_query_done()" we are still recursing, but the "staleonly"
query attribute has already been cleared. In other words, the
query will resume when recursion ends (it already has ended but is
still on the task queue).
6. Execute the "fetch complete" fetch_callback(). It finds the answer
from recursion in the cache again and tries to add the duplicate to
the answer section.
This commit changes the logic for finding stale answers in the cache,
such that on "stale_only" lookups actually only stale RRsets are
considered. It refactors the code so that code paths for "dbfind_stale",
"stale_refresh_window", and "stale_only" are more clear.
First we call some generic code that applies in all three cases,
formatting the domain name for logging purposes, increment the
trystale stats, and check if we actually found stale data that we can
use.
The "dbfind_stale" lookup will return SERVFAIL if we didn't found a
usable answer, otherwise we will continue with the lookup
(query_gotanswer()). This is no different as before the introduction of
"stale-answer-client-timeout" and "stale-refresh-time".
The "stale_refresh_window" lookup is similar to the "dbfind_stale"
lookup: return SERVFAIL if we didn't found a usable answer, otherwise
continue with the lookup (query_gotanswer()).
Finally the "stale_only" lookup.
If the "stale_only" lookup was triggered because of an actual client
timeout (stale-answer-client-timeout > 0), and if database lookup
returned a stale usable RRset, trigger a response to the client.
Otherwise return and wait until the recursion completes (or the
resolver query times out).
If the "stale_only" lookup is a "stale-anwer-client-timeout 0" lookup,
preferring stale data over a lookup. In this case if there was no stale
data, or the data was not a positive answer, retry the lookup with the
stale options cleared, a.k.a. a normal lookup. Otherwise, continue
with the lookup (query_gotanswer()) and refresh the stale RRset. This
will trigger a response to the client, but will not detach the handle
because a fetch will be created to refresh the RRset.
The stale-answer-client-timeout feature introduced a dependancy on
when a client may be detached from the handle. The dboption
DNS_DBFIND_STALEONLY was reused to track this attribute. This overloads
the meaning of this database option, and actually introduced a bug
because the option was checked in other places. In particular, in
'ns_query_done()' there is a check for 'RECURSING(qctx->client) &&
(!QUERY_STALEONLY(&qctx->client->query) || ...' and the condition is
satisfied because recursion has not completed yet and
DNS_DBFIND_STALEONLY is already cleared by that time (in
query_lookup()), because we found a useful answer and we should detach
the client from the handle after sending the response.
Add a new boolean to the client structure to keep track of client
detach from handle is allowed or not. It is only disallowed if we are
in a staleonly lookup and we didn't found a useful answer.
The RFC7828 specifies the keepalive interval to be 16-bit, specified in
units of 100 milliseconds and the configuration options tcp-*-timeouts
are following the suit. The units of 100 milliseconds are very
unintuitive and while we can't change the configuration and presentation
format, we should not follow this weird unit in the API.
This commit changes the isc_nm_(get|set)timeouts() functions to work
with milliseconds and convert the values to milliseconds before passing
them to the function, not just internally.
When we query the resolver for a domain name that is in the same zone
for which is already one or more fetches outstanding, we could
potentially hit the fetch limits. If so, recursion fails immediately
for the incoming query and if serve-stale is enabled, we may try to
return a stale answer.
If the resolver is also is authoritative for the parent zone (for
example the root zone), first a delegation is found, but we first
check the cache for a better response.
Nothing is found in the cache, so we try to recurse to find the
answer to the query.
Because of fetch-limits 'dns_resolver_createfetch()' returns an error,
which 'ns_query_recurse()' propagates to the caller,
'query_delegation_recurse()'.
Because serve-stale is enabled, 'query_usestale()' is called,
setting 'qctx->db' to the cache db, but leaving 'qctx->version'
untouched. Now 'query_lookup()' is called to search for stale data
in the cache database with a non-NULL 'qctx->version'
(which is set to a zone db version), and thus we hit an assertion
in rbtdb.
This crash was introduced in 'main' by commit
8bcd7fe69e5343071fc917738d6092a8b974ef3f.
- style, cleanup, and removal of unnecessary code.
- combined isc_nm_http_add_endpoint() and isc_nm_http_add_doh_endpoint()
into one function, renamed isc_http_endpoint().
- moved isc_nm_http_connect_send_request() into doh_test.c as a helper
function; remove it from the public API.
- renamed isc_http2 and isc_nm_http2 types and functions to just isc_http
and isc_nm_http, for consistency with other existing names.
- shortened a number of long names.
- the caller is now responsible for determining the peer address.
in isc_nm_httpconnect(); this eliminates the need to parse the URI
and the dependency on an external resolver.
- the caller is also now responsible for creating the SSL client context,
for consistency with isc_nm_tlsdnsconnect().
- added setter functions for HTTP/2 ALPN. instead of setting up ALPN in
isc_tlsctx_createclient(), we now have a function
isc_tlsctx_enable_http2client_alpn() that can be run from
isc_nm_httpconnect().
- refactored isc_nm_httprequest() into separate read and send functions.
isc_nm_send() or isc_nm_read() is called on an http socket, it will
be stored until a corresponding isc_nm_read() or _send() arrives; when
we have both halves of the pair the HTTP request will be initiated.
- isc_nm_httprequest() is renamed isc__nm_http_request() for use as an
internal helper function by the DoH unit test. (eventually doh_test
should be rewritten to use read and send, and this function should
be removed.)
- added implementations of isc__nm_tls_settimeout() and
isc__nm_http_settimeout().
- increased NGHTTP2 header block length for client connections to 128K.
- use isc_mem_t for internal memory allocations inside nghttp2, to
help track memory leaks.
- send "Cache-Control" header in requests and responses. (note:
currently we try to bypass HTTP caching proxies, but ideally we should
interact with them: https://tools.ietf.org/html/rfc8484#section-5.1)
The pthread_self(), thrd_current() or GetCurrentThreadId() could
actually be a pointer, so we should rather convert the value into
uintptr_t instead of unsigned long.
When a staleonly lookup doesn't find a satisfying answer, it should
not try to respond to the client.
This is not true when the initial lookup is staleonly (that is when
'stale-answer-client-timeout' is set to 0), because no resolver fetch
has been created at this point. In this case continue with the lookup
normally.
Fix a crash that can happen in the following scenario:
A client request is received. There is no data for it in the cache,
(not even stale data). A resolver fetch is created as part of
recursion.
Some time later, the fetch still hasn't completed, and
stale-answer-client-timeout is triggered. A staleonly lookup is
started. It will also find no data in the cache.
So 'query_lookup()' will call 'query_gotanswer()' with ISC_R_NOTFOUND,
so this will call 'query_notfound()' and this will start recursion.
We will eventually end up in 'ns_query_recurse()' and that requires
the client query fetch to be NULL:
REQUIRE(client->query.fetch == NULL);
If the previously started fetch is still running this assertion
fails.
The crash is easily prevented by not requiring recursion for
staleonly lookups.
Also remove a redundant setting of the staleonly flag at the end of
'query_lookup_staleonly()' before destroying the query context.
Add a system test to catch this case.
On 24-core machine, the tests would crash because we would run out of
the hazard pointers. We now adjust the number of hazard pointers to be
in the <128,256> interval based on the number of available cores.
Note: This is just a band-aid and needs a proper fix.
The 'query_usestale()' function was only called when in
'query_gotanswer()' and an unexpected error occurred. This may have
been "quota reached", and thus we were in some cases returning
stale data on fetch-limits (and if serve-stale enabled of course).
But we can also hit fetch-limits when recursing because we are
following a referral (in 'query_notfound()' and
'query_delegation_recurse()'). Here we should also check for using
stale data in case an error occurred.
Specifically don't check for using stale data when refetching a
zero TTL RRset from cache.
Move the setting of DNS_DBFIND_STALESTART into the 'query_usestale()'
function to avoid code duplication.
This commit completes the support for DNS-over-HTTP(S) built on top of
nghttp2 and plugs it into the BIND. Support for both GET and POST
requests is present, as required by RFC8484.
Both encrypted (via TLS) and unencrypted HTTP/2 connections are
supported. The latter are mostly there for debugging/troubleshooting
purposes and for the means of encryption offloading to third-party
software (as might be desirable in some environments to simplify TLS
certificates management).
If we did not attempt a fetch due to fetch-limits, we should not start
the stale-refresh-time window.
Introduce a new flag DNS_DBFIND_STALESTART to differentiate between
a resolver failure and unexpected error. If we are resuming, this
indicates a resolver failure, then start the stale-refresh-time window,
otherwise don't start the stale-refresh-time window, but still fall
back to using stale data.
(This commit also wraps some docstrings to 80 characters width)
Before this change, BIND will only fallback to using stale data if
there was an actual attempt to resolve the query. Then on a timeout,
the stale data from cache becomes eligible.
This commit changes this so that on any unexpected error stale data
becomes eligble (you would still have to have 'stale-answer-enable'
enabled of course).
If there is no stale data, this may return in an error again, so don't
loop on stale data lookup attempts. If the DNS_DBFIND_STALEOK flag is
set, this means we already tried to lookup stale data, so if that is
the case, don't use stale again.
First of all, there was a flaw in the code related to the
'stale-refresh-time' option. If stale answers are enabled, and we
returned stale data, then it was assumed that it was because we were
in the 'stale-refresh-time' window. But now we could also have returned
stale data because of a 'stale-answer-client-timeout'. To fix this,
introduce a rdataset attribute DNS_RDATASETATTR_STALE_WINDOW to
indicate whether the stale cache entry was returned because the
'stale-refresh-time' window is active.
Second, remove the special case handling when the result is
DNS_R_NCACHENXRRSET. This can be done more generic in the code block
when dealing with stale data.
Putting all stale case handling in the code block when dealing with
stale data makes the code more easy to follow.
Update documentation to be more verbose and to match then new code
flow.
Both functions employed the same code lines to allocate query context
buffers, which are used to store query results, so this shared portion
of code was extracted out to a new function, qctx_prepare_buffers.
Also, this commit uses qctx_init to initialize the query context whitin
query_refresh_rrset function.
This commit allows stale RRset to be used (if available) for responding
a query, before an attempt to refresh an expired, or otherwise resolve
an unavailable RRset in cache is made.
For that to work, a value of zero must be specified for
stale-answer-client-timeout statement.
To better understand the logic implemented, there are three flags being
used during database lookup and other parts of code that must be
understood:
. DNS_DBFIND_STALEOK: This flag is set when BIND fails to refresh a
RRset due to timeout (resolver-query-timeout), its intent is to
try to look for stale data in cache as a fallback, but only if
stale answers are enabled in configuration.
This flag is also used to activate stale-refresh-time window, since it
is the only way the database knows that a resolution has failed.
. DNS_DBFIND_STALEENABLED: This flag is used as a hint to the database
that it may use stale data. It is always set during query lookup if
stale answers are enabled, but only effectively used during
stale-refresh-time window. Also during this window, the resolver will
not try to resolve the query, in other words no attempt to refresh the
data in cache is made when the stale-refresh-time window is active.
. DNS_DBFIND_STALEONLY: This new introduced flag is used when we want
stale data from the database, but not due to a failure in resolution,
it also doesn't require stale-refresh-time window timer to be active.
As long as there is a stale RRset available, it should be returned.
It is mainly used in two situations:
1. When stale-answer-client-timeout timer is triggered: in that case
we want to know if there is stale data available to answer the
client.
2. When stale-answer-client-timeout value is set to zero: in that
case, we also want to know if there is some stale RRset available
to promptly answer the client.
We must also discern between three situations that may happen when
resolving a query after the addition of stale-answer-client-timeout
statement, and how to handle them:
1. Are we running query_lookup() due to stale-answer-client-timeout
timer being triggered?
In this case, we look for stale data, making use of
DNS_DBFIND_STALEONLY flag. If a stale RRset is available then
respond the client with the data found, mark this query as
answered (query attribute NS_QUERYATTR_ANSWERED), so when the
fetch completes the client won't be answered twice.
We must also take care of not detaching from the client, as a
fetch will still be running in background, this is handled by the
following snippet:
if (!QUERY_STALEONLY(&client->query)) {
isc_nmhandle_detach(&client->reqhandle);
}
Which basically tests if DNS_DBFIND_STALEONLY flag is set, which
means we are here due to a stale-answer-client-timeout timer
expiration.
2. Are we running query_lookup() due to resolver-query-timeout being
triggered?
In this case, DNS_DBFIND_STALEOK flag will be set and an attempt
to look for stale data will be made.
As already explained, this flag is algo used to activate
stale-refresh-time window, as it means that we failed to refresh
a RRset due to timeout.
It is ok in this situation to detach from the client, as the
fetch is already completed.
3. Are we running query_lookup() during the first time, looking for
a RRset in cache and stale-answer-client-timeout value is set to
zero?
In this case, if stale answers are enabled (probably), we must do
an initial database lookup with DNS_DBFIND_STALEONLY flag set, to
indicate to the database that we want stale data.
If we find an active RRset, proceed as normal, answer the client
and the query is done.
If we find a stale RRset we respond to the client and mark the
query as answered, but don't detach from the client yet as an
attempt in refreshing the RRset will still be made by means of
the new introduced function 'query_resolve'.
If no active or stale RRset is available, begin resolution as
usual.
The general logic behind the addition of this new feature works as
folows:
When a client query arrives, the basic path (query.c / ns_query_recurse)
was to create a fetch, waiting for completion in fetch_callback.
With the introduction of stale-answer-client-timeout, a new event of
type DNS_EVENT_TRYSTALE may invoke fetch_callback, whenever stale
answers are enabled and the fetch took longer than
stale-answer-client-timeout to complete.
When an event of type DNS_EVENT_TRYSTALE triggers fetch_callback, we
must ensure that the folowing happens:
1. Setup a new query context with the sole purpose of looking up for
stale RRset only data, for that matters a new flag was added
'DNS_DBFIND_STALEONLY' used in database lookups.
. If a stale RRset is found, mark the original client query as
answered (with a new query attribute named NS_QUERYATTR_ANSWERED),
so when the fetch completion event is received later, we avoid
answering the client twice.
. If a stale RRset is not found, cleanup and wait for the normal
fetch completion event.
2. In ns_query_done, we must change this part:
/*
* If we're recursing then just return; the query will
* resume when recursion ends.
*/
if (RECURSING(qctx->client)) {
return (qctx->result);
}
To this:
if (RECURSING(qctx->client) && !QUERY_STALEONLY(qctx->client)) {
return (qctx->result);
}
Otherwise we would not proceed to answer the client if it happened
that a stale answer was found when looking up for stale only data.
When an event of type DNS_EVENT_FETCHDONE triggers fetch_callback, we
proceed as before, resuming query, updating stats, etc, but a few
exceptions had to be added, most important of which are two:
1. Before answering the client (ns_client_send), check if the query
wasn't already answered before.
2. Before detaching a client, e.g.
isc_nmhandle_detach(&client->reqhandle), ensure that this is the
fetch completion event, and not the one triggered due to
stale-answer-client-timeout, so a correct call would be:
if (!QUERY_STALEONLY(client)) {
isc_nmhandle_detach(&client->reqhandle);
}
Other than these notes, comments were added in code in attempt to make
these updates easier to follow.
The BIND 9 libraries are considered to be internal only and hence the
API and ABI changes a lot. Keeping track of the API/ABI changes takes
time and it's a complicated matter as the safest way to make everything
stable would be to bump any library in the dependency chain as in theory
if libns links with libdns, and a binary links with both, and we bump
the libdns SOVERSION, but not the libns SOVERSION, the old libns might
be loaded by binary pulling old libdns together with new libdns loaded
by the binary. The situation gets even more complicated with loading
the plugins that have been compiled with few versions old BIND 9
libraries and then dynamically loaded into the named.
We are picking the safest option possible and usable for internal
libraries - instead of using -version-info that has only a weak link to
BIND 9 version number, we are using -release libtool option that will
embed the corresponding BIND 9 version number into the library name.
That means that instead of libisc.so.1701 (as an example) the library
will now be named libisc-9.17.10.so.
* Following the example set in 634bdfb16d8, the tlsdns netmgr
module now uses libuv and SSL primitives directly, rather than
opening a TLS socket which opens a TCP socket, as the previous
model was difficult to debug. Closes#2335.
* Remove the netmgr tls layer (we will have to re-add it for DoH)
* Add isc_tls API to wrap the OpenSSL SSL_CTX object into libisc
library; move the OpenSSL initialization/deinitialization from dstapi
needed for OpenSSL 1.0.x to the isc_tls_{initialize,destroy}()
* Add couple of new shims needed for OpenSSL 1.0.x
* When LibreSSL is used, require at least version 2.7.0 that
has the best OpenSSL 1.1.x compatibility and auto init/deinit
* Enforce OpenSSL 1.1.x usage on Windows
* Added a TLSDNS unit test and implemented a simple TLSDNS echo
server and client.
These options were ancient or made obsolete a long time ago, it is
safe to remove them.
Also stop printing ancient options, they should be treated the same as
unknown options.
Removed options: lwres, geoip-use-ecs, sit-secret, use-ixfr,
acache-cleaning-interval, acache-enable, additional-from-auth,
additional-from-cache, allow-v6-synthesis, dnssec-enable,
max-acache-size, nosit-udp-size, queryport-pool-ports,
queryport-pool-updateinterval, request-sit, use-queryport-pool, and
support-ixfr.
When using the `unixtime` or `date` method to update the SOA serial,
`named` and `dnssec-signzone` would silently fallback to `increment`
method to prevent the new serial number to be smaller than the old
serial number (using the serial number arithmetics). Add a warning
message when such fallback happens.
This is a part of the works that intends to make the netmgr stable,
testable, maintainable and tested. It contains a numerous changes to
the netmgr code and unfortunately, it was not possible to split this
into smaller chunks as the work here needs to be committed as a complete
works.
NOTE: There's a quite a lot of duplicated code between udp.c, tcp.c and
tcpdns.c and it should be a subject to refactoring in the future.
The changes that are included in this commit are listed here
(extensively, but not exclusively):
* The netmgr_test unit test was split into individual tests (udp_test,
tcp_test, tcpdns_test and newly added tcp_quota_test)
* The udp_test and tcp_test has been extended to allow programatic
failures from the libuv API. Unfortunately, we can't use cmocka
mock() and will_return(), so we emulate the behaviour with #define and
including the netmgr/{udp,tcp}.c source file directly.
* The netievents that we put on the nm queue have variable number of
members, out of these the isc_nmsocket_t and isc_nmhandle_t always
needs to be attached before enqueueing the netievent_<foo> and
detached after we have called the isc_nm_async_<foo> to ensure that
the socket (handle) doesn't disappear between scheduling the event and
actually executing the event.
* Cancelling the in-flight TCP connection using libuv requires to call
uv_close() on the original uv_tcp_t handle which just breaks too many
assumptions we have in the netmgr code. Instead of using uv_timer for
TCP connection timeouts, we use platform specific socket option.
* Fix the synchronization between {nm,async}_{listentcp,tcpconnect}
When isc_nm_listentcp() or isc_nm_tcpconnect() is called it was
waiting for socket to either end up with error (that path was fine) or
to be listening or connected using condition variable and mutex.
Several things could happen:
0. everything is ok
1. the waiting thread would miss the SIGNAL() - because the enqueued
event would be processed faster than we could start WAIT()ing.
In case the operation would end up with error, it would be ok, as
the error variable would be unchanged.
2. the waiting thread miss the sock->{connected,listening} = `true`
would be set to `false` in the tcp_{listen,connect}close_cb() as
the connection would be so short lived that the socket would be
closed before we could even start WAIT()ing
* The tcpdns has been converted to using libuv directly. Previously,
the tcpdns protocol used tcp protocol from netmgr, this proved to be
very complicated to understand, fix and make changes to. The new
tcpdns protocol is modeled in a similar way how tcp netmgr protocol.
Closes: #2194, #2283, #2318, #2266, #2034, #1920
* The tcp and tcpdns is now not using isc_uv_import/isc_uv_export to
pass accepted TCP sockets between netthreads, but instead (similar to
UDP) uses per netthread uv_loop listener. This greatly reduces the
complexity as the socket is always run in the associated nm and uv
loops, and we are also not touching the libuv internals.
There's an unfortunate side effect though, the new code requires
support for load-balanced sockets from the operating system for both
UDP and TCP (see #2137). If the operating system doesn't support the
load balanced sockets (either SO_REUSEPORT on Linux or SO_REUSEPORT_LB
on FreeBSD 12+), the number of netthreads is limited to 1.
* The netmgr has now two debugging #ifdefs:
1. Already existing NETMGR_TRACE prints any dangling nmsockets and
nmhandles before triggering assertion failure. This options would
reduce performance when enabled, but in theory, it could be enabled
on low-performance systems.
2. New NETMGR_TRACE_VERBOSE option has been added that enables
extensive netmgr logging that allows the software engineer to
precisely track any attach/detach operations on the nmsockets and
nmhandles. This is not suitable for any kind of production
machine, only for debugging.
* The tlsdns netmgr protocol has been split from the tcpdns and it still
uses the old method of stacking the netmgr boxes on top of each other.
We will have to refactor the tlsdns netmgr protocol to use the same
approach - build the stack using only libuv and openssl.
* Limit but not assert the tcp buffer size in tcp_alloc_cb
Closes: #2061
Return value of dns_db_getservestalerefresh() and
dns_db_getservestalettl() functions were previously unhandled.
This commit purposefully ignore those return values since there is
no side effect if those results are != ISC_R_SUCCESS, it also supress
Coverity warnings.
previously query plugins were strictly synchrounous - the query
process would be interrupted at some point, data would be looked
up or a change would be made, and then the query processing would
resume immediately.
this commit enables query plugins to initiate asynchronous processes
and resume on a completion event, as with recursion.
several small changes to query processing to make it easier to
use hook-based recursion (and other asynchronous functionlity)
later.
- recursion quota check is now a separate function,
check_recursionquota(), which is called by ns_query_recurse().
- pass isc_result to query_nxdomain() instead of bool.
the value of 'empty_wild' will be determined in the function
based on the passed result. this is similar to query_nodata(),
and makes the signatures of the two functions more consistent.
- pass the current 'result' value into plugin hooks.
Before this update, BIND would attempt to do a full recursive resolution
process for each query received if the requested rrset had its ttl
expired. If the resolution fails for any reason, only then BIND would
check for stale rrset in cache (if 'stale-cache-enable' and
'stale-answer-enable' is on).
The problem with this approach is that if an authoritative server is
unreachable or is failing to respond, it is very unlikely that the
problem will be fixed in the next seconds.
A better approach to improve performance in those cases, is to mark the
moment in which a resolution failed, and if new queries arrive for that
same rrset, try to respond directly from the stale cache, and do that
for a window of time configured via 'stale-refresh-time'.
Only when this interval expires we then try to do a normal refresh of
the rrset.
The logic behind this commit is as following:
- In query.c / query_gotanswer(), if the test of 'result' variable falls
to the default case, an error is assumed to have happened, and a call
to 'query_usestale()' is made to check if serving of stale rrset is
enabled in configuration.
- If serving of stale answers is enabled, a flag will be turned on in
the query context to look for stale records:
query.c:6839
qctx->client->query.dboptions |= DNS_DBFIND_STALEOK;
- A call to query_lookup() will be made again, inside it a call to
'dns_db_findext()' is made, which in turn will invoke rbdb.c /
cache_find().
- In rbtdb.c / cache_find() the important bits of this change is the
call to 'check_stale_header()', which is a function that yields true
if we should skip the stale entry, or false if we should consider it.
- In check_stale_header() we now check if the DNS_DBFIND_STALEOK option
is set, if that is the case we know that this new search for stale
records was made due to a failure in a normal resolution, so we keep
track of the time in which the failured occured in rbtdb.c:4559:
header->last_refresh_fail_ts = search->now;
- In check_stale_header(), if DNS_DBFIND_STALEOK is not set, then we
know this is a normal lookup, if the record is stale and the query
time is between last failure time + stale-refresh-time window, then
we return false so cache_find() knows it can consider this stale
rrset entry to return as a response.
The last additions are two new methods to the database interface:
- setservestale_refresh
- getservestale_refresh
Those were added so rbtdb can be aware of the value set in configuration
option, since in that level we have no access to the view object.
Parse the configuration of tls objects into SSL_CTX* objects. Listen on
DoT if 'tls' option is setup in listen-on directive. Use DoT/DoH ports
for DoT/DoH.
This commit extends the perl Configure script to also check for libssl
in addition to libcrypto and change the vcxproj source files to link
with both libcrypto and libssl.