2
0
mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-28 21:17:54 +00:00

33814 Commits

Author SHA1 Message Date
Michał Kępień
7eb87270a4 Add CHANGES entry 2021-04-08 10:33:44 +02:00
Michał Kępień
d954e152d9 Free resources when gss_accept_sec_context() fails
Even if a call to gss_accept_sec_context() fails, it might still cause a
GSS-API response token to be allocated and left for the caller to
release.  Make sure the token is released before an early return from
dst_gssapi_acceptctx().
2021-04-08 10:33:44 +02:00
Ondřej Surý
3c5267cc5c Merge branch '2600-general-error-managed-keys-zone-dns_journal_compact-failed-no-more' into 'main'
Resolve "general: error: managed-keys-zone: dns_journal_compact failed: no more"

Closes #2600

See merge request isc-projects/bind9!4849
2021-04-07 19:28:39 +00:00
Mark Andrews
0174098aca Add CHANGES and release note for [GL #2600] 2021-04-07 21:02:10 +02:00
Mark Andrews
bb6f0faeed Check that upgrade of managed-keys.bind.jnl succeeded
Update the system to include a recoverable managed.keys journal created
with <size,serial0,serial1,0> transactions and test that it has been
updated as part of the start up process.
2021-04-07 20:27:22 +02:00
Mark Andrews
0fbdf189c7 Rewrite managed-key journal immediately
Both managed keys and regular zone journals need to be updated
immediately when a recoverable error is discovered.
2021-04-07 20:23:46 +02:00
Mark Andrews
83310ffd92 Update dns_journal_compact() to handle bad transaction headers
Previously, dns_journal_begin_transaction() could reserve the wrong
amount of space.  We now check that the transaction is internally
consistent when upgrading / downgrading a journal and we also handle the
bad transaction headers.
2021-04-07 20:23:46 +02:00
Mark Andrews
520509ac7e Compute transaction size based on journal/transaction type
previously the code assumed that it was a new transaction.
2021-04-07 20:20:57 +02:00
Mark Andrews
5a6112ec8f Use journal_write_xhdr() to write the dummy transaction header
Instead of journal_write(), use correct format call journal_write_xhdr()
to write the dummy transaction header which looks at j->header_ver1 to
determine which transaction header to write instead of always writing a
zero filled journal_rawxhdr_t header.
2021-04-07 20:18:44 +02:00
Ondřej Surý
81c5f5e6a8 Merge branch '2401-ISC_R_TIMEDOUT-is-recoverable' into 'main'
netmgr: Make it possible to recover from ISC_R_TIMEDOUT

Closes #2401

See merge request isc-projects/bind9!4845
2021-04-07 14:34:46 +00:00
Evan Hunt
5496e51a80 Add CHANGES note for GL #2401 2021-04-07 15:38:16 +02:00
Artem Boldariev
8da12738f1 Use T_CONNECT timeout constant for TCP tests (instead of 1 ms)
The netmgr_test would be failing on heavily loaded systems because the
connection timeout was set to 1 ms.  Use the global constant instead.
2021-04-07 15:37:10 +02:00
Evan Hunt
d2ea8f4245 Ensure dig lookup is detached on UDP connect failure
dig could hang when UDP connect failed due to a dangling lookup object.
2021-04-07 15:36:59 +02:00
Ondřej Surý
72ef5f465d Refactor async callbacks and fix the double tlsdnsconnect callback
The isc_nm_tlsdnsconnect() call could end up with two connect callbacks
called when the timeout fired and the TCP connection was aborted,
but the TLS handshake was not complete yet.  isc__nm_connecttimeout_cb()
forgot to clean up sock->tls.pending_req when the connect callback was
called with ISC_R_TIMEDOUT, leading to a second callback running later.

A new argument has been added to the isc__nm_*_failed_connect_cb and
isc__nm_*_failed_read_cb functions, to indicate whether the callback
needs to run asynchronously or not.
2021-04-07 15:36:59 +02:00
Ondřej Surý
58e75e3ce5 Skip long tls_tests in the CI
We already skip most of the recv_send tests in CI because they are
too timing-related to be run in overloaded environment.  This commit
adds a similar change to tls_test before we merge tls_test into
netmgr_test.
2021-04-07 15:36:59 +02:00
Artem Boldariev
340235c855 Prevent short TLS tests from hanging in case of errors
The tests in tls_test.c could hang in the event of a connect
error.  This commit allows the tests to bail out when such an
error occurs.
2021-04-07 15:36:59 +02:00
Evan Hunt
426c40c96d rearrange nm_teardown() to check correctness after shutting down
if a test failed at the beginning of nm_teardown(), the function
would abort before isc_nm_destroy() or isc_tlsctx_free() were reached;
we would then abort when nm_setup() was run for the next test case.
rearranging the teardown function prevents this problem.
2021-04-07 15:36:59 +02:00
Ondřej Surý
86f4872dd6 isc_nm_*connect() always return via callback
The isc_nm_*connect() functions were refactored to always return the
connection status via the connect callback instead of sometimes returning
the hard failure directly (for example, when the socket could not be
created, or when the network manager was shutting down).

This commit changes the connect functions in all the network manager
modules, and also makes the necessary refactoring changes in places
where the connect functions are called.
2021-04-07 15:36:59 +02:00
Evan Hunt
a70cd026df move UDP connect retries from dig into isc_nm_udpconnect()
dig previously ran isc_nm_udpconnect() three times before giving
up, to work around a freebsd bug that caused connect() to return
a spurious transient EADDRINUSE. this commit moves the retry code
into the network manager itself, so that isc_nm_udpconnect() no
longer needs to return a result code.
2021-04-07 15:36:59 +02:00
Ondřej Surý
ca12e25bb0 Use generic functions for reading and timers in TCP
The TCP module has been updated to use the generic functions from
netmgr.c instead of its own local copies.  This brings the module
mostly up to par with the TCPDNS and TLSDNS modules.
2021-04-07 15:36:59 +02:00
Ondřej Surý
7df8c7061c Fix and clean up handling of connect callbacks
Serveral problems were discovered and fixed after the change in
the connection timeout in the previous commits:

  * In TLSDNS, the connection callback was not called at all under some
    circumstances when the TCP connection had been established, but the
    TLS handshake hadn't been completed yet.  Additional checks have
    been put in place so that tls_cycle() will end early when the
    nmsocket is invalidated by the isc__nm_tlsdns_shutdown() call.

  * In TCP, TCPDNS and TLSDNS, new connections would be established
    even when the network manager was shutting down.  The new
    call isc__nm_closing() has been added and is used to bail out
    early even before uv_tcp_connect() is attempted.
2021-04-07 15:36:59 +02:00
Ondřej Surý
5a87c7372c Make it possible to recover from connect timeouts
Similarly to the read timeout, it's now possible to recover from
ISC_R_TIMEDOUT event by restarting the timer from the connect callback.

The change here also fixes platforms that missing the socket() options
to set the TCP connection timeout, by moving the timeout code into user
space.  On platforms that support setting the connect timeout via a
socket option, the timeout has been hardcoded to 2 minutes (the maximum
value of tcp-initial-timeout).
2021-04-07 15:36:58 +02:00
Ondřej Surý
33c00c281f Make it possible to recover from read timeouts
Previously, when the client timed out on read, the client socket would
be automatically closed and destroyed when the nmhandle was detached.
This commit changes the logic so that it's possible for the callback to
recover from the ISC_R_TIMEDOUT event by restarting the timer. This is
done by calling isc_nmhandle_settimeout(), which prevents the timeout
handling code from destroying the socket; instead, it continues to wait
for data.

One specific use case for multiple timeouts is serve-stale - the client
socket could be created with shorter timeout (as specified with
stale-answer-client-timeout), so we can serve the requestor with stale
answer, but keep the original query running for a longer time.
2021-04-07 15:36:58 +02:00
Ondřej Surý
0aad979175 Disable netmgr tests only when running under CI
The full netmgr test suite is unstable when run in CI due to various
timing issues.  Previously, we enabled the full test suite only when
CI_ENABLE_ALL_TESTS environment variable was set, but that went against
original intent of running the full suite when an individual developer
would run it locally.

This change disables the full test suite only when running in the CI and
the CI_ENABLE_ALL_TESTS is not set.
2021-04-07 15:36:58 +02:00
Matthijs Mekking
ad25ca8bc6 Merge branch '2608-stale-answer-client-timeout-default-off' into 'main'
Change default stale-answer-client-timeout to off

Closes #2608

See merge request isc-projects/bind9!4862
2021-04-07 12:45:48 +00:00
Matthijs Mekking
e443279bbf Change default stale-answer-client-timeout to off
Using "stale-answer-client-timeout" turns out to have unforeseen
negative consequences, and thus it is better to disable the feature
by default for the time being.
2021-04-07 14:10:31 +02:00
Diego dos Santos Fronza
e8313d91ea Merge branch '2582-threadsanitizer-data-race-lib-dns-zone-c-10272-7-in-zone_maintenance' into 'main'
Resolve "ThreadSanitizer: data race lib/dns/zone.c:10272:7 in zone_maintenance"

Closes #2582

See merge request isc-projects/bind9!4864
2021-04-07 12:05:05 +00:00
Diego Fronza
6e08307bc8 Resolve TSAN data race in zone_maintenance
Fix race between zone_maintenance and dns_zone_notifyreceive functions,
zone_maintenance was attempting to read a zone flag calling
DNS_ZONE_FLAG(zone, flag) while dns_zone_notifyreceive was updating
a flag in the same zone calling DNS_ZONE_SETFLAG(zone, ...).

The code reading the flag in zone_maintenance was not protected by the
zone's lock, to avoid a race the zone's lock is now being acquired
before an attempt to read the zone flag is made.
2021-04-07 12:04:01 +00:00
Michał Kępień
2e5a6ab7fc Merge branch '2579-enforce-a-run-time-limit-on-unit-test-binaries' into 'main'
Enforce a run time limit on unit test binaries

Closes #2579

See merge request isc-projects/bind9!4802
2021-04-07 09:46:40 +00:00
Michał Kępień
6bdd55a9b3 Enforce a run time limit on unit test binaries
When a unit test binary hangs, the GitLab CI job in which it is run is
stuck until its run time limit is exceeded.  Furthermore, it is not
trivial to determine which test(s) hung in a given GitLab CI job based
on its log.  To prevent these issues, enforce a run time limit on every
binary executed by the lib/unit-test-driver.sh script.  Use a timeout of
5 minutes for consistency with older BIND 9 branches, which employed
Kyua for running unit tests.  Report an exit code of 124 when the run
time limit is exceeded for a unit test binary, for consistency with the
"timeout" tool included in GNU coreutils.
2021-04-07 11:41:45 +02:00
Artem Boldariev
d1bb1b01b9 Merge branch '2611-doth-failure' into 'main'
Fix "doth" system test failure with SSL_ERROR_SYSCALL (5)

See merge request isc-projects/bind9!4863
2021-04-07 08:44:38 +00:00
Artem Boldariev
ee10948e2d Remove dead code which was supposed to handle TLS shutdowns nicely
Fixes Coverity issue CID 330954 (See #2612).
2021-04-07 11:21:08 +03:00
Artem Boldariev
e6062210c7 Handle buggy situations with SSL_ERROR_SYSCALL
See "BUGS" section at:

https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html

It is mentioned there that when TLS status equals SSL_ERROR_SYSCALL
AND errno == 0 it means that underlying transport layer returned EOF
prematurely.  However, we are managing the transport ourselves, so we
should just resume reading from the TCP socket.

It seems that this case has been handled properly on modern versions
of OpenSSL. That being said, the situation goes in line with the
manual: it is briefly mentioned there that SSL_ERROR_SYSCALL might be
returned not only in a case of low-level errors (like system call
failures).
2021-04-07 11:21:08 +03:00
Mark Andrews
6b121171a5 Merge branch '2613-lib-dns-gen-is-not-deleted-on-make-clean' into 'main'
Resolve "lib/dns/gen is not deleted on make clean"

Closes #2613

See merge request isc-projects/bind9!4865
2021-04-07 07:18:53 +00:00
Mark Andrews
9c28df2204 remove lib/dns/gen when running 'make clean' 2021-04-07 08:06:49 +10:00
Matthijs Mekking
8556c7f854 Merge branch '2594-servestale-staleonly-recursion-race' into 'main'
Serve-stale "staleonly" recursion race condition

See merge request isc-projects/bind9!4859
2021-04-02 11:26:57 +00:00
Matthijs Mekking
3d3a6415f7 If RPZ config'd, bail stale-answer-client-timeout
When we are recursing, RPZ processing is not allowed. But when we are
performing a lookup due to "stale-answer-client-timeout", we are still
recursing. This effectively means that RPZ processing is disabled on
such a lookup.

In this case, bail the "stale-answer-client-timeout" lookup and wait
for recursion to complete, as we we can't perform the RPZ rewrite
rules reliably.
2021-04-02 10:02:40 +02:00
Matthijs Mekking
839df94190 Rename "staleonly"
The dboption DNS_DBFIND_STALEONLY caused confusion because it implies
we are looking for stale data **only** and ignore any active RRsets in
the cache. Rename it to DNS_DBFIND_STALETIMEOUT as it is more clear
the option is related to a lookup due to "stale-answer-client-timeout".

Rename other usages of "staleonly", instead use "lookup due to...".
Also rename related function and variable names.
2021-04-02 10:02:40 +02:00
Matthijs Mekking
3f81d79ffb Restore the RECURSIONOK attribute after staleonly
When doing a staleonly lookup we don't want to fallback to recursion.
After all, there are obviously problems with recursion, otherwise we
wouldn't do a staleonly lookup.

When resuming from recursion however, we should restore the
RECURSIONOK flag, allowing future required lookups for this client
to recurse.
2021-04-02 10:02:40 +02:00
Matthijs Mekking
aaed7f9d8c Remove result exception on staleonly lookup
When implementing "stale-answer-client-timeout", we decided that
we should only return positive answers prematurely to clients. A
negative response is not useful, and in that case it is better to
wait for the recursion to complete.

To do so, we check the result and if it is not ISC_R_SUCCESS, we
decide that it is not good enough. However, there are more return
codes that could lead to a positive answer (e.g. CNAME chains).

This commit removes the exception and now uses the same logic that
other stale lookups use to determine if we found a useful stale
answer (stale_found == true).

This means we can simplify two test cases in the serve-stale system
test: nodata.example is no longer treated differently than data.example.
2021-04-02 10:02:40 +02:00
Matthijs Mekking
e44bcc6f53 Add notes and changes for [#2594]
Pretty newsworthy.
2021-04-02 10:02:40 +02:00
Matthijs Mekking
3d5429f61f Remove INSIST on NS_QUERYATTR_ANSWERED
The NS_QUERYATTR_ANSWERED attribute is to prevent sending a response
twice. Without the attribute, this may happen if a staleonly lookup
found a useful answer and sends a response to the client, and later
recursion ends and also tries to send a response.

The attribute was also used to mask adding a duplicate RRset. This is
considered harmful. When we created a response to the client with a
stale only lookup (regardless if we actually have send the response),
we should clear the rdatasets that were added during that lookup.

Mark such rdatasets with the a new attribute,
DNS_RDATASETATTR_STALE_ADDED. Set a query attribute
NS_QUERYATTR_STALEOK if we may have added rdatasets during a stale
only lookup. Before creating a response on a normal lookup, check if
we can expect rdatasets to have been added during a staleonly lookup.
If so, clear the rdatasets from the message with the attribute
DNS_RDATASETATTR_STALE_ADDED set.
2021-04-02 09:15:07 +02:00
Matthijs Mekking
48b0dc159b Simplify when to detach the client
With stale-answer-client-timeout, we may send a response to the client,
but we may want to hold on to the network manager handle, because
recursion is going on in the background, or we need to refresh a
stale RRset.

Simplify the setting of 'nodetach':
* During a staleonly lookup we should not detach the nmhandle, so just
  set it prior to 'query_lookup()'.
* During a staleonly "stalefirst" lookup set the 'nodetach' to true
  if we are going to refresh the RRset.

Now there is no longer the need to clear the 'nodetach' if we go
through the "dbfind_stale", "stale_refresh_window", or "stale_only"
paths.
2021-04-02 09:14:09 +02:00
Matthijs Mekking
92f7a67892 Refactor stale lookups, ignore active RRsets
When doing a staleonly lookup, ignore active RRsets from cache. If we
don't, we may add a duplicate RRset to the message, and hit an
assertion failure in query.c because adding the duplicate RRset to the
ANSWER section failed.

This can happen on a race condition. When a client query is received,
the recursion is started. When 'stale-answer-client-timeout' triggers
around the same time the recursion completes, the following sequence
of events may happen:
1. Queue the "try stale" fetch_callback() event to the client task.
2. Add the RRsets from the authoritative response to the cache.
3. Queue the "fetch complete" fetch_callback() event to the client task.
4. Execute the "try stale" fetch_callback(), which retrieves the
   just-inserted RRset from the database.
5. In "ns_query_done()" we are still recursing, but the "staleonly"
   query attribute has already been cleared. In other words, the
   query will resume when recursion ends (it already has ended but is
   still on the task queue).
6. Execute the "fetch complete" fetch_callback(). It finds the answer
   from recursion in the cache again and tries to add the duplicate to
   the answer section.

This commit changes the logic for finding stale answers in the cache,
such that on "stale_only" lookups actually only stale RRsets are
considered. It refactors the code so that code paths for "dbfind_stale",
"stale_refresh_window", and "stale_only" are more clear.

First we call some generic code that applies in all three cases,
formatting the domain name for logging purposes, increment the
trystale stats, and check if we actually found stale data that we can
use.

The "dbfind_stale" lookup will return SERVFAIL if we didn't found a
usable answer, otherwise we will continue with the lookup
(query_gotanswer()). This is no different as before the introduction of
"stale-answer-client-timeout" and "stale-refresh-time".

The "stale_refresh_window" lookup is similar to the "dbfind_stale"
lookup: return SERVFAIL if we didn't found a usable answer, otherwise
continue with the lookup (query_gotanswer()).

Finally the "stale_only" lookup.

If the "stale_only" lookup was triggered because of an actual client
timeout (stale-answer-client-timeout > 0), and if database lookup
returned a stale usable RRset, trigger a response to the client.
Otherwise return and wait until the recursion completes (or the
resolver query times out).

If the "stale_only" lookup is a "stale-anwer-client-timeout 0" lookup,
preferring stale data over a lookup. In this case if there was no stale
data, or the data was not a positive answer, retry the lookup with the
stale options cleared, a.k.a. a normal lookup. Otherwise, continue
with the lookup (query_gotanswer()) and refresh the stale RRset. This
will trigger a response to the client, but will not detach the handle
because a fetch will be created to refresh the RRset.
2021-04-02 09:14:09 +02:00
Matthijs Mekking
fee164243f Keep track of allow client detach
The stale-answer-client-timeout feature introduced a dependancy on
when a client may be detached from the handle. The dboption
DNS_DBFIND_STALEONLY was reused to track this attribute. This overloads
the meaning of this database option, and actually introduced a bug
because the option was checked in other places. In particular, in
'ns_query_done()' there is a check for 'RECURSING(qctx->client) &&
(!QUERY_STALEONLY(&qctx->client->query) || ...' and the condition is
satisfied because recursion has not completed yet and
DNS_DBFIND_STALEONLY is already cleared by that time (in
query_lookup()), because we found a useful answer and we should detach
the client from the handle after sending the response.

Add a new boolean to the client structure to keep track of client
detach from handle is allowed or not. It is only disallowed if we are
in a staleonly lookup and we didn't found a useful answer.
2021-04-02 09:14:09 +02:00
Artem Boldariev
e7fe606020 Merge branch 'artem/tls-tests-and-fixes' into 'main'
TLS transport code refactoring and unit tests

See merge request isc-projects/bind9!4851
2021-04-01 15:41:52 +00:00
Artem Boldariev
fa062162a7 Fix crash (regression) in DIG when handling non-DoH responses
This commit fixes crash in dig when it encounters non-expected header
value. The bug was introduced at some point late in the last DoH
development cycle. Also, refactors the relevant code a little bit to
ensure better incoming data validation for client-side DoH
connections.
2021-04-01 17:31:29 +03:00
Artem Boldariev
11ed7aac5d TLS code refactoring, fixes and unit-tests
This commit fixes numerous stability issues with TLS transport code as
well as adds unit tests for it.
2021-04-01 17:31:29 +03:00
Ondřej Surý
01cd310407 Merge branch '2607-remove-custom-spnego' into 'main'
Remove custom ISC SPNEGO implementation

Closes #2607

See merge request isc-projects/bind9!4856
2021-04-01 14:14:00 +00:00
Ondřej Surý
66bd47a129 Add CHANGES and release note for GL #2607 2021-04-01 16:08:19 +02:00