2
0
mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-31 14:35:26 +00:00
Commit Graph

33779 Commits

Author SHA1 Message Date
Matthijs Mekking
9af8caa733 Implement draft-vandijk-dnsop-nsec-ttl
The draft says that the NSEC(3) TTL must have the same TTL value
as the minimum of the SOA MINIMUM field and the SOA TTL. This was
always the intended behaviour.

Update the zone structure to also track the SOA TTL. Whenever we
use the MINIMUM value to determine the NSEC(3) TTL, use the minimum
of MINIMUM and SOA TTL instead.

There is no specific test for this, however two tests need adjusting
because otherwise they failed: They were testing for NSEC3 records
including the TTL. Update these checks to use 600 (the SOA TTL),
rather than 3600 (the SOA MINIMUM).
2021-04-13 11:26:26 +02:00
Matthijs Mekking
8ffb4b0a13 Merge branch '2289-cache-dump-stale-ttl-weird-values' into 'main'
Fix nonsensical stale TTL values in cache dump

Closes #2289

See merge request isc-projects/bind9!4799
2021-04-13 08:54:49 +00:00
Matthijs Mekking
a83c8cb0af Use stale TTL as RRset TTL in dumpdb
It is more intuitive to have the countdown 'max-stale-ttl' as the
RRset TTL, instead of 0 TTL. This information was already available
in a comment "; stale (will be retained for x more seconds", but
Support suggested to put it in the TTL field instead.
2021-04-13 09:48:20 +02:00
Matthijs Mekking
debee6157b Check staleness in bind_rdataset
Before binding an RRset, check the time and see if this record is
stale (or perhaps even ancient). Marking a header stale or ancient
happens only when looking up an RRset in cache, but binding an RRset
can also happen on other occasions (for example when dumping the
database).

Check the time and compare it to the header. If according to the
time the entry is stale, but not ancient, set the STALE attribute.
If according to the time is ancient, set the ANCIENT attribute.

We could mark the header stale or ancient here, but that requires
locking, so that's why we only compare the current time against
the rdh_ttl.

Adjust the test to check the dump-db before querying for data. In the
dumped file the entry should be marked as stale, despite no cache
lookup happened since the initial query.
2021-04-13 09:48:20 +02:00
Matthijs Mekking
2a5e0232ed Fix nonsensical stale TTL values in cache dump
When introducing change 5149, "rndc dumpdb" started to print a line
above a stale RRset, indicating how long the data will be retained.

At that time, I thought it should also be possible to load
a cache from file. But if a TTL has a value of 0 (because it is stale),
stale entries wouldn't be loaded from file. So, I added the
'max-stale-ttl' to TTL values, and adjusted the $DATE accordingly.

Since we actually don't have a "load cache from file" feature, this
is premature and is causing confusion at operators. This commit
changes the 'max-stale-ttl' adjustments.

A check in the serve-stale system test is added for a non-stale
RRset (longttl.example) to make sure the TTL in cache is sensible.

Also, the comment above stale RRsets could have nonsensical
values. A possible reason why this may happen is when the RRset was
marked a stale but the 'max-stale-ttl' has passed (and is actually an
RRset awaiting cleanup). This would lead to the "will be retained"
value to be negative (but since it is stored in an uint32_t, you would
get a nonsensical value (e.g. 4294362497).

To mitigate against this, we now also check if the header is not
ancient. In addition we check if the stale_ttl would be negative, and
if so we set it to 0. Most likely this will not happen because the
header would already have been marked ancient, but there is a possible
race condition where the 'rdh_ttl + serve_stale_ttl' has passed,
but the header has not been checked for staleness.
2021-04-13 09:48:20 +02:00
Mark Andrews
1941ce99d4 Merge branch '2622-command-line-option-l-not-shown-with-usage-message' into 'main'
Resolve "Command-line option -L not shown with usage message"

Closes #2622

See merge request isc-projects/bind9!4881
2021-04-13 01:33:28 +00:00
Mark Andrews
38449de93b Update named's usage description 2021-04-12 12:07:44 +10:00
Michał Kępień
b64af491bf Merge branch 'michal/add-placeholder-entries-to-CHANGES' into 'main'
Add placeholders for GL #2467, GL #2540, GL #2604

See merge request isc-projects/bind9!4878
2021-04-08 11:10:54 +00:00
Michał Kępień
0874242db6 Add placeholders for GL #2467, GL #2540, GL #2604 2021-04-08 13:06:57 +02:00
Michał Kępień
b517108cbc Merge branch '2578-rework-get_ports.sh-to-make-it-not-use-a-lock-file' into 'main'
Rework get_ports.sh to make it not use a lock file

Closes #2578

See merge request isc-projects/bind9!4801
2021-04-08 09:37:51 +00:00
Michał Kępień
c3718b926b Use the same port selection method on all systems
When system tests are run on Windows, they are assigned port ranges that
are 100 ports wide and start from port number 5000.  This is a different
port assignment method than the one used on Unix systems.  Drop the "-p"
command line option from bin/tests/system/run.sh invocations used for
starting system tests on Windows to unify the port assignment method
used across all operating systems.
2021-04-08 11:12:37 +02:00
Michał Kępień
31e5ca4bd9 Rework get_ports.sh to make it not use a lock file
The get_ports.sh script is used for determining the range of ports a
given system test should use.  It first determines the start of the port
range to return (the base port); it can either be specified explicitly
by the caller or chosen randomly.  Subsequent ports are picked
sequentially, starting from the base port.  To ensure no single port is
used by multiple tests, a state file (get_ports.state) containing the
last assigned port is maintained by the script.  Concurrent access to
the state file is protected by a lock file (get_ports.lock); if one
instance of the script holds the lock file while another instance tries
to acquire it, the latter retries its attempt to acquire the lock file
after sleeping for 1 second; this retry process can be repeated up to 10
times before the script returns an error.

There are some problems with this approach:

  - the sleep period in case of failure to acquire the lock file is
    fixed, which leads to a "thundering herd" type of problem, where
    (depending on how processes are scheduled by the operating system)
    multiple system tests try to acquire the lock file at the same time
    and subsequently sleep for 1 second, only for the same situation to
    likely happen the next time around,

  - the lock file is being locked and then unlocked for every single
    port assignment made, not just once for the entire range of ports a
    system test should use; in other words, the lock file is currently
    locked and unlocked 13 times per system test; this increases the
    odds of the "thundering herd" problem described above preventing a
    system test from getting one or more ports assigned before the
    maximum retry count is reached (assuming multiple system tests are
    run in parallel); it also enables the range of ports used by a given
    system test to be non-sequential (which is a rather cosmetic issue,
    but one that can make log interpretation harder than necessary when
    test failures are diagnosed),

  - both issues described above cause unnecessary delays when multiple
    system tests are started in parallel (due to high lock file
    contention among the system tests being started),

  - maintaining a state file requires ensuring proper locking, which
    complicates the script's source code.

Rework the get_ports.sh script so that it assigns non-overlapping port
ranges to its callers without using a state file or a lock file:

  - add a new command line switch, "-t", which takes the name of the
    system test to assign ports for,

  - ensure every instance of get_ports.sh knows how many ports all
    system tests which form the test suite are going to need in total
    (based on the number of subdirectories found in bin/tests/system/),

  - in order to ensure all instances of get_ports.sh work on the same
    global port range (so that no port range collisions happen), a
    stable (throughout the expected run time of a single system test
    suite) base port selection method is used instead of the random one;
    specifically, the base port, unless specified explicitly using the
    "-p" command line switch, is derived from the number of hours which
    passed since the Unix Epoch time,

  - use the name of the system test to assign ports for (passed via the
    new "-t" command line switch) as a unique index into the global
    system test range, to ensure all system tests use disjoint port
    ranges.
2021-04-08 11:12:37 +02:00
Michal Nowak
a2484f2673 Merge branch 'mnowak/fix-missing-fromhex.pl-in-out-of-tree' into 'main'
Move fromhex.pl script to bin/tests/system/

See merge request isc-projects/bind9!4875
2021-04-08 09:07:07 +00:00
Michal Nowak
cd0a34df1b Move fromhex.pl script to bin/tests/system/
The fromhex.pl script needs to be copied from the source directory to
the build directory before any test is run, otherwise the out-of-tree
fails to find it. Given that the script is used only in system test,
move it to bin/tests/system/.
2021-04-08 11:04:26 +02:00
Michał Kępień
e01b3ccfaa Merge branch '2620-free-resources-when-gss_accept_sec_context-fails' into 'main'
Free resources when gss_accept_sec_context() fails

Closes #2620

See merge request isc-projects/bind9!4873
2021-04-08 08:40:27 +00:00
Michał Kępień
7eb87270a4 Add CHANGES entry 2021-04-08 10:33:44 +02:00
Michał Kępień
d954e152d9 Free resources when gss_accept_sec_context() fails
Even if a call to gss_accept_sec_context() fails, it might still cause a
GSS-API response token to be allocated and left for the caller to
release.  Make sure the token is released before an early return from
dst_gssapi_acceptctx().
2021-04-08 10:33:44 +02:00
Ondřej Surý
3c5267cc5c Merge branch '2600-general-error-managed-keys-zone-dns_journal_compact-failed-no-more' into 'main'
Resolve "general: error: managed-keys-zone: dns_journal_compact failed: no more"

Closes #2600

See merge request isc-projects/bind9!4849
2021-04-07 19:28:39 +00:00
Mark Andrews
0174098aca Add CHANGES and release note for [GL #2600] 2021-04-07 21:02:10 +02:00
Mark Andrews
bb6f0faeed Check that upgrade of managed-keys.bind.jnl succeeded
Update the system to include a recoverable managed.keys journal created
with <size,serial0,serial1,0> transactions and test that it has been
updated as part of the start up process.
2021-04-07 20:27:22 +02:00
Mark Andrews
0fbdf189c7 Rewrite managed-key journal immediately
Both managed keys and regular zone journals need to be updated
immediately when a recoverable error is discovered.
2021-04-07 20:23:46 +02:00
Mark Andrews
83310ffd92 Update dns_journal_compact() to handle bad transaction headers
Previously, dns_journal_begin_transaction() could reserve the wrong
amount of space.  We now check that the transaction is internally
consistent when upgrading / downgrading a journal and we also handle the
bad transaction headers.
2021-04-07 20:23:46 +02:00
Mark Andrews
520509ac7e Compute transaction size based on journal/transaction type
previously the code assumed that it was a new transaction.
2021-04-07 20:20:57 +02:00
Mark Andrews
5a6112ec8f Use journal_write_xhdr() to write the dummy transaction header
Instead of journal_write(), use correct format call journal_write_xhdr()
to write the dummy transaction header which looks at j->header_ver1 to
determine which transaction header to write instead of always writing a
zero filled journal_rawxhdr_t header.
2021-04-07 20:18:44 +02:00
Ondřej Surý
81c5f5e6a8 Merge branch '2401-ISC_R_TIMEDOUT-is-recoverable' into 'main'
netmgr: Make it possible to recover from ISC_R_TIMEDOUT

Closes #2401

See merge request isc-projects/bind9!4845
2021-04-07 14:34:46 +00:00
Evan Hunt
5496e51a80 Add CHANGES note for GL #2401 2021-04-07 15:38:16 +02:00
Artem Boldariev
8da12738f1 Use T_CONNECT timeout constant for TCP tests (instead of 1 ms)
The netmgr_test would be failing on heavily loaded systems because the
connection timeout was set to 1 ms.  Use the global constant instead.
2021-04-07 15:37:10 +02:00
Evan Hunt
d2ea8f4245 Ensure dig lookup is detached on UDP connect failure
dig could hang when UDP connect failed due to a dangling lookup object.
2021-04-07 15:36:59 +02:00
Ondřej Surý
72ef5f465d Refactor async callbacks and fix the double tlsdnsconnect callback
The isc_nm_tlsdnsconnect() call could end up with two connect callbacks
called when the timeout fired and the TCP connection was aborted,
but the TLS handshake was not complete yet.  isc__nm_connecttimeout_cb()
forgot to clean up sock->tls.pending_req when the connect callback was
called with ISC_R_TIMEDOUT, leading to a second callback running later.

A new argument has been added to the isc__nm_*_failed_connect_cb and
isc__nm_*_failed_read_cb functions, to indicate whether the callback
needs to run asynchronously or not.
2021-04-07 15:36:59 +02:00
Ondřej Surý
58e75e3ce5 Skip long tls_tests in the CI
We already skip most of the recv_send tests in CI because they are
too timing-related to be run in overloaded environment.  This commit
adds a similar change to tls_test before we merge tls_test into
netmgr_test.
2021-04-07 15:36:59 +02:00
Artem Boldariev
340235c855 Prevent short TLS tests from hanging in case of errors
The tests in tls_test.c could hang in the event of a connect
error.  This commit allows the tests to bail out when such an
error occurs.
2021-04-07 15:36:59 +02:00
Evan Hunt
426c40c96d rearrange nm_teardown() to check correctness after shutting down
if a test failed at the beginning of nm_teardown(), the function
would abort before isc_nm_destroy() or isc_tlsctx_free() were reached;
we would then abort when nm_setup() was run for the next test case.
rearranging the teardown function prevents this problem.
2021-04-07 15:36:59 +02:00
Ondřej Surý
86f4872dd6 isc_nm_*connect() always return via callback
The isc_nm_*connect() functions were refactored to always return the
connection status via the connect callback instead of sometimes returning
the hard failure directly (for example, when the socket could not be
created, or when the network manager was shutting down).

This commit changes the connect functions in all the network manager
modules, and also makes the necessary refactoring changes in places
where the connect functions are called.
2021-04-07 15:36:59 +02:00
Evan Hunt
a70cd026df move UDP connect retries from dig into isc_nm_udpconnect()
dig previously ran isc_nm_udpconnect() three times before giving
up, to work around a freebsd bug that caused connect() to return
a spurious transient EADDRINUSE. this commit moves the retry code
into the network manager itself, so that isc_nm_udpconnect() no
longer needs to return a result code.
2021-04-07 15:36:59 +02:00
Ondřej Surý
ca12e25bb0 Use generic functions for reading and timers in TCP
The TCP module has been updated to use the generic functions from
netmgr.c instead of its own local copies.  This brings the module
mostly up to par with the TCPDNS and TLSDNS modules.
2021-04-07 15:36:59 +02:00
Ondřej Surý
7df8c7061c Fix and clean up handling of connect callbacks
Serveral problems were discovered and fixed after the change in
the connection timeout in the previous commits:

  * In TLSDNS, the connection callback was not called at all under some
    circumstances when the TCP connection had been established, but the
    TLS handshake hadn't been completed yet.  Additional checks have
    been put in place so that tls_cycle() will end early when the
    nmsocket is invalidated by the isc__nm_tlsdns_shutdown() call.

  * In TCP, TCPDNS and TLSDNS, new connections would be established
    even when the network manager was shutting down.  The new
    call isc__nm_closing() has been added and is used to bail out
    early even before uv_tcp_connect() is attempted.
2021-04-07 15:36:59 +02:00
Ondřej Surý
5a87c7372c Make it possible to recover from connect timeouts
Similarly to the read timeout, it's now possible to recover from
ISC_R_TIMEDOUT event by restarting the timer from the connect callback.

The change here also fixes platforms that missing the socket() options
to set the TCP connection timeout, by moving the timeout code into user
space.  On platforms that support setting the connect timeout via a
socket option, the timeout has been hardcoded to 2 minutes (the maximum
value of tcp-initial-timeout).
2021-04-07 15:36:58 +02:00
Ondřej Surý
33c00c281f Make it possible to recover from read timeouts
Previously, when the client timed out on read, the client socket would
be automatically closed and destroyed when the nmhandle was detached.
This commit changes the logic so that it's possible for the callback to
recover from the ISC_R_TIMEDOUT event by restarting the timer. This is
done by calling isc_nmhandle_settimeout(), which prevents the timeout
handling code from destroying the socket; instead, it continues to wait
for data.

One specific use case for multiple timeouts is serve-stale - the client
socket could be created with shorter timeout (as specified with
stale-answer-client-timeout), so we can serve the requestor with stale
answer, but keep the original query running for a longer time.
2021-04-07 15:36:58 +02:00
Ondřej Surý
0aad979175 Disable netmgr tests only when running under CI
The full netmgr test suite is unstable when run in CI due to various
timing issues.  Previously, we enabled the full test suite only when
CI_ENABLE_ALL_TESTS environment variable was set, but that went against
original intent of running the full suite when an individual developer
would run it locally.

This change disables the full test suite only when running in the CI and
the CI_ENABLE_ALL_TESTS is not set.
2021-04-07 15:36:58 +02:00
Matthijs Mekking
ad25ca8bc6 Merge branch '2608-stale-answer-client-timeout-default-off' into 'main'
Change default stale-answer-client-timeout to off

Closes #2608

See merge request isc-projects/bind9!4862
2021-04-07 12:45:48 +00:00
Matthijs Mekking
e443279bbf Change default stale-answer-client-timeout to off
Using "stale-answer-client-timeout" turns out to have unforeseen
negative consequences, and thus it is better to disable the feature
by default for the time being.
2021-04-07 14:10:31 +02:00
Diego dos Santos Fronza
e8313d91ea Merge branch '2582-threadsanitizer-data-race-lib-dns-zone-c-10272-7-in-zone_maintenance' into 'main'
Resolve "ThreadSanitizer: data race lib/dns/zone.c:10272:7 in zone_maintenance"

Closes #2582

See merge request isc-projects/bind9!4864
2021-04-07 12:05:05 +00:00
Diego Fronza
6e08307bc8 Resolve TSAN data race in zone_maintenance
Fix race between zone_maintenance and dns_zone_notifyreceive functions,
zone_maintenance was attempting to read a zone flag calling
DNS_ZONE_FLAG(zone, flag) while dns_zone_notifyreceive was updating
a flag in the same zone calling DNS_ZONE_SETFLAG(zone, ...).

The code reading the flag in zone_maintenance was not protected by the
zone's lock, to avoid a race the zone's lock is now being acquired
before an attempt to read the zone flag is made.
2021-04-07 12:04:01 +00:00
Michał Kępień
2e5a6ab7fc Merge branch '2579-enforce-a-run-time-limit-on-unit-test-binaries' into 'main'
Enforce a run time limit on unit test binaries

Closes #2579

See merge request isc-projects/bind9!4802
2021-04-07 09:46:40 +00:00
Michał Kępień
6bdd55a9b3 Enforce a run time limit on unit test binaries
When a unit test binary hangs, the GitLab CI job in which it is run is
stuck until its run time limit is exceeded.  Furthermore, it is not
trivial to determine which test(s) hung in a given GitLab CI job based
on its log.  To prevent these issues, enforce a run time limit on every
binary executed by the lib/unit-test-driver.sh script.  Use a timeout of
5 minutes for consistency with older BIND 9 branches, which employed
Kyua for running unit tests.  Report an exit code of 124 when the run
time limit is exceeded for a unit test binary, for consistency with the
"timeout" tool included in GNU coreutils.
2021-04-07 11:41:45 +02:00
Artem Boldariev
d1bb1b01b9 Merge branch '2611-doth-failure' into 'main'
Fix "doth" system test failure with SSL_ERROR_SYSCALL (5)

See merge request isc-projects/bind9!4863
2021-04-07 08:44:38 +00:00
Artem Boldariev
ee10948e2d Remove dead code which was supposed to handle TLS shutdowns nicely
Fixes Coverity issue CID 330954 (See #2612).
2021-04-07 11:21:08 +03:00
Artem Boldariev
e6062210c7 Handle buggy situations with SSL_ERROR_SYSCALL
See "BUGS" section at:

https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html

It is mentioned there that when TLS status equals SSL_ERROR_SYSCALL
AND errno == 0 it means that underlying transport layer returned EOF
prematurely.  However, we are managing the transport ourselves, so we
should just resume reading from the TCP socket.

It seems that this case has been handled properly on modern versions
of OpenSSL. That being said, the situation goes in line with the
manual: it is briefly mentioned there that SSL_ERROR_SYSCALL might be
returned not only in a case of low-level errors (like system call
failures).
2021-04-07 11:21:08 +03:00
Mark Andrews
6b121171a5 Merge branch '2613-lib-dns-gen-is-not-deleted-on-make-clean' into 'main'
Resolve "lib/dns/gen is not deleted on make clean"

Closes #2613

See merge request isc-projects/bind9!4865
2021-04-07 07:18:53 +00:00
Mark Andrews
9c28df2204 remove lib/dns/gen when running 'make clean' 2021-04-07 08:06:49 +10:00