Validation pipeline: https://gitlab.isc.org/isc-projects/bind9/-/pipelines/160984
Prerequisites:
- [x] isc-private/devops!11
- [x] https://gitlab.isc.org/isc-projects/bind9-qa/-/merge_requests/9
Things to consider:
- FreeBSD DoH jobs are not added because Flamethrower queries always timeout.
- This adds 15 more CI jobs:
- Linux (AWS autoscaler): `(auth + recursive + RPZ) * (DoH + DoT) * (amd64 + arm64) = 12`
- FreeBSD (one FreeBSD runner): `(auth + recursive + RPZ) * (DoT) * (amd64) = 3`
- Autoscaler is not yet present on FreeBSD. Adding 3 CI jobs (i.e., DoT) run serially adds 3 hours to the pipeline runtime. Should we add just one FreeBSD DoT job to limit the runtime?
- DoH/DoT performance is slightly lower than pure TCP, so the threshold for the test to pass must be lowered by 5-10% (see isc-private/bind-qa!40).
Merge branch 'mnowak/stress-test-with-doh-dot' into 'main'
See merge request isc-projects/bind9!5800
Add DoH and DoT stress test jobs. The DoH scenario on FreeBSD is omitted
because all Flamethrower's DoH queries timeout on this platform.
Since the response rate of DoT queries is lower than that of DoH and
TCP, the expected TCP response rate is 80%.
Due to the large number of similar stress test configurations, the
"util/generate-stress-test-configs.py" script now generates them as part
of a downstream pipeline. The script is expected to be run exclusively
within the CI environment, which sources all environmental variables and
files.
This refactoring brought the following changes:
- To start a stress test immediately and not wait for artifacts of the
autoreconf job, run the "autoreconf -fi" command as part of every job.
- Drop the BIND_STRESS_TEST_* variables as they were rarely used and
conflicted with mode and platform selection in the configuration
generator.
- Most pipelines now include a few short, randomly selected stress test
jobs. To schedule all stress tests, set the ALL_BIND_STRESS_TESTS
environmental variable, push a tag to CI, or run a scheduled pipeline.
- Set the BIND_STRESS_TESTS_RUN_TIME environmental variable to pick the
stress test runtime of your choosing, set the BIND_STRESS_TESTS_RATE
environmental variable to set different than the default query rate.
- Job timeout is set to 30 minutes plus stress test runtime in minutes.
Extended DNS error 22 (No reachable authority) was previously detected when `fctx_expired` fired. It turns out this function is used as a "safety net" and the timeout detection should be caught earlier.
It was working though, because of another issue fixed by !9927. But then, the recursive request timed out detection occurs before `fctx_expired` making impossible to raise the EDE 22 error.
This fixes the problem by triggering the EDE 22 in the part of the code detecting the (TCP or UDP) time out and taking the decision to cancel the whole fetch (i.e. There is no other server to attempt to contact).
Note this is not targeting users (no release note) because there is no release versions of BIND between !9927 and this changes. Thus a release note would be confusing.
Closes#5137
Merge branch '5137-ede22' into 'main'
See merge request isc-projects/bind9!9985
When an EDE code is added to a message, the code is converted early in a
big-endian order so it can be memcpy-ed directly in the EDE buffer that
will go on the wire.
This previous change forget to update debug logs which still assume the
EDE code was in host byte order. Add a separate variable to
differentiate both and avoid ambiguities
When EDE 3 (stale answer) was added the serve-stale tests were checking
for those exclusively, i.e. grepping for no "EDE" in the dig output when
no stale answer was expected.
However, some stale tests disable stale answers and make the
authoritative server unresponsive, effectively triggering a timed out
request thus an EDE 22. Update those tests so they still tests the
absence of EDE 3 error, but also the presence of EDE 22.
This re-do a previously existing EDE 22 system test as well as add
another one making sure the timed out flow detection works also on UDP
when the resolver is contacting the authoritative server. (the existing
test was using TCP to contact the authoritative servers).
Extended DNS error 22 (No reachable authority) was previously detected
when `fctx_expired` fired. It turns out this function is used as a
"safety net" and the timeout detection should be caught earlier.
It was working though, because of another issue fixed by !9927. Since
this change, the recursive request timed out detection occurs before
`fctx_expired` so EDE 22 is not added to the response message anymore.
The fix of the problem is to add the EDE 22 code in two situations:
- When the dispatch code timed out (rctx_timedout) the resolver code
checks various properties to figure out if it needs to make another
fetch attempt. One of the paramters if the fetch expiration time. If
it expires, the whole recursion is canceled, so it now adds the EDE 22
code.
- If the fetch expiration time doesn't expires in the case above (and
other parameters allows it) a new fetch attempt is made (fctx_query).
But before the new request is actually made, the fetch expiration time
is re-checked. It might then has elapsed, and the whole recursion is
canceled. So it now also adds the EDE 22 code here as well.
The new command is `rndc memprof`. The memory profiling status is also
reported inside `rndc status`. The status also shows whether named can
toggle memory profiling or not and if the server is built with jemalloc.
Closes#4759
Merge branch '4759-add-a-trigger-to-dump-jeprof-data-or-memory-statistics' into 'main'
See merge request isc-projects/bind9!9370
The new command is `rndc memprof`. The memory profiling status is also
reported inside `rndc status`. The status also shows whether named can
toggle memory profiling or not and if the server is built with jemalloc.
The changelog job is supposed to test that the text from GitLab MR
title&description is valid rst syntax and can be built with sphinx. In
49128fc1, the way gitchangelog generates entries was changed - it no
longer writes to the changelog file, but generates output on stdout
instead. Ensure the generated notes is actually written to (some)
rendered file which is part of the docs so that the subsequent sphinx
build attempts to render the note.
Merge branch 'nicki/ci-fix-changelog-job' into 'main'
See merge request isc-projects/bind9!9804
The changelog job is supposed to test that the text from GitLab MR
title&description is valid rst syntax and can be built with sphinx. In
49128fc1, the way gitchangelog generates entries was changed - it no
longer writes to the changelog file, but generates output on stdout
instead. Ensure the generated notes is actually written to (some)
rendered file which is part of the docs so that the subsequent sphinx
build attempts to render the note.
Add support for EDE codes 1 & 2 which might occurs during DNSSEC validation in case of unsupported RRSIG algorithm or DNSKEY digest.
See #2715
Merge branch '2715-ede-unsupported-digest-alg' into 'main'
See merge request isc-projects/bind9!9948
A DNSSEC validation can fail in the case where multiple DNSKEY are
available for a zone and none of them are supported, but for different
reasons: one has a DS record in the parent zone using an unsupported
digest while the other one uses an unsupported encryption algorithm.
Add a specific test case covering this flow and making sure that two
extended DNS error are provided: code 1 and 2, each of them highlighting
unsupported algorithm and digest.
Add support for EDE codes 1 (Unsupported DNSKEY Algorithm) and 2
(Unsupported DS Digest Type) which might occurs during DNSSEC
validation in case of unsupported DNSKEY algorithm or DS digest type.
Because DNSSEC internally kicks off various fetches, we need to copy
all encountered extended errors from fetch responses to the fetch
context. Upon an event, the errors from the fetch context are copied
to the client response.
The minimal dnspython version to run this test is 2.5.0.
Merge branch 'mnowak/pytest_rewrite_cipher-suites' into 'main'
See merge request isc-projects/bind9!8662
The servers are setup and torn down once per each test module. All the
logs and server state persists between individual tests within the same
module. The servers fixture representing these servers should be
module-wide as well.
The base image tends to have a rather old dnspython version and when
used with pylint and mypy it produces errors about newer dnspython
features the old version does not know about.
$ mypy "bin/tests/system/isctest/"
bin/tests/system/isctest/query.py:55: error: Unexpected keyword argument "verify" for "tls" [call-arg]
/usr/lib/python3/dist-packages/dns/query.py:958: note: "tls" defined here
$ pylint --rcfile $CI_PROJECT_DIR/.pylintrc --disable=wrong-import-position $(git ls-files 'bin/tests/system/*.py' | grep -vE 'ans\.py')
************* Module isctest.query
bin/tests/system/isctest/query.py:55:11: E1123: Unexpected keyword argument 'verify' in function call (unexpected-keyword-arg)
When explicitly set to True, the "verify" argument lets dnspython verify
certificates used for the connection. As most certificates in the system
test will inevitably be self-signed, the "verify" argument defaults to
False.
The "verify" argument is present in dnspython since the version 2.5.0.
A number of result codes are obsolete and can be removed. Others, including `ISC_R_NOMEMORY`, are still checked in various places even though they can't occur any longer. These have been cleaned up.
Merge branch 'each-cleanup-results' into 'main'
See merge request isc-projects/bind9!9942
ISCCC_R_SYNTAX, ISCCC_R_EXPIRED, and ISCCC_R_CLOCKSKEW have the
same usage and text formats as DNS_R_SYNTAX, DNS_R_EXPIRED and
DNS_R_CLOCKSCREW respectively. this was originally done because
result codes were defined in separate libraries, and some tool
might be linked with libisccc but not libdns. as the result codes
are now defined in only one place, there's no need to retain the
duplicates.
building BIND without crypto support is no longer possible.
consequently this result code is never sent, and therefore we
don't need code in calling functions to handle it.
the isc_mem allocation functions can no longer fail; as a result,
ISC_R_NOMEMORY is now rarely used: only when an external library
such as libjson-c or libfstrm could return NULL. (even in
these cases, arguably we should assert rather than returning
ISC_R_NOMEMORY.)
code and comments that mentioned ISC_R_NOMEMORY have been
cleaned up, and the following functions have been changed to
type void, since (in most cases) the only value they could
return was ISC_R_SUCCESS:
- dns_dns64_create()
- dns_dyndb_create()
- dns_ipkeylist_resize()
- dns_kasp_create()
- dns_kasp_key_create()
- dns_keystore_create()
- dns_order_create()
- dns_order_add()
- dns_peerlist_new()
- dns_tkeyctx_create()
- dns_view_create()
- dns_zone_setorigin()
- dns_zone_setfile()
- dns_zone_setstream()
- dns_zone_getdbtype()
- dns_zone_setjournal()
- dns_zone_setkeydirectory()
- isc_lex_openstream()
- isc_portset_create()
- isc_symtab_create()
(the exception is dns_view_create(), which could have returned
other error codes in the event of a crypto library failure when
calling isc_file_sanitize(), but that should be a RUNTIME_CHECK
anyway.)
Adjust the limit of maximum disagreements in respdiff results based on
recent pipeline results.
The respdiff and respdiff:asan seem to have almost identical results,
typically around 0.07 % of differences with ocassional spikes up to
around 0.11 %. Similar results are for respdiff:tsan, perhaps with more
common spikes with values up to around 0.12 %. Set the limit to 0.15 %
to allow for some tolerance due to network conditions, time of day etc.
The respdiff:third-party has a slightly higher disagreements average,
with typical values being around 0.12 %. Set the limit to 0.2 %.
Exceeding either of those values should be quite clear indication that
some resolution behaviour has changed, since the values appear to be
very stable within the newly configured limits.
Merge branch 'nicki/ci-respdiff-limits' into 'main'
See merge request isc-projects/bind9!9950
Adjust the limit of maximum disagreements in respdiff results based on
recent pipeline results.
The respdiff and respdiff:asan seem to have almost identical results,
typically around 0.07 % of differences with ocassional spikes up to
around 0.11 %. Similar results are for respdiff:tsan, perhaps with more
common spikes with values up to around 0.12 %. Set the limit to 0.15 %
to allow for some tolerance due to network conditions, time of day etc.
The respdiff:third-party has a slightly higher disagreements average,
with typical values being around 0.12 %. Set the limit to 0.2 %.
Exceeding either of those values should be quite clear indication that
some resolution behaviour has changed, since the values appear to be
very stable within the newly configured limits.
There was confusion about whether the interval was calculated from
the validity period provided on the command line (with -s and -e),
or from the signature being replaced.
Add text to clarify that the interval is calculated from the new
validity period.
Closes#5128
Merge branch '5128-clarify-dnssec-signzone-interval' into 'main'
See merge request isc-projects/bind9!9955
There was confusion about whether the interval was calculated from
the validity period provided on the command line (with -s and -e),
or from the signature being replaced.
Add text to clarify that the interval is calculated from the new
validity period.
In the case when `dnssec-signzone` is called on an already signed zone, and the private key file is unavailable, a signature that needs to be refreshed may be dropped without being able to generate a replacement. This has been fixed.
Closes#5126
Merge branch '5126-dnssec-signzone-retain-rrsig-if-key-is-offline' into 'main'
See merge request isc-projects/bind9!9951
Track inside the dns_dnsseckey structure whether we have seen the
private key, or if this key only has a public key file.
If the key only has a public key file, or a DNSKEY reference in the
zone, mark the key 'pubkey'. In dnssec-signzone, if the key only
has a public key available, consider the key to be offline. Any
signatures that should be refreshed for which the key is not available,
retain the signature.
So in the code, 'expired' becomes 'refresh', and the new 'expired'
is only used to determine whether we need to keep the signature if
the corresponding key is not available (retaining the signature if
it is not expired).
In the 'keysthatsigned' function, we can remove:
- key->force_publish = false;
- key->force_sign = false;
because they are redundant ('dns_dnsseckey_create' already sets these
values to false).
Add a test case for the scenario below.
There is a case when signing a zone with dnssec-signzone where the
private key file is moved outside the key directory (for offline
ksk purposes), and then the zone is resigned. The signature of the
DNSKEY needs refreshing, but is not expired.
Rather than removing the signature without having a valid replacement,
leave the signature in the zone (despite it needs to be refreshed).
If the generated status output exceeds 4096 it was silently truncated, now we output that the status was truncated.
Closes#4180
Merge branch '4180-possible-truncation-in-dns_keymgr_status' into 'main'
See merge request isc-projects/bind9!9905
Extended DNS error mechanism (EDE) may have several errors raised during a DNS resolution. `named` is now able to add up to three EDE codes in a DNS response. In the case of duplicate error codes, only the first one will be part of the DNS response.
Closes#5085
Merge branch '5085-multiple-ede' into 'main'
See merge request isc-projects/bind9!9952
Extended DNS error mechanism (EDE) enables to have several EDE raised
during a DNS resolution (typically, a DNSSEC query will do multiple
fetches which each of them can have an error). Add support to up to 3
EDE errors in an DNS response. If duplicates occur (two EDEs with the
same code, the extra text is not compared), only the first one will be
part of the DNS answer.
Because the maximum number of EDE is statically fixed, `ns_client_t`
object own a static vector of `DNS_DE_MAX_ERRORS` (instead of a linked
list, for instance). The array can be fully filled (all slots point to
an allocated `dns_ednsopt_t` object) or partially filled (or
empty). In such case, the first NULL slot means there is no more EDE
objects.
When 'ISC_R_TIMEDOUT' is received in 'tcp_recv()', it times out the
oldest response in the active responses queue, and only after that it
checks whether other active responses have also timed out. So when
setting a timeout value for a read operation after a successful
connection, it makes sense to take the timeout value from the oldest
response in the active queue too, because, theoretically, the responses
can have different timeout values, e.g. when the TCP dispatch is shared.
Currently 'resp' is always NULL. Previously when connect and read timeouts
were not separated in dispatch this affected only logging, but now since
we are setting a new timeout after a successful connection, we need to
choose a suitable response from the active queue.
Merge branch 'aram/dispatch-tcp_connected-fix' into 'main'
See merge request isc-projects/bind9!9927