RUNTIME_CHECK on the "wrap" variable avoids possible NULL dereference:
thread.c: In function 'thread_wrap':
thread.c:60:15: error: dereference of possibly-NULL 'wrap' [CWE-690] [-Werror=analyzer-possible-null-dereference]
60 | *wrap = (struct thread_wrap){
The RUNTIME_CHECK was there before
7d1ceaf35dbc25dfbca32deffa7f6ff8f452e75f.
The line summarising TSAN reports was misplaced in the ASAN territory
and thus never used.
I also made core dumps, assertion failures, and TSAN reports detection
independent of each other.
The last use of the cmocka_add_test_byname() helper macro was removed in
commit 63fe9312ff8f54bc79e399bdbd5aaa15cd3e5459. Remove the
<isc/cmocka.h> header that defines it.
tests/isc/ht_test.c triggers the following compiler warnings when built
against development versions of cmocka:
In file included from ht_test.c:24:
ht_test.c: In function ‘test_ht_full’:
ht_test.c:68:45: warning: passing argument 2 of ‘_assert_ptr_equal’ makes pointer from integer without a cast [-Wint-conversion]
68 | assert_ptr_equal((void *)i, (uintptr_t)f);
/usr/include/cmocka.h:1513:56: note: in definition of macro ‘assert_ptr_equal’
1513 | #define assert_ptr_equal(a, b) _assert_ptr_equal((a), (b), __FILE__, __LINE__)
| ^
/usr/include/cmocka.h:2907:36: note: expected ‘const void *’ but argument is of type ‘long unsigned int’
2907 | const void *b,
| ~~~~~~~~~~~~^
ht_test.c:163:45: warning: passing argument 2 of ‘_assert_ptr_equal’ makes pointer from integer without a cast [-Wint-conversion]
163 | assert_ptr_equal((void *)i, (uintptr_t)f);
/usr/include/cmocka.h:1513:56: note: in definition of macro ‘assert_ptr_equal’
1513 | #define assert_ptr_equal(a, b) _assert_ptr_equal((a), (b), __FILE__, __LINE__)
| ^
/usr/include/cmocka.h:2907:36: note: expected ‘const void *’ but argument is of type ‘long unsigned int’
2907 | const void *b,
| ~~~~~~~~~~~~^
These are caused by a change to the definitions of pointer assert
functions in cmocka's development branch [1]. Fix by casting the
affected variables to (void *) instead of (uintptr_t).
[1] https://git.cryptomilk.org/projects/cmocka.git/commit/?id=09621179af67535788a67957a910d9f17c975b45
Development versions of cmocka require the intmax_t and uintmax_t types
to be defined by the time the test code includes the <cmocka.h> header.
These types are defined in the <stdint.h> header, which is included by
the <inttypes.h> header, which in turn is already explicitly included by
some of the programs in the tests/ directory. Ensure all programs in
that directory that include the <cmocka.h> header also include the
<inttypes.h> header to future-proof the code while keeping the change
set minimal and the resulting code consistent. Also prevent explicitly
including the <stdint.h> header in those programs as it is included by
the <inttypes.h> header.
When the FIPS provider is available, RSASHA1 signing keys for zone
"example.com." are ignored if the zone is attempted to be signed with
the dnssec-signzone "-F" (FIPS mode) option:
"fatal: No signing keys specified or found"
The upforwd test for forwarding updates to a dead primary can continue
running a little bit past its end, causing update replies to be
recorded during a subsequent test case. Correct this by only looking
for update requests and replies for the specific domain name being
tested at any given time.
After the RCU changes were merged, the `upforwd` test started
consistenly failing when run under thread sanitizer. After some
investigation, it turned out that retry attempts were continuing after
the "update forwarding to dead primary" test. This caused mismatches
in the DNSTAP message counts for the subsequent tests, because they
were also counting retries.
Fix this problem by `wait`ing for the `nsupdate` processes to exit.
While investigating the bug, I replaced several fixed 15 second delays
with `wait_for_log`, so the test runs faster.
Move registration and deregistration of the main thread from
`isc_loopmgr_run()` into `isc__initialize()` / `isc__shutdown()`:
liburcu-qsbr fails an assertion if we try to use it from an
unregistered thread, and we need to be able to use it when the
event loops are not running.
Use `rcu_assign_pointer()` and `rcu_dereference()` in qp-trie
transactions so that they properly mark threads as online. The
RCU-protected pointer is no longer declared atomic because
liburcu does not (yet) use standard C atomics.
Fix the definition of `isc_qsbr_rcu_dereference()` to return
the referenced value, and to call the right function inside
liburcu.
Change the thread sanitizer suppressions to match any variant of
`rcu_*_barrier()`
An omission pointed out by the following report from Coverity:
/lib/isc/loop.c: 483 in isc_loopmgr_pause()
>>> CID 455002: Error handling issues (CHECKED_RETURN)
>>> Calling "uv_async_send" without checking return value (as is done elsewhere 5 out of 6 times).
483 uv_async_send(&loop->pause_trigger);
when reading on a streamdns socket failed due to timeout, but
the dispatch was still waiting for other responses, it would
resume reading by calling isc_nm_read() again. this caused
an assertion because the socket was already reading.
we now check that either the socket is reading, or that it was
already reading on the same handle.
Create and free per-CPU helper threads from the main thread and tell
thread sanitizer to suppress leaking threads. (We are not leaking
threads ourselves and we can safely ignore the Userspace-RCU thread
leaks.)
All the places the qp-trie code was using `call_rcu()` needed
`__tsan_release()` and `__tsan_acquire()` annotations, so
add a couple of wrappers to encapsulate this pattern.
With these wrappers, the tests run almost clean under thread
sanitizer. The remaining problems are due to `rcu_barrier()`
which can be suppressed using `.tsan-suppress`. It does not
suppress the whole of `liburcu`, because we would like thread
sanitizer to detect problems in `call_rcu()` callbacks, which
are called from `liburcu`.
The CI jobs have been updated to use `.tsan-suppress` by
default, except for a special-case job that needs the
additional suppressions in `.tsan-suppress-extra`.
We might be able to get rid of some of this after liburcu gains
support for thread sanitizer.
Note: the `rcu_barrier()` suppression is not entirely effective:
tsan sometimes reports races that originate inside `rcu_barrier()`
but tsan has discarded the stack so it does not have the
information required to suppress the report. These "races" can
be made much easier to reproduce by adding `atexit_sleep_ms=1000`
to `TSAN_OPTIONS`. The problem with tsan's short memory can be
addressed by increasing `history_size`: when it is large enough
(6 or 7) the `rcu_barrier()` stack usually survives long enough
for suppression to work.
Shutdown and cleanup of zones is more asynchronous with the qp-trie
zone table. As a result it's possible that some activity is delayed
until after a zone has been released from its zonemanager.
Previously, the dns_zone code was not very strict in the way it
refers to the loop it is running on: The loop pointer was stashed when
dns_zonemgr_managezone() was called and never cleared. Now, zones
properly attach to and detach from their loops.
The zone timer depends on its loop. The shutdown crashes occurred
when asynchronous calls tried to modify the zone timer after
dns_zonemgr_releasezone() has been called and the loop was
invalidated. In these cases the attempt to set the timer is now
ignored, with a debug log message.
A `dns_qmpulti_t` no longer needs to know about its loopmgr. We no
longer keep a linked list of `dns_qpmulti_t` that have reclamation
work, and we no longer mark chunks with the phase in which they are to
be reclaimed. Instead, empty chunks are listed in an array in a
`qp_rcu_t`, which is passed to call_rcu().
Memory reclamation by `call_rcu()` is asynchronous, so during shutdown
it can lose a race with the destruction of its memory context. When we
defer memory reclamation, we need to attach to the memory context to
indicate that it is still in use, but that is not enough to delay its
destruction. So, call `rcu_barrier()` in `isc_mem_destroy()` to wait
for pending RCU work to finish before proceeding to destroy the memory
context.
It can be fairly long-winded to allocate space for a struct with a
flexible array member: in general we need the size of the struct, the
size of the member, and the number of elements. Wrap them all up in a
STRUCT_FLEX_SIZE() macro, and use the new macro for the flexible
arrays in isc_ht and dns_qp.
Allow an arbitrary TCP timeout value to be specified when running
rndc, so that commands that take a long time to execute (for example,
reloading a very large configuration) can be given time to do so.
The zone_resigninc() function does not check the validity of
'zone->db', which can crash named if the zone was unloaded earlier,
for example with "rndc delete".
Check that 'zone->db' is not 'NULL' before attaching to it, like
it is done in zone_sign() and zone_nsec3chain() functions, which
can similarly be called by zone maintenance.