Memory reclamation by `call_rcu()` is asynchronous, so during shutdown
it can lose a race with the destruction of its memory context. When we
defer memory reclamation, we need to attach to the memory context to
indicate that it is still in use, but that is not enough to delay its
destruction. So, call `rcu_barrier()` in `isc_mem_destroy()` to wait
for pending RCU work to finish before proceeding to destroy the memory
context.
It can be fairly long-winded to allocate space for a struct with a
flexible array member: in general we need the size of the struct, the
size of the member, and the number of elements. Wrap them all up in a
STRUCT_FLEX_SIZE() macro, and use the new macro for the flexible
arrays in isc_ht and dns_qp.
The Userspace-RCU headers are now needed for more parts of the libisc
and libdns, thus we need to add it globally to prevent compilation
failures on systems with non-standard Userspace-RCU installation path.
The isc_quota API was using locked list of isc_job_t objects to keep the
waiting TCP accepts. Change the isc_quota implementation to use
cds_wfcqueue internally - the enqueue is wait-free and only dequeue
needs to be locked.
The isc_async API was using lock-free stack (where enqueue operation was
not wait-free). Change the isc_async to use cds_wfcqueue internally -
enqueue and splice (move the queue members from one list to another) is
nonblocking and wait-free.
Instead of having a global hashtable with a global rwlock for the GLUE
cache, move the glue_list directly into rdatasetheader and use
Userspace-RCU to update the pointer when the glue_list is empty.
Additionally, the cached glue_lists needs to be stored in the RBTDB
version for early cleaning, otherwise the circular dependencies between
nodes and glue_lists will prevent nodes to be ever cleaned up.
Clang 16 LeakSanitizer reports a memory leak when dns_request_create()
returned a TLS error in the nsupdate system test. While technically a
memory leak on error handling, it's not a problem because the program is
immediately terminated; nsupdate is not expected to run for a prolonged
time.
Stop deliberately breaking const rules by copying file->name into
dirbuf and truncating it there. Handle files located in the root
directory properly. Use unlinkat() from POSIX 200809.
Removing old timestamp or increment versions of log backup files did
not work when the file is an absolute path: only the entry name was
provided to the file remove function.
The dirname was also bogus, since the file separater was put back too
soon.
Fix these issues to make log file rotation work when the file is
configured to be an absolute path.
All the per-loop `libuv` setup remains in `isc_loop`, but the per-thread
RCU setup is moved to `isc_thread` alongside the other per-thread setup.
This avoids repeating the per-thread setup for `call_rcu()` helpers,
and explains a little better why some parts of the per-thread setup
is missing for `call_rcu()` helpers.
This also removes the per-loop `call_rcu()` helpers as we refactored the
isc__random_initialize() in the previous commit.
Instead of writing complicated wrappers for every thread, move the
initialization back to isc_random unit and check whether the random seed
was initialized with a thread_local variable.
Ensure that isc_entropy_get() returns a non-zero seed.
This avoids problems with thread sanitizer tests getting stuck in an
infinite loop.
Remove the `isc_threadarg_t` and `isc_threadresult_t`
typedefs which were unhelpful disguises for `void *`,
and free the dummy jemalloc allocation sooner.
When liburcu is not installed from a system package, its headers are
not treated as system headers by the compiler, so BIND's -Werror and
other warning options take effect. The liburcu headers have a lot
of inline functions, some of which do not use all their arguments,
which BIND's build treats as an error.
This commit allows BIND 9 to be compiled with different flavours of
Userspace RCU, and improves the integration between Userspace RCU and
our event loop:
- In the RCU QSBR, the thread is put offline when polling and online
when rcu_dereference, rcu_assign_pointer (or friends) are called.
- In other RCU modes, we check that we are not reading when reaching the
quiescent callback in the event loop.
- We register the thread before uv_work_run() callback is called and
after it has finished. The rcu_(un)register_thread() has a large
overhead, but that's fine in this case.
There's a recurring pattern walking the ISC_LISTs that just repeats over
and over. Add two macros:
* ISC_LIST_FOREACH(list, elt, link) - walk the static list
* ISC_LIST_FOREACH_SAFE(list, elt, link, next) - walk the list in
a manner that's safe against list member deletions
The spinlock is small (atomic_uint_fast32_t at most), lightweight
synchronization primitive and should only be used for short-lived and
most of the time a isc_mutex should be used.
Add a isc_spinlock unit which is either (most of the time) a think
wrapper around pthread_spin API or an efficient shim implementation of
the simple spinlock.
When shutting down TCP sockets, the read callback calling logic was
flawed, it would call either one less callback or one extra. Fix the
logic in the way:
1. When isc_nm_read() has been called but isc_nm_read_stop() hasn't on
the handle, the read callback will be called with ISC_R_CANCELED to
cancel active reading from the socket/handle.
2. When isc_nm_read() has been called and isc_nm_read_stop() has been
called on the on the handle, the read callback will be called with
ISC_R_SHUTTINGDOWN to signal that the dormant (not-reading) socket
is being shut down.
3. The .reading and .recv_read flags are little bit tricky. The
.reading flag indicates if the outer layer is reading the data (that
would be uv_tcp_t for TCP and isc_nmsocket_t (TCP) for TLSStream),
the .recv_read flag indicates whether somebody is interested in the
data read from the socket.
Usually, you would expect that the .reading should be false when
.recv_read is false, but it gets even more tricky with TLSStream as
the TLS protocol might need to read from the socket even when sending
data.
Fix the usage of the .recv_read and .reading flags in the TLSStream
to their true meaning - which mostly consist of using .recv_read
everywhere and then wrapping isc_nm_read() and isc_nm_read_stop()
with the .reading flag.
4. The TLS failed read helper has been modified to resemble the TCP code
as much as possible, clearing and re-setting the .recv_read flag in
the TCP timeout code has been fixed and .recv_read is now cleared
when isc_nm_read_stop() has been called on the streaming socket.
5. The use of Network Manager in the named_controlconf, isccc_ccmsg, and
isc_httpd units have been greatly simplified due to the improved design.
6. More unit tests for TCP and TLS testing the shutdown conditions have
been added.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
By inspecting the code, it was discovered that .sendbuf member of the
isc__nm_networker_t was unused and just consuming ~64k per worker.
Remove the member and the association allocation/deallocation.
In e18541287231b721c9cdb7e492697a2a80fd83fc, the TCP accept quota code
became broken in a subtle way - the quota would get initialized on the
first accept for the server socket and then deleted from the server
socket, so it would never get applied again.
Properly fixing this required a bigger refactoring of the isc_quota API
code to make it much simpler. The new code decouples the ownership of
the quota and acquiring/releasing the quota limit.
After (during) the refactoring it became more clear that we need to use
the callback from the child side of the accepted connection, and not the
server side.
Instead of using isc_async_run() when closing StreamDNS handle, add
isc_job_t member to the isc_nmhandle_t structure and use isc_job_run()
to avoid allocation/deallocation on the StreamDNS hot-path.
If the quota callback is called on a thread matching the socket, call
the TCP accept function directly instead of using isc_async_run() which
allocates-deallocates memory.
The isc_tid() function is often called on the hot-path and it's the only
function is to return thread_local variable, make the isc_tid() function
a header-only to save several function calls during query-response
processing.
Only attempt to digest 'in' if it is non NULL. This will prevent
false positives about NULL pointer dereferences against 'in' and
should also speed up the processing.
This commit optimises isc_dnsstream_assembler_t in such a way that
memory copying and reallocation are avoided when receiving one or more
complete DNS messages at once. We try to handle the data from the
messages directly, without storing them in an intermediate memory
buffer.
The `isc_histosummary_t` functions were written in the early days of
`hg64` and carried over when I brought `hg64` into BIND. They were
intended to be useful for graphing cumulative frequency distributions
and the like, but in practice whatever draws charts is better off with
a raw histogram export. Especially because of the poor performance of
the old functions.
The replacement `isc_histo_quantiles()` function is intended for
providing a few quantile values in BIND's stats channel, when the user
does not want the full histogram. Unlike the old functions, the caller
provides all the query fractions up-front, so that the values can be
found in a single scan instead of a scan per value. The scan is from
larger values to smaller, since larger quantiles are usually more
interesting, so the scan can bail out early.
Although an `isc_histo_t` is thread-safe, it can suffer
from cache contention under heavy load. To avoid this,
an `isc_histomulti_t` contains a histogram per thread,
so updates are local and low-contention.
This is an adaptation of my `hg64` experiments for use in BIND.
As well as renaming everything according to ISC style, I have
written some more extensive tests that ensure the edge cases are
correct and the fenceposts are in the right places.
I have added utility functions for working with precision in terms of
decimal significant figures as well as this code's native binary.
Cleanup the remnants of MS Compiler bits from <isc/refcount.h>, printing
the information in named/main.c, and cleanup some comments about Windows
that no longer apply.
The bits in picohttpparser.{h,c} were left out, because it's not our
code.
OPENSSL_cleanup is supposed to free all remaining memory in use
provided the application has cleaned up properly. This is not the
case on some operating systems. Silently ignore memory that is
freed after OPENSSL_cleanup has been called.
When fatal is called we may be holding memory allocated by OpenSSL.
This may result in the reference count for the FIPS provider not
going to zero and the shared library not being unloaded during
OPENSSL_cleanup. When the shared library is ultimately unloaded,
when all remaining dynamically loaded libraries are freed, we have
already destroyed the memory context we where using to track memory
leaks / late frees resulting in INSIST being called.
Disable triggering the INSIST when fatal has being called.
The `isc_trampoline` module had a lot of machinery to support stable
thread IDs for use by hazard pointers. But the hazard pointer code
is gone, and the `isc_loop` module now has its own per-loop thread
IDs.
The trampoline machinery seems over-complicated for its remaining
tasks, so move the per-thread initialization into `isc/thread.c`,
and delete the rest.
The isc_time_now() and isc_time_now_hires() were used inconsistently
through the code - either with status check, or without status check,
or via TIME_NOW() macro with RUNTIME_CHECK() on failure.
Refactor the isc_time_now() and isc_time_now_hires() to always fail when
getting current time has failed, and return the isc_time_t value as
return value instead of passing the pointer to result in the argument.
The isc_fsaccess API was created to hide the implementation details
between POSIX and Windows APIs. As we are not supporting the Windows
APIs anymore, it's better to drop this API used in the DST part.
Moreover, the isc_fsaccess was setting the permissions in an insecure
manner - it operated on the filename, and not on the file descriptor
which can lead to all kind of attacks if unpriviledged user has read (or
even worse write) access to key directory.
Replace the code that operates on the private keys with code that uses
mkstemp(), fchmod() and atomic rename() at the end, so at no time the
private key files have insecure permissions.
As it's impossible to get the current umask without modifying it at the
same time, initialize the current umask at the program start and keep
the loaded value internally. Add isc_os_umask() function to access the
starttime umask.
This is a simple replacement using the semantic patch from the previous
commit and as added bonus, one removal of previously undetected unused
variable in named/server.c.
When the loopmanager is shutting down following a signal,
`dig` and `host` should stop cleanly. Before this commit
they were oblivious to ISC_R_SHUTTINGDOWN.
The `isc_signal` callbacks now report this kind of mistake
with a stack backtrace.
Instead of marking the unused entities with UNUSED(x) macro in the
function body, use a `ISC_ATTR_UNUSED` attribute macro that expans to
C23 [[maybe_unused]] or __attribute__((__unused__)) as fallback.