Previously, the read callback would be synchronous only on success or
timeout. Add an option (similar to what other callbacks have) to decide
whether we need the asynchronous read callback on a higher level.
On a general level, we need the asynchronous callbacks to happen only
when we are invoking the callback from the public API. If the path to
the callback went through the libuv callback or netmgr callback, we are
already on asynchronous path, and there's no need to make the call to
the callback asynchronous again.
For the read callback, this means we need the asynchronous path for
failure paths inside the isc_nm_read() (which calls isc__nm_udp_read(),
isc__nm_tcp_read(), etc...) - all other invocations of the read callback
could be synchronous, because those are called from the respective libuv
or netmgr read callbacks.
The netievent handler for isc_nmsocket_set_tlsctx() was inadvertently
ifdef'd out when BIND was built with --disable-doh, resulting in an
assertion failure on startup when DoT was configured.
This commit introduces a primitive isc__nmsocket_stop() which performs
shutting down on a multilayered socket ensuring the proper order of
the operations.
The shared data within the socket object can be destroyed after the
call completed, as it is guaranteed to not be used from within the
context of other worker threads.
During loop manager refactoring isc_nmsocket_set_tlsctx() was not
properly adapted. The function is expected to broadcast the new TLS
context for every worker, but this behaviour was accidentally broken.
After the loopmgr work has been merged, we can now cleanup the TCP and
TLS protocols a little bit, because there are stronger guarantees that
the sockets will be kept on the respective loops/threads. We only need
asynchronous call for listening sockets (start, stop) and reading from
the TCP (because the isc_nm_read() might be called from read callback
again.
This commit does the following changes (they are intertwined together):
1. Cleanup most of the asynchronous events in the TCP code, and add
comments for the events that needs to be kept asynchronous.
2. Remove isc_nm_resumeread() from the netmgr API, and replace
isc_nm_resumeread() calls with existing isc_nm_read() calls.
3. Remove isc_nm_pauseread() from the netmgr API, and replace
isc_nm_pauseread() calls with a new isc_nm_read_stop() call.
4. Disable the isc_nm_cancelread() for the streaming protocols, only the
datagram-like protocols can use isc_nm_cancelread().
5. Add isc_nmhandle_close() that can be used to shutdown the socket
earlier than after the last detach. Formerly, the socket would be
closed only after all reading and sending would be finished and the
last reference would be detached. The new isc_nmhandle_close() can
be used to close the underlying socket earlier, so all the other
asynchronous calls would call their respective callbacks immediately.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
The destructor for the isc__nmsocket_t was missing call to the
isc_refcount_destroy() on the reference counter, which might lead to
spurious ThreadSanitizer data race warnings if we ever change the
acquire-release memory order in the isc_refcount_decrement().
The network manager UDP code was misinterpreting when the libuv called
the udp_recv_cb with nrecv == 0 and addr == NULL -> this doesn't really
mean that the "stream" has ended, but the libuv indicates that the
receive buffer can be freed. This could lead to assertion failure in
the code that calls isc_nm_read() from the network manager read callback
due to the extra spurious callbacks.
Properly handle the extra callback calls from the libuv in the client
read callback, and refactor the UDP isc_nm_read() implementation to be
synchronous, so no datagram is lost between the time that we stop the
reading from the UDP socket and we restart it again in the asychronous
udpread event.
Add a unit test that tests the isc_nm_read() call from the read
callback to receive two datagrams.
When we are closing the listening sockets, there's a time window in
which the TCP connection could be accepted although the respective
stoplistening function has already returned to control to the caller.
Clear the accept callback function early, so it doesn't get called when
we are not interested in the incoming connections anymore.
Previously:
* applications were using isc_app as the base unit for running the
application and signal handling.
* networking was handled in the netmgr layer, which would start a
number of threads, each with a uv_loop event loop.
* task/event handling was done in the isc_task unit, which used
netmgr event loops to run the isc_event calls.
In this refactoring:
* the network manager now uses isc_loop instead of maintaining its
own worker threads and event loops.
* the taskmgr that manages isc_task instances now also uses isc_loopmgr,
and every isc_task runs on a specific isc_loop bound to the specific
thread.
* applications have been updated as necessary to use the new API.
* new ISC_LOOP_TEST macros have been added to enable unit tests to
run isc_loop event loops. unit tests have been updated to use this
where needed.
The STATID_CONNECT and STATID_CONNECTFAIL statistics were used
incorrectly. The STATID_CONNECT was incremented twice (once in
the *_connect_direct() and once in the callback) and STATID_CONNECTFAIL
would not be incremented at all if the failure happened in the callback.
Closes: #3452
On FreeBSD (and perhaps other *BSD) systems, the TCP connect() call (via
uv_tcp_connect()) can fail with transient UV_EADDRINUSE error. The UDP
code already handles this by trying three times (is a charm) before
giving up. Add a code for the TCP, TCPDNS and TLSDNS layers to also try
three times before giving up by calling uv_tcp_connect() from the
callback two more time on UV_EADDRINUSE error.
Additionally, stop the timer only if we succeed or on hard error via
isc__nm_failed_connect_cb().
it's a style violation to have REQUIRE or INSIST contain code that
must run for the server to work. this was being done with some
atomic_compare_exchange calls. these have been cleaned up. uses
of atomic_compare_exchange in assertions have been replaced with
a new macro atomic_compare_exchange_enforced, which uses RUNTIME_CHECK
to ensure that the exchange was successful.
Under specific rare timing circumstances the uv_read_start() could
fail with UV_EINVAL when the connection is reset between the connect (or
accept) and the uv_read_start() call on the nmworker loop. Handle such
situation gracefully by propagating the errors from uv_read_start() into
upper layers, so the socket can be internally closed().
As we are going to use libuv outside of the netmgr, we need the shims to
be readily available for the rest of the codebase.
Move the "netmgr/uv-compat.h" to <isc/uv.h> and netmgr/uv-compat.c to
uv.c, and as a rule of thumb, the users of libuv should include
<isc/uv.h> instead of <uv.h> directly.
Additionally, merge netmgr/uverr2result.c into uv.c and rename the
single function from isc__nm_uverr2result() to isc_uverr2result().
Move the netmgr socket related functions from netmgr/netmgr.c and
netmgr/uv-compat.c to netmgr/socket.c, so they are all present all in
the same place. Adjust the names of couple interal functions
accordingly.
This commit ensures that write callbacks are getting called only after
the data has been sent via the network.
Without this fix, a situation could appear when a write callback could
get called before the actual encrypted data would have been sent to
the network. Instead, it would get called right after it would have
been passed to the OpenSSL (i.e. encrypted).
Most likely, the issue does not reveal itself often because the
callback call was asynchronous, so in most cases it should have been
called after the data has been sent, but that was not guaranteed by
the code logic.
Also, this commit removes one memory allocation (netievent) from a hot
path, as there is no need to call this callback asynchronously
anymore.
When we compile with libuv that has some capabilities via flags passed
to f.e. uv_udp_listen() or uv_udp_bind(), the call with such flags would
fail with invalid arguments when older libuv version is linked at the
runtime that doesn't understand the flag that was available at the
compile time.
Enforce minimal libuv version when flags have been available at the
compile time, but are not available at the runtime. This check is less
strict than enforcing the runtime libuv version to be same or higher
than compile time libuv version.
This commit adds isc_nmsocket_set_tlsctx() - an asynchronous function
that replaces the TLS context within a given TLS-enabled listener
socket object. It is based on the newly added reference counting
functionality.
The intention of adding this function is to add functionality to
replace a TLS context without recreating the whole socket object,
including the underlying TCP listener socket, as a BIND process might
not have enough permissions to re-create it fully on reconfiguration.
Previously, HAVE_SO_REUSEPORT_LB has been defined only in the private
netmgr-int.h header file, making the configuration of load balanced
sockets inoperable.
Move the missing HAVE_SO_REUSEPORT_LB define the isc/netmgr.h and add
missing isc_nm_getloadbalancesockets() implementation.
Previously, the option to enable kernel load balancing of the sockets
was always enabled when supported by the operating system (SO_REUSEPORT
on Linux and SO_REUSEPORT_LB on FreeBSD).
It was reported that in scenarios where the networking threads are also
responsible for processing long-running tasks (like RPZ processing, CATZ
processing or large zone transfers), this could lead to intermitten
brownouts for some clients, because the thread assigned by the operating
system might be busy. In such scenarious, the overall performance would
be better served by threads competing over the sockets because the idle
threads can pick up the incoming traffic.
Add new configuration option (`load-balance-sockets`) to allow enabling
or disabling the load balancing of the sockets.
Previously, the task privileged mode has been used only when the named
was starting up and loading the zones from the disk as the "first" thing
to do. The privileged task was setup with quantum == 2, which made the
taskmgr/netmgr spin around the privileged queue processing two events at
the time.
The same effect can be achieved by setting the quantum to UINT_MAX (e.g.
practically unlimited) for the loadzone task, hence the privileged task
mode was removed in favor of just processing all the events on the
loadzone task in a single task_run().
In couple places, we have missed INSIST(0) or ISC_UNREACHABLE()
replacement on some branches with UNREACHABLE(). Replace all
ISC_UNREACHABLE() or INSIST(0) calls with UNREACHABLE().
This commit adds support for ISC_R_TLSBADPEERCERT error code, which is
supposed to be used to signal for TLS peer certificates verification
in dig and other code.
The support for this error code is added to our TLS and TLS DNS
implementations.
This commit also adds isc_nm_verify_tls_peer_result_string() function
which is supposed to be used to get a textual description of the
reason for getting a ISC_R_TLSBADPEERCERT error.
Previously, it was possible to assign a bit of memory space in the
nmhandle to store the client data. This was complicated and prevents
further refactoring of isc_nmhandle_t caching (future work).
Instead of caching the data in the nmhandle, allocate the hot-path
ns_client_t objects from per-thread clientmgr memory context and just
assign it to the isc_nmhandle_t via isc_nmhandle_set().
Previously, the unreachable code paths would have to be tagged with:
INSIST(0);
ISC_UNREACHABLE();
There was also older parts of the code that used comment annotation:
/* NOTREACHED */
Unify the handling of unreachable code paths to just use:
UNREACHABLE();
The UNREACHABLE() macro now asserts when reached and also uses
__builtin_unreachable(); when such builtin is available in the compiler.
Gcc 7+ and Clang 10+ have implemented __attribute__((fallthrough)) which
is explicit version of the /* FALLTHROUGH */ comment we are currently
using.
Add and apply FALLTHROUGH macro that uses the attribute if available,
but does nothing on older compilers.
In one case (lib/dns/zone.c), using the macro revealed that we were
using the /* FALLTHROUGH */ comment in wrong place, remove that comment.
Instead of passing the "workers" variable back and forth along with
passing the single isc_nm_t instance, add isc_nm_getnworkers() function
that returns the number of netmgr threads are running.
Change the ns_interfacemgr and ns_taskmgr to utilize the newly acquired
knowledge.
When sock->closehandle_cb is set, we need to run nmhandle_detach_cb()
asynchronously to ensure correct order of multiple packets processing in
the isc__nm_process_sock_buffer(). When not run asynchronously, it
would cause:
a) out-of-order processing of the return codes from processbuffer();
b) stack growth because the next TCP DNS message read callback will
be called from within the current TCP DNS message read callback.
The sock->closehandle_cb is set to isc__nm_resume_processing() for TCP
sockets which calls isc__nm_process_sock_buffer(). If the read callback
(called from isc__nm_process_sock_buffer()->processbuffer()) doesn't
attach to the nmhandle (f.e. because it wants to drop the processing or
we send the response directly via uv_try_write()), the
isc__nm_resume_processing() (via .closehandle_cb) would call
isc__nm_process_sock_buffer() recursively.
The below shortened code path shows how the stack can grow:
1: ns__client_request(handle, ...);
2: isc_nm_tcpdns_sequential(handle);
3: ns_query_start(client, handle);
4: query_lookup(qctx);
5: query_send(qctcx->client);
6: isc__nmhandle_detach(&client->reqhandle);
7: nmhandle_detach_cb(&handle);
8: sock->closehandle_cb(sock); // isc__nm_resume_processing
9: isc__nm_process_sock_buffer(sock);
10: processbuffer(sock); // isc__nm_tcpdns_processbuffer
11: isc_nmhandle_attach(req->handle, &handle);
12: isc__nm_readcb(sock, req, ISC_R_SUCCESS);
13: isc__nm_async_readcb(NULL, ...);
14: uvreq->cb.recv(...); // ns__client_request
Instead, if 'sock->closehandle_cb' is set, we need to run detach the
handle asynchroniously in 'isc__nmhandle_detach', so that on line 8 in
the code flow above does not start this recursion. This ensures the
correct order when processing multiple packets in the function
'isc__nm_process_sock_buffer()' and prevents the stack growth.
When not run asynchronously, the out-of-order processing leaves the
first TCP socket open until all requests on the stream have been
processed.
If the pipelining is disabled on the TCP via `keep-response-order`
configuration option, named would keep the first socket in lingering
CLOSE_WAIT state when the client sends an incomplete packet and then
closes the connection from the client side.
Previously, the established TCP connections (both client and server)
would be gracefully closed waiting for the write timeout.
Don't wait for TCP connections to gracefully shutdown, but directly
reset them for faster shutdown.
Previously, there was a single per-socket write timer that would get
restarted for every new write. This turned out to be insufficient
because the other side could keep reseting the timer, and never reading
back the responses.
Change the single write timer to per-send timer which would in turn
reset the TCP connection on the first send timeout.
Previously the socket code would set the TCPv6 maximum segment size to
minimum value to prevent IP fragmentation for TCP. This was not yet
implemented for the network manager.
Implement network manager functions to set and use minimum MTU socket
option and set the TCP_MAXSEG socket option for both IPv4 and IPv6 and
use those to clamp the TCP maximum segment size for TCP, TCPDNS and
TLSDNS layers in the network manager to 1220 bytes, that is 1280 (IPv6
minimum link MTU) minus 40 (IPv6 fixed header) minus 20 (TCP fixed
header)
We already rely on a similar value for UDP to prevent IP fragmentation
and it make sense to use the same value for IPv4 and IPv6 because the
modern networks are required to support IPv6 packet sizes. If there's
need for small TCP segment values, the MTU on the interfaces needs to be
properly configured.
The IPV6_USE_MIN_MTU socket option directs the IP layer to limit the
IPv6 packet size to the minimum required supported MTU from the base
IPv6 specification, i.e. 1280 bytes. Many implementations of TCP
running over IPv6 neglect to check the IPV6_USE_MIN_MTU value when
performing MSS negotiation and when constructing a TCP segment despite
MSS being defined to be the MTU less the IP and TCP header sizes (60
bytes for IPv6). This leads to oversized IPv6 packets being sent
resulting in unintended Path Maximum Transport Unit Discovery (PMTUD)
being performed and to fragmented IPv6 packets being sent.
Add and use a function to set socket option to limit the MTU on IPv6
sockets to the minimum MTU (1280) both for UDP and TCP.
The current implementation of isc_queue uses Michael-Scott lock-free
queue that in turn uses hazard pointers. It was discovered that the way
we use the isc_queue, such complicated mechanism isn't really needed,
because most of the time, we either execute the work directly when on
nmthread (in case of UDP) or schedule the work from the matching
nmthreads.
Replace the current implementation of the isc_queue with a simple locked
ISC_LIST. There's a slight improvement - since copying the whole list
is very lightweight - we move the queue into a new list before we start
the processing and locking just for moving the queue and not for every
single item on the list.
NOTE: There's a room for future improvements - since we don't guarantee
the order in which the netievents are processed, we could have two lists
- one unlocked that would be used when scheduling the work from the
matching thread and one locked that would be used from non-matching
thread.
The isc__nmsocket_reset() was missing a case for raw TCP sockets (used
by RNDC and DoH) which would case a assertion failure when write timeout
would be triggered.
TCP sockets are now also properly handled in isc__nmsocket_reset().
When isc__nm_uvreq_t gets deactivated, it could be just put onto array
stack to be reused later to save some initialization time.
Unfortunately, this might hide some use-after-free errors.
Disable the inactive uvreqs caching when compiled with Address or
Thread Sanitizer.
When isc_nmhandle_t gets deactivated, it could be just put onto array
stack to be reused later to safe some initialization time.
Unfortunately, this might hide some use-after-free errors.
Disable the inactive handles caching when compiled with Address or
Thread Sanitizer.
The isc__nmsocket_t has locked array of isc_nmhandle_t that's not used
for anything. The isc__nmhandle_get() adds the isc_nmhandle_t to the
locked array (and resized if necessary) and removed when
isc_nmhandle_put() finally destroys the handle. That's all it does, so
it serves no useful purpose.
Remove the .ah_handles, .ah_size, and .ah_frees members of the
isc__nmsocket_t and .ah_pos member of the isc_nmhandle_t struct.
When the TCP, TCPDNS or TLSDNS connection times out, the isc__nm_uvreq_t
would be pushed into sock->inactivereqs before the uv_tcp_connect()
callback finishes. Because the isc__nmsocket_t keeps the list of
inactive isc__nm_uvreq_t, this would cause use-after-free only when the
sock->inactivereqs is full (which could never happen because the failure
happens in connection timeout callback) or when the sock->inactivereqs
mechanism is completely removed (f.e. when running under Address or
Thread Sanitizer).
Delay isc__nm_uvreq_t deallocation to the connection callback and only
signal the connection callback should be called by shutting down the
libuv socket from the connection timeout callback.
The keep-response-order option has been obsoleted, and in this commit,
remove the keep-response-order ACL map rendering the option no-op, the
call the isc_nm_sequential() and the now unused isc_nm_sequential()
function itself.
There was an artificial limit of 23 on the number of simultaneous
pipelined queries in the single TCP connection. The new network
managers is capable of handling "unlimited" (limited only by the TCP
read buffer size ) queries similar to "unlimited" handling of the DNS
queries receive over UDP.
Don't limit the number of TCP queries that we can process within a
single TCP read callback.
When invalid DNS message is received, there was a handling mechanism for
DoH that would be called to return proper HTTP response.
Reuse this mechanism and reset the TCP connection when the client is
blackholed, DNS message is completely bogus or the ns_client receives
response instead of query.