Under specific rare timing circumstances the uv_read_start() could
fail with UV_EINVAL when the connection is reset between the connect (or
accept) and the uv_read_start() call on the nmworker loop. Handle such
situation gracefully by propagating the errors from uv_read_start() into
upper layers, so the socket can be internally closed().
Setting the sock->write_timeout from the TCP, TCPDNS, and TLSDNS send
functions could lead to (harmless) data race when setting the value for
the first time when the isc_nm_send() function would be called from
thread not-matching the socket we are sending to. Move the setting the
sock->write_timeout to the matching async function which is always
called from the matching thread.
As we are going to use libuv outside of the netmgr, we need the shims to
be readily available for the rest of the codebase.
Move the "netmgr/uv-compat.h" to <isc/uv.h> and netmgr/uv-compat.c to
uv.c, and as a rule of thumb, the users of libuv should include
<isc/uv.h> instead of <uv.h> directly.
Additionally, merge netmgr/uverr2result.c into uv.c and rename the
single function from isc__nm_uverr2result() to isc_uverr2result().
Move the netmgr socket related functions from netmgr/netmgr.c and
netmgr/uv-compat.c to netmgr/socket.c, so they are all present all in
the same place. Adjust the names of couple interal functions
accordingly.
For some applications, it's useful to not listen on full battery of
threads. Add workers argument to all isc_nm_listen*() functions and
convenience ISC_NM_LISTEN_ONE and ISC_NM_LISTEN_ALL macros.
Previously, the option to enable kernel load balancing of the sockets
was always enabled when supported by the operating system (SO_REUSEPORT
on Linux and SO_REUSEPORT_LB on FreeBSD).
It was reported that in scenarios where the networking threads are also
responsible for processing long-running tasks (like RPZ processing, CATZ
processing or large zone transfers), this could lead to intermitten
brownouts for some clients, because the thread assigned by the operating
system might be busy. In such scenarious, the overall performance would
be better served by threads competing over the sockets because the idle
threads can pick up the incoming traffic.
Add new configuration option (`load-balance-sockets`) to allow enabling
or disabling the load balancing of the sockets.
Previously, it was possible to assign a bit of memory space in the
nmhandle to store the client data. This was complicated and prevents
further refactoring of isc_nmhandle_t caching (future work).
Instead of caching the data in the nmhandle, allocate the hot-path
ns_client_t objects from per-thread clientmgr memory context and just
assign it to the isc_nmhandle_t via isc_nmhandle_set().
Previously, the unreachable code paths would have to be tagged with:
INSIST(0);
ISC_UNREACHABLE();
There was also older parts of the code that used comment annotation:
/* NOTREACHED */
Unify the handling of unreachable code paths to just use:
UNREACHABLE();
The UNREACHABLE() macro now asserts when reached and also uses
__builtin_unreachable(); when such builtin is available in the compiler.
Previously, there was a single per-socket write timer that would get
restarted for every new write. This turned out to be insufficient
because the other side could keep reseting the timer, and never reading
back the responses.
Change the single write timer to per-send timer which would in turn
reset the TCP connection on the first send timeout.
The C17 standard deprecated ATOMIC_VAR_INIT() macro (see [1]). Follow
the suite and remove the ATOMIC_VAR_INIT() usage in favor of simple
assignment of the value as this is what all supported stdatomic.h
implementations do anyway:
* MacOSX.plaform: #define ATOMIC_VAR_INIT(__v) {__v}
* Gcc stdatomic.h: #define ATOMIC_VAR_INIT(VALUE) (VALUE)
1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1138r0.pdf
Previously the socket code would set the TCPv6 maximum segment size to
minimum value to prevent IP fragmentation for TCP. This was not yet
implemented for the network manager.
Implement network manager functions to set and use minimum MTU socket
option and set the TCP_MAXSEG socket option for both IPv4 and IPv6 and
use those to clamp the TCP maximum segment size for TCP, TCPDNS and
TLSDNS layers in the network manager to 1220 bytes, that is 1280 (IPv6
minimum link MTU) minus 40 (IPv6 fixed header) minus 20 (TCP fixed
header)
We already rely on a similar value for UDP to prevent IP fragmentation
and it make sense to use the same value for IPv4 and IPv6 because the
modern networks are required to support IPv6 packet sizes. If there's
need for small TCP segment values, the MTU on the interfaces needs to be
properly configured.
The IPV6_USE_MIN_MTU socket option directs the IP layer to limit the
IPv6 packet size to the minimum required supported MTU from the base
IPv6 specification, i.e. 1280 bytes. Many implementations of TCP
running over IPv6 neglect to check the IPV6_USE_MIN_MTU value when
performing MSS negotiation and when constructing a TCP segment despite
MSS being defined to be the MTU less the IP and TCP header sizes (60
bytes for IPv6). This leads to oversized IPv6 packets being sent
resulting in unintended Path Maximum Transport Unit Discovery (PMTUD)
being performed and to fragmented IPv6 packets being sent.
Add and use a function to set socket option to limit the MTU on IPv6
sockets to the minimum MTU (1280) both for UDP and TCP.
When the TCP, TCPDNS or TLSDNS connection times out, the isc__nm_uvreq_t
would be pushed into sock->inactivereqs before the uv_tcp_connect()
callback finishes. Because the isc__nmsocket_t keeps the list of
inactive isc__nm_uvreq_t, this would cause use-after-free only when the
sock->inactivereqs is full (which could never happen because the failure
happens in connection timeout callback) or when the sock->inactivereqs
mechanism is completely removed (f.e. when running under Address or
Thread Sanitizer).
Delay isc__nm_uvreq_t deallocation to the connection callback and only
signal the connection callback should be called by shutting down the
libuv socket from the connection timeout callback.
When the outgoing TCP write buffers are full because the other party is
not reading the data, the uv_write() could wait indefinitely on the
uv_loop and never calling the callback. Add a new write timer that uses
the `tcp-idle-timeout` value to interrupt the TCP connection when we are
not able to send data for defined period of time.
Before adding the write timer, we have to remove the generic sock->timer
to sock->read_timer. We don't touch the function names to limit the
impact of the refactoring.
When isc_quota_attach_cb() API returns ISC_R_QUOTA (meaning hard quota
was reached) the accept_connection() would return without logging a
message about quota reached.
Change the connection callback to log the quota reached message.
Some operating systems (OpenBSD and DragonFly BSD) don't restrict the
IPv6 sockets to sending and receiving IPv6 packets only. Explicitly
enable the IPV6_V6ONLY socket option on the IPv6 sockets to prevent
failures from using the IPv4-mapped IPv6 address.
Previously, the netmgr/udp.c tried to detect the recvmmsg detection in
libuv with #ifdef UV_UDP_<foo> preprocessor macros. However, because
the UV_UDP_<foo> are not preprocessor macros, but enum members, the
detection didn't work. Because the detection didn't work, the code
didn't have access to the information when we received the final chunk
of the recvmmsg and tried to free the uvbuf every time. Fortunately,
the isc__nm_free_uvbuf() had a kludge that detected attempt to free in
the middle of the receive buffer, so the code worked.
However, libuv 1.37.0 changed the way the recvmmsg was enabled from
implicit to explicit, and we checked for yet another enum member
presence with preprocessor macro, so in fact libuv recvmmsg support was
never enabled with libuv >= 1.37.0.
This commit changes to the preprocessor macros to autoconf checks for
declaration, so the detection now works again. On top of that, it's now
possible to cleanup the alloc_cb and free_uvbuf functions because now,
the information whether we can or cannot free the buffer is available to
us.
This commit converts the license handling to adhere to the REUSE
specification. It specifically:
1. Adds used licnses to LICENSES/ directory
2. Add "isc" template for adding the copyright boilerplate
3. Changes all source files to include copyright and SPDX license
header, this includes all the C sources, documentation, zone files,
configuration files. There are notes in the doc/dev/copyrights file
on how to add correct headers to the new files.
4. Handle the rest that can't be modified via .reuse/dep5 file. The
binary (or otherwise unmodifiable) files could have license places
next to them in <foo>.license file, but this would lead to cluttered
repository and most of the files handled in the .reuse/dep5 file are
system test files.
Commit 9ee60e7a17bf34c7ef7f4d79e6a00ca45444ec8c erroneously introduced
duplicate conditions to several existing conditional statements
responsible for determining error codes passed to connection callbacks
upon failure. Fix the affected expressions to ensure connection
callbacks are invoked with:
- the ISC_R_SHUTTINGDOWN error code when a global netmgr shutdown is
in progress,
- the ISC_R_CANCELED error code when a specific operation has been
canceled.
This does not fix any known bugs, it only adjusts the changes introduced
by commit 9ee60e7a17bf34c7ef7f4d79e6a00ca45444ec8c so that they match
its original intent.
Previously, when TCP accept failed, we have logged a message with
ISC_LOG_ERROR level. One common case, how this could happen is that the
client hits TCP client quota and is put on hold and when resumed, the
client has already given up and closed the TCP connection. In such
case, the named would log:
TCP connection failed: socket is not connected
This message was quite confusing because it actually doesn't say that
it's related to the accepting the TCP connection and also it logs
everything on the ISC_LOG_ERROR level.
Change the log message to "Accepting TCP connection failed" and for
specific error states lower the severity of the log message to
ISC_LOG_INFO.
This commit fixes a peculiar corner case in the client-side DoT code
because of which a crash could occur during a zone transfer. A junk
DNS message should be sent at the end of a zone transfer via TLS to
trigger the crash (abort).
This commit, hopefully, fixes that.
Also, this commit adds similar changes to the TCP DNS code, as it
shares the same origin and most of the logic.
After support for route/netlink sockets is merged, not all sockets
will have stats counters associated with them, so it's now necessary
to check whether socket stats exist before incrementing or decrementing
them. rather than relying on the caller for this, we now just pass the
socket and an index, and the correct stats counter will be updated if
it exists.
- The read timer must always be stopped when reading stops.
- Read callbacks can now call isc_nm_read() again in TCP, TCPDNS and
TLSDNS; previously this caused an assertion.
- The wrong failure code could be sent after a UDP recv failure because
the if statements were in the wrong order. the check for a NULL
address needs to be after the check for an error code, otherwise the
result will always be set to ISC_R_EOF.
- When aborting a read or connect because the netmgr is shutting down,
use ISC_R_SHUTTINGDOWN. (ISC_R_CANCELED is now reserved for when the
read has been canceled by the caller.)
- A new function isc_nmhandle_timer_running() has been added enabling a
callback to check whether the timer has been reset after processing a
timeout.
- Incidental netmgr fix: always use isc__nm_closing() instead of
referencing sock->mgr->closing directly
- Corrected a few comments that used outdated function names.
this commit removes isc__nm_tcpdns_keepalive() and
isc__nm_tlsdns_keepalive(); keepalive for these protocols and
for TCP will now be set directly from isc_nmhandle_keepalive().
protocols that have an underlying TCP socket (i.e., TLS stream
and HTTP), now have protocol-specific routines, called by
isc_nmhandle_keeaplive(), to set the keepalive value on the
underlying socket.
previously, receiving a keepalive option had no effect on how
long named would keep the connection open; there was a place to
configure the keepalive timeout but it was never used. this commit
corrects that.
this also fixes an error in isc__nm_{tcp,tls}dns_keepalive()
in which the sense of a REQUIRE test was reversed; previously this
error had not been noticed because the functions were not being
used.
- fix some duplicated and out-of-order prototypes declared in
netmgr-int.h
- rename isc_nm_tcpdns_keepalive to isc__nm_tcpdns_keepalive as
it's for internal use
Previously, each protocol (TCPDNS, TLSDNS) has specified own function to
disable pipelining on the connection. An oversight would lead to
assertion failure when opcode is not query over non-TCPDNS protocol
because the isc_nm_tcpdns_sequential() function would be called over
non-TCPDNS socket. This commit removes the per-protocol functions and
refactors the code to have and use common isc_nm_sequential() function
that would either disable the pipelining on the socket or would handle
the request in per specific manner. Currently it ignores the call for
HTTP sockets and causes assertion failure for protocols where it doesn't
make sense to call the function at all.
The Windows support has been completely removed from the source tree
and BIND 9 now no longer supports native compilation on Windows.
We might consider reviewing mingw-w64 port if contributed by external
party, but no development efforts will be put into making BIND 9 compile
and run on Windows again.
The isc_nmiface_t type was holding just a single isc_sockaddr_t,
so we got rid of the datatype and use plain isc_sockaddr_t in place
where isc_nmiface_t was used before. This means less type-casting and
shorter path to access isc_sockaddr_t members.
At the same time, instead of keeping the reference to the isc_sockaddr_t
that was passed to us when we start listening, we will keep a local
copy. This prevents the data race on destruction of the ns_interface_t
objects where pending nmsockets could reference the sockaddr of already
destroyed ns_interface_t object.
This commit adds a new configuration option to set the receive and send
buffer sizes on the TCP and UDP netmgr sockets. The default is `0`
which doesn't set any value and just uses the value set by the operating
system.
There's no magic value here - set it too small and the performance will
drop, set it too large, the buffers can fill-up with queries that have
already timeouted on the client side and nobody is interested for the
answer and this would just make the server clog up even more by making
it produce useless work.
The `netstat -su` can be used on POSIX systems to monitor the receive
and send buffer errors.
Network manager events that require interlock (pause, resume, listen)
are now always executed in the same worker thread, mgr->workers[0],
to prevent races.
"stoplistening" events no longer require interlock.
The netmgr listening, stoplistening, pausing and resuming functions
now use barriers for synchronization, which makes the code much simpler.
isc/barrier.h defines isc_barrier macros as a front-end for uv_barrier
on platforms where that works, and pthread_barrier where it doesn't
(including TSAN builds).
when running read callbacks, if the event result is not ISC_R_SUCCESS,
the callback is always run asynchronously. this is a problem on timeout,
because there's no chance to reset the timer before the socket has
already been destroyed. this commit allows read callbacks to run
synchronously for both ISC_R_SUCCESS and ISC_R_TIMEDOUT result codes.
This commit changes the taskmgr to run the individual tasks on the
netmgr internal workers. While an effort has been put into keeping the
taskmgr interface intact, couple of changes have been made:
* The taskmgr has no concept of universal privileged mode - rather the
tasks are either privileged or unprivileged (normal). The privileged
tasks are run as a first thing when the netmgr is unpaused. There
are now four different queues in in the netmgr:
1. priority queue - netievent on the priority queue are run even when
the taskmgr enter exclusive mode and netmgr is paused. This is
needed to properly start listening on the interfaces, free
resources and resume.
2. privileged task queue - only privileged tasks are queued here and
this is the first queue that gets processed when network manager
is unpaused using isc_nm_resume(). All netmgr workers need to
clean the privileged task queue before they all proceed normal
operation. Both task queues are processed when the workers are
finished.
3. task queue - only (traditional) task are scheduled here and this
queue along with privileged task queues are process when the
netmgr workers are finishing. This is needed to process the task
shutdown events.
4. normal queue - this is the queue with netmgr events, e.g. reading,
sending, callbacks and pretty much everything is processed here.
* The isc_taskmgr_create() now requires initialized netmgr (isc_nm_t)
object.
* The isc_nm_destroy() function now waits for indefinite time, but it
will print out the active objects when in tracing mode
(-DNETMGR_TRACE=1 and -DNETMGR_TRACE_VERBOSE=1), the netmgr has been
made a little bit more asynchronous and it might take longer time to
shutdown all the active networking connections.
* Previously, the isc_nm_stoplistening() was a synchronous operation.
This has been changed and the isc_nm_stoplistening() just schedules
the child sockets to stop listening and exits. This was needed to
prevent a deadlock as the the (traditional) tasks are now executed on
the netmgr threads.
* The socket selection logic in isc__nm_udp_send() was flawed, but
fortunatelly, it was broken, so we never hit the problem where we
created uvreq_t on a socket from nmhandle_t, but then a different
socket could be picked up and then we were trying to run the send
callback on a socket that had different threadid than currently
running.
The isc_nm_tlsdnsconnect() call could end up with two connect callbacks
called when the timeout fired and the TCP connection was aborted,
but the TLS handshake was not complete yet. isc__nm_connecttimeout_cb()
forgot to clean up sock->tls.pending_req when the connect callback was
called with ISC_R_TIMEDOUT, leading to a second callback running later.
A new argument has been added to the isc__nm_*_failed_connect_cb and
isc__nm_*_failed_read_cb functions, to indicate whether the callback
needs to run asynchronously or not.
The isc_nm_*connect() functions were refactored to always return the
connection status via the connect callback instead of sometimes returning
the hard failure directly (for example, when the socket could not be
created, or when the network manager was shutting down).
This commit changes the connect functions in all the network manager
modules, and also makes the necessary refactoring changes in places
where the connect functions are called.
Serveral problems were discovered and fixed after the change in
the connection timeout in the previous commits:
* In TLSDNS, the connection callback was not called at all under some
circumstances when the TCP connection had been established, but the
TLS handshake hadn't been completed yet. Additional checks have
been put in place so that tls_cycle() will end early when the
nmsocket is invalidated by the isc__nm_tlsdns_shutdown() call.
* In TCP, TCPDNS and TLSDNS, new connections would be established
even when the network manager was shutting down. The new
call isc__nm_closing() has been added and is used to bail out
early even before uv_tcp_connect() is attempted.
Similarly to the read timeout, it's now possible to recover from
ISC_R_TIMEDOUT event by restarting the timer from the connect callback.
The change here also fixes platforms that missing the socket() options
to set the TCP connection timeout, by moving the timeout code into user
space. On platforms that support setting the connect timeout via a
socket option, the timeout has been hardcoded to 2 minutes (the maximum
value of tcp-initial-timeout).
The udp, tcpdns and tlsdns contained lot of cut&paste code or code that
was very similar making the stack harder to maintain as any change to
one would have to be copied to the the other protocols.
In this commit, we merge the common parts into the common functions
under isc__nm_<foo> namespace and just keep the little differences based
on the socket type.
After the TCPDNS refactoring the initial and idle timers were broken and
only the tcp-initial-timeout was always applied on the whole TCP
connection.
This broke any TCP connection that took longer than tcp-initial-timeout,
most often this would affect large zone AXFRs.
This commit changes the timeout logic in this way:
* On TCP connection accept the tcp-initial-timeout is applied
and the timer is started
* When we are processing and/or sending any DNS message the timer is
stopped
* When we stop processing all DNS messages, the tcp-idle-timeout
is applied and the timer is started again
Although harmless, the memmove() in tlsdns and tcpdns was guarded by a
current message length variable that was always bigger than 0 instead of
correct current buffer length remainder variable.
* Following the example set in 634bdfb16d8, the tlsdns netmgr
module now uses libuv and SSL primitives directly, rather than
opening a TLS socket which opens a TCP socket, as the previous
model was difficult to debug. Closes#2335.
* Remove the netmgr tls layer (we will have to re-add it for DoH)
* Add isc_tls API to wrap the OpenSSL SSL_CTX object into libisc
library; move the OpenSSL initialization/deinitialization from dstapi
needed for OpenSSL 1.0.x to the isc_tls_{initialize,destroy}()
* Add couple of new shims needed for OpenSSL 1.0.x
* When LibreSSL is used, require at least version 2.7.0 that
has the best OpenSSL 1.1.x compatibility and auto init/deinit
* Enforce OpenSSL 1.1.x usage on Windows
* Added a TLSDNS unit test and implemented a simple TLSDNS echo
server and client.
On Windows, we were limiting the number of listening children to just 1,
but we were then iterating on mgr->nworkers. That lead to scheduling
more async_*listen() than actually allocated and out-of-bound read-write
operation on the heap.
When we were in nmthread, the isc__nm_async_<proto>connect() function
executes in the same thread as the isc__nm_<proto>connect() and on a
failure, it would block indefinitely because the failure branch was
setting sock->active to false before the condition around the wait had a
chance to skip the WAIT().
This also fixes the zero system test being stuck on FreeBSD 11, so we
re-enable the test in the commit.
On platforms without load-balancing socket all the queries would be
handle by a single thread. Currently, the support for load-balanced
sockets is present in Linux with SO_REUSEPORT and FreeBSD 12 with
SO_REUSEPORT_LB.
This commit adds workaround for such platforms that:
1. setups single shared listening socket for all listening nmthreads for
UDP, TCP and TCPDNS netmgr transports
2. Calls uv_udp_bind/uv_tcp_bind on the underlying socket just once and
for rest of the nmthreads only copy the internal libuv flags (should
be just UV_HANDLE_BOUND and optionally UV_HANDLE_IPV6).
3. start reading on UDP socket or listening on TCP socket
The load distribution among the nmthreads is uneven, but it's still
better than utilizing just one thread for processing all the incoming
queries
On FreeBSD, the stack is destroyed more aggressively than on Linux and
that revealed a bug where we were allocating the 16-bit len for the
TCPDNS message on the stack and the buffer got garbled before the
uv_write() sendback was executed. Now, the len is part of the uvreq, so
we can safely pass it to the uv_write() as the req gets destroyed after
the sendcb is executed.