Simplify the stopping of the TCP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the starting of the TCP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
The old code didn't handle race conditions and errors on systems
with non load balancing sockets gracefully. Look for an error on
any child socket and if found close all the child sockets and return
an error.
We already have a synchronization mechanism when starting the UDP and
TCP listener children - barriers. Change how we start the first-born
child (tid == 0), so we don't have to race for sock->parent->result and
sock->parent->fd.
Always track the per-worker sockets in the .active_sockets field in the
isc__networker_t struct and always track the per-socket handles in the
.active_handles field ian the isc_nmsocket_t struct.
On some platforms, when a synchronizing barrier is cleared, one
thread can progress while other threads are still in the process
of releasing the barrier. If a barrier is reused by the progressing
thread during this window, it can cause a deadlock. This can occur if,
for example, we stop listening immediately after we start, because the
stop and listen functions both use socket->barrier. This has been
addressed by using separate barrier objects for stop and listen.
The new internal function works in the same way as isc_nm_send()
except that it sends a DNS message size ahead of the DNS message
data (the format used in DNS over TCP).
The intention is to provide a fast path for sending DNS messages over
streams protocols - that is, without allocating any intermediate
memory buffers.
This commit adds a manual read timer control mode to the TCP
code (adding isc__nmhandle_set_manual_timer() as the interface to it).
Manual read timer control mode suppresses read timer restarting the
read timer when receiving any amount of data. This way the read timer
can be controlled manually using:
* isc__nmsocket_timer_start();
* isc__nmsocket_timer_stop();
* isc__nmsocket_timer_restart().
The change is required to make it possible to implement more
sophisticated read timer control policies in DNS transports, built on
top of TCP.
Add internal logging functions isc__netmgr_log, isc__nmsocket_log(), and
isc__nmhandle_log() that can be used to add logging messages to the
netmgr, and change all direct use of isc_log_write() to use those
logging functions to properly prefix them with netmgr, nmsocket and
nmsocket+nmhandle.
This commit make TCP code use uv_try_write() on best effort basis,
just like TCP DNS and TLS DNS code does.
This optimisation was added in
'caa5b6548a11da6ca772d6f7e10db3a164a18f8d' but, similar change was
mistakenly omitted for generic TCP code. This commit fixes that.
Previously, the send callback would be synchronous only on success. Add
an option (similar to what other callbacks have) to decide whether we
need the asynchronous send callback on a higher level.
On a general level, we need the asynchronous callbacks to happen only
when we are invoking the callback from the public API. If the path to
the callback went through the libuv callback or netmgr callback, we are
already on asynchronous path, and there's no need to make the call to
the callback asynchronous again.
For the send callback, this means we need the asynchronous path for
failure paths inside the isc_nm_send() (which calls isc__nm_udp_send(),
isc__nm_tcp_send(), etc...) - all other invocations of the send callback
could be synchronous, because those are called from the respective libuv
send callbacks.
Previously, the read callback would be synchronous only on success or
timeout. Add an option (similar to what other callbacks have) to decide
whether we need the asynchronous read callback on a higher level.
On a general level, we need the asynchronous callbacks to happen only
when we are invoking the callback from the public API. If the path to
the callback went through the libuv callback or netmgr callback, we are
already on asynchronous path, and there's no need to make the call to
the callback asynchronous again.
For the read callback, this means we need the asynchronous path for
failure paths inside the isc_nm_read() (which calls isc__nm_udp_read(),
isc__nm_tcp_read(), etc...) - all other invocations of the read callback
could be synchronous, because those are called from the respective libuv
or netmgr read callbacks.
This commit introduces a primitive isc__nmsocket_stop() which performs
shutting down on a multilayered socket ensuring the proper order of
the operations.
The shared data within the socket object can be destroyed after the
call completed, as it is guaranteed to not be used from within the
context of other worker threads.
The check is left from when tcp_connect_direct() called isc__nm_socket()
and it was uncertain whether it had succeeded, but now isc__nm_socket()
is called before tcp_connect_direct(), so sock->fd cannot be -1.
*** CID 357292: (REVERSE_NEGATIVE)
/lib/isc/netmgr/tcp.c: 309 in isc_nm_tcpconnect()
303
304 atomic_store(&sock->active, true);
305
306 result = tcp_connect_direct(sock, req);
307 if (result != ISC_R_SUCCESS) {
308 atomic_store(&sock->active, false);
>>> CID 357292: (REVERSE_NEGATIVE)
>>> You might be using variable "sock->fd" before verifying that it is >= 0.
309 if (sock->fd != (uv_os_sock_t)(-1)) {
310 isc__nm_tcp_close(sock);
311 }
312 isc__nm_connectcb(sock, req, result, true);
313 }
314
Add new semantic patch to replace the straightfoward uses of:
ptr = isc_mem_{get,allocate}(..., size);
memset(ptr, 0, size);
with the new API call:
ptr = isc_mem_{get,allocate}x(..., size, ISC_MEM_ZERO);
After the loopmgr work has been merged, we can now cleanup the TCP and
TLS protocols a little bit, because there are stronger guarantees that
the sockets will be kept on the respective loops/threads. We only need
asynchronous call for listening sockets (start, stop) and reading from
the TCP (because the isc_nm_read() might be called from read callback
again.
This commit does the following changes (they are intertwined together):
1. Cleanup most of the asynchronous events in the TCP code, and add
comments for the events that needs to be kept asynchronous.
2. Remove isc_nm_resumeread() from the netmgr API, and replace
isc_nm_resumeread() calls with existing isc_nm_read() calls.
3. Remove isc_nm_pauseread() from the netmgr API, and replace
isc_nm_pauseread() calls with a new isc_nm_read_stop() call.
4. Disable the isc_nm_cancelread() for the streaming protocols, only the
datagram-like protocols can use isc_nm_cancelread().
5. Add isc_nmhandle_close() that can be used to shutdown the socket
earlier than after the last detach. Formerly, the socket would be
closed only after all reading and sending would be finished and the
last reference would be detached. The new isc_nmhandle_close() can
be used to close the underlying socket earlier, so all the other
asynchronous calls would call their respective callbacks immediately.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
Simplify the closing code - during the loopmgr implementation, it was
discovered that the various lists used by the uv_loop_t aren't FIFO, but
LIFO. See doc/dev/libuv.md for more details.
With this knowledge, we can close the protocol handles (uv_udp_t and
uv_tcp_t) and uv_timer_t at the same time by reordering the uv_close()
calls, and thus making sure that after calling the
isc__nm_stoplistening(), the code will not issue any additional callback
calls (accept, read) on the socket that stopped listening.
This might help with the TLS and DoH shutting down sequence as described
in the [GL #3509] as we now stop the reading, stop the timer and call
the uv_close() as earliest as possible.
When we are closing the listening sockets, there's a time window in
which the TCP connection could be accepted although the respective
stoplistening function has already returned to control to the caller.
Clear the accept callback function early, so it doesn't get called when
we are not interested in the incoming connections anymore.
Previously:
* applications were using isc_app as the base unit for running the
application and signal handling.
* networking was handled in the netmgr layer, which would start a
number of threads, each with a uv_loop event loop.
* task/event handling was done in the isc_task unit, which used
netmgr event loops to run the isc_event calls.
In this refactoring:
* the network manager now uses isc_loop instead of maintaining its
own worker threads and event loops.
* the taskmgr that manages isc_task instances now also uses isc_loopmgr,
and every isc_task runs on a specific isc_loop bound to the specific
thread.
* applications have been updated as necessary to use the new API.
* new ISC_LOOP_TEST macros have been added to enable unit tests to
run isc_loop event loops. unit tests have been updated to use this
where needed.
The STATID_CONNECT and STATID_CONNECTFAIL statistics were used
incorrectly. The STATID_CONNECT was incremented twice (once in
the *_connect_direct() and once in the callback) and STATID_CONNECTFAIL
would not be incremented at all if the failure happened in the callback.
Closes: #3452
On FreeBSD (and perhaps other *BSD) systems, the TCP connect() call (via
uv_tcp_connect()) can fail with transient UV_EADDRINUSE error. The UDP
code already handles this by trying three times (is a charm) before
giving up. Add a code for the TCP, TCPDNS and TLSDNS layers to also try
three times before giving up by calling uv_tcp_connect() from the
callback two more time on UV_EADDRINUSE error.
Additionally, stop the timer only if we succeed or on hard error via
isc__nm_failed_connect_cb().
Under specific rare timing circumstances the uv_read_start() could
fail with UV_EINVAL when the connection is reset between the connect (or
accept) and the uv_read_start() call on the nmworker loop. Handle such
situation gracefully by propagating the errors from uv_read_start() into
upper layers, so the socket can be internally closed().
Setting the sock->write_timeout from the TCP, TCPDNS, and TLSDNS send
functions could lead to (harmless) data race when setting the value for
the first time when the isc_nm_send() function would be called from
thread not-matching the socket we are sending to. Move the setting the
sock->write_timeout to the matching async function which is always
called from the matching thread.
As we are going to use libuv outside of the netmgr, we need the shims to
be readily available for the rest of the codebase.
Move the "netmgr/uv-compat.h" to <isc/uv.h> and netmgr/uv-compat.c to
uv.c, and as a rule of thumb, the users of libuv should include
<isc/uv.h> instead of <uv.h> directly.
Additionally, merge netmgr/uverr2result.c into uv.c and rename the
single function from isc__nm_uverr2result() to isc_uverr2result().
Move the netmgr socket related functions from netmgr/netmgr.c and
netmgr/uv-compat.c to netmgr/socket.c, so they are all present all in
the same place. Adjust the names of couple interal functions
accordingly.
For some applications, it's useful to not listen on full battery of
threads. Add workers argument to all isc_nm_listen*() functions and
convenience ISC_NM_LISTEN_ONE and ISC_NM_LISTEN_ALL macros.
Previously, the option to enable kernel load balancing of the sockets
was always enabled when supported by the operating system (SO_REUSEPORT
on Linux and SO_REUSEPORT_LB on FreeBSD).
It was reported that in scenarios where the networking threads are also
responsible for processing long-running tasks (like RPZ processing, CATZ
processing or large zone transfers), this could lead to intermitten
brownouts for some clients, because the thread assigned by the operating
system might be busy. In such scenarious, the overall performance would
be better served by threads competing over the sockets because the idle
threads can pick up the incoming traffic.
Add new configuration option (`load-balance-sockets`) to allow enabling
or disabling the load balancing of the sockets.
Previously, it was possible to assign a bit of memory space in the
nmhandle to store the client data. This was complicated and prevents
further refactoring of isc_nmhandle_t caching (future work).
Instead of caching the data in the nmhandle, allocate the hot-path
ns_client_t objects from per-thread clientmgr memory context and just
assign it to the isc_nmhandle_t via isc_nmhandle_set().
Previously, the unreachable code paths would have to be tagged with:
INSIST(0);
ISC_UNREACHABLE();
There was also older parts of the code that used comment annotation:
/* NOTREACHED */
Unify the handling of unreachable code paths to just use:
UNREACHABLE();
The UNREACHABLE() macro now asserts when reached and also uses
__builtin_unreachable(); when such builtin is available in the compiler.
Previously, there was a single per-socket write timer that would get
restarted for every new write. This turned out to be insufficient
because the other side could keep reseting the timer, and never reading
back the responses.
Change the single write timer to per-send timer which would in turn
reset the TCP connection on the first send timeout.
The C17 standard deprecated ATOMIC_VAR_INIT() macro (see [1]). Follow
the suite and remove the ATOMIC_VAR_INIT() usage in favor of simple
assignment of the value as this is what all supported stdatomic.h
implementations do anyway:
* MacOSX.plaform: #define ATOMIC_VAR_INIT(__v) {__v}
* Gcc stdatomic.h: #define ATOMIC_VAR_INIT(VALUE) (VALUE)
1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1138r0.pdf
Previously the socket code would set the TCPv6 maximum segment size to
minimum value to prevent IP fragmentation for TCP. This was not yet
implemented for the network manager.
Implement network manager functions to set and use minimum MTU socket
option and set the TCP_MAXSEG socket option for both IPv4 and IPv6 and
use those to clamp the TCP maximum segment size for TCP, TCPDNS and
TLSDNS layers in the network manager to 1220 bytes, that is 1280 (IPv6
minimum link MTU) minus 40 (IPv6 fixed header) minus 20 (TCP fixed
header)
We already rely on a similar value for UDP to prevent IP fragmentation
and it make sense to use the same value for IPv4 and IPv6 because the
modern networks are required to support IPv6 packet sizes. If there's
need for small TCP segment values, the MTU on the interfaces needs to be
properly configured.
The IPV6_USE_MIN_MTU socket option directs the IP layer to limit the
IPv6 packet size to the minimum required supported MTU from the base
IPv6 specification, i.e. 1280 bytes. Many implementations of TCP
running over IPv6 neglect to check the IPV6_USE_MIN_MTU value when
performing MSS negotiation and when constructing a TCP segment despite
MSS being defined to be the MTU less the IP and TCP header sizes (60
bytes for IPv6). This leads to oversized IPv6 packets being sent
resulting in unintended Path Maximum Transport Unit Discovery (PMTUD)
being performed and to fragmented IPv6 packets being sent.
Add and use a function to set socket option to limit the MTU on IPv6
sockets to the minimum MTU (1280) both for UDP and TCP.
When the TCP, TCPDNS or TLSDNS connection times out, the isc__nm_uvreq_t
would be pushed into sock->inactivereqs before the uv_tcp_connect()
callback finishes. Because the isc__nmsocket_t keeps the list of
inactive isc__nm_uvreq_t, this would cause use-after-free only when the
sock->inactivereqs is full (which could never happen because the failure
happens in connection timeout callback) or when the sock->inactivereqs
mechanism is completely removed (f.e. when running under Address or
Thread Sanitizer).
Delay isc__nm_uvreq_t deallocation to the connection callback and only
signal the connection callback should be called by shutting down the
libuv socket from the connection timeout callback.
When the outgoing TCP write buffers are full because the other party is
not reading the data, the uv_write() could wait indefinitely on the
uv_loop and never calling the callback. Add a new write timer that uses
the `tcp-idle-timeout` value to interrupt the TCP connection when we are
not able to send data for defined period of time.
Before adding the write timer, we have to remove the generic sock->timer
to sock->read_timer. We don't touch the function names to limit the
impact of the refactoring.
When isc_quota_attach_cb() API returns ISC_R_QUOTA (meaning hard quota
was reached) the accept_connection() would return without logging a
message about quota reached.
Change the connection callback to log the quota reached message.
Some operating systems (OpenBSD and DragonFly BSD) don't restrict the
IPv6 sockets to sending and receiving IPv6 packets only. Explicitly
enable the IPV6_V6ONLY socket option on the IPv6 sockets to prevent
failures from using the IPv4-mapped IPv6 address.
The commit itself is harmless, but at the same time it is also useless,
so we are reverting it.
This reverts commit 11c869a3d53eafa4083b404e6b6686a120919c26.
Previously, the netmgr/udp.c tried to detect the recvmmsg detection in
libuv with #ifdef UV_UDP_<foo> preprocessor macros. However, because
the UV_UDP_<foo> are not preprocessor macros, but enum members, the
detection didn't work. Because the detection didn't work, the code
didn't have access to the information when we received the final chunk
of the recvmmsg and tried to free the uvbuf every time. Fortunately,
the isc__nm_free_uvbuf() had a kludge that detected attempt to free in
the middle of the receive buffer, so the code worked.
However, libuv 1.37.0 changed the way the recvmmsg was enabled from
implicit to explicit, and we checked for yet another enum member
presence with preprocessor macro, so in fact libuv recvmmsg support was
never enabled with libuv >= 1.37.0.
This commit changes to the preprocessor macros to autoconf checks for
declaration, so the detection now works again. On top of that, it's now
possible to cleanup the alloc_cb and free_uvbuf functions because now,
the information whether we can or cannot free the buffer is available to
us.
This commit converts the license handling to adhere to the REUSE
specification. It specifically:
1. Adds used licnses to LICENSES/ directory
2. Add "isc" template for adding the copyright boilerplate
3. Changes all source files to include copyright and SPDX license
header, this includes all the C sources, documentation, zone files,
configuration files. There are notes in the doc/dev/copyrights file
on how to add correct headers to the new files.
4. Handle the rest that can't be modified via .reuse/dep5 file. The
binary (or otherwise unmodifiable) files could have license places
next to them in <foo>.license file, but this would lead to cluttered
repository and most of the files handled in the .reuse/dep5 file are
system test files.
The isc__nm_tcp_resumeread() was using maybe_enqueue function to enqueue
netmgr event which could case the read callback to be executed
immediately if there was enough data waiting in the TCP queue.
If such thing would happen, the read callback would be called before the
previous read callback was finished and the worker receive buffer would
be still marked "in use" causing a assertion failure.
This would affect only raw TCP channels, e.g. rndc and http statistics.
Commit 9ee60e7a17bf34c7ef7f4d79e6a00ca45444ec8c erroneously introduced
duplicate conditions to several existing conditional statements
responsible for determining error codes passed to connection callbacks
upon failure. Fix the affected expressions to ensure connection
callbacks are invoked with:
- the ISC_R_SHUTTINGDOWN error code when a global netmgr shutdown is
in progress,
- the ISC_R_CANCELED error code when a specific operation has been
canceled.
This does not fix any known bugs, it only adjusts the changes introduced
by commit 9ee60e7a17bf34c7ef7f4d79e6a00ca45444ec8c so that they match
its original intent.
Previously, when TCP accept failed, we have logged a message with
ISC_LOG_ERROR level. One common case, how this could happen is that the
client hits TCP client quota and is put on hold and when resumed, the
client has already given up and closed the TCP connection. In such
case, the named would log:
TCP connection failed: socket is not connected
This message was quite confusing because it actually doesn't say that
it's related to the accepting the TCP connection and also it logs
everything on the ISC_LOG_ERROR level.
Change the log message to "Accepting TCP connection failed" and for
specific error states lower the severity of the log message to
ISC_LOG_INFO.
After support for route/netlink sockets is merged, not all sockets
will have stats counters associated with them, so it's now necessary
to check whether socket stats exist before incrementing or decrementing
them. rather than relying on the caller for this, we now just pass the
socket and an index, and the correct stats counter will be updated if
it exists.
- The read timer must always be stopped when reading stops.
- Read callbacks can now call isc_nm_read() again in TCP, TCPDNS and
TLSDNS; previously this caused an assertion.
- The wrong failure code could be sent after a UDP recv failure because
the if statements were in the wrong order. the check for a NULL
address needs to be after the check for an error code, otherwise the
result will always be set to ISC_R_EOF.
- When aborting a read or connect because the netmgr is shutting down,
use ISC_R_SHUTTINGDOWN. (ISC_R_CANCELED is now reserved for when the
read has been canceled by the caller.)
- A new function isc_nmhandle_timer_running() has been added enabling a
callback to check whether the timer has been reset after processing a
timeout.
- Incidental netmgr fix: always use isc__nm_closing() instead of
referencing sock->mgr->closing directly
- Corrected a few comments that used outdated function names.
The Windows support has been completely removed from the source tree
and BIND 9 now no longer supports native compilation on Windows.
We might consider reviewing mingw-w64 port if contributed by external
party, but no development efforts will be put into making BIND 9 compile
and run on Windows again.
The isc_nmiface_t type was holding just a single isc_sockaddr_t,
so we got rid of the datatype and use plain isc_sockaddr_t in place
where isc_nmiface_t was used before. This means less type-casting and
shorter path to access isc_sockaddr_t members.
At the same time, instead of keeping the reference to the isc_sockaddr_t
that was passed to us when we start listening, we will keep a local
copy. This prevents the data race on destruction of the ns_interface_t
objects where pending nmsockets could reference the sockaddr of already
destroyed ns_interface_t object.
This commit adds a new configuration option to set the receive and send
buffer sizes on the TCP and UDP netmgr sockets. The default is `0`
which doesn't set any value and just uses the value set by the operating
system.
There's no magic value here - set it too small and the performance will
drop, set it too large, the buffers can fill-up with queries that have
already timeouted on the client side and nobody is interested for the
answer and this would just make the server clog up even more by making
it produce useless work.
The `netstat -su` can be used on POSIX systems to monitor the receive
and send buffer errors.