The root cause is the fix for CVE-2024-0760 (part 3), which resets
the TCP connection on a failed send. Specifically commit
4b7c61381f186e20a476c35032a871295ebbd385 stops reading on the socket
because the TCP connection is throttling.
When the tcpdns_send_cb callback thinks about restarting reading
on the socket, this fails because the socket is a client socket.
And nsupdate is a client and is using the same netmgr code.
This commit removes the requirement that the socket must be a server
socket, allowing reading on the socket again after being throttled.
Add query counters for DoT, DoH, unencrypted DoH and their proxied
counterparts. The protocols don't increment TCP/UDP counters anymore
since they aren't the same as plain DNS-over-53.
Currently, the outgoing UDP sockets have enabled
SO_REUSEADDR (SO_REUSEPORT on BSDs) which allows multiple UDP sockets to
bind to the same address+port. There's one caveat though - only a
single (the last one) socket is going to receive all the incoming
traffic. This in turn could lead to incoming DNS message matching to
invalid dns_dispatch and getting dropped.
Disable setting the SO_REUSEADDR on the outgoing UDP sockets. This
needs to be done explicitly because `uv_udp_open()` silently enables the
option on the socket.
If the operating system UDP queue gets full and the outgoing UDP sending
starts to be delayed, BIND 9 could exhibit memory spikes as it tries to
enqueue all the outgoing UDP messages. As those are not going to be
delivered anyway (as we argued when we stopped enlarging the operating
system send and receive buffers), try to send the UDP messages directly
using `uv_udp_try_send()` and if that fails, drop the outgoing UDP
message.
We currently set SO_INCOMING_CPU incorrectly, and testing by Ondrej
shows that fixing the issue and setting affinities is worse than letting
the kernel schedule threads without constraints. So we should not set
SO_INCOMING_CPU anymore.
The new
isc_log_createandusechannel() function combines following calls:
isc_log_createchannel()
isc_log_usechannel()
calls into a single call that cannot fail and therefore can be used in
places where we know this cannot fail thus simplifying the error
handling.
When TLS connection (TLSstream) connection was accepted, the children
listening socket was not attached to sock->server and thus it could have
been freed before all the accepted connections were actually closed.
In turn, this would cause us to call isc_tls_free() too soon - causing
cascade errors in pending SSL_read_ex() in the accepted connections.
Properly attach and detach the children listening socket when accepting
and closing the server connections.
This commit ensures that we are not attempting to accept an expired
TCP connection as we are not interested in any data that could have
been accumulated in its internal buffers. Now we just drop them for
good.
Be more aggressive when throttling the reading - when we can't send the
outgoing TCP synchronously with uv_try_write(), we start throttling the
reading immediately instead of waiting for the send buffers to fill up.
This should not affect behaved clients that read the data from the TCP
on the other end.
Due to omission it was possible to un-throttle a TCP connection
previously throttled due to the peer not reading back data we are
sending.
In particular, that affected DoH code, but it could also affect other
transports (the current or future ones) that pause/resume reading
according to its internal state.
The single TCP read can create as much as 64k divided by the minimum
size of the DNS message. This can clog the processing thread and trash
the memory allocator because we need to do as much as ~20k allocations in
a single UV loop tick.
Limit the number of the DNS messages processed in a single UV loop tick
to just single DNS message and limit the number of the outstanding DNS
messages back to 23. This effectively limits the number of pipelined
DNS messages to that number (this is the limit we already had before).
When TCP client would not read the DNS message sent to them, the TCP
sends inside named would accumulate and cause degradation of the
service. Throttle the reading from the TCP socket when we accumulate
enough DNS data to be sent. Currently this is limited in a way that a
single largest possible DNS message can fit into the buffer.
This commit ensures that an HTTP endpoints set reference is stored in
a socket object associated with an HTTP/2 stream instead of
referencing the global set stored inside a listener.
This helps to prevent an issue like follows:
1. BIND is configured to serve DoH clients;
2. A client is connected and one or more HTTP/2 stream is
created. Internal pointers are now pointing to the data on the
associated HTTP endpoints set;
3. BIND is reconfigured - the new endpoints set object is created and
promoted to all listeners;
4. The old pointers to the HTTP endpoints set data are now invalid.
Instead referencing a global object that is updated on
re-configurations we now store a local reference which prevents the
endpoints set objects to go out of scope prematurely.
It was reported that HTTP/2 session might get closed or even deleted
before all async. processing has been completed.
This commit addresses that: now we are avoiding using the object when
we do not need it or specifically check if the pointers used are not
'NULL' and by ensuring that there is at least one reference to the
session object while we are doing incoming data processing.
This commit makes the code more resilient to such issues in the
future.
To reduce memory pressure, we can add light per-loop (netmgr worker)
memory pools for isc_nmsocket_t structures. This will help in
situations where there's a lot of churn creating and destroying the
nmsockets.
Embedding isc_nmsocket_h2_t directly inside isc_nmsocket_t had increased
the size of isc_nmsocket_t to 1840 bytes. Making the isc_nmsocket_h2_t
to be a pointer to the structure and allocated on demand allows us to
reduce the size to 1208 bytes. While there are still some possible
reductions in the isc_nmsocket_t (embedded tlsstream, streamdns
structures), this was the far biggest drop in the memory usage.
The uv_req union member of struct isc__nm_uvreq contained libuv request
types that we don't use. Turns out that uv_getnameinfo_t is 1000 bytes
big and unnecessarily enlarged the whole structure. Remove all the
unused members from the uv_req union.
This commit improves TLS messages framing by avoiding an extra call to
SSL_write_ex(). Before that we would use an extra SSL_write_ex() call
to pass DNS message length to OpenSSL. That could create an extra TLS
frame, increasing number of bytes sent due to frame header and
padding.
This commit fixes that by making the code pass both DNS message length
and data at once, just like old TLS code did.
It should improve compatibility with some buggy clients that expect
both DNS message length and data to be in one TLS frame.
Older TLS DNS code worked like this, too.
The statistics channel does not expose the current number of TCP clients
connected, only the highwater. Therefore, users did not have an easy
means to collect statistics about TCP clients served over time. This
information could only be measured as a seperate mechanism via rndc by
looking at the TCP quota filled.
In order to expose the exact current count of connected TCP clients
(tracked by the "tcp-clients" quota) as a statistics counter, an
extra, dedicated Network Manager callback would need to be
implemented for that purpose (a counterpart of ns__client_tcpconn()
that would be run when a TCP connection is torn down), which is
inefficient. Instead, track the number of currently-connected TCP
clients separately for IPv4 and IPv6, as Network Manager statistics.
This commit removes wrong INSIST() condition as the assumption that if
'csock->recv_cb != NULL' iff 'csock->statichandle != NULL' is wrong.
There is no direct relation between 'csock->statichandle' and
'csock->recv_cb', as 'csock->statichandle' gets set when allocating a
handle regardless of 'csock->recv_cb' not being NULL, as it is
possible to attach to the handle without starting a read operation (at
the very least, it is correct to start writing before reading).
That condition made `cipher-suites` system test fail with crash on
some platforms in FIPS mode (namely, Oracle Linux 9) despite not being
related to FIPS at all.
This commit modifies TLS Stream and DNS-over-HTTPS transports so that
they do not use the "sock->iface" and "sock->peer" of the lower level
transport directly.
That did not cause any problems before, as things worked as expected,
but with the introduction of PROXYv2 support we use handles to store
the information in both PROXY Stream and UDP Proxy
transports. Therefore, in order to propagate the information (like
addresses), extracted from PROXYv2 headers, from the lower level
transports to the higher-level ones, we need to get that information
from the lower-level handles rather than sockets. That means that we
should get the peer and interface addresses using the intended
APIs ("isc_nmhandle_peeraddr()" and "isc_nmhandle_localaddr()").
Add the new isc__nm_dump_active_manager() function that can be used
for debugging purposes: it dumps all active sockets withing the
network manager instance.
This commit adds a new transport that supports PROXYv2 over UDP. It is
built on top of PROXYv2 handling code (just like PROXY Stream). It
works by processing and stripping the PROXYv2 headers at the beginning
of a datagram (when accepting a datagram) or by placing a PROXYv2
header to the beginning of an outgoing datagram.
The transport is built in such a way that incoming datagrams are being
handled with minimal memory allocations and copying.
In the previous versions of the NM, detecting the case when worker is
shutting down was not that important and actual status code did not
matter much. However, that might be not the case all the time.
This commit makes necessary modifications to the code.
This commit makes it possible to use PROXY Stream not only over TCP,
but also over TLS. That is, now PROXY Stream can work in two modes as
far as TLS is involved:
1. PROXY over (plain) TCP - PROXYv2 headers are sent unencrypted before
TLS handshake messages. That is the main mode as described in the
PROXY protocol specification (as it is clearly stated there), and most
of the software expects PROXYv2 support to be implemented that
way (e.g. HAProxy);
2. PROXY over (encrypted) TLS - PROXYv2 headers are sent after the TLS
handshake has happened. For example, this mode is being used (only ?)
by "dnsdist". As far as I can see, that is, in fact, a deviation from
the spec, but I can certainly see how PROXYv2 could end up being
implemented this way elsewhere.
This commit modifies TLS Stream to make it possible to use over PROXY
Stream. That is required to add PROVYv2 support into TLS-based
transports (DNS over HTTP, DNS over TLS).
This commit adds a new stream-based transport with an interface
compatible with TCP. The transport is built on top of TCP transport
and the new PROXYv2 handling code. Despite being built on top of TCP,
it can be easily extended to work on top of any TCP-like stream-based
transport. The intention of having this transport is to add PROXYv2
support into all existing stream-based DNS transport (DNS over TCP,
DNS over TLS, DNS over HTTP) by making the work on top of this new
transport.
The idea behind the transport is simple after accepting the connection
or connecting to a remote server it enters PROXYv2 handling mode: that
is, it either attempts to read (when accepting the connection) or send
(when establishing a connection) a PROXYv2 header. After that it works
like a mere wrapper on top of the underlying stream-based
transport (TCP).
The Unix Domain Sockets support in BIND 9 has been completely disabled
since BIND 9.18 and it has been a fatal error since then. Cleanup the
code and the documentation that suggest that Unix Domain Sockets are
supported.
Use the new isc_mem_c*() calloc-like API for allocations that are
zeroed.
In turn, this also fixes couple of incorrect usage of the ISC_MEM_ZERO
for structures that need to be zeroed explicitly.
There are few places where isc_mem_cput() is used on structures with a
flexible member (or similar).