This commit makes it possible to use PROXY Stream not only over TCP,
but also over TLS. That is, now PROXY Stream can work in two modes as
far as TLS is involved:
1. PROXY over (plain) TCP - PROXYv2 headers are sent unencrypted before
TLS handshake messages. That is the main mode as described in the
PROXY protocol specification (as it is clearly stated there), and most
of the software expects PROXYv2 support to be implemented that
way (e.g. HAProxy);
2. PROXY over (encrypted) TLS - PROXYv2 headers are sent after the TLS
handshake has happened. For example, this mode is being used (only ?)
by "dnsdist". As far as I can see, that is, in fact, a deviation from
the spec, but I can certainly see how PROXYv2 could end up being
implemented this way elsewhere.
This commit adds a new stream-based transport with an interface
compatible with TCP. The transport is built on top of TCP transport
and the new PROXYv2 handling code. Despite being built on top of TCP,
it can be easily extended to work on top of any TCP-like stream-based
transport. The intention of having this transport is to add PROXYv2
support into all existing stream-based DNS transport (DNS over TCP,
DNS over TLS, DNS over HTTP) by making the work on top of this new
transport.
The idea behind the transport is simple after accepting the connection
or connecting to a remote server it enters PROXYv2 handling mode: that
is, it either attempts to read (when accepting the connection) or send
(when establishing a connection) a PROXYv2 header. After that it works
like a mere wrapper on top of the underlying stream-based
transport (TCP).
Instead of growing and never shrinking the list of the inactive
handles (to be reused mostly on the UDP connections), limit the number
of maximum number of inactive handles kept to 64. Instead of caching
the inactive handles for all listening sockets, enable the caching on on
UDP listening sockets. For TCP, the handles were cached for each
accepted socket thus reusing the handles only for long-standing TCP
connections, but not reusing the handles across different TCP streams.
When shutting down TCP sockets, the read callback calling logic was
flawed, it would call either one less callback or one extra. Fix the
logic in the way:
1. When isc_nm_read() has been called but isc_nm_read_stop() hasn't on
the handle, the read callback will be called with ISC_R_CANCELED to
cancel active reading from the socket/handle.
2. When isc_nm_read() has been called and isc_nm_read_stop() has been
called on the on the handle, the read callback will be called with
ISC_R_SHUTTINGDOWN to signal that the dormant (not-reading) socket
is being shut down.
3. The .reading and .recv_read flags are little bit tricky. The
.reading flag indicates if the outer layer is reading the data (that
would be uv_tcp_t for TCP and isc_nmsocket_t (TCP) for TLSStream),
the .recv_read flag indicates whether somebody is interested in the
data read from the socket.
Usually, you would expect that the .reading should be false when
.recv_read is false, but it gets even more tricky with TLSStream as
the TLS protocol might need to read from the socket even when sending
data.
Fix the usage of the .recv_read and .reading flags in the TLSStream
to their true meaning - which mostly consist of using .recv_read
everywhere and then wrapping isc_nm_read() and isc_nm_read_stop()
with the .reading flag.
4. The TLS failed read helper has been modified to resemble the TCP code
as much as possible, clearing and re-setting the .recv_read flag in
the TCP timeout code has been fixed and .recv_read is now cleared
when isc_nm_read_stop() has been called on the streaming socket.
5. The use of Network Manager in the named_controlconf, isccc_ccmsg, and
isc_httpd units have been greatly simplified due to the improved design.
6. More unit tests for TCP and TLS testing the shutdown conditions have
been added.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
By inspecting the code, it was discovered that .sendbuf member of the
isc__nm_networker_t was unused and just consuming ~64k per worker.
Remove the member and the association allocation/deallocation.
In e18541287231b721c9cdb7e492697a2a80fd83fc, the TCP accept quota code
became broken in a subtle way - the quota would get initialized on the
first accept for the server socket and then deleted from the server
socket, so it would never get applied again.
Properly fixing this required a bigger refactoring of the isc_quota API
code to make it much simpler. The new code decouples the ownership of
the quota and acquiring/releasing the quota limit.
After (during) the refactoring it became more clear that we need to use
the callback from the child side of the accepted connection, and not the
server side.
Instead of using isc_async_run() when closing StreamDNS handle, add
isc_job_t member to the isc_nmhandle_t structure and use isc_job_run()
to avoid allocation/deallocation on the StreamDNS hot-path.
Instead of marking the unused entities with UNUSED(x) macro in the
function body, use a `ISC_ATTR_UNUSED` attribute macro that expans to
C23 [[maybe_unused]] or __attribute__((__unused__)) as fallback.
When accepting a TCP connection in the higher layers (tlsstream,
streamdns, and http) attach to the socket the connection was accepted
on, and use this socket instead of the parent listening socket.
This has an advantage - accessing the sock->listener now doesn't break
the thread boundaries, so we can properly check whether the socket is
being closed without requiring .closing member to be atomic_bool.
The last atomic_bool variable sock->active was converted to non-atomic
bool by properly handling the listening socket case where we were
checking parent socket instead of children sockets.
This is no longer necessary as we properly set the .active to false on
the children sockets.
Additionally, cleanup the .rchildren - the atomic variable was used for
mutex+condition to block until all children were listening, but that's
now being handled by a barrier.
Finally, just remove dead .self and .active_child_connections members of
the netmgr socket.
Now that everything runs on their own loop and we don't cross the thread
boundaries (with few exceptions), most of the atomic_bool variables used
to track the socket state have been unatomicized because they are always
accessed from the matching thread.
The remaining few have been relaxed: a) the sock->active is now using
acquire/release memory ordering; b) the various global limits are now
using relaxed memory ordering - we don't really care about the
synchronization for those.
Change the isc_job_run() to not-make any allocations. The caller must
make sure that it allocates isc_job_t - usually as part of the argument
passed to the callback.
For simple jobs, using isc_async_run() is advised as it allocates its
own separate isc_job_t.
Change the isc__nm_uvreq_t to have the idle callback as a separate
member as we always need to use it to properly close the uvreq.
Slightly refactor uvreq_put and uvreq_get to remove the unneeded
arguments - in uvreq_get(), we always use sock->worker, and in
uvreq_put, we always use req->sock, so there's not reason to pass those
extra arguments.
Instead of using isc_job_run() that's quite heavy as it allocates memory
for every new job, add uv_idle_t to uvreq union, and use uv_idle API
directly to execute the connect/read/send callback without any
additional allocations.
add a public function ns_interface_create() allowing the caller
to set up a listening interface directly without having to set
up listen-on and scan network interfaces.
Simplify the stopping of the generic socket children by using the
isc_async API from the loopmgr instead of using the asychronous
netievent mechanism in the netmgr.
Simplify the setting of the TLS contexts by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the canceling of the StreamDNS socket by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the reading from the StreamDNS socket by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the setting of the DoH endpoints by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the acception the new TCP connection by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the canceling of the UDP socket by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the stopping of the TCP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the starting of the TCP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the stopping of the UDP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the starting of the UDP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
The active handles accounting was both using atomic counter and ISC_LIST
to keep track of active handles. Remove the atomic counter that was in
use before the ISC_LIST was added for better tracking of the handles
attached to the socket.
Instead of calling isc__nmhandle_detach calling
nmhandle_detach_cb() asynchronously when there's closehandle_cb
initialized, convert the closehandle_cb to use isc_job, and make the
isc__nmhandle_detach() to be fully synchronous.
The netmgr connect, read and send callbacks can now only be executed on
the same loop, convert it from asynchronous netievent queue event to
more direct isc_job.
Return 'isc_result_t' type value instead of 'bool' to indicate
the actual failure. Rename the function to something not suggesting
a boolean type result. Make changes in the places where the API
function is being used to check for the result code instead of
a boolean value.
Change the per-socket inactive uvreq cache (implemented as isc_astack)
to per-worker memory pool.
Change the per-socket inactive nmhandle cache (implemented as
isc_astack) to unlocked per-socket ISC_LIST.
Always track the per-worker sockets in the .active_sockets field in the
isc__networker_t struct and always track the per-socket handles in the
.active_handles field ian the isc_nmsocket_t struct.
On some platforms, when a synchronizing barrier is cleared, one
thread can progress while other threads are still in the process
of releasing the barrier. If a barrier is reused by the progressing
thread during this window, it can cause a deadlock. This can occur if,
for example, we stop listening immediately after we start, because the
stop and listen functions both use socket->barrier. This has been
addressed by using separate barrier objects for stop and listen.
This commit replaces ad-hoc code for send requests buffer management
within TLS with the one based on isc_buffer_t.
Previous version of the code was trying to use pre-allocated small
buffers to avoid extra allocations. The code would allocate a larger
dynamic buffer when needed. There is no need to have ad-hoc code for
this, as isc_buffer_t now provides this functionality internally.
Additionally to the above, the old version of the code lacked any
logic to reuse the dynamically allocated buffers. Now, as we do not
manage memory buffers, but isc_buffer_t objects, we can implement this
strategy. It can be in particular helpful for longer lasting
connections, as in this case the buffer will adjust itself to the size
of the messages being transferred. That is, it is in particular useful
for XoT, as Stream DNS happen to order send requests in such a way
that the send request will get reused.
The new internal function works in the same way as isc_nm_send()
except that it sends a DNS message size ahead of the DNS message
data (the format used in DNS over TCP).
The intention is to provide a fast path for sending DNS messages over
streams protocols - that is, without allocating any intermediate
memory buffers.
This commit optimises TLS send request object allocation to enable
send request object reuse, somewhat reducing pressure on the memory
manager. It is especially helpful in the case when Stream DNS uses the
TLS implementation as the transport.