There was an artificial limit of 23 on the number of simultaneous
pipelined queries in the single TCP connection. The new network
managers is capable of handling "unlimited" (limited only by the TCP
read buffer size ) queries similar to "unlimited" handling of the DNS
queries receive over UDP.
Don't limit the number of TCP queries that we can process within a
single TCP read callback.
When invalid DNS message is received, there was a handling mechanism for
DoH that would be called to return proper HTTP response.
Reuse this mechanism and reset the TCP connection when the client is
blackholed, DNS message is completely bogus or the ns_client receives
response instead of query.
In some situations (unit test and forthcoming XFR timeouts MR), we need
to modify the write timeout independently of the read timeout. Add a
isc_nmhandle_setwritetimeout() function that could be called before
isc_nm_send() to specify a custom write timeout interval.
When the outgoing TCP write buffers are full because the other party is
not reading the data, the uv_write() could wait indefinitely on the
uv_loop and never calling the callback. Add a new write timer that uses
the `tcp-idle-timeout` value to interrupt the TCP connection when we are
not able to send data for defined period of time.
Before adding the write timer, we have to remove the generic sock->timer
to sock->read_timer. We don't touch the function names to limit the
impact of the refactoring.
Some operating systems (OpenBSD and DragonFly BSD) don't restrict the
IPv6 sockets to sending and receiving IPv6 packets only. Explicitly
enable the IPV6_V6ONLY socket option on the IPv6 sockets to prevent
failures from using the IPv4-mapped IPv6 address.
Previously, the netmgr/udp.c tried to detect the recvmmsg detection in
libuv with #ifdef UV_UDP_<foo> preprocessor macros. However, because
the UV_UDP_<foo> are not preprocessor macros, but enum members, the
detection didn't work. Because the detection didn't work, the code
didn't have access to the information when we received the final chunk
of the recvmmsg and tried to free the uvbuf every time. Fortunately,
the isc__nm_free_uvbuf() had a kludge that detected attempt to free in
the middle of the receive buffer, so the code worked.
However, libuv 1.37.0 changed the way the recvmmsg was enabled from
implicit to explicit, and we checked for yet another enum member
presence with preprocessor macro, so in fact libuv recvmmsg support was
never enabled with libuv >= 1.37.0.
This commit changes to the preprocessor macros to autoconf checks for
declaration, so the detection now works again. On top of that, it's now
possible to cleanup the alloc_cb and free_uvbuf functions because now,
the information whether we can or cannot free the buffer is available to
us.
This commit converts the license handling to adhere to the REUSE
specification. It specifically:
1. Adds used licnses to LICENSES/ directory
2. Add "isc" template for adding the copyright boilerplate
3. Changes all source files to include copyright and SPDX license
header, this includes all the C sources, documentation, zone files,
configuration files. There are notes in the doc/dev/copyrights file
on how to add correct headers to the new files.
4. Handle the rest that can't be modified via .reuse/dep5 file. The
binary (or otherwise unmodifiable) files could have license places
next to them in <foo>.license file, but this would lead to cluttered
repository and most of the files handled in the .reuse/dep5 file are
system test files.
The isc_queue_new() was using dirty tricks to allocate the head and tail
members of the struct aligned to the cacheline. We can now use
isc_mem_get_aligned() to allocate the structure to the cacheline
directly.
Use ISC_OS_CACHELINE_SIZE (64) instead of arbitrary ALIGNMENT (128), one
cacheline size is enough to prevent false sharing.
Cleanup the unused max_threads variable - there was actually no limit on
the maximum number of threads. This was changed a while ago.
On FreeBSD, the pthread primitives are not solely allocated on stack,
but part of the object lives on the heap. Missing pthread_*_destroy
causes the heap memory to grow and in case of fast lived object it's
possible to run out-of-memory.
Properly destroy the leaking mutex (worker->lock) and
the leaking condition (sock->cond).
Previously, when TCP accept failed, we have logged a message with
ISC_LOG_ERROR level. One common case, how this could happen is that the
client hits TCP client quota and is put on hold and when resumed, the
client has already given up and closed the TCP connection. In such
case, the named would log:
TCP connection failed: socket is not connected
This message was quite confusing because it actually doesn't say that
it's related to the accepting the TCP connection and also it logs
everything on the ISC_LOG_ERROR level.
Change the log message to "Accepting TCP connection failed" and for
specific error states lower the severity of the log message to
ISC_LOG_INFO.
This commit adds an isc_nm_socket_type() function which can be used to
obtain a handle's socket type.
This change obsoletes isc_nm_is_tlsdns_handle() and
isc_nm_is_http_handle(). However, it was decided to keep the latter as
we eventually might end up supporting multiple HTTP versions.
Change 5756 (GL #2854) introduced build errors when using
'configure --disable-doh'. To fix this, isc_nm_is_http_handle() is
now defined in all builds, not just builds that have DoH enabled.
Missing code comments were added both for that function and for
isc_nm_is_tlsdns_handle().
This commit adds an isc_nm_set_min_answer_ttl() function which is
intended to to be used to give a hint to the underlying transport
regarding the answer TTL.
The interface is intentionally kept generic because over time more
transports might benefit from this functionality, but currently it is
intended for DoH to set "max-age" value within "Cache-Control" HTTP
header (as recommended in the RFC8484, section 5.1 "Cache
Interaction").
It is no-op for other DNS transports for the time being.
it is possible for udp_recv_cb() to fire after the socket
is already shutting down and statichandle is NULL; we need to
create a temporary handle in this case.
route/netlink sockets don't have stats counters associated with them,
so it's now necessary to check whether socket stats exist before
incrementing or decrementing them. rather than relying on the caller
for this, we now just pass the socket and an index, and the correct
stats counter will be updated if it exists.
isc_nm_routeconnect() opens a route/netlink socket, then calls a
connect callback, much like isc_nm_udpconnect(), with a handle that
can then be monitored for network changes.
Internally the socket is treated as a UDP socket, since route/netlink
sockets follow the datagram contract.
After support for route/netlink sockets is merged, not all sockets
will have stats counters associated with them, so it's now necessary
to check whether socket stats exist before incrementing or decrementing
them. rather than relying on the caller for this, we now just pass the
socket and an index, and the correct stats counter will be updated if
it exists.
This commit makes BIND verify that zone transfers are allowed to be
done over the underlying connection. Currently, it makes sense only
for DoT, but the code is deliberately made to be protocol-agnostic.
The intention of having this function is to have a predicate to check
if a zone transfer could be performed over the given handle. In most
cases we can assume that we can do zone transfers over any stream
transport except DoH, but this assumption will not work for zone
transfers over DoT (XoT), as the RFC9103 requires ALPN to happen,
which might not be the case for all deployments of DoT.
- The `timeout_action` parameter to dns_dispatch_addresponse() been
replaced with a netmgr callback that is called when a dispatch read
times out. this callback may optionally reset the read timer and
resume reading.
- Added a function to convert isc_interval to milliseconds; this is used
to translate fctx->interval into a value that can be passed to
dns_dispatch_addresponse() as the timeout.
- Note that netmgr timeouts are accurate to the millisecond, so code to
check whether a timeout has been reached cannot rely on microsecond
accuracy.
- If serve-stale is configured, then a timeout received by the resolver
may trigger it to return stale data, and then resume waiting for the
read timeout. this is no longer based on a separate stale timer.
- The code for canceling requests in request.c has been altered so that
it can run asynchronously.
- TCP timeout events apply to the dispatch, which may be shared by
multiple queries. since in the event of a timeout we have no query ID
to use to identify the resp we wanted, we now just send the timeout to
the oldest query that was pending.
- There was some additional refactoring in the resolver: combining
fctx_join() and fctx_try_events() into one function to reduce code
duplication, and using fixednames in fetchctx and fetchevent.
- Incidental fix: new_adbaddrinfo() can't return NULL anymore, so the
code can be simplified.
- The read timer must always be stopped when reading stops.
- Read callbacks can now call isc_nm_read() again in TCP, TCPDNS and
TLSDNS; previously this caused an assertion.
- The wrong failure code could be sent after a UDP recv failure because
the if statements were in the wrong order. the check for a NULL
address needs to be after the check for an error code, otherwise the
result will always be set to ISC_R_EOF.
- When aborting a read or connect because the netmgr is shutting down,
use ISC_R_SHUTTINGDOWN. (ISC_R_CANCELED is now reserved for when the
read has been canceled by the caller.)
- A new function isc_nmhandle_timer_running() has been added enabling a
callback to check whether the timer has been reset after processing a
timeout.
- Incidental netmgr fix: always use isc__nm_closing() instead of
referencing sock->mgr->closing directly
- Corrected a few comments that used outdated function names.
Previously isc_nm_read() required references on the handle to be at
least 2, under the assumption that it would only ever be called from a
connect or accept callback. however, it can also be called from a read
callback, in which case the reference count might be only 1.
On TCPDNS/TLSDNS read callback, the socket buffer could be reallocated
if the received contents would be larger than the buffer. The existing
code would not preserve the contents of the existing buffer which lead
to the loss of the already received data.
This commit changes the isc_mem_put()+isc_mem_get() with isc_mem_reget()
to preserve the existing contents of the socket buffer.
The netmgr, has an internal cache for freed active handles. This cache
was allocated using isc_mem_allocate()/isc_mem_free() API because it was
simpler to reallocate the cache when we needed to grow it. The new
isc_mem_reget() function could be used here reducing the need to use
isc_mem_allocate() API which is tad bit slower than isc_mem_get() API.
this commit removes isc__nm_tcpdns_keepalive() and
isc__nm_tlsdns_keepalive(); keepalive for these protocols and
for TCP will now be set directly from isc_nmhandle_keepalive().
protocols that have an underlying TCP socket (i.e., TLS stream
and HTTP), now have protocol-specific routines, called by
isc_nmhandle_keeaplive(), to set the keepalive value on the
underlying socket.
previously, receiving a keepalive option had no effect on how
long named would keep the connection open; there was a place to
configure the keepalive timeout but it was never used. this commit
corrects that.
this also fixes an error in isc__nm_{tcp,tls}dns_keepalive()
in which the sense of a REQUIRE test was reversed; previously this
error had not been noticed because the functions were not being
used.
Instead of disabling the fragmentation on the UDP sockets, we now
disable the Path MTU Discovery by setting IP(V6)_MTU_DISCOVER socket
option to IP_PMTUDISC_OMIT on Linux and disabling IP(V6)_DONTFRAG socket
option on FreeBSD. This option sets DF=0 in the IP header and also
ignores the Path MTU Discovery.
As additional mitigation on Linux, we recommend setting
net.ipv4.ip_no_pmtu_disc to Mode 3:
Mode 3 is a hardend pmtu discover mode. The kernel will only accept
fragmentation-needed errors if the underlying protocol can verify
them besides a plain socket lookup. Current protocols for which pmtu
events will be honored are TCP, SCTP and DCCP as they verify
e.g. the sequence number or the association. This mode should not be
enabled globally but is only intended to secure e.g. name servers in
namespaces where TCP path mtu must still work but path MTU
information of other protocols should be discarded. If enabled
globally this mode could break other protocols.
It was discovered that setting the thread affinity on both the netmgr
and netthread threads lead to inconsistent recursive performance because
sometimes the netmgr and netthread threads would compete over single
resource and sometimes not.
Removing setting the affinity causes a slight dip in the authoritative
performance around 5% (the measured range was from 3.8% to 7.8%), but
the recursive performance is now consistently good.
In the jemalloc merge request, we missed the fact that ah_frees and ah_handles
are reallocated which is not compatible with using isc_mem_get() for allocation
and isc_mem_put() for deallocation. This commit reverts that part and restores
use of isc_mem_allocate() and isc_mem_free().
Current mempools are kind of hybrid structures - they serve two
purposes:
1. mempool with a lock is basically static sized allocator with
pre-allocated free items
2. mempool without a lock is a doubly-linked list of preallocated items
The first kind of usage could be easily replaced with jemalloc small
sized arena objects and thread-local caches.
The second usage not-so-much and we need to keep this (in
libdns:message.c) for performance reasons.
The isc_mem_allocate() comes with additional cost because of the memory
tracking. In this commit, we replace the usage with isc_mem_get()
because we track the allocated sizes anyway, so it's possible to also
replace isc_mem_free() with isc_mem_put().
This commit makes BIND return HTTP status codes for malformed or too
small requests.
DNS request processing code would ignore such requests. Such an
approach works well for other DNS transport but does not make much
sense for HTTP, not allowing it to complete the request/response
sequence.
Suppose execution has reached the point where DNS message handling
code has been called. In that case, it means that the HTTP request has
been successfully processed, and, thus, we are expected to respond to
it either with a message containing some DNS payload or at least to
return an error status code. This commit ensures that BIND behaves
this way.
This commit adds two new autoconf options `--enable-doh` (enabled by
default) and `--with-libnghttp2` (mandatory when DoH is enabled).
When DoH support is disabled the library is not linked-in and support
for http(s) protocol is disabled in the netmgr, named and dig.
In DNS Flag Day 2020, we started setting the DF (Don't Fragment socket
option on the UDP sockets. It turned out, that this code was incomplete
leading to dropping the outgoing UDP packets.
This has been now remedied, so it is possible to disable the
fragmentation on the UDP sockets again as the sending error is now
handled by sending back an empty response with TC (truncated) bit set.
This reverts commit 66eefac78c92b64b6689a1655cc677a2b1d13496.
Previously, each protocol (TCPDNS, TLSDNS) has specified own function to
disable pipelining on the connection. An oversight would lead to
assertion failure when opcode is not query over non-TCPDNS protocol
because the isc_nm_tcpdns_sequential() function would be called over
non-TCPDNS socket. This commit removes the per-protocol functions and
refactors the code to have and use common isc_nm_sequential() function
that would either disable the pipelining on the socket or would handle
the request in per specific manner. Currently it ignores the call for
HTTP sockets and causes assertion failure for protocols where it doesn't
make sense to call the function at all.
This commit makes NM code to report HTTP as a stream protocol. This
makes it possible to handle large responses properly. Like:
dig +https @127.0.0.1 A cmts1-dhcp.longlines.com
The Windows support has been completely removed from the source tree
and BIND 9 now no longer supports native compilation on Windows.
We might consider reviewing mingw-w64 port if contributed by external
party, but no development efforts will be put into making BIND 9 compile
and run on Windows again.
The libuv has a support for running long running tasks in the dedicated
threadpools, so it doesn't affect networking IO.
This commit adds isc_nm_work_enqueue() wrapper that would wraps around
the libuv API and runs it on top of associated worker loop.
The only limitation is that the function must be called from inside
network manager thread, so the call to the function should be wrapped
inside a (bound) task.
configuring with --enable-mutex-atomics flagged these incorrectly
initialised variables on systems where pthread_mutex_init doesn't
just zero out the structure.
The isc_nmiface_t type was holding just a single isc_sockaddr_t,
so we got rid of the datatype and use plain isc_sockaddr_t in place
where isc_nmiface_t was used before. This means less type-casting and
shorter path to access isc_sockaddr_t members.
At the same time, instead of keeping the reference to the isc_sockaddr_t
that was passed to us when we start listening, we will keep a local
copy. This prevents the data race on destruction of the ns_interface_t
objects where pending nmsockets could reference the sockaddr of already
destroyed ns_interface_t object.
Previously, as a way of reducing the contention between threads a
clientmgr object would be created for each interface/IP address.
We tasks being more strictly bound to netmgr workers, this is no longer
needed and we can just create clientmgr object per worker queue (ncpus).
Each clientmgr object than would have a single task and single memory
context.
Instead of using fixed quantum, this commit adds atomic counter for
number of items on each queue and uses the number of netievents
scheduled to run as the limit of maximum number of netievents for a
single process_queue() run.
This prevents the endless loops when the netievent would schedule more
netievents onto the same loop, but we don't have to pick "magic" number
for the quantum.