The native PKCS#11 support has been removed in favour of better
maintained, more performance and easier to use OpenSSL PKCS#11 engine
from the OpenSC project.
This commit adds new function isc_nm_http_makeuri() which is supposed
to unify DoH URI construction throughout the codebase.
It handles IPv6 addresses, hostnames, and IPv6 addresses given as
hostnames properly, and replaces similar ad-hoc code in the codebase.
The previous versions of BIND 9 exported its internal libraries so that
they can be used by third-party applications more easily. Certain
library functions were altered from specific BIND-only behavior to more
generic behavior when used by other applications.
This commit removes the function isc_lib_register() that was used by
external applications to enable the functionality.
previously, receiving a keepalive option had no effect on how
long named would keep the connection open; there was a place to
configure the keepalive timeout but it was never used. this commit
corrects that.
this also fixes an error in isc__nm_{tcp,tls}dns_keepalive()
in which the sense of a REQUIRE test was reversed; previously this
error had not been noticed because the functions were not being
used.
- fix some duplicated and out-of-order prototypes declared in
netmgr-int.h
- rename isc_nm_tcpdns_keepalive to isc__nm_tcpdns_keepalive as
it's for internal use
Add a new function to resize the number of counters in a statistics
counter structure. This will be needed when we keep track of DNSSEC
sign statistics and new keys are introduced due to a rollover.
This commit gets rid of RW locks in a hot path of the DoH code. In the
original design, it was implied that we add new endpoints after the
HTTP listener was created. Such a design implies some locking. We do
not need such flexibility, though. Instead, we could build a set of
endpoints before the HTTP listener gets created. Such a design does
not need RW locks at all.
On the isc_mem water change the old water_t structure could be used
after free. Instead of introducing reference counting on the hot-path
we are going to introduce additional constraints on the
isc_mem_setwater. Once it's set for the first time, the additional
calls have to be made with the same water and water_arg arguments.
This commit makes number of concurrent HTTP/2 streams per connection
configurable as a mean to fight DDoS attacks. As soon as the limit is
reached, BIND terminates the whole session.
The commit adds a global configuration
option (http-streams-per-connection) which can be overridden in an
http <name> {...} statement like follows:
http local-http-server {
...
streams-per-connection 100;
...
};
For now the default value is 100, which should be enough (e.g. NGINX
uses 128, but it is a full-featured WEB-server). When using lower
numbers (e.g. ~70), it is possible to hit the limit with
e.g. flamethrower.
This commit adds support for http-listener-clients global options as
well as ability to override the default in an HTTP server description,
like:
http local-http-server {
...
listener-clients 100;
...
};
This way we have ability to specify per-listener active connections
quota globally and then override it when required. This is exactly
what AT&T requested us: they wanted a functionality to specify quota
globally and then override it for specific IPs. This change
functionality makes such a configuration possible.
It makes sense: for example, one could have different quotas for
internal and external clients. Or, for example, one could use BIND's
internal ability to serve encrypted DoH with some sane quota value for
internal clients, while having un-encrypted DoH listener without quota
to put BIND behind a load balancer doing TLS offloading for external
clients.
Moreover, the code no more shares the quota with TCP, which makes
little sense anyway (see tcp-clients option), because of the nature of
interaction of DoH clients: they tend to keep idle opened connections
for longer periods of time, preventing the TCP and TLS client from
being served. Thus, the need to have a separate, generally larger,
quota for them.
Also, the change makes any option within "http <name> { ... };"
statement optional, making it easier to override only required default
options.
By default, the DoH connections are limited to 300 per listener. I
hope that it is a good initial guesstimate.
This commit adds the code (and some tests) which allows verifying
validity of HTTP paths both in incoming HTTP requests and in BIND's
configuration file.
The isc_mem_get(), isc_mem_allocate() and isc_mem_reallocate() can
return NULL ptr in case where the allocation size is NULL. Remove the
nonnull attribute from the functions' declarations.
This stems from the following definition in the C11 standard:
> If the size of the space requested is zero, the behavior is
> implementation-defined: either a null pointer is returned, or the
> behavior is as if the size were some nonzero value, except that the
> returned pointer shall not be used to access an object.
In this case, we return NULL as it's easier to detect errors when
accessing pointer from zero-sized allocation which should obviously
never happen.
- isc_mempool_get() can no longer fail; when there are no more objects
in the pool, more are always allocated. checking for NULL return is
no longer necessary.
- the isc_mempool_setmaxalloc() and isc_mempool_getmaxalloc() functions
are no longer used and have been removed.
Current mempools are kind of hybrid structures - they serve two
purposes:
1. mempool with a lock is basically static sized allocator with
pre-allocated free items
2. mempool without a lock is a doubly-linked list of preallocated items
The first kind of usage could be easily replaced with jemalloc small
sized arena objects and thread-local caches.
The second usage not-so-much and we need to keep this (in
libdns:message.c) for performance reasons.
Previously, we only had capability to trace the mempool gets and puts,
but for debugging, it's sometimes also important to keep track how many
and where do the memory pools get created and destroyed. This commit
adds such tracking capability.
The ISC_MEM_DEBUGSIZE and ISC_MEM_DEBUGCTX did sanity checks on matching
size and memory context on the memory returned to the allocator. Those
will no longer needed when most of the allocator will be replaced with
jemalloc.
Previously, we only had capability to trace the memory gets and puts,
but for debugging, it's sometimes also important to keep track how many
and where do the memory contexts get created and destroyed. This commit
adds such tracking capability.
This commit makes BIND return HTTP status codes for malformed or too
small requests.
DNS request processing code would ignore such requests. Such an
approach works well for other DNS transport but does not make much
sense for HTTP, not allowing it to complete the request/response
sequence.
Suppose execution has reached the point where DNS message handling
code has been called. In that case, it means that the HTTP request has
been successfully processed, and, thus, we are expected to respond to
it either with a message containing some DNS payload or at least to
return an error status code. This commit ensures that BIND behaves
this way.
This commit adds two new autoconf options `--enable-doh` (enabled by
default) and `--with-libnghttp2` (mandatory when DoH is enabled).
When DoH support is disabled the library is not linked-in and support
for http(s) protocol is disabled in the netmgr, named and dig.
The isc/platform.h header was left empty which things either already
moved to config.h or to appropriate headers. This is just the final
cleanup commit.
The last remaining defines needed for platforms without NAME_MAX and
PATH_MAX (I'm looking at you, GNU Hurd) were moved to isc/dir.h where
it's prevalently used.
The ISC_STRERRORSIZE was defined in isc/platform.h header as the
value was different between Windows and POSIX platforms. Now that
Windows is gone, move the define to where it belongs.
Previously, each protocol (TCPDNS, TLSDNS) has specified own function to
disable pipelining on the connection. An oversight would lead to
assertion failure when opcode is not query over non-TCPDNS protocol
because the isc_nm_tcpdns_sequential() function would be called over
non-TCPDNS socket. This commit removes the per-protocol functions and
refactors the code to have and use common isc_nm_sequential() function
that would either disable the pipelining on the socket or would handle
the request in per specific manner. Currently it ignores the call for
HTTP sockets and causes assertion failure for protocols where it doesn't
make sense to call the function at all.
The requirements for BIND 9.17+ now requires C11 support from the
compiler, so we can safely drop most of the stdatomic.h shims from
lib/isc/unix/include/stdatomic.h.
This commit removes support for clang atomic builtins (clang >= 3.6.0
includes stdatomic.h header) and for Gcc __sync builtins.
The only compatibility shim that remains is support for __atomic
builtins for Gcc >= 4.7.0 since CentOS 7 still includes only Gcc 4.8.1
and the proper stdatomic.h header was only introduced in Gcc >= 4.9.
We cannot use DoH for zone transfers. According to RFC8484 a DoH
request contains exactly one DNS message (see Section 6: Definition of
the "application/dns-message" Media Type,
https://datatracker.ietf.org/doc/html/rfc8484#section-6). This makes
DoH unsuitable for zone transfers as often (and usually!) these need
more than one DNS message, especially for larger zones.
As zone transfers over DoH are not (yet) standardised, nor discussed
in RFC8484, the best thing we can do is to return "not implemented."
Technically DoH can be used to transfer small zones which fit in one
message, but that is not enough for the generic case.
Also, this commit makes the server-side DoH code ensure that no
multiple responses could be attempted to be sent over one HTTP/2
stream. In HTTP/2 one stream is mapped to one request/response
transaction. Now the write callback will be called with failure error
code in such a case.
The Windows support has been completely removed from the source tree
and BIND 9 now no longer supports native compilation on Windows.
We might consider reviewing mingw-w64 port if contributed by external
party, but no development efforts will be put into making BIND 9 compile
and run on Windows again.
The libuv has a support for running long running tasks in the dedicated
threadpools, so it doesn't affect networking IO.
This commit adds isc_nm_work_enqueue() wrapper that would wraps around
the libuv API and runs it on top of associated worker loop.
The only limitation is that the function must be called from inside
network manager thread, so the call to the function should be wrapped
inside a (bound) task.
On some platforms, the __attribute__ constructor and destructor won't
take priorities and the compilation failed. On such platform would be
macOS. For this reason, the constructor/destructor in the libisc was
reworked to not use priorities, but have a single constructor and
destructor that calls the appropriate routines in correct order.
This commit removes the extra priority because it's now not needed and
it also breaks a compilation on macOS with GCC 10.
The isc_nmiface_t type was holding just a single isc_sockaddr_t,
so we got rid of the datatype and use plain isc_sockaddr_t in place
where isc_nmiface_t was used before. This means less type-casting and
shorter path to access isc_sockaddr_t members.
At the same time, instead of keeping the reference to the isc_sockaddr_t
that was passed to us when we start listening, we will keep a local
copy. This prevents the data race on destruction of the ns_interface_t
objects where pending nmsockets could reference the sockaddr of already
destroyed ns_interface_t object.
Previously, as a way of reducing the contention between threads a
clientmgr object would be created for each interface/IP address.
We tasks being more strictly bound to netmgr workers, this is no longer
needed and we can just create clientmgr object per worker queue (ncpus).
Each clientmgr object than would have a single task and single memory
context.
Since a client object is bound to a netmgr handle, each client
will always be processed by the same netmgr worker, so we can
simplify the code by binding client->task to the same thread as
the client. Since ns__client_request() now runs in the same event
loop as client->task events, is no longer necessary to pause the
task manager before launching them.
Also removed some functions in isc_task that were not used.
Instead of using fixed quantum, this commit adds atomic counter for
number of items on each queue and uses the number of netievents
scheduled to run as the limit of maximum number of netievents for a
single process_queue() run.
This prevents the endless loops when the netievent would schedule more
netievents onto the same loop, but we don't have to pick "magic" number
for the quantum.
This commit adds a new configuration option to set the receive and send
buffer sizes on the TCP and UDP netmgr sockets. The default is `0`
which doesn't set any value and just uses the value set by the operating
system.
There's no magic value here - set it too small and the performance will
drop, set it too large, the buffers can fill-up with queries that have
already timeouted on the client side and nobody is interested for the
answer and this would just make the server clog up even more by making
it produce useless work.
The `netstat -su` can be used on POSIX systems to monitor the receive
and send buffer errors.
The netmgr listening, stoplistening, pausing and resuming functions
now use barriers for synchronization, which makes the code much simpler.
isc/barrier.h defines isc_barrier macros as a front-end for uv_barrier
on platforms where that works, and pthread_barrier where it doesn't
(including TSAN builds).
all zone loading tasks have the privileged flag, but we only want
them to run as privileged tasks when the server is being initialized;
if we privilege them the rest of the time, the server may hang for a
long time after a reload/reconfig. so now we call isc_taskmgr_setmode()
to turn privileged execution mode on or off in the task manager.
isc_task_privileged() returns true if the task's privilege flag is
set *and* the taskmgr is in privileged execution mode. this is used
to determine in which netmgr event queue the task should be run.
There was a theoretical possibility of clogging up the queue processing
with an endless loop where currently processing netievent would schedule
new netievent that would get processed immediately. This wasn't such a
problem when only netmgr netievents were processed, but with the
addition of the tasks, there are at least two situation where this could
happen:
1. In lib/dns/zone.c:setnsec3param() the task would get re-enqueued
when the zone was not yet fully loaded.
2. Tasks have internal quantum for maximum number of isc_events to be
processed, when the task quantum is reached, the task would get
rescheduled and then immediately processed by the netmgr queue
processing.
As the isc_queue doesn't have a mechanism to atomically move the queue,
this commit adds a mechanism to quantize the queue, so enqueueing new
netievents will never stop processing other uv_loop_t events.
The default quantum size is 128.
Since the queue used in the network manager allows items to be enqueued
more than once, tasks are now reference-counted around task_ready()
and task_run(). task_ready() now has a public API wrapper,
isc_task_ready(), that the netmgr can use to reschedule processing
of a task if the quantum has been reached.
Incidental changes: Cleaned up some unused fields left in isc_task_t
and isc_taskmgr_t after the last refactoring, and changed atomic
flags to atomic_bools for easier manipulation.
With taskmgr running on top of netmgr, the ordering of how the tasks and
netmgr shutdown interacts was wrong as previously isc_taskmgr_destroy()
was waiting until all tasks were properly shutdown and detached. This
responsibility was moved to netmgr, so we now need to do the following:
1. shutdown all the tasks - this schedules all shutdown events onto
the netmgr queue
2. shutdown the netmgr - this also makes sure all the tasks and
events are properly executed
3. Shutdown the taskmgr - this now waits for all the tasks to finish
running before returning
4. Shutdown the netmgr - this call waits for all the netmgr netievents
to finish before returning
This solves the race when the taskmgr object would be destroyed before
all the tasks were finished running in the netmgr loops.
Previously, netmgr, taskmgr, timermgr and socketmgr all had their own
isc_<*>mgr_create() and isc_<*>mgr_destroy() functions. The new
isc_managers_create() and isc_managers_destroy() fold all four into a
single function and makes sure the objects are created and destroy in
correct order.
Especially now, when taskmgr runs on top of netmgr, the correct order is
important and when the code was duplicated at many places it's easy to
make mistake.
The former isc_<*>mgr_create() and isc_<*>mgr_destroy() functions were
made private and a single call to isc_managers_create() and
isc_managers_destroy() is required at the program startup / shutdown.
This commit adds support for generating backtraces on Windows and
refactors the isc_backtrace API to match the Linux/BSD API (without
the isc_ prefix)
* isc_backtrace_gettrace() was renamed to isc_backtrace(), the third
argument was removed and the return type was changed to int
* isc_backtrace_symbols() was added
* isc_backtrace_symbols_fd() was added and used as appropriate
The malloc attribute allows compiler to do some optmizations on
functions that behave like malloc/calloc, like assuming that the
returned pointer do not alias other pointers.
This commit changes the taskmgr to run the individual tasks on the
netmgr internal workers. While an effort has been put into keeping the
taskmgr interface intact, couple of changes have been made:
* The taskmgr has no concept of universal privileged mode - rather the
tasks are either privileged or unprivileged (normal). The privileged
tasks are run as a first thing when the netmgr is unpaused. There
are now four different queues in in the netmgr:
1. priority queue - netievent on the priority queue are run even when
the taskmgr enter exclusive mode and netmgr is paused. This is
needed to properly start listening on the interfaces, free
resources and resume.
2. privileged task queue - only privileged tasks are queued here and
this is the first queue that gets processed when network manager
is unpaused using isc_nm_resume(). All netmgr workers need to
clean the privileged task queue before they all proceed normal
operation. Both task queues are processed when the workers are
finished.
3. task queue - only (traditional) task are scheduled here and this
queue along with privileged task queues are process when the
netmgr workers are finishing. This is needed to process the task
shutdown events.
4. normal queue - this is the queue with netmgr events, e.g. reading,
sending, callbacks and pretty much everything is processed here.
* The isc_taskmgr_create() now requires initialized netmgr (isc_nm_t)
object.
* The isc_nm_destroy() function now waits for indefinite time, but it
will print out the active objects when in tracing mode
(-DNETMGR_TRACE=1 and -DNETMGR_TRACE_VERBOSE=1), the netmgr has been
made a little bit more asynchronous and it might take longer time to
shutdown all the active networking connections.
* Previously, the isc_nm_stoplistening() was a synchronous operation.
This has been changed and the isc_nm_stoplistening() just schedules
the child sockets to stop listening and exits. This was needed to
prevent a deadlock as the the (traditional) tasks are now executed on
the netmgr threads.
* The socket selection logic in isc__nm_udp_send() was flawed, but
fortunatelly, it was broken, so we never hit the problem where we
created uvreq_t on a socket from nmhandle_t, but then a different
socket could be picked up and then we were trying to run the send
callback on a socket that had different threadid than currently
running.