While cleaning up the usage of HAVE_UV_<func> macros, we forgot to
cleanup the HAVE_UV_UDP_CONNECT in the actual code and
HAVE_UV_TRANSLATE_SYS_ERROR and this was causing Windows build to fail
on uv_udp_send() because the socket was already connected and we were
falsely assuming that it was not.
The platforms with autoconf support were not affected, because we were
still checking for the functions from the configure.
This commit adds the ability to consolidate HTTP/2 write requests if
there is already one in flight. If it is the case, the code will
consolidate multiple subsequent write request into a larger one
allowing to utilise the network in a more efficient way by creating
larger TCP packets as well as by reducing TLS records overhead (by
creating large TLS records instead of multiple small ones).
This optimisation is especially efficient for clients, creating many
concurrent HTTP/2 streams over a transport connection at once. This
way, the code might create a small amount of multi-kilobyte requests
instead of many 50-120 byte ones.
In fact, it turned out to work so well that I had to add a work-around
to the code to ensure compatibility with the flamethrower, which, at
the time of writing, does not support TLS records larger than two
kilobytes. Now the code tries to flush the write buffer after 1.5
kilobyte, which is still pretty adequate for our use case.
Essentially, this commit implements a recommendation given by nghttp2
library:
https://nghttp2.org/documentation/nghttp2_session_mem_send.html
The libuv has a support for running long running tasks in the dedicated
threadpools, so it doesn't affect networking IO.
This commit adds isc_nm_work_enqueue() wrapper that would wraps around
the libuv API and runs it on top of associated worker loop.
The only limitation is that the function must be called from inside
network manager thread, so the call to the function should be wrapped
inside a (bound) task.
Instead of having a configure check for every missing function that has
been added in later version of libuv, we now use UV_VERSION_HEX to
decide whether we need the shim or not.
The uv_req_get_data() and uv_req_set_data() functions were introduced in
libuv >= 1.19.0, so we need to add compatibility shims with older libuv
versions.
On some platforms, the __attribute__ constructor and destructor won't
take priorities and the compilation failed. On such platform would be
macOS. For this reason, the constructor/destructor in the libisc was
reworked to not use priorities, but have a single constructor and
destructor that calls the appropriate routines in correct order.
This commit removes the extra priority because it's now not needed and
it also breaks a compilation on macOS with GCC 10.
configuring with --enable-mutex-atomics flagged these incorrectly
initialised variables on systems where pthread_mutex_init doesn't
just zero out the structure.
The isc_nmiface_t type was holding just a single isc_sockaddr_t,
so we got rid of the datatype and use plain isc_sockaddr_t in place
where isc_nmiface_t was used before. This means less type-casting and
shorter path to access isc_sockaddr_t members.
At the same time, instead of keeping the reference to the isc_sockaddr_t
that was passed to us when we start listening, we will keep a local
copy. This prevents the data race on destruction of the ns_interface_t
objects where pending nmsockets could reference the sockaddr of already
destroyed ns_interface_t object.
Previously, as a way of reducing the contention between threads a
clientmgr object would be created for each interface/IP address.
We tasks being more strictly bound to netmgr workers, this is no longer
needed and we can just create clientmgr object per worker queue (ncpus).
Each clientmgr object than would have a single task and single memory
context.
Since a client object is bound to a netmgr handle, each client
will always be processed by the same netmgr worker, so we can
simplify the code by binding client->task to the same thread as
the client. Since ns__client_request() now runs in the same event
loop as client->task events, is no longer necessary to pause the
task manager before launching them.
Also removed some functions in isc_task that were not used.
Instead of using fixed quantum, this commit adds atomic counter for
number of items on each queue and uses the number of netievents
scheduled to run as the limit of maximum number of netievents for a
single process_queue() run.
This prevents the endless loops when the netievent would schedule more
netievents onto the same loop, but we don't have to pick "magic" number
for the quantum.
This commit adds a new configuration option to set the receive and send
buffer sizes on the TCP and UDP netmgr sockets. The default is `0`
which doesn't set any value and just uses the value set by the operating
system.
There's no magic value here - set it too small and the performance will
drop, set it too large, the buffers can fill-up with queries that have
already timeouted on the client side and nobody is interested for the
answer and this would just make the server clog up even more by making
it produce useless work.
The `netstat -su` can be used on POSIX systems to monitor the receive
and send buffer errors.
The outgoing UDP socket selection would pick unintialized children
socket on Windows, because we have more netmgr workers than we have
listening sockets. This commit fixes the selection by keeping the
outgoing socket the same, so it's always run on existing socket.
The initial intent was to limit the number of concurrent streams by
the value of 100 but due to the error when reading the documentation
it was set to the maximum possible number of streams per session.
This could lead to security issues, e.g. a remote attacker could have
taken down the BIND instance by creating lots of sessions via low
number of transport connections. This commit fixes that.
We should not call nghttp2_session_terminate_session() in server-side
code after all of the active HTTP/2 streams are processed. The
underlying transport connection is expected to remain opened at least
for some time in this case for new HTTP/2 requests to arrive. That is
what flamethrower was expecting and it makes perfect sense from the
HTTP/2 perspective.
During the stress testing, it was discovered that the default netmgr
quantum of 128 is not enough and there was a performance drop for TCP on
FreeBSD. Bumping the default quantum to 1024 solves the performance
issue and is still enough to prevent the endless loops.
We were clearing the pointer to taskmgr as soon as isc_taskmgr_destroy()
would be called and before all tasks were finished. Unfortunately, some
tasks would use global named_g_taskmgr objects from inside the events
and this would cause either a data race or NULL pointer dereference.
This commit fixes the data race by moving the destruction of the
referenced pointer to the time after all tasks are finished.
Network manager events that require interlock (pause, resume, listen)
are now always executed in the same worker thread, mgr->workers[0],
to prevent races.
"stoplistening" events no longer require interlock.
- ensure isc_nm_pause() and isc_nm_resume() work the same whether
run from inside or outside of the netmgr.
- promote 'stop' events to the priority event level so they can
run while the netmgr is pausing or paused.
- when pausing, drain the priority queue before acquiring an
interlock; this prevents a deadlock when another thread is waiting
for us to complete a task.
- release interlock after pausing, reacquire it when resuming, so
that stop events can happen.
some incidental changes:
- use a function to enqueue pause and resume events (this was part of a
different change attempt that didn't work out; I kept it because I
thought was more readable).
- make mgr->nworkers a signed int to remove some annoying integer casts.
The netmgr listening, stoplistening, pausing and resuming functions
now use barriers for synchronization, which makes the code much simpler.
isc/barrier.h defines isc_barrier macros as a front-end for uv_barrier
on platforms where that works, and pthread_barrier where it doesn't
(including TSAN builds).
When isc__nm_http_stoplistening() is run from inside the netmgr, we need
to make sure it's run synchronously. This commit is just a band-aid
though, as the desired behvaior for isc_nm_stoplistening() is not always
the same:
1. When run from outside user of the interface, the call must be
synchronous, e.g. the calling code expects the call to really stop
listening on the interfaces.
2. But if there's a call from listen<proto> when listening fails,
that needs to be scheduled to run asynchronously, because
isc_nm_listen<proto> is being run in a paused (interlocked)
netmgr thread and we could get stuck.
The proper solution would be to make isc_nm_stoplistening()
behave like uv_close(), i.e., to have a proper callback.
all zone loading tasks have the privileged flag, but we only want
them to run as privileged tasks when the server is being initialized;
if we privilege them the rest of the time, the server may hang for a
long time after a reload/reconfig. so now we call isc_taskmgr_setmode()
to turn privileged execution mode on or off in the task manager.
isc_task_privileged() returns true if the task's privilege flag is
set *and* the taskmgr is in privileged execution mode. this is used
to determine in which netmgr event queue the task should be run.
There was a theoretical possibility of clogging up the queue processing
with an endless loop where currently processing netievent would schedule
new netievent that would get processed immediately. This wasn't such a
problem when only netmgr netievents were processed, but with the
addition of the tasks, there are at least two situation where this could
happen:
1. In lib/dns/zone.c:setnsec3param() the task would get re-enqueued
when the zone was not yet fully loaded.
2. Tasks have internal quantum for maximum number of isc_events to be
processed, when the task quantum is reached, the task would get
rescheduled and then immediately processed by the netmgr queue
processing.
As the isc_queue doesn't have a mechanism to atomically move the queue,
this commit adds a mechanism to quantize the queue, so enqueueing new
netievents will never stop processing other uv_loop_t events.
The default quantum size is 128.
Since the queue used in the network manager allows items to be enqueued
more than once, tasks are now reference-counted around task_ready()
and task_run(). task_ready() now has a public API wrapper,
isc_task_ready(), that the netmgr can use to reschedule processing
of a task if the quantum has been reached.
Incidental changes: Cleaned up some unused fields left in isc_task_t
and isc_taskmgr_t after the last refactoring, and changed atomic
flags to atomic_bools for easier manipulation.
With taskmgr running on top of netmgr, the ordering of how the tasks and
netmgr shutdown interacts was wrong as previously isc_taskmgr_destroy()
was waiting until all tasks were properly shutdown and detached. This
responsibility was moved to netmgr, so we now need to do the following:
1. shutdown all the tasks - this schedules all shutdown events onto
the netmgr queue
2. shutdown the netmgr - this also makes sure all the tasks and
events are properly executed
3. Shutdown the taskmgr - this now waits for all the tasks to finish
running before returning
4. Shutdown the netmgr - this call waits for all the netmgr netievents
to finish before returning
This solves the race when the taskmgr object would be destroyed before
all the tasks were finished running in the netmgr loops.
Previously, netmgr, taskmgr, timermgr and socketmgr all had their own
isc_<*>mgr_create() and isc_<*>mgr_destroy() functions. The new
isc_managers_create() and isc_managers_destroy() fold all four into a
single function and makes sure the objects are created and destroy in
correct order.
Especially now, when taskmgr runs on top of netmgr, the correct order is
important and when the code was duplicated at many places it's easy to
make mistake.
The former isc_<*>mgr_create() and isc_<*>mgr_destroy() functions were
made private and a single call to isc_managers_create() and
isc_managers_destroy() is required at the program startup / shutdown.
Under some circumstances a situation might occur when server-side
session gets finished while there are still active HTTP/2
streams. This would lead to isc_nm_httpsocket object leaks.
This commit fixes this behaviour as well as refactors failed_read_cb()
to allow better code reuse.
This commit fixes a situation when a cstream object could get unlinked
from the list as a result of a cstream->read_cb call. Thus, unlinking
it after the call could crash the program.
... the last handle has been detached after calling write
callback. That makes it possible to detach from the underlying socket
and not to keep the socket object alive for too long. This issue was
causing TLS tests with quota to fail because quota might not have been
detached on time (because it was still referenced by the underlying
TCP socket).
One could say that this commit is an ideological continuation of:
513cdb52ecd4e63566672217f7390574f68c4d2d.
This way we create less netievent objects, not bombarding NM with the
messages in case of numerous low-level errors (like too many open
files) in e.g. unit tests.
This change ensures that a TCP connect callback is called from within
the context of a worker thread in case of a low-level error when
descriptors cannot be created (e.g. when there are too many open file
descriptors).