2
0
mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-29 13:38:26 +00:00

Remove isc_qsbr (we are using liburcu instead)

This commit breaks the qp-trie code.
This commit is contained in:
Tony Finch 2023-03-08 10:20:16 +00:00
parent cd0795beea
commit 05ca11e122
No known key found for this signature in database
7 changed files with 1 additions and 1123 deletions

View File

@ -1,397 +0,0 @@
<!--
Copyright (C) Internet Systems Consortium, Inc. ("ISC")
SPDX-License-Identifier: MPL-2.0
This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, you can obtain one at https://mozilla.org/MPL/2.0/.
See the COPYRIGHT file distributed with this work for additional
information regarding copyright ownership.
-->
QSBR: quiescent state based reclamation
=======================================
QSBR is a safe memory reclamation (SMR) algorithm for lock-free data
structures such as a qp-trie. (See `doc/dev/qp.md`.)
When an object is unlinked from a lock-free data structure, it
cannot be `free()`ed immediately, because there can still be readers
accessing the object via an old version of the data structure. SMR
algorithms determine when it is safe to reclaim memory after it has
been unlinked.
Introductions and overviews
---------------------------
There is a terse overview in `include/isc/qsbr.h`.
Jeff Preshing has a nice introduction to QSBR,
_<https://preshing.com/20160726/using-quiescent-states-to-reclaim-memory/>_
At the end of this note is a copy of a blog post about writing BIND's
`isc_qsbr`, _<https://dotat.at/@/2023-01-10-qsbr.html>_
[Paul McKenney's web page][paulmck] has links to his book on
concurrent programming, the [Userspace RCU library][urcu], and more.
McKenney invented RCU and QSBR. RCU is the Linux kernel's machinery
for lock-free data structures and safe memory reclamation, based on
QSBR.
[paulmck]: http://www.rdrop.com/~paulmck/
[urcu]: https://liburcu.org/
Example code
------------
If you are implementing a lock-free data structure that needs safe
memory reclamation, here's a guide to using `isc_qsbr`, based on how
QSBR is used by `dns_qp`.
### registration
When the program starts up you need to register a global callback
function that will reclaim unused memory. You can do so using an
ISC_CONSTRUCTOR function that runs automatically at startup.
static void
qp_qsbr_register(void) ISC_CONSTRUCTOR;
static void
qp_qsbr_register(void) {
isc_qsbr_register(qp_qsbr_reclaimer);
}
### work list
Your module will need somewhere that your callback can find the work
it needs to do. The qp-trie has an atomic list of `dns_qpmulti_t`
objects for this purpose.
/* a global variable */
static ISC_ASTACK(dns_qpmulti_t) qsbr_work;
The reason for using global variables is so that we don't need to
allocate a thunk every time we have memory reclamation work to do.
### read-only access
You should design your data structure so that it has a single atomic
root pointer referring to its current version. A lock-free reader
_must_ run in an `isc_loop` callback. It gains access to the data
structure by taking a copy of this pointer:
qp_node_t *reader = atomic_load_acquire(&multi->reader);
During an `isc_loop` callback, a reader should keep using the same
pointer go get a consistent view of the data structure. If it reloads
the pointer it can get a different version changed by concurrent
writers.
A reader _must_ stop using the root pointer and any interior pointers
obtained via the root pointer before it returns to the `isc_loop`.
### modifications and writes
All changes to the data structure must be copy-on-write (aka
read-copy-update) so that concurrent readers are not disturbed.
When a new version of the data structure has been prepared, it is
committed by overwriting the atomic root pointer,
atomic_store_release(&multi->reader, reader); /* COMMIT */
### scheduling cleanup
After committing a change, your data structure may have memory that
will become free, after concurrent readers have stopped accessing it.
To reclaim the memory when it is safe, use code like:
isc_qsbr_phase_t phase = isc_qsbr_phase(multi->loopmgr);
if (defer_chunk_reclamation(qp, phase)) {
ISC_ASTACK_ADD(qsbr_work, multi, cleanup);
isc_qsbr_activate(multi->loopmgr, phase);
}
* First, get the current QSBR phase
* Second, mark free memory with the phase number. The qp-trie scans
its chunks and marks those that will become free, and returns
`true` if there is cleanup work to do.
* If so, the qp-trie is added to the work list. (`ISC_ALIST_ADD()`
is idempotent).
* Finally, QSBR is informed that there is work to do.
In other cases it might not make sense to scan the data structure
after committing, and instead you might make note of which memory to
clean up while making changes before you know what the phase will be.
You can then have per-phase work lists, like:
static ISC_ASTACK(my_work_t) qsbr_work[ISC_QSBR_PHASES];
isc_qsbr_phase_t phase = isc_qsbr_phase(loopmgr);
ISC_ASTACK_ADD(qsbr_work[phase], cleanup_work, link);
isc_qsbr_activate(loopmgr, phase);
In general, there will be several (maybe many) write operations during
a grace period. Your lock-free data structure should collect its
reclamation work from all these writes into a batch per phase, i.e.
per grace period.
### reclaiming
Inside the reclaimer callback, we iterate over the work list and clean
up each item on it. If there is more cleanup work to do in another
phase, we put the qp-trie back on the work list for another go.
static void
qsbreclaimer(void *arg, isc_qsbr_phase_t phase) {
UNUSED(arg);
ISC_STACK(dns_qpmulti_t) drain = ISC_ASTACK_TO_STACK(qsbr_work);
while (!ISC_STACK_EMPTY(drain)) {
dns_qpmulti_t *multi = ISC_STACK_POP(drain, cleanup);
INSIST(QPMULTI_VALID(multi));
LOCK(&multi->mutex);
if (reclaim_chunks(&multi->writer, phase)) {
/* more to do next time */
ISC_ALIST_PUSH(qsbr_work, multi, cleanup);
}
UNLOCK(&multi->mutex);
}
}
### reclaim marks
In the qp-trie data structure, each chunk has some metadata which
includes a bitfield for the reclaim phase:
isc_qsbr_phase_t phase : ISC_QSBR_PHASE_BITS;
We use a bitfield so that all the metadata fits in a single word.
------------------------------------------------------------------------
Safe memory reclamation for BIND
================================
At the end of October 2022, I _finally_ got [my multithreaded
qp-trie][qp-gc] working! It could be built with two different
concurrency control mechanisms:
* A reader/writer lock
This has poor read-side scalability, because every thread is
hammering on the same shared location. But its write performance
is reasonably good: concurrent readers don't slow it down too much.
* [`liburcu`, userland read-copy-update][urcu]
RCU has a fast and scalable read side, nice! But on the write side
I used `synchronize_rcu()`, which is blocking and rather slow, so
my write performance was terrible.
OK, but I want the best of both worlds! To fix it, I needed to change
the qp-trie code to use safe memory reclamation more effectively:
instead of blocking inside `synchronize_rcu()` before cleaning up, use
`call_rcu()` to clean up asynchronously. I expect I'll write about the
qp-trie changes another time.
Another issue is that I want the best of both worlds _by default_,
but `liburcu` is [LGPL][] and we don't want BIND to depend on
code whose licence demands more from our users than the [MPL][].
[qp-gc]: https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
[LGPL]: https://opensource.org/licenses/LGPL-2.1
[MPL]: https://opensource.org/licenses/MPL-2.0
So I set out to write my own safe memory reclamation support code.
lock freedom
------------
In a [multithreaded qp-trie][qp-gc], there can be many concurrent
readers, but there can be only one writer at a time and modifications
are strictly serialized. When I have got it working properly, readers
are completely wait-free, unaffected by other readers, and almost
unaffected by writers. Writers need to get a mutex to ensure there is
only one at a time, but once the mutex is acquired, a writer is not
obstructed by readers.
The way this works is that readers use an atomic load to get a pointer
to the root of the current version of the trie. Readers can make
multiple queries using this root pointer and the results will be
consistent wrt that particular version, regardless of what changes
writers might be making concurrently. Writers do not affect readers
because all changes are made by copy-on-write. When a writer is ready
to commit a new version of the trie, it uses an atomic store to flip
the root pointer.
safe memory reclamation
-----------------------
We can't copy-on-write indefinitely: we need to reclaim the memory
used by old versions of the trie. And we must do so "safely", i.e.
without `free()`ing memory that readers are still using.
So, before `free()`ing memory, a writer must wait for a _"grace
period"_, which is a jargon term meaning "until readers are not using
the old version". There are a bunch of algorithms for determining when
a grace period is over, with varying amounts of over-approximation,
CPU overhead, and memory backlog.
The [RCU][urcu] function `synchronize_rcu()` is slow because it blocks
waiting for a grace period; the `call_rcu()` function runs a callback
asynchronously after a grace period has passed. I wanted to avoid
blocking my writers, so I needed to implement something like
`call_rcu()`.
aversions
---------
When I started trying to work out how to do safe memory reclamation,
it all seemed quite intimidating. But as I learned more, I found that
my circumstances make it easier than it appeared at first.
The [`liburcu`][urcu] homepage has a long list of supported CPU
architectures and operating systems. Do I have to care about those
details too? No! The RCU code dates back to before the age of
standardized concurrent memory models, so the RCU developers had to
invent their own atomic primitives and correctness rules. Twenty-ish
years later the state of the art has advanced, so I can use
`<stdatomic.h>` without having to re-do it like `liburcu`.
You can also choose between several algorithms implemented by
[`liburcu`][urcu], involving questions about kernel support, specially
reserved signals, and intrusiveness in application code. But while I
was working out how to schedule asynchronous memory reclamation work,
I realised that BIND is already well-suited to the fastest flavour of
RCU, called "QSBR".
QSBR
----
QSBR stands for "quiescent state based reclamation". A _"quiescent
state"_ is a fancy name for a point when a thread is not accessing a
lock-free data structure, and does not retain any root pointers or
interior pointers.
When a thread has passed through a quiescent state, it no longer has
access to older versions of the data structures. When _all_ threads
have passed through quiescent states, then nothing in the program has
access to old versions. This is how QSBR detects grace periods: after
a writer commits a new version, it waits for all threads to pass
through quiescent states, and therefore a grace period has definitely
elapsed, and so it is then safe to reclaim the old version's memory.
QSBR is fast because readers do not need to explicitly mark the
critical section surrounding the atomic load that I mentioned earlier.
Threads just need to pass through a quiescent state frequently enough
that there isn't a huge build-up of unreclaimed memory.
Inside an operating system kernel (RCU's native environment), a
context switch provides a natural quiescent state. In a userland
application, you need to find a good place to call
`rcu_quiescent_state()`. You could call it every time you have
finished using a root pointer, but marking a quiescent state is not
completely free, so there are probably more efficient ways.
`libuv`
-------
BIND is multithreaded, and (basically) each thread runs an event loop.
Recent versions of BIND use [`libuv`][uv] for the event loops.
A lot of things started falling into place when I realised that the
`libuv` event loop gives BIND a [natural quiescent state][uv-loop]:
when the event callbacks have finished running, and `libuv` is about
to call `select()` or `poll()` or whatever, we can mark a quiescent
state. We can require that event-handling functions do not stash root
pointers in the heap, but only use them via local variables, so we
know that old versions are inaccessible after the callback returns.
My design marks a quiescent state once per loop, so on a busy server
where each loop has lots to do, the cost of marking a quiescent state
is amortized across several I/O events.
[uv]: http://libuv.org/
[uv-loop]: http://docs.libuv.org/en/v1.x/design.html#the-i-o-loop
fuzzy barrier
-------------
So, how do we mark a quiescent state? Using a _"fuzzy barrier"_.
When a thread reaches a normal barrier, it blocks until all the other
threads have reached the barrier, after which exactly one of the
threads can enter a protected section of code, and the others are
unblocked and can proceed as normal.
When a thread encounters a fuzzy barrier, it never blocks. It either
proceeds immediately as normal, or if it is the last thread to reach
the barrier, it enters the protected code.
RCU does not actually use a fuzzy barrier as I have described it. Like
a fuzzy barrier, each thread keeps track of whether it has passed
through a quiescent state in the current grace period, without
blocking; but unlike a fuzzy barrier, no thread is diverted to the
protected code. Instead, code that wants to enter a protected section
uses the blocking `synchronize_rcu()` function.
EBR-ish
-------
As in the paper ["performance of memory reclamation for lockless
synchronization"][HMBW], my implementation of QSBR uses a fuzzy
barrier designed for another safe memory reclamation algorithm, EBR,
epoch based reclamation. (EBR was invented here in Cambridge by [Keir
Fraser][tr579].)
[HMBW]: http://csng.cs.toronto.edu/publication_files/0000/0159/jpdc07.pdf
[tr579]: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.html
Actually, my fuzzy barrier is slightly different to EBR's. In EBR, the
fuzzy barrier is used every time the program enters a critical
section. (In qp-trie terms, that would be every time a reader fetches
a root pointer.) So it is vital that EBR's barrier avoids mutating
shared state, because that would wreck multithreaded performance.
Because BIND will only pass through the fuzzy barrier when it is about
to use a blocking system call, my version mutates shared state more
frequently (typically, once per CPU per grace period, instead of once
per grace period). If this turns out to be a problem, it won't be too
hard to make it work more like EBR.
More trivially, I'm using the term "phase" instead of "epoch", because
it's nothing to do with the unix epoch, because there are three
phases, and because I can talk about phase transitions and threads
being out of phase with each other.
coda
----
While reading various RCU-related papers, I was amused by ["user-level
implementations of read-copy update"][DMSDW], which says:
> BIND, a major domain-name server used for Internet domain-name
> resolution, is facing scalability issues. Since domain names
> are read often but rarely updated, using user-level RCU might be
> beneficial.
Yes, I think it might :-)
[DMSDW]: https://www.efficios.com/publications/

View File

@ -66,7 +66,6 @@ libisc_la_HEADERS = \
include/isc/pause.h \ include/isc/pause.h \
include/isc/portset.h \ include/isc/portset.h \
include/isc/quota.h \ include/isc/quota.h \
include/isc/qsbr.h \
include/isc/radix.h \ include/isc/radix.h \
include/isc/random.h \ include/isc/random.h \
include/isc/ratelimiter.h \ include/isc/ratelimiter.h \
@ -174,7 +173,6 @@ libisc_la_SOURCES = \
picohttpparser.h \ picohttpparser.h \
portset.c \ portset.c \
quota.c \ quota.c \
qsbr.c \
radix.c \ radix.c \
random.c \ random.c \
ratelimiter.c \ ratelimiter.c \

View File

@ -69,17 +69,6 @@ isc_loopmgr_run(isc_loopmgr_t *loopmgr);
*\li 'loopmgr' is a valid loop manager. *\li 'loopmgr' is a valid loop manager.
*/ */
void
isc_loopmgr_wakeup(isc_loopmgr_t *loopmgr);
/*%<
* Send no-op events to wake up all running loops in 'loopmgr' except
* the current one. (See <isc/qsbr.h>.)
*
* Requires:
*\li 'loopmgr' is a valid loop manager.
*\li We are in a running loop.
*/
void void
isc_loopmgr_pause(isc_loopmgr_t *loopmgr); isc_loopmgr_pause(isc_loopmgr_t *loopmgr);
/*%< /*%<

View File

@ -1,282 +0,0 @@
/*
* Copyright (C) Internet Systems Consortium, Inc. ("ISC")
*
* SPDX-License-Identifier: MPL-2.0
*
* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, you can obtain one at https://mozilla.org/MPL/2.0/.
*
* See the COPYRIGHT file distributed with this work for additional
* information regarding copyright ownership.
*/
#pragma once
#include <isc/atomic.h>
#include <isc/stack.h>
#include <isc/types.h>
#include <isc/uv.h>
/*
* Quiescent state based reclamation
* =================================
*
* QSBR is a safe memory reclamation algorithm for lock-free data
* structures such as a qp-trie.
*
* When an object is unlinked from a lock-free data structure, it
* cannot be free()d immediately, because there can still be readers
* accessing the object via an old version of the data structure. SMR
* algorithms determine when it is safe to reclaim memory after it has
* been unlinked.
*
* With QSBR, reading a data structure is wait-free. All that is
* required is an atomic load to get the data structure's current
* root; there is no need to explicitly mark any read-side critical
* section.
*
* QSBR is used by RCU (read-copy-update) in the Linux kernel. BIND's
* implementation also uses some ideas from EBR (epoch-based reclamation).
* The following summary is based on the overview in the paper
* "performance of memory reclamation for lockless synchronization",
* (http://csng.cs.toronto.edu/publication_files/0000/0159/jpdc07.pdf).
*
* Aside: This QSBR implementation is somewhat different from the one
* in liburcu, described in the paper "user-level implementations of
* read-copy update", (https://www.efficios.com/publications/), which
* contains the amusing comment:
*
* BIND, a major domain-name server used for Internet domain-name
* resolution, is facing scalability issues. Since domain names
* are read often but rarely updated, using user-level RCU might
* be beneficial.
*
* A "quiescent state" is a point when a thread is not accessing any
* lock-free data structure. After passing through a quiescent state,
* a thread can no longer access versions of a data structure that
* were replaced before that point. In BIND, we use a point in the
* event loop (a uv_prepare_t callback) to identify a quiescent state.
*
* Aside: a prepare handle runs its callbacks before the loop sleeps,
* which reduces reclaim latency (unlike a check handle) and it does
* not affect timeout calculations (unlike an idle handle).
*
* A "grace period" is any time interval such that after the end of
* the grace period, all objects removed before the start of the grace
* period can safely be reclaimed. Different SMR algorithms detect
* grace periods with varying degrees of tightness or looseness.
*
* QSBR uses quiescent states to detect grace periods: a grace period
* is a time interval in which every thread passes through a quiescent
* state. (This is a safe over-estimate.) A "fuzzy barrier" is used to
* find out when all threads have passed through a quiescent state.
*
* NOTE: In BIND this means that code which is not running in an event
* loop thread (such as an isc_work / uv_work_t callback) must use
* locking (not lock-free) data structure accessors.
*
* Because a quiescent state happens once per event loop, a grace
* period takes roughly the same amount of time as the slowest event
* loop in each cycle.
*
* Similar to the paper linked above, this QSBR implementation uses a
* variant of the EBR fuzzy barrier. Like EBR, each grace period is
* numbered with a "phase", which cycles round 1,2,3,1,2,3,... (Phases
* are called epochs in EBR, but I think "phase" is a better metaphor.)
* When entering the fuzzy barrier, each thread updates its local phase
* to match the global phase, keeping a global count of the number of
* threads still to pass. When this count reaches zero, it is the end of
* the grace period; the global phase is updated and reclamation is
* triggered.
*
* Note that threads are usually slightly out-of-phase wrt the global
* grace period. At any particular point in time, there will be some
* threads in the current global phase, and some in the previous
* global phase. EBR has three phases because that is the minimum
* number that leaves one phase unoccupied by readers. Any objects that
* were detached from the data structure in the third phase can be
* reclaimed after the start of the current phase, because a grace
* period (the previous phase) has elapsed since the objects were
* detached.
*
* A phase number can be used by a lock-free data structure (such as a
* qp-trie) to record when an object was detached. QSBR calls the data
* structure's reclaimer function, passing a phase number indicating
* that objects detached in that phase can now be reclaimed
*
* In general, there will be several (maybe many) write operations
* during a grace period. The lock-free data structures that use QSBR
* will collect their reclamation work from all these writes into a
* batch per phase, i.e. per grace period.
*
* There is some example code in `doc/dev/qsbr.md`, with pointers to
* less terse introductions to QSBR and other overview material.
*/
#define ISC_QSBR_PHASE_BITS 2
typedef unsigned int isc_qsbr_phase_t;
/*%<
* A grace period phase number. It can be stored in a bitfield of size
* ISC_QSBR_PHASE_BITS. You can use zero to indicate "no phase".
* (Don't assume the maximum is three: We might want to increase the
* number of phases so that there is more than one unoccupied phase.
* This would allow concurrent reclamation of objects released in
* multiple unoccupied phases.)
*/
typedef void
isc_qsbreclaimer_t(isc_qsbr_phase_t phase);
/*%<
* The type of memory reclaimer callback functions.
*
* The `phase` identifies which objects are to be reclaimed.
*
* An isc_qsbreclaimer_t can call isc_qsbr_activate() if it could not
* reclaim everything and needs to be called again.
*/
typedef struct isc_qsbr_registered {
ISC_SLINK(struct isc_qsbr_registered) link;
isc_qsbreclaimer_t *func;
} isc_qsbr_registered_t;
/*%<
* Each reclaimer callback has a static `isc_qsbr_registered_t` object
* so that QSBR can find it.
*/
void
isc__qsbr_register(isc_qsbr_registered_t *reg);
/*%<
* Requires:
* \li reclaimer->link is not linked
* \li reclaimer->func is not NULL
*/
#define isc_qsbr_register(cb) \
do { \
static isc_qsbr_registered_t registration = { \
.link = ISC_SLINK_INITIALIZER, \
.func = cb, \
}; \
isc__qsbr_register(&registration); \
} while (0)
/*%<
* Register a callback function with QSBR. This macro should be used
* inside an `ISC_CONSTRUCTOR` function. There should be one callback
* for eack lock-free data structure implementation, which is able to
* reclaim all the unused memory across all instances of its data
* structure.
*/
isc_qsbr_phase_t
isc_qsbr_phase(isc_loopmgr_t *loopmgr);
/*%<
* Get the current phase, to use for marking detached objects.
*
* To commit a write that requires cleanup, the ordering must be:
*
* - Use atomic_store_release() to commit the data structure's new
* root pointer; release ordering ensures that the interior changes
* are written before the root pointer.
*
* - Call isc_qsbr_phase() to get the phase to be used for marking
* objects to reclaim. This must happen after the commit, to ensure
* there is at least one grace period between commit and cleanup.
*
* - Pass the same phase to isc_qsbr_activate() so that the reclaimer
* will be called after a grace period has passed.
*/
void
isc_qsbr_activate(isc_loopmgr_t *loopmgr, isc_qsbr_phase_t phase);
/*%<
* Tell QSBR that objects have been detached and will need reclaiming
* after a grace period.
*/
/***********************************************************************
*
* private parts
*/
/*
* Accessors and constructors for the `grace` variable.
* It contains two bit fields:
*
* - the global phase in the lower ISC_QSBR_PHASE_BITS
*
* - a thread counter in the upper bits
*/
#define ISC_QSBR_ONE_THREAD (1 << ISC_QSBR_PHASE_BITS)
#define ISC_QSBR_PHASE_MAX (ISC_QSBR_ONE_THREAD - 1)
#define ISC_QSBR_GRACE_PHASE(grace) (grace & ISC_QSBR_PHASE_MAX)
#define ISC_QSBR_GRACE_THREADS(grace) (grace >> ISC_QSBR_PHASE_BITS)
#define ISC_QSBR_GRACE(threads, phase) \
((threads << ISC_QSBR_PHASE_BITS) | phase)
typedef struct isc_qsbr {
/*
* The `grace` variable keeps track of the current grace period.
* When the phase changes, the thread counter is set to the number of
* threads that need to observe the new phase before the grace period
* can end.
*
* The thread counter is an add-on to the usual EBR fuzzy barrier.
* Counting threads through the barrier adds multi-thread update
* contention, and in EBR the fuzzy barrier runs frequently enough
* (on every access) that it's important to minimize its cost. With
* QSBR, the fuzzy barrier runs less frequently (roughly, per loop,
* instead of per-callback) so contention is less of a concern. The
* thread counter helps to reduce reclaim latency, because unlike EBR
* we don't probabilistically check, we know deterministically when
* all threads have changed phase.
*/
atomic_uint_fast32_t grace;
/*
* A flag for each phase indicating that there will be work to
* do, so we don't invoke the reclaim machinery unnecessarily.
* Set by `isc_qsbr_activate()` and cleared before the reclaimer
* functions are invoked (so they can re-set their flag if
* necessary).
*/
atomic_uint_fast32_t activated;
/*
* The time of the last phase transition (isc_nanosecs_t). Used
* to ensure that grace periods do not last forever. We use
* `isc_time_monotonic()` because we need the same time in all
* threads. (`uv_now()` is different in different threads.)
*/
atomic_uint_fast64_t transition_time;
} isc_qsbr_t;
/*
* When we start there is no worker thread yet, so the thread
* count is equal to the number of loops. The global phase starts
* off at one (it must always be nonzero).
*/
#define ISC_QSBR_INITIALIZER(nloops) \
(isc_qsbr_t) { \
.grace = ISC_QSBR_GRACE(nloops, 1), \
.transition_time = isc_time_monotonic(), \
}
/*
* For use by tests that need to explicitly drive QSBR phase transitions.
*/
void
isc__qsbr_quiescent_state(isc_loop_t *loop);
/*
* Used by the loopmgr
*/
void
isc__qsbr_quiescent_cb(uv_prepare_t *handle);
void
isc__qsbr_destroy(isc_loopmgr_t *loopmgr);

View File

@ -26,14 +26,12 @@
#include <isc/magic.h> #include <isc/magic.h>
#include <isc/mem.h> #include <isc/mem.h>
#include <isc/mutex.h> #include <isc/mutex.h>
#include <isc/qsbr.h>
#include <isc/refcount.h> #include <isc/refcount.h>
#include <isc/result.h> #include <isc/result.h>
#include <isc/signal.h> #include <isc/signal.h>
#include <isc/strerr.h> #include <isc/strerr.h>
#include <isc/thread.h> #include <isc/thread.h>
#include <isc/tid.h> #include <isc/tid.h>
#include <isc/time.h>
#include <isc/urcu.h> #include <isc/urcu.h>
#include <isc/util.h> #include <isc/util.h>
#include <isc/uv.h> #include <isc/uv.h>
@ -151,7 +149,6 @@ destroy_cb(uv_async_t *handle) {
uv_close(&loop->run_trigger, isc__job_close); uv_close(&loop->run_trigger, isc__job_close);
uv_close(&loop->destroy_trigger, NULL); uv_close(&loop->destroy_trigger, NULL);
uv_close(&loop->pause_trigger, NULL); uv_close(&loop->pause_trigger, NULL);
uv_close(&loop->wakeup_trigger, NULL);
uv_close(&loop->quiescent, NULL); uv_close(&loop->quiescent, NULL);
uv_walk(&loop->loop, loop_walk_cb, (char *)"destroy_cb"); uv_walk(&loop->loop, loop_walk_cb, (char *)"destroy_cb");
@ -162,8 +159,6 @@ shutdown_cb(uv_async_t *handle) {
isc_loop_t *loop = uv_handle_get_data(handle); isc_loop_t *loop = uv_handle_get_data(handle);
isc_loopmgr_t *loopmgr = loop->loopmgr; isc_loopmgr_t *loopmgr = loop->loopmgr;
loop->shuttingdown = true;
/* Make sure, we can't be called again */ /* Make sure, we can't be called again */
uv_close(&loop->shutdown_trigger, shutdown_trigger_close_cb); uv_close(&loop->shutdown_trigger, shutdown_trigger_close_cb);
@ -185,12 +180,6 @@ shutdown_cb(uv_async_t *handle) {
UV_RUNTIME_CHECK(uv_async_send, r); UV_RUNTIME_CHECK(uv_async_send, r);
} }
static void
wakeup_cb(uv_async_t *handle) {
/* we only woke up to make the loop take a spin */
UNUSED(handle);
}
static void static void
loop_init(isc_loop_t *loop, isc_loopmgr_t *loopmgr, uint32_t tid) { loop_init(isc_loop_t *loop, isc_loopmgr_t *loopmgr, uint32_t tid) {
*loop = (isc_loop_t){ *loop = (isc_loop_t){
@ -226,9 +215,6 @@ loop_init(isc_loop_t *loop, isc_loopmgr_t *loopmgr, uint32_t tid) {
UV_RUNTIME_CHECK(uv_async_init, r); UV_RUNTIME_CHECK(uv_async_init, r);
uv_handle_set_data(&loop->destroy_trigger, loop); uv_handle_set_data(&loop->destroy_trigger, loop);
r = uv_async_init(&loop->loop, &loop->wakeup_trigger, wakeup_cb);
UV_RUNTIME_CHECK(uv_async_init, r);
r = uv_prepare_init(&loop->loop, &loop->quiescent); r = uv_prepare_init(&loop->loop, &loop->quiescent);
UV_RUNTIME_CHECK(uv_prepare_init, r); UV_RUNTIME_CHECK(uv_prepare_init, r);
uv_handle_set_data(&loop->quiescent, loop); uv_handle_set_data(&loop->quiescent, loop);
@ -245,7 +231,7 @@ loop_init(isc_loop_t *loop, isc_loopmgr_t *loopmgr, uint32_t tid) {
static void static void
quiescent_cb(uv_prepare_t *handle) { quiescent_cb(uv_prepare_t *handle) {
isc__qsbr_quiescent_cb(handle); UNUSED(handle);
#if defined(RCU_QSBR) #if defined(RCU_QSBR)
/* safe memory reclamation */ /* safe memory reclamation */
@ -340,7 +326,6 @@ isc_loopmgr_create(isc_mem_t *mctx, uint32_t nloops, isc_loopmgr_t **loopmgrp) {
loopmgr = isc_mem_get(mctx, sizeof(*loopmgr)); loopmgr = isc_mem_get(mctx, sizeof(*loopmgr));
*loopmgr = (isc_loopmgr_t){ *loopmgr = (isc_loopmgr_t){
.nloops = nloops, .nloops = nloops,
.qsbr = ISC_QSBR_INITIALIZER(nloops),
}; };
isc_mem_attach(mctx, &loopmgr->mctx); isc_mem_attach(mctx, &loopmgr->mctx);
@ -465,22 +450,6 @@ isc_loopmgr_run(isc_loopmgr_t *loopmgr) {
isc_thread_main(loop_thread, &loopmgr->loops[0]); isc_thread_main(loop_thread, &loopmgr->loops[0]);
} }
void
isc_loopmgr_wakeup(isc_loopmgr_t *loopmgr) {
REQUIRE(VALID_LOOPMGR(loopmgr));
for (size_t i = 0; i < loopmgr->nloops; i++) {
isc_loop_t *loop = &loopmgr->loops[i];
/* Skip current loop */
if (i == isc_tid()) {
continue;
}
uv_async_send(&loop->wakeup_trigger);
}
}
void void
isc_loopmgr_pause(isc_loopmgr_t *loopmgr) { isc_loopmgr_pause(isc_loopmgr_t *loopmgr) {
REQUIRE(VALID_LOOPMGR(loopmgr)); REQUIRE(VALID_LOOPMGR(loopmgr));

View File

@ -21,7 +21,6 @@
#include <isc/loop.h> #include <isc/loop.h>
#include <isc/magic.h> #include <isc/magic.h>
#include <isc/mem.h> #include <isc/mem.h>
#include <isc/qsbr.h>
#include <isc/refcount.h> #include <isc/refcount.h>
#include <isc/result.h> #include <isc/result.h>
#include <isc/signal.h> #include <isc/signal.h>
@ -76,9 +75,7 @@ struct isc_loop {
uv_async_t destroy_trigger; uv_async_t destroy_trigger;
/* safe memory reclamation */ /* safe memory reclamation */
uv_async_t wakeup_trigger;
uv_prepare_t quiescent; uv_prepare_t quiescent;
isc_qsbr_phase_t qsbr_phase;
}; };
/* /*
@ -113,9 +110,6 @@ struct isc_loopmgr {
/* per-thread objects */ /* per-thread objects */
isc_loop_t *loops; isc_loop_t *loops;
/* safe memory reclamation */
isc_qsbr_t qsbr;
}; };
/* /*

View File

@ -1,393 +0,0 @@
/*
* Copyright (C) Internet Systems Consortium, Inc. ("ISC")
*
* SPDX-License-Identifier: MPL-2.0
*
* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, you can obtain one at https://mozilla.org/MPL/2.0/.
*
* See the COPYRIGHT file distributed with this work for additional
* information regarding copyright ownership.
*/
#include <isc/atomic.h>
#include <isc/log.h>
#include <isc/loop.h>
#include <isc/qsbr.h>
#include <isc/stack.h>
#include <isc/tid.h>
#include <isc/time.h>
#include <isc/types.h>
#include <isc/uv.h>
#include "loop_p.h"
#define MAX_GRACE_PERIOD_NS 53 * NS_PER_MS
#if 0
#define TRACE(fmt, ...) \
isc_log_write(isc_lctx, ISC_LOGCATEGORY_GENERAL, ISC_LOGMODULE_OTHER, \
ISC_LOG_DEBUG(7), "%s:%u:%s():t%u: " fmt, __FILE__, \
__LINE__, __func__, isc_tid(), ##__VA_ARGS__)
#else
#define TRACE(...)
#endif
static ISC_STACK(isc_qsbr_registered_t) qsbreclaimers = ISC_STACK_INITIALIZER;
static void
reclaim_cb(void *arg);
static void
reclaimed_cb(void *arg);
/**********************************************************************/
/*
* 3,2,1,3,2,1,...
*/
static isc_qsbr_phase_t
change_phase(isc_qsbr_phase_t phase) {
return (--phase > 0 ? phase : ISC_QSBR_PHASE_MAX);
}
/*
* For marking or checking that a phase has cleanup work to do.
*/
static unsigned int
active_bit(isc_qsbr_phase_t phase) {
return (1 << phase);
}
/*
* Extract the global phase from the grace period state.
*/
static isc_qsbr_phase_t
global_phase(isc_qsbr_t *qsbr, memory_order m_o) {
uint32_t grace = atomic_load_explicit(&qsbr->grace, m_o);
return (ISC_QSBR_GRACE_PHASE(grace));
}
/*
* Record that the current thread has passed the barrier.
* Returns true if more threads still need to pass.
*
* ATOMIC: acquire-release, to ensure that this is not reordered wrt
* read-only accesses to lock-free data structures. This implements the
* ordering requirements of a quiescent state.
*/
static bool
fuzzy_barrier_not_yet(isc_qsbr_t *qsbr) {
uint32_t grace = atomic_fetch_sub_acq_rel(&qsbr->grace,
ISC_QSBR_ONE_THREAD);
uint32_t threads = ISC_QSBR_GRACE_THREADS(grace);
return (threads > 1);
}
/*
* Ungracefully drive all cleanup work to completion.
*
* ATOMIC: everything is relaxed, because we assume that concurrent
* readers have already finished. `reclaim_cb()` uses the `activated`
* flags to ensure it is OK that threads will race to complete the
* cleanup.
*/
static void
qsbr_shutdown(isc_loopmgr_t *loopmgr) {
isc_qsbr_t *qsbr = &loopmgr->qsbr;
isc_qsbr_phase_t phase = global_phase(qsbr, memory_order_relaxed);
uint32_t threads = isc_loopmgr_nloops(loopmgr);
uint32_t grace;
while (atomic_load_relaxed(&qsbr->activated) != 0) {
reclaim_cb(loopmgr);
phase = change_phase(phase);
grace = ISC_QSBR_GRACE(threads, phase);
atomic_store_relaxed(&qsbr->grace, grace);
}
}
/*
* On a quiet server that does not have enough network traffic to keep
* all its threads spinning, grace periods might extend indefinitely.
* So check if we have been waiting an unreasonably long time since
* the last phase change. If so, send a no-op async request to every
* thread to make them all cycle through a quiescent state.
*/
static void
maybe_wakeup(isc_loop_t *loop) {
isc_loopmgr_t *loopmgr = loop->loopmgr;
isc_qsbr_t *qsbr = &loopmgr->qsbr;
/*
* ATOMIC: relaxed is OK here because we don't use any values guarded
* by the `activated` flags.
*/
if (atomic_load_relaxed(&qsbr->activated) == 0) {
return;
}
if (loop->shuttingdown) {
qsbr_shutdown(loopmgr);
return;
}
/*
* ATOMIC: relaxed, because the `transition_time` doesn't guard any
* other values, just the isc_loopmgr_wakeup() call below.
*/
atomic_uint_fast64_t *qsbr_ttp = &qsbr->transition_time;
isc_nanosecs_t now = isc_time_monotonic();
isc_nanosecs_t start = atomic_load_relaxed(qsbr_ttp);
if (now < start + MAX_GRACE_PERIOD_NS) {
return;
}
/*
* To stop other threads from also invoking `isc_loopmgr_wakeup()`,
* we try to push the timer into the future (expecting that it will
* not trigger again), and quit if someone else got there first.
* ATOMIC: relaxed, as before; strong, because there is no retry loop.
*/
if (!atomic_compare_exchange_strong_relaxed(qsbr_ttp, &start, now)) {
return;
}
TRACE("long grace period of %llu ns, waking up other threads",
(unsigned long long)(now - start));
isc_loopmgr_wakeup(loopmgr);
}
/*
* Callers use the fuzzy barrier to ensure only one thread can enter
* this function at a time.
*
* Phase transitions happen at roughly the same frequency that IO
* event loops cycle, limited by the slowest loop in each cycle.
*/
static void
phase_transition(isc_loop_t *loop, isc_qsbr_phase_t current_phase) {
isc_loopmgr_t *loopmgr = loop->loopmgr;
isc_qsbr_t *qsbr = &loopmgr->qsbr;
if (loop->shuttingdown) {
qsbr_shutdown(loopmgr);
return;
}
/*
* After we change phase, threads will be in either the `current_phase`
* or the `next_phase`. We will reclaim memory from the `third_phase`.
*
* ATOMIC: relaxed is OK here because the necessary synchronization
* happens in `reclaim_cb()`.
*/
isc_qsbr_phase_t next_phase = change_phase(current_phase);
isc_qsbr_phase_t third_phase = change_phase(next_phase);
bool activated = atomic_load_relaxed(&qsbr->activated) &
active_bit(third_phase);
/*
* Reset the wakeup timer, and log the length of the grace period.
* ATOMIC: relaxed, per the commentary in `maybe_wakeup()`.
*/
atomic_uint_fast64_t *qsbr_tt = &qsbr->transition_time;
isc_nanosecs_t now = isc_time_monotonic();
isc_nanosecs_t start = atomic_exchange_relaxed(qsbr_tt, now);
TRACE("phase %u -> %u after grace period of %f ms", current_phase,
next_phase, (double)(now - start) / NS_PER_MS);
UNUSED(start); /* ifndef TRACE() */
/*
* Work out the threads counter for this grace period.
*
* We need to add one for any reclamation worker thread, to
* prevent us from changing phase before the work is done. If
* we change too early, any newly detached objects will be
* marked with the same phase as the running reclaimer, which
* might lead to them being free()d too soon.
*/
uint32_t threads = isc_loopmgr_nloops(loopmgr) + (activated ? 1 : 0);
/*
* Start the new grace period.
*
* ATOMIC: release, to pair with the load-acquire in `reclaim_cb()`
* which is spawned in a separate worker thread.
*/
uint32_t grace = ISC_QSBR_GRACE(threads, next_phase);
atomic_store_release(&qsbr->grace, grace);
if (activated) {
isc_work_enqueue(loop, reclaim_cb, reclaimed_cb, loopmgr);
}
}
/*
* This function is called once per cycle of each IO event loop by the
* `uv_prepare` callback below.
*/
void
isc__qsbr_quiescent_state(isc_loop_t *loop) {
isc_loopmgr_t *loopmgr = loop->loopmgr;
isc_qsbr_t *qsbr = &loopmgr->qsbr;
/*
* ATOMIC: relaxed. If we are in phase then we don't need to
* synchronize; if we are not then this thread's presence in
* the thread counter will prevent the phase from changing
* before we get to the fuzzy barrier.
*/
isc_qsbr_phase_t phase = global_phase(qsbr, memory_order_relaxed);
if (loop->qsbr_phase == phase) {
maybe_wakeup(loop);
return;
}
/*
* Enter the current phase and count us out of the previous phase.
*/
loop->qsbr_phase = phase;
if (fuzzy_barrier_not_yet(qsbr)) {
maybe_wakeup(loop);
return;
}
/*
* We were the last thread to enter the current phase so the
* grace period is up. No other thread can reach this point.
*/
phase_transition(loop, phase);
}
void
isc__qsbr_quiescent_cb(uv_prepare_t *handle) {
isc_loop_t *loop = uv_handle_get_data((uv_handle_t *)handle);
isc__qsbr_quiescent_state(loop);
}
static void
reclaimed_cb(void *arg) {
/* we are back on a loop thread */
isc_loopmgr_t *loopmgr = arg;
isc_qsbr_t *qsbr = &loopmgr->qsbr;
isc_loop_t *loop = CURRENT_LOOP(loopmgr);
/*
* Remove the reclaimers from the thread count, so that the
* next grace period can start.
*/
if (fuzzy_barrier_not_yet(qsbr)) {
return;
}
/*
* The reclaimers were the last thread to be counted out: every
* other thread already passed through a quiescent state.
*
* We expect loop->qsbr_phase == global_phase() at this point,
* except during shutdown when the phase shifts rapidly. Also,
* the current loop might not have received the shutdown
* message yet, so it seems easiest to omit the assertion.
*
* ATOMIC: relaxed, the fuzzy barrier already synchronized.
*/
TRACE("reclaimers overran");
phase_transition(loop, global_phase(qsbr, memory_order_relaxed));
}
static void
reclaim_cb(void *arg) {
/* we are on a work thread not a loop thread */
isc_loopmgr_t *loopmgr = arg;
isc_qsbr_t *qsbr = &loopmgr->qsbr;
/*
* The global phase has just been bumped by a `phase_transition()`
* and it cannot change again until the grace period is up, which
* cannot happen until we have finished working.
*
* ATOMIC: acquire, to pair with the release in `phase_transition()`.
*
* The phase we are to clean up is 2 before the current phase,
* which is the same as the one after the current phase (mod 3).
*/
isc_qsbr_phase_t cur_phase = global_phase(qsbr, memory_order_acquire);
isc_qsbr_phase_t third_phase = change_phase(cur_phase);
unsigned int third_bit = active_bit(third_phase);
/*
* If any reclaimers need to be called again later, they can use
* `isc_qsbr_activate()`, so we need to clear the bit first.
*
* ATOMIC: acquire, so that `isc_qsbr_activate()` happens before
* the callbacks are invoked.
*/
uint32_t activated = atomic_fetch_and_explicit(
&qsbr->activated, ~third_bit, memory_order_acquire);
/* this can happen when we are racing to clean up on shutdown */
if ((activated & third_bit) == 0) {
return;
}
isc_qsbr_registered_t *reclaimer = ISC_STACK_TOP(qsbreclaimers);
while (reclaimer != NULL) {
reclaimer->func(third_phase);
reclaimer = ISC_SLINK_NEXT(reclaimer, link);
}
}
void
isc__qsbr_register(isc_qsbr_registered_t *reclaimer) {
REQUIRE(reclaimer->func != NULL);
ISC_STACK_PUSH(qsbreclaimers, reclaimer, link);
}
/*
* ATOMIC: This function needs to ensure that the global phase is read
* after a write has committed. Acquire/release ordering is not sufficient
* for ordering between separate atomics (the data structure's root pointer
* and the global phase), so it must be sequentially consistent.
*
* In general, the phases up to and including the next phase transition
* look like:
*
* 1. local phase
* 2. global phase
* 3. next phase
* 1. third phase
*
* i.e. some threads are still one behind the global phase, on the same
* phase that will be cleaned up immediately after the phase transition.
*
* This function is called just after a write commits. It's likely that
* some threads on the global phase (2) are using a version of the data
* structure from before the write, and they can continue using it while
* the straggler threads (1) catch up and cause a phase transition.
*
* The writer can be one of the straggler threads. If it incorrectly marks
* cleanup work with its local phase (1), memory will be reclaimed
* immediately after the next phase transition (when the third phase is
* also 1), which could be almost immediately when the writer returns to
* the event loop. This will cause a use-after-free for existing readers
* (in phase 2).
*
* More straightforwardly, we need to be able to queue up reclaim work from
* a thread that isn't running a loop, which also means this function has
* to return the global phase.
*/
isc_qsbr_phase_t
isc_qsbr_phase(isc_loopmgr_t *loopmgr) {
isc_qsbr_t *qsbr = &loopmgr->qsbr;
return (global_phase(qsbr, memory_order_seq_cst));
}
void
isc_qsbr_activate(isc_loopmgr_t *loopmgr, isc_qsbr_phase_t phase) {
/*
* ATOMIC: release ordering ensures that writing the cleanup lists
* happens before the callback is invoked from a worker thread.
*/
atomic_fetch_or_release(&loopmgr->qsbr.activated, active_bit(phase));
}