mirror of
https://gitlab.isc.org/isc-projects/kea
synced 2025-08-30 21:45:37 +00:00
Delete 01-scaling-across-cores
This commit is contained in:
@@ -1,347 +0,0 @@
|
||||
Scaling across (many) cores
|
||||
===========================
|
||||
|
||||
Problem statement
|
||||
-----------------
|
||||
|
||||
The general issue is how to insure that the resolver scales.
|
||||
|
||||
Currently resolvers are CPU bound, and it seems likely that both
|
||||
instructions-per-cycle and CPU frequency will not increase radically,
|
||||
scaling will need to be across multiple cores.
|
||||
|
||||
How can we best scale a recursive resolver across multiple cores?
|
||||
|
||||
Image of how resolution looks like
|
||||
----------------------------------
|
||||
|
||||
Receive the query. @# <------------------------\
|
||||
| |
|
||||
| |
|
||||
v |
|
||||
Parse it, etc. $ |
|
||||
| |
|
||||
| |
|
||||
v |
|
||||
Look into the cache. $# |
|
||||
Cry <---- No <---------- Is it there? -----------> Yes ---------\ |
|
||||
| ^ | |
|
||||
Prepare upstream query $ | | |
|
||||
| | | |
|
||||
v | | |
|
||||
Send an upstream query (#) | | |
|
||||
| | | |
|
||||
| | | |
|
||||
v | | |
|
||||
Wait for answer @(#) | | |
|
||||
| | | |
|
||||
v | | |
|
||||
Parse $ | | |
|
||||
| | | |
|
||||
v | | |
|
||||
Is it enough? $ ----> No ---------/ | |
|
||||
| | |
|
||||
Yes | |
|
||||
| | |
|
||||
\-----------------------> Build answer $ <----------------------/ |
|
||||
| |
|
||||
| |
|
||||
v |
|
||||
Send answer # -----------------------------/
|
||||
|
||||
This is simplified version, however. There may be other tasks (validation, for
|
||||
example), which are not drawn mostly for simplicity, as they don't produce more
|
||||
problems. The validation would be done as part of some computational task and
|
||||
they could do more lookups in the cache or upstream queries.
|
||||
|
||||
Also, multiple queries may generate the same upstream query, so they should be
|
||||
aggregated together somehow.
|
||||
|
||||
Legend
|
||||
~~~~~~
|
||||
* $ - CPU intensive
|
||||
* @ - Waiting for external event
|
||||
* # - Possible interaction with other tasks
|
||||
|
||||
Goals
|
||||
-----
|
||||
* Run the CPU intensive tasks in multiple threads to allow concurrency.
|
||||
* Minimize waiting for locks.
|
||||
* Don't require too much memory.
|
||||
* Minimize the number of upstream queries (both because they are slow and
|
||||
expensive and also because we don't want to eat too much bandwidth and spam
|
||||
the authoritative servers).
|
||||
* Design simple enough so it can be implemented.
|
||||
|
||||
Naïve version
|
||||
-------------
|
||||
|
||||
Let's look at possible approaches and list their pros and cons. Many of the
|
||||
simple versions would not really work, but let's have a look at them anyway,
|
||||
because thinking about them might bring some solutions for the real versions.
|
||||
|
||||
We take one query, handle it fully, with blocking waits for the answers. After
|
||||
this is done, we take another. The cache is private for each one process.
|
||||
|
||||
Advantages:
|
||||
|
||||
* Very simple.
|
||||
* No locks.
|
||||
|
||||
Disadvantages:
|
||||
|
||||
* To scale across cores, we need to run *a lot* of processes, since they'd be
|
||||
waiting for something most of their time. That means a lot of memory eaten,
|
||||
because each one has its own cache. Also, running so many processes may be
|
||||
problematic, processes are not very cheap.
|
||||
* Many things would be asked multiple times, because the caches are not
|
||||
shared.
|
||||
|
||||
Threads
|
||||
~~~~~~~
|
||||
|
||||
Some of the problems could be solved by using threads, but they'd not improve
|
||||
it much, since threads are not really cheap either (starting several hundred
|
||||
threads might not be a good idea either).
|
||||
|
||||
Also, threads bring other problems. When we still assume separate caches (for
|
||||
caches, see below), we need to ensure safe access to logging, configuration,
|
||||
network, etc. These could be a bottleneck (eg. if we lock every time we read a
|
||||
packet from network, when there are many threads, they'll just fight over the
|
||||
lock).
|
||||
|
||||
Supercache
|
||||
~~~~~~~~~~
|
||||
|
||||
The problem with cache could be solved by placing a ``supercache'' between the
|
||||
resolvers and the Internet. That one would do almost no processing, it would
|
||||
just take the query, looked up in the cache and either answered from the cache
|
||||
or forwarded the query to the external world. It would store the answer and
|
||||
forward it back.
|
||||
|
||||
The cache, if single-threaded, could be a bottle-neck. To solve it, there could
|
||||
be several approaches:
|
||||
|
||||
Layered cache::
|
||||
Each process has it's own small cache, which catches many queries. Then, a
|
||||
group of processes shares another level of bigger cache, which catches most
|
||||
of the queries that get past the private caches. We further group them and
|
||||
each level handles less queries from each process, so they can keep up.
|
||||
However, with each level, we add some overhead to do another lookup.
|
||||
Segmented cache::
|
||||
We have several caches of the same level, in parallel. When we would ask a
|
||||
cache, we hash the query and decide which cache to ask by the hash. Only that
|
||||
cache would have that answer if any and each could run in a separate process.
|
||||
The only problem is, could there be a pattern of queries that would skew to
|
||||
use only one cache while the rest would be idle?
|
||||
Shared cache access::
|
||||
A cache would be accessed by multiple processes/threads. See below for
|
||||
details, but there's a risk of lock contention on the cache (it depends on
|
||||
the data structure).
|
||||
|
||||
Upstream queries
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Before doing an upstream query, we look into the cache to ensure we don't have
|
||||
the information yet. When we get the answer, we want to update the cache.
|
||||
|
||||
This suggests the upstream queries are tightly coupled with the cache. Now,
|
||||
when we have several cache processes/threads, each can have some set of opened
|
||||
sockets which are not shared with other caches to do the lookups. This way we
|
||||
can avoid locking the upstream network communication.
|
||||
|
||||
Also, we can have three conceptual states for data in cache, and act
|
||||
differently when it is requested.
|
||||
|
||||
Present::
|
||||
If it is available, in positive or negative version, we just provide the
|
||||
answer right away.
|
||||
Not present::
|
||||
The continuation of processing is queued somehow (blocked/callback is
|
||||
stored/whatever). An upstream query is sent and we get to the next state.
|
||||
Waiting for answer::
|
||||
If another query for the same thing arrives, we just queue it the same way
|
||||
and keep waiting. When the answer comes, all the queued tasks are resumed.
|
||||
If the TTL > 0, we store the answer and set it to ``present''.
|
||||
|
||||
We want to do aggregation of upstream queries anyway, using cache for it saves
|
||||
some more processing and possibly locks.
|
||||
|
||||
Multiple parallel queries
|
||||
-------------------------
|
||||
|
||||
It seems obvious we can't afford to have a thread or process for each
|
||||
outstanding query. We need to handle multiple queries in each one at any given
|
||||
time.
|
||||
|
||||
Coroutines
|
||||
~~~~~~~~~~
|
||||
|
||||
The OS-level threads might be too expensive, but coroutines might be cheap
|
||||
enough. In that way, we could still write a code that would be easy to read,
|
||||
but limit the number of OS threads to reasonable number.
|
||||
|
||||
In this model, when a query comes, a new coroutine/user-level thread is created
|
||||
for it. We use special reads and writes whenever there's an operation that
|
||||
could block. These reads and writes would internally schedule the operation
|
||||
and switch to another coroutine (if there's any ready to be executed).
|
||||
|
||||
Each thread/process maintains its own set of coroutines and they do not
|
||||
migrate. This way, the queue of coroutines is kept lock-less, as well as any
|
||||
private caches. Only the shared caches are protected by a lock.
|
||||
|
||||
[NOTE]
|
||||
The `coro` unit we have in the current code is *not* considered a coroutine
|
||||
library here. We would need a coroutine library where we have real stack for
|
||||
each coroutine and we switch the stacks on coroutine switch. That is possible
|
||||
with reasonable amount of dark magic (see `ucontext.h`, for example, but there
|
||||
are surely some higher-level libraries for that).
|
||||
|
||||
There are some trouble with multiple coroutines waiting on the same event, like
|
||||
the same upstream query (possibly even coroutines from different threads), but
|
||||
it should be possible to solve.
|
||||
|
||||
Event-based
|
||||
~~~~~~~~~~~
|
||||
|
||||
We use events (`asio` and stuff) for writing it. Each outstanding query is an
|
||||
object with some callbacks on it. When we would do a possibly blocking
|
||||
operation, we schedule a callback to happen once the operation finishes.
|
||||
|
||||
This is more lightweight than the coroutines (the query objects will be smaller
|
||||
than the stacks for coroutines), but it is harder to write and read for.
|
||||
|
||||
[NOTE]
|
||||
Do not consider cross-breeding the models. That leads to space-time distortions
|
||||
and brain damage. Implementing one on top of other is OK, but mixing it in the
|
||||
same bit of code is a way do madhouse.
|
||||
|
||||
Landlords and peasants
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In both the coroutines and event-based models, the cache and other shared
|
||||
things are easier to imagine as objects the working threads fight over to hold
|
||||
for a short while. In this model, it is easier to imagine each such shared
|
||||
object as something owned by a landlord that doesn't let anyone else on it,
|
||||
but you can send requests to him.
|
||||
|
||||
A query is an object once again, with some kind of state machine.
|
||||
|
||||
Then there are two kinds of threads. The peasants are just to do the heavy
|
||||
work. There's a global work-queue for peasants. Once a peasant is idle, it
|
||||
comes to the queue and picks up a handful of queries from there. It does as
|
||||
much on each the query as possible without requiring any shared resource.
|
||||
|
||||
The other kind, the landlords, have a resource to watch over each. So we would
|
||||
have a cache (or several parts of cache), the sockets for accepting queries and
|
||||
answering them, possibly more. Each of these would have a separate landlord
|
||||
thread and a queue of tasks to do on the resource (look up something, send an
|
||||
answer...).
|
||||
|
||||
Similarly, the landlord would take a handful of tasks from its queue and start
|
||||
handling them. It would possibly produce some more tasks for the peasants.
|
||||
|
||||
The point here is, all the synchronisation is done on the queues, not on the
|
||||
shared resources themselves. And, we would append to a queues once the whole
|
||||
batch was completed. By tweaking the size of the batch, we could balance the
|
||||
lock contention, throughput and RTT. The append/remove would be a quick
|
||||
operation, and the cost of locks would amortize in the larger amount of queries
|
||||
handled per one lock operation.
|
||||
|
||||
The possible downside is, a query needs to travel across several threads during
|
||||
its lifetime. It might turn out it is faster to move the query between cores
|
||||
than accessing the cache from several threads, since it is smaller, but it
|
||||
might be slower as well.
|
||||
|
||||
It would be critical to make some kind of queue that is fast to append to and
|
||||
fast to take out first n items. Also, the tasks in the queues can be just
|
||||
abstract `boost::function<void (Worker&)>` functors, and each worker would just
|
||||
iterate through the queue, calling each functor. The parameter would be to
|
||||
allow easy generation of more tasks for other queues (they would be stored
|
||||
privately first, and appended to remote queues at the end of batch).
|
||||
|
||||
Also, if we wanted to generate multiple parallel upstream queries from a single
|
||||
query, we would need to be careful. A query object would not have a lock on
|
||||
itself and the upstream queries could end up in a different caches/threads. To
|
||||
protect the original query, we would add another landlord that would aggregate
|
||||
answers together and let the query continue processing once it got enough
|
||||
answers. That way, the answers would be pushed all to the same threads and they
|
||||
could not fight over the query.
|
||||
|
||||
[NOTE]
|
||||
This model would work only with threads, not processes.
|
||||
|
||||
Shared caches
|
||||
-------------
|
||||
|
||||
While it seems it is good to have some sort of L1 cache with pre-rendered
|
||||
answers (according to measurements in the #2777 ticket), we probably need some
|
||||
kind of larger shared cache.
|
||||
|
||||
If we had just a single shared cache protected by lock, there'd be a lot of
|
||||
lock contention on the lock.
|
||||
|
||||
Partitioning the cache
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
We split the cache into parts, either by the layers or by parallel bits we
|
||||
switch between by a hash. If we take it to the extreme, a lock on each hash
|
||||
bucket would be this kind, though that might be wasting resources (how
|
||||
expensive is it to create a lock?).
|
||||
|
||||
Landlords
|
||||
~~~~~~~~~
|
||||
|
||||
The landlords do synchronizations themselves. Still, the cache would need to be
|
||||
partitioned.
|
||||
|
||||
RCU
|
||||
~~~
|
||||
|
||||
The RCU is a lock-less synchronization mechanism. An item is accessed through a
|
||||
pointer. An updater creates a copy of the structure (in our case, it would be
|
||||
content of single hash bucket) and then atomically replaces the pointer. The
|
||||
readers from before have the old version, the new ones get the new version.
|
||||
When all the old readers die out, the old copy is reclaimed. Also, the
|
||||
reclamation can AFAIK be postponed for later times when we are slightly more
|
||||
idle or to a different thread.
|
||||
|
||||
We could use it for cache ‒ in the fast track, we would just read the cache. In
|
||||
the slow one, we would have to wait in queue to do the update, in a single
|
||||
updater thread (because we don't really want to be updating the same cell twice
|
||||
at the same time).
|
||||
|
||||
Proposals
|
||||
---------
|
||||
|
||||
In either case, we would have some kind of L1 cache with pre-rendered answers.
|
||||
For these proposals (except the third), we wouldn't care if we split the cache
|
||||
into parallel chunks or layers.
|
||||
|
||||
Hybrid RCU/Landlord
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The landlord approach, just read only accesses to the cache are done directly
|
||||
by the peasants. Only if they don't find what they want, they'd append the
|
||||
queue to the task of the landlord. The landlord would be doing the RCU updates.
|
||||
It could happen that by the time the landlord gets to the task the answer is
|
||||
already there, but that would not matter much.
|
||||
|
||||
Accessing network would be from landlords.
|
||||
|
||||
Coroutines+RCU
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
We would do the coroutines, and the reads from shared cache would go without
|
||||
locking. When doing write, we would have to lock.
|
||||
|
||||
To avoid locking, each worker thread would have its own set of upstream sockets
|
||||
and we would dup the sockets from users so we don't have to lock that.
|
||||
|
||||
Multiple processes with coroutines and RCU
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This would need the layered cache. The upper caches would be mapped to local
|
||||
memory for read-only access. Each cache would be a separate process. The
|
||||
process would do the updates ‒ if the answer was not there, the process would
|
||||
be asked by some kind of IPC to pull it from upstream cache or network.
|
Reference in New Issue
Block a user