mirror of
https://gitlab.isc.org/isc-projects/kea
synced 2025-08-30 21:45:37 +00:00
Delete 03-cache-algorithm
This commit is contained in:
@@ -1,256 +0,0 @@
|
||||
03-cache-algorithm
|
||||
|
||||
Introduction
|
||||
------------
|
||||
Cache performance may be important for the resolver. It might not be
|
||||
critical. We need to research this.
|
||||
|
||||
One key question is: given a specific cache hit rate, how much of an
|
||||
impact does cache performance have?
|
||||
|
||||
For example, if we have 90% cache hit rate, will we still be spending
|
||||
most of our time in system calls or in looking things up in our cache?
|
||||
|
||||
There are several ways we can consider figuring this out, including
|
||||
measuring this in existing resolvers (BIND 9, Unbound) or modeling
|
||||
with specific values.
|
||||
|
||||
Once we know how critical the cache performance is, we can consider
|
||||
which algorithm is best for that. If it is very critical, then a
|
||||
custom algorithm designed for DNS caching makes sense. If it is not,
|
||||
then we can consider using an STL-based data structure.
|
||||
|
||||
Effectiveness of Cache
|
||||
----------------------
|
||||
|
||||
First, I'll try to answer the introductory questions.
|
||||
|
||||
In some simplified model, we can express the amount of running time
|
||||
for answering queries directly from the cache in the total running
|
||||
time including that used for recursive resolution due to cache miss as
|
||||
follows:
|
||||
|
||||
A = r*Q2*/(r*Q2+ Q1*(1-r))
|
||||
where
|
||||
A: amount of time for answering queries from the cache per unit time
|
||||
(such as sec, 0<=A<=1)
|
||||
r: cache hit rate (0<=r<=1)
|
||||
Q1: max qps of the server with 100% cache hit
|
||||
Q2: max qps of the server with 0% cache hit
|
||||
|
||||
Q1 can be measured easily for given data set; measuring Q2 is tricky
|
||||
in general (it requires many external queries with unreliable
|
||||
results), but we can still have some not-so-unrealistic numbers
|
||||
through controlled simulation.
|
||||
|
||||
As a data point for these values, see a previous experimental results
|
||||
of mine:
|
||||
https://lists.isc.org/pipermail/bind10-dev/2012-July/003628.html
|
||||
|
||||
Looking at the "ideal" server implementation (no protocol overhead)
|
||||
with the set up 90% and 85% cache hit rate with 1 recursion on cache
|
||||
miss, and with the possible maximum total throughput, we can deduce
|
||||
Q1 and Q2, which are: 170591qps and 60138qps respectively.
|
||||
|
||||
This means, with 90% cache hit rate (r = 0.9), the server would spend
|
||||
76% of its run time for receiving queries and answering responses
|
||||
directly from the cache: 0.9*60138/(0.9*60138 + 0.1*170591) = 0.76.
|
||||
|
||||
I also ran more realistic experiments: using BIND 9.9.2 and unbound
|
||||
1.4.19 in the "forward only" mode with crafted query data and the
|
||||
forwarded server to emulate the situation of 100% and 0% cache hit
|
||||
rates. I then measured the max response throughput using a
|
||||
queryperf-like tool. In both cases Q2 is about 28% of Q1 (I'm not
|
||||
showing specific numbers to avoid unnecessary discussion about
|
||||
specific performance of existing servers; it's out of scope of this
|
||||
memo). Using Q2 = 0.28*Q1, above equation with 90% cache hit rate
|
||||
will be: A = 0.9 * 0.28 / (0.9*0.28 + 0.1) = 0.716. So the server will
|
||||
spend about 72% of its running time to answer queries directly from
|
||||
the cache.
|
||||
|
||||
Of course, these experimental results are too simplified. First, in
|
||||
these experiments we assumed only one external query is needed on
|
||||
cache miss. In general it can be more; however, it may not actually
|
||||
be too optimistic either: in my another research result:
|
||||
http://bind10.isc.org/wiki/ResolverPerformanceResearch
|
||||
In the more detailed analysis using real query sample and tracing what
|
||||
an actual resolver would do, it looked we'd need about 1.44 to 1.63
|
||||
external queries per cache miss in average.
|
||||
|
||||
Still, of course, the real world cases are not that simple: in reality
|
||||
we'd need to deal with timeouts, slower remote servers, unexpected
|
||||
intermediate results, etc. DNSSEC validating resolvers will clearly
|
||||
need to do more work.
|
||||
|
||||
So, in the real world deployment Q2 should be much smaller than Q1.
|
||||
Here are some specific cases of the relationship between Q1 and Q2 for
|
||||
given A (assuming r = 0.9):
|
||||
|
||||
70%: Q2 = 0.26 * Q1
|
||||
60%: Q2 = 0.17 * Q1
|
||||
50%: Q2 = 0.11 * Q1
|
||||
|
||||
So, even if "recursive resolution is 10 times heavier" than the cache
|
||||
only case, we can assume the server spends a half of its run time for
|
||||
answering queries directly from the cache at the cache hit rate of
|
||||
90%. I think this is a reasonably safe assumption.
|
||||
|
||||
Now, assuming the number of 50% or more, does this suggest we should
|
||||
highly optimize the cache? Opinions may vary on this point, but I
|
||||
personally think the answer is yes. I've written an experimental
|
||||
cache only implementation that employs the idea of fully-rendered
|
||||
cached data. On one test machine (2.20GHz AMD64, using a single
|
||||
core), queryperf-like benchmark shows it can handle over 180Kqps,
|
||||
while BIND 9.9.2 can just handle 41K qps. The experimental
|
||||
implementation skips some necessary features for a production server,
|
||||
and cache management itself is always inevitable bottleneck, so the
|
||||
production version wouldn't be that fast, but it still suggests it may
|
||||
not be very difficult to reach over 100Kqps in production environment
|
||||
including recursive resolution overhead.
|
||||
|
||||
Cache Types
|
||||
-----------
|
||||
|
||||
1. Record cache
|
||||
|
||||
Conceptually, any recursive resolver (with cache) implementation would
|
||||
have cache for RRs (or RRsets in the modern version of protocol) given
|
||||
in responses to its external queries. In BIND 9, it's called the
|
||||
"cached DB", using an in-memory rbt-like tree. unbound calls it
|
||||
"rrset cache", which is implemented as a hash table.
|
||||
|
||||
2. Delegation cache
|
||||
|
||||
Recursive server implementations would also have cache to determine
|
||||
the deepest zone cut for a given query name in the recursion process.
|
||||
Neither BIND 9 nor unbound has a separate cache for this purpose;
|
||||
basically they try to find an NR RRset from the "record cache" whose
|
||||
owner name best matches the given query name.
|
||||
|
||||
3. Remote server cache
|
||||
|
||||
In addition, a recursive server implementation may maintain a cache
|
||||
for information of remote authoritative servers. Both BIND 9 and
|
||||
unbound conceptually have this type of cache, although there are some
|
||||
non-negligible differences in details. BIND 9's implementation of
|
||||
this cache is called ADB. Its a hash table whose key is domain name,
|
||||
and each entry stores corresponding IPv6/v4 addresses; another data
|
||||
structure for each address stores averaged RTT for the address,
|
||||
lameness information, EDNS availability, etc. unbound's
|
||||
implementation is called "infrastructure cache". It's a hash table
|
||||
keyed with IP addresses whose entries store similar information as
|
||||
that in BIND 9's per address ADB entry. In unbound a remote server's
|
||||
address must be determined by looking up the record cache (rrset cache
|
||||
in unbound terminology); unlike BIND 9's ADB, there's no direct
|
||||
shortcut from a server's domain name to IP addresses.
|
||||
|
||||
4. Full response cache
|
||||
|
||||
unbound has an additional cache layer, called the "message cache".
|
||||
It's a hash table whose hash key is query parameter (essentially qname
|
||||
and type) and entry is a sequence to record (rrset) cache entries.
|
||||
This sequence constructs a complete response to the corresponding
|
||||
query, so it would help optimize building a response message skipping
|
||||
the record cache for each section (answer/authority/additional) of the
|
||||
response message. PowerDNS recursor has (seemingly) the same concept
|
||||
called "packet cache" (but I don't know its implementation details
|
||||
very much).
|
||||
|
||||
BIND 9 doesn't have this type of cache; it always looks into the
|
||||
record cache to build a complete response to a given query.
|
||||
|
||||
Miscellaneous General Requirements
|
||||
----------------------------------
|
||||
|
||||
- Minimize contention between threads (if threaded)
|
||||
- Cache purge policy: normally only a very small part of cached DNS
|
||||
information will be reused, and those reused are very heavily
|
||||
reused. So LRU-like algorithm should generally work well, but we'll
|
||||
also need to honor DNS TTL.
|
||||
|
||||
Random Ideas for BIND 10
|
||||
------------------------
|
||||
|
||||
Below are specific random ideas for BIND 10. Some are based on
|
||||
experimental results with reasonably realistic data; some others are
|
||||
mostly a guess.
|
||||
|
||||
1. Fully rendered response cache
|
||||
|
||||
Some real world query samples show that a very small portion of entire
|
||||
queries are very popular and queried very often and many times; the
|
||||
rest is rarely reused, if any. Two different data sets show top
|
||||
10,000 queries would cover around 80% of total queries, regardless
|
||||
of the size of the total queries. This suggests an idea of having a
|
||||
small, highly optimized full response cache.
|
||||
|
||||
I tried this idea in the jinmei-l1cache branch. It's a hash table
|
||||
keyed with a tuple of query name and type whose entry stores fully
|
||||
rendered, wire-format response image (answer section only, assuming
|
||||
the "minimal-responses" option). It also maintains offsets to each
|
||||
RR, so it can easily update TTLs when necessary or rotate RRs if
|
||||
optionally requested. If neither TTL adjustment nor RR rotation is
|
||||
required, query handling is just to lookup the hash table and copy the
|
||||
pre-rendered data. Experimental benchmark showed it ran vary fast;
|
||||
more than 4 times faster than BIND 9, and even much faster than other
|
||||
implementations that have full response cache (although, as usual, the
|
||||
comparison is not entirely fair).
|
||||
|
||||
Also, the cache size is quite small; the run time memory footprint of
|
||||
this server process was just about 5MB. So, I think it's reasonable
|
||||
to have each process/thread have their own copy of this cache to
|
||||
completely eliminate contention. Also, if we can keep the cache size
|
||||
this small, it would be easier to dump it to a file on shutdown and
|
||||
reuse it on restart. This will be quite effective (if the downtime is
|
||||
reasonably short) because the cached data are expected to be highly
|
||||
popular.
|
||||
|
||||
2. Record cache
|
||||
|
||||
For the normal record cache, I don't have a particular idea beyond
|
||||
something obvious, like a hash table to map from query parameters to
|
||||
corresponding RRset (or negative information). But I guess this cache
|
||||
should be shared by multiple threads. That will help reconstruct the
|
||||
full response cache data on TTL expiration more efficiently. And, if
|
||||
shared, the data structure should be chosen so that contention
|
||||
overhead can be minimized. In general, I guess something like hash
|
||||
tables is more suitable than tree-like structure in that sense.
|
||||
|
||||
There's other points to discuss for this cache related to other types
|
||||
of cache (see below).
|
||||
|
||||
3. Separate delegation cache
|
||||
|
||||
One thing I'm guessing is that it may make sense if we have a separate
|
||||
cache structure for delegation data. It's conceptually a set of NS
|
||||
RRs so we can identify the best (longest) matching one for a given
|
||||
query name.
|
||||
|
||||
Analysis of some sets of query data showed the vast majority of
|
||||
end client's queries are for A and AAAA (not surprisingly). So, even
|
||||
if we separate this cache from the record cache, the additional
|
||||
overhead (both for memory and fetch) will probably (hopefully) be
|
||||
marginal. Separating caches will also help reduce contention between
|
||||
threads. It *might* also help improve lookup performance because this
|
||||
can be optimized for longest match search.
|
||||
|
||||
4. Remote server cache without involving the record cache
|
||||
|
||||
Likewise, it may make sense to maintain the remote server cache
|
||||
separately from the record cache. I guess these AAAA and A records
|
||||
are rarely the queried by end clients, so, like the case of delegation
|
||||
cache it's possible that the data sets are mostly disjoint. Also, for
|
||||
this purpose the RRsets don't have to have higher trust rank (per
|
||||
RFC2181 5.4.1): glue or additional are okay, and, by separating these
|
||||
from the record cache, we can avoid accidental promotion of these data
|
||||
to trustworthy answers and returning them to clients (BIND 9 had this
|
||||
type of bugs before).
|
||||
|
||||
Custom vs Existing Library (STL etc)
|
||||
------------------------------------
|
||||
|
||||
It may have to be discussed, but I guess in many cases we end up
|
||||
introducing custom implementation because these caches should be
|
||||
highly performance sensitive, directly related to our core business, and
|
||||
also have to be memory efficient. But in some sub-components we may
|
||||
be able to benefit from existing generic libraries.
|
Reference in New Issue
Block a user