Is there a time when new_qp(c|z)node() would not be followed by
assignment of the namespace? No, so let's add the assignment to the
function that creates the node.
Now that we have to code working, rename 'dns_qp_lookup2' back to
'dns_qp_lookup' and adjust all remaining 'dns_qp_lookup' occurrences
to take a space=0 parameter.
For now we only allow DNS_DB_NSEC_* values so it makes sense to change
the type to an enum.
Rename 'denial' to the more intuitive 'space', indicating the namespace
of the keyvalue pair.
In preparation to merge the three qp tries (tree, nsec, nsec3) into
one, add the piece of information into the qpkey. This is the most
significant bit of information, so prepend the denial type to the qpkey.
This means we need to pass on the denial type when constructing the
qpkey from a name, or doing a lookup.
Reuse the the DNS_DB_NSEC_* values. Most qp tries in the code we just
pass on 0 (nta, rpz, zt, etc.), because there is no need for denial of
existence, but for qpzone and qpcache we must pass the right value.
Change the code, so that node->nsec no longer can have the value
DNS_DB_NSEC_HAS_NSEC, instead track this in a new attribute 'havensec'.
Since we use node->nsec to convert names to keys, the value MUST be set
before inserting the node into the qp-trie.
Update the fuzzing and unit tests accordingly. This only adds a few
extra test cases, more are needed.
In the qp_test.c we can remove test code for empty keys as this is
no longer possible.
RRset ordering is now an enum inside struct rdataset attributes. This
was done to keep size to of the structure to its original value before
this MR.
I expect zero performance impact but it should be easier to deal with
attributes in debuggers and language servers.
Per pspacek, currently qp and qpcache logs are too verbose and enabled at a
level too low compared to how often the logging is useful.
This commit increases the logging level, while keeping it configurable
via a define.
qp-tries allocate their nodes (twigs) in chunks to reduce allocator
pressure and improve memory locality. The choice of chunk size presents
a tradeoff: larger chunks benefit qp-tries with many values (as seen
in large zones and resolvers) but waste memory in smaller use cases.
Previously, our fixed chunk size of 2^10 twigs meant that even an
empty qp-trie would consume 12KB of memory, while reducing this size
would negatively impact resolver performance.
This commit implements an adaptive chunking strategy that:
- Tracks the size of the most recently allocated chunk.
- Doubles the chunk size for each new allocation until reaching a
predefined maximum.
This approach effectively balances memory efficiency for small tries
while maintaining the performance benefits of larger chunk sizes for
bigger data structures.
This commit also splits the callback freeing qpmultis into two
phases, one that frees the underlying qptree, and one that reclaims
the qpmulti memory. In order to prevent races between the qpmulti
destructor and chunk garbage collection jobs, the second phase is
protected by reference counting.
The 'qpnode->nsec' structure member isn't protected by a lock and
there's a data race between the reading and writing parts in the
qpcache_addrdataset() function. Use a node read lock for accessing
'qpnode->nsec' in qpcache_addrdataset(). Add an additional
'qpnode->nsec != DNS_DB_NSEC_HAS_NSEC' check under a write lock
to be sure that no other competing thread changed it in the time
when the read lock is unlocked and a write lock is not acquired
yet.
The *DB_VIRTUAL value was introduced to allow the clients (presumably
ns_clients) that has been running for some time to access the cached
data that was valid at the time of its inception. The default value
of 5 minutes is way longer than longevity of the ns_client object as
the resolver will give up after 2 minutes.
Reduce the value to 10 seconds to accomodate to honour the original
more closely, but still allow some leeway for clients that started some
time in the past.
Our measurements show that even setting this value to 0 has no
statistically significant effect, thus the value of 10 seconds should be
on the safe side.
Profiles show that an high amount of CPU time spent in memset.
By removing zero initalization of certain large buffers we improve
performance in certain authoritative workloads.
This is the core implementation of the SIEVE algorithm described in the
following paper:
Zhang, Yazhuo, Juncheng Yang, Yao Yue, Ymir Vigfusson, and K V
Rashmi. “SIEVE Is Simpler than LRU: An Efficient Turn-Key Eviction
Algorithm for Web Caches,” n.d.. available online from
https://junchengyang.com/publication/nsdi24-SIEVE.pdf
In QPcache, there were two places that tried to upgrade the lock. In
clean_stale_header(), the code would try to upgrade the lock and cleanup
the header, and in qpzonode_release(), the tree lock would be optionally
upgraded, so we can cleanup the node directly if empty. These
optimizations are not needed and they have no effect on the performance.
dns_zonekey_iszonekey() was the only function defined in the
dns_zonekey module, and was only called from one place. it
makes more sense to group this with dns_dnssec functions.
The qpcache_findzonecut() accepts two "foundnames": 'foundname' and
'dcname' could be NULL. Originally, when 'dcname' would be NULL, the
'dcname' would be set to 'foundname'. Then code like this was present:
result = find_deepest_zonecut(&search, node, nodep, foundname,
rdataset,
sigrdataset DNS__DB_FLARG_PASS);
dns_name_copy(foundname, dcname);
Which basically means that we are copying the .ndata over itself for no
apparent reason. Cleanup the dcname vs foundname usage.
Co-authored-by: Evan Hunt <each@isc.org>
Co-authored-by: Ondřej Surý <ondrej@isc.org>
The offsets were meant to speed-up the repeated dns_name operations, but
it was experimentally proven that there's actually no real-world
benefit. Remove the offsets and labels fields from the dns_name and the
static offsets fields to save 128 bytes from the fixedname in favor of
calculating labels and offsets only when needed.
Acquire the database refernce in the detachnode() to prevent the last
reference to be release while the NODE_LOCK being locked. The NODE_LOCK
is locked/unlocked inside the RCU critical section, thus it is most
probably this should not pose a problem as the database uses call_rcu
memory reclamation, but this it is still safer to acquire the reference
before releasing the node.
The "raw" version of the header was used for the noqname and the closest
proofs to save around 152 bytes of the dns_slabheader_t while bringing
an additional complexity. Remove the raw version of the dns_slabheader
API at the slight expense of having unused dns_slabheader_t data sitting
in front of the proofs.
when dns_rdataslab_fromrdataset() is run, in addition to
allocating space for a slab header, it also partially
initializes it, setting the type match rdataset->type and
rdataset->covers, the trust to rdataset->trust, and the TTL to
rdataset->ttl.
there are now two functions for creating an rdataslab from an
rdataset: dns_rdataslab_fromrdataset() creates a full slab (including
space for a slab header), and dns_rdataslab_raw_fromrdataset() creates
a raw slab.
- there are now two functions for getting rdataslab size:
dns_rdataslab_size() is for full slabs and dns_rdataslab_sizeraw()
for raw slabs. there is no longer a need for a reservelen parameter.
- dns_rdataslab_count() also no longer takes a reservelen parameter.
(currently it's never used for raw slabs, so there is no _countraw()
function.)
- dns_rdataslab_rdatasize() has been removed, because
dns_rdataslab_sizeraw() can do the same thing.
- dns_rdataslab_merge() and dns_rdataslab_subtract() both take
slabheader parameters instead of character buffers, and the
reservelen parameter has been removed.
if both rdataslabs being compared have zero length, return true.
also, since these functions are only ever called on slabheaders
with sizeof(dns_slabheader_t) as the reserve length, we can
simplify the API: remove the reservelen argument, and pass the
slabs as type dns_slabheader_t * instead of unsigned char *.
Database versions are not used in cache databases. Some places in
qpcache.c required the version argument to be NULL; others marked it
as UNUSED. Unify all cases to require version to be NULL.
Add new related_headers() function that simplifies the code
flow in qpcache_findrdataset(). Also use check_stale_header() function
to remove code duplication.
Add a helper function both_headers() that unifies the slabheader
matching for simple type: it returns true when both the type and
the matching RRSIG have been found.
The new maybe_update_headers() function unifies the LRU updates to the
slabheaders that was scattered all over the place. More calls to update
headers after bindrdatasets() were also added for completeness.
This removes code duplication between the dual bindrdataset() calls. It
also unifies the handling as there were small differences between the
calls: one variant was checking for !NEGATIVE(found) condition and one
wasn't, and it is technically ok to do the check for all variants.
The check_stale_header() function now updates header_prev directly
so it doesn't have to be handled in the outer loop; it's always
set to the correct value of the previous header in the chain.
some code was left in the cache database implementation after
it was separated from the zone database, and can be cleaned up
and refactored now:
- the DNS_SLABHEADERATTR_IGNORE flag is never set in the cache
- support for loading the cache from was removed, but the add()
function still had a 'loading' flag that's always false
- two different macros were used for checking the
DNS_SLABHEADERATTR_NONEXISTENT flag - EXISTS() and NONEXISTENT().
it's clearer to just use EXISTS().
- the cache doesn't support versions, so it isn't necessary to
walk down the 'down' pointer chain when iterating through the
cache or looking for a header to update. 'down' now only points
to records that are deleted from the cache but have not yet been
purged from memory. this allows us to simplify both the iterator
and the add() function.
Reduce the number of qpzone_ref() and qpzone_unref() calls in
qpzone_detachnode() by relying on the call_rcu to delay
the destruction of the lock buckets.
Instead of having many node_lock_count * sizeof(<member>) arrays, pack
all the members into a qpcache_bucket_t struct that is cacheline aligned
and have a single array of those.
Additionaly, make both the head and the tail of isc_queue_t padded, not
just the head, to prevent false sharing of the lock-free structure with
the lock that follows it.
In #1870, the expiration time of ANCIENT records were printed, but
actually the ancient records are very short lived, and the information
carries a little value.
Instead of printing the expiration of ANCIENT records, print the
expiration time of STALE records.
The find_coveringnsec() was getting the 'now' from two sources -
search->now and separate now argument. Things like this are ticking
bombs, remove the extra 'now' argument and use single source of 'now'.
When the mark_ancient() helper function was introduced, couple of places
with duplicate (or almost duplicate) code was missed. Move the
mark_ancient() function closer to the top of the file, and correctly use
it in places that mark the header as ANCIENT.
If we know that the header has ZEROTTL set, the server should never send
stale records for it and the TTL should never be anything else than 0.
The comment was already there, but the code was not matching the
comment.
The old name was misleading as it never meant time-to-live, e.g. number
of seconds from now when the header should expire. The true meaning was
an expiration time e.g. now + ttl. This was the original design bug
that caused the slip when we assigned header->ttl to rdataset->ttl.
Because the name was matching, nobody has questioned the correctness of
the code both during the MR review and during the numerous re-reviews
when we were searching for the cause of the 54 year TTL.
When the header has been marked as ANCIENT, but the ttl hasn't been
reset (this happens in couple of places), the rdataset TTL would be
set to the header timestamp instead to a reasonable TTL value.
Since this header has been already expired (ANCIENT is set), set the
rdataset TTL to 0 and don't reuse this field to print the expiration
time when dumping the cache. Instead of printing the time, we now
just print 'expired (awaiting cleanup'.
the search for the deepest known zone cut in the cache could
improperly reject a node containing stale data, even if the
NS rdataset wasn't the data that was stale.
this change also improves the efficiency of the search by
stopping it when both NS and RRSIG(NS) have been found.
Change the names of the node reference counting functions
and add comments to make the mechanism easier to understand:
- newref() and decref() are now called qpcnode_acquire()/
qpznode_acquire() and qpcnode_release()/qpznode_release()
respectively; this reflects the fact that they modify both
the internal and external reference counters for a node.
- qpcnode_newref() and qpznode_newref() are now called
qpcnode_erefs_increment() and qpznode_erefs_increment(), and
qpcnode_decref() and qpznode_decref() are now called
qpcnode_erefs_decrement() and qpznode_erefs_decrement(),
to reflect that they only increase and decrease the node's
external reference counters, not internal.
This removes the db_nodelock_t structure and changes the node_locks
array to be composed only of isc_rwlock_t pointers. The .reference
member has been moved to qpdb->references in addition to
common.references that's external to dns_db API users. The .exiting
members has been completely removed as it has no use when the reference
counting is used correctly.
The origin_node in qpcache was always NULL, so we can remove the
getoriginode() function and origin_node pointer as the
dns_db_getoriginnode() correctly returns ISC_R_NOTFOUND when the
function is not implemented.
Cleanup the pattern in the decref() functions in both qpcache.c and
qpzone.c, so it follows the similar patter as we already have in
newref() function.
Previously, this function always acquires a node write lock if it
might need node cleanup in case the reference decrements to 0. In
fact, the lock is unnecessary if the reference is larger than 1 and it
can be optimized as an "easy" case. This optimization could even be
"necessary". In some extreme cases, many worker threads could repeat
acquring and releasing the reference on the same node, resulting in
severe lock contention for nothing (as the ref wouldn't decrement to 0
in most cases). This change would prevent noticeable performance
drop like query timeout for such cases.
Co-authored-by: JINMEI Tatuya <jtatuya@infoblox.com>
Co-authored-by: Ondřej Surý <ondrej@isc.org>