This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
The serve-stale feature was added in the BIND 9.11.4-S subscriber edition in 2018, and then included in the open source as of BIND 9.12.0.
There were three options provided with BIND 9’s initial implementation of serve-stale:
stale-answer-enable
: If yes, enable the returning of “stale” cached answers when the name servers for a zone are not answering. The default is not to return stale answers (answering from stale cache can also be enabled and disabled dynamically by the BIND server administrator viarndc serve-stale on|off
).max-stale-ttl
: If stale cache is enabled,max-stale-ttl
sets the maximum time for which the server retains records past their normal expiry to return them as stale records, when the servers for those records are not reachable.stale-answer-ttl
: This specifies the TTL to be returned on stale answers. The default was originally set at 30 seconds. As of the 9.16.6 and 9.11.22-S1 update, it defaults to one second.
ISC has received some complaints that the serve-stale implementation is not efficient in production.
- Every client that asks for a record that is stale but still available in cache waits for a
lengthy timeout, as BIND re-queries the authority before sending the stale answer. That was
part of the original design, which was to serve as a last resort in case of a lengthy outage,
but it provides a slower response. The timeout BIND uses is based on an option called
resolver-query-timeout
. The default value of this timeout is 10 seconds, but it can be configured from 301 msec to 30 seconds. - Another observation was that we had not provided operators with an option to disable the stale cache.
- Records are kept in cache for too long after they expire.
9.16.6 and 9.11.22-S1 improvements
We added the following parameter:
stale-cache-enable
: If yes (the default), enables the retention of expired cache records so that they are available to be returned from cache if eitherstale-answer-enable
is set toyes
, or is switched on later usingrndc serve-stale on
.
This deals with complaint number 2.
At the same time, we made another tweak. As of 9.16.6 and 9.11.22-S1, answers that are received with TTL=0 are ineligible for serve-stale, and we updated the default values for stale-answer-ttl
(from 30 seconds to one second) and max-stale-ttl
(from one week to 12 hours).
This deals with complaint number 3.
9.16.9 improvements
In 9.16.9 we introduced a new option to deal with complaint number 1.
stale-refresh-time
: If set, BIND replies with the stale answer in cache immediately if an attempt to refresh the RRset has previously failed, and continues to provide the stale answer for an amount of time specified bystale-refresh-time
. The default is 30 seconds, as RFC 8767 recommends.
This enhancement speeds up responses for nearly all of the users that are in need of a stale answer: the very first user that queries for a record that has just become unavailable from the authority will still have to wait for the query timeout, but all the subsequent users will get the stale answer from cache.
9.16.12 improvements
After introducing stale-refresh-time
we realized that we could have used better default values than introduced in 9,16,6 and 9.11.22-S1. Instead of coming up with our own values, we should have used the values recommended by RFC 8767, so this release revises these defaults and resets stale-answer-ttl
to 30 seconds and changes max-stale-ttl
from 12 hours to one day.
To further improve the serving of stale data, we added another option:
stale-answer-client-timeout
, which is the maximum amount of time a recursive resolver should allow between the receipt of a resolution request and the sending of its response (only applicable ifstale-answer-enable
is set).
All improvements made in 9.16.9 and later will be backported to 9.11-S1.
A client walks into a bar, and asks the bartender do you serve stale data?
cache-enable | answer-enable | refresh-time | answer-client-timeout | bartender |
---|---|---|---|---|
no | no | 0 (disabled) | off (disabled) | no |
no | no | 0 (disabled) | 0 | no |
no | no | 0 (disabled) | value | no |
no | no | value | off (disabled) | no |
no | no | value | 0 | no |
no | no | value | value | no |
no | yes | 0 (disabled) | off (disabled) | no |
no | yes | 0 (disabled) | 0 | no |
no | yes | 0 (disabled) | value | no |
no | yes | value | off (disabled) | no |
no | yes | value | 0 | no |
no | yes | value | value | no |
yes | no | 0 (disabled) | off (disabled) | no |
yes | no | 0 (disabled) | 0 | no |
yes | no | 0 (disabled) | value | no |
yes | no | value | off (disabled) | no |
yes | no | value | 0 | no |
yes | no | value | value | no |
yes | yes | 0 (disabled) | off (disabled) | maybe |
yes | yes | 0 (disabled) | 0 | yes |
yes | yes | 0 (disabled) | value | maybe |
yes | yes | value | off (disabled) | most of the time |
yes | yes | value | 0 | yes |
yes | yes | value | value | most of the time |
With all these options, we now allow operators to be RFC 8767 compliant. But it is also less predictable when stale data is returned and when an attempt to resolve the query is made.
If stale-cache-enable
is set to no
, stale data is not kept in cache and it will not be used in DNS responses. It does not matter what the other options are set to, they are ignored in this case.
If stale-cache-enable
is set to yes
, stale date may be served to the client if stale-answers-enable
is set, or if enabled via rndc serve-stale on
. Otherwise, the stale cache entries will not be used in DNS responses and it does not matter what the other options are set to, they are ignored.
So stale RRsets can only be returned to the client if stale-cache-enable
is set to yes
and if stale-answers-enable
is set to yes
(or enabled with rndc serve-stale on
). But when will a stale answer be returned, and when will BIND attempt to resolve the query?
There are two options that influence the query path: stale-refresh-time
and stale-answer-client-timeout
.
The initial implementation
If stale-refresh-time
is set to 0 (disabled) and stale-answer-client-timeout
is set to off|disabled
, the behaviour is the same as the initial implementation: If there is no active data in cache (but there may be stale data), BIND will first attempt to resolve the query. Only after resolver-query-timeout
, it falls back to stale data in cache.
The stale refresh window
If stale-refresh-time
is set to a positive value, and stale-answer-client-timeout
is disabled, the behaviour improves for most clients. Suppose there is stale data in cache, and it is being requested. The first client that queries for a specific RRset will still face the lengthy timeout, but subsequent clients that query for the same RRset will immediately be served the stale data. In this time window, no attempts to refresh the RRset are made (we have failed before, so wait some time before try resolving again).
Bartender, how much longer do I have to wait before being served?
If stale-refresh-time
is set to 0 (disabled), BIND may still serve stale data within a reasonable time if stale-answer-client-timeout
is enabled.
If stale-answer-client-timeout
is set to a positive value, and there is stale data in cache, BIND will first try to resolve the query, but if it takes longer than stale-answer-client-timeout
, a database lookup for stale data is executed. If that results in a stale positive answer, it is given to the client, while in the background BIND continues to resolve the query, with the goal of updating the cache entry (and if that resolver query times out, BIND falls back to stale data in cache, regardless whether it is a positive or negative entry). If no stale positive answer is available, the database lookup is dropped and BIND waits until the resolver is finished.
If stale-answer-client-timeout
is set to 0, we prioritize stale positive answers: If such a stale entry exists in the cache we immediately return it to the client, and start an attempt to refresh the RRset. If the cache does not have a stale positive answer, a regular lookup is started (again this may succeed, responding to the client with an authoritative answer, or this may fail and BIND will fall back to stale data in cache, regardless whether it is a positive or negative entry).
The stale refresh window in combination with client timeout
Now what happens if both stale-refresh-time
and stale-answer-client-timeout
are used?
First let us look at the case where stale-answer-client-timeout
is 0.
A request comes in, BIND checks if there is a stale cache entry and if one is found then:
-
If
stale-refresh-time
is active for that RRset, then we have two situations:- The stale cache entry is a positive answer: the stale answer is used in the response to the client, no attempt to refresh the entry is made, because the time window explicitly tells us not to do so.
- The stale cache entry is a negative answer: this query will now result in a server failure response.
-
If the
stale-refresh-time
window was inactive for that RRset, then again we have two situations:- The stale cache entry is a positive answer: the stale answer is used in the response to the client but BIND starts an attempt to refresh the RRset.
- The stale cache entry is a negative answer: BIND will start resolving the query.
What if the stale-answer-client-timeout
is set to a positive value? Then if a stale cache entry exists, and if the stale-refresh-time
window was active for this RRset, the entry is immediately returned, without an attempt to resolve the query.
If the stale-refresh-time
window was inactive, a normal lookup follows. The stale cache entry is not immediately used. Either the query is resolved swiftly, and the answer from the authority is used in the response to the client, or resolving the query takes a long time and the stale-answer-client-timeout
occurs.
If the stale-answer-client-timeout
occurs, a a database lookup for stale data is executed. BIND now acts the same as when stale-refresh-time
is disabled: If that results in a stale positive answer, it is given to the client, while in the background BIND continues to resolve the query, with the goal of updating the cache entry (and if that resolver query times out, BIND falls back to stale data in cache, regardless whether it is a positive or negative entry). If no stale positive answer is available, the database lookup is dropped and BIND waits until the resolver is finished.