Serving stale data - bind - Mike's Git repositories

mir/bind

mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-21 17:48:07 +00:00

Table of Contents

9.16.6 and 9.11.22-S1 improvements
9.16.9 improvements
9.16.12 improvements
A client walks into a bar, and asks the bartender do you serve stale data?

The initial implementation
The stale refresh window
Bartender, how much longer do I have to wait before being served?
The stale refresh window in combination with client timeout

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

The serve-stale feature was added in the BIND 9.11.4-S subscriber edition in 2018, and then included in the open source as of BIND 9.12.0.

There were three options provided with BIND 9’s initial implementation of serve-stale:

stale-answer-enable: If yes, enable the returning of “stale” cached answers when the name servers for a zone are not answering. The default is not to return stale answers (answering from stale cache can also be enabled and disabled dynamically by the BIND server administrator via rndc serve-stale on|off).
max-stale-ttl: If stale cache is enabled, max-stale-ttl sets the maximum time for which the server retains records past their normal expiry to return them as stale records, when the servers for those records are not reachable.
stale-answer-ttl: This specifies the TTL to be returned on stale answers. The default was originally set at 30 seconds. As of the 9.16.6 and 9.11.22-S1 update, it defaults to one second.

ISC has received some complaints that the serve-stale implementation is not efficient in production.

Every client that asks for a record that is stale but still available in cache waits for a lengthy timeout, as BIND re-queries the authority before sending the stale answer. That was part of the original design, which was to serve as a last resort in case of a lengthy outage, but it provides a slower response. The timeout BIND uses is based on an option called resolver-query-timeout. The default value of this timeout is 10 seconds, but it can be configured from 301 msec to 30 seconds.
Another observation was that we had not provided operators with an option to disable the stale cache.
Records are kept in cache for too long after they expire.

9.16.6 and 9.11.22-S1 improvements

We added the following parameter:

stale-cache-enable: If yes (the default), enables the retention of expired cache records so that they are available to be returned from cache if either stale-answer-enable is set to yes, or is switched on later using rndc serve-stale on.

This deals with complaint number 2.

At the same time, we made another tweak. As of 9.16.6 and 9.11.22-S1, answers that are received with TTL=0 are ineligible for serve-stale, and we updated the default values for stale-answer-ttl (from 30 seconds to one second) and max-stale-ttl (from one week to 12 hours).

This deals with complaint number 3.

9.16.9 improvements

In 9.16.9 we introduced a new option to deal with complaint number 1.

stale-refresh-time: If set, BIND replies with the stale answer in cache immediately if an attempt to refresh the RRset has previously failed, and continues to provide the stale answer for an amount of time specified by stale-refresh-time. The default is 30 seconds, as RFC 8767 recommends.

This enhancement speeds up responses for nearly all of the users that are in need of a stale answer: the very first user that queries for a record that has just become unavailable from the authority will still have to wait for the query timeout, but all the subsequent users will get the stale answer from cache.

9.16.12 improvements

After introducing stale-refresh-time we realized that we could have used better default values than introduced in 9,16,6 and 9.11.22-S1. Instead of coming up with our own values, we should have used the values recommended by RFC 8767, so this release revises these defaults and resets stale-answer-ttl to 30 seconds and changes max-stale-ttl from 12 hours to one day.

To further improve the serving of stale data, we added another option:

stale-answer-client-timeout, which is the maximum amount of time a recursive resolver should allow between the receipt of a resolution request and the sending of its response (only applicable if stale-answer-enable is set).

All improvements made in 9.16.9 and later will be backported to 9.11-S1.

A client walks into a bar, and asks the bartender do you serve stale data?

cache-enable	answer-enable	refresh-time	answer-client-timeout	bartender
no	no	0 (disabled)	off (disabled)	no
no	no	0 (disabled)	0	no
no	no	0 (disabled)	value	no
no	no	value	off (disabled)	no
no	no	value	0	no
no	no	value	value	no
no	yes	0 (disabled)	off (disabled)	no
no	yes	0 (disabled)	0	no
no	yes	0 (disabled)	value	no
no	yes	value	off (disabled)	no
no	yes	value	0	no
no	yes	value	value	no
yes	no	0 (disabled)	off (disabled)	no
yes	no	0 (disabled)	0	no
yes	no	0 (disabled)	value	no
yes	no	value	off (disabled)	no
yes	no	value	0	no
yes	no	value	value	no
yes	yes	0 (disabled)	off (disabled)	maybe
yes	yes	0 (disabled)	0	yes
yes	yes	0 (disabled)	value	maybe
yes	yes	value	off (disabled)	most of the time
yes	yes	value	0	yes
yes	yes	value	value	most of the time

With all these options, we now allow operators to be RFC 8767 compliant. But it is also less predictable when stale data is returned and when an attempt to resolve the query is made.

If stale-cache-enable is set to no, stale data is not kept in cache and it will not be used in DNS responses. It does not matter what the other options are set to, they are ignored in this case.

If stale-cache-enable is set to yes, stale date may be served to the client if stale-answers-enable is set, or if enabled via rndc serve-stale on. Otherwise, the stale cache entries will not be used in DNS responses and it does not matter what the other options are set to, they are ignored.

So stale RRsets can only be returned to the client if stale-cache-enable is set to yes and if stale-answers-enable is set to yes (or enabled with rndc serve-stale on). But when will a stale answer be returned, and when will BIND attempt to resolve the query?

There are two options that influence the query path: stale-refresh-time and stale-answer-client-timeout.

The initial implementation

If stale-refresh-time is set to 0 (disabled) and stale-answer-client-timeout is set to off|disabled, the behaviour is the same as the initial implementation: If there is no active data in cache (but there may be stale data), BIND will first attempt to resolve the query. Only after resolver-query-timeout, it falls back to stale data in cache.

The stale refresh window

If stale-refresh-time is set to a positive value, and stale-answer-client-timeout is disabled, the behaviour improves for most clients. Suppose there is stale data in cache, and it is being requested. The first client that queries for a specific RRset will still face the lengthy timeout, but subsequent clients that query for the same RRset will immediately be served the stale data. In this time window, no attempts to refresh the RRset are made (we have failed before, so wait some time before try resolving again).

Bartender, how much longer do I have to wait before being served?

If stale-refresh-time is set to 0 (disabled), BIND may still serve stale data within a reasonable time if stale-answer-client-timeout is enabled.

If stale-answer-client-timeout is set to a positive value, and there is stale data in cache, BIND will first try to resolve the query, but if it takes longer than stale-answer-client-timeout, a database lookup for stale data is executed. If that results in a stale positive answer, it is given to the client, while in the background BIND continues to resolve the query, with the goal of updating the cache entry (and if that resolver query times out, BIND falls back to stale data in cache, regardless whether it is a positive or negative entry). If no stale positive answer is available, the database lookup is dropped and BIND waits until the resolver is finished.

If stale-answer-client-timeout is set to 0, we prioritize stale positive answers: If such a stale entry exists in the cache we immediately return it to the client, and start an attempt to refresh the RRset. If the cache does not have a stale positive answer, a regular lookup is started (again this may succeed, responding to the client with an authoritative answer, or this may fail and BIND will fall back to stale data in cache, regardless whether it is a positive or negative entry).

The stale refresh window in combination with client timeout

Now what happens if both stale-refresh-time and stale-answer-client-timeout are used?

First let us look at the case where stale-answer-client-timeout is 0. A request comes in, BIND checks if there is a stale cache entry and if one is found then:

If stale-refresh-time is active for that RRset, then we have two situations:
- The stale cache entry is a positive answer: the stale answer is used in the response to the client, no attempt to refresh the entry is made, because the time window explicitly tells us not to do so.
- The stale cache entry is a negative answer: this query will now result in a server failure response.
If the stale-refresh-time window was inactive for that RRset, then again we have two situations:
- The stale cache entry is a positive answer: the stale answer is used in the response to the client but BIND starts an attempt to refresh the RRset.
- The stale cache entry is a negative answer: BIND will start resolving the query.

What if the stale-answer-client-timeout is set to a positive value? Then if a stale cache entry exists, and if the stale-refresh-time window was active for this RRset, the entry is immediately returned, without an attempt to resolve the query.

If the stale-refresh-time window was inactive, a normal lookup follows. The stale cache entry is not immediately used. Either the query is resolved swiftly, and the answer from the authority is used in the response to the client, or resolving the query takes a long time and the stale-answer-client-timeout occurs.

If the stale-answer-client-timeout occurs, a a database lookup for stale data is executed. BIND now acts the same as when stale-refresh-time is disabled: If that results in a stale positive answer, it is given to the client, while in the background BIND continues to resolve the query, with the goal of updating the cache entry (and if that resolver query times out, BIND falls back to stale data in cache, regardless whether it is a positive or negative entry). If no stale positive answer is available, the database lookup is dropped and BIND waits until the resolver is finished.