2
0
mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-21 17:48:07 +00:00
Clone
7
Serving stale data
Diego dos Santos Fronza edited this page 2021-01-21 18:07:51 +00:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

The serve-stale feature was added in the BIND 9.11.4-S subscriber edition in 2018, and then included in the open source as of BIND 9.12.0.

There were three options provided with BIND 9s initial implementation of serve-stale:

  • stale-answer-enable: If yes, enable the returning of “stale” cached answers when the name servers for a zone are not answering. The default is not to return stale answers (answering from stale cache can also be enabled and disabled dynamically by the BIND server administrator via rndc serve-stale on|off).
  • max-stale-ttl: If stale cache is enabled, max-stale-ttl sets the maximum time for which the server retains records past their normal expiry to return them as stale records, when the servers for those records are not reachable.
  • stale-answer-ttl: This specifies the TTL to be returned on stale answers. The default was originally set at 30 seconds. As of the 9.16.6 and 9.11.22-S1 update, it defaults to one second.

ISC has received some complaints that the serve-stale implementation is not efficient in production.

  1. Every client that asks for a record that is stale but still available in cache waits for a lengthy timeout, as BIND re-queries the authority before sending the stale answer. That was part of the original design, which was to serve as a last resort in case of a lengthy outage, but it provides a slower response. The timeout BIND uses is based on an option called resolver-query-timeout. The default value of this timeout is 10 seconds, but it can be configured from 301 msec to 30 seconds.
  2. Another observation was that we had not provided operators with an option to disable the stale cache.
  3. Records are kept in cache for too long after they expire.

9.16.6 and 9.11.22-S1 improvements

We added the following parameter:

  • stale-cache-enable: If yes (the default), enables the retention of expired cache records so that they are available to be returned from cache if either stale-answer-enable is set to yes, or is switched on later using rndc serve-stale on.

This deals with complaint number 2.

At the same time, we made another tweak. As of 9.16.6 and 9.11.22-S1, answers that are received with TTL=0 are ineligible for serve-stale, and we updated the default values for stale-answer-ttl (from 30 seconds to one second) and max-stale-ttl (from one week to 12 hours).

This deals with complaint number 3.

9.16.9 improvements

In 9.16.9 we introduced a new option to deal with complaint number 1.

  • stale-refresh-time: If set, BIND replies with the stale answer in cache immediately if an attempt to refresh the RRset has previously failed, and continues to provide the stale answer for an amount of time specified by stale-refresh-time. The default is 30 seconds, as RFC 8767 recommends.

This enhancement speeds up responses for nearly all of the users that are in need of a stale answer: the very first user that queries for a record that has just become unavailable from the authority will still have to wait for the query timeout, but all the subsequent users will get the stale answer from cache.

9.16.12 improvements

After introducing stale-refresh-time we realized that we could have used better default values than introduced in 9,16,6 and 9.11.22-S1. Instead of coming up with our own values, we should have used the values recommended by RFC 8767, so this release revises these defaults and resets stale-answer-ttl to 30 seconds and changes max-stale-ttl from 12 hours to one day.

To further improve the serving of stale data, we added another option:

  • stale-answer-client-timeout, which is the maximum amount of time a recursive resolver should allow between the receipt of a resolution request and the sending of its response (only applicable if stale-answer-enable is set).

All improvements made in 9.16.9 and later will be backported to 9.11-S1.

A client walks into a bar, and asks the bartender do you serve stale data?

cache-enable answer-enable refresh-time answer-client-timeout bartender
no no 0 (disabled) off (disabled) no
no no 0 (disabled) 0 no
no no 0 (disabled) value no
no no value off (disabled) no
no no value 0 no
no no value value no
no yes 0 (disabled) off (disabled) no
no yes 0 (disabled) 0 no
no yes 0 (disabled) value no
no yes value off (disabled) no
no yes value 0 no
no yes value value no
yes no 0 (disabled) off (disabled) no
yes no 0 (disabled) 0 no
yes no 0 (disabled) value no
yes no value off (disabled) no
yes no value 0 no
yes no value value no
yes yes 0 (disabled) off (disabled) maybe
yes yes 0 (disabled) 0 yes
yes yes 0 (disabled) value maybe
yes yes value off (disabled) most of the time
yes yes value 0 yes
yes yes value value most of the time

With all these options, we now allow operators to be RFC 8767 compliant. But it is also less predictable when stale data is returned and when an attempt to resolve the query is made.

If stale-cache-enable is set to no, stale data is not kept in cache and it will not be used in DNS responses. It does not matter what the other options are set to, they are ignored in this case.

If stale-cache-enable is set to yes, stale date may be served to the client if stale-answers-enable is set, or if enabled via rndc serve-stale on. Otherwise, the stale cache entries will not be used in DNS responses and it does not matter what the other options are set to, they are ignored.

So stale RRsets can only be returned to the client if stale-cache-enable is set to yes and if stale-answers-enable is set to yes (or enabled with rndc serve-stale on). But when will a stale answer be returned, and when will BIND attempt to resolve the query?

There are two options that influence the query path: stale-refresh-time and stale-answer-client-timeout.

The initial implementation

If stale-refresh-time is set to 0 (disabled) and stale-answer-client-timeout is set to off|disabled, the behaviour is the same as the initial implementation: If there is no active data in cache (but there may be stale data), BIND will first attempt to resolve the query. Only after resolver-query-timeout, it falls back to stale data in cache.

The stale refresh window

If stale-refresh-time is set to a positive value, and stale-answer-client-timeout is disabled, the behaviour improves for most clients. Suppose there is stale data in cache, and it is being requested. The first client that queries for a specific RRset will still face the lengthy timeout, but subsequent clients that query for the same RRset will immediately be served the stale data. In this time window, no attempts to refresh the RRset are made (we have failed before, so wait some time before try resolving again).

Bartender, how much longer do I have to wait before being served?

If stale-refresh-time is set to 0 (disabled), BIND may still serve stale data within a reasonable time if stale-answer-client-timeout is enabled.

If stale-answer-client-timeout is set to a positive value, and there is stale data in cache, BIND will first try to resolve the query, but if it takes longer than stale-answer-client-timeout, a database lookup for stale data is executed. If that results in a stale positive answer, it is given to the client, while in the background BIND continues to resolve the query, with the goal of updating the cache entry (and if that resolver query times out, BIND falls back to stale data in cache, regardless whether it is a positive or negative entry). If no stale positive answer is available, the database lookup is dropped and BIND waits until the resolver is finished.

If stale-answer-client-timeout is set to 0, we prioritize stale positive answers: If such a stale entry exists in the cache we immediately return it to the client, and start an attempt to refresh the RRset. If the cache does not have a stale positive answer, a regular lookup is started (again this may succeed, responding to the client with an authoritative answer, or this may fail and BIND will fall back to stale data in cache, regardless whether it is a positive or negative entry).

The stale refresh window in combination with client timeout

Now what happens if both stale-refresh-time and stale-answer-client-timeout are used?

First let us look at the case where stale-answer-client-timeout is 0. A request comes in, BIND checks if there is a stale cache entry and if one is found then:

  1. If stale-refresh-time is active for that RRset, then we have two situations:

    • The stale cache entry is a positive answer: the stale answer is used in the response to the client, no attempt to refresh the entry is made, because the time window explicitly tells us not to do so.
    • The stale cache entry is a negative answer: this query will now result in a server failure response.
  2. If the stale-refresh-time window was inactive for that RRset, then again we have two situations:

    • The stale cache entry is a positive answer: the stale answer is used in the response to the client but BIND starts an attempt to refresh the RRset.
    • The stale cache entry is a negative answer: BIND will start resolving the query.

What if the stale-answer-client-timeout is set to a positive value? Then if a stale cache entry exists, and if the stale-refresh-time window was active for this RRset, the entry is immediately returned, without an attempt to resolve the query.

If the stale-refresh-time window was inactive, a normal lookup follows. The stale cache entry is not immediately used. Either the query is resolved swiftly, and the answer from the authority is used in the response to the client, or resolving the query takes a long time and the stale-answer-client-timeout occurs.

If the stale-answer-client-timeout occurs, a a database lookup for stale data is executed. BIND now acts the same as when stale-refresh-time is disabled: If that results in a stale positive answer, it is given to the client, while in the background BIND continues to resolve the query, with the goal of updating the cache entry (and if that resolver query times out, BIND falls back to stale data in cache, regardless whether it is a positive or negative entry). If no stale positive answer is available, the database lookup is dropped and BIND waits until the resolver is finished.