DNS Benchmarks

Collection of hints and links which can be useful for people new to DNS benchmarking.

Step 0. Prerequisites

See DNS-OARC 42, DNS Benchmarking 101: Essentials and Common Pitfalls slides, video recording

Test design must be different for any combination of
- resolver
- authoritative server
- normal traffic
- DoS traffic
- server management operations
⚠️ Be absolutely sure to test your test environment first.
- If you don't the results are probably garbage
- An echo server will be very useful
  - user-space: https://github.com/DNS-OARC/dumdumd/
  - XDP (UDP-only): https://gitlab.nic.cz/knot/xdp-utils/
Usual tuning tips apply also to an echo server
- beware of NUMA domains in hardware - you might want to restrict yourself to an inside of one NUMA domain - commands numactl, taskset
- pick the NUMA domain so it is directly connected to network card in use - tool lstopo can help with that
- find out how many network card queues is available - command ethtool -l devicename
- pick minimum value from (# of CPUS in the chosen NUMA domain, number of network queues) and use that for number of threads
- these are just a starting point, experiment!

Resolvers

An example with explanation why and how we test resolvers: https://www.isc.org/blogs/bind-resolver-performance-july-2021/
⚠️ Results cannot be generalized to other data sets or setups with any certainty
- Why? Because every single query (and it's timing) changes state of the system under test. Not to speak of dependency on authoritative server performance ...
Introduction to DNS Shotgun test tool slides, video
Stateful transports are even more complicated to test because it pulls in client transport layer behavior into the mix.
- See DNS-OARC 33 "DNS Shotgun: realistic DNS resolver benchmarking (also) for stateful transports" at slides, video
To run resolver test with real data set, using DNS Shotgun, in our Gitlab CI:
- Go to https://gitlab.isc.org/isc-projects/bind9-shotgun-ci/-/pipelines/new and fill in the form:
- SHOTGUN_TEST_VERSION: versions to test, specify multiple versions e.g. like ["main", "another_branch_for_testing"] (tags and commit hashes also work)
- SHOTGUN_TRAFFIC_MULTIPLIER: Load factor. For normal resolver UDP scenario play in range 10-20.
- SHOTGUN_ROUNDS: How many times individual test should be repeated. Use at least 3 if you want sensible results.
- SHOTGUN_DURATION: Test duration in seconds, normally most cache misses happen in first minute. Use e.g. 300 if you are also interested in hot-cache behavior.
- Click [Run pipeline] button at the bottom of the form
- Now wait - you might need to restart the failing jobs
  - Stop/restart also postproc job to pick up results from the restarted "test" jobs
- Test results will be in the "postproc" job in the children pipeline - click on index.html, it will show you the most interesting jobs, or dive into subdirectories full of charts

Variables to consider

Just a short list of the most important ones. Not a full list by a long shot.

Client query patterns
- distribution of queries among clients (especially with stateful transports)
- theoretical cache hit ratio
- RD=1 queries
- DO=1 queries
- CD=1 queries
- Client Subnet in DNS Queries (aka ENS Client Subnet, ECS)
Client transport
- UDP with TCP fallback
- DoT
- DoH
  - HTTP
  - HTTPS
- TLS version
- TLS resumption
- TLS certificate/algorithm in use
- connection management - how clients and server manage idle connections
Resolver side
- forwarding
  - via DoT
- dnssec-validation
- cache size
- TTL limits
- availability of ipv4 and/or ipv6
RPZ/policy
- number of RPZ zones
- RPZ zone config
  - break-dnssec option
  - min-update-interval option
  - *-wait-recurse options
- content of RPZ zones
- update size
- update frequency
views / GeoIP
Auth side (i.e. how does the Internet behave)
- TTL of answers
- size of answers
- validity of answers
- weird answers - very long CNAME chains etc.

Typical setups

Cartesian product of

(recursor, forwarder)
(no ECS, ECS)
(no RPZ, lots of RPZ)
(one view, more views with GeoIP)
(closed resolver, open resolver)

Authoritative

⚠️ Different queries have different processing cost - and it also depends on zone contents!
- See e.g. https://www.knot-dns.cz/benchmark/ with test results for half a dozen different test setups
⚠️ Raw QPS is not everything!
Authoritative servers can (also) suffer from latency spikes - see DNS-OARC 40 Detecting latency spikes in DNS server implementation(s) slides, video
Our CI for primary authoritative server is Perflab: https://perflab.isc.org/

Variables to consider

Just a short list of the most important ones. Not a full list by a long shot.

Client query patterns
- distribution of queries among clients - for load distribution
- DO=1 queries
Transport
- DoT for zone transfers
- TCP parameters - especially for outgoing connections
  - sysctl net.ipv4.ip_local_port_range
  - sysctl net.ipv4.tcp_max_tw_buckets
  - sysctl net.ipv4.tcp_tw_reuse
- see above in Resolver section if relevant
Number of zones
Content/size of the zones
- DNSSEC signed zones - NSEC parameters
Update mechanisms
- UPDATE
- AXFR
- IXFR
- rndc reload
- DNSSEC resigning
journal configuration
- ixfr-from-differences
- max-journal-size
- storage properties - mainly IOPS

Typical setups

Test primary and secondaries for each scenario. Example values as known on 2024-09-25.

TLD
- couple large zones
- DNSSEC signed
- infrequent but large updates (often DNSSEC resigning)
- mostly delegation hits - see e.g. CZ stats
DNS hosting
- lots of mostly small zones
  - possibly unsigned or signed, depends on users
  - wide range in # of zones, in range of 100 - 2M
- most zones static
- small but frequent-ish updates
- reconfig possibilities
  - catalog zone
  - rndc addzone
  - file change & rndc reconfig
root
- small number of small zones (root, arpa, etc.)
  - NSEC signed
- read only
- lots of NXDOMAIN traffic, see e.g. ITHI stats
- see RSSAC 002 stats
- roughly 70 % of DO=1 queries, see e.g. ITHI stats

Benchmarks

server startup time
server restart/reload/reconfig/addzone/delzone times
memory consumption
- also in transient states - updates etc.
CPU load
- also in transient states - updates etc.
QPS
- ⚠️ measurement is always valid only for specific config/zones/query stream/transport etc.
- in steady state
- when AXFR/IXFR/UPDATE etc. is in progress
  - ⚠️ latency
- when a management operation is in progress
  - reconfig(s)
  - statistics query
  - rndc addzone/delzone/modzone
impact of UPDATEs
impact of DNSSEC resigning
zone transfers
- time to cold start secondary - no local data
- time to restart secondary - a copy of zones available, time to finish first refreshes
- number of secondaries handled by a single primary
- propagation delay - from change on primary to last secondary being up-to-date
compare behavior with only legitimate clients vs. when under attack (attack parallel with legit clients)