[[TOC]]
DNS Benchmarks
Collection of hints and links which can be useful for people new to DNS benchmarking.
Step 0. Prerequisites
See DNS-OARC 42, DNS Benchmarking 101: Essentials and Common Pitfalls slides, video recording
-
Test design must be different for any combination of
- resolver
- authoritative server
- normal traffic
- DoS traffic
- server management operations
-
⚠️ Be absolutely sure to test your test environment first.
- If you don't the results are probably garbage
- An echo server will be very useful
- user-space: https://github.com/DNS-OARC/dumdumd/
- XDP (UDP-only): https://gitlab.nic.cz/knot/xdp-utils/
-
Usual tuning tips apply also to an echo server
- beware of NUMA domains in hardware - you might want to restrict yourself to an inside of one NUMA domain - commands numactl, taskset
- pick the NUMA domain so it is directly connected to network card in use - tool lstopo can help with that
- find out how many network card queues is available - command ethtool -l devicename
- pick minimum value from (# of CPUS in the chosen NUMA domain, number of network queues) and use that for number of threads
- these are just a starting point, experiment!
Resolvers
-
An example with explanation why and how we test resolvers: https://www.isc.org/blogs/bind-resolver-performance-july-2021/
-
⚠️ Results cannot be generalized to other data sets or setups with any certainty
- Why? Because every single query (and it's timing) changes state of the system under test. Not to speak of dependency on authoritative server performance ...
-
Stateful transports are even more complicated to test because it pulls in client transport layer behavior into the mix.
-
To run resolver test with real data set, using DNS Shotgun, in our Gitlab CI:
- Go to https://gitlab.isc.org/isc-projects/bind9-shotgun-ci/-/pipelines/new and fill in the form:
SHOTGUN_TEST_VERSION
: versions to test, specify multiple versions e.g. like["main", "another_branch_for_testing"]
(tags and commit hashes also work)SHOTGUN_TRAFFIC_MULTIPLIER
: Load factor. For normal resolver UDP scenario play in range 10-20.SHOTGUN_ROUNDS
: How many times individual test should be repeated. Use at least 3 if you want sensible results.SHOTGUN_DURATION
: Test duration in seconds, normally most cache misses happen in first minute. Use e.g. 300 if you are also interested in hot-cache behavior.- Click
[Run pipeline]
button at the bottom of the form - Now wait - you might need to restart the failing jobs
- Stop/restart also postproc job to pick up results from the restarted "test" jobs
- Test results will be in the "postproc" job in the children pipeline - click on index.html, it will show you the most interesting jobs, or dive into subdirectories full of charts
Variables to consider
Just a short list of the most important ones. Not a full list by a long shot.
- Client query patterns
- distribution of queries among clients (especially with stateful transports)
- theoretical cache hit ratio
- RD=1 queries
- DO=1 queries
- CD=1 queries
- Client Subnet in DNS Queries (aka ENS Client Subnet, ECS)
- Client transport
- UDP with TCP fallback
- DoT
- DoH
- HTTP
- HTTPS
- TLS version
- TLS resumption
- TLS certificate/algorithm in use
- connection management - how clients and server manage idle connections
- Resolver side
- forwarding
- via DoT
- dnssec-validation
- cache size
- TTL limits
- availability of ipv4 and/or ipv6
- forwarding
- RPZ/policy
- number of RPZ zones
- RPZ zone config
- break-dnssec option
- min-update-interval option
- *-wait-recurse options
- content of RPZ zones
- update size
- update frequency
- views / GeoIP
- Auth side (i.e. how does the Internet behave)
- TTL of answers
- size of answers
- validity of answers
- weird answers - very long CNAME chains etc.
Typical setups
Cartesian product of
- (recursor, forwarder)
- (no ECS, ECS)
- (no RPZ, lots of RPZ)
- (one view, more views with GeoIP)
- (closed resolver, open resolver)
Authoritative
- ⚠️ Different queries have different processing cost - and it also depends on zone contents!
- See e.g. https://www.knot-dns.cz/benchmark/ with test results for half a dozen different test setups
- ⚠️ Raw QPS is not everything!
- Authoritative servers can (also) suffer from latency spikes - see DNS-OARC 40 Detecting latency spikes in DNS server implementation(s) slides, video
- Our CI for primary authoritative server is Perflab: https://perflab.isc.org/
Variables to consider
Just a short list of the most important ones. Not a full list by a long shot.
- Client query patterns
- distribution of queries among clients - for load distribution
- DO=1 queries
- Transport
- DoT for zone transfers
- TCP parameters - especially for outgoing connections
- sysctl net.ipv4.ip_local_port_range
- sysctl net.ipv4.tcp_max_tw_buckets
- sysctl net.ipv4.tcp_tw_reuse
- see above in Resolver section if relevant
- Number of zones
- Content/size of the zones
- DNSSEC signed zones - NSEC parameters
- Update mechanisms
- UPDATE
- AXFR
- IXFR
- rndc reload
- DNSSEC resigning
- journal configuration
- ixfr-from-differences
- max-journal-size
- storage properties - mainly IOPS
Typical setups
Test primary and secondaries for each scenario. Example values as known on 2024-09-25.
- TLD
- couple large zones
- DNSSEC signed
- infrequent but large updates (often DNSSEC resigning)
- mostly delegation hits - see e.g. CZ stats
- DNS hosting
- lots of mostly small zones
- possibly unsigned or signed, depends on users
- wide range in # of zones, in range of 100 - 2M
- most zones static
- small but frequent-ish updates
- reconfig possibilities
- catalog zone
- rndc addzone
- file change & rndc reconfig
- lots of mostly small zones
- root
- small number of small zones (root, arpa, etc.)
- NSEC signed
- read only
- lots of NXDOMAIN traffic, see e.g. ITHI stats
- see RSSAC 002 stats
- roughly 70 % of DO=1 queries, see e.g. ITHI stats
- small number of small zones (root, arpa, etc.)
Benchmarks
- server startup time
- server restart/reload/reconfig/addzone/delzone times
- memory consumption
- also in transient states - updates etc.
- CPU load
- also in transient states - updates etc.
- QPS
- ⚠️ measurement is always valid only for specific config/zones/query stream/transport etc.
- in steady state
- when AXFR/IXFR/UPDATE etc. is in progress
- ⚠️ latency
- when a management operation is in progress
- reconfig(s)
- statistics query
- rndc addzone/delzone/modzone
- impact of UPDATEs
- impact of DNSSEC resigning
- zone transfers
- time to cold start secondary - no local data
- time to restart secondary - a copy of zones available, time to finish first refreshes
- number of secondaries handled by a single primary
- propagation delay - from change on primary to last secondary being up-to-date
- compare behavior with only legitimate clients vs. when under attack (attack parallel with legit clients)