2
0
mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-22 01:59:26 +00:00
Clone
19
benchmarks
Petr Špaček edited this page 2024-09-27 19:33:41 +00:00

[[TOC]]

DNS Benchmarks

Collection of hints and links which can be useful for people new to DNS benchmarking.

Step 0. Prerequisites

See DNS-OARC 42, DNS Benchmarking 101: Essentials and Common Pitfalls slides, video recording

  • Test design must be different for any combination of

    • resolver
    • authoritative server
    • normal traffic
    • DoS traffic
    • server management operations
  • ⚠️ Be absolutely sure to test your test environment first.

  • Usual tuning tips apply also to an echo server

    • beware of NUMA domains in hardware - you might want to restrict yourself to an inside of one NUMA domain - commands numactl, taskset
    • pick the NUMA domain so it is directly connected to network card in use - tool lstopo can help with that
    • find out how many network card queues is available - command ethtool -l devicename
    • pick minimum value from (# of CPUS in the chosen NUMA domain, number of network queues) and use that for number of threads
    • these are just a starting point, experiment!

Resolvers

  • An example with explanation why and how we test resolvers: https://www.isc.org/blogs/bind-resolver-performance-july-2021/

  • ⚠️ Results cannot be generalized to other data sets or setups with any certainty

    • Why? Because every single query (and it's timing) changes state of the system under test. Not to speak of dependency on authoritative server performance ...
  • Introduction to DNS Shotgun test tool slides, video

  • Stateful transports are even more complicated to test because it pulls in client transport layer behavior into the mix.

    • See DNS-OARC 33 "DNS Shotgun: realistic DNS resolver benchmarking (also) for stateful transports" at slides, video
  • To run resolver test with real data set, using DNS Shotgun, in our Gitlab CI:

    • Go to https://gitlab.isc.org/isc-projects/bind9-shotgun-ci/-/pipelines/new and fill in the form:
    • SHOTGUN_TEST_VERSION: versions to test, specify multiple versions e.g. like ["main", "another_branch_for_testing"] (tags and commit hashes also work)
    • SHOTGUN_TRAFFIC_MULTIPLIER: Load factor. For normal resolver UDP scenario play in range 10-20.
    • SHOTGUN_ROUNDS: How many times individual test should be repeated. Use at least 3 if you want sensible results.
    • SHOTGUN_DURATION: Test duration in seconds, normally most cache misses happen in first minute. Use e.g. 300 if you are also interested in hot-cache behavior.
    • Click [Run pipeline] button at the bottom of the form
    • Now wait - you might need to restart the failing jobs
      • Stop/restart also postproc job to pick up results from the restarted "test" jobs
    • Test results will be in the "postproc" job in the children pipeline - click on index.html, it will show you the most interesting jobs, or dive into subdirectories full of charts

Variables to consider

Just a short list of the most important ones. Not a full list by a long shot.

  • Client query patterns
    • distribution of queries among clients (especially with stateful transports)
    • theoretical cache hit ratio
    • RD=1 queries
    • DO=1 queries
    • CD=1 queries
    • Client Subnet in DNS Queries (aka ENS Client Subnet, ECS)
  • Client transport
    • UDP with TCP fallback
    • DoT
    • DoH
      • HTTP
      • HTTPS
    • TLS version
    • TLS resumption
    • TLS certificate/algorithm in use
    • connection management - how clients and server manage idle connections
  • Resolver side
    • forwarding
      • via DoT
    • dnssec-validation
    • cache size
    • TTL limits
    • availability of ipv4 and/or ipv6
  • RPZ/policy
    • number of RPZ zones
    • RPZ zone config
      • break-dnssec option
      • min-update-interval option
      • *-wait-recurse options
    • content of RPZ zones
    • update size
    • update frequency
  • views / GeoIP
  • Auth side (i.e. how does the Internet behave)
    • TTL of answers
    • size of answers
    • validity of answers
    • weird answers - very long CNAME chains etc.

Typical setups

Cartesian product of

  • (recursor, forwarder)
  • (no ECS, ECS)
  • (no RPZ, lots of RPZ)
  • (one view, more views with GeoIP)
  • (closed resolver, open resolver)

Authoritative

  • ⚠️ Different queries have different processing cost - and it also depends on zone contents!
  • ⚠️ Raw QPS is not everything!
  • Authoritative servers can (also) suffer from latency spikes - see DNS-OARC 40 Detecting latency spikes in DNS server implementation(s) slides, video
  • Our CI for primary authoritative server is Perflab: https://perflab.isc.org/

Variables to consider

Just a short list of the most important ones. Not a full list by a long shot.

  • Client query patterns
    • distribution of queries among clients - for load distribution
    • DO=1 queries
  • Transport
    • DoT for zone transfers
    • TCP parameters - especially for outgoing connections
      • sysctl net.ipv4.ip_local_port_range
      • sysctl net.ipv4.tcp_max_tw_buckets
      • sysctl net.ipv4.tcp_tw_reuse
    • see above in Resolver section if relevant
  • Number of zones
  • Content/size of the zones
    • DNSSEC signed zones - NSEC parameters
  • Update mechanisms
    • UPDATE
    • AXFR
    • IXFR
    • rndc reload
    • DNSSEC resigning
  • journal configuration

Typical setups

Test primary and secondaries for each scenario. Example values as known on 2024-09-25.

  • TLD
    • couple large zones
    • DNSSEC signed
    • infrequent but large updates (often DNSSEC resigning)
    • mostly delegation hits - see e.g. CZ stats
  • DNS hosting
    • lots of mostly small zones
      • possibly unsigned or signed, depends on users
      • wide range in # of zones, in range of 100 - 2M
    • most zones static
    • small but frequent-ish updates
    • reconfig possibilities
      • catalog zone
      • rndc addzone
      • file change & rndc reconfig
  • root
    • small number of small zones (root, arpa, etc.)
      • NSEC signed
    • read only
    • lots of NXDOMAIN traffic, see e.g. ITHI stats
    • see RSSAC 002 stats
    • roughly 70 % of DO=1 queries, see e.g. ITHI stats

Benchmarks

  • server startup time
  • server restart/reload/reconfig/addzone/delzone times
  • memory consumption
    • also in transient states - updates etc.
  • CPU load
    • also in transient states - updates etc.
  • QPS
    • ⚠️ measurement is always valid only for specific config/zones/query stream/transport etc.
    • in steady state
    • when AXFR/IXFR/UPDATE etc. is in progress
      • ⚠️ latency
    • when a management operation is in progress
      • reconfig(s)
      • statistics query
      • rndc addzone/delzone/modzone
  • impact of UPDATEs
  • impact of DNSSEC resigning
  • zone transfers
    • time to cold start secondary - no local data
    • time to restart secondary - a copy of zones available, time to finish first refreshes
    • number of secondaries handled by a single primary
    • propagation delay - from change on primary to last secondary being up-to-date
  • compare behavior with only legitimate clients vs. when under attack (attack parallel with legit clients)