System Design Article

DNS and Domain Routing

Difficulty: Easy

DNS (Domain Name System) is the phonebook of the internet. It translates human-readable domain names like 'google.com' into IP addresses that computers use to route traffic. This lesson explains how DNS resolution works, the hierarchy of DNS servers, common record types, and how DNS is used in system design for load balancing, failover, and global traffic routing.

DNS and Domain Routing

System Design

Easy

dns

domain-routing

networking

nameservers

dns-records

load-balancing

route53

beginner

837 views

What is DNS?

DNS (Domain Name System) is a hierarchical, distributed database that maps human-readable domain names to machine-readable IP addresses.

Without DNS, you would need to memorize IP addresses like 142.250.80.46 instead of typing google.com. DNS acts as a translation layer, converting names to numbers.

Why Not Just Use IP Addresses?

Memorability: Humans remember names, not numbers.
Abstraction: A domain name can point to different IPs over time (server migration, failover) without users needing to know.
Load distribution: A single domain name can resolve to multiple IP addresses, distributing traffic across servers.
Flexibility: Services can change their infrastructure (new servers, new regions) without changing their public-facing domain.

DNS as a Distributed System

DNS itself is one of the largest distributed systems in the world. It handles trillions of queries per day across millions of servers, with no single point of failure at the protocol level. Understanding DNS is understanding distributed systems in practice.

Text

"google.com"  --->  [DNS System]  --->  142.250.80.46
"github.com"  --->  [DNS System]  --->  140.82.121.3
"api.stripe.com" -> [DNS System]  --->  13.227.108.42

The DNS Hierarchy

DNS is organized as a tree-like hierarchy with four levels:

1. Root Nameservers

At the top of the hierarchy are 13 root nameserver clusters (labeled A through M), operated by organizations like ICANN, Verisign, and NASA. They do not know the IP for google.com directly, but they know which servers are responsible for .com.

There are only 13 logical root server addresses, but hundreds of physical servers behind them (using anycast routing for redundancy and performance).

2. TLD (Top-Level Domain) Nameservers

TLD servers handle top-level domains like .com, .org, .io, .dev. When asked about google.com, the root server directs you to the .com TLD server, which knows the authoritative nameservers for google.com.

Examples of TLDs:

Generic: .com, .org, .net, .io
Country-code: .uk, .de, .jp, .ca
Sponsored: .gov, .edu, .mil

3. Authoritative Nameservers

These are the servers that actually hold the DNS records for a specific domain. When you register example.com, you configure its authoritative nameservers (often provided by your domain registrar or a service like AWS Route 53 or Cloudflare).

The authoritative nameserver responds with the actual IP address (or other record types) for the requested domain.

4. Recursive Resolvers

Recursive resolvers (also called DNS resolvers or caching resolvers) are the intermediaries that do the work on behalf of clients. Your ISP runs recursive resolvers, and public ones include Google (8.8.8.8) and Cloudflare (1.1.1.1).

The resolver receives a query from your browser, walks the DNS hierarchy (root -> TLD -> authoritative), caches the result, and returns the answer.

Text

DNS Hierarchy
                    [Root Nameservers]      (. / root)
                         /     \
                        /       \
              [.com TLD]         [.org TLD]
                /    \               \
               /      \              \
     [google.com]  [github.com]  [wikipedia.org]
     Authoritative  Authoritative  Authoritative
     Nameservers    Nameservers    Nameservers

How DNS Resolution Works

When you type www.example.com in your browser, here is the full resolution process:

Step-by-Step DNS Resolution

Browser cache: The browser checks its own DNS cache. If you visited example.com recently, the IP is already cached.
OS cache: If not in the browser cache, the operating system's DNS cache (stub resolver) is checked.
Recursive resolver: The OS sends the query to a configured recursive resolver (e.g., your ISP's resolver or 8.8.8.8). The resolver checks its own cache.
Root nameserver: On a cache miss, the resolver queries a root nameserver: "Where can I find .com?" The root responds with the IP of the .com TLD server.
TLD nameserver: The resolver queries the .com TLD server: "Where can I find example.com?" The TLD server responds with the IP of example.com's authoritative nameserver.
Authoritative nameserver: The resolver queries the authoritative nameserver: "What is the IP for www.example.com?" The authoritative server responds with the A record (e.g., 93.184.216.34).
Response cached and returned: The resolver caches the result (respecting the TTL) and returns it to the client. The browser can now establish a TCP connection to the IP.

Iterative vs Recursive Queries

Recursive: The client asks the resolver to do all the work and return the final answer. This is what your browser does.
Iterative: The resolver asks each server in the hierarchy, and each server responds with a referral to the next server. The resolver follows the chain itself.

In practice, the client-to-resolver query is recursive, and the resolver-to-nameserver queries are iterative.

DNS Resolution Latency

Step	Typical Latency
Browser cache hit	0ms
OS cache hit	<1ms
Resolver cache hit	1-5ms
Full resolution (cache miss)	20-120ms
Full resolution (distant nameserver)	100-300ms

This is why DNS caching is critical for performance. A cold DNS lookup adds 50-200ms to the very first request.

DNS Record Types

DNS does more than just map names to IPs. Different record types serve different purposes:

Essential Record Types

Record	Purpose	Example
A	Maps a domain to an IPv4 address	`example.com -> 93.184.216.34`
AAAA	Maps a domain to an IPv6 address	`example.com -> 2606:2800:220:1:248:1893:25c8:1946`
CNAME	Alias one domain to another domain	`www.example.com -> example.com`
MX	Mail server for the domain (with priority)	`example.com -> mail.example.com (priority 10)`
TXT	Arbitrary text (used for verification, SPF, DKIM)	`example.com -> "v=spf1 include:_spf.google.com ~all"`
NS	Delegates a domain to specific nameservers	`example.com -> ns1.cloudflare.com`
SRV	Specifies host, port, priority, and weight for a service	`_sip._tcp.example.com -> 5060 sipserver.example.com`
SOA	Start of Authority - metadata about the DNS zone	Serial number, refresh interval, TTL defaults
PTR	Reverse DNS - maps IP to domain name	`34.216.184.93.in-addr.arpa -> example.com`

CNAME vs A Record: A Common Confusion

A record: Points directly to an IP. Use when you control the IP address.
CNAME: Points to another domain name. Use when the target IP might change (e.g., pointing to a load balancer's DNS name).

Limitation: A CNAME cannot coexist with other records at the same name. You cannot have both a CNAME and an MX record at the root domain (example.com). This is why many DNS providers offer ALIAS or ANAME records as a workaround.

TTL (Time to Live)

Every DNS record has a TTL value (in seconds) that tells caches how long to remember the result:

Short TTL (60-300s): Fast propagation of changes, but more DNS queries (higher cost, slightly higher latency).
Long TTL (3600-86400s): Fewer queries, faster resolution from cache, but changes take longer to propagate.

Best practice: Use short TTLs (60-300s) before a migration or failover event. Use longer TTLs (3600s+) for stable records that rarely change.

DNS for Load Balancing & Traffic Routing

DNS is not just name resolution - it is a powerful tool for directing traffic in system design.

DNS Load Balancing Strategies

1. Round-Robin DNS

Return multiple A records for a domain. Clients cycle through them.

Text

example.com -> 10.0.1.1
example.com -> 10.0.1.2
example.com -> 10.0.1.3

Pros: Simple, no special infrastructure. Cons: No health checking (sends traffic to dead servers), uneven distribution due to caching, no session affinity.

2. Weighted Routing

Assign weights to different records. 70% of responses return Server A, 30% return Server B. Use case: Canary deployments (send 5% of traffic to new version), gradual migration between data centers.

3. Latency-Based Routing

Return the IP of the server with the lowest latency to the requester's location. Use case: Global services with multiple regions. A user in Tokyo gets routed to ap-northeast-1, a user in London gets eu-west-1.

4. Geolocation Routing

Return different IPs based on the geographic location of the resolver (which approximates the user's location). Use case: Compliance requirements (EU users must hit EU servers), localized content delivery.

5. Failover Routing

Configure a primary and secondary record. DNS returns the primary unless health checks detect it is down, then switches to secondary. Use case: Disaster recovery. Primary in us-east-1, secondary in us-west-2.

AWS Route 53 as a Case Study

AWS Route 53 is a managed DNS service that implements all of these strategies:

Health checks: Route 53 monitors endpoints and removes unhealthy ones from DNS responses.
Alias records: Route 53's proprietary record type that works like a CNAME but can be used at the zone apex (root domain).
Traffic policies: Complex routing rules combining weighted, latency, geolocation, and failover strategies.

DNS vs Application-Level Load Balancing

Aspect	DNS Load Balancing	Application Load Balancer
Layer	DNS (layer 7 of name resolution)	HTTP/TCP (layer 4/7)
Granularity	Per DNS query (coarse)	Per request (fine-grained)
Health checking	Slow (TTL-dependent)	Fast (real-time)
Session affinity	Not possible	Supported (sticky sessions)
Cost	Low (just DNS records)	Higher (dedicated infrastructure)
Best for	Global traffic distribution	Request-level routing within a region

DNS Failures & Security Concerns

DNS is critical infrastructure, and its failure modes and security vulnerabilities are important to understand.

Common DNS Failure Scenarios

1. DNS Provider Outage In 2016, a DDoS attack on Dyn (a major DNS provider) took down Twitter, Netflix, GitHub, and dozens of other sites. The applications were fine - but users could not resolve their domain names.

Mitigation: Use multiple DNS providers. Configure secondary NS records pointing to a backup DNS service.

2. TTL-Related Delays After changing a DNS record, the old IP is still cached worldwide for up to the old TTL duration. If your TTL was 86400 (24 hours), some users will hit the old server for a full day.

Mitigation: Lower TTL before planned changes. Wait for the old TTL to expire before making the change.

3. Propagation Inconsistency Different resolvers update at different times. For a period after a DNS change, some users see the old IP and some see the new one.

Mitigation: Plan for a transition period. Keep old servers running until propagation is complete.

DNS Security Threats

DNS Spoofing / Cache Poisoning: An attacker injects false DNS records into a resolver's cache, redirecting users to malicious servers. Mitigation: DNSSEC (DNS Security Extensions) adds cryptographic signatures to DNS records, allowing resolvers to verify authenticity.

DNS Amplification Attack: Attacker sends small DNS queries with a spoofed source IP. The DNS server sends large responses to the victim's IP, amplifying the DDoS attack. Mitigation: Rate limiting on DNS servers, response rate limiting (RRL), BCP38 (ingress filtering).

DNS Hijacking: An attacker changes the DNS configuration (via compromised registrar account or BGP hijacking) to redirect a domain to their server. Mitigation: Registrar lock, two-factor authentication on registrar accounts, DNSSEC.

DNS over HTTPS (DoH) and DNS over TLS (DoT)

Traditional DNS queries are sent in plaintext over UDP port 53, allowing ISPs and network operators to see (and potentially modify) your DNS queries.

DoH: Encrypts DNS queries inside HTTPS requests (port 443). Adopted by Firefox and Chrome.
DoT: Encrypts DNS queries using TLS (port 853). More transparent to network operators.

Both prevent eavesdropping and tampering of DNS queries in transit.

DNS in System Design Interviews

DNS appears in nearly every system design discussion. Here is how to use it effectively:

When to Mention DNS

Multi-region architecture: "We will use DNS-based geolocation routing (e.g., Route 53) to direct users to the nearest data center."
CDN integration: "CloudFront distributions have a DNS name. We create a CNAME or Alias record pointing our domain to the CloudFront distribution."
Failover design: "Route 53 health checks will monitor our primary region. On failure, DNS failover routes traffic to the secondary region."
Service discovery: "Internal services register with a private DNS zone. Service A finds Service B by querying service-b.internal.example.com."

Architecture Pattern: DNS + CDN + Load Balancer

Text

User types: www.example.com
         |
         v
[DNS Resolution] - Route 53 (geolocation routing)
         |
         v
[CDN Edge Server] - CloudFront (cache hit? serve directly)
         |
    Cache miss
         |
         v
[Load Balancer] - ALB (distributes to healthy instances)
         |
         v
[Application Servers] - EC2/ECS (process request)
         |
         v
[Database] - RDS/DynamoDB

This three-tier routing (DNS -> CDN -> LB) is the standard architecture for a globally distributed web application.

Common Interview Patterns

"How would you design for 99.99% availability?" - Mention multi-region with DNS failover.
"How do you handle a region going down?" - DNS health checks detect failure, route traffic to backup region.
"How does a CDN work?" - The CDN provider gives you a DNS name. You create a CNAME to it.
"How would you migrate to a new infrastructure with zero downtime?" - Lower TTL, switch DNS records, wait for propagation, decommission old servers.

Real-World Examples

How real systems implement this in production

AWS Route 53

Route 53 is Amazon's managed DNS service, handling billions of queries daily. It supports weighted, latency-based, geolocation, and failover routing policies, with built-in health checks that automatically remove unhealthy endpoints from DNS responses.

Trade-off: Managed DNS reduces operational burden but creates vendor dependency. Route 53's Alias records solve the CNAME-at-apex limitation but are AWS-specific.

Cloudflare DNS

Cloudflare operates one of the fastest public DNS resolvers (1.1.1.1) with an average response time under 12ms globally. Their authoritative DNS service uses a massive anycast network to serve DNS responses from the nearest edge location.

Trade-off: Using Cloudflare as both DNS provider and CDN/security layer simplifies architecture but concentrates risk in a single provider.

Dyn DNS Outage (2016)

A massive DDoS attack against Dyn DNS disrupted access to Twitter, Netflix, Reddit, GitHub, and many other major sites. The applications themselves were running fine, but users could not resolve domain names to reach them.

Trade-off: This incident demonstrated that DNS is a single point of failure even for companies with otherwise redundant infrastructure. It drove adoption of multi-provider DNS strategies.

Quick Interview Phrases

Key terms to use in your answer

DNS resolution

recursive vs iterative lookup

TTL-based caching

A record vs CNAME

GeoDNS routing

DNS propagation delay

Common Interview Questions

Questions you might be asked about this topic

How would you design a system to handle DNS failures gracefully?

Client-side caching with TTL, fallback DNS resolvers, health checks with automatic failover, use multiple DNS providers, implement retry logic with exponential backoff.

Explain how GeoDNS helps with latency

What happens if a DNS provider goes down?

How do CDNs use DNS?

Interview Tips

How to discuss this topic effectively

In system design interviews, always include DNS as the first step in your request flow. 'The client resolves our domain via DNS, which returns the IP of our load balancer or CDN.' This shows you think about the complete request path.

When designing multi-region systems, mention DNS-based routing (latency, geolocation, or failover) as the mechanism for directing users to the right region. Be specific about which strategy and why.

Know the TTL trade-off cold: short TTLs give you faster failover but more DNS query volume; long TTLs reduce query load but slow down propagation of changes. In an interview, suggest starting with 300s TTL as a reasonable default.

If asked about zero-downtime migration, outline the DNS strategy: lower TTL days before the migration, switch the DNS record, wait for the old TTL to expire, then decommission old infrastructure.

Mention that DNS is often the bottleneck in cold-start scenarios. Pre-warming DNS caches or using DNS prefetching (<link rel='dns-prefetch'>) can reduce perceived latency for users.

Common Mistakes

Pitfalls to avoid in interviews

Assuming DNS changes propagate instantly

DNS changes take time to propagate due to caching at every level (browser, OS, resolver). The propagation time depends on the old record's TTL. If the TTL was 86400 seconds (24 hours), some resolvers will serve the old IP for up to a full day.

Using CNAME records at the zone apex (root domain)

The DNS specification prohibits CNAME records at the zone apex (e.g., example.com) because CNAMEs cannot coexist with other record types (like MX or NS). Use your DNS provider's ALIAS/ANAME record or an A record instead.

Relying on DNS round-robin as a production load balancer

DNS round-robin has no health checking - it will happily send traffic to a dead server. It also distributes unevenly because resolvers cache responses. Use it for basic distribution, but pair it with a proper load balancer for production traffic.

Ignoring DNS in availability calculations

If your DNS provider goes down, your entire service is unreachable even if all servers are healthy. Include DNS in your availability design: use multiple DNS providers, monitor DNS health, and test failover regularly.

Back to System Design