System Design Article
DNS and Domain Routing
Difficulty: Easy
DNS (Domain Name System) is the phonebook of the internet. It translates human-readable domain names like 'google.com' into IP addresses that computers use to route traffic. This lesson explains how DNS resolution works, the hierarchy of DNS servers, common record types, and how DNS is used in system design for load balancing, failover, and global traffic routing.
DNS and Domain Routing
DNS (Domain Name System) is the phonebook of the internet. It translates human-readable domain names like 'google.com' into IP addresses that computers use to route traffic. This lesson explains how DNS resolution works, the hierarchy of DNS servers, common record types, and how DNS is used in system design for load balancing, failover, and global traffic routing.
837 views
21
What is DNS?
DNS (Domain Name System) is a hierarchical, distributed database that maps human-readable domain names to machine-readable IP addresses.
Without DNS, you would need to memorize IP addresses like 142.250.80.46 instead of typing google.com. DNS acts as a translation layer, converting names to numbers.
Why Not Just Use IP Addresses?
- Memorability: Humans remember names, not numbers.
- Abstraction: A domain name can point to different IPs over time (server migration, failover) without users needing to know.
- Load distribution: A single domain name can resolve to multiple IP addresses, distributing traffic across servers.
- Flexibility: Services can change their infrastructure (new servers, new regions) without changing their public-facing domain.
DNS as a Distributed System
DNS itself is one of the largest distributed systems in the world. It handles trillions of queries per day across millions of servers, with no single point of failure at the protocol level. Understanding DNS is understanding distributed systems in practice.
"google.com" ---> [DNS System] ---> 142.250.80.46
"github.com" ---> [DNS System] ---> 140.82.121.3
"api.stripe.com" -> [DNS System] ---> 13.227.108.42The DNS Hierarchy
DNS is organized as a tree-like hierarchy with four levels:
1. Root Nameservers
At the top of the hierarchy are 13 root nameserver clusters (labeled A through M), operated by organizations like ICANN, Verisign, and NASA. They do not know the IP for google.com directly, but they know which servers are responsible for .com.
There are only 13 logical root server addresses, but hundreds of physical servers behind them (using anycast routing for redundancy and performance).
2. TLD (Top-Level Domain) Nameservers
TLD servers handle top-level domains like .com, .org, .io, .dev. When asked about google.com, the root server directs you to the .com TLD server, which knows the authoritative nameservers for google.com.
Examples of TLDs:
- Generic:
.com,.org,.net,.io - Country-code:
.uk,.de,.jp,.ca - Sponsored:
.gov,.edu,.mil
3. Authoritative Nameservers
These are the servers that actually hold the DNS records for a specific domain. When you register example.com, you configure its authoritative nameservers (often provided by your domain registrar or a service like AWS Route 53 or Cloudflare).
The authoritative nameserver responds with the actual IP address (or other record types) for the requested domain.
4. Recursive Resolvers
Recursive resolvers (also called DNS resolvers or caching resolvers) are the intermediaries that do the work on behalf of clients. Your ISP runs recursive resolvers, and public ones include Google (8.8.8.8) and Cloudflare (1.1.1.1).
The resolver receives a query from your browser, walks the DNS hierarchy (root -> TLD -> authoritative), caches the result, and returns the answer.
DNS Hierarchy
[Root Nameservers] (. / root)
/ \
/ \
[.com TLD] [.org TLD]
/ \ \
/ \ \
[google.com] [github.com] [wikipedia.org]
Authoritative Authoritative Authoritative
Nameservers Nameservers NameserversHow DNS Resolution Works
When you type www.example.com in your browser, here is the full resolution process:
Step-by-Step DNS Resolution
Browser cache: The browser checks its own DNS cache. If you visited
example.comrecently, the IP is already cached.OS cache: If not in the browser cache, the operating system's DNS cache (stub resolver) is checked.
Recursive resolver: The OS sends the query to a configured recursive resolver (e.g., your ISP's resolver or 8.8.8.8). The resolver checks its own cache.
Root nameserver: On a cache miss, the resolver queries a root nameserver: "Where can I find
.com?" The root responds with the IP of the.comTLD server.TLD nameserver: The resolver queries the
.comTLD server: "Where can I findexample.com?" The TLD server responds with the IP ofexample.com's authoritative nameserver.Authoritative nameserver: The resolver queries the authoritative nameserver: "What is the IP for
www.example.com?" The authoritative server responds with the A record (e.g.,93.184.216.34).Response cached and returned: The resolver caches the result (respecting the TTL) and returns it to the client. The browser can now establish a TCP connection to the IP.
Iterative vs Recursive Queries
- Recursive: The client asks the resolver to do all the work and return the final answer. This is what your browser does.
- Iterative: The resolver asks each server in the hierarchy, and each server responds with a referral to the next server. The resolver follows the chain itself.
In practice, the client-to-resolver query is recursive, and the resolver-to-nameserver queries are iterative.
DNS Resolution Latency
| Step | Typical Latency |
|---|---|
| Browser cache hit | 0ms |
| OS cache hit | <1ms |
| Resolver cache hit | 1-5ms |
| Full resolution (cache miss) | 20-120ms |
| Full resolution (distant nameserver) | 100-300ms |
This is why DNS caching is critical for performance. A cold DNS lookup adds 50-200ms to the very first request.
DNS Record Types
DNS does more than just map names to IPs. Different record types serve different purposes:
Essential Record Types
| Record | Purpose | Example |
|---|---|---|
| A | Maps a domain to an IPv4 address | example.com -> 93.184.216.34 |
| AAAA | Maps a domain to an IPv6 address | example.com -> 2606:2800:220:1:248:1893:25c8:1946 |
| CNAME | Alias one domain to another domain | www.example.com -> example.com |
| MX | Mail server for the domain (with priority) | example.com -> mail.example.com (priority 10) |
| TXT | Arbitrary text (used for verification, SPF, DKIM) | example.com -> "v=spf1 include:_spf.google.com ~all" |
| NS | Delegates a domain to specific nameservers | example.com -> ns1.cloudflare.com |
| SRV | Specifies host, port, priority, and weight for a service | _sip._tcp.example.com -> 5060 sipserver.example.com |
| SOA | Start of Authority - metadata about the DNS zone | Serial number, refresh interval, TTL defaults |
| PTR | Reverse DNS - maps IP to domain name | 34.216.184.93.in-addr.arpa -> example.com |
CNAME vs A Record: A Common Confusion
- A record: Points directly to an IP. Use when you control the IP address.
- CNAME: Points to another domain name. Use when the target IP might change (e.g., pointing to a load balancer's DNS name).
Limitation: A CNAME cannot coexist with other records at the same name. You cannot have both a CNAME and an MX record at the root domain (example.com). This is why many DNS providers offer ALIAS or ANAME records as a workaround.
TTL (Time to Live)
Every DNS record has a TTL value (in seconds) that tells caches how long to remember the result:
- Short TTL (60-300s): Fast propagation of changes, but more DNS queries (higher cost, slightly higher latency).
- Long TTL (3600-86400s): Fewer queries, faster resolution from cache, but changes take longer to propagate.
Best practice: Use short TTLs (60-300s) before a migration or failover event. Use longer TTLs (3600s+) for stable records that rarely change.
DNS for Load Balancing & Traffic Routing
DNS is not just name resolution - it is a powerful tool for directing traffic in system design.
DNS Load Balancing Strategies
1. Round-Robin DNS
Return multiple A records for a domain. Clients cycle through them.
example.com -> 10.0.1.1
example.com -> 10.0.1.2
example.com -> 10.0.1.3Pros: Simple, no special infrastructure. Cons: No health checking (sends traffic to dead servers), uneven distribution due to caching, no session affinity.
2. Weighted Routing
Assign weights to different records. 70% of responses return Server A, 30% return Server B. Use case: Canary deployments (send 5% of traffic to new version), gradual migration between data centers.
3. Latency-Based Routing
Return the IP of the server with the lowest latency to the requester's location.
Use case: Global services with multiple regions. A user in Tokyo gets routed to ap-northeast-1, a user in London gets eu-west-1.
4. Geolocation Routing
Return different IPs based on the geographic location of the resolver (which approximates the user's location). Use case: Compliance requirements (EU users must hit EU servers), localized content delivery.
5. Failover Routing
Configure a primary and secondary record. DNS returns the primary unless health checks detect it is down, then switches to secondary.
Use case: Disaster recovery. Primary in us-east-1, secondary in us-west-2.
AWS Route 53 as a Case Study
AWS Route 53 is a managed DNS service that implements all of these strategies:
- Health checks: Route 53 monitors endpoints and removes unhealthy ones from DNS responses.
- Alias records: Route 53's proprietary record type that works like a CNAME but can be used at the zone apex (root domain).
- Traffic policies: Complex routing rules combining weighted, latency, geolocation, and failover strategies.
DNS vs Application-Level Load Balancing
| Aspect | DNS Load Balancing | Application Load Balancer |
|---|---|---|
| Layer | DNS (layer 7 of name resolution) | HTTP/TCP (layer 4/7) |
| Granularity | Per DNS query (coarse) | Per request (fine-grained) |
| Health checking | Slow (TTL-dependent) | Fast (real-time) |
| Session affinity | Not possible | Supported (sticky sessions) |
| Cost | Low (just DNS records) | Higher (dedicated infrastructure) |
| Best for | Global traffic distribution | Request-level routing within a region |
DNS Failures & Security Concerns
DNS is critical infrastructure, and its failure modes and security vulnerabilities are important to understand.
Common DNS Failure Scenarios
1. DNS Provider Outage In 2016, a DDoS attack on Dyn (a major DNS provider) took down Twitter, Netflix, GitHub, and dozens of other sites. The applications were fine - but users could not resolve their domain names.
Mitigation: Use multiple DNS providers. Configure secondary NS records pointing to a backup DNS service.
2. TTL-Related Delays After changing a DNS record, the old IP is still cached worldwide for up to the old TTL duration. If your TTL was 86400 (24 hours), some users will hit the old server for a full day.
Mitigation: Lower TTL before planned changes. Wait for the old TTL to expire before making the change.
3. Propagation Inconsistency Different resolvers update at different times. For a period after a DNS change, some users see the old IP and some see the new one.
Mitigation: Plan for a transition period. Keep old servers running until propagation is complete.
DNS Security Threats
DNS Spoofing / Cache Poisoning: An attacker injects false DNS records into a resolver's cache, redirecting users to malicious servers. Mitigation: DNSSEC (DNS Security Extensions) adds cryptographic signatures to DNS records, allowing resolvers to verify authenticity.
DNS Amplification Attack: Attacker sends small DNS queries with a spoofed source IP. The DNS server sends large responses to the victim's IP, amplifying the DDoS attack. Mitigation: Rate limiting on DNS servers, response rate limiting (RRL), BCP38 (ingress filtering).
DNS Hijacking: An attacker changes the DNS configuration (via compromised registrar account or BGP hijacking) to redirect a domain to their server. Mitigation: Registrar lock, two-factor authentication on registrar accounts, DNSSEC.
DNS over HTTPS (DoH) and DNS over TLS (DoT)
Traditional DNS queries are sent in plaintext over UDP port 53, allowing ISPs and network operators to see (and potentially modify) your DNS queries.
- DoH: Encrypts DNS queries inside HTTPS requests (port 443). Adopted by Firefox and Chrome.
- DoT: Encrypts DNS queries using TLS (port 853). More transparent to network operators.
Both prevent eavesdropping and tampering of DNS queries in transit.
DNS in System Design Interviews
DNS appears in nearly every system design discussion. Here is how to use it effectively:
When to Mention DNS
- Multi-region architecture: "We will use DNS-based geolocation routing (e.g., Route 53) to direct users to the nearest data center."
- CDN integration: "CloudFront distributions have a DNS name. We create a CNAME or Alias record pointing our domain to the CloudFront distribution."
- Failover design: "Route 53 health checks will monitor our primary region. On failure, DNS failover routes traffic to the secondary region."
- Service discovery: "Internal services register with a private DNS zone. Service A finds Service B by querying
service-b.internal.example.com."
Architecture Pattern: DNS + CDN + Load Balancer
User types: www.example.com
|
v
[DNS Resolution] - Route 53 (geolocation routing)
|
v
[CDN Edge Server] - CloudFront (cache hit? serve directly)
|
Cache miss
|
v
[Load Balancer] - ALB (distributes to healthy instances)
|
v
[Application Servers] - EC2/ECS (process request)
|
v
[Database] - RDS/DynamoDBThis three-tier routing (DNS -> CDN -> LB) is the standard architecture for a globally distributed web application.
Common Interview Patterns
- "How would you design for 99.99% availability?" - Mention multi-region with DNS failover.
- "How do you handle a region going down?" - DNS health checks detect failure, route traffic to backup region.
- "How does a CDN work?" - The CDN provider gives you a DNS name. You create a CNAME to it.
- "How would you migrate to a new infrastructure with zero downtime?" - Lower TTL, switch DNS records, wait for propagation, decommission old servers.
Real-World Examples
How real systems implement this in production
Route 53 is Amazon's managed DNS service, handling billions of queries daily. It supports weighted, latency-based, geolocation, and failover routing policies, with built-in health checks that automatically remove unhealthy endpoints from DNS responses.
Trade-off: Managed DNS reduces operational burden but creates vendor dependency. Route 53's Alias records solve the CNAME-at-apex limitation but are AWS-specific.
Cloudflare operates one of the fastest public DNS resolvers (1.1.1.1) with an average response time under 12ms globally. Their authoritative DNS service uses a massive anycast network to serve DNS responses from the nearest edge location.
Trade-off: Using Cloudflare as both DNS provider and CDN/security layer simplifies architecture but concentrates risk in a single provider.
A massive DDoS attack against Dyn DNS disrupted access to Twitter, Netflix, Reddit, GitHub, and many other major sites. The applications themselves were running fine, but users could not resolve domain names to reach them.
Trade-off: This incident demonstrated that DNS is a single point of failure even for companies with otherwise redundant infrastructure. It drove adoption of multi-provider DNS strategies.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Client-side caching with TTL, fallback DNS resolvers, health checks with automatic failover, use multiple DNS providers, implement retry logic with exponential backoff.
GeoDNS returns different IP addresses based on the client's geographic location, routing them to the nearest data center. Reduces round-trip time. Mention anycast as an alternative.
Reference the 2016 Dyn attack. Mitigations: multiple DNS providers, longer TTLs for critical records, client-side caching, hardcoded fallback IPs for critical services.
CNAME records point to CDN's DNS, which uses GeoDNS/anycast to return the nearest edge server IP. Short TTLs allow dynamic routing based on load and availability.
Interview Tips
How to discuss this topic effectively
In system design interviews, always include DNS as the first step in your request flow. 'The client resolves our domain via DNS, which returns the IP of our load balancer or CDN.' This shows you think about the complete request path.
When designing multi-region systems, mention DNS-based routing (latency, geolocation, or failover) as the mechanism for directing users to the right region. Be specific about which strategy and why.
Know the TTL trade-off cold: short TTLs give you faster failover but more DNS query volume; long TTLs reduce query load but slow down propagation of changes. In an interview, suggest starting with 300s TTL as a reasonable default.
If asked about zero-downtime migration, outline the DNS strategy: lower TTL days before the migration, switch the DNS record, wait for the old TTL to expire, then decommission old infrastructure.
Mention that DNS is often the bottleneck in cold-start scenarios. Pre-warming DNS caches or using DNS prefetching (<link rel='dns-prefetch'>) can reduce perceived latency for users.
Common Mistakes
Pitfalls to avoid in interviews
Assuming DNS changes propagate instantly
DNS changes take time to propagate due to caching at every level (browser, OS, resolver). The propagation time depends on the old record's TTL. If the TTL was 86400 seconds (24 hours), some resolvers will serve the old IP for up to a full day.
Using CNAME records at the zone apex (root domain)
The DNS specification prohibits CNAME records at the zone apex (e.g., example.com) because CNAMEs cannot coexist with other record types (like MX or NS). Use your DNS provider's ALIAS/ANAME record or an A record instead.
Relying on DNS round-robin as a production load balancer
DNS round-robin has no health checking - it will happily send traffic to a dead server. It also distributes unevenly because resolvers cache responses. Use it for basic distribution, but pair it with a proper load balancer for production traffic.
Ignoring DNS in availability calculations
If your DNS provider goes down, your entire service is unreachable even if all servers are healthy. Include DNS in your availability design: use multiple DNS providers, monitor DNS health, and test failover regularly.
