Before designing anything, understand the properties you're optimising for and estimate the scale you're working at. Numbers ground your decisions.
The Core Properties
% uptime. 99.9% ("three nines") = 8.77 hrs downtime/year. 99.99% = 52 min/year. 99.999% = 5.26 min/year. SLAs are commitments; SLOs are internal targets; SLIs are actual measurements.
Performs correctly, not just responds. A service returning stale data is available but not reliable. MTBF (Mean Time Between Failures) and MTTR (Mean Time To Recovery) are key metrics.
Handles growing load without degrading. Define your load parameters first: RPS, concurrent users, data volume, read/write ratio. Twitter handles ~6k tweets/sec average, 150k at peak.
All nodes agree on the same data. Strong consistency = every read sees latest write. Eventual = replicas converge over time. The right model depends on the business (bank balance vs likes count).
Committed data survives crashes. Achieved via WAL (Write-Ahead Logs), replication, and periodic snapshots. S3 offers 99.999999999% (11 nines) durability.
Latency = time for one request. Throughput = requests/second. Often a trade-off. Batching improves throughput but increases latency. Target both P50 and P99 latencies.
Numbers Every Engineer Should Know
| Operation | Latency | Intuition |
|---|---|---|
| L1 cache access | 0.5 ns | 1 step across a room |
| L2 cache access | 7 ns | 14 steps |
| RAM read | 100 ns | 200 steps across a field |
| SSD random read (4K) | 150 µs | Drive to the corner store |
| HDD seek + read | 10 ms | A walk around the block |
| Same datacenter RTT | 0.5 ms | Ping next building |
| Cross-continent (US↔EU) | 150 ms | Speed of light delay |
| Reading 1 MB from RAM | 250 µs | — |
| Reading 1 MB from SSD | 1 ms | 4× slower than RAM |
| Reading 1 MB from disk | 20 ms | 80× slower than RAM |
Back-of-Envelope: Twitter Example
Given: 300M monthly users, 50% daily active = 150M DAU. Each user reads 100 tweets/day, posts 1 tweet every 2 days.
Scalability is about how your system behaves as load grows. The goal is to add resources in proportion to load without redesigning the system.
Horizontal vs Vertical Scaling
| Attribute | Vertical (Scale Up) | Horizontal (Scale Out) |
|---|---|---|
| Method | Bigger machine (more CPU/RAM) | More machines of same type |
| Complexity | Simple — no distributed concerns | Complex — must handle distributed state |
| Ceiling | Hardware limit exists (~96 cores) | Nearly unlimited (Google has millions of servers) |
| Downtime | Requires restart to resize | Rolling deploys; zero downtime |
| Failure | Single point of failure | Partial failure tolerance |
| Cost curve | Superlinear — 2× RAM costs 4× | Linear — commodity hardware |
| Real example | Early Instagram: one fat Postgres | Airbnb: thousands of identical app nodes |
Load Balancing Deep Dive
A load balancer distributes traffic across a server pool. It can operate at Layer 4 (TCP) or Layer 7 (HTTP, aware of content).
Requests go to each server in turn: A → B → C → A... Best when servers are identical and requests are homogeneous.
Assign weights by capacity. Server with 2× RAM gets 2× requests. Used when servers have different specs.
Route to server with fewest active connections. Best for long-lived connections (WebSocket, streaming).
Hash(key) maps to a point on a ring. Each server owns an arc. Adding a server only remaps ~1/n keys. Essential for caches (Redis cluster, Cassandra).
Same client always goes to same server. Enables sticky sessions without shared store. Risk: hot spots if one client is very active.
Statistically balanced without state. Good for short-lived stateless requests. AWS ALB uses weighted random by default.
Rate Limiting
Protect services from abuse, ensure fair use, and prevent cascade failures. Implemented at API gateway or per-service.
| Algorithm | How it Works | Burst Allowed? | Use When |
|---|---|---|---|
| Token Bucket | Bucket holds N tokens, refills at rate R. Each request consumes 1 token. | Yes (up to bucket size) | Most APIs — allows bursty traffic |
| Leaky Bucket | Requests enter queue, processed at fixed rate. Overflow is dropped. | No — strictly smooth | Payment systems, smooth traffic |
| Fixed Window | Count per time window (e.g., 100 req/min). Resets at boundary. | 2× burst at boundary | Simple quota enforcement |
| Sliding Window Log | Store timestamp of each request. Count within rolling window. | Precise no-burst | Accurate enforcement, high memory |
| Sliding Window Counter | Blend current + previous window weighted by position. | Approximate | Best balance of accuracy vs memory |
Auto-Scaling
Automatically add or remove servers based on metrics. Cloud providers (AWS, GCP, Azure) all offer managed auto-scaling groups.
- Reactive scaling — trigger on CPU > 70% for 5 min. Lag means brief overload before new instances are warm.
- Predictive scaling — ML predicts load (e.g., scale up at 8 AM before morning traffic). AWS now offers this natively.
- Schedule-based — scale up Friday evenings, down weekday nights. Known traffic patterns (e-commerce flash sales).
- Always define scale-in slowly (to avoid thrashing) and scale-out aggressively.
Database choice is often the most consequential decision in a system. Understand the trade-offs before committing — migrations are expensive.
SQL vs NoSQL
| Attribute | SQL (Relational) | NoSQL |
|---|---|---|
| Schema | Fixed, enforced by DB | Flexible, app-enforced |
| Transactions | Full ACID | Varies (Mongo 4.0+ has multi-doc ACID) |
| Scaling | Vertical primary; sharding complex | Designed for horizontal scale |
| Joins | Native; performant with indexes | Avoid; denormalise data |
| Query language | SQL — expressive, standardised | Varies by DB |
| Best for | Complex queries, financial, relational data | High-write, flexible schema, huge scale |
NoSQL Database Types
Examples: Redis, DynamoDB, Riak
Fast O(1) lookup by key. No complex queries. Best for: sessions, caches, user preferences, shopping carts.
Examples: MongoDB, CouchDB, Firestore
JSON/BSON documents, flexible schema. Good for: user profiles, content management, catalogs with varying attributes.
Examples: Cassandra, HBase, BigTable
Rows + dynamic columns. Optimised for time-series, write-heavy. Good for: IoT, analytics, activity feeds. Cassandra handles 1M writes/sec.
Examples: Neo4j, Neptune, JanusGraph
Nodes + edges + properties. Relationships are first-class. Good for: social graphs, fraud detection, recommendation engines.
Examples: InfluxDB, TimescaleDB, Prometheus
Optimised for timestamp-indexed data. Automatic downsampling and retention. Good for: metrics, monitoring, IoT sensor data.
Examples: Elasticsearch, OpenSearch, Solr
Inverted index, full-text search, fuzzy matching, aggregations. Not a primary DB — sync from main DB via events.
Indexing
Without an index, every query is a full table scan O(n). With an index, it's O(log n) for B-Tree or O(1) for hash.
- B-Tree — default; great for range queries, equality, ORDER BY
- Hash — O(1) equality only; no range support
- GIN (Generalized Inverted Index) — arrays, JSONB, full-text search in Postgres
- Partial Index — index only rows matching a condition:
WHERE status = 'active' - Covering Index — includes all SELECT columns; never touches heap table
Replication Strategies
One primary accepts all writes. Changes replicated to read replicas. Replicas can serve reads, offloading the primary. Used by: most RDBMS, Redis Sentinel.
- Sync replication: write confirmed when replica acknowledges — no data loss, but latency doubles
- Async replication: write confirmed immediately — fast, but replica lag means potential data loss on failover
Multiple nodes accept writes. Each replicates to others. Needed for multi-datacenter active-active. Used by: CouchDB, Google Docs real-time sync.
- Write conflicts must be resolved (LWW, CRDTs, custom merge)
- Complex to reason about; avoid if possible
Any node accepts reads/writes. Uses quorum: write to W nodes, read from R nodes. If W+R > N, you have strong consistency.
- Cassandra default: N=3, W=1, R=1 (eventual consistency)
- Strong consistency: N=3, W=2, R=2 (quorum)
Split data across multiple DB instances. Each shard is an independent DB holding a subset of data.
- Range sharding: user A-M on shard 1, N-Z on shard 2. Risk: hot shards
- Hash sharding: hash(user_id) mod N. Even distribution but no range queries
- Directory-based: lookup service maps key to shard. Flexible but adds a hop
Database Selection Guide
| Need | Best Choice | Why |
|---|---|---|
| User accounts, orders, payments | PostgreSQL / MySQL | ACID, joins, complex queries |
| Session store, rate limiting | Redis | In-memory, O(1), TTL support |
| Product catalog, CMS | MongoDB | Flexible schema, nested docs |
| Social graph, recommendations | Neo4j / Neptune | Graph traversals are native |
| Write-heavy IoT / activity logs | Cassandra | 1M+ writes/sec, no single point of failure |
| Full-text search, autocomplete | Elasticsearch | Inverted index, relevance scoring |
| Metrics, monitoring dashboards | InfluxDB / Prometheus | Built-in downsampling, fast range queries |
| Data warehouse, analytics | BigQuery / Redshift / Snowflake | Columnar, massively parallel |
Caching is the single highest-leverage optimisation in distributed systems. A well-placed cache can reduce DB load by 90%+.
Cache Hierarchy
Each layer should only be needed when the layer above it misses. A well-tuned system sees 90%+ of traffic absorbed by CDN + Redis.
Cache Patterns
App checks cache → miss → reads DB → writes cache → returns. Cache only contains accessed data. Most common. Risk: stale data if DB updated without invalidating cache.
Cache is the only data source from app's perspective. On miss, cache itself fetches and caches. Transparent to app. Used by: AWS ElastiCache with DAX.
Every write goes to cache AND DB synchronously. No stale data ever. Write latency doubles. Good for: user preferences, settings (infrequent writes, frequent reads).
Write to cache, return immediately, flush to DB asynchronously. Low write latency. Risk: data loss if cache crashes before flush. Good for: likes/views counters, analytics.
Cache proactively refreshes entries before they expire, based on predicted access patterns. Complex but eliminates cold misses for hot items. Used by: Netflix for content metadata.
Cache Problems & Solutions
| Problem | Description | Real Example | Solutions |
|---|---|---|---|
| Cache Stampede | Many requests miss simultaneously after TTL expiry → DB flooded | Reddit front page after a popular post's cache expires | Mutex/lock on miss, probabilistic early expiry, background refresh |
| Cache Penetration | Requests for keys that will never exist in DB (e.g., invalid IDs) → always miss cache, always hit DB | Attacker querying random non-existent user IDs | Cache null results with short TTL, Bloom filter at gateway |
| Cache Avalanche | Many keys expire at same time → large spike on DB | Black Friday: product catalog cache all set to 1-hour TTL at deploy time | Jitter on TTL (±20%), persistent cache (no global TTL), staggered deploys |
| Hot Key | One key gets massive traffic (e.g., a celebrity's profile) | Justin Bieber problem on Twitter | Replicate hot key across N cache nodes, local in-process cache for ultra-hot keys |
| Cache Inconsistency | Cache and DB diverge — stale reads | User changes name but sees old name for 10 min | Write-through, event-driven invalidation, shorter TTL, version keys |
Redis Data Structures (with Use Cases)
| Structure | Operations | Use Case |
|---|---|---|
String | GET, SET, INCR, EXPIRE | Sessions, counters, simple cache, rate limit counters |
Hash | HGET, HSET, HMGET | User profile fields, product attributes |
List | LPUSH, RPOP, LRANGE | Activity feed, job queues, chat history |
Set | SADD, SISMEMBER, SUNION | Unique visitors, tags, friends list |
Sorted Set (ZSet) | ZADD, ZRANGE, ZRANGEBYSCORE | Leaderboards, rate limiting (sliding window), priority queues |
Bitmap | SETBIT, GETBIT, BITCOUNT | Daily active users, feature flags per user (512MB = 4B users) |
HyperLogLog | PFADD, PFCOUNT | Approximate unique counts with <1% error, tiny memory (12KB for any cardinality) |
Geo | GEOADD, GEORADIUS | Nearby drivers, location search |
Stream | XADD, XREAD, XGROUP | Event log, lightweight Kafka alternative |
API Design Patterns Compared
| Style | Protocol | Payload | Best For | Avoid When |
|---|---|---|---|---|
| REST | HTTP/1.1+ | JSON/XML | Public APIs, simple CRUD, well-understood semantics | Need real-time; complex nested queries |
| GraphQL | HTTP | JSON | Mobile clients (minimise bytes), flexible frontends, aggregating multiple APIs | Simple CRUD; when caching is critical (no GET) |
| gRPC | HTTP/2 | Protobuf (binary) | Internal microservices, high throughput, streaming | Browser clients (needs grpc-web proxy) |
| WebSocket | TCP upgrade from HTTP | Any (text/binary) | Chat, live dashboards, multiplayer games, collaborative editing | Simple request-response; fire-and-forget |
| Server-Sent Events | HTTP | Text | One-way server push: notifications, live feeds, progress updates | Bidirectional; binary data |
| Webhooks | HTTP POST callback | JSON | Event notifications to third parties (Stripe payment events, GitHub PR hooks) | Real-time (client must have public endpoint) |
REST API Best Practices
Long Polling vs WebSocket vs SSE
Client asks "any updates?" every N seconds. Simple but wasteful — most responses are empty. Good for: email clients checking every 30s.
Client asks, server holds connection open until update arrives, then responds. Client immediately re-polls. Simulates push over HTTP. Good for: notifications when WebSocket not available.
Full-duplex persistent TCP connection after HTTP upgrade handshake. Both client and server can send anytime. Best for: chat, games, live collaboration.
Server pushes events to client over persistent HTTP GET. One-directional. Automatic reconnection built-in. Good for: Twitter live timeline, sports scores, progress bars.
Async communication via queues is the backbone of scalable, resilient systems. It decouples services so they can fail and scale independently.
Why Queues? A Concrete Example
Kafka vs RabbitMQ vs SQS
| Feature | Apache Kafka | RabbitMQ | Amazon SQS |
|---|---|---|---|
| Model | Distributed log (topics/partitions) | Traditional broker (exchanges/queues) | Managed queue service |
| Message retention | Configurable (days/weeks), replayable | Until consumed | 14 days max |
| Throughput | Millions/sec per cluster | 20-50k msg/sec | Unlimited (managed) |
| Ordering | Per partition (strict) | Per queue | Standard: no; FIFO: yes |
| Routing | Simple (topic-based) | Complex (topic, direct, fanout, headers) | Simple |
| Consumer model | Pull; consumer groups share partitions | Push; competing consumers | Pull; long polling |
| Best for | Event sourcing, analytics pipelines, audit logs, stream processing | Task queues, RPC, complex routing, work distribution | Serverless, AWS ecosystem, simple queuing |
Kafka Deep Dive
CAP Theorem — The Real Story
During a network partition (unavoidable in any distributed system), you must choose:
Returns error or waits rather than returning stale data. Sacrifices availability. Examples: HBase, ZooKeeper, etcd, MongoDB (in majority). Use when: banking, inventory management (can't oversell).
Returns best available data (possibly stale). Sacrifices consistency. Examples: Cassandra, CouchDB, DynamoDB (default). Use when: social feeds, shopping carts, DNS — stale data is acceptable.
Distributed Patterns
Prevents calling a failing service. States: Closed (normal) → Open (fail fast after threshold) → Half-Open (probe after timeout). Prevents cascade failures. Used by: Netflix Hystrix, Resilience4j.
Distributed transactions without 2-Phase Commit. Sequence of local transactions; on failure, run compensating transactions backwards. Two flavours: Choreography (events) or Orchestration (central coordinator).
Isolate failures like watertight compartments in a ship. Separate thread pools per dependency. If payment service hangs, it doesn't exhaust threads for search service.
Retry failed requests with exponential backoff + jitter. Without jitter, all clients retry in sync → stampede. Formula: wait = min(cap, base × 2^attempt) + rand()
Client sends unique key per logical operation. Server deduplicates using the key. Critical for payment APIs — ensures charging once even if request retried. Stripe, PayPal use this.
Write to DB and outbox table atomically. Separate process reads outbox and publishes to queue. Guarantees at-least-once message delivery without distributed transactions.
Consistent Hashing — Why It Matters
Virtual nodes (vnodes): each physical server owns multiple points on the ring. Distributes load evenly even with heterogeneous servers.
Storage Types
| Type | Description | Example | Use Case | Latency |
|---|---|---|---|---|
| In-Memory | RAM, volatile | Redis, Memcached | Cache, sessions, real-time leaderboards | ~1 µs |
| Block Storage | Raw volumes, OS-level | AWS EBS, GCP PD | Database storage, OS boot volumes | ~1 ms |
| File Storage (NAS) | Hierarchical filesystem | AWS EFS, NFS | Shared filesystems, CMS media libraries | ~5 ms |
| Object Storage | Flat key-value for blobs | AWS S3, GCS, R2 | Media, backups, ML datasets, static assets | ~50-100 ms |
| Cold/Archive | Tape/glacier, infrequent | S3 Glacier, GCS Archive | Compliance backups, audit logs | Minutes to hours |
CDN Architecture
- What to cache on CDN: Static assets (JS, CSS, images), HTML for SPAs, video segments, API responses with low variance
- What NOT to cache: Personalised responses, payment pages, admin dashboards, frequently changing data
- Cache-Control headers:
max-age=31536000, immutablefor versioned assets;no-cachefor HTML - Cache invalidation: Content-hashed filenames (bundle.a3f9b2.js) are infinitely cacheable and invalidate naturally
Microservices split a system into small, independently deployable services. Not always better than a monolith — understand the trade-offs.
Monolith vs Microservices
| Dimension | Monolith | Microservices |
|---|---|---|
| Deployment | Deploy everything at once | Deploy services independently |
| Scaling | Scale everything or nothing | Scale only bottleneck services |
| Dev velocity (small team) | Fast — no network overhead | Slow — service contracts, infra complexity |
| Dev velocity (large team) | Slow — merge conflicts, coordination | Fast — teams own services independently |
| Failure isolation | One bug can take down everything | Failures contained to a service |
| Data | Shared DB — easy joins | Each service owns its DB — no direct joins |
| Debugging | Simple — single process | Hard — distributed traces needed |
| Start with | Always — extract later if needed | Only when team/scale demands it |
Service Communication
Service A calls Service B and waits for response. Simple, easy to reason about. Problem: if B is slow, A is slow. Creates temporal coupling.
Service A publishes event. Service B processes when ready. No coupling. Service B can be down and catch up later. Problem: harder to trace, eventual consistency.
Sidecar proxy handles: mTLS, retries, circuit breaking, distributed tracing, load balancing — transparently without app code changes. Operations team's best friend.
Single entry point: auth, rate limiting, routing, response aggregation, SSL termination. Clients don't need to know internal service topology. Examples: Kong, AWS API Gateway, nginx.
Authentication Patterns
Server stores session. Client sends session ID cookie. Simple, revocable immediately. Problem: requires shared session store (Redis) for horizontal scaling. Used by: traditional web apps.
Stateless — server signs payload, no storage needed. Client sends in Authorization header. Problem: can't revoke before expiry without a blocklist. Use short expiry (15 min) + refresh tokens.
Delegated authorization. "Login with Google" flow: user authorises your app to access their Google data. Separate auth server issues tokens. Industry standard for third-party integrations.
Long-lived opaque tokens for machine-to-machine auth. Simple, no expiry complexity. Hash before storing (treat like passwords). Rate limit per key. Used by: Stripe, SendGrid, Twilio.
JWT Flow
Security Essentials Checklist
- Always use HTTPS — TLS everywhere, including internal services
- Hash passwords with bcrypt/Argon2 — never SHA1/MD5, never plain text
- Input validation — validate server-side, never trust client input
- SQL injection prevention — always use parameterised queries, never string concat
- Rate limiting on auth endpoints — prevent brute force; lock after N failures
- Secrets management — AWS Secrets Manager / Vault; never hardcode in code
- Principle of least privilege — DB user only has SELECT on read-only service
- CSRF protection — SameSite cookie attribute or CSRF tokens
You can't fix what you can't see. Observability is the ability to infer internal system state from external outputs.
The Three Pillars
Aggregated numbers over time. CPU%, RPS, error rate, P99 latency, cache hit rate. Tool: Prometheus (scraping) + Grafana (dashboards). Alert on SLO violations.
Immutable, timestamped event records. Structured logs (JSON) are searchable. Ship to: ELK Stack (Elasticsearch + Logstash + Kibana) or Loki + Grafana. Include request IDs for correlation.
Follow a single request across multiple services. Each span records: service, operation, duration, status, tags. Tool: Jaeger, Zipkin, AWS X-Ray. Critical for diagnosing microservice latency.
Key Metrics to Track
| Category | Metrics | Typical Alert Thresholds |
|---|---|---|
| Traffic | Requests/sec, active connections | Alert on sudden 2× spike or drop |
| Latency | P50, P95, P99 response time | P99 > 500ms for APIs, P99 > 200ms for search |
| Errors | 5xx rate, exception count, failed jobs | 5xx rate > 0.1% of traffic |
| Saturation | CPU%, memory%, disk%, queue depth | CPU > 80% sustained, queue > 1M items |
| Database | Query latency, connection pool usage, replication lag | Replica lag > 30s, pool usage > 90% |
| Cache | Hit rate, eviction rate, memory usage | Hit rate drops below 90% |
| Business | Orders/min, signups/hour, revenue/min | 50% drop in orders → page on-call |
These are the 12 most commonly asked system design problems. For each, understand the key decisions — not just the answer.
URL Shortener (Bitly)
Read: 100:1 Low latency Write: ~100 URLs/sec
- Key design choice: short code generation. Options: (a) auto-increment ID → base62 encode — simple, sequential but predictable; (b) random UUID → take 7 chars — unpredictable but needs collision check; (c) MD5(longURL) → take 7 chars — deterministic, same URL always same code
- Schema:
id, short_code (indexed), long_url, user_id, created_at, expires_at, click_count - Redirect type: 301 Permanent (browser caches → can't track clicks) vs 302 Temporary (always hits server → track every click) — Bitly uses 302
- Analytics: Async — write click events to Kafka, process in batches, store in Cassandra by (short_code, date)
Twitter / Social Feed
Fan-out problem Celebrity edge case Read-heavy
- Timeline storage: Redis Sorted Set per user, score = timestamp, member = tweet_id. Fetch top 200 tweets → hydrate from tweet cache
- Who is a celebrity? Threshold: > 1M followers = pull model. Twitter uses ~10K as threshold internally
- Tweet storage: MySQL with Vitess sharding (Twitter's actual stack), sharded by tweet_id
- Media: Images/videos → S3 → CDN (CloudFront). Never stored in DB.
WhatsApp / Chat System
Real-time Offline handling E2E encryption
- Connection management: Each user has WebSocket to one Chat Server. Need to know which server holds which user's connection → store in Redis:
user_id → server_id - Read receipts: Small event sent back through queue:
{type: "read", msg_id: X, user_id: Y, ts: T} - Message ordering: Logical timestamps / sequence IDs per conversation. Cassandra for message storage: partition key = conversation_id, cluster key = message_id (time-ordered)
- E2E encryption: Signal Protocol — keys generated on device, server never sees plaintext
YouTube / Video Platform
Storage-heavy Transcoding pipeline CDN-delivered
- Resumable uploads: Clients upload in chunks (5MB). Server tracks progress. Failed upload resumes from last chunk. YouTube uses tus protocol.
- Adaptive bitrate: HLS/DASH splits video into segments. Manifest file (.m3u8) lists available quality tiers. Player switches dynamically — smooth even on variable connections.
- View count: Not real-time. Batch updated from Kafka stream via Flink. Prevents race conditions at scale.
- Recommendation: Separate ML service. Reads from Cassandra (watch history) + Redis (trending). Serves pre-computed recommendations.
Uber / Ride-Sharing
Geospatial Real-time matching Event-driven
- Geohash: Encode (lat, lng) as string. "ww8p1r4t8" → nearby cells share prefix. Range query = prefix search. Redis GEO uses sorted set with geohash score internally.
- Surge pricing: Real-time supply/demand ratio per geohash cell. Stream processing on Kafka with Flink.
- ETA calculation: Not just distance — real-time traffic data from map service + historical patterns per time of day.
Google Drive / File Storage
Sync across devices Deduplication Collaboration
Rate Limiter Service
Distributed Token bucket
A repeatable process for any design question. Interviewers evaluate your thinking process, not just the final diagram.
The 5-Step Framework (45 minutes)
Clarify Requirements (5 min)
Never start designing before asking questions. Functional requirements: what does the system DO? Non-functional: scale, latency SLA, availability, consistency needs, geographic distribution?
Questions to ask: "How many daily active users? What's the read/write ratio? Do we need global distribution? What's the acceptable latency? Does the feed need to be real-time?"
Capacity Estimation (5 min)
Estimate QPS (write + read), storage per day/year, bandwidth. These numbers drive your architecture choices. Don't skip this — it reveals if you need 1 server or 1,000.
High-Level Design (10 min)
Draw the major boxes: clients, DNS/CDN, load balancer, app servers, cache, DB, queues, object store. Connect them with arrows and label data flows. Get alignment before diving deep.
Deep Dive (15 min)
Pick the most interesting or complex component. Design the DB schema, API endpoints, or the core algorithm. The interviewer often directs you here. Show depth in the area that matters most.
Scale & Trade-offs (10 min)
Identify bottlenecks. How does the design handle 10× traffic? What are the failure modes? What did you trade off — and why? Great engineers articulate trade-offs, not just solutions.
Standard Component Checklist
- ☐ DNS + CDN for static assets
- ☐ Load balancer (L4 or L7)
- ☐ Stateless app servers (scale-out)
- ☐ Primary DB + read replicas
- ☐ Cache layer (Redis)
- ☐ Async queue for heavy tasks
- ☐ Object store for media (S3)
- ☐ Search index (Elasticsearch)
- ☐ Auth service / API gateway
- ☐ Rate limiter
- ☐ Notification service (push/email)
- ☐ Monitoring (Prometheus + Grafana)
- ☐ Distributed tracing (Jaeger)
- ☐ Feature flags (LaunchDarkly)
- ☐ Microservices for small teams
- ☐ Multiple DB types unless needed
- ☐ Custom consensus algorithms
- ☐ ML pipelines before product-market fit
- ☐ Multi-region before 1M users
System design is as much about how you think out loud as what you design. This section is about developing the mental models and communication skills that separate great engineers from good ones.
The Mental Model Ladder
Move through these levels for any component you're designing:
What does it do? (Function)
State the job in one sentence. Don't overcomplicate. If you can't explain it simply, you don't understand it yet.
Why this choice? (Trade-offs)
Every choice excludes alternatives. Name what you're trading off. This shows you've thought beyond "I know Redis".
What breaks at scale? (Failure modes)
Every design has weaknesses. Name them before the interviewer does. This demonstrates systems-level thinking.
What would you change with 10× load? (Evolution path)
Good designs have a clear growth path. You shouldn't need to redesign from scratch to handle more load.
Articulation Patterns to Practice
"I'm choosing X over Y because our workload is [read/write/latency] heavy. The downside of X is [Z], which I'll mitigate by [solution]. If requirements change towards [different constraint], Y would be better."
"For our initial scale of [N users], [simple solution] works. As we grow to [10N], the bottleneck becomes [component] because [reason]. I'd address that by [next level solution]."
"Before I dive in — can I check: is the primary concern here latency, throughput, or consistency? That'll change my approach significantly. For a payment system I'd prioritise consistency; for a social feed I'd accept eventual consistency for lower latency."
"This design has a weakness: [specific failure mode]. In production I'd add [mitigation]. I'm leaving it simplified for now — should I detail the production-grade version?"
Common Thinking Traps
| Trap | What It Looks Like | Fix |
|---|---|---|
| Gold plating | Designing a Kubernetes cluster with 12 microservices for a URL shortener MVP | Ask scale first. Match complexity to requirements. |
| Buzzword dropping | "I'd use blockchain and Kubernetes and Kafka and GraphQL" without reasoning | Every technology should solve a stated problem. Name the problem first. |
| Skipping the obvious | Jumping to sharding without considering "can a single Postgres handle this?" | Start simple. Postgres handles millions of rows fine with good indexing. |
| Ignoring failure modes | Designing the happy path only | For every component, ask "what happens when this fails?" |
| Premature optimisation | "We'll need CDN, multi-region, eventual consistency..." for day-1 product | Design for current scale, show awareness of future scaling steps. |
| Silent designing | Drawing boxes without explaining why | Think out loud. Narrate your decisions. The interviewer evaluates process, not just output. |
Building Intuition: The 5-Whys Drill
When you encounter any design decision, drill into the "why" five levels deep. This builds permanent mental models:
"Why does Uber use consistent hashing for driver location?"
- Why 1: To distribute driver keys across multiple Redis nodes.
- Why 2: Because one Redis instance can't hold 5M driver locations.
- Why 3: Because RAM is finite and ~1M drivers × 100 bytes = 100MB — easily fits. But they want horizontal scale too.
- Why 4: Because consistent hashing means adding a new node only remaps 1/N drivers, not all of them.
- Why 5: Because if adding a node invalidated all keys, you'd get a thundering herd hitting the DB during scaling events — exactly when you need stability most.
Deliberate practice beats passive reading. Here's a structured roadmap from beginner to advanced.
Learning Roadmap
Month 1 — Foundations
Read DDIA Chapters 1–6 (storage, encoding, replication, partitioning). Design a URL shortener and a key-value store from scratch. Watch ByteByteGo's fundamentals playlist. Build latency intuition.
Month 2 — Core Systems
Design Twitter, WhatsApp, and YouTube. Read Dynamo and Bigtable papers. Study consistent hashing, CAP theorem, and caching patterns deeply. Start mock interviews with peers.
Month 3 — Advanced Patterns
Read DDIA Chapters 7–12 (transactions, distributed systems). Design Google Drive, Uber, and a search engine. Study Raft consensus. Read Netflix/Discord/Airbnb engineering blogs.
Month 4+ — Mastery
Design novel systems (TypeAhead/Autocomplete, Notification Service, Ad Click Aggregator). Do 2 timed mock interviews/week. Read primary sources: papers, engineering blogs. Teach concepts to others.
Weekly Practice Template
5 hours/week split
- Monday (1 hr): Read one chapter of DDIA or one engineering blog post. Take notes.
- Wednesday (1.5 hr): Design one system from scratch. Timer = 45 min. Then compare to reference solution.
- Friday (1 hr): Mock interview with a peer. One person designs, one asks clarifying questions.
- Weekend (1.5 hr): Deep dive into one component from the week's design (e.g., how does Redis cluster work internally?).
The Solo Practice Loop
Pick a system
Use this list: URL Shortener → KV Store → Twitter → Pastebin → Instagram → WhatsApp → Uber → YouTube → Google Drive → TypeAhead → Notification Service → Web Crawler → Distributed Cache
Set a 45-minute timer and design without references
Use Excalidraw or paper. Write out your requirements, estimation, high-level design, and one deep dive. Narrate out loud as if in an interview. This is uncomfortable — that's the point.
Compare with the reference solution
ByteByteGo, Alex Xu's book, or Grokking. Note every component you missed or over-engineered. Write down 3 specific gaps.
Deep dive one gap
Spend 30 minutes understanding one thing you got wrong. Read the relevant section of DDIA, a blog post, or watch a video. Build the mental model from first principles.
Spaced repetition
Revisit each system 1 week later and 1 month later. Can you reproduce the design in 15 minutes? True mastery = fast retrieval, not recognition.
Real Engineering Blog Reading Strategy
Read engineering blogs — but read them analytically, not passively:
- Before reading: "What problem is this company solving? What's their scale?"
- During reading: "What trade-off did they make? What alternatives did they consider? What surprised me?"
- After reading: Write 3 bullet points from memory. If you can't, you didn't absorb it.
Know the landscape. In real systems and interviews, naming the right tools (and knowing why) demonstrates production experience.
Diagramming & Design
Databases & Storage
Message Queues & Streaming
Infrastructure & Orchestration
Observability & Monitoring
API Gateways & Load Balancers
Cloud Providers Quick Reference
| Category | AWS | Google Cloud | Azure |
|---|---|---|---|
| Compute | EC2, Lambda, ECS/EKS | GCE, Cloud Run, GKE | VMs, Functions, AKS |
| Object Storage | S3 | Cloud Storage | Blob Storage |
| Relational DB | RDS, Aurora | Cloud SQL, Spanner | Azure SQL, CosmosDB |
| NoSQL / Cache | DynamoDB, ElastiCache | Firestore, Memorystore | CosmosDB, Cache for Redis |
| Queue / Streaming | SQS, Kinesis, MSK | Pub/Sub, Dataflow | Service Bus, Event Hubs |
| CDN | CloudFront | Cloud CDN | Azure CDN |
| Load Balancer | ALB / NLB | Cloud Load Balancing | Application Gateway |
| Search | OpenSearch Service | Vertex AI Search | Azure Cognitive Search |
References & Resources
Designing Data-Intensive Applications
Martin Kleppmann (O'Reilly, 2017). The definitive resource. Deep dives on storage engines, replication, consistency, and stream processing. Read once, reference forever.
dataintensive.net →System Design Interview Vol. 1 & 2
Alex Xu. Vol 1 covers 16 systems; Vol 2 covers 13 more complex ones. Step-by-step with diagrams. Best book for interview preparation.
Building Microservices (2nd Ed)
Sam Newman (O'Reilly). Comprehensive guide to microservice patterns: decomposition, communication, data management, observability.
ByteByteGo
Alex Xu's channel. Animated explainers of system design concepts and architecture patterns. Free, high quality, beginner-friendly.
youtube.com/@ByteByteGo →Gaurav Sen
Deep conceptual explanations. Especially good for consistent hashing, messaging systems, and distributed system fundamentals.
youtube.com/@gkcs →Hussein Nasser
Backend engineering, database internals, proxy servers, networking. Very practical and hands-on. Covers HTTP/2, gRPC, WebSocket deeply.
youtube.com/@hnasr →Grokking System Design — Educative
Structured course with 20+ systems, interactive diagrams, and design considerations. Good for structured learners.
educative.io →System Design Primer
Donne Martin's massive open-source resource. 270k+ stars. Covers everything with diagrams, Anki flashcards, and coding problems.
github.com/donnemartin →High Scalability Blog
Real-world architecture breakdowns of how companies (Pinterest, Twitter, Netflix) built and scaled their systems.
highscalability.com →ByteByteGo Newsletter
Weekly system design posts. Covers new topics not in the book. Excellent diagrams. Free tier available.
blog.bytebytego.com →Amazon Dynamo (2007)
Foundational paper on highly available key-value storage. Introduced consistent hashing + vector clocks. Basis for Cassandra and DynamoDB.
allthingsdistributed.com →Google Bigtable (2006)
Describes the wide-column storage system behind Google's web indexing. Foundation for HBase and Cassandra's data model.
research.google →Raft Consensus Algorithm
Understandable consensus. Diego Ongaro & Ousterhout, 2014. Powers etcd (Kubernetes backbone), CockroachDB, TiKV. More readable than Paxos.
raft.github.io →Google Spanner (2012)
Globally distributed SQL database with external consistency. Uses TrueTime (GPS/atomic clocks) to achieve linearisability at global scale.
research.google →Netflix Tech Blog
Chaos engineering, streaming at scale, microservices patterns, Hystrix. Essential reading for resilience and large-scale system design.
netflixtechblog.com →Discord Engineering Blog
WebSocket at millions of concurrent connections, migrating from Cassandra to ScyllaDB, storing billions of messages. Excellent case studies.
discord.com/blog →