Engineering Study Notes — v2.0

System Design

Comprehensive notes covering core concepts, distributed patterns, thinking frameworks, tooling, practice strategies, and real-world architectures — with curated references.

15Sections
50+Concepts
30+Tools Listed
12Design Problems

Table of Contents

01Fundamentals & Estimation 02Scalability 03Databases 04Caching 05Networking & APIs 06Message Queues 07Distributed Systems 08Storage & CDN 09Microservices 10Security & Auth 11Observability 12Classic Design Problems 13Design Interview Framework 14Thinking & Articulation 15How to Practice 16Tooling Ecosystem References

Before designing anything, understand the properties you're optimising for and estimate the scale you're working at. Numbers ground your decisions.

The Core Properties

Availability

% uptime. 99.9% ("three nines") = 8.77 hrs downtime/year. 99.99% = 52 min/year. 99.999% = 5.26 min/year. SLAs are commitments; SLOs are internal targets; SLIs are actual measurements.

Reliability

Performs correctly, not just responds. A service returning stale data is available but not reliable. MTBF (Mean Time Between Failures) and MTTR (Mean Time To Recovery) are key metrics.

Scalability

Handles growing load without degrading. Define your load parameters first: RPS, concurrent users, data volume, read/write ratio. Twitter handles ~6k tweets/sec average, 150k at peak.

Consistency

All nodes agree on the same data. Strong consistency = every read sees latest write. Eventual = replicas converge over time. The right model depends on the business (bank balance vs likes count).

Durability

Committed data survives crashes. Achieved via WAL (Write-Ahead Logs), replication, and periodic snapshots. S3 offers 99.999999999% (11 nines) durability.

Latency vs Throughput

Latency = time for one request. Throughput = requests/second. Often a trade-off. Batching improves throughput but increases latency. Target both P50 and P99 latencies.

Numbers Every Engineer Should Know

OperationLatencyIntuition
L1 cache access0.5 ns1 step across a room
L2 cache access7 ns14 steps
RAM read100 ns200 steps across a field
SSD random read (4K)150 µsDrive to the corner store
HDD seek + read10 msA walk around the block
Same datacenter RTT0.5 msPing next building
Cross-continent (US↔EU)150 msSpeed of light delay
Reading 1 MB from RAM250 µs
Reading 1 MB from SSD1 ms4× slower than RAM
Reading 1 MB from disk20 ms80× slower than RAM
Key Takeaway RAM is ~10,000× faster than disk. Network within a datacenter is ~1,000× faster than cross-continent. These gaps inform every caching and replication decision you make.

Back-of-Envelope: Twitter Example

Given: 300M monthly users, 50% daily active = 150M DAU. Each user reads 100 tweets/day, posts 1 tweet every 2 days.

// Write QPS tweet_writes = 150M users × (1 tweet / 2 days) / 86400s = ~870 writes/sec (peak × 3 = ~2,500/sec) // Read QPS timeline_reads = 150M × 100 / 86400s = ~174,000 reads/sec (peak × 3 = ~500,000/sec) // Read:Write ratio = ~200:1 — heavily read-dominated // Decision: optimise for reads, cache timelines aggressively // Storage tweet_size = 64 bytes (id) + 140 bytes (text) + metadata = ~300 bytes storage_per_day= 870/sec × 86400s × 300 bytes = ~22.6 GB/day per_year = 22.6 GB × 365 = ~8 TB/year (before replication)

Scalability is about how your system behaves as load grows. The goal is to add resources in proportion to load without redesigning the system.

Horizontal vs Vertical Scaling

AttributeVertical (Scale Up)Horizontal (Scale Out)
MethodBigger machine (more CPU/RAM)More machines of same type
ComplexitySimple — no distributed concernsComplex — must handle distributed state
CeilingHardware limit exists (~96 cores)Nearly unlimited (Google has millions of servers)
DowntimeRequires restart to resizeRolling deploys; zero downtime
FailureSingle point of failurePartial failure tolerance
Cost curveSuperlinear — 2× RAM costs 4×Linear — commodity hardware
Real exampleEarly Instagram: one fat PostgresAirbnb: thousands of identical app nodes

Load Balancing Deep Dive

A load balancer distributes traffic across a server pool. It can operate at Layer 4 (TCP) or Layer 7 (HTTP, aware of content).

┌─────────────────────────────────────────┐ │ LOAD BALANCER │ │ Health checks every 5s │ │ Removes unhealthy servers │ └──────┬─────────────┬───────────┬────────┘ │ │ │ [Server A] [Server B] [Server C] sessions:5 sessions:2 sessions:8 ← LC routes new req to Server B ──────→
Round Robin

Requests go to each server in turn: A → B → C → A... Best when servers are identical and requests are homogeneous.

Weighted Round Robin

Assign weights by capacity. Server with 2× RAM gets 2× requests. Used when servers have different specs.

Least Connections

Route to server with fewest active connections. Best for long-lived connections (WebSocket, streaming).

Consistent Hashing

Hash(key) maps to a point on a ring. Each server owns an arc. Adding a server only remaps ~1/n keys. Essential for caches (Redis cluster, Cassandra).

IP / URL Hash

Same client always goes to same server. Enables sticky sessions without shared store. Risk: hot spots if one client is very active.

Random

Statistically balanced without state. Good for short-lived stateless requests. AWS ALB uses weighted random by default.

Rate Limiting

Protect services from abuse, ensure fair use, and prevent cascade failures. Implemented at API gateway or per-service.

AlgorithmHow it WorksBurst Allowed?Use When
Token BucketBucket holds N tokens, refills at rate R. Each request consumes 1 token.Yes (up to bucket size)Most APIs — allows bursty traffic
Leaky BucketRequests enter queue, processed at fixed rate. Overflow is dropped.No — strictly smoothPayment systems, smooth traffic
Fixed WindowCount per time window (e.g., 100 req/min). Resets at boundary.2× burst at boundarySimple quota enforcement
Sliding Window LogStore timestamp of each request. Count within rolling window.Precise no-burstAccurate enforcement, high memory
Sliding Window CounterBlend current + previous window weighted by position.ApproximateBest balance of accuracy vs memory
Real World GitHub API: 5,000 req/hour per token (token bucket). Twitter API: 300 req/15 min (fixed window). Stripe: 100 req/sec (token bucket with burst). Return HTTP 429 with Retry-After header.

Auto-Scaling

Automatically add or remove servers based on metrics. Cloud providers (AWS, GCP, Azure) all offer managed auto-scaling groups.

  • Reactive scaling — trigger on CPU > 70% for 5 min. Lag means brief overload before new instances are warm.
  • Predictive scaling — ML predicts load (e.g., scale up at 8 AM before morning traffic). AWS now offers this natively.
  • Schedule-based — scale up Friday evenings, down weekday nights. Known traffic patterns (e-commerce flash sales).
  • Always define scale-in slowly (to avoid thrashing) and scale-out aggressively.

Database choice is often the most consequential decision in a system. Understand the trade-offs before committing — migrations are expensive.

SQL vs NoSQL

AttributeSQL (Relational)NoSQL
SchemaFixed, enforced by DBFlexible, app-enforced
TransactionsFull ACIDVaries (Mongo 4.0+ has multi-doc ACID)
ScalingVertical primary; sharding complexDesigned for horizontal scale
JoinsNative; performant with indexesAvoid; denormalise data
Query languageSQL — expressive, standardisedVaries by DB
Best forComplex queries, financial, relational dataHigh-write, flexible schema, huge scale
Common Mistake Defaulting to NoSQL for "scale" without considering that most products never exceed what a well-tuned PostgreSQL cluster can handle (millions of req/day). Instagram ran on Postgres for years at massive scale.

NoSQL Database Types

Key-Value Store

Examples: Redis, DynamoDB, Riak
Fast O(1) lookup by key. No complex queries. Best for: sessions, caches, user preferences, shopping carts.

Document Store

Examples: MongoDB, CouchDB, Firestore
JSON/BSON documents, flexible schema. Good for: user profiles, content management, catalogs with varying attributes.

Wide-Column

Examples: Cassandra, HBase, BigTable
Rows + dynamic columns. Optimised for time-series, write-heavy. Good for: IoT, analytics, activity feeds. Cassandra handles 1M writes/sec.

Graph DB

Examples: Neo4j, Neptune, JanusGraph
Nodes + edges + properties. Relationships are first-class. Good for: social graphs, fraud detection, recommendation engines.

Time-Series DB

Examples: InfluxDB, TimescaleDB, Prometheus
Optimised for timestamp-indexed data. Automatic downsampling and retention. Good for: metrics, monitoring, IoT sensor data.

Search Engine

Examples: Elasticsearch, OpenSearch, Solr
Inverted index, full-text search, fuzzy matching, aggregations. Not a primary DB — sync from main DB via events.

Indexing

Without an index, every query is a full table scan O(n). With an index, it's O(log n) for B-Tree or O(1) for hash.

-- Without index: full scan of 10M rows SELECT * FROM orders WHERE user_id = 42; -- scans ALL rows -- With B-Tree index: log(10M) ≈ 23 comparisons CREATE INDEX idx_orders_user ON orders(user_id); SELECT * FROM orders WHERE user_id = 42; -- ~23 comparisons -- Composite index: column order matters! Left-prefix rule CREATE INDEX idx_comp ON orders(user_id, created_at); -- ✅ WHERE user_id = 42 (uses index) -- ✅ WHERE user_id = 42 AND created_at > '2024-01-01' (uses index) -- ❌ WHERE created_at > '2024-01-01' (skips user_id prefix, full scan)
  • B-Tree — default; great for range queries, equality, ORDER BY
  • Hash — O(1) equality only; no range support
  • GIN (Generalized Inverted Index) — arrays, JSONB, full-text search in Postgres
  • Partial Index — index only rows matching a condition: WHERE status = 'active'
  • Covering Index — includes all SELECT columns; never touches heap table

Replication Strategies

Single-Leader (Primary-Replica)

One primary accepts all writes. Changes replicated to read replicas. Replicas can serve reads, offloading the primary. Used by: most RDBMS, Redis Sentinel.

  • Sync replication: write confirmed when replica acknowledges — no data loss, but latency doubles
  • Async replication: write confirmed immediately — fast, but replica lag means potential data loss on failover
Multi-Leader

Multiple nodes accept writes. Each replicates to others. Needed for multi-datacenter active-active. Used by: CouchDB, Google Docs real-time sync.

  • Write conflicts must be resolved (LWW, CRDTs, custom merge)
  • Complex to reason about; avoid if possible
Leaderless (Dynamo-style)

Any node accepts reads/writes. Uses quorum: write to W nodes, read from R nodes. If W+R > N, you have strong consistency.

  • Cassandra default: N=3, W=1, R=1 (eventual consistency)
  • Strong consistency: N=3, W=2, R=2 (quorum)
Sharding (Partitioning)

Split data across multiple DB instances. Each shard is an independent DB holding a subset of data.

  • Range sharding: user A-M on shard 1, N-Z on shard 2. Risk: hot shards
  • Hash sharding: hash(user_id) mod N. Even distribution but no range queries
  • Directory-based: lookup service maps key to shard. Flexible but adds a hop

Database Selection Guide

NeedBest ChoiceWhy
User accounts, orders, paymentsPostgreSQL / MySQLACID, joins, complex queries
Session store, rate limitingRedisIn-memory, O(1), TTL support
Product catalog, CMSMongoDBFlexible schema, nested docs
Social graph, recommendationsNeo4j / NeptuneGraph traversals are native
Write-heavy IoT / activity logsCassandra1M+ writes/sec, no single point of failure
Full-text search, autocompleteElasticsearchInverted index, relevance scoring
Metrics, monitoring dashboardsInfluxDB / PrometheusBuilt-in downsampling, fast range queries
Data warehouse, analyticsBigQuery / Redshift / SnowflakeColumnar, massively parallel

Caching is the single highest-leverage optimisation in distributed systems. A well-placed cache can reduce DB load by 90%+.

Cache Hierarchy

Client Browser Cache → (miss) → CDN Edge Cache → (miss) → API Gateway Cache → (miss) → App-Level Cache → (miss) → Redis / Memcached → (miss) → Database

Each layer should only be needed when the layer above it misses. A well-tuned system sees 90%+ of traffic absorbed by CDN + Redis.

Cache Patterns

Cache-Aside (Lazy)

App checks cache → miss → reads DB → writes cache → returns. Cache only contains accessed data. Most common. Risk: stale data if DB updated without invalidating cache.

Read-Through

Cache is the only data source from app's perspective. On miss, cache itself fetches and caches. Transparent to app. Used by: AWS ElastiCache with DAX.

Write-Through

Every write goes to cache AND DB synchronously. No stale data ever. Write latency doubles. Good for: user preferences, settings (infrequent writes, frequent reads).

Write-Behind (Write-Back)

Write to cache, return immediately, flush to DB asynchronously. Low write latency. Risk: data loss if cache crashes before flush. Good for: likes/views counters, analytics.

Refresh-Ahead

Cache proactively refreshes entries before they expire, based on predicted access patterns. Complex but eliminates cold misses for hot items. Used by: Netflix for content metadata.

Cache Problems & Solutions

ProblemDescriptionReal ExampleSolutions
Cache StampedeMany requests miss simultaneously after TTL expiry → DB floodedReddit front page after a popular post's cache expiresMutex/lock on miss, probabilistic early expiry, background refresh
Cache PenetrationRequests for keys that will never exist in DB (e.g., invalid IDs) → always miss cache, always hit DBAttacker querying random non-existent user IDsCache null results with short TTL, Bloom filter at gateway
Cache AvalancheMany keys expire at same time → large spike on DBBlack Friday: product catalog cache all set to 1-hour TTL at deploy timeJitter on TTL (±20%), persistent cache (no global TTL), staggered deploys
Hot KeyOne key gets massive traffic (e.g., a celebrity's profile)Justin Bieber problem on TwitterReplicate hot key across N cache nodes, local in-process cache for ultra-hot keys
Cache InconsistencyCache and DB diverge — stale readsUser changes name but sees old name for 10 minWrite-through, event-driven invalidation, shorter TTL, version keys

Redis Data Structures (with Use Cases)

StructureOperationsUse Case
StringGET, SET, INCR, EXPIRESessions, counters, simple cache, rate limit counters
HashHGET, HSET, HMGETUser profile fields, product attributes
ListLPUSH, RPOP, LRANGEActivity feed, job queues, chat history
SetSADD, SISMEMBER, SUNIONUnique visitors, tags, friends list
Sorted Set (ZSet)ZADD, ZRANGE, ZRANGEBYSCORELeaderboards, rate limiting (sliding window), priority queues
BitmapSETBIT, GETBIT, BITCOUNTDaily active users, feature flags per user (512MB = 4B users)
HyperLogLogPFADD, PFCOUNTApproximate unique counts with <1% error, tiny memory (12KB for any cardinality)
GeoGEOADD, GEORADIUSNearby drivers, location search
StreamXADD, XREAD, XGROUPEvent log, lightweight Kafka alternative

API Design Patterns Compared

StyleProtocolPayloadBest ForAvoid When
RESTHTTP/1.1+JSON/XMLPublic APIs, simple CRUD, well-understood semanticsNeed real-time; complex nested queries
GraphQLHTTPJSONMobile clients (minimise bytes), flexible frontends, aggregating multiple APIsSimple CRUD; when caching is critical (no GET)
gRPCHTTP/2Protobuf (binary)Internal microservices, high throughput, streamingBrowser clients (needs grpc-web proxy)
WebSocketTCP upgrade from HTTPAny (text/binary)Chat, live dashboards, multiplayer games, collaborative editingSimple request-response; fire-and-forget
Server-Sent EventsHTTPTextOne-way server push: notifications, live feeds, progress updatesBidirectional; binary data
WebhooksHTTP POST callbackJSONEvent notifications to third parties (Stripe payment events, GitHub PR hooks)Real-time (client must have public endpoint)

REST API Best Practices

// Resource-oriented URLs — nouns, not verbs GET /users/42 ✅ get user POST /users ✅ create user PUT /users/42 ✅ replace user (idempotent) PATCH /users/42 ✅ partial update (idempotent) DELETE /users/42 ✅ delete user (idempotent) GET /getUserInfo?id=42 ❌ verb in URL, not RESTful // Versioning — always version your API GET /v1/users/42 ✅ URL versioning (visible, cacheable) Accept: application/vnd.api+json; version=2 header versioning // Pagination for list endpoints GET /posts?page=2&limit=20 offset pagination (simple, but expensive on large offsets) GET /posts?cursor=abc123&limit=20 cursor pagination (consistent, scalable) // Consistent error responses { "error": { "code": "USER_NOT_FOUND", "message": "...", "request_id": "abc" } }

Long Polling vs WebSocket vs SSE

Short Polling

Client asks "any updates?" every N seconds. Simple but wasteful — most responses are empty. Good for: email clients checking every 30s.

Long Polling

Client asks, server holds connection open until update arrives, then responds. Client immediately re-polls. Simulates push over HTTP. Good for: notifications when WebSocket not available.

WebSocket

Full-duplex persistent TCP connection after HTTP upgrade handshake. Both client and server can send anytime. Best for: chat, games, live collaboration.

Server-Sent Events

Server pushes events to client over persistent HTTP GET. One-directional. Automatic reconnection built-in. Good for: Twitter live timeline, sports scores, progress bars.

Async communication via queues is the backbone of scalable, resilient systems. It decouples services so they can fail and scale independently.

Why Queues? A Concrete Example

WITHOUT QUEUE (synchronous): User → API Server → Image Resizer → Email Service → Analytics → DB If any service is slow or down, the whole request fails. User waits 8 seconds. WITH QUEUE (asynchronous): User → API Server → Queue → [Image Resizer] ← each service picks up when ready ↓ └──→ [Email Service] ← can retry on failure Responds in <100ms └──→ [Analytics] ← can scale independently

Kafka vs RabbitMQ vs SQS

FeatureApache KafkaRabbitMQAmazon SQS
ModelDistributed log (topics/partitions)Traditional broker (exchanges/queues)Managed queue service
Message retentionConfigurable (days/weeks), replayableUntil consumed14 days max
ThroughputMillions/sec per cluster20-50k msg/secUnlimited (managed)
OrderingPer partition (strict)Per queueStandard: no; FIFO: yes
RoutingSimple (topic-based)Complex (topic, direct, fanout, headers)Simple
Consumer modelPull; consumer groups share partitionsPush; competing consumersPull; long polling
Best forEvent sourcing, analytics pipelines, audit logs, stream processingTask queues, RPC, complex routing, work distributionServerless, AWS ecosystem, simple queuing

Kafka Deep Dive

Topic: "order-events" (partitioned for parallelism) ┌─ Partition 0 ──────────────────────────────────────┐ │ offset: 0 1 2 3 4 5 6 7 8 ... │ ← append-only log └────────────────────────────────────────────────────┘ ┌─ Partition 1 ──────────────────────────────────────┐ │ offset: 0 1 2 3 4 ... │ └────────────────────────────────────────────────────┘ Producer hashes order_id to pick partition → same order always same partition → ordered per order Consumer Group A (payment service): each instance reads 1 partition Consumer Group B (analytics service): independent read position, can replay from offset 0
Key Insight Because Kafka stores messages on disk and tracks only offsets (not message acks), adding a new consumer service can replay ALL historical events. This is the foundation of Event Sourcing — your queue IS your audit log.

CAP Theorem — The Real Story

During a network partition (unavoidable in any distributed system), you must choose:

CP System (Consistent + Partition-Tolerant)

Returns error or waits rather than returning stale data. Sacrifices availability. Examples: HBase, ZooKeeper, etcd, MongoDB (in majority). Use when: banking, inventory management (can't oversell).

AP System (Available + Partition-Tolerant)

Returns best available data (possibly stale). Sacrifices consistency. Examples: Cassandra, CouchDB, DynamoDB (default). Use when: social feeds, shopping carts, DNS — stale data is acceptable.

PACELC Extension CAP only considers behaviour during partitions. PACELC adds: even without partitions, there's a Latency vs Consistency trade-off. DynamoDB: PA/EL (available during partition, low latency else). Spanner: PC/EC (consistent always, at latency cost).

Distributed Patterns

Circuit Breaker

Prevents calling a failing service. States: Closed (normal) → Open (fail fast after threshold) → Half-Open (probe after timeout). Prevents cascade failures. Used by: Netflix Hystrix, Resilience4j.

Saga Pattern

Distributed transactions without 2-Phase Commit. Sequence of local transactions; on failure, run compensating transactions backwards. Two flavours: Choreography (events) or Orchestration (central coordinator).

Bulkhead

Isolate failures like watertight compartments in a ship. Separate thread pools per dependency. If payment service hangs, it doesn't exhaust threads for search service.

Retry with Backoff

Retry failed requests with exponential backoff + jitter. Without jitter, all clients retry in sync → stampede. Formula: wait = min(cap, base × 2^attempt) + rand()

Idempotency Keys

Client sends unique key per logical operation. Server deduplicates using the key. Critical for payment APIs — ensures charging once even if request retried. Stripe, PayPal use this.

Outbox Pattern

Write to DB and outbox table atomically. Separate process reads outbox and publishes to queue. Guarantees at-least-once message delivery without distributed transactions.

Consistent Hashing — Why It Matters

Traditional hash: hash(key) mod N servers Add 1 server: mod 4 instead of 3 → 75% of keys remapped → cache invalidation Consistent hashing: keys and servers on a ring [0, 2^32) hash("user:42") = 500 → goes to first server clockwise → Server B (at pos 600) Add Server D (at pos 550): only keys between 500 and 550 move to D → ~25% remapped Remove Server B: only its keys move to next server → ~33% remapped (1/N)

Virtual nodes (vnodes): each physical server owns multiple points on the ring. Distributes load evenly even with heterogeneous servers.

Storage Types

TypeDescriptionExampleUse CaseLatency
In-MemoryRAM, volatileRedis, MemcachedCache, sessions, real-time leaderboards~1 µs
Block StorageRaw volumes, OS-levelAWS EBS, GCP PDDatabase storage, OS boot volumes~1 ms
File Storage (NAS)Hierarchical filesystemAWS EFS, NFSShared filesystems, CMS media libraries~5 ms
Object StorageFlat key-value for blobsAWS S3, GCS, R2Media, backups, ML datasets, static assets~50-100 ms
Cold/ArchiveTape/glacier, infrequentS3 Glacier, GCS ArchiveCompliance backups, audit logsMinutes to hours

CDN Architecture

Origin Server (US-East) ↑ cache miss (rare) ┌────────────────────────────────────┐ │ CDN Edge Network │ │ ┌──────────┐ ┌──────────────┐ │ │ │ PoP Dubai│ │ PoP Singapore│ │ │ │ ~10ms │ │ ~15ms │ │ │ └──────────┘ └──────────────┘ │ └────────────────────────────────────┘ ↑ cache hit (~95% of requests) User in Pakistan User in Malaysia
  • What to cache on CDN: Static assets (JS, CSS, images), HTML for SPAs, video segments, API responses with low variance
  • What NOT to cache: Personalised responses, payment pages, admin dashboards, frequently changing data
  • Cache-Control headers: max-age=31536000, immutable for versioned assets; no-cache for HTML
  • Cache invalidation: Content-hashed filenames (bundle.a3f9b2.js) are infinitely cacheable and invalidate naturally

Microservices split a system into small, independently deployable services. Not always better than a monolith — understand the trade-offs.

Monolith vs Microservices

DimensionMonolithMicroservices
DeploymentDeploy everything at onceDeploy services independently
ScalingScale everything or nothingScale only bottleneck services
Dev velocity (small team)Fast — no network overheadSlow — service contracts, infra complexity
Dev velocity (large team)Slow — merge conflicts, coordinationFast — teams own services independently
Failure isolationOne bug can take down everythingFailures contained to a service
DataShared DB — easy joinsEach service owns its DB — no direct joins
DebuggingSimple — single processHard — distributed traces needed
Start withAlways — extract later if neededOnly when team/scale demands it
Amazon's Rule of Thumb If a service can't be owned by a team that could be fed by 2 pizzas, it's too big. But also: don't start with microservices. Shopify, Stack Overflow, and GitHub run primarily on monoliths at scale.

Service Communication

Synchronous (HTTP/gRPC)

Service A calls Service B and waits for response. Simple, easy to reason about. Problem: if B is slow, A is slow. Creates temporal coupling.

Asynchronous (Queue/Events)

Service A publishes event. Service B processes when ready. No coupling. Service B can be down and catch up later. Problem: harder to trace, eventual consistency.

Service Mesh (Envoy/Istio)

Sidecar proxy handles: mTLS, retries, circuit breaking, distributed tracing, load balancing — transparently without app code changes. Operations team's best friend.

API Gateway

Single entry point: auth, rate limiting, routing, response aggregation, SSL termination. Clients don't need to know internal service topology. Examples: Kong, AWS API Gateway, nginx.

Authentication Patterns

Session Tokens

Server stores session. Client sends session ID cookie. Simple, revocable immediately. Problem: requires shared session store (Redis) for horizontal scaling. Used by: traditional web apps.

JWT (JSON Web Token)

Stateless — server signs payload, no storage needed. Client sends in Authorization header. Problem: can't revoke before expiry without a blocklist. Use short expiry (15 min) + refresh tokens.

OAuth 2.0

Delegated authorization. "Login with Google" flow: user authorises your app to access their Google data. Separate auth server issues tokens. Industry standard for third-party integrations.

API Keys

Long-lived opaque tokens for machine-to-machine auth. Simple, no expiry complexity. Hash before storing (treat like passwords). Rate limit per key. Used by: Stripe, SendGrid, Twilio.

JWT Flow

1. User logs in → server validates credentials 2. Server signs JWT: Header.Payload.Signature Payload = { user_id: 42, role: "admin", exp: now + 15min } Signature = HMAC_SHA256(base64(header) + "." + base64(payload), SECRET_KEY) 3. Client stores JWT (memory or httpOnly cookie — NOT localStorage for sensitive apps) 4. Client sends: Authorization: Bearer eyJhbGci... 5. Server verifies signature (no DB lookup needed!) and checks exp 6. Refresh flow: access token (15 min) + refresh token (7 days, stored in DB, revocable)

Security Essentials Checklist

  • Always use HTTPS — TLS everywhere, including internal services
  • Hash passwords with bcrypt/Argon2 — never SHA1/MD5, never plain text
  • Input validation — validate server-side, never trust client input
  • SQL injection prevention — always use parameterised queries, never string concat
  • Rate limiting on auth endpoints — prevent brute force; lock after N failures
  • Secrets management — AWS Secrets Manager / Vault; never hardcode in code
  • Principle of least privilege — DB user only has SELECT on read-only service
  • CSRF protection — SameSite cookie attribute or CSRF tokens

You can't fix what you can't see. Observability is the ability to infer internal system state from external outputs.

The Three Pillars

Metrics

Aggregated numbers over time. CPU%, RPS, error rate, P99 latency, cache hit rate. Tool: Prometheus (scraping) + Grafana (dashboards). Alert on SLO violations.

Logs

Immutable, timestamped event records. Structured logs (JSON) are searchable. Ship to: ELK Stack (Elasticsearch + Logstash + Kibana) or Loki + Grafana. Include request IDs for correlation.

Traces

Follow a single request across multiple services. Each span records: service, operation, duration, status, tags. Tool: Jaeger, Zipkin, AWS X-Ray. Critical for diagnosing microservice latency.

Key Metrics to Track

CategoryMetricsTypical Alert Thresholds
TrafficRequests/sec, active connectionsAlert on sudden 2× spike or drop
LatencyP50, P95, P99 response timeP99 > 500ms for APIs, P99 > 200ms for search
Errors5xx rate, exception count, failed jobs5xx rate > 0.1% of traffic
SaturationCPU%, memory%, disk%, queue depthCPU > 80% sustained, queue > 1M items
DatabaseQuery latency, connection pool usage, replication lagReplica lag > 30s, pool usage > 90%
CacheHit rate, eviction rate, memory usageHit rate drops below 90%
BusinessOrders/min, signups/hour, revenue/min50% drop in orders → page on-call
Google's Four Golden Signals Latency, Traffic, Errors, Saturation (LTES). If you monitor only these four, you'll catch most production incidents.

These are the 12 most commonly asked system design problems. For each, understand the key decisions — not just the answer.

URL Shortener (Bitly)

Read: 100:1 Low latency Write: ~100 URLs/sec

User → [API] → Short Code Generator → DB (shortCode → longURL) → Cache (Redis: shortCode → longURL, TTL 24h) GET /abc123 → check Redis → if hit: 302 redirect → if miss: check DB → cache + redirect
  • Key design choice: short code generation. Options: (a) auto-increment ID → base62 encode — simple, sequential but predictable; (b) random UUID → take 7 chars — unpredictable but needs collision check; (c) MD5(longURL) → take 7 chars — deterministic, same URL always same code
  • Schema: id, short_code (indexed), long_url, user_id, created_at, expires_at, click_count
  • Redirect type: 301 Permanent (browser caches → can't track clicks) vs 302 Temporary (always hits server → track every click) — Bitly uses 302
  • Analytics: Async — write click events to Kafka, process in batches, store in Cassandra by (short_code, date)

Twitter / Social Feed

Fan-out problem Celebrity edge case Read-heavy

HYBRID APPROACH (Twitter's actual strategy): Regular user posts tweet: Tweet DB → Fan-out Service → pushes to all followers' timeline caches (Redis) Read: load from user's pre-computed timeline cache → O(1) Celebrity (10M followers) posts tweet: Tweet DB only (no fan-out — 10M writes would take too long) Read: merge pre-computed timeline + fetch celebrity's recent tweets → O(followers user follows who are celebrities)
  • Timeline storage: Redis Sorted Set per user, score = timestamp, member = tweet_id. Fetch top 200 tweets → hydrate from tweet cache
  • Who is a celebrity? Threshold: > 1M followers = pull model. Twitter uses ~10K as threshold internally
  • Tweet storage: MySQL with Vitess sharding (Twitter's actual stack), sharded by tweet_id
  • Media: Images/videos → S3 → CDN (CloudFront). Never stored in DB.

WhatsApp / Chat System

Real-time Offline handling E2E encryption

Online-to-Online: Alice → WebSocket → Chat Server A → Message Queue → Chat Server B → WebSocket → Bob Offline delivery: Alice → Chat Server A → Message Queue → Storage DB → Push notification (APNs/FCM) → Bob's device Bob comes online → WebSocket connects → fetches queued messages → Server deletes from queue Group message (100 members): Alice → Chat Server → Fan-out to 100 member message queues (or direct delivery if online) For large groups (1000+): different strategy — pull model on load
  • Connection management: Each user has WebSocket to one Chat Server. Need to know which server holds which user's connection → store in Redis: user_id → server_id
  • Read receipts: Small event sent back through queue: {type: "read", msg_id: X, user_id: Y, ts: T}
  • Message ordering: Logical timestamps / sequence IDs per conversation. Cassandra for message storage: partition key = conversation_id, cluster key = message_id (time-ordered)
  • E2E encryption: Signal Protocol — keys generated on device, server never sees plaintext

YouTube / Video Platform

Storage-heavy Transcoding pipeline CDN-delivered

Upload Pipeline: User → Resumable Upload API → Raw Storage (S3) → Kafka: "video.uploaded" event → Transcoding Workers (pick up from Kafka) → FFmpeg: 360p, 720p, 1080p, 4K in H.264 + VP9 → Thumbnail generation → Store outputs → S3 → Kafka: "video.ready" event → CDN pre-warming (popular content pushed to edge) Watch Pipeline: User requests video → API returns CDN URLs for each quality tier Player (HLS/DASH) → picks quality based on bandwidth → requests 2-10s segments from CDN → segment not in CDN → CDN fetches from S3 → cached for next viewers
  • Resumable uploads: Clients upload in chunks (5MB). Server tracks progress. Failed upload resumes from last chunk. YouTube uses tus protocol.
  • Adaptive bitrate: HLS/DASH splits video into segments. Manifest file (.m3u8) lists available quality tiers. Player switches dynamically — smooth even on variable connections.
  • View count: Not real-time. Batch updated from Kafka stream via Flink. Prevents race conditions at scale.
  • Recommendation: Separate ML service. Reads from Cassandra (watch history) + Redis (trending). Serves pre-computed recommendations.

Uber / Ride-Sharing

Geospatial Real-time matching Event-driven

Driver location updates (every 4s): Driver App → Location Service → Redis GEOADD drivers {lng, lat, driver_id} Rider requests ride: Rider App → Matching Service → GEORADIUS (find drivers within 2km) → Score by distance + rating + car type → Send offer to top 3 drivers via WebSocket → First to accept → trip created → Notify rider → track driver on map Trip events (Kafka): ride.requested → ride.driver_assigned → ride.started → ride.completed → payment.triggered
  • Geohash: Encode (lat, lng) as string. "ww8p1r4t8" → nearby cells share prefix. Range query = prefix search. Redis GEO uses sorted set with geohash score internally.
  • Surge pricing: Real-time supply/demand ratio per geohash cell. Stream processing on Kafka with Flink.
  • ETA calculation: Not just distance — real-time traffic data from map service + historical patterns per time of day.

Google Drive / File Storage

Sync across devices Deduplication Collaboration

Upload: Client splits file into chunks (4MB each) Hash each chunk (SHA-256): skip if already on server (deduplication) Upload only new chunks → S3 Metadata DB: file_id, name, owner, chunks[], version, modified_at Sync: Client → WebSocket → Notification Service → "file X changed" Client fetches delta: which chunks changed? download only those chunks Local reassembly Conflict resolution: Dropbox: last-write-wins + keeps conflicted copy (e.g. "file (John's conflicted copy 2024)") Google Docs: OT (Operational Transformation) / CRDT for real-time collaborative editing

Rate Limiter Service

Distributed Token bucket

// Redis-based sliding window rate limiter function isAllowed(userId, limit=100, windowSecs=60): key = "ratelimit:{userId}" now = currentTimestamp() windowStart = now - windowSecs * 1000 // Atomic pipeline pipeline = redis.pipeline() pipeline.ZREMRANGEBYSCORE(key, 0, windowStart) // remove old entries pipeline.ZADD(key, now, now) // add current request pipeline.ZCARD(key) // count requests in window pipeline.EXPIRE(key, windowSecs) // auto-cleanup [_, _, count, _] = pipeline.execute() return count <= limit // true = allowed, false = 429

A repeatable process for any design question. Interviewers evaluate your thinking process, not just the final diagram.

The 5-Step Framework (45 minutes)

Clarify Requirements (5 min)

Never start designing before asking questions. Functional requirements: what does the system DO? Non-functional: scale, latency SLA, availability, consistency needs, geographic distribution?

Questions to ask: "How many daily active users? What's the read/write ratio? Do we need global distribution? What's the acceptable latency? Does the feed need to be real-time?"

Capacity Estimation (5 min)

Estimate QPS (write + read), storage per day/year, bandwidth. These numbers drive your architecture choices. Don't skip this — it reveals if you need 1 server or 1,000.

High-Level Design (10 min)

Draw the major boxes: clients, DNS/CDN, load balancer, app servers, cache, DB, queues, object store. Connect them with arrows and label data flows. Get alignment before diving deep.

Deep Dive (15 min)

Pick the most interesting or complex component. Design the DB schema, API endpoints, or the core algorithm. The interviewer often directs you here. Show depth in the area that matters most.

Scale & Trade-offs (10 min)

Identify bottlenecks. How does the design handle 10× traffic? What are the failure modes? What did you trade off — and why? Great engineers articulate trade-offs, not just solutions.

Standard Component Checklist

Always Consider
  • ☐ DNS + CDN for static assets
  • ☐ Load balancer (L4 or L7)
  • ☐ Stateless app servers (scale-out)
  • ☐ Primary DB + read replicas
  • ☐ Cache layer (Redis)
  • ☐ Async queue for heavy tasks
  • ☐ Object store for media (S3)
Add When Needed
  • ☐ Search index (Elasticsearch)
  • ☐ Auth service / API gateway
  • ☐ Rate limiter
  • ☐ Notification service (push/email)
  • ☐ Monitoring (Prometheus + Grafana)
  • ☐ Distributed tracing (Jaeger)
  • ☐ Feature flags (LaunchDarkly)
Avoid Premature
  • ☐ Microservices for small teams
  • ☐ Multiple DB types unless needed
  • ☐ Custom consensus algorithms
  • ☐ ML pipelines before product-market fit
  • ☐ Multi-region before 1M users

System design is as much about how you think out loud as what you design. This section is about developing the mental models and communication skills that separate great engineers from good ones.

The Mental Model Ladder

Move through these levels for any component you're designing:

1

What does it do? (Function)

State the job in one sentence. Don't overcomplicate. If you can't explain it simply, you don't understand it yet.

Example: "Redis is an in-memory key-value store. I'm using it as a cache to avoid hitting the DB for user sessions."
2

Why this choice? (Trade-offs)

Every choice excludes alternatives. Name what you're trading off. This shows you've thought beyond "I know Redis".

Example: "I chose Redis over Memcached because I need sorted sets for the leaderboard feature, not just strings. Memcached would have been simpler for pure caching."
3

What breaks at scale? (Failure modes)

Every design has weaknesses. Name them before the interviewer does. This demonstrates systems-level thinking.

Example: "Single Redis instance is a SPOF and limited to one machine's RAM. At our scale, I'd shard with Redis Cluster across 6 nodes, giving us failover and ~6× memory."
4

What would you change with 10× load? (Evolution path)

Good designs have a clear growth path. You shouldn't need to redesign from scratch to handle more load.

Example: "At 10× load, the DB becomes the bottleneck. I'd add read replicas first. At 100×, I'd shard by user_id using consistent hashing. At 1000×, consider a NoSQL migration for the hot path."

Articulation Patterns to Practice

The Trade-off Statement

"I'm choosing X over Y because our workload is [read/write/latency] heavy. The downside of X is [Z], which I'll mitigate by [solution]. If requirements change towards [different constraint], Y would be better."

The Scale Escalation

"For our initial scale of [N users], [simple solution] works. As we grow to [10N], the bottleneck becomes [component] because [reason]. I'd address that by [next level solution]."

The Constraint Surfacing

"Before I dive in — can I check: is the primary concern here latency, throughput, or consistency? That'll change my approach significantly. For a payment system I'd prioritise consistency; for a social feed I'd accept eventual consistency for lower latency."

The Failure Acknowledgement

"This design has a weakness: [specific failure mode]. In production I'd add [mitigation]. I'm leaving it simplified for now — should I detail the production-grade version?"

Common Thinking Traps

TrapWhat It Looks LikeFix
Gold platingDesigning a Kubernetes cluster with 12 microservices for a URL shortener MVPAsk scale first. Match complexity to requirements.
Buzzword dropping"I'd use blockchain and Kubernetes and Kafka and GraphQL" without reasoningEvery technology should solve a stated problem. Name the problem first.
Skipping the obviousJumping to sharding without considering "can a single Postgres handle this?"Start simple. Postgres handles millions of rows fine with good indexing.
Ignoring failure modesDesigning the happy path onlyFor every component, ask "what happens when this fails?"
Premature optimisation"We'll need CDN, multi-region, eventual consistency..." for day-1 productDesign for current scale, show awareness of future scaling steps.
Silent designingDrawing boxes without explaining whyThink out loud. Narrate your decisions. The interviewer evaluates process, not just output.

Building Intuition: The 5-Whys Drill

When you encounter any design decision, drill into the "why" five levels deep. This builds permanent mental models:

Example Drill

"Why does Uber use consistent hashing for driver location?"

  • Why 1: To distribute driver keys across multiple Redis nodes.
  • Why 2: Because one Redis instance can't hold 5M driver locations.
  • Why 3: Because RAM is finite and ~1M drivers × 100 bytes = 100MB — easily fits. But they want horizontal scale too.
  • Why 4: Because consistent hashing means adding a new node only remaps 1/N drivers, not all of them.
  • Why 5: Because if adding a node invalidated all keys, you'd get a thundering herd hitting the DB during scaling events — exactly when you need stability most.

Deliberate practice beats passive reading. Here's a structured roadmap from beginner to advanced.

Learning Roadmap

Month 1 — Foundations

Read DDIA Chapters 1–6 (storage, encoding, replication, partitioning). Design a URL shortener and a key-value store from scratch. Watch ByteByteGo's fundamentals playlist. Build latency intuition.

Month 2 — Core Systems

Design Twitter, WhatsApp, and YouTube. Read Dynamo and Bigtable papers. Study consistent hashing, CAP theorem, and caching patterns deeply. Start mock interviews with peers.

Month 3 — Advanced Patterns

Read DDIA Chapters 7–12 (transactions, distributed systems). Design Google Drive, Uber, and a search engine. Study Raft consensus. Read Netflix/Discord/Airbnb engineering blogs.

Month 4+ — Mastery

Design novel systems (TypeAhead/Autocomplete, Notification Service, Ad Click Aggregator). Do 2 timed mock interviews/week. Read primary sources: papers, engineering blogs. Teach concepts to others.

Weekly Practice Template

Recommended Weekly Routine

5 hours/week split

  • Monday (1 hr): Read one chapter of DDIA or one engineering blog post. Take notes.
  • Wednesday (1.5 hr): Design one system from scratch. Timer = 45 min. Then compare to reference solution.
  • Friday (1 hr): Mock interview with a peer. One person designs, one asks clarifying questions.
  • Weekend (1.5 hr): Deep dive into one component from the week's design (e.g., how does Redis cluster work internally?).

The Solo Practice Loop

Pick a system

Use this list: URL Shortener → KV Store → Twitter → Pastebin → Instagram → WhatsApp → Uber → YouTube → Google Drive → TypeAhead → Notification Service → Web Crawler → Distributed Cache

Set a 45-minute timer and design without references

Use Excalidraw or paper. Write out your requirements, estimation, high-level design, and one deep dive. Narrate out loud as if in an interview. This is uncomfortable — that's the point.

Compare with the reference solution

ByteByteGo, Alex Xu's book, or Grokking. Note every component you missed or over-engineered. Write down 3 specific gaps.

Deep dive one gap

Spend 30 minutes understanding one thing you got wrong. Read the relevant section of DDIA, a blog post, or watch a video. Build the mental model from first principles.

Spaced repetition

Revisit each system 1 week later and 1 month later. Can you reproduce the design in 15 minutes? True mastery = fast retrieval, not recognition.

Real Engineering Blog Reading Strategy

Read engineering blogs — but read them analytically, not passively:

  • Before reading: "What problem is this company solving? What's their scale?"
  • During reading: "What trade-off did they make? What alternatives did they consider? What surprised me?"
  • After reading: Write 3 bullet points from memory. If you can't, you didn't absorb it.
Top Blogs to Follow Netflix Tech Blog (streaming, chaos engineering), Discord Engineering (WebSocket at scale), Airbnb Engineering (distributed systems), Cloudflare Blog (networking/CDN), Martin Fowler (patterns), High Scalability (curated case studies).

Know the landscape. In real systems and interviews, naming the right tools (and knowing why) demonstrates production experience.

Diagramming & Design

✏️
Excalidraw
Diagramming
Free, open-source whiteboard with hand-drawn aesthetic. Best for quick architecture diagrams. Collaboration built-in. Ideal for system design interviews.
🔷
draw.io / diagrams.net
Diagramming
Feature-rich free diagramming. AWS/GCP/Azure icon packs included. Integrates with Confluence, Google Drive. Professional look for documentation.
🎨
Figma / FigJam
Whiteboard
FigJam for collaborative whiteboarding. Sticky notes, connectors, vote stickers. Used for design sprints and team architecture reviews. Free tier available.
🏗️
Miro
Whiteboard
Enterprise-grade collaborative board. Architecture templates, flowcharts, user story maps. Best for team workshops. Free tier limited to 3 boards.

Databases & Storage

🐘
PostgreSQL
Relational DB
Most advanced open-source RDBMS. JSONB support, full-text search, window functions, advanced indexing. Default choice for relational data. Scales to tens of TB.
🍃
MongoDB
Document DB
Leading document database. Flexible schemas, Atlas cloud service, built-in replication and sharding. ACID transactions in v4.0+. Great for content/catalog apps.
🔴
Redis
In-Memory / Cache
The Swiss Army knife of system design. Cache, session store, pub/sub, rate limiter, geospatial index, leaderboard, queue. Sub-millisecond latency. Redis Cluster for horizontal scaling.
Cassandra
Wide-Column DB
Masterless, peer-to-peer. No single point of failure. Handles 1M+ writes/sec. Linear horizontal scaling. Used by Netflix, Instagram, Discord (billions of messages). Partition key selection is critical.
🔍
Elasticsearch
Search Engine
Distributed search and analytics. Inverted index, relevance scoring (BM25), fuzzy matching, aggregations. ELK stack for logging. Not a primary DB — sync from main DB.
☁️
Amazon S3
Object Storage
Industry-standard object store. 11 nines durability, unlimited scale, lifecycle policies, versioning. Competitors: GCS, Azure Blob, Cloudflare R2 (no egress fees). Default for media storage.

Message Queues & Streaming

📨
Apache Kafka
Event Streaming
The standard for high-throughput event streaming. Persistent, replayable log. Consumer groups, compaction, exactly-once semantics. Confluent Cloud for managed. Used by Uber, Airbnb, LinkedIn.
🐇
RabbitMQ
Message Broker
AMQP protocol. Exchange types: direct, fanout, topic, headers. Dead letter queues, priority queues, delayed messages. Best for complex routing and task queues.
☁️
Amazon SQS + SNS
Managed Queue
SQS: managed queue (standard + FIFO). SNS: pub/sub fanout to SQS, Lambda, email, SMS. Perfect for serverless architectures. No infrastructure to manage.

Infrastructure & Orchestration

☸️
Kubernetes (K8s)
Container Orchestration
Automates deployment, scaling, and management of containerised apps. Pods, services, deployments, ingress, HPA (auto-scaling). Industry standard for microservices. Managed: EKS, GKE, AKS.
🐳
Docker
Containerisation
Package app + dependencies into portable containers. Consistent environments from dev to prod. Foundation of modern deployment pipelines. Docker Compose for local multi-service development.
🌿
Terraform
Infrastructure as Code
Declarative infra provisioning across AWS, GCP, Azure. Define infrastructure in HCL, version in Git, apply with plan/apply. State management tracks actual vs desired infra.
🔗
Istio / Envoy
Service Mesh
Sidecar proxies for every service. Automatic mTLS, circuit breaking, retries, load balancing, distributed tracing — without changing app code. Envoy is the proxy; Istio is the control plane.

Observability & Monitoring

📊
Prometheus + Grafana
Metrics & Dashboards
Prometheus scrapes metrics endpoints, stores time-series. Grafana visualises with dashboards and alerts. The open-source standard for infrastructure monitoring. PromQL for flexible querying.
🔎
Jaeger / Zipkin
Distributed Tracing
Trace requests across microservices. Visualise call graphs, find latency sources. Jaeger (Uber-created, CNCF) and Zipkin (Twitter-created) both use OpenTelemetry standard. Datadog APM for managed.
📝
ELK Stack
Log Management
Elasticsearch + Logstash + Kibana. Ingest, transform, search, and visualise logs. Filebeat for shipping. Managed: Elastic Cloud, AWS OpenSearch. Grafana Loki as lightweight alternative.
🔔
PagerDuty / OpsGenie
Incident Management
On-call scheduling, alert routing, escalation policies. Integrates with Prometheus, Datadog, CloudWatch. Reduces alert fatigue with intelligent grouping and suppression.

API Gateways & Load Balancers

🦁
nginx
Load Balancer / Proxy
High-performance HTTP server and reverse proxy. Load balancing, SSL termination, caching, rate limiting. Powers 40%+ of the web. Lightweight and battle-tested.
🦅
Kong / AWS API Gateway
API Gateway
Kong: open-source, plugin-based (auth, rate limiting, logging, caching). AWS API Gateway: managed, integrates natively with Lambda/ECS. Both handle routing, auth, and transformation.
⚖️
HAProxy
Load Balancer
Extremely performant L4/L7 load balancer. Sub-millisecond overhead. Health checks, sticky sessions, ACLs. Used by GitHub, Reddit. Often outperforms nginx for pure LB workloads.

Cloud Providers Quick Reference

CategoryAWSGoogle CloudAzure
ComputeEC2, Lambda, ECS/EKSGCE, Cloud Run, GKEVMs, Functions, AKS
Object StorageS3Cloud StorageBlob Storage
Relational DBRDS, AuroraCloud SQL, SpannerAzure SQL, CosmosDB
NoSQL / CacheDynamoDB, ElastiCacheFirestore, MemorystoreCosmosDB, Cache for Redis
Queue / StreamingSQS, Kinesis, MSKPub/Sub, DataflowService Bus, Event Hubs
CDNCloudFrontCloud CDNAzure CDN
Load BalancerALB / NLBCloud Load BalancingApplication Gateway
SearchOpenSearch ServiceVertex AI SearchAzure Cognitive Search

References & Resources

📖 Book — Essential

Designing Data-Intensive Applications

Martin Kleppmann (O'Reilly, 2017). The definitive resource. Deep dives on storage engines, replication, consistency, and stream processing. Read once, reference forever.

dataintensive.net →
📖 Book — Interview

System Design Interview Vol. 1 & 2

Alex Xu. Vol 1 covers 16 systems; Vol 2 covers 13 more complex ones. Step-by-step with diagrams. Best book for interview preparation.

📖 Book — Patterns

Building Microservices (2nd Ed)

Sam Newman (O'Reilly). Comprehensive guide to microservice patterns: decomposition, communication, data management, observability.

🎬 YouTube

ByteByteGo

Alex Xu's channel. Animated explainers of system design concepts and architecture patterns. Free, high quality, beginner-friendly.

youtube.com/@ByteByteGo →
🎬 YouTube

Gaurav Sen

Deep conceptual explanations. Especially good for consistent hashing, messaging systems, and distributed system fundamentals.

youtube.com/@gkcs →
🎬 YouTube

Hussein Nasser

Backend engineering, database internals, proxy servers, networking. Very practical and hands-on. Covers HTTP/2, gRPC, WebSocket deeply.

youtube.com/@hnasr →
📚 Course

Grokking System Design — Educative

Structured course with 20+ systems, interactive diagrams, and design considerations. Good for structured learners.

educative.io →
🌐 GitHub Repo

System Design Primer

Donne Martin's massive open-source resource. 270k+ stars. Covers everything with diagrams, Anki flashcards, and coding problems.

github.com/donnemartin →
🌐 Website

High Scalability Blog

Real-world architecture breakdowns of how companies (Pinterest, Twitter, Netflix) built and scaled their systems.

highscalability.com →
📰 Newsletter

ByteByteGo Newsletter

Weekly system design posts. Covers new topics not in the book. Excellent diagrams. Free tier available.

blog.bytebytego.com →
📄 Paper

Amazon Dynamo (2007)

Foundational paper on highly available key-value storage. Introduced consistent hashing + vector clocks. Basis for Cassandra and DynamoDB.

allthingsdistributed.com →
📄 Paper

Google Bigtable (2006)

Describes the wide-column storage system behind Google's web indexing. Foundation for HBase and Cassandra's data model.

research.google →
📄 Paper

Raft Consensus Algorithm

Understandable consensus. Diego Ongaro & Ousterhout, 2014. Powers etcd (Kubernetes backbone), CockroachDB, TiKV. More readable than Paxos.

raft.github.io →
📄 Paper

Google Spanner (2012)

Globally distributed SQL database with external consistency. Uses TrueTime (GPS/atomic clocks) to achieve linearisability at global scale.

research.google →
🌐 Blog

Netflix Tech Blog

Chaos engineering, streaming at scale, microservices patterns, Hystrix. Essential reading for resilience and large-scale system design.

netflixtechblog.com →
🌐 Blog

Discord Engineering Blog

WebSocket at millions of concurrent connections, migrating from Cassandra to ScyllaDB, storing billions of messages. Excellent case studies.

discord.com/blog →