The Distributed Systems Roadmap Every Software Engineer Should Follow

Why systems fail, how they scale, and what holds them together. CAP theorem, sharding, message queues, circuit breakers, microservices, and 16 more concepts explained from scratch.

Jun 20, 2026

∙ Paid

Every application you use is a distributed system.

When you open Instagram, your phone sends a request to a load balancer, which routes it to one of hundreds of application servers, which queries a sharded database, checks a distributed cache, and returns a response assembled from data that lives on dozens of machines across multiple data centers.

If any single machine in that chain fails, the system continues working because the other machines pick up the load.

That is a distributed system: multiple computers working together to appear as a single system to the user.

The challenge is that distributed systems introduce problems that do not exist on a single machine.

Networks are unreliable. Clocks on different machines drift.

A database update on one machine might not be visible on another machine for seconds. A server can crash at any moment, including halfway through writing data.

Understanding these problems and their solutions is what system design interviews test.

This guide covers every distributed systems concept you need, starting from the fundamentals and building toward the advanced patterns that appear in staff-level interviews.

Part 1: The Fundamentals

What Happens When a Request Arrives

Before diving into theory, understand the physical reality.

When a user taps a button in an app, a request travels through a specific path. Understanding this path is the foundation of every system design answer.

The request leaves the user’s device and hits DNS, which translates the domain name (api.example.com) into an IP address.

The request reaches a load balancer, which selects one of many application servers.

The application server processes the request: it might read from a cache, query a database, put a message on a queue, or call another service.

The response travels back through the same chain.

For a visual walkthrough of what happens from the moment you click a button to the moment data arrives, that post traces the entire path with latency at each hop.

For the DNS layer specifically, how DNS works in 6 steps covers the resolution process.

The CAP Theorem

The CAP theorem is the most cited and most misunderstood concept in distributed systems. It states that a distributed system can provide at most two of three guarantees simultaneously.

Consistency: Every read returns the most recent write. If you write “balance = $100” and then immediately read, you get $100, regardless of which server handles the read.

Availability: Every request receives a response (not an error), even if some servers are down. The system always answers, though the answer might be stale.

Partition tolerance: The system continues to operate even when the network between servers is broken. Messages between servers are lost or delayed.

The insight most people miss: network partitions are not optional.

In any distributed system, the network will eventually fail.

Since partitions are inevitable, the real choice is between consistency and availability during a partition.

A CP system (like a primary database with synchronous replication) refuses to serve reads during a partition because it cannot guarantee the data is current. It returns an error instead of stale data.

Use CP for bank accounts, inventory counts, and anything where serving wrong data is worse than serving no data.

An AP system (like Cassandra or DynamoDB) continues serving reads during a partition, even though the data might be stale. It prioritizes staying available.

Use AP for social media feeds, product recommendations, and anything where an approximate answer is better than no answer.

For the 10 specific ways candidates misuse the CAP theorem in interviews, that post corrects the misconceptions.

The most common one: candidates say “I choose CP” as if it is a database-level toggle.

In reality, different parts of the same system can make different trade-offs. Your user profile might be AP (eventual consistency is fine) while your payment ledger is CP (strong consistency required).

For the extension that adds latency, the PACELC framework covers what most production systems actually reason about.

Consistency Models

The CAP theorem gives you a binary choice during partitions.

Consistency models give you a spectrum for normal operation.

Strong consistency: After a write completes, every subsequent read from any server returns the updated value.

The system behaves as if there is a single copy of the data. Simple to reason about, expensive to implement because it requires coordination between servers on every write.

Eventual consistency: After a write completes, replicas will converge to the same value eventually (typically within milliseconds to seconds).

Reads during this window might return stale data. Cheaper and faster because writes do not need to wait for all replicas to acknowledge.

Causal consistency: If event A caused event B, any server that has seen B has also seen A.

Unrelated events can appear in any order. This is the sweet spot for many applications: you get the performance of eventual consistency with the guarantee that cause-and-effect relationships are preserved.

If Alice posts a message and then Bob replies to it, every server shows Alice’s message before Bob’s reply. But an unrelated post from Carol might appear before or after Alice’s message on different servers.

In interviews, naming the consistency model for each component of your design is a strong signal.

“The feed uses eventual consistency because stale data for a few seconds is acceptable. The payment ledger uses strong consistency because financial correctness is non-negotiable.”

For the detailed comparison of strong vs eventual vs causal consistency with examples, that post covers each model. For understanding why ‘eventually consistent’ is the most dangerous phrase in system design and the failure modes it hides, that post covers the pitfalls.

Distributed Transactions

What happens when a single operation spans multiple services?

A user places an order: the order service creates the order, the inventory service decrements stock, and the payment service charges the credit card.

If the payment fails after the inventory has already been decremented, you have an inconsistent state: stock is reduced but no payment was collected.

Two-Phase Commit (2PC) solves this by coordinating all participants.

Phase 1: the coordinator asks all participants “can you commit?” Each participant locks its resources and votes yes or no.

Phase 2: if all voted yes, the coordinator tells everyone to commit. If any voted no, everyone rolls back.

The problem: if the coordinator crashes between phases, all participants are stuck holding locks, unable to proceed or roll back. This makes 2PC fragile and slow.

The Saga pattern takes a different approach. Instead of one atomic transaction, a saga is a sequence of local transactions. Each service performs its local transaction and publishes an event.

If a step fails, the saga executes compensating transactions to undo the previous steps.

The order saga: create order → reserve inventory → charge payment.

If payment fails: reverse inventory reservation → cancel order.

The trade-off: the system is temporarily inconsistent between steps, and compensating transactions add complexity.

For the complete walkthrough of distributed transactions, 2PC, and the saga pattern, that post covers the implementation details and when to use each approach.

Part 2: How Systems Communicate

The components of a distributed system need to talk to each other. The communication pattern you choose determines your system’s latency, reliability, and coupling.

Synchronous Communication (Request-Response)

The simplest pattern: Service A sends a request to Service B and waits for a response. REST APIs and gRPC are the most common protocols.

REST uses HTTP with JSON payloads. It is simple, widely supported, and human-readable.

The trade-off is verbosity (JSON is larger than binary formats) and the lack of a schema contract (the client hopes the response matches the expected format).

gRPC uses HTTP/2 with Protocol Buffers (a binary serialization format). It is faster than REST (smaller payloads, multiplexed connections), has a strict schema contract (both sides agree on the data format via .proto files), and supports bidirectional streaming.

The trade-off is complexity (harder to debug than JSON) and browser support (gRPC does not work natively in browsers without a proxy).

For a complete comparison of REST, gRPC, and GraphQL with the decision framework for choosing between them, that post covers each protocol.

The danger of synchronous communication is coupling.

If Service A calls Service B, which calls Service C, a failure in Service C cascades backward: Service B times out, which causes Service A to time out, which causes the user to see an error.

This chain of failures is a cascading failure, and it is the most common way distributed systems go down.

Keep reading with a 7-day free trial

Subscribe to System Design Nuggets to keep reading this post and get 7 days of free access to the full post archives.