Throughput vs. Latency: The System Design Interview Guide
Does high throughput always mean a fast system? Master the concepts of queuing, serialization, tail latency, saturation points, and Little's law.
Designing software that runs successfully on a single machine is a significant achievement.
When you write a function, run it locally, and receive a correct output, the process feels instantaneous.
In this local environment, resources are exclusive, the network is internal, and the data volume is trivial.
However, moving that same logic into a large-scale distributed system introduces a new set of challenges that have nothing to do with code correctness. The primary challenge is Performance.
In the context of system design, “performance” is often treated as a vague synonym for “fast.” But for a backend engineer or a system architect, “fast” is an imprecise term.
A system can be “fast” in two completely different ways that are often at odds with one another.
A system might accept millions of data points per second (High Volume) but take ten minutes to process them. Alternatively, a system might return a result in two milliseconds (High Speed) but crash if more than five users access it simultaneously.
Understanding the distinction between these two scenarios is the foundation of scalable architecture. These concepts are defined by Throughput and Latency.
For junior developers and candidates preparing for System Design Interviews (SDI), confusing these metrics is a critical error.
This guide will break down these concepts using technical terminology, explain the mechanics of how they interact, and demonstrate why they fundamentally fight against each other in a constrained environment.
Defining Latency: The Measure of Duration
Latency is a measure of time.
It answers the specific question: “How much time elapses between the start of a request and the completion of that request?”
When a client sends a request to a server, a timer effectively starts.
The request travels across the network, the server processes the logic, retrieves data, and sends a response back. The timer stops when the client receives the final byte of that response. The total duration recorded is the latency.


