System Design Interview: When to Use Heartbeats vs. Health Checks

Why is a "running" server not always a "working" server? We explain the "Zombie Process" problem and how health checks solve what heartbeats miss.

Jan 29, 2026

∙ Paid

Building software on a single computer is predictable. You run the code.

If it crashes, you see an error message on your screen. You know immediately that something went wrong because the operating system cleans up the process and reports the failure.

Distributed systems are different.

When you move to large-scale architecture, your code runs on hundreds or thousands of servers. These servers communicate over a network, and this introduces a massive challenge that every system architect must face.

Hardware is unreliable. Networks are unreliable.

When a server fails in a data center, it rarely sends a “Goodbye” message. It rarely tells you it is crashing. It usually just stops. It goes silent.

This silence is dangerous.

Other parts of your system might continue trying to talk to the dead server. They will send requests and wait for answers that will never come. This causes delays. Eventually, these delays can pile up, consume all available resources, and crash the rest of your system.

To prevent this, we need mechanisms to detect failure. We need to know when a server is gone and when a server is there but broken.

This is where Heartbeats and Health Checks come in.

While many beginners use these terms interchangeably, they are distinct concepts. They solve different problems.

Understanding the specific role of each is mandatory for designing reliable software and passing system design interviews.

The Core Problem: Uncertainty in the Network

To understand why we need these specific tools, we first need to understand the environment we are working in.

In a distributed system, the state of a remote server is never known with 100% certainty; it is only known with a certain probability based on the last time we heard from it.

If Service A sends a request to Service B and does not receive a reply, there are multiple possibilities:

Service B is powered off.
Service B is running but the application crashed.
Service B is running and processing the request, but it is just very slow.
Service B processed the request and sent a reply, but the network lost the reply packet.

From the perspective of Service A, all these scenarios look exactly the same: silence.

We cannot simply wait forever. We need active ways to determine the status of Service B so we can decide whether to retry the request, route it to a different server, or give up and show the user an error.

The Heartbeat

The most basic question in a distributed system is simple: Is the server online?

A Heartbeat is the mechanism used to answer this question. It provides a low-level, fundamental signal that a component is active and reachable.

How It Works

A heartbeat is a periodic signal. It is usually a small data packet containing an identifier and a timestamp.

The logic is based on a timer. The server (the sender) is configured to send this signal at a fixed interval. For example, it might send a signal every five seconds.

The monitoring service or registry (the receiver) listens for these signals.

The receiver keeps a “Time-to-Live” (TTL) counter for that specific sender.

Every time the signal arrives, the receiver resets the counter. If the signal does not arrive before the counter runs out, the receiver assumes the sender is dead.

The Push Model

Heartbeats typically follow a “push” model.

The server actively pushes the message out to say, “I am here.”

This is efficient for the monitoring system. The monitor does not need to initiate connections to thousands of servers. It just sits there and listens.

If the silence lasts too long, it triggers an alert or updates a registry to say the server is offline.

Keep reading with a 7-day free trial

Subscribe to System Design Nuggets to keep reading this post and get 7 days of free access to the full post archives.