System Design Nuggets

System Design Nuggets

How to Design a Distributed Lock Service (Chubby/ZooKeeper)

Learn how to design a distributed lock service with a step-by-step breakdown. Includes back-of-the-envelope estimation, high-level architecture, core flows, scaling strategies, and bottlenecks.

Arslan Ahmad's avatar
Arslan Ahmad
May 11, 2026
∙ Paid

Here is a detailed, step-by-step design for a Distributed Lock Manager (DLM) similar to Google’s Chubby.

1. Problem Definition and Scope

We are designing a distributed lock service and coordination system. It serves as the “root of trust” for a distributed architecture, allowing thousands of other systems (like databases, load balancers, and job schedulers) to elect leaders and store small, critical configuration files.

  • User Groups: Internal software services and infrastructure systems (not human end-users).

  • Main Actions: Leader election (acquiring a lock), storing metadata (e.g., “who is the primary database?”), and listening for changes (events).

  • Scope:

    • We will focus on coarse-grained locking. Locks are held for seconds or days, not milliseconds.

    • We will focus on Strong Consistency. All nodes must see the same state; stale data is not acceptable for locking.

    • We will not design a high-throughput message queue or a general-purpose file storage system for large blobs.

2. Clarify functional requirements

Must Have

  • Lock Acquisition: Clients can acquire an exclusive lock on a named node (file).

  • Session Management: The system must maintain an active session for each client via heartbeats. If a client dies, its locks are automatically released.

  • Metadata Storage: Clients can read and write small files (under 1MB) organized in a directory hierarchy.

  • Events/Watches: Clients can subscribe to a node and receive notifications when it changes or is deleted.

  • Reliable Reads: Clients can read data with a guarantee that it is the most current version.

Nice to Have

  • Recursive Locks: Locking a directory covers all nodes inside it.

  • ACLs: Permission controls (who can read/write).

  • Sequencers: A monotonic number returned with the lock to prevent “zombie” writes (fencing).

3. Clarify non-functional requirements

  • Target Users: 10,000+ active clients (servers) per cluster.

  • Read vs. Write: Extremely Read Heavy (1000:1). Configuration changes and leader failovers are rare; reading the config is frequent.

  • Latency:

    • Writes: Moderate (50ms - 500ms) as they require consensus across multiple nodes.

    • Reads: Very low (< 5ms), ideally served from local client cache.

  • Availability: High (99.99%). If the lock manager is down, the entire datacenter usually stalls.

  • Consistency: Strong (CP in CAP theorem). We prefer the system to be unavailable rather than returning incorrect locking data (Split Brain).

  • Data Size: Small. The entire dataset usually fits in RAM (e.g., < 10GB).

4. Back of the envelope estimates

  • Traffic Estimates:

    • Assume 10,000 connected clients.

    • Each client sends a heartbeat (KeepAlive) every 10 seconds.

    • Heartbeat QPS = 10,000 / 10 = 1,000 QPS. This is a steady background load.

    • Read QPS: High (e.g., 50,000 QPS), but mostly served by client-side cache.

    • Write QPS: Low (e.g., 50 - 100 QPS).

  • Storage Estimates:

    • Assume 1 million nodes (files/locks).

    • Average node size = 1 KB.

    • Total Storage = 1,000,000 * 1 KB = 1 GB.

    • Conclusion: We can keep the entire tree in RAM for speed, using disk only for durability (logs).

  • Bandwidth:

    • Payloads are small text/binary. Bandwidth is not a bottleneck; request processing (CPU) is.

5. API design

We will use a gRPC-style API (or internal RPC) rather than REST, as it fits the session-oriented nature better.

1. CreateSession

  • Method: POST /session/create

  • Response: session_id, keep_alive_interval (e.g., 10s).

2. KeepAlive

  • Method: POST /session/renew

  • Params: session_id

  • Response: status (OK/EXPIRED).

3. AcquireLock (Create Ephemeral Node)

  • Method: POST /node/create

  • Params: path (e.g., “/db/leader”), content, mode (EPHEMERAL)

  • Response: status (SUCCESS/ALREADY_EXISTS), sequencer (version).

4. GetNode

  • Method: GET /node

  • Params: path, watch (boolean)

  • Response: content, stat, lock_holder.

  • Behavior: If watch=true, the connection registers a callback. The server will notify the client later if the node changes.

5. SetContent

  • Method: PUT /node

  • Params: path, content, expected_version (for optimistic locking).

  • Response: new_version.

6. High level architecture

Our architecture follows a “Replicated State Machine” model using a consensus protocol like Raft or Paxos.

Client App -> Client Library -> Load Balancer -> DLM Cell (Cluster)

  1. Client Application: The service using the lock (e.g., a MySQL server).

  2. Client Library: A “smart” library linked into the application. It handles:

    • Caching: Stores file content in local RAM.

    • Heartbeats: Automatically sends KeepAlives for the session.

    • Master Redirection: Remembers who the Leader is to avoid unnecessary hops.

  3. DLM Cell (The Cluster): A group of 5 servers.

    • The Leader: One node is elected Leader. All writes and lock acquisitions go here.

    • The Followers (4 nodes): They replicate the Leader’s log. If the Leader dies, they elect a new one.

  4. Consensus Module: The logic (Raft/Paxos) ensuring all 5 nodes agree on the order of operations.

  5. Storage: In-memory tree structure backed by a Write-Ahead Log (WAL) on disk.

7. Data model

We use a hierarchical tree structure similar to a UNIX filesystem.

Storage Engine: In-Memory Tree

The entire database lives in RAM.

  • Root /

    • ls (Lock Service)

      • global

        • master_db (Node)

          • Content: “10.0.0.5”

          • Type: Ephemeral (Lock)

          • ACL: [Read:All, Write:DB_Admins]

          • Watchers: [Client_A, Client_B]

Why In-Memory?

  • Since the data is small (1GB), RAM is cheap and provides the fastest access.

  • We do not need complex SQL joins. We just need path lookups (/ls/global/master_db).

Durability:

  • WAL (Write Ahead Log): Every change (create, delete, write) is appended to a file on disk before updating the RAM.

  • Snapshot: Every few hours, the RAM state is saved to a file so we can truncate the log.

8. Core flows end to end

These flows describe exactly how the Distributed Lock Manager (DLM) handles its three most critical jobs: deciding who is in charge (locking), guaranteeing data accuracy (consistency), and cleaning up failures (session management).

This design prioritizes Strong Consistency (CP). The system would rather be slow or unavailable than lie to a client about who holds a lock.

Flow 1: Leader Election (Acquiring a Lock)

This flow explains how distributed processes coordinate to ensure only one leader exists at a time. It relies on the concept of Ephemeral Nodes (files that disappear when the creator disconnects).

Keep reading with a 7-day free trial

Subscribe to System Design Nuggets to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Arslan Ahmad · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture