System Design Nuggets

System Design Nuggets

Designing a Notification Service in 45 Minutes: System Design Interview Guide

A complete step-by-step system design walkthrough for building a scalable notification platform. Covers APIs, queues, prioritization, retries, user preferences, and third-party integrations.

Arslan Ahmad's avatar
Arslan Ahmad
Dec 06, 2025
∙ Paid

1. Problem definition

We are designing a centralized Notification Platform that acts as a middleware between multiple internal services (producers) and end users (subscribers). Its goal is to abstract the complexity of delivering messages via various channels and ensure reliability.

  • Main User Groups:

    • Internal Services (Producers): Applications like Billing (invoices), Orders (shipping updates), or Social (comments) that trigger alerts.

    • End Users (Consumers): The recipients of the notifications on their devices.

  • Scope:

    • We will cover the backend API, priority queuing, worker processing, user preferences, and integration with third-party providers.

    • Channels: Email, SMS, and Mobile Push (iOS/Android).

    • Out of Scope: Client-side UI implementation and building the actual delivery gateways (we will use vendors like SendGrid, Twilio, FCM).

2. Clarify functional requirements

  • Must Have:

    • Unified API: A single endpoint for internal services to trigger messages without knowing channel specifics.

    • Multi-channel: Support for Email, SMS, and Push Notifications.

    • User Preferences: Users can opt in or out of specific channels (e.g., “No SMS”) or categories (e.g., “No Marketing”).

    • Templating: Support HTML/Text templates with dynamic variable insertion (e.g., “Hi {name}”).1

    • Reliability: Retry mechanisms for transient failures from providers.

    • Rate Limiting: Prevent a buggy service from spamming a user (e.g., max 10 SMS per hour).

  • Nice to Have:

    • Prioritization: Critical alerts (OTPs) must be processed before bulk marketing.

    • Deduplication: Prevent sending the same message twice in a short window.

    • Tracking: Capture status updates via webhooks (e.g., Sent, Delivered, Bounced).

3. Clarify non-functional requirements

  • Target Users: 10 million Daily Active Users (DAU).

  • Volume: Average 5 notifications per user/day = 50 million notifications/day.

  • Traffic Pattern: Bursty. Marketing campaigns might trigger millions of messages in minutes.

  • Latency:

    • Critical (OTP/Security): < 5 seconds p99 (end-to-end).

    • Non-Critical (Marketing): Minutes or hours are acceptable.

  • Availability: 99.99%. If the system is down, users cannot log in (missing OTPs).

  • Consistency: Eventual consistency is acceptable for logs and status tracking. Strong consistency is required for user preferences (immediate opt-out).

  • Data Retention: Keep logs for 30 days for debugging, then archive.

4. Back-of-the-envelope estimates

  • Throughput (QPS):

    • Daily Notifications: 50,000,000.

    • Seconds per day: 86,400.

    • Average QPS: 50,000,000 / 86,400 ≈ 580 requests/sec.

    • Peak QPS: Assume 10x burst during events. Peak ≈ 6,000 QPS.

  • Storage (Logs):

    • Metadata per notification (IDs, status, timestamp): ~1 KB.

    • Daily Storage: 50,000,000 * 1 KB = 50 GB/day.

    • Monthly Storage: 50 GB * 30 = 1.5 TB.

  • Bandwidth:

    • Mostly text/JSON. Bandwidth is negligible compared to video systems. Heavy assets (email images) are hosted on CDNs.

5. API design

We will expose a REST API for internal services.

1. Send Notification

  • Method: POST /v1/notifications

  • Request:

    JSON

{
  “userId”: “u_12345”,
  “type”: “order_update”,
  “priority”: “high”, // or “low”
  “channels”: [”push”, “email”], // Optional overrides
  “templateId”: “tmpl_order_shipped”,
  “data”: {
    “orderId”: “999”,
    “trackingLink”: “https://...”
  },
  “idempotencyKey”: “uuid_gen_123”
}
  • Response: 202 Accepted { “notificationId”: “n_98765”, “status”: “queued” }

  • Error: 429 Too Many Requests (if service quota exceeded).

2. Update Preferences

  • Method: PUT /v1/users/{userId}/preferences

  • Request: {”marketing”: {”email”: false, “sms”: false}}

  • Response: 200 OK

3. Register Device

  • Method: POST /v1/devices

  • Request: {”userId”: “u_123”, “token”: “fcm_token_xyz”, “platform”: “android”}

  • Response: 201 Created

6. High-level architecture

We use an asynchronous Queue-Worker architecture to handle bursts and decouple ingestion from delivery.2

Client Svc -> LB -> API Servers -> Message Queues -> Workers -> 3rd Party Providers

  1. Notification API: Lightweight service. Validates payload, authenticates the calling service, checks simple rate limits, and pushes the job to Kafka. Returns 202.

  2. Message Queues (Kafka):

    • High Priority Topic: For OTPs, Security alerts.

    • Low Priority Topic: For Marketing, Newsletters.

    • Separation ensures marketing blasts don’t block password resets.

  3. Notification Workers:

    • Stateless servers that pull messages.

    • Check User Preferences (Cache).

    • Render templates.

    • Call Third-Party APIs (Twilio, SendGrid, FCM).

  4. Webhook Handler: A separate service that receives callbacks from providers (e.g., “Email Delivered”) to update the log status.

User's avatar

Continue reading this post for free, courtesy of Arslan Ahmad.

Or purchase a paid subscription.
© 2026 Arslan Ahmad · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture