Designing a Notification Service in 45 Minutes: System Design Interview Guide

A complete step-by-step system design walkthrough for building a scalable notification platform. Covers APIs, queues, prioritization, retries, user preferences, and third-party integrations.

Dec 06, 2025

∙ Paid

1. Problem definition

We are designing a centralized Notification Platform that acts as a middleware between multiple internal services (producers) and end users (subscribers). Its goal is to abstract the complexity of delivering messages via various channels and ensure reliability.

Main User Groups:
- Internal Services (Producers): Applications like Billing (invoices), Orders (shipping updates), or Social (comments) that trigger alerts.
- End Users (Consumers): The recipients of the notifications on their devices.
Scope:
- We will cover the backend API, priority queuing, worker processing, user preferences, and integration with third-party providers.
- Channels: Email, SMS, and Mobile Push (iOS/Android).
- Out of Scope: Client-side UI implementation and building the actual delivery gateways (we will use vendors like SendGrid, Twilio, FCM).

2. Clarify functional requirements

Must Have:
- Unified API: A single endpoint for internal services to trigger messages without knowing channel specifics.
- Multi-channel: Support for Email, SMS, and Push Notifications.
- User Preferences: Users can opt in or out of specific channels (e.g., “No SMS”) or categories (e.g., “No Marketing”).
- Templating: Support HTML/Text templates with dynamic variable insertion (e.g., “Hi {name}”).¹
- Reliability: Retry mechanisms for transient failures from providers.
- Rate Limiting: Prevent a buggy service from spamming a user (e.g., max 10 SMS per hour).
Nice to Have:
- Prioritization: Critical alerts (OTPs) must be processed before bulk marketing.
- Deduplication: Prevent sending the same message twice in a short window.
- Tracking: Capture status updates via webhooks (e.g., Sent, Delivered, Bounced).

3. Clarify non-functional requirements

Target Users: 10 million Daily Active Users (DAU).
Volume: Average 5 notifications per user/day = 50 million notifications/day.
Traffic Pattern: Bursty. Marketing campaigns might trigger millions of messages in minutes.
Latency:
- Critical (OTP/Security): < 5 seconds p99 (end-to-end).
- Non-Critical (Marketing): Minutes or hours are acceptable.
Availability: 99.99%. If the system is down, users cannot log in (missing OTPs).
Consistency: Eventual consistency is acceptable for logs and status tracking. Strong consistency is required for user preferences (immediate opt-out).
Data Retention: Keep logs for 30 days for debugging, then archive.

4. Back-of-the-envelope estimates

Throughput (QPS):
- Daily Notifications: 50,000,000.
- Seconds per day: 86,400.
- Average QPS: 50,000,000 / 86,400 ≈ 580 requests/sec.
- Peak QPS: Assume 10x burst during events. Peak ≈ 6,000 QPS.
Storage (Logs):
- Metadata per notification (IDs, status, timestamp): ~1 KB.
- Daily Storage: 50,000,000 * 1 KB = 50 GB/day.
- Monthly Storage: 50 GB * 30 = 1.5 TB.
Bandwidth:
- Mostly text/JSON. Bandwidth is negligible compared to video systems. Heavy assets (email images) are hosted on CDNs.

5. API design

We will expose a REST API for internal services.

1. Send Notification

Method: POST /v1/notifications
Request:
JSON

{
  “userId”: “u_12345”,
  “type”: “order_update”,
  “priority”: “high”, // or “low”
  “channels”: [”push”, “email”], // Optional overrides
  “templateId”: “tmpl_order_shipped”,
  “data”: {
    “orderId”: “999”,
    “trackingLink”: “https://...”
  },
  “idempotencyKey”: “uuid_gen_123”
}

Response: 202 Accepted { “notificationId”: “n_98765”, “status”: “queued” }
Error: 429 Too Many Requests (if service quota exceeded).

2. Update Preferences

Method: PUT /v1/users/{userId}/preferences
Request: {”marketing”: {”email”: false, “sms”: false}}
Response: 200 OK

3. Register Device

Method: POST /v1/devices
Request: {”userId”: “u_123”, “token”: “fcm_token_xyz”, “platform”: “android”}
Response: 201 Created

6. High-level architecture

We use an asynchronous Queue-Worker architecture to handle bursts and decouple ingestion from delivery.²

Client Svc -> LB -> API Servers -> Message Queues -> Workers -> 3rd Party Providers

Notification API: Lightweight service. Validates payload, authenticates the calling service, checks simple rate limits, and pushes the job to Kafka. Returns 202.
Message Queues (Kafka):
- High Priority Topic: For OTPs, Security alerts.
- Low Priority Topic: For Marketing, Newsletters.
- Separation ensures marketing blasts don’t block password resets.
Notification Workers:
- Stateless servers that pull messages.
- Check User Preferences (Cache).
- Render templates.
- Call Third-Party APIs (Twilio, SendGrid, FCM).
Webhook Handler: A separate service that receives callbacks from providers (e.g., “Email Delivered”) to update the log status.

Continue reading this post for free, courtesy of Arslan Ahmad.

Or purchase a paid subscription.

System Design Nuggets