Designing Instagram
๐ธ Instagram System Design โ Notes
๐งญ 1) Problem Statement
Design a social media platform like Instagram where users can:
- Create accounts, follow others
- Upload photos/videos, view feeds, like/comment/save/share
- Post Stories (24h) and Reels (short video)
- Message (DMs), search users/hashtags/places
- Receive notifications
Non-functional goals
- High availability, low latency UI
- Horizontal scalability (hundreds of millions of users)
- Durable media storage, cost-efficient delivery via CDN
- Privacy/security, abuse prevention
โ๏ธ 2) High-Level Architecture
Clients (iOS/Android/Web) | API Gateway / LB | โโโโโโโโโโโโโโโโ Core Services โโโโโโโโโโโโโโโโโ | Auth | User | Social Graph | Feed | Media | | Post | Story| Comment| Like | Search| Notify | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | | | Message Bus (Kafka) Datastores Media Pipeline | | | Stream processors SQL/NoSQL/Cache Ingest โ Transcode โ Store | | Analytics & | Ranking (Flink/Beam) CDN + Object Storage๐งฉ 3) Core Services & Responsibilities
| Service | Responsibilities |
|---|---|
| Auth | Sign-up/sign-in, OAuth/JWT, sessions, rate limiters |
| User/Profile | Profiles, settings, privacy flags, blocks |
| Social Graph | Follow/unfollow, follower counts, block lists |
| Post/Media | Upload, metadata, tagging, locations, reels |
| Story | 24h TTL posts, highlights, viewer list |
| Feed/Ranking | Home timeline construction + ranking |
| Interaction | Likes, comments, saves, shares, counters |
| Search/Explore | Users, hashtags, places, trend detection |
| Notification | Follows, likes/comments, mentions, async fanout |
| DM | Real-time messaging, read receipts, media in chat |
| Moderation | Safety, spam/scam detection, reporting, ML |
| Analytics | Events, aggregates (DAU, retention, CTR), A/B tests |
๐พ 4) Data Model (Simplified)
SQL (transactional)
users(id, handle, name, bio, is_private, created_at, ...)follows(follower_id, followee_id, created_at, status)(status for requested/approved when private)posts(id, author_id, media_id, caption, location_id, created_at, visibility, ...)comments(id, post_id, author_id, text, created_at, parent_comment_id NULL)likes(user_id, post_id, created_at)(composite PK; also per-comment likes)stories(id, author_id, media_id, created_at, expires_at)saves(user_id, post_id, collection_id NULL, created_at)hashtags(id, tag)+post_hashtags(post_id, hashtag_id)locations(id, name, lat, lon)
NoSQL / KV
- Timelines:
home_timeline:{user_id}โ list of (post_id, score, ts) - Story trays:
stories_tray:{user_id}โ list of story ids (TTL) - Counters: post like/comment counts (atomic increments)
- DM: chat threads + messages (append-only, partitioned by thread)
Object Storage (media)
- Original + transcoded renditions (images: different sizes, videos: ABR HLS/DASH)
- Thumbnails, previews, story sprites
Search Index (Elasticsearch/OpenSearch)
- Users (handle/name), hashtags, captions, places
- Features for ranking: engagement, freshness, locale
๐ฆ 5) Media Upload & Processing
- Client gets pre-signed URL โ uploads directly to object storage (bypass app servers).
- Ingest event to Kafka โ Transcode via FFmpeg workers (multi-bitrate, keyframe alignment).
- Generate thumbnails, story sprites, extract EXIF, run safety/NSFW checks.
- Store renditions; write media metadata to DB; publish โreadyโ event.
- CDN invalidation/warm-up for hot content.
Reels: short vertical video, additional music/audio mixing, on-device pre-compression, lightweight editing timelines.
๐ฐ 6) Feed Generation: Fan-out vs Fan-in
Options
| Strategy | Idea | Pros | Cons |
|---|---|---|---|
| Fan-out on write | Push new post IDs into followersโ home timelines at publish time | Fast feed reads; simpler ranking at read time | Heavy write amplification for celebs; backfills for late followers |
| Fan-in on read | Build feed by querying followed usersโ latest posts at request | Lower write cost | Expensive reads, higher tail latency |
| Hybrid | Fan-out for normal users; fan-in/cache for high-degree nodes | Balanced | Complexity, tuning needed |
Ranking: ML model scoring by freshness, affinity, engagement likelihood, content quality, diversity, safety.
- Online features in Redis (user ร creator affinity, recent interactions).
- Offline aggregates via Flink/Spark (7-day engagement rates).
- Two-stage: candidate retrieval โ lightweight re-rank.
โณ 7) Stories
- TTL 24h; store metadata in SQL, content in object storage; Redis keys with expiry for trays.
- Viewer list as append-only log (cap at recent N).
- Privacy: only followers (if private), block lists enforced in story read path.
๐ฌ 8) Direct Messaging (DM)
- WebSocket/MQTT gateway for real-time delivery + presence.
- Per-thread partitioning (
thread_id % N) โ ordered appends. - End-to-end encryption (design option) or server-side encrypted at rest.
- Media in chat follows same pre-signed upload pipeline.
- Push via APNs/FCM with collapse keys and quiet hours.
๐ 9) Search & Explore
- Indices: users, hashtags, places, captions; autocomplete; typo tolerance.
- Explore = feed of ML-ranked candidates from global trending, similar-user embeddings, content-based signals (vision/text).
- Index pipeline consumes post events, updates term stats, popularity windows.
๐ 10) Scalability & Caching
- CDN for all media + static assets; origin shield to cut egress.
- Redis: hot user profiles, timelines, counts, session tokens, feature store.
- DB sharding: by
user_id(profiles, follows), bypost_id(interactions), time-based for cold partitions. - Read replicas for fan-out reads; CQRS separation of write/read models for heavy tables.
- Backfill workers for follow-graph changes (new follow โ seed timeline).
- Bulk counters: approximate with HLL or batched increments; reconcile offline.
๐ก๏ธ 11) Privacy, Safety, Abuse
- Private accounts, approvals; block/mute/restrict flows.
- Rate limiting (IP/device/account), CAPTCHA, device fingerprinting.
- Spam/ban evasion ML, link-scam detection, comment filtering, report queues.
- PII handling: encryption at rest, scoped access, audit logs.
- Geo-compliance: data residency where required.
๐ 12) Observability & SLOs
- SLOs: p95 feed load < 300ms (metadata), p95 media TTFMP < 2s on 4G, availability 99.95%+.
- Metrics: DAU/MAU, session length, feed CTR, like/comment rate, story completion, DM delivery latency, push open rate, error rates.
- Tracing: end-to-end (ingress โ ranking โ storage).
- Circuit breakers + feature flags for safe rollouts.
๐งฎ 13) Capacity Planning (Back-of-Envelope)
- Users: 200M MAU; 50M DAU; peak QPS ~ read-heavy (feed/story/DM).
- Media: avg image 200KB, video 2โ6MB per minute uploaded.
- Daily uploads: 30M posts โ raw ingress 10โ50 TB/day before transcodes.
- CDN offload target: >95% for media; origin guarded by signed URLs.
(Adjust numbers in interviews; show method, not exactness.)
๐ 14) Failure Modes & Resilience
- Graceful degradation: ranker offline โ fall back to recency feed.
- Write buffers: queue posts when DB shard is degraded; drain later.
- Hot celebrity: switch to fan-in + edge cache for that creator.
- CDN origin failover: multi-region object storage with bucket failover.
- Shadow read new indices; dual write during migrations.
๐ง 15) API Sketches
POST /v1/postsGET /v1/feed?cursor=...POST /v1/follow/{user_id}POST /v1/like/{post_id}GET /v1/stories/trayWS /v1/dm/connect (auth โ subscribe to thread channels)Use idempotency keys for uploads/interactions to avoid double actions.
๐งช 16) Ranking Signals (Examples)
- Userโcreator affinity: recent interactions, dwell time
- Content features: vision tags (food, travel), NSFW filters
- Freshness: time decay, session diversity
- Quality: historical engagement rate, viewer feedback (see-less)
- Network: graph proximity, mutuals
๐งฉ 17) Trade-offs to Discuss
| Topic | Option A | Option B | Trade-off |
|---|---|---|---|
| Timelines | Fan-out | Fan-in | Write amp vs read latency |
| Counters | Strong | Eventual/batched | Freshness vs throughput |
| Storage | Single region | Multi-region | Cost/complexity vs resilience |
| DM | HTTP poll | WebSocket/MQTT | Simplicity vs real-time experience |
| Privacy | Server-side | E2E | Safety features vs confidentiality |
โ 18) Interview Flow (How to Answer)
- Clarify features, privacy, scale.
- Draw overall architecture.
- Deep-dive media pipeline + feed/ranking.
- Explain timeline strategy (fan-out / hybrid) and caching.
- Cover search/explore, stories, DM briefly.
- Talk data model, sharding, observability, SLOs.
- Call out abuse prevention, privacy.
- Discuss trade-offs and failure handling.
- Summarize; propose incremental rollouts and cost controls.