Skip to content

Designing Instagram


๐Ÿ“ธ Instagram System Design โ€” Notes


๐Ÿงญ 1) Problem Statement

Design a social media platform like Instagram where users can:

  • Create accounts, follow others
  • Upload photos/videos, view feeds, like/comment/save/share
  • Post Stories (24h) and Reels (short video)
  • Message (DMs), search users/hashtags/places
  • Receive notifications

Non-functional goals

  • High availability, low latency UI
  • Horizontal scalability (hundreds of millions of users)
  • Durable media storage, cost-efficient delivery via CDN
  • Privacy/security, abuse prevention

โš™๏ธ 2) High-Level Architecture

Clients (iOS/Android/Web)
|
API Gateway / LB
|
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Core Services โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
| Auth | User | Social Graph | Feed | Media |
| Post | Story| Comment| Like | Search| Notify |
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
| | |
Message Bus (Kafka) Datastores Media Pipeline
| | |
Stream processors SQL/NoSQL/Cache Ingest โ†’ Transcode โ†’ Store
| |
Analytics & |
Ranking (Flink/Beam) CDN + Object Storage

๐Ÿงฉ 3) Core Services & Responsibilities

ServiceResponsibilities
AuthSign-up/sign-in, OAuth/JWT, sessions, rate limiters
User/ProfileProfiles, settings, privacy flags, blocks
Social GraphFollow/unfollow, follower counts, block lists
Post/MediaUpload, metadata, tagging, locations, reels
Story24h TTL posts, highlights, viewer list
Feed/RankingHome timeline construction + ranking
InteractionLikes, comments, saves, shares, counters
Search/ExploreUsers, hashtags, places, trend detection
NotificationFollows, likes/comments, mentions, async fanout
DMReal-time messaging, read receipts, media in chat
ModerationSafety, spam/scam detection, reporting, ML
AnalyticsEvents, aggregates (DAU, retention, CTR), A/B tests

๐Ÿ’พ 4) Data Model (Simplified)

SQL (transactional)

  • users(id, handle, name, bio, is_private, created_at, ...)
  • follows(follower_id, followee_id, created_at, status) (status for requested/approved when private)
  • posts(id, author_id, media_id, caption, location_id, created_at, visibility, ...)
  • comments(id, post_id, author_id, text, created_at, parent_comment_id NULL)
  • likes(user_id, post_id, created_at) (composite PK; also per-comment likes)
  • stories(id, author_id, media_id, created_at, expires_at)
  • saves(user_id, post_id, collection_id NULL, created_at)
  • hashtags(id, tag) + post_hashtags(post_id, hashtag_id)
  • locations(id, name, lat, lon)

NoSQL / KV

  • Timelines: home_timeline:{user_id} โ†’ list of (post_id, score, ts)
  • Story trays: stories_tray:{user_id} โ†’ list of story ids (TTL)
  • Counters: post like/comment counts (atomic increments)
  • DM: chat threads + messages (append-only, partitioned by thread)

Object Storage (media)

  • Original + transcoded renditions (images: different sizes, videos: ABR HLS/DASH)
  • Thumbnails, previews, story sprites

Search Index (Elasticsearch/OpenSearch)

  • Users (handle/name), hashtags, captions, places
  • Features for ranking: engagement, freshness, locale

๐Ÿ“ฆ 5) Media Upload & Processing

  1. Client gets pre-signed URL โ†’ uploads directly to object storage (bypass app servers).
  2. Ingest event to Kafka โ†’ Transcode via FFmpeg workers (multi-bitrate, keyframe alignment).
  3. Generate thumbnails, story sprites, extract EXIF, run safety/NSFW checks.
  4. Store renditions; write media metadata to DB; publish โ€œreadyโ€ event.
  5. CDN invalidation/warm-up for hot content.

Reels: short vertical video, additional music/audio mixing, on-device pre-compression, lightweight editing timelines.


๐Ÿ“ฐ 6) Feed Generation: Fan-out vs Fan-in

Options

StrategyIdeaProsCons
Fan-out on writePush new post IDs into followersโ€™ home timelines at publish timeFast feed reads; simpler ranking at read timeHeavy write amplification for celebs; backfills for late followers
Fan-in on readBuild feed by querying followed usersโ€™ latest posts at requestLower write costExpensive reads, higher tail latency
HybridFan-out for normal users; fan-in/cache for high-degree nodesBalancedComplexity, tuning needed

Ranking: ML model scoring by freshness, affinity, engagement likelihood, content quality, diversity, safety.

  • Online features in Redis (user ร— creator affinity, recent interactions).
  • Offline aggregates via Flink/Spark (7-day engagement rates).
  • Two-stage: candidate retrieval โ†’ lightweight re-rank.

โณ 7) Stories

  • TTL 24h; store metadata in SQL, content in object storage; Redis keys with expiry for trays.
  • Viewer list as append-only log (cap at recent N).
  • Privacy: only followers (if private), block lists enforced in story read path.

๐Ÿ’ฌ 8) Direct Messaging (DM)

  • WebSocket/MQTT gateway for real-time delivery + presence.
  • Per-thread partitioning (thread_id % N) โ†’ ordered appends.
  • End-to-end encryption (design option) or server-side encrypted at rest.
  • Media in chat follows same pre-signed upload pipeline.
  • Push via APNs/FCM with collapse keys and quiet hours.

๐Ÿ” 9) Search & Explore

  • Indices: users, hashtags, places, captions; autocomplete; typo tolerance.
  • Explore = feed of ML-ranked candidates from global trending, similar-user embeddings, content-based signals (vision/text).
  • Index pipeline consumes post events, updates term stats, popularity windows.

๐Ÿš€ 10) Scalability & Caching

  • CDN for all media + static assets; origin shield to cut egress.
  • Redis: hot user profiles, timelines, counts, session tokens, feature store.
  • DB sharding: by user_id (profiles, follows), by post_id (interactions), time-based for cold partitions.
  • Read replicas for fan-out reads; CQRS separation of write/read models for heavy tables.
  • Backfill workers for follow-graph changes (new follow โ‡’ seed timeline).
  • Bulk counters: approximate with HLL or batched increments; reconcile offline.

๐Ÿ›ก๏ธ 11) Privacy, Safety, Abuse

  • Private accounts, approvals; block/mute/restrict flows.
  • Rate limiting (IP/device/account), CAPTCHA, device fingerprinting.
  • Spam/ban evasion ML, link-scam detection, comment filtering, report queues.
  • PII handling: encryption at rest, scoped access, audit logs.
  • Geo-compliance: data residency where required.

๐Ÿ“ˆ 12) Observability & SLOs

  • SLOs: p95 feed load < 300ms (metadata), p95 media TTFMP < 2s on 4G, availability 99.95%+.
  • Metrics: DAU/MAU, session length, feed CTR, like/comment rate, story completion, DM delivery latency, push open rate, error rates.
  • Tracing: end-to-end (ingress โ†’ ranking โ†’ storage).
  • Circuit breakers + feature flags for safe rollouts.

๐Ÿงฎ 13) Capacity Planning (Back-of-Envelope)

  • Users: 200M MAU; 50M DAU; peak QPS ~ read-heavy (feed/story/DM).
  • Media: avg image 200KB, video 2โ€“6MB per minute uploaded.
  • Daily uploads: 30M posts โ†’ raw ingress 10โ€“50 TB/day before transcodes.
  • CDN offload target: >95% for media; origin guarded by signed URLs.

(Adjust numbers in interviews; show method, not exactness.)


๐Ÿ” 14) Failure Modes & Resilience

  • Graceful degradation: ranker offline โ†’ fall back to recency feed.
  • Write buffers: queue posts when DB shard is degraded; drain later.
  • Hot celebrity: switch to fan-in + edge cache for that creator.
  • CDN origin failover: multi-region object storage with bucket failover.
  • Shadow read new indices; dual write during migrations.

๐Ÿง  15) API Sketches

POST /v1/posts
GET /v1/feed?cursor=...
POST /v1/follow/{user_id}
POST /v1/like/{post_id}
GET /v1/stories/tray
WS /v1/dm/connect (auth โ†’ subscribe to thread channels)

Use idempotency keys for uploads/interactions to avoid double actions.


๐Ÿงช 16) Ranking Signals (Examples)

  • Userโ€“creator affinity: recent interactions, dwell time
  • Content features: vision tags (food, travel), NSFW filters
  • Freshness: time decay, session diversity
  • Quality: historical engagement rate, viewer feedback (see-less)
  • Network: graph proximity, mutuals

๐Ÿงฉ 17) Trade-offs to Discuss

TopicOption AOption BTrade-off
TimelinesFan-outFan-inWrite amp vs read latency
CountersStrongEventual/batchedFreshness vs throughput
StorageSingle regionMulti-regionCost/complexity vs resilience
DMHTTP pollWebSocket/MQTTSimplicity vs real-time experience
PrivacyServer-sideE2ESafety features vs confidentiality

โœ… 18) Interview Flow (How to Answer)

  1. Clarify features, privacy, scale.
  2. Draw overall architecture.
  3. Deep-dive media pipeline + feed/ranking.
  4. Explain timeline strategy (fan-out / hybrid) and caching.
  5. Cover search/explore, stories, DM briefly.
  6. Talk data model, sharding, observability, SLOs.
  7. Call out abuse prevention, privacy.
  8. Discuss trade-offs and failure handling.
  9. Summarize; propose incremental rollouts and cost controls.