Design Instagram
“Design Instagram” is a classic system design question, where you’re asked to design a distributed CRUD application. Understanding the general type of system you’re asked to design will help you narrow your clarifying questions and design components.
In the video above, we consider tradeoffs, such as relational vs. non-relational databases and query performance vs. distribution. We also consider challenges unique to social media applications, such as the computation time of the feed generation process.
Below is a supplemental solution that goes into additional detail. Use it as a guide and reference for how you might structure your answer.
Step 1: Define the problem
Ask clarifying questions
The problem space of designing Instagram is incredibly large. If the interviewer hasn’t defined some constraints or requirements, ask clarifying questions to define the problem scope.
To start, we can ask clarifying questions about the key features and aspects the design should include:
- What is the scale of the product we are building (number of users)?
- How should user feeds be generated? Should they be ordered in any way?
- What kind of data are we expected to support (text, images, video)?
- Should we support updating/editing of posts?
Define functional requirements
Based on the insights gathered from our clarifying questions, we can boil down the primary areas and features of Instagram to the following:
Instagram primarily:
- Allows its wide user base to view and contribute posts on the feed.
- Customizes each feed, based on the user’s following.
Given its primary features, Instagram’s functional requirements include:
- Users can upload images from a mobile client to create a post.
- Users can follow other users.
- Users can view a feed of images.
Define non-functional requirements
When dealing with designs that must satisfy high scale, consider what’s most important for end users and what tradeoffs you’d make. The CAP theorem is a good place to start.
Given the functional requirements, it is safe to assume that the non-functional requirements include:
- High availability
- Scalability
- Low latency
Since it’s not a high priority for users to see the most up-to-date data, we can assume it’s okay for the user to see eventually consistent data. In this case, we could choose availability and fault tolerance over consistency.
Estimate the amount of data
To start, we’ll estimate the amount of storage needed to hold a year’s worth of user data.
- 10M Monthly Active Users
- 2 Photos uploaded per month
- 5MB per photo
Result: 10^7 * 2 * 5MB = 10^8MB = 100TB per month = 1.2PB per year
Next, I’ll compute estimates for the system's QPS number.
- 10M Monthly Active Users
- ~3M Daily Active Users
Write QPS: ~600k photo uploads = 6*10^5/86400 = ~10 uploads per sec
Read QPS: 10 reads per day * 3M DAU / 86400 = ~350 reads per sec
For tips on estimating unknowns, check out Exponent’s Estimation Strategies and Tricks lesson.
Step 2: Design a high-level system
Design the APIs
Because the system does not require the server to independently send or push data to the client, we can use standard REST APIs to facilitate communications between the client and the server.
Below, we can mock some of our REST APIs mapped from our functional requirements.

Design the data model
The system wants to collect/capture 3 different data types:
- User data - Metadata relevant to the users of the application
- Photos - Data related to a user-uploaded post/photo
- User Followers - Stores the relationships between User-to-User
Specific attributes of the data include:
- Photo entries can trace to a specific user
- User followers have a 1:1 mapping
- Photos table will only store metadata and content can be referenced via URL path
Considering the functional requirements, which prioritize relational querying between photos, users, and followers, we can store the data by maintaining a separate table for each data type with foreign key references between them.
Backend services will perform a JOIN operation on the followers and photo tables in order to generate a feed for each unique user request.
The Instagram product relies heavily on the relation between these data types to support fetching user feeds. A common query would be to fetch all photos for a group of (followed) users. Because of this, it would make sense to use a SQL Database to host our data models.

Create a high-level design diagram
First, we mock out the core functionality of the system to incorporate all of the functional requirements.
The system below incorporates the following workflows:
- Users can upload images from a mobile client to create a post: Users will issue a request to the Write App Server, which will write post metadata and the photo to the databases.
- Users can follow other users: Similar to above, the Write App Server will process follow requests and write the relation to the Follows table in the database.
- Users can view a feed of images: In this workstream, the user issues a request to the Read App server, which will query the Database for all posts belonging to the users they follow.

At this point, all the functional requirements should be met. Moving forward, we’ll identify points of weakness in the current system and how to address them alongside the non-functional requirements.
Step 3: Deep-dive into the design
Assess tradeoffs
Below, we outline some common trade-offs within the context of the problem.
Relational vs. Non-Relational Database
To fulfill the feed generation requirement, the system must perform a query on both the “follow” table and the “posts” table. This operation would greatly benefit from the JOIN operation, which is primarily supported by a SQL database.
However, two downsides of using SQL are that we must manually shard our database, and we lose flexibility of our data schema. If these downsides are acceptable for our example, we can still use SQL.
Distribution vs. Query Performance
Since SQL is not natively horizontally scalable, we’ll need to establish a sharding key so that the data will work at scale in a distributed system. Assuming the feed generation will want to prioritize recent posts from followers, a sharding key using both timestamp and user ID would provide the best advantage.
Other requests may be impacted/slowed, but SQL would provide the best performance for the highest priority request.
Read/Write Service vs. Monolithic App Server
We de-coupled the write vs. read capabilities into different services for the following reasons:
- They serve very different use cases that will leverage different technologies (e.g. Redis for the Read Server and Kafka for the Write Server).
- This technique minimizes noisy neighbor problems and supports fault tolerance. If the Write Server has an outage, users can still read and vice versa.
- This technique encourages independent horizontal scaling, where Read Servers will most likely scale out more than writes. This comes at a tradeoff of additional software engineering complexity, instead of maintaining a monolithic service.
Step 4: Identify bottlenecks and scale
A few examples of bottlenecks and their potential solutions include:
- Writes and reads in the system have high latency, especially when considering the scale being worked with.
- We can optimize the reads by using a Redis cache and CDN to store data/photos that will be likely accessed in the future.
- Feed Generation may have a high computation time and lead to high latency of read requests.
- To optimize, consider instead maintaining a cache or secondary table that is updated upon write. This is also known as the “Push” vs “Pull” model. In the push model, we maintain another database or table that is updated on write, so that we only have to read from one table.
- At scale, we may want to ensure high availability across our database and servers.
- To address this, we can add replicas across the system’s app servers and database clusters.
- For writes, we can add a message queue to asynchronously update the database to manage spikes in load. However, one caveat is that a message queue may increase the system’s latency.
- Also consider the “celebrity” case where many users would need to get the same update at the same time.
- To address this, we might consider asynchronously updating user feeds in batches. Also, consider adding redundancy in the tables to spread out user traffic.

Step 5: Review and summarize
The current state of the system works well as a distributed system and appropriately handles basic requests scoped earlier in the functional requirements. But, there are a few missing features that could be supported in the future:
- Comments: To support commenting in the system, we can add a second table, as well as a service to handle comment requests and data.
- More customization: From a product perspective, we can consider adding more customization to user feeds by ranking them based on relevance. That might be adding a ranking service to the feed generation pipeline that can process other user properties, e.g., location, preferences, followers.
Other considerations
If given more time, we can explain what other design choices we might consider. For example:
- A reverse proxy may provide additional levels of security.
- Identify which cache update policy would provide the best performance and experience to the end user.
- Go into more depth on the availability patterns and how the system would respond in cases of failure to a service or table to ensure there is no single point of failure.
- A potential concern is the two writes on the Consumer layer, one to Metadata DB and the other to Object Storage. If one of these writes fails, we could ensure that the consumer (and message queue) can retry these scenarios and ensure that writing to Object Storage is idempotent, meaning we deterministically generate the same path and do not re-upload the same file.
- If the Metadata DB has an outage, then we can rely on our cache to provide a somewhat operational service, as opposed to the service being completely unavailable.