Skip to main content

Design WhatsApp

Premium

WhatsApp is an instant messaging software application with functionalities that include sending text, voice and video messages. WhatsApp also offers features such as seeing when the person was last online, sending documents etc.

Step 1: Define the problem

Ask clarifying questions

To start, I’d like to ask some clarifying questions about the key features and aspects we need to include in the design, for example:

  • For group chat, what is the group member limit? 1024
  • What kind of data are we expected to support (text, images, video)? Text for now, but we can discuss images and videos if time permits.
  • For text messages, is there a message size limit? It is 65,536 characters.
  • Are we building a web or mobile application, or both? Both
  • How long shall we store the chat history? 10 years

Define functional requirements

Based on the insights gathered from my clarifying questions, I can boil down the primary areas and features of WhatsApp to the following:

  • One on one conversation between users
  • Group chat
  • User online status

I’d also list some optional functional requirements and discuss their priorities with the interviewer. For example, we decide the following are low priority/no-goals for this system design question.

  • Historical messages
  • Multi-Device support
  • Push notifications

Define non-functional requirements

When dealing with designs that must satisfy high scale, consider what’s most important for end users and what tradeoffs you’d make. The CAP theorem is a good place to start.

Given our functional requirements, it is safe to assume that our non-functional requirements include:

  • High consistency. A user can expect to receive messages according to their relevant orders.
  • High availability is desirable, but can be sacrificed for high consistency.
  • High scalability.
  • Minimum latency which achieves real time chatting experience.

Back-of-the-envelope calculation

Assuming WhatsApp has 2 billion DAUs and an average user sends 20 messages per day.

First we estimate the peak read and write query-per-second (QPS) numbers.

Average Write QPS = DAUs * average_number_of_messages_per_day / 86400 seconds per day ~= 2B * 20 / 86400 ~= 462k

Peak Write QPS = 3 * Average Read QPS = 1.4M

Peak Read QPS = 10x Peak Write QPS = 14M, this is because for a group chat, each message is delivered to multiple readers.

Then we work out the amount of storage needed to hold a year’s worth of data.

  • Size per message:
    • message_id - 8 bytes
    • user_id - 8 bytes
    • text - 100 bytes (let’s say a typical message has 50 characters, and we need two bytes to store a character without compression)
    • Timestamp: 8 bytes
    • Total: ~100 Bytes
  • New messages per day
    • 2B DAU * average_number_of_messages_per_user = 2B* 20 = 40B.
  • Overall storage to store the messages for 10 years
    • 40B messages per day * 100 bytes per message* 365 days per year * 10 years = 14.6PB.

For tips on estimating unknowns, check out our Estimation Strategies and Tricks lesson.

Step 2: Design a high-level system

Design your data model

Naturally we want a Message Table to store all the messages.

Message table:

Primary keyMessageIdInt
FromUserIdInt
ToUserIdInt
ContentVarchar
CreatedAtdatetime

With the message table we can find all the conversations between two users. However, this does not scale to support a group chat so we need a separate table that represents all the conversations inside a group chat.

Group chat table:

Primary keyGroupIdInt
UserIdInt
ParticipantIdsText
IsMutedbool
CreatedAtdatetime
UpdatedAtdatetime

The UserId does not refer to the creator of that group, instead, for a group with N participants, there are N entries in the Group Chat table. This is because some attributes such as IsMuted are private which belong to each individual participant.

We also want to add the GroupId to the Message table so we can easily find all the messages for that group. We also want to remove the ToUserId as there might be multiple receipts in a group chat.

The ParticipantIds column records a list of participants in the group chat and is the same for each of the N entries in the same group chat. This can be further optimized to save space at the cost of performing a database join operation.

A new Message table:

Primary keyGroupIdInt
Primary keyMessageIdInt
FromUserIdInt
ContentVarchar
CreatedAtdatetime

We can choose NoSQL such as Cassandra for the Message table since it’s massive and most of the write operations are just appendings. NoSQL databases are optimized for this use case.

For the Group Chat table, we choose SQL database such as MySQL because we need it to be indexed by both the UserId + GroupId (so that we can quickly locate a user’s group chat) and UserId + UpdatedAt timestamp (so that we can order a user’s most up-to-date chat chronologically).

Separately, if we want to support a user’s online status, we need another database to store that information.

A Online Status table:

Primary keyUserIdInt
StatusEnum
LastActiveAtdatetime

The Online Status table can use a KV store database such as DynamoDB.

Create a high-level design diagram

Let’s provide a high-level design that delivers the basic functional requirements.

Design WhatsApp High Level Design

The above diagram includes two core functionalities, one for sending/receiving messages, and another for the user's online status.

Sending/receiving messages

When client A wants to send a message to client B, they first send a message to the Web Server and then the Chat Server. Upon receiving the message, the Chat Server responds to client A with an acknowledgment.

The message gets saved in the Message table and the Chat Server also updates the UpdatedAt timestamp in the group chat table. Note that an one-on-one conversation is the simplest format of a group chat.

Then the Chat Server retrieves the message and sends them to client B. This can be accomplished by client B periodically pinging the Chat Server, asking if there are any newly incoming messages.

After client B successfully received the message from client A, it sends an acknowledgment message to the Chat Server, and the Chat Server notifies client A that the message has been successfully delivered to client B.

Online status

How do we define online status? We can say a user is online when they login to our web server. We can also say they are offline when they log off from our chat application.

But what if the user just gets temporarily disconnected due to an internet disruption? That might happen very frequently for some users so making the online status depend on the internet condition is probably a bad idea, as it adds too much pressure to our back end service.

A common way to define the online status is whenever a user is inactive for x (e.g. 2 mins) amount of time.

This can be achieved by maintaining a heartbeat between the client and our Web Server, and if the heartbeat is not detected in x amount of time, we change the user’s online status to offline. The user’s followers can pull the Online Status Server so that they know the user’s online status.

At this point all the functional requirements should be met. Moving forward, we can identify points of weakness in the current system and how we can address them alongside our non-functional requirements.

Step 3: Deep-dive into the design

Assess tradeoffs

Below we can outline some common trade-offs within the context of the problem.

Push vs. pull model in the chat service

In our proposed high-level design, a user periodically pulls from the Chat Server to receive any new incoming messages.

In the pull model, the Chat Service needs to keep track of what messages have been sent to the user so the user doesn’t receive duplicate messages. More importantly, the user still needs to pull from the server even if they do not receive any messages.

How about using a push model? A user keeps an open connection with the Chat Server, so that the server can send messages to the user whenever there are new messages. There are typically two ways that a client can maintain the open connection with the server: HTTP long polling and websockets.

HTTP long polling v.s. websocket

With HTTP long polling, the client sends requests to the service, but the server might or might not return to the client immediately. Instead, the server pushes information to the client whenever the data is available.

WebSocket provides two-way and persistent communication channels between the client and the server so that both ends can send and receive data at any time. Here's a comparison between HTTP Long Polling and WebSockets.

HTTP Long PollingWebSockets
Communication TypeRequest-Response ModelTwo-way Communication
LatencyHigher latency due to waiting for server responseLower latency as communication is bidirectional
OverheadHigher overhead due to constant opening/closing of connectionsLower overhead, more efficient
Implementation ComplexitySimpler implementation as it works within the traditional request-response modelCan be more complex to implement due to the need for handling bidirectional communication and potential issues with firewalls/proxies

Push vs. pull model in the online status service

We have used a pull model so that a user can pull from the Online Status Server to query the accounts they follow.

How about using a push model similar to what we do in the Twitter newsfeed generation? We can implement a fanout service to deliver a user’s online status to all of their followers. There are several difficulties in implementing a push model in Online Status service.

First of all we do not know who is online so that we can push the online status to. Most of the uses are offline so it would be a waste of resources to push the status to all of the followers.

Secondly, network disruptions tend to happen geographically. Once a network disruption is restored, we might need to fan out the online status to a large number of users which causes surges in queries.

Therefore a pull model is often desired here for its simplicity. Since a user has to periodically send the heartbeat to the server (to tell the server they are online) anyway, we can let the Online Status Server send the followees’ online status back to the user as a response.

Step 4: Identify bottlenecks and scale

Message orderings

One question that the interviewer might ask is how to ensure the ordering of the messages.

Can we sort the messages according to the message’s timestamp? The answer is no. See the following example where two users A and B sends M1 and M2 in a group chat respectively.

Design WhatsApp Message Orderings

In the above figure, neither user A nor B sees the other message when they send out their own message. Therefore they both expect to see their message appear first.

For example, at t5, user A expects to see M1 before M2 because at t-t0 when user A sends out M1, they haven't seen M2 yet. Sorting the messages according to the server timestamp would place M1 before M2. So this works out perfectly.

However, at t4, user B expects to see M2 before M1 because at t=t2 when user B sends out M2, they haven’t seen the M1 yet. But the same server timestamp order will render user B a different message order: user B actually sees M1 before M2!

This example shows we can’t order the messages’ based on their timestamp, instead, we should return different ordering to different users. This can be achieved by associating each message with a sequence number as the message is sent for each participant in a group chat.

For example, from user B’s perspective, M2 has a lower sequence number than M1 as it is sent to B earlier (M2 is sent at t2 and M1 is sent from the Chat Server to user B at t5). In this way, each participant has its own view of the correct message ordering.

Removing single point of failure

To scale up the system, we need to make sure there is no single point of failure. Here are some (but not exhaustive) optimization ideas we can discuss with the interviewer.

  • Add a Load Balancer between the client and the Web Servers. The load balancer will choose the best (in terms of availability, latency, utilization etc.) for a given incoming request. Similar approach can be applied between a web server and Chat servers.
  • Add replicas across our system’s app servers and database clusters.  All writes will first go to the primary server and then will be replicated to secondary servers, either synchronously or asynchronously. This also makes the system more robust, and when failure happens, we can failover to a secondary server.
  • Shard the database to even out the traffic. For example, we can shard the Message Table by MessageID so that we can evenly distribute all messages to avoid hotspots. However, fetching all the messages in a group chat incurs higher latencies because we need to query all database partitions to get the group chat messages. We can mitigate this issue by using a combination of GroupID and MessageID. In this way, we can get all the messages in the same group chat by querying only a few partitions.

Step 5: Review and summarize

The current state of our system works well as a distributed system and appropriately handles basic requests we scoped in our functional requirements.

If we have more time, we can continue to discuss other low priority functionalities such as historical messages, multi-device support and push notifications.

Hiring decision

  • The candidate (TC) provides a clear scope of the problem and navigates the problem by considering realistic design constraints and tradeoffs.
  • TC also demonstrates a logical thinking process and problem solving skills at  each step of the interview and lands on a working solution.
  • In most cases, TC shows knowledge of various system design technologies, evaluates the pros and cons for each and makes a sound decision based on the specific requirements of the system.
  • TC is aware that the initial working solution has several drawbacks and is able to provide mitigations and optimizations.
  • TC considers several corner cases of the system and is able to provide good coverage for these cases.
  • TC works towards a final design without a single point of failure and better performance.

Considering these factors, I would be inclined to hire this candidate.

Exponent coach decision: Hire