System Design Interview Guide: FAANG and Startups

This is an in-depth guide to system design interviews for software engineers, engineering managers, technical program managers, and staff-level engineers.

By the end, you'll know how to approach system design problems and which types of questions to expect in your interviews.

👋

This guide accompanies Exponent's complete system design interview course, which has been used by 25,000+ engineers to ace their interviews.

Sneak Peek:
- Design Instagram
- API Basics
- System Design Interview Rubric

Watch a complete overview of system design interviews.

This guide was written with help from dozens of technical interview coaches at companies like Microsoft, Amazon, Meta, Google, Netflix, Dropbox, and Stripe, including:

Tom Lu (Google Senior SWE)
Neeraj Gupta (eBay EM)
Praveen Dubey (Microsoft Senior SWE)

System Design Interview Overview

The system design interview assesses your ability to tackle complex engineering problems by designing a system or component from scratch. Or, you'll discuss technical requirements for real-world scenarios.

For example,

Design TikTok.
Design WhatsApp.
How would you optimize CDN usage for Netflix?.

Instead, you'll be judged on your ability to:

Understand and dissect technical problems,
Sketch blueprints,
Engage in discussions about system requirements and tradeoffs,
Create working solutions.

Interviewers don’t expect you to create a 100% perfect solution.

Instead, system design interviews evaluate how you make decisions in the face of uncertainty, your confidence in taking risks, and your ability to adapt to changing technical requirements.

ℹ️

This interview is often used to determine the level at which a candidate will be hired. It is usually accompanied by an engineering manager interview or a behavioral interview.

Interview Structure

Typically, system design interviews last 45 minutes to an hour. This includes time for introductions and Q&As at the end.

During this time, you will:

Clarify requirements
Consider constraints, bottlenecks, and trade-offs
Discuss and assess various solutions, weighing their pros and cons
Identify opportunities for scaling, potential risks, and points of failure

ℹ️

Some companies, like Amazon, might mix system design questions into behavioral rounds or vice versa or even shorten system design conversations to 20 minutes.

Whiteboarding

During the interview, you’ll be asked to whiteboard your system design with:

A physical whiteboard,
Whimsical (our preferred tool),
Google Drawings,
or Miro.

Find out from your recruiter how you'll be presenting your solution, and get some practice using that tool.

If you can choose, select one tool and stick with it for practice so you feel confident on the day of your interview.

Question Types

Companies vary in their approaches to choosing system design interview questions.

For instance, Netflix asks system design questions that address issues they're currently dealing with, while others, like Google, avoid asking questions that are too similar to their real-world products.

Questions are usually presented in a general form at first (e.g., Design TikTok), and it’s your job to narrow the problem scope.

Usually, the interviewer adjusts the expectations and guidelines of the question to suit your level of experience.

ℹ️

Sometimes, constraints and requirements are provided for mid-level engineers to limit the scope and complexity of the problem.

On the other hand, senior-level engineers are expected to navigate a vaguer problem statement with a broader scope and make decisions based on a deeper understanding of system design principles.

System Design vs. Coding Interviews

The system design interview is categorized as a technical interview, much like a coding interview, but it’s significantly different.

For example:

Prompts are often vague. Coding challenges are precise. System design prompts are not. System design interviews replicate real-world conditions, so interviewers won't provide you with detailed feature requests. Generally, you'll need to extract specifics yourself. Asking clarifying questions and using a framework is essential.
There isn't a "right" answer. If your choices are defensible and you can clearly discuss trade-offs, your creativity should guide you.
The interview is a two-way conversation. Interact with your interviewer throughout the process. Clarify requirements at the start, stay in touch, and review your decisions at the end.

Even the most talented engineers can tank the system design interview if they forget that communication skills are also being assessed.

The best engineers ask many questions, consider trade-offs, and justify their choices to build a working system.

💬

"Concentrate on the basic principles of system design instead of memorizing particular product setups. Knowing the underlying principles will help you apply your knowledge to different situations and technologies, no matter the question." - Daisy Modi is an Exponent interview coach and senior engineer with experience at Google, Uber, Adobe, and Twitter.

System Design Interview Framework

Next, let's develop a concise interview framework for answering most system design questions in a typical 45-60-minute interview.

Focus on these five steps:

Step 1: Understand the problem. Familiarize yourself with the problem and define the scope of the design with the interviewer.
Step 2: Design the system. Highlight the system's core elements. Explain how they’ll work together within a complete system.
Step 3: Dive deep: You or the interviewer will select one or two components to discuss in-depth.
Step 4: Refine the design. Reflect on the design's bottlenecks and possible solutions if the system was scaled up.
Step 5: Finalize. Check that the design meets all requirements and suggest ways to improve your system if you have more time.

https://exponent-blog.ghost.io/content/images/2023/08/System-Design-Interview-Cheat-Sheet.png — System design interview cheat sheet.

Step 1: Understand the Problem

Time estimate: 10 minutes

Start by gathering more information from your interviewer about the system's constraints.

Use a combination of context clues and direct questions to answer:

What are the functional and non-functional requirements?
What should be included and excluded?
Who are the clients and consumers?
Do you need to talk to pieces of an existing system?

💡

Don’t start designing without first clarifying the problem.

Non-Functional Requirements

Next, consider the non-functional requirements. These may be related to business objectives or user experience.

Non-functional requirements include things like:

Availability,
Consistency,
Speed,
Security,
Reliability,
Maintainability,
And cost.

Ask your interviewer these questions to better understand the non-functional requirements:

What is the scale of the system?
How many users should it support?
How many requests should the server handle?
Are most use cases read-only?
Do users typically read the data shortly after someone else overwrites it?
Are most users on mobile devices?

If there are many design constraints and some are more important than others, focus on the most critical ones.

For example, if designing a social media timeline, focus on posting photos and timeline generation services instead of user registration or how to follow another user.

Requirement	Question
Performance	How fast is the system?
Scalability	How will the system respond to increased demand?
Reliability	What is the system’s uptime?
Resilience	How will the system recover if it fails?
Security	How are the system and data protected?
Usability	How do users interact with the system?
Maintainability	How will you troubleshoot the system?
Modifiability	Can users customize features? Can developers change the code?
Localization	Will the system handle currencies and languages?

💡

Interview Tip: Discuss your non-functional requirements and your reasoning with the interviewer and check in with them. They may be interested in a particular aspect of your system, so listen to their hints if they nudge you in a specific direction.

Estimating Data

You can estimate the data volume roughly by performing some quick calculations, such as queries per second (QPS), storage size, or bandwidth requirements.

Estimating the data throughput can also help you pick essential components for your system or identify opportunities to scale.

As you estimate data, it’s OK to make assumptions about user volume and typical user behavior. But, check with your interviewer if you’re unsure if your assumptions match their expectations.

💡

Interview Tip: Check with your interviewer if these assumptions match their expectations.

Step 2: High-Level System Design

Time estimate: 10 minutes

Next, explain how each part of the system will work together, starting by designing APIs.

APIs define how clients can access your system's resources or functionality via requests and responses.

Consider how clients interact with the system and the types of data they're passing through. Clients may want to create/delete resources or read/update existing ones.

Each system requirement should translate to one or more APIs.

At this step, you should choose what type of APIs you want to use and why—such as:

Representational State Transfer [REST],
Simple Object Access Protocol [SOAP],
Remote Procedure Call [RPC],
or GraphQL.

Consider the request's parameters and the response type.

APIs are the foundation of your system's architecture.

Server and Client Communication

Next, think about how the client and web server will communicate.

There are several popular options to choose from:

Ajax Polling
Long Polling
WebSockets
Server-Sent Events

	Pros	Cons
Ajax Polling	Easy to implement, works with all browsers	High server load, high latency
Long Polling	Low latency, less server load	High server load, not supported by all browsers
Websockets	Real-time communication	May require more complex server setup
Server-sent Events	Efficient, low latency	Unidirectional communication, not supported by all browsers

Each has different communication directions and varying performance advantages and disadvantages.

💡

Interview Tip: Discuss and explain your communication strategy with your interviewer. Avoid introducing APIs irrelevant to the functional requirements.

Data Modeling

Once you've designed the API and established a communication protocol, determine the core database data models.

This includes:

Creating a simplified schema that lists only the most important fields,
Discussing data access patterns and the read-to-write ratio,
Considering indexing options,
And at a high level, identifying which database to use.

Database Cheatsheet for AWS, Azure, and Google Cloud

This system design database cheat-sheet can help you decide between SQL and NoSQL database options for your design.

https://exponent-blog.ghost.io/content/images/2023/06/System-Design-Interview-Database-SQL-vs-NoSQL--Azure--AWS--Google--Cloud--Cheatsheet.png

High-Level Design Diagram

After designing the API, establishing a communication protocol, and building a preliminary data model, the next step is to create a high-level design diagram.

The diagram should serve as a blueprint for your design.

It highlights the most crucial elements needed to fulfill the functional requirements.

You don't need to delve into too much detail about each service yet. Your goal at this stage is to confirm that your design satisfies all functional requirements.

Demonstrate the data and control flow for each requirement to your interviewer.

https://exponent-blog.ghost.io/content/images/2023/05/A-high-level-SD-diagram.png

In this diagram, Twitter/X is abstracted into an API server, several services, and core databases. These servers and services are behind a load balancer, which aids in routing and balancing traffic among different servers.

In the example above, you could explain to your interviewer how the following features function:

How users create or log into their account
How users can follow or unfollow another user
How users can post
How users can view their news feed

Step 3: Explore the Design: Deep-Dive

Time estimate: 10 minutes

Next, examine the system components and their relationships in more detail.

The interviewer may prompt you to focus on a particular area, but don't rely on them to drive the conversation.

💡

Interview Tip: Regularly check in with your interviewer to see if they have questions or concerns in a specific area.

Non-Functional Requirements

Consider how non-functional requirements impact design choices.

Transactions: If the system requires transactions, consider a database that offers the ACID (Atomicity, Consistency, Isolation, and Durability) property.
Data freshness: If an online system requires fresh data, think about how to speed up the data ingestion, processing, and query process.
Data size: If the data size is small enough to fit into memory (up to hundreds of GBs), you can place it in memory. However, RAM is prone to data loss, so if you can't afford to lose data, you must find a way to make it persistent.
Partitioning: If the volume of data you need to store is large, you may want to partition the database to balance storage and query traffic.
Offline processing: If some processing can be done offline or delayed, you may want to rely on message queues and consumers.
Access patterns: Revisit the data access pattern, QPS number, and read/write ratio, and consider how they impact your choices for databases, database schemas, and indexing options.

System design questions have no "correct" answer. Every question can be answered in multiple ways.

The most important skill in a system design interview is your ability to weigh trade-offs as you consider functional and non-functional requirements.

Step 4: Improve the Design (Bottlenecks and Scale)

Time estimate: 10 minutes

After thoroughly examining the system components, take a step back.

Are there any bottlenecks in this system? How well does it scale?

Evaluate if the system can operate effectively under different conditions and has the flexibility to support future growth.

Consider these points:

Single points of failure: Is there a single point that could cause the entire system to fail? How could the system be more robust and maintain uptime?
Data replication: Is the data important enough to make copies? How important is it to keep all copies the same?
CDNs: Does it provide a service for people all over the world? Would data centers in different parts of the world make it faster?
High traffic: Are there any special situations, like when many people use the system simultaneously, that could make it slow or even break it?
Scalability: How can the system work for 10 times more people?

Message Queues and Publish/Subscribe

By breaking down processes and implementing queuing mechanisms to manage traffic, systems can be optimized for high performance at scale.

An example of event-driven architecture is Exponent's peer-to-peer mock interview tool.

Registering a user is handled as an asynchronous event, involving multiple services working in tandem.

Message Queues (MQs) play a pivotal role in enabling orderly and efficient message transmission to a single receiver.

On the other hand, Publish-Subscribe (Pub/Sub) systems excel at broadcasting information to multiple subscribers simultaneously.

Message Queues (MQs): MQs are ideal for scenarios where processing jobs in a specific order is essential. They ensure that tasks are executed sequentially, maintaining the integrity of the workflow.
Publish-Subscribe (Pub/Sub) systems: Pub/Sub systems shine when it comes to disseminating events or notifications to a large number of recipients concurrently.

Here are examples of synchronous, asynchronous, and pub/sub-messaging queues:

https://exponent-blog.ghost.io/content/images/2023/06/Synchronous-vs.-Asynchronous-vs.-Pub-Sub-Architecture-3.png

Decoupling backend services using synchronous, asynchronous, and pub/sub message queues can improve scalability and reliability.

Discuss Bottlenecks

To talk about bottlenecks, follow this structure.

Focus on the 2 or 3 most important limitations to keep your answer concise.

First, identify a bottleneck in the system.
Next, propose a single alternative to it.
Discuss the trade-offs of this alternative.
Decide and recommend an option between the alternative and your original solution.
Repeat for each bottleneck.

💬

"I appreciate it when candidates show that they have considered multiple options to solve a problem. This broad understanding of different technologies shows me that they are not simply memorizing answers or the use cases of a single tech stack." Suman B is an Exponent software engineering interview coach and an engineering manager at Amazon.

Step 5: Wrap Up

Time Estimate: 5 minutes

This is the end of the interview. You can summarize the requirements, justify decisions, suggest alternatives, and answer any questions.

Walk through your decisions, providing justification for each and discussing any space, time, and complexity tradeoffs.

Throughout the discussion, refer back to the requirements periodically.

Junior vs. Senior Engineers

💬

The following advice on leadership in SD interviews comes from Geoff Mendal. Geoff is an Exponent leadership coach and former software engineer with over 30 years of engineering experience at Google, Microsoft, and Pandora.

System design interviews help determine the level at which a candidate will be hired.

For junior engineers and new graduates, the focus on system design interviews is lesser. Junior candidates are expected to know the basics but not every detailed concept.

For instance, junior candidates don't need to know when to use NGINX or AWS' native load balancer. They only need to know that a load balancer is necessary.

However, for senior, staff, and lead candidates, having an in-depth understanding of system design and various trade-offs becomes vital.

Having more than one system design interview for higher-level roles is common.

During a system design interview, candidates often overlook or are not prepared for the evaluation of their leadership behaviors and skills.

In addition to assessing technical skills for designing at scale, the interviewer also tries to answer, "What is it like to work with you, and would they want you on their team?"

You can demonstrate leadership skills in an interview by:

asking powerful open-ended questions at the outset, such as "What are the goals of this system?" and "What does success look like?"
actively listening (sometimes referred to as level 5 listening),
and collaborating with the interviewer rather than treating them as a stenographer. This is particularly important for more senior roles, where you will be expected to use leadership behaviors and skills heavily.

Demonstrating these skills during the interview is critical to receiving a positive evaluation.

Advice for MAANG+

Google System Design Interviews

💬

This advice comes from Yaro, a Google EM.

"I recently completed the Google engineering manager interview loop. These were regular system design interviews with different engineers and teams.

To prepare for the interviews, I watched a lot of mock interviews from Exponent. I also read some books and practiced answering system design questions in Google Docs. I practiced writing solutions for 3-4 systems, including Google Drive, Instagram, a hotel booking system, Google Maps, an analytics system, and blob storage.

A coding interview round was also evaluated by an L6 engineering manager. They advised me to spend time understanding which database to choose.

I recommend checking out Alex Xu's system design database table and use cases on Twitter. Spend an evening learning about all the different use cases for these database types. Google likes to ask detailed questions about database selections.

Additionally, I reviewed all of the databases used by Google, including Bigtable, Spanner, Firestore, and BigQuery. This gave me a few more points with the interviewer since I approached the problems with their internal tech, not just AWS or Azure. This was probably overkill, but it helped me feel more prepared."

Amazon System Design Interviews

During an Amazon system design interview, a big focus will be on behavioral questions based on Amazon's Leadership Principles.

However, the interview will also evaluate your technical, functional job fit, specifically in system design.

Focus on the big picture rather than becoming an expert on the specific system they want you to create.

Whether you come from a FinTech or HealthTech background, Amazon will likely ask you to design an Amazon-type product. This could be Alexa or Amazon Prime.

Focus on the fundamentals that create a cohesive experience across different layers required for a complex environment to work.

During the interview, you may be asked to optimize your solution or test different parameters to see how you adjust the scope and handle unforeseen circumstances.

Fundamental Concepts

The last part of this guide is a breakdown of the fundamental principles and concepts of designing scalable systems.

Jump ahead:

➡️ Web Protocols

Web protocols are the foundation of network communication, essential for the functioning of distributed systems. They include standards and rules for data exchange over a network, involving physical infrastructures like servers and client machines.

Network protocols ensure standardized communication among machines, crucial for maintaining order and functionality in network interactions. The internet primarily uses two models, TCP/IP and OSI, to structure these communications.

TCP/IP and OSI Models

TCP/IP Model: Consists of four layers: Network Access, Internet, Transport, and Application. It uses protocols like IP (Internet Protocol) and TCP (Transmission Control Protocol) to facilitate data transmission across network nodes.
OSI Model: A more conceptual, seven-layer model that provides a detailed breakdown of the network communication process.

TCP and UDP

TCP: Focuses on reliable transmission, correcting errors like lost or out-of-order packets. It's suitable for applications where data accuracy is critical.
UDP: Offers faster transmission by eliminating error-checking processes, used in applications like live streaming where speed is preferred over accuracy.

HTTP and HTTPS

HTTP: An application layer protocol for transmitting hyperlinks and web content using request methods like GET and POST.
HTTPS: Enhances HTTP with security features, encrypting data to prevent unauthorized access.

TLS and WebSocket

TLS (Transport Layer Security): Encrypts data to secure communications, initiated through a TLS handshake process involving cipher suites and digital certificates.
WebSocket: Supports real-time data transfer between clients and servers without the need for repeated requests, ideal for applications requiring continuous data flow.

➡️ APIs

APIs (Application Programming Interfaces) facilitate communication between different systems by defining how they can use each other's resources. They operate as a contract, specifying the request and response format between systems. Web APIs, a common type, utilize HTTP for data transmission.

APIs communicate through formats like JSON or XML, using internet protocols. A typical API interaction involves a client sending a request to a specific URL using HTTP methods (GET, POST, PUT, DELETE), and receiving a structured response.

REST APIs:

REST (Representational State Transfer) is a popular API design that emphasizes stateless communication and resource manipulation using standard HTTP methods. It simplifies interactions by using familiar web URL structures and methods.

Other API Types:

RPC (Remote Procedure Call): Streamlines back-end data exchanges using binary data for lightweight communication.
GraphQL: Allows clients to define precisely what data they need, optimizing flexibility and reducing data transfer.
SOAP (Simple Object Access Protocol): Uses a text-based format to ensure high security, suitable for transactions requiring strict compliance.

Design Considerations

Design patterns such as pagination facilitate efficient data retrieval, while idempotency ensures reliable transaction processing.

API Security and Management

API gateways enhance security by serving as a control point for incoming and outgoing traffic. They manage authentication, rate limiting, and other critical functions to prevent misuse and maintain system integrity.

➡️ Reliability

Reliability ensures that a system functions correctly, handles errors, and secures against unauthorized access. It encompasses not only availability but also comprehensive measures for security, error management, and disaster recovery.

Reliable systems incorporate strategies to manage hardware and network failures effectively, distinguishing between transient errors like temporary network outages and non-transient errors like hardware failures.

Refer to predefined requirements: Helps focus on mitigating significant risks.
Assume failures: Design systems for graceful recovery from failures.
Include testing and monitoring: Essential for assessing system performance and making necessary adjustments.

Retries

Simple retry: For unusual transient errors with request limits to avoid system overload.
Delayed retries with exponential backoff: For common transient errors to prevent the thundering herd problem, where simultaneous retries overload the system.

Circuit Breakers

Operation: Mimics physical circuit breakers by stopping repeated attempts when failures occur, thus conserving resources and preventing further issues.
Use cases: Useful for avoiding cascading failures and enabling quick responses in performance-critical situations.

Saga

Concept: Manages distributed transactions in microservices by ensuring each component transaction completes successfully or compensatory actions are taken.
Use cases: Maintains data consistency across services and is suitable for applications where repeated operations should not alter the outcome.

Techniques & Considerations:

Implement jitter in retries to avoid synchronized requests.
Set appropriate configurations for circuit breakers based on anticipated recovery patterns and performance requirements.
For sagas, choose between choreography for simpler setups with fewer services or orchestration for complex systems with many interdependent services.

➡️ Availability

High availability (HA) is complex due to issues like scaling and hardware failures, and while 100% uptime is unrealistic, cloud services strive for near-perfect availability.

Rate Limiting

Rate limiting controls service use by setting a cap on the number of operations within a specified time. This strategy prevents service overuse, maintaining availability by managing load and reducing unnecessary costs.

Suitable for preventing budget overruns in autoscaling, sustaining API availability during a Denial of Service (DoS) attack, and managing customer access and costs in SaaS environments.

Techniques and Considerations:

Token bucket: Allocates a certain number of tokens per request, limiting service when the bucket is empty.
Leaky bucket: Discards excess requests when capacity is reached.
Fixed and sliding window: Controls request spikes by limiting the number in a set window of time.

Queue-Based Load Leveling

Queue-based load leveling manages service demand by introducing a queue that moderates the flow of tasks to services, preventing system overload.

Suitable for services prone to high load spikes where higher latency is acceptable to ensure process order.

Techniques and Considerations:

Design the queue system to accommodate the service's limitations in depth, message size, and response rate.
This strategy is optimal for architectures where tasks can be easily decoupled from services.

Gateway Aggregation

Gateway aggregation reduces multiple service requests into a single operation at the gateway level, improving efficiency and reducing the load on backend services.

Suitable for reducing service chatter in microservices architectures and lowering latency in complex systems, especially over networks like mobile.

Techniques and Considerations:

Ensure the gateway can manage expected loads and scale accordingly.
Consider implementing circuit breakers or retries and load testing the gateway to ensure reliability.

➡️ Load Balancing

Load balancers distribute web traffic across multiple servers. This mechanism enhances application scalability, increases availability, and optimizes server capacity usage.

Load balancers address the limitations of server capacity caused by increased traffic, ensuring that no single server becomes overwhelmed. They are essential for implementing horizontal scaling, which involves adding more servers to manage increased load.

Load balancers distribute incoming traffic using various strategies:

Round robin: Assigns servers in a cyclic order, ensuring even distribution.
Least connections: Directs traffic to the server with the fewest active connections.
Consistent hashing: Routes requests based on criteria like IP address or URL, useful in maintaining user session consistency.

They are also crucial for managing traffic across server pools, needing to be both efficient and highly reliable.

Load balancers are recommended when a system benefits from increased capacity or redundancy. They are typically positioned between external traffic and application servers, and in microservices architectures, they may front each service, allowing independent scalability.

Advantages

Scalability: Facilitates easy scaling of application servers as demand changes.
Reliability: Enhances system reliability by providing failover capabilities, thus reducing downtime.
Performance: Improves response times by evenly distributing workloads.

Considerations

Bottlenecks: At higher scales, load balancers can become bottlenecks or points of failure, necessitating the use of multiple load balancers.
User sessions: Managing user sessions can be challenging unless configured to ensure session persistence across server requests.
Deployment complexities: Deploying updates can be more complex and time-consuming, requiring careful traffic management during rollouts.

➡️ SQL vs. NoSQL Databases

When should you use a SQL or NoSQL database?

SQL Databases

SQL databases, or relational databases, are structured to handle complex queries and relationships among multiple tables using primary and foreign keys. They are ideal for scenarios requiring structured data and robust transaction support.

Relationships: Facilitates complex queries on data relationships.
Structured Data: Ensures data integrity through predefined schemas.
ACID Compliance: Supports transactions that are atomic, consistent, isolated, and durable.
Less flexibility: Requires predefined schemas, making changes cumbersome.
Scaling challenges: Difficult to scale horizontally; more suitable for vertical scaling.

Popular SQL Databases: MySQL, PostgreSQL, Microsoft SQL Server, Oracle, CockroachDB.

NoSQL Databases

NoSQL databases are designed for flexibility, accommodating unstructured data without predefined relationships. They excel in environments where horizontal scaling and large volumes of data are common.

Flexible data models: Suitable for unstructured data and quick setups.
Ease of scaling: Simplifies horizontal scaling across distributed data stores.
Diverse data types: Supports documents, key-value pairs, and wide-column stores.
Eventual consistency: Can lead to stale data reads in distributed setups.
Complex transactions: Not ideal for applications requiring complex transactional integrity.

Popular NoSQL Databases: MongoDB, Redis, DynamoDB, Cassandra, CouchDB.

Common Scenarios

Amazon's Shopper Service: For storing data on shopper activities where slight staleness is acceptable, a NoSQL database is recommended due to its scalability and flexibility in handling large volumes of data.
Caching Service for Customer Metrics: A NoSQL database fits well as it provides fast data retrieval for non-relational data, which is ideal for caching purposes.
Loan Application Service at PayPal: An SQL database is suitable due to the need for high data consistency and relationships between loan, user balance, and transaction history data.

➡️ Database Sharding

Database sharding involves dividing a large database into smaller, more manageable pieces, known as shards, each hosted on separate servers. This technique is used to enhance performance, scalability, and manageability of databases that cannot be maintained efficiently as a single monolithic unit due to size and complexity.

Sharding Techniques

Geo-based Sharding: Data is divided based on geographical locations to minimize latency and improve user experience. This method can result in uneven data distribution if user density varies across regions.
Range-based Sharding: Data is segmented into ranges based on a shard key, such as the first letter of a user's name. While simple, this can lead to unbalanced loads if the data isn't uniformly distributed across the chosen ranges.
Hash-based Sharding: A hash function is applied to a shard key to evenly distribute data across shards, reducing the risk of creating hotspots. This method is less likely to keep related data together, which can complicate query performance optimization.

Manual vs. Automatic Sharding

Automatic Sharding: Some modern databases automatically manage sharding, dynamically adjusting partitions to maintain balanced data and workload distribution.
Manual Sharding: Requires significant application-level intervention where the database doesn’t inherently support sharding. This method increases complexity and potential for uneven data distribution and can complicate operational processes like schema updates.

Advantages and Disadvantages

Scalability: Sharding allows databases to scale horizontally by adding more servers.
Performance: Smaller data sets improve query response times and reduce index size.
Resilience: Isolates failures to individual shards rather than affecting the entire database.
Complexity: Manual sharding introduces significant architectural challenges and operational overhead.
Data Distribution Issues: Incorrect shard key selection or poor sharding strategy can lead to data skew and performance bottlenecks.
Maintenance: Each shard operates independently requiring its own maintenance, backups, and updates, which increases operational efforts and costs.
Query Limitations: Cross-shard queries can be complex, inefficient, or infeasible, especially for operations like joins.

➡️ Database Replication

Database replication involves copying data from one database to one or more databases. This can safeguard against data loss during failures, improve data access speed for users in different geographical locations, and help scale applications by distributing the load.

Replication is essential in distributed systems where data is stored across multiple nodes. It ensures data availability and integrity, reduces latency, and increases system resilience against network or hardware failures.

Replication Strategies

Leader-Follower Replication: In this common approach, all write operations are performed on a leader database, which then replicates the data to one or more follower databases.
- Synchronous Replication: Ensures data consistency across databases by requiring all replicas to acknowledge writes.
- Asynchronous Replication: Improves write speed by not waiting for followers to acknowledge writes, though this can lead to data inconsistencies.
Multi-Leader Replication: Multiple databases can accept writes, enhancing system reliability and availability. This approach requires mechanisms to resolve conflicts due to concurrent writes.
Leaderless Replication: Eliminates the distinction between leader and followers, allowing all nodes to handle reads and writes. This method is used in systems like Amazon's DynamoDB and involves techniques like read repair and anti-entropy to maintain consistency.

Leader Failures

Consensus Algorithms: Algorithms like Paxos or Raft are used to select a new leader if the current leader fails, ensuring continuous availability and consistency.
Leader Election: In multi-leader setups, if one leader fails, others can take over without disrupting the system.

When to Implement Replication

High Read Volumes: Leader-follower replication can distribute read operations across several replicas to improve performance.
High Availability Requirements: Multi-leader replication ensures that the system remains operational even if one leader fails.
Global Scale Operations: Leaderless replication suits scenarios requiring high availability and fault tolerance across multiple regions.

Considerations

Latency: Synchronous replication can introduce significant latency, particularly if replicas are geographically dispersed.
Data Consistency: Asynchronous replication might lead to temporary inconsistencies between replicas.
Complexity: Implementing and managing replication, especially multi-leader and leaderless systems, can add complexity to system design.

➡️ Consistent Hashing

Distributed systems require robust, scalable methods for data management to handle network and hardware failures, as well as variable traffic.

Limitations of Traditional Hashing

Rehashing: Traditional hashing requires all keys to be reassigned to servers if the number of servers changes, which is highly inefficient.
Scalability: As the number of servers (N) changes frequently, the hash function must remap all keys, which can disrupt service.

How Consistent Hashing Works

Consistent hashing maps keys and servers onto a hash ring or circle such that each key is handled by the nearest server in the clockwise direction.

Hash Circle: Keys and servers are hashed to numeric values, which are placed on a circular hash space.
Server Assignment: Each key is assigned to the nearest server on the circle moving clockwise.
Adding/Removing Servers: Only the keys that are mapped between the neighboring servers on the circle need to be reassigned, minimizing disruption.

Advantages

Reduced Overhead: Minimizes the number of keys that need to be remapped when servers change.
Scalability: Allows the system to scale more gracefully by limiting the impact of adding or removing servers.
Load Distribution: Facilitates better load distribution among servers.

➡️ CAP Theorem

The CAP theorem is a principle that highlights the trade-offs faced by distributed systems in managing data across multiple nodes during network failures.

It posits that a distributed system can only guarantee two of the following three properties at the same time:

Consistency: Every read receives the most recent write or an error.
Availability: Every request receives a (non-error) response, without guarantee of it containing the most recent write.
Partition Tolerance: The system continues to operate despite arbitrary message loss or failure of part of the system.

Scenarios

Consistency over Availability: During a network partition, nodes may refuse to commit updates to ensure that no inconsistent data is returned (all nodes contain the same data). This might render part of the system non-operational, sacrificing availability for consistency.
Availability over Consistency: Nodes will always process queries and try to return the most recent available version of the information, even if some nodes are partitioned. This can mean some nodes might return older data (eventual consistency), prioritizing availability.

Practical Applications

Network Partitions: Common in distributed systems. The theorem asserts that if there is a network partition, one must choose between consistency and availability.
Database Systems: SQL databases often prioritize consistency (using transactions that lock data), whereas many NoSQL databases opt for availability and partition tolerance, offering eventual consistency.

Implications

Consistency-Prioritized Systems: Suitable for systems where business logic requires up-to-date data across all nodes, such as financial transaction systems.
Availability-Prioritized Systems: Best for applications where the system needs to remain operational at all times, such as e-commerce websites, where displaying slightly stale data is acceptable.

Common Misunderstandings

While it's often said that one must choose two out of three properties (Consistency, Availability, Partition Tolerance), in reality, partition tolerance is not optional in distributed systems. The choice is actually between consistency and availability when a partition occurs.

➡️ Asynchronous Processing

Synchronous vs. Asynchronous Processing

Asynchronous processing is essential in systems where operations might be resource-intensive or time-consuming.

In synchronous processing, operations are performed one after another, requiring each task to complete before the next begins.

In async processing, tasks operate independently and do not need to wait for the prior task to complete. This allows multiple operations to occur in parallel, significantly improving the system's efficiency and user experience by decoupling task execution from user interaction.

Why Asynchronous Processing Matters

Improving Responsiveness: Users are not blocked and can continue interacting with the system while tasks process in the background.
Enhancing Scalability: Systems can handle more operations simultaneously, distributing workloads more effectively across resources.
Maintaining System Reliability: By preventing system failures from cascading and impacting the user experience directly.

How It Works

Here’s a look at different methods of asynchronous processing:

Batch Processing:
- Suitable for large data processing tasks that can be executed periodically without the need for real-time data handling.
- Commonly implemented using frameworks like MapReduce, where tasks are divided and processed in parallel, then reduced to produce a final result.
Stream Processing:
- Deals with data in real-time by processing data as soon as it arrives.
- Ideal for applications requiring immediate insights from incoming data, such as financial trading platforms or real-time analytics.
Lambda Architecture:
- Combines batch and stream processing to leverage the benefits of both: accuracy from batch layers and responsiveness from speed layers.
- Useful in scenarios where both real-time and comprehensive data analysis are required.
Asynchronous Queues:
- Utilize message queues to manage tasks and workloads, ensuring that tasks are executed in a controlled manner.
- Facilitates error handling, task scheduling, and system decoupling.

Key Components

Message Queues: Manage communication between different parts of a system asynchronously (e.g., RabbitMQ, Kafka).
Task Queues: Handle scheduling and execution of operations asynchronously (e.g., Celery).
Publish/Subscribe Systems: Allow messages to be published to subscribers asynchronously without requiring the sender to be aware of the consumers (e.g., Google Pub/Sub).

Practical Applications

Web Applications: Improving user experience by performing resource-intensive tasks like image processing or data exports in the background.
Data-Intensive Applications: Handling large-scale data operations without impacting system performance, such as processing logs or stream data analysis.
E-commerce Systems: Managing inventory updates, order processing, and notifications asynchronously to improve scalability and customer experience.

➡️ Caching

Caching speeds up data retrieval, utilizes the locality principle to store data close to usage points, and improves efficiency in large-scale applications by reducing redundant operations.

How Caching Works

In-memory cache: Fast but increases memory usage per server.
Distributed cache: Shares cache across servers, e.g., Memcached, Redis.
Database cache: Caches frequent queries or results.
File system cache: Uses CDNs to cache files geographically close to users.

Caching Policies

FIFO: Evicts the oldest data first.
LRU: Removes least recently accessed data.
LFU: Discards least frequently accessed data.
Caching involves trade-offs between cost, speed, and data accuracy, managed by eviction policies like TTL settings.

Cache Coherence Strategies

Write-through cache: Immediate consistency between cache and storage but slower write times.
Write-behind cache: Faster writes with potential for temporary inconsistencies.
Cache-aside: Loads data into the cache on-demand, reducing stale data risk.

➡️ Encryption

Encryption secures data transmission between parties. There are two main types:

Symmetric encryption: Uses the same key for encryption and decryption, efficient but requires secure key handling.
Asymmetric encryption: Uses a public key for encryption and a private key for decryption, enhancing security but slower.

Encryption is crucial for:

Data in transit: Often implemented via SSL/TLS, using both encryption types for secure Internet communication.
Data at rest: Uses algorithms like AES to encrypt data stored on servers.
Password protection: Utilizes hashing and salting techniques, like bcrypt, to secure passwords stored in databases.

➡️ Authentication and Authorization

Authentication verifies user identity through credentials like passwords or biometrics, while authorization determines user access levels within a system post-authentication.

Authentication Methods

Single-Factor Authentication (1FA) involves simple username-password verification.
Multi-Factor Authentication (MFA) requires additional verification methods, enhancing security.

Authorization Methods

Role-Based Access Control (RBAC): Access is granted based on user roles.
Attribute-Based Access Control (ABAC): Access is determined by user attributes and environmental conditions.
Access Control Lists (ACLs): Individual permissions are managed for specific resources.

OAuth 2

A framework enabling secure authorization across different applications, using tokens to allow access without exposing user credentials.

➡️ Cloud Architecture

Cloud-based applications are hosted and maintained entirely by someone else. There are many cloud service providers, including:

Amazon Web Services (AWS)
Google Cloud (GCloud)
Microsoft Azure

Why pursue a cloud-based strategy?

Affordability: Minimal upfront costs; mostly pay-as-you-go.
Maintenance: Handled by providers, easing software upgrades and deployments.
Security: Enhanced by the scale of cloud providers.
Scalability: Easily adjustable to needs, paying only for used resources.

Disadvantages

Costs: Potentially higher in the long term compared to on-premise solutions.
Control: Limited customization and potential compatibility issues.
Vendor Lock-in: Difficult to switch providers or revert to on-premise.
Compliance: Not all services meet specific industry regulations.
Location and Security: Legal and high-security requirements might restrict data storage locations or necessitate airgapping.

Cloud Providers' Offerings

Compute: Various options including servers, VMs, GPUs, and serverless computing.
Containers: Provide isolation and security, simplify portability.
Databases: Diverse types including RDBMS, graph, and in-memory databases.
Networking: Tools for maintaining microservices, API gateways, and more.
Additional Tools: Monitoring, analytics, developer tools, and machine learning resources.

Orchestration Services

Terraform: Manages physical infrastructure, utilizes Infrastructure as Code for flexibility.
Kubernetes: Manages container-based applications, automating complex, scalable deployments.

Cloud Migration Strategies

Re-host: Simple transfer of data and applications to the cloud.
Re-platform: Minor modifications to optimize cloud benefits.
Refactor: Complete re-architecture for cloud optimization.
Repurchase: Transitioning to cloud-based applications for specific business processes.

➡️ CDNs

CDNs are networks of servers globally distributed to deliver static content (like images, videos, and web scripts) quickly to users by caching content near them, reducing bandwidth usage and improving access speeds.

Popular CDNs

Major cloud providers offer CDN services, such as:

Cloudflare CDN
AWS Cloudfront
GCP Cloud CDN
Azure CDN
Oracle CDN

Types of CDNs

CDNs cache static content in multiple locations. There are two types of CDNs:

Push CDNs: Content is manually updated by engineers. Ensures content is always current but requires manual updates.
Pull CDNs: Content is automatically fetched from the origin server when not in the cache, then cached for subsequent requests. Less maintenance but can serve outdated content if not regularly updated. Popular due to ease of maintenance.

When not to use CDNs?

CDNs are not suitable for dynamic or sensitive content that must be up-to-date, such as financial or government services. They are also less beneficial if all users are localized, as the main advantage of a CDN is reduced latency across diverse geographic locations.

Interview Tips

You will need knowledge and comfort with diverse technologies to effectively answer these interview questions.

Engineers, for example, will need to elaborate deeply on the systems within their areas of expertise.

However, management roles, such as TPM, need a much broader knowledge of the systems and technologies they use.

Define success: Don't forget to clearly define the who and the what of your solution early on when clarifying requirements. Explain the nature of the problem and refer back to it frequently as you build your system.
Ask clarifying questions: You wouldn't design and implement a whole system without plenty of back-and-forth communication in the real world (we hope), so don't do it here.
Answer the "why": A successful answer is always preemptively answering the "whys?" that come alongside each design decision. Clearly explain why your design decisions are appropriate for the problem.
Be thorough: Carefully explain why you make the decisions you do. Don't skip something, even if it seems obvious! Your interviewer is highly invested in your thought and decision process. Explaining the obvious is a critical piece of that.
Architectures: Borrow from the design architectures that you are most comfortable or experienced with. So long as you can substantively explain why it is the best architecture for the required solution. This doesn't mean you should try to fit every potential system design into the same architectural pattern, though.

FAQs

These are some of the most commonly asked questions around prepping for these tough interviews.

Does Amazon ask system design interview questions?

Yes and no. Amazon asks system design questions in their engineering interviews.

However, they don't ask these types of questions to freshers and recent graduates. System design questions are usually only asked in interviews for experienced positions (4-5 years of experience).

Does Google ask system design interview questions?

Yes, Google asks system design questions. They are asked during the technical phone interviews.

Your initial phone screens won't have any system design elements to them.

Instead, you'll be asked about algorithms and data structures. You'll encounter system design questions if you're advanced to the next interview round.

To pass the Google system design interview, focus on your whiteboarding skills.

Are system design interviews difficult?

System design interview questions are notoriously difficult to prepare for. Unlike algorithmic questions, they don't reduce down to a handful of prescribed patterns. Instead, they require years of technical knowledge and experience to answer well.

For junior engineers, this can be tricky. Even senior developers sometimes find themselves scratching their heads trying to understand a system.

The key to doing well in these types of interviews is to use your entire knowledge base to think about scalability and reliability in your answer.

What is the difference between high-level and low-level system design?

The high-level design focuses on the problem you're trying to solve. The low-level design breaks down how you'll actually achieve it. That includes breaking down systems into their individual components and explaining the logic behind each step. System design interviews focus on both high-level and low-level design elements so that your interviewer can understand your entire thought process.

System Design Interview Guide: FAANG and Startups

System Design Interview Overview

Interview Structure

Whiteboarding

Question Types

System Design vs. Coding Interviews

System Design Interview Framework

Step 1: Understand the Problem

Non-Functional Requirements

Estimating Data

Step 2: High-Level System Design

Server and Client Communication

Data Modeling

High-Level Design Diagram

Step 3: Explore the Design: Deep-Dive

Non-Functional Requirements

Step 4: Improve the Design (Bottlenecks and Scale)

Message Queues and Publish/Subscribe

Discuss Bottlenecks

Step 5: Wrap Up

Top System Design Interview Questions

Real-time & Events

Media & Content Delivery

Async & Data Processing

Payments & Marketplaces

Distributed Infrastructure

Junior vs. Senior Engineers

Advice for MAANG+

Google System Design Interviews

Amazon System Design Interviews

Fundamental Concepts

➡️ Web Protocols

TCP/IP and OSI Models

TCP and UDP

HTTP and HTTPS

➡️ APIs

REST APIs:

Other API Types:

Design Considerations

API Security and Management

➡️ Reliability

Retries

Circuit Breakers

Saga

➡️ Availability

Rate Limiting

Queue-Based Load Leveling

Gateway Aggregation

➡️ Load Balancing

Advantages

➡️ SQL vs. NoSQL Databases

SQL Databases

NoSQL Databases

Common Scenarios

➡️ Database Sharding

Sharding Techniques

Manual vs. Automatic Sharding

Advantages and Disadvantages

➡️ Database Replication

Replication Strategies

Leader Failures

When to Implement Replication

➡️ Consistent Hashing

Limitations of Traditional Hashing

How Consistent Hashing Works

Advantages

➡️ CAP Theorem

Scenarios

Practical Applications

Implications

Common Misunderstandings

➡️ Asynchronous Processing

Synchronous vs. Asynchronous Processing

Why Asynchronous Processing Matters

How It Works

Key Components

Practical Applications

➡️ Caching

How Caching Works

Caching Policies