Skip to main content

Design Google Docs

Premium

Design a collaborative document editor similar to Google Docs. Watch Daniyal, Senior Software Engineer at Amazon, walk through this system design question.

“Design Google Docs” is a classic collaborative system design problem, where you are asked to design a real-time, multi-user, low-latency document editing system that supports concurrent edits, sharing, and access control at massive scale.

This is fundamentally different from CRUD-heavy systems like Uber Eats because the core challenge here is real-time collaboration and consistency under concurrency.

In this design, the following trade-offs are central:

  • Low latency for real-time typing and collaboration
  • High availability for document access and editing
  • Eventual consistency for document state across collaborators
  • Trade-offs between strong consistency vs user experience
  • Efficient storage of document versions and edits

This write-up presents a structured, interview-ready design aligned with how Google Docs actually works.

Step 1: Define the problem

Ask clarifying questions

Collaborative editors have many optional features. Before jumping into architecture, it’s critical to clarify scope so you don’t over-design or miss key expectations.

Some useful clarifying questions include:

  1. What scale are we designing for? Number of users, documents, and concurrent editors per document.
  2. Is real-time collaboration required, or is near real-time acceptable?
  3. What types of content do documents support? Plain text only, or rich text with formatting, images, and comments?
  4. Do we need version history and undo/redo support?
  5. What permissions should be supported? View, comment, edit, owner?
  6. Should offline editing be supported?

For this design, we’ll assume a system similar to Google Docs with real-time collaborative editing, rich text, sharing with permissions, and high global scale.

Define functional requirements

Based on the clarified scope, we can narrow Google Docs down to its core purpose: enabling multiple users to collaboratively create and edit documents in real time while preserving correctness and usability.

At a high level, the system must allow users to create and manage documents, share them with other users, and collaboratively edit content with minimal latency.

More concretely:

  • Users should be able to create, open, and edit documents
  • Users should be able to share documents with other users with role-based permissions (viewer, commenter, editor)
  • Multiple users should be able to edit the same document concurrently
  • Edits made by one user should appear in near real time for other collaborators
  • The system should support basic rich-text features such as formatting, links, images, and comments
  • Users should be able to view the latest document state when opening a document

These requirements define Google Docs as a collaborative, stateful, real-time system, rather than a simple request-response application.

Define non-functional requirements

For collaborative systems, non-functional requirements often matter more than functional ones. The CAP theorem is especially relevant when reasoning about consistency and availability under concurrent edits.

Several key non-functional requirements shape the design.

First, low latency is critical. When a user types, characters should appear instantly on their screen, and updates from other collaborators should feel nearly real time. Any noticeable delay breaks the illusion of collaboration.

Second, the system must be highly available. Users should be able to open and edit documents even during partial failures. Temporary inconsistencies are acceptable, but the system should rarely be completely unavailable.

Third, the system must be highly scalable. Google Docs serves hundreds of millions of users globally, with potentially thousands of documents being edited simultaneously. The system must scale horizontally across regions and documents.

In terms of consistency, strong consistency is not strictly required. Users can tolerate small inconsistencies during collaboration, such as cursor jumps or temporary ordering differences, as long as the system eventually converges to a correct document state.

As a result, this system intentionally favors:

  • Availability and low latency over strong consistency
  • Eventual consistency with conflict resolution
  • Optimistic updates on the client

This trade-off enables a smooth real-time experience while remaining scalable.

Estimate the amount of data (back-of-envelope)

Data estimate

Assumptions:

  • 100 million Daily Active Users
  • Each user owns or edits 10 documents on average
  • Average document size: 1 MB
  • Documents store:
    • Base content
    • Metadata
    • Edit operations / version history

Total document storage

100M users × 10 documents × 1 MB ≈ 1 billion documents ≈ 1 PB of raw document data

In practice, documents are stored using:

  • Snapshots (periodic full versions)
  • Operation logs (incremental edits)

This significantly reduces storage growth compared to storing full copies for every edit.

QPS estimate

Read QPS (opening documents)

Assume 50M document opens/day ≈ 50M / 86,400 ≈ 600 reads/sec

Write QPS (edits)

Assume 10 edits/sec per active document with 1M concurrent active documents: ≈ 10M write operations/sec globally

This makes Google Docs an extremely write-heavy, latency-sensitive system, especially compared to traditional CRUD applications.

For tips on estimating unknowns, check out our Estimation Strategies and Tricks lesson.

Step 2: Design a high-level system

Once the problem space and requirements are clear, the next step is to design a high-level system that can support real-time collaboration at scale while remaining reliable, secure, and responsive.

Unlike traditional CRUD applications, Google Docs is fundamentally a stateful system. The backend doesn't just store data; it must actively coordinate the sequence of edits between multiple actors. This core requirement heavily influences how we design APIs, services, and data flows.

Design the APIs

The APIs for Google Docs are designed around user intent and collaboration workflows. Instead of repeatedly sending full document payloads��which would be bandwidth-intensive—the system primarily exchanges small incremental operations (edits), presence signals, and metadata updates.

To keep the system modular and scalable, the backend is split into domain-specific services, each responsible for a distinct part of the document lifecycle.

ABAC service (Attribute-based access control)

The ABAC Service is responsible for fine-grained permissions. Unlike simple role-based access, it evaluates attributes (user, document, location, time) to determine if a user can view, comment, or edit.

  • APIs
    • GET /permissions/{docId}/{userId}
    • POST /permissions/{docId}/share

Doc creation / metadata service

This service handles the "management" side of documents: creating new files, updating titles, and storing metadata like owner and creation date.

  • APIs
    • POST /doc/create
    • GET /doc/{id}/metadata
    • PATCH /doc/{id}/rename

Doc editing service

The heart of the system. This service maintains a persistent connection with the client (typically via WebSockets) to receive and broadcast edits using Operational Transformation (OT) logic to resolve conflicts.

  • APIs
    • WS /edit/{docId} (WebSocket for real-time operations)
    • POST /edit/{docId}/op (Fallback for individual edit operations)

Presence service

This service tracks who is currently viewing or editing a document. It enables the "cursor" feature where you can see where others are typing.

  • APIs
    • POST /presence/{docId}/heartbeat
    • GET /presence/{docId}/active-users

Design the data model

1. Operations DB (NoSQL – Key-value / document)

This is the most critical database in the system. It stores every single edit (operation) ever made to a document. It is optimized for heavy writes and ordered reads, which allows the system to replay edits deterministically and reconstruct document state.

  • Partition key: doc_id
  • Sort key: version_number (monotonically increasing per document)
AttributeTypeDescription
doc_idStringUnique identifier for the document.
versionLongThe sequential version number of this edit.
user_idStringWho made the edit.
typeEnumINSERT, DELETE, FORMAT.
dataJSONDetails of the edit (e.g., {"char": "H", "index": 25}).
timestampLongUnix timestamp of the operation.

Each row represents a single logical edit. Because operations are append-only, this store scales well under high write throughput and avoids contention between concurrent editors.

2. Metadata DB (NoSQL – Document)

The metadata database stores the high-level information about a document. This data is frequently accessed when users browse their document lists, search for files, or load document headers.

AttributeTypeDescription
doc_idStringUnique identifier.
owner_idStringUser ID of the creator.
titleStringCurrent name of the document.
created_atLongCreation timestamp.
last_snapshot_vLongThe version number of the most recent snapshot.
tags/folderListOrganization metadata.

This database is optimized for fast lookups by document ID and owner, rather than ordered reads or heavy writes.

3. Operations DB – Snapshots (NoSQL – Document)

Replaying tens or hundreds of thousands of operations every time a document is opened would be prohibitively slow. To avoid this, the system periodically stores full document snapshots.

AttributeTypeDescription
doc_idStringReference to the doc.
versionLongThe version number this snapshot represents.
contentBlob/TextThe full state of the document at this version.
checksumStringTo verify data integrity during loading.

Snapshots act as checkpoints. When loading a document, the system loads the latest snapshot and replays only the operations that occurred after it.

4. Presence & session store (Redis)

Presence data is highly volatile and does not require durability. The goal here is speed, not persistence.

If a heartbeat expires, the user is automatically removed from the active collaborators list. This keeps presence data accurate without explicit cleanup logic.

5. ABAC service (Permissions DB)

Permissions in Google Docs are flexible and fine-grained. The system uses an Access Control List (ACL) model, which is frequently cached to reduce authorization latency.

AttributeTypeValue Example
doc_idStringdoc_123
user_idStringuser_789
roleEnumVIEWER, COMMENTER, EDITOR, OWNER
is_publicBooleanTrue (If link sharing is on)

This data is read far more often than it is written, especially during document open flows, which is why aggressive caching is typically applied here.

High level architecture

google docs design

To better understand how the system behaves end-to-end, it’s useful to walk through the major request flows in Google Docs. Each flow exercises different parts of the architecture and highlights why the system separates metadata, editing, and presence into distinct services.

Document creation flow

When a user creates a new document, the request first enters the system through the API Gateway, which authenticates the user and validates the request. Because document creation is a control-plane operation rather than a real-time collaboration task, it is handled synchronously.

After authentication, the request is routed to the Document Creation and Metadata Service. This service generates a globally unique document identifier, assigns ownership to the creating user, and initializes the document’s metadata, including sharing settings and timestamps. At this point, the document contains no user content—only metadata that defines its identity and access rules.

The metadata is persisted in a NoSQL metadata store, which allows the system to scale to billions of documents while maintaining low-latency access. Once the metadata write succeeds, the service returns the document ID to the client, which can immediately open the document and begin editing.

This flow is intentionally simple and strongly consistent, ensuring that document identity and ownership are correctly established before any collaboration begins.

Open document flow

When a user opens an existing document, the request again passes through the API Gateway, which verifies authentication and routes the request appropriately.

Before any document data is returned, the system consults the Access Control (ABAC) Service to confirm that the user has permission to view or edit the document. This step is critical, as Google Docs supports fine-grained sharing rules and must enforce them consistently across all entry points.

Once access is granted, the Metadata Service is queried to retrieve document-level information, including pointers to the latest snapshot and recent operation logs. Using this information, the Document Editing Service loads the most recent snapshot of the document and replays any subsequent operations to reconstruct the current document state.

At the same time, the client establishes a persistent WebSocket connection with the editing service. This connection allows the server to push real-time updates to the client without relying on inefficient polling.

From the user’s perspective, the document appears almost instantly, even while background synchronization continues. This is possible because the system prioritizes fast initial rendering and progressively applies updates as they arrive.

Real-time editing flow

Real-time editing is the most performance-critical flow in the system.

When a user types or modifies the document, the client immediately applies the change optimistically to the local document state. This ensures that typing feels instantaneous, even before the server acknowledges the edit.

The client then sends a small edit operation to the Document Editing Service over the existing WebSocket connection. Rather than transmitting the entire document, the operation describes the change in terms of inserts, deletes, or formatting actions.

On the server, the editing service receives the operation and applies Operational Transformation (OT) logic to resolve conflicts with concurrent edits from other users. The transformed operation is then appended to the operations log and periodically contributes to new document snapshots.

Once persisted, the transformed edit is broadcast to all other connected collaborators. Each client applies the update to its local document state, ensuring that all users eventually converge on the same content, even if edits arrive in different orders.

This flow favors availability and responsiveness over immediate global consistency. Temporary divergence between clients is acceptable, as long as the system guarantees eventual convergence.

Presence and cursor update flow

In parallel with content editing, clients continuously send lightweight presence signals to the Presence Service. These signals include information such as whether the user is currently active, their cursor position, and text selection ranges.

The presence service stores this data in an in-memory store with short TTLs, ensuring that stale presence information is automatically removed if a client disconnects or becomes inactive.

Presence updates are broadcast to other collaborators viewing the same document, allowing them to see real-time indicators such as active cursors and collaborator avatars. Because presence data is non-critical, the system allows occasional drops or inconsistencies without affecting document correctness.

This separation ensures that high-frequency presence updates do not interfere with the core editing pipeline.

Asset upload flow

When a user inserts an image or other large asset into a document, the upload bypasses the editing service entirely. Instead, the client uploads the asset directly to object storage, receiving a reference ID upon successful upload.

The client then sends a small metadata update or edit operation containing this reference ID to the Document Editing Service, which embeds it into the document structure.

This design prevents large binary payloads from slowing down real-time collaboration and allows asset delivery to scale independently using CDNs.

Failure & reconnection flow

If a client temporarily disconnects due to network issues, the local document state is preserved. Upon reconnection, the client re-authenticates through the API Gateway and reconnects to the Document Editing Service.

Using the document version and last-seen operation ID, the server sends any missed operations, allowing the client to catch up and rejoin the collaboration session seamlessly.

This ensures that short-lived failures do not disrupt the user experience or require manual recovery.

Step 3: Deep dives & trade-offs

Once the high-level architecture and request flows are established, it’s important to examine the most critical design decisions in more depth. Google Docs is a system where correctness is enforced not by strict locking or strong consistency, but by carefully balancing latency, availability, and convergence guarantees under heavy concurrency.

This section explores the key architectural choices and the trade-offs behind them.

Why operational transformation (OT)?

The core challenge in Google Docs is allowing multiple users to edit the same document at the same time without corrupting the document state. Naively locking the document would eliminate conflicts but would completely break the real-time collaboration experience.

Operational Transformation (OT) solves this by allowing users to edit optimistically, without waiting for locks or global coordination. Each edit is represented as a small operation—such as insert or delete—that can be transformed relative to other concurrent operations.

When two users edit the same region of the document, the OT algorithm adjusts the position and intent of each operation so that all clients eventually converge to the same final document state, regardless of the order in which edits arrive.

The primary advantage of OT is low latency. Users see their edits immediately, and collaboration feels natural and fluid. The trade-off is implementation complexity. OT logic is notoriously difficult to get right, especially when supporting rich text, formatting, and embedded objects. However, for a latency-sensitive system like Google Docs, this complexity is justified by the user experience benefits.

Why optimistic client updates?

A defining characteristic of Google Docs is that typing feels instantaneous. This is achieved through optimistic updates on the client.

Instead of waiting for the server to validate or acknowledge an edit, the client immediately applies the change locally. The server later confirms or transforms the operation as needed. If the transformed operation differs, the client reconciles the difference transparently.

The trade-off here is temporary inconsistency. For brief moments, different clients may have slightly different views of the document. However, because edits converge quickly and discrepancies are rarely noticeable, this trade-off is acceptable and necessary for a high-quality user experience.

Why separate editing from presence?

Editing operations affect document correctness and must be persisted durably. Presence signals—such as cursor positions or “user is typing” indicators—do not.

By separating these concerns into different services, the system ensures that high-frequency, non-critical updates do not interfere with core editing workflows. Presence data can be dropped, delayed, or overwritten without impacting document integrity.

This separation allows each subsystem to scale independently and simplifies failure handling. If the presence service becomes temporarily unavailable, users can still edit documents without disruption.

Why NoSQL for metadata & operations?

Google Docs stores two major types of durable data: document metadata and edit operations.

Both datasets are:

  • Write-heavy
  • Append-oriented
  • Naturally shardable by document ID

NoSQL databases are well-suited for this access pattern. They provide horizontal scalability and high write throughput without the overhead of complex relational schemas.

The trade-off is weaker transactional guarantees. However, because correctness is enforced at the application layer through OT and versioning, strict database-level transactions are not required.

Why periodic snapshots?

Storing only an ever-growing log of operations would make document recovery increasingly expensive. To avoid replaying thousands of operations every time a document is opened, the system periodically writes full document snapshots.

Snapshots act as checkpoints. When reconstructing document state, the system loads the latest snapshot and replays only the operations that occurred after it. This significantly reduces recovery time and improves document open latency.

The trade-off is additional storage and write overhead, which is acceptable given the performance benefits.

Step 4: Bottlenecks & scaling challenges

As Google Docs scales to millions of concurrent editors, certain bottlenecks naturally emerge.

Hot documents

Documents with many simultaneous editors can overload a single editing service instance.

To mitigate this, documents are partitioned by document ID, and editing sessions are pinned to specific servers. Load is distributed horizontally across the fleet, and safeguards limit the number of concurrent editors per document if necessary.

In extreme cases, the system may temporarily degrade non-essential features, such as presence updates, to preserve core editing functionality.

High edit throughput

Active documents can generate thousands of operations per second. Without control, this can overwhelm storage systems and downstream consumers.

The system mitigates this by batching operations, applying backpressure, and prioritizing user-facing responsiveness over immediate persistence guarantees.

Global latency

Users collaborate across regions, and routing all edits through a single data center would introduce unacceptable latency.

To address this, documents are typically anchored to a primary region, with clients connecting to the nearest data center. Cross-region replication occurs asynchronously, ensuring durability without slowing down real-time collaboration.

Step 5: Review & summary

The Google Docs architecture is intentionally designed to favor responsiveness and availability over strict consistency, while still guaranteeing eventual correctness.

By combining optimistic client updates, centralized transformation logic, durable operation logs, and in-memory collaboration signals, the system enables seamless real-time editing at global scale.

This design demonstrates how carefully chosen trade-offs—rather than rigid adherence to strong consistency—can produce systems that feel fast, intuitive, and reliable to users.