15% off membership for Easter! Learn more. Close

How would you implement the sync feature of Google Drive app or Google Docs? How would you design the DB for G-drive?

Asked at Google
3.3k views
Asked at
eye 3.3k views eye 3.3k views
Answers (3)
crownAccess expert answers by becoming a member

You'll get access to over 3,000 product manager interview questions and answers

badge Gold PM
Step 1 - Understand and scope the question
I just want to reiterate the question to confirm my understanding. We need to implement the feature where the documents need to show real time editing in Google Docs and also the documents need to keep saving as soon as some edits have been made. The additional ask is to design a database that will aid quick saving and retrieving of the documents and other data from Google Drive. Am I correct in my understanding?
Interviewer - Yes. That sounds about correct.
 
I also assume this is for all customers, irrespective of whether they are enterprise of B2C customers. 
Interviewer - That is correct.
 
Step 2 - Explain the answer
So, what i'll do is explain what happens from a UI perspective, and then go behind the scenes and try to explain how that gets built out. 
Interviewer - sounds good.
When we open any Google document that is shared with other people, we get a list of names on the top that have the document open and we can also see who are editing the document at that moment. If we make any edits, the others can see that, and if others are making the edits, we can see the new edits coming there. Likewise, when we save the document, it gets saved and this saved copy is visible to everyone along with the previous versions, in case someone wants to rollback the changes.
 
Step 3 - Describe the product attributes
I want to list down here some attributes that are important when designing this system. For the purposes of this question, let's assume the following attributes are important.
  1. Real time syncing of documents/edits
  2. Conflict resolution (when 2 people edit the same sentence in a G-Doc)
  3. Speed of retrieval of the documents
  4. Traceability of who made the changes
 
Step 4 - Select a goal
From a business point of view, it is very important that the documents are saved and the end user can easily find the documents from their library.
 
Step 5 - Prioritize the product attributes
For the purposes of this question, I will prioritize the following.
  1. Real time sync capabilities with conflict resolution
  2. Speed of retrieval of the documents
 
Step 6 - Design the product
There are 2 parts to this, so let's tackle the real time sync capabilities first. 
If we think about the real time sync capabilities it has a lot in common with chat capabilities, so it would be good to send the editing operations to a web socket server. This web socket server will handle the real time operations that are shown in the document, when a user views or edits the documents. Since there will be many edits at the same time, it makes sense to have a message queue such as Redis to ensure that the edits are queued and appear in sequence. The web server then serves the request (such as editing a document) and uses some collaboration algorithms to resolve conflicts. This is probably the most important part and there are a few algorithms, although Google uses something called Operational Transformation. The web server then contacts the database (see the second part of the answer for the database design). There will be 2 different databases (an S3 bucket to store the actual folder structure and files) and a metadata DB to store the metadata related to the documents. In terms of a diagram, it will probably look something like this.
Access expert answers by becoming a member
1 like   |  
Get unlimited access for $12/month
Get access to 2,346 pm interview questions and answers to give yourself a strong edge against other candidates that are interviewing for the same position
Get access to over 238 hours of video material containing an interview prep course, recorded mock interviews by expert PMs, group practice sessions, and QAs with expert PMs
Boost your confidence in PM interviews by attending peer to peer mock interview practices, group practices, and QA sessions with expert PMs
badge Bronze PM

 

Understanding of the feature:

When a file on Google Drive changes, or the contents (or settings) of a Google Doc changes we want to reflect that to all consumers of the resource.

Let's take a Google Docs as the main scenario and proceed. Same pricinple will apply to Drive too.

3 stakeholders to this problem:

  • User who is making changes (A)
  • Resource store (typically a database) (R)
  • User who is consuming changes (can be the same user who is also making changes) (B)

Goals of each stakeholder:

  • Users
    • Make changes and have them persist
      • Either by explicit instruction of auto-save
    • See changes of other users
      • Whenever they happen
    • Resolve conflicts if any
  • Resource store
    • Maintain a steady state copy
    • Maintain transient copies for collaboration
    • Make transient copy a steady copy once collaboration is complete

Flow:

  • A creates a file.
  • No steady copy exists, so R will create a transient copy and send it to A
  • A starts editing.
  • R writes edits to transient copy.
  • A shares document with B.
  • When B accesses the file, R sends the transient copy.
  • Now both A and B are editing.
  • R keeps writing edits to transient copy.
  • Both of them finish editing and close the document.
  • After some time (or other condition), R replaces steady copy with transient copy and deletes transient copies.

Architecture:

  • Steady state copies can be stored in a normal database(on disk). Write and read performance are important but not critical.
  • This database will have a dirty bit that will indicate whether file is being edited.
  • Transient DB can be an in-memory database offering high read and write performance.
  • Importantly we cannot stream the transient copy for every single change. It makes sense to send updates only.
    • Update should contain at the very least:
      • Coordinated time
      • Content
      • Offsets
      • Author
    • So we will need a system that
      • Receives updates and modifies transient copy
        • If conflict:
          • Send a message to original author and don’t update transient copy
        • else:
          • Sends updates to all users
    • Extra: Occasionally check if user’s copy is same as transient copy to verify if all updates are persisted properly.
  • Finally, when all users close the doc, we replace the steady copy with transient, turn off the dirty flag and delete transient copies.

Additional challenges:

  • Scale:
    • If we have 100 users editing the same file
      • We may have to implement a message queuing system for updates
    • From a product PoV, I’d rather limit simultaneous opens to say 10 users.
  • Network connectivity:
    • An update may be received long after it was originally made.
      • Conflict resolution should be able to handle this like regular case maybe

So to summarize, there will be 2 main components:

  • Transient DB system to ensure high performance of reads and writes
  • Update broker that resolves conflicts, modifies transient DB and broadcasts updates to users.

 

Access expert answers by becoming a member
2 likes   |  
Get unlimited access for $12/month
Get access to 2,346 pm interview questions and answers to give yourself a strong edge against other candidates that are interviewing for the same position
Get access to over 238 hours of video material containing an interview prep course, recorded mock interviews by expert PMs, group practice sessions, and QAs with expert PMs
Boost your confidence in PM interviews by attending peer to peer mock interview practices, group practices, and QA sessions with expert PMs
badge Silver PM

I noticed there aren't many Google technical questions answered here, so thought I'd take a stab at this. Would love feedback as I'm new to technical interview questions. 

Clarify:

  • By sync feature, do we mean the ability to be able to keep document changes up to date, so that if two users are on the same document and one person makes a change to the document, that change will be visible to the other person? → Yes

  • Do we need to consider how much storage cost this will require? → Yes

Key features of Google docs sync:

  • Create your own document

  • Save your document

  • Multiple people can edit the same document at once

    • Syncs are automatically merged and merge conflicts are handled

  • Additional features like comments, formatting, etc.

Qualities:

  • Simple editing features

  • Fast

  • Handle conflicts

  • Seamless

Back-of-envelope estimation of total data size and key bottlenecks:

  • Figure out total storage cost of Google

    • # of Google doc users * # of docs created per year * size of document * 5 years of storage

    • Did some math to figure out total storage around 250TB of data 

  • Figure out # of Google docs open at any time

    • Around 12M active documents per half hour

    • 2M of those have 4-5 users on editing at once

High level architecture:

  • When document is opened, it is fetched from database and put into some type of cache or place easier to edit than directly on disk so faster to access and make changes

  • Architect a synchronization service that is able to make sure client side version of document is not out of date with most recent version

  • Since multiple clients can be editing at same time, consider enforcing all requests go through same server so that there aren’t race conditions on the data itself

    • Proxy could figure out if requests are for the same document, then make sure they’re working off same cache

  • Create some type of queue so that changes are addressed in order

  • User session stored to figure out which user is making which edits

  • Maybe separate out services to have read and write functionality so you can scale both of those independently. Needs to be able to handle high read and write. 

DB tables:

  • User table

    • User id

  • Documents table

    • Document information

    • Last updated

  • Updates table

    • Documents ID

    • User doing the update

    • Update time

  • User permissions table

    • User id

    • Document ids

Ways to scale/optimize:

 

  • I already mentioned some of these in the high level architecture piece. 

  • Other thing to consider is how we store the documents across servers. Maybe use indexing to make it easier to find the document in question.

Access expert answers by becoming a member
10 likes   |  
Get unlimited access for $12/month
Get access to 2,346 pm interview questions and answers to give yourself a strong edge against other candidates that are interviewing for the same position
Get access to over 238 hours of video material containing an interview prep course, recorded mock interviews by expert PMs, group practice sessions, and QAs with expert PMs
Boost your confidence in PM interviews by attending peer to peer mock interview practices, group practices, and QA sessions with expert PMs
Get unlimited access for $12/month
Get access to 2,346 pm interview questions and answers to give yourself a strong edge against other candidates that are interviewing for the same position
Get access to over 238 hours of video material containing an interview prep course, recorded mock interviews by expert PMs, group practice sessions, and QAs with expert PMs
Boost your confidence in PM interviews by attending peer to peer mock interview practices, group practices, and QA sessions with expert PMs