You’re part of the Google Search web spam team. How would you detect duplicate websites?
You'll get access to over 3,000 product manager interview questions and answers
Recommended by over 100k members
You can't answer this Google technical question until you define "duplicacy." In my mind, i come across two use cases:
1. First is sites which try to copy paste contents from other websites and make a new website.
2. Second is sites which try to get links from Link Farms and increase Page Rank Artificially.
So, lets take the first case.
As PR algo works on Keywords and Relevance( Links), having the correct keywords is important.
PR algo builds an index of keywords in a website.
To tackle this problem, Google should develop a Spam Check Algorithm which also gives a Spam Score when it crawls the world wide web.
This spam score should be in a percentage from 0% to 100%.
In layman terms this tells how much of the website content is original and how much is copied from other websites. Google can tell about the copy part as it crawls all the internet.
More the Spam Score , the PR of the page should decrease by a factor. There can be an upper and lower limit to the same. Lets say 20% spam score is allowed and more than 60% leads to Google not showing the result in SERP.
This will also depend on the category. A website from e commerce will have very low Spam Score but a media and content website will have a large Spam Score even without actually copying stuff.
Now moving to the second use case, Google has to be very clever such links from Link farms because this is essentially gaming the PR algo.
Google should do the following steps for the same:
1. Maintain a directory of Link farms. Any website having even one incoming link from Link Farm is blocked by Google
2. Identify two way links. If a website A links to B and B also links back to A and both are unrelated, this may be a link farm. These cases have to be further studied
So, essentially any website taking any link farm's link should be banned by Google.
- First lets clarify the understanding of duplicate site. As per my understanding, a duplicate site is one which has :-
- Scraped content from other sites without adding little or no original content. (Example - site having embedded videos, text, images from other sites, repulbished content from other sites, content copied from other sites with slight modification)
- It can also include sites, which have too many links or links that are part of link schemes
- Other type of spam content :- Irrelevant keywords, sneaky redirects i.e. cases where visitor is redirected to a site with totally different content
- Thus duplicate sites is just one type of spam. Google would have a spam detection algo which would be evaluating web pages and assigning them a spam score. There will be certain logics in place to detect duplicate site which will impact the spam score. Which ever site will have a high spam score (i.e. above certain threshold, will be given lower rank via the Pagerank algo)
- Logics in place to detect duplicates could be:-
- Content: if the content on the webpage is mostly same as another web page then based on the webpage creation date & site reputation algo will decide which of the 2 is a duplicate site and increase the spam score for that site.
- Links : If a site has only embedded videos, images, media links to another website OR if a site has no content but only redirects to another web site. For this we can have % of content has links and based on this spam score would be increased.
- Apart from this there would be manual reviewers in google who would review webpages and can categorize a page as spam.
Top Google interview questions
- What is your favorite product? Why?89 answers | 263k views
- How would you design a bicycle renting app for tourists?62 answers | 82.5k views
- Build a product to buy and sell antiques.54 answers | 66.8k views
- See Google PM Interview Questions
Top Technical interview questions
- Imagine you're the product manager for Facebook Marketplace. Since many sellers don't mark items as sold, what existing functionality and metrics could you use to determine whether an item has likely sold?7 answers | 20.9k views
- What happens when you enter a URL in your browser?6 answers | 10.8k views
- How would you determine how to rank posts in the newsfeed?4 answers | 3.3k views
- See Technical PM Interview Questions
Top Google interview questions
- How would you improve Google Maps?53 answers | 228k views
- A metric for a video streaming service dropped by 80%. What do you do?50 answers | 135k views
- How would you design a web search engine for children below 14 years old?36 answers | 42.9k views
- See Google PM Interview Questions
Top Technical interview questions
- The Chrome team is looking to reduce power utilization on mobile phones when using the browser. How would you go about solving this problem?3 answers | 3.7k views
- How would you map the ocean?3 answers | 2.9k views
- Create an API design for third-party integration for payments.3 answers | 4.2k views
- See Technical PM Interview Questions