You’re part of the Google Search web spam team. How would you detect duplicate websites?

Question

You’re part of the Google Search web spam team. How would you detect duplicate websites?

Asked at Google over a year ago

8.3k views

Asked at

8.3k views

How to answer Technical questions

Interview Guide

Top Technical interview questions

Imagine you're the product manager for Facebook Marketplace. Since many sellers don't mark items as sold, what existing functionality and metrics could you use to determine whether an item has likely sold?7 answers | 20.9k views
What happens when you enter a URL in your browser?6 answers | 10.8k views
How does TinyURL work?5 answers | 317k views
See Technical PM Interview Questions

Invite members
Invite by email

Add another

nehagupta · Answer 1 · 2021-06-18T15:48:30+0000

First lets clarify the understanding of duplicate site. As per my understanding, a duplicate site is one which has :-

1. Scraped content from other sites without adding little or no original content. (Example - site having embedded videos, text, images from other sites, repulbished content from other sites, content copied from other sites with slight modification)
2. It can also include sites, which have too many links or links that are part of link schemes
Other type of spam content :- Irrelevant keywords, sneaky redirects i.e. cases where visitor is redirected to a site with totally different content
Thus duplicate sites is just one type of spam. Google would have a spam detection algo which would be evaluating web pages and assigning them a spam score. There will be certain logics in place to detect duplicate site which will impact the spam score. Which ever site will have a high spam score (i.e. above certain threshold, will be given lower rank via the Pagerank algo)
Logics in place to detect duplicates could be:-
1. Content: if the content on the webpage is mostly same as another web page then based on the webpage creation date & site reputation algo will decide which of the 2 is a duplicate site and increase the spam score for that site.
2. Links : If a site has only embedded videos, images, media links to another website OR if a site has no content but only redirects to another web site. For this we can have % of content has links and based on this spam score would be increased.
Apart from this there would be manual reviewers in google who would review webpages and can categorize a page as spam.

KN · Answer 2 · 2020-04-12T04:57:14+0000

You can't answer this Google technical question until you define "duplicacy." In my mind, i come across two use cases:

1. First is sites which try to copy paste contents from other websites and make a new website.

2. Second is sites which try to get links from Link Farms and increase Page Rank Artificially.

So, lets take the first case.

As PR algo works on Keywords and Relevance( Links), having the correct keywords is important.

PR algo builds an index of keywords in a website.

To tackle this problem, Google should develop a Spam Check Algorithm which also gives a Spam Score when it crawls the world wide web.

This spam score should be in a percentage from 0% to 100%.

In layman terms this tells how much of the website content is original and how much is copied from other websites. Google can tell about the copy part as it crawls all the internet.

More the Spam Score , the PR of the page should decrease by a factor. There can be an upper and lower limit to the same. Lets say 20% spam score is allowed and more than 60% leads to Google not showing the result in SERP.

This will also depend on the category. A website from e commerce will have very low Spam Score but a media and content website will have a large Spam Score even without actually copying stuff.

Now moving to the second use case, Google has to be very clever such links from Link farms because this is essentially gaming the PR algo.

Google should do the following steps for the same:

1. Maintain a directory of Link farms. Any website having even one incoming link from Link Farm is blocked by Google

2. Identify two way links. If a website A links to B and B also links back to A and both are unrelated, this may be a link farm. These cases have to be further studied

So, essentially any website taking any link farm's link should be banned by Google.