15% off membership for Easter! Learn more. Close

You’re part of the Google Search web spam team. How would you detect duplicate websites?

Asked at Google
8.3k views
Asked at
eye 8.3k views eye 8.3k views
Answers (2)
crownAccess expert answers by becoming a member

You'll get access to over 3,000 product manager interview questions and answers

badge Platinum PM

 

 
  • First lets clarify the understanding of duplicate site. As per my understanding, a duplicate site is one which has :-
     
    1. Scraped content from other sites without adding little or no original content. (Example - site having embedded videos, text, images from other sites, repulbished content from other sites, content copied from other sites with slight modification) 
    2. It can also include sites, which have too many links or links that are part of link schemes
  1. Other type of spam content :- Irrelevant keywords, sneaky redirects i.e. cases where visitor is redirected to a site with totally different content 
  2. Thus duplicate sites is just one type of spam. Google would have a spam detection algo which would be evaluating web pages and assigning them a spam score. There will be certain logics in place to detect duplicate site which will impact the spam score. Which ever site will have a high spam score (i.e. above certain threshold, will be given lower rank via the Pagerank algo)
  3. Logics in place to detect duplicates could be:-
    1. Content: if the content on the webpage is mostly same as another web page then based on the webpage creation date & site reputation algo will decide which of the 2 is a duplicate site and increase the spam score for that site. 
    2. Links : If a site has only embedded videos, images, media links to another website OR if a site has no content but only redirects to another web site. For this we can have % of content has links and based on this spam score would be increased.
  4. Apart from this there would be manual reviewers in google  who would review webpages and can categorize a page as spam.

 

Access expert answers by becoming a member
1 like   |  
Sign up for FREE to continue reading
badge Platinum PM

You can't answer this Google technical question until you define "duplicacy."  In my mind, i come across two use cases:

1. First is sites which try to copy paste contents from other websites and make a new website.

2. Second is sites which try to get links from Link Farms and increase Page Rank Artificially.

 

So, lets take the first case. 

As PR algo works on Keywords and Relevance( Links), having the correct keywords is important.

PR algo builds an index of keywords in a website.

To tackle this problem, Google should develop a Spam Check Algorithm which also gives a Spam Score when it crawls the world wide web.

This spam score should be in a percentage from 0% to 100%.

In layman terms this tells how much of the website content is original and how much is copied from other websites. Google can tell about the copy part as it crawls all the internet.

More the Spam Score , the PR of the page should decrease by a factor. There can be an upper and lower limit to the same. Lets say 20% spam score is allowed and more than 60% leads to Google not showing the result in SERP. 

This will also depend on the category. A website from e commerce will have very low Spam Score but a media and content website will have a large Spam Score even without actually copying stuff.

 

Now moving to the second use case, Google has to be very clever such links from Link farms because this is essentially gaming the PR algo.

Google should do the following steps for the same:

1. Maintain a directory of Link farms. Any website having even one incoming link from Link Farm is blocked by Google

2. Identify two way links. If a website A links to B and B also links back to A and both are unrelated, this may be a link farm. These cases have to be further studied

So, essentially any website taking any link farm's link should be banned by Google.

Access expert answers by becoming a member
6 likes   |  
1 Feedback

I like the user cases listed, have a few suggestions to the answers:

  1. I went to the Spamdexing - Wikipedia site to learn more cases about web spamming, and it also shows a few other cases which could be classified as duplicating content, such as mirror websites. 
  2. Targeting a couple of scenarios is good. On the detecting content part, I didn't see how the code could decide which site is original vs copied. I think it would be good to add a few criteria on that, such as content creation time, site reputation etc.
  3. On the link farm detection, I think the scoring system can also be used, whenever a two way link detected on a website, it increments the score, and a threadshold of say 10 two-way links detected, could be blocked. 
1
Sign up for FREE to continue reading
Sign up for FREE to continue reading