Tuesday, September 20, 2016

How common is Cloaking? (Showing different content to search crawlers than to users)

Google often points out that sites should not show significantly different content for users and search crawlers. Penalties threatened. It might negatively impact ranking accuracy, and Google also needs to see differences in files to update their index (not mentioned by Google, but likely).

An easy way to see if cloaking happens is to compare variations of a page downloaded with different user agents (curl, googlebot, firefox, ....) and then compre the md5 hashs of the versions. If a site sends different data only to crawlers (and not based on other user agents, IE or FF for example) this indicates that the site might be cloaking.

How do some categories of sites fare in this?


  1. Spamlist 2 (as explained previously) is basically a list of 2336 blogs on different large scale blog platforms like blogger, tumblr, wordpress, blogspot that have many attributes that might be indicators for spam.
  2. The earlier spam list has similar results (spam based on industry and competitor related inbound links).
  3. The Alexa* lists use the top 1000 urls, middle 1000 urls and last 1000 urls of the Alexa top 1 million list.


The results sorted by percentage of 'cloaking':


And the table of results, again sorted by 'cloaking' percentage (other combinations make the difference to 100%):


Discussion

It seems one group of spammers differentiates a lot by agent - they still don't have a lot of settings where only bots see different content. Quite interesting, too, that one spam group and the top alexa sites are more likely to cloak than other sites. (Again - this is only considering one factor on how the homepage is displayed with all the resulting vagueness.)

How to replicate

First - this is severely limited, as it only analyses the homepage, and only the core page, not elements that are loaded with the page (images, scripts, etc.).

First generate list of spam-like urls / domains. All lists are checked for 200 status with several options: the url as is, then as https, then with www, then www with https plus one test that tries to work around crawler detection. The resulting lists of unique 200 OK urls are used in the next steps.

Download the homepage (just the html part) with different user agents (googlebot,bingbot,FF,IE,Chrome, and whatever else pleases your heart). For each download build the md5 has of the file, store in a table.
With awk we can compare quickly if a hash is the same or different by user agent, which then just needs to be summarized.
Bookmark and Share