Tuesday, September 20, 2016

How common is Cloaking? (Showing different content to search crawlers than to users)

Google often points out that sites should not show significantly different content for users and search crawlers. Penalties threatened. It might negatively impact ranking accuracy, and Google also needs to see differences in files to update their index (not mentioned by Google, but likely).

An easy way to see if cloaking happens is to compare variations of a page downloaded with different user agents (curl, googlebot, firefox, ....) and then compre the md5 hashs of the versions. If a site sends different data only to crawlers (and not based on other user agents, IE or FF for example) this indicates that the site might be cloaking.

How do some categories of sites fare in this?


  1. Spamlist 2 (as explained previously) is basically a list of 2336 blogs on different large scale blog platforms like blogger, tumblr, wordpress, blogspot that have many attributes that might be indicators for spam.
  2. The earlier spam list has similar results (spam based on industry and competitor related inbound links).
  3. The Alexa* lists use the top 1000 urls, middle 1000 urls and last 1000 urls of the Alexa top 1 million list.


The results sorted by percentage of 'cloaking':


And the table of results, again sorted by 'cloaking' percentage (other combinations make the difference to 100%):


Discussion

It seems one group of spammers differentiates a lot by agent - they still don't have a lot of settings where only bots see different content. Quite interesting, too, that one spam group and the top alexa sites are more likely to cloak than other sites. (Again - this is only considering one factor on how the homepage is displayed with all the resulting vagueness.)

How to replicate

First - this is severely limited, as it only analyses the homepage, and only the core page, not elements that are loaded with the page (images, scripts, etc.).

First generate list of spam-like urls / domains. All lists are checked for 200 status with several options: the url as is, then as https, then with www, then www with https plus one test that tries to work around crawler detection. The resulting lists of unique 200 OK urls are used in the next steps.

Download the homepage (just the html part) with different user agents (googlebot,bingbot,FF,IE,Chrome, and whatever else pleases your heart). For each download build the md5 has of the file, store in a table.
With awk we can compare quickly if a hash is the same or different by user agent, which then just needs to be summarized.

Tuesday, August 16, 2016

Are spam sites using variate testing tool (mbox) more than other sites?

Mboxes are such an interesting topic, and the spam list I used last time was good, but I was not sure how representative. So my next two lists of spam domains were generated using Moz' opensiteexplorer. This tool has a feature where it categorizes domains linking to other sites by up to 18 parameters indicative of spam.

Spam domains linking to large sites using mbox

I used a few profiles (root domains like dell.com) from our industry first, and pulled all inbound linking subdomains with the highest spam scores. For most sites, I had to use scores 8 and greater (not 18 or 15 or 12), because the link profiles were actually pretty clean for all large sites checked.

On a total of 4544 linking 'spam' domains I found 196 mbox implementations on the homepage, 4.3 %. This is completely in line with the first set of spam domains last time (4.4%).

Spam domains linking to blogging platforms using mbox

Now the second set of spam domains are from the same tool, but this time I looked at the link profiles for the root domains blogger.com, blogspot.com, tumblr.com and wordpress.com, in the hope to catch some bad linkspam from some of the bloggers. But I was surprised how relatively clean these were again. Not as good as the large site link domains, though. Perhaps due to the large number of overall inbound links - the samples are much larger and I could focus on domains with a spam score of 13 and higher.

On a total of 2336 domains I found 99 mbox implementations on the homepage, 4.2 %. This is completely in line with the first set of spam domains last time (4.4%) and the other spam list.

(First I ran this with 5 sec timeout and discovered only 87 sites with mbox. 5 sec for all files would be nice, but perhaps not realistic, so with a timeout threshold of 30 sec more mboxes showed up.)

All in all this shows that mboxes (for variate testing) are used on all kinds of sites and are not an indicator for a spam-like site - but not the opposite either.

Wednesday, July 6, 2016

Do spammers use mbox A/B testing or multivariate testing more than other sites?

A/B testing, multivariate testing and SEO

Many companies use verious products for a/b and or multivariate testing, perhaps even for personalization.  If testing is ok, would a spammer not use a variant testing tool to cloak content for Google?

SEOs know that bots or crawlers should not be served different content then users, especially when this is done based on cookies or user agent ('cloaking'). My understanding of the Google position on testing is that it is good for sites and for usability, and as long as it is limited in scope and run-time, it 'should' be ok. That also means, if too long, too much, too many pages affected, it is not - and perhaps not even short term, small scope.

Can using a testing tool hurt our rankings in Google? 

For the research, I analysed sites using a specific testing tool that adds elements in an 'mbox' on page; it is one of the larger tools capable of large scale implementations. If a larger percentage of spammers would use the tool, it could indicate that variate testing tools might be used for cloaking (assuming spammers measure impact and adjust. Excluding other tools for now. )

Spammers vs other sites: use of variate testing tools


  • A full 30% of the top 1000 list (with 200 status) have an mbox on their homepage
  • only 7 % of the last 1000 from the Alexa 1 million
  • The spammer list showed 24 sites that have an mbox on the homepage from a total of 528 domains, about 4.4 % of the suspect spam list.

How to use this result

Even with spammers using mboxes, this does NOT indicate that the tool is used for spam for several reasons! Sites on the list might be not-spam sites, sites might not use the testing tool for spamming but for legit reasons, or at even not at all although they have an mbox element on their site, for example with self-made Js. Lastly, if the tool would be a good tool for spammers to use, the usage of mboxes would likely show higher than average, but it is significantly lower.

The resulting list is still interesting as a selection of sites that deserve more scrutiny - a manual deep dive to learn about the various uses of the mbox tool for A/B or multivariate testing. 

Process - how to replicate this test

First I pulled the Alexa 1 million  list, split out the top 1000 sites, then the last 1000. Then I looked for downloadable list of spam domains, as I could not find a list of sites know for cloaking, and this one looked pretty good. It is just the list of hosts they consider spam for their site, but as a first test that's good enough for now.
Then I downloaded all elements of the homepage (spanning hosts for scripts from other subdomains and similar), checked with a small script if an mbox was integrated in any of the files downloaded with the homepage. To calculate the percentage of mbox sites for each group I discounted the sites not delivering a 200 OK.

If you have a better spam domain list or even domains known for cloaking, please share. 

Thursday, May 12, 2016

Dell wins key award for their SEO implementation

SEOclarity synergy award for DellThe company behind one of the largest, truly global Enterprise SEO platforms (seoClarity) just awarded Dell with the ‘Synergy’ award – out of more than 3000 brands (!) they cover with their platform.

This highlights the exemplary integration of all teams to optimize the experience for users coming from Search Engines like Google to Dell.com and beyond.

“ Dell has demonstrated an award-worthy level of synergy across the product management, development, design, ux, brand and content aspects of the Dell Support Group (DSG). "

Original SeoClarity Press release: http://bit.ly/1WqEL9O

Wednesday, April 6, 2016

Google site ownership verification per DNS


Many (or perhaps all?) seo / webmasters all have their sites verified in Google search console (and others). There are currently several options to verify ownership of a site, usually a domain or a subdomain. Many people use a file that's uploaded into the root folder of their site or a meta tag on the homepage.

If you have access to a DNS, adding a TXT entry to the DNS is a great option. We actually use several methods to verify the same domains, for example the meta tag + DNS, or verification file + DNS to add some extra protection in case one breaks.

This is the official (new) guideline from Google, but it is also explained in GSC. There's a second page with the details for the 4 different DNS entry variations that can be used - up to now I only had seen the first two.
One adds just a TXT entry with some copy in it, and the copy starts with "google-site-verification=' and then adds a specific entry per domain.

We have lots of domains, and a dedicated DNS team, so I needed to test if they implemented correctly - and as usual a quick script.
Script to test DNS TXT entry for google verification is below, and the output works - as shown with domains with this entry, and below without.


Done with a few domains where I don't have the DNS entry:



while read domain; do
dig -t TXT $domain | grep "TXT" | grep "google-site-verification"

done < $1

Tuesday, February 16, 2016

site search box in SERP


Many of us site managers and tech Seo will have thought about using the sitesearch link box in Google results. When Google displays expanded results with sub-results, they offer to integrate a sitesearch box. Posts on this are rare, and have little info on the impact of these. What I have seen so far indicates that there's no significant impact on natural search traffic. Information about on site conversion (closer match should increase on-site conversion) is not published to my knowledge, and would be hard to separate out from the overall search.

There are three options to deal with this Google offer:
1. ignore, do nothing
2. block the searchbox actively
3. integrate schema into the homepage of the site to 'invite' Google to add the box.

Which route to go? A first step, as often, is to look at how common it is to use one of these options. Working on one of the largest websites, I compare here with the top 10,000 sites by traffic as estimated by Alexa (top million sites report).

Now showing:

  1. how many block, use schema, do nothing,
  2. average rank of blocked sites compared to sites with schema
  3. what do the top sites do (surprise)
  4. top sites blocked by rank
  5. top sites with schema by rank 
  6. bash script to test


1. How many block, use schema, do nothing



2. Average rank of blocked sites compared to sites with schema


3. What do the top sites do (surprise)

4. Top sites blocked by rank

5. Top sites with schema by rank 

6. Bash script to test


while read -r line; do
if [[ $line != www* ]]; then
line="www.${line}"
fi
output=$(curl -s "${line}" )
nobox=$(echo "$output" | grep "nosite")
boxschema=$( echo "$output" | grep "SearchAction")

if [[ $nobox != "" ]] ; then
nobox="blocked"
fi

if [[ $boxschema != "" ]] ; then
boxschema="schema"
fi

echo -e "$line \t $nobox \t $boxschema" | tee -a nobox-or-box.txt


done < $1


Thursday, February 4, 2016

Canonical checker

​For search, a header tag can be very important. We have many areas with duplicate pages, or at least very similar pages. That's bad news for ranking in external search engines as the 'rank equity', or 'ranking love' like some would say, is diluted.
For many situations this rel canonical tag can be used. It indicates to search engines, which of the many urls for the same content should rank in search results. It is one of the well understood and well working elements in seo.

Unfortunately though, some webpages don't make full use of this. Most tools, internally and externally, show if a canonical is on a page, but they don't check anything else, particularly not if the canonical is self-referring to the url of the page on which it is implemented.
These self-referring canonicals are nothing bad, even recommended, but if we have 3 urls for the same content (think segments or customer segments) all have a self referring canonical we need to change this to one having the self referring, and the other two refer to that url with the self-referring one.


Knowledge is the parent of action or so, right?


Here is a little tool to check in your area, works under linux and also on cygwin. Suggestions for improvement welcome as always.  

#bash
echo -e "\033[91m if url in feeding file had NO http or https, http is assumed and set \033[0m"
if ! [[ -f $1 ]]; then
echo "need a file with urls starting with http or https"
return
fi

echo -e "URL tested\tcanonical status"
while read line; do
if [[ ! $line == http* ]]; then
line="http://"$line
fi
response=$(curl -I -s --write-out %{http_code} "$line" -o /dev/null)
if [[ $response == 200 ]]; then
canonical=$( curl -s --max-time 3 "$line" | grep "canonical" | grep -o '<.*canonical.*>' | sed -e 's/<link //' -e 's/rel=//' -e 's/canonical//' -e 's/href=//' -e 's/ //g' -e 's/\x22//g' -e 's/\x27//g' -e 's/\/>$//' )
case $canonical in
*hreflang*)
canonical="coding error"
;;
"")
canonical="none"
;;
*/)
canonical="${canonical%/}"
;;
*)
canonical="${canonical}"
esac
if [[ $line == $canonical ]]; then
canonical="\033[32mmatch\033[0m"
fi
else canonical="redirected"
fi
echo -e "$line\t$canonical" | tee -a output.txt

done < $1

Bookmark and Share