Showing posts with label test. Show all posts
Showing posts with label test. Show all posts

Tuesday, August 16, 2016

Are spam sites using variate testing tool (mbox) more than other sites?

Mboxes are such an interesting topic, and the spam list I used last time was good, but I was not sure how representative. So my next two lists of spam domains were generated using Moz' opensiteexplorer. This tool has a feature where it categorizes domains linking to other sites by up to 18 parameters indicative of spam.

Spam domains linking to large sites using mbox

I used a few profiles (root domains like dell.com) from our industry first, and pulled all inbound linking subdomains with the highest spam scores. For most sites, I had to use scores 8 and greater (not 18 or 15 or 12), because the link profiles were actually pretty clean for all large sites checked.

On a total of 4544 linking 'spam' domains I found 196 mbox implementations on the homepage, 4.3 %. This is completely in line with the first set of spam domains last time (4.4%).

Spam domains linking to blogging platforms using mbox

Now the second set of spam domains are from the same tool, but this time I looked at the link profiles for the root domains blogger.com, blogspot.com, tumblr.com and wordpress.com, in the hope to catch some bad linkspam from some of the bloggers. But I was surprised how relatively clean these were again. Not as good as the large site link domains, though. Perhaps due to the large number of overall inbound links - the samples are much larger and I could focus on domains with a spam score of 13 and higher.

On a total of 2336 domains I found 99 mbox implementations on the homepage, 4.2 %. This is completely in line with the first set of spam domains last time (4.4%) and the other spam list.

(First I ran this with 5 sec timeout and discovered only 87 sites with mbox. 5 sec for all files would be nice, but perhaps not realistic, so with a timeout threshold of 30 sec more mboxes showed up.)

All in all this shows that mboxes (for variate testing) are used on all kinds of sites and are not an indicator for a spam-like site - but not the opposite either.

Wednesday, July 6, 2016

Do spammers use mbox A/B testing or multivariate testing more than other sites?

A/B testing, multivariate testing and SEO

Many companies use verious products for a/b and or multivariate testing, perhaps even for personalization.  If testing is ok, would a spammer not use a variant testing tool to cloak content for Google?

SEOs know that bots or crawlers should not be served different content then users, especially when this is done based on cookies or user agent ('cloaking'). My understanding of the Google position on testing is that it is good for sites and for usability, and as long as it is limited in scope and run-time, it 'should' be ok. That also means, if too long, too much, too many pages affected, it is not - and perhaps not even short term, small scope.

Can using a testing tool hurt our rankings in Google? 

For the research, I analysed sites using a specific testing tool that adds elements in an 'mbox' on page; it is one of the larger tools capable of large scale implementations. If a larger percentage of spammers would use the tool, it could indicate that variate testing tools might be used for cloaking (assuming spammers measure impact and adjust. Excluding other tools for now. )

Spammers vs other sites: use of variate testing tools


  • A full 30% of the top 1000 list (with 200 status) have an mbox on their homepage
  • only 7 % of the last 1000 from the Alexa 1 million
  • The spammer list showed 24 sites that have an mbox on the homepage from a total of 528 domains, about 4.4 % of the suspect spam list.

How to use this result

Even with spammers using mboxes, this does NOT indicate that the tool is used for spam for several reasons! Sites on the list might be not-spam sites, sites might not use the testing tool for spamming but for legit reasons, or at even not at all although they have an mbox element on their site, for example with self-made Js. Lastly, if the tool would be a good tool for spammers to use, the usage of mboxes would likely show higher than average, but it is significantly lower.

The resulting list is still interesting as a selection of sites that deserve more scrutiny - a manual deep dive to learn about the various uses of the mbox tool for A/B or multivariate testing. 

Process - how to replicate this test

First I pulled the Alexa 1 million  list, split out the top 1000 sites, then the last 1000. Then I looked for downloadable list of spam domains, as I could not find a list of sites know for cloaking, and this one looked pretty good. It is just the list of hosts they consider spam for their site, but as a first test that's good enough for now.
Then I downloaded all elements of the homepage (spanning hosts for scripts from other subdomains and similar), checked with a small script if an mbox was integrated in any of the files downloaded with the homepage. To calculate the percentage of mbox sites for each group I discounted the sites not delivering a 200 OK.

If you have a better spam domain list or even domains known for cloaking, please share. 

Thursday, June 4, 2015

Speed: Data on top 1000 header load time vs full load time

Lots of tools give a different number for the speed of a site, how it is for users, over different channels, providers, including rendering time or not, including many elements or not.

This is the 'hardest' test of all:

  • With a little script I checked the top 1000 pages from the Alexa list*. The first script tests how long it takes to get the http header back (yes, document exists, and yes, I know where it is and it is OK), These are the blue dots.
  • The second script downloads ALL page elements of the homepage, including images, scripts, stylesheets, also from integrated third party tools like enlighten, tealeaf, omniture, or whatever a site uses. These are the orange dots.

First I ran this without time limit, and when I checked back the next day, it was still running, so I set the timeout to 20 seconds.

There seems to be a clear connection between header response time and full time, not so much between rank in the top 1000 by traffic and speed.

sorted by full download time




sorted by traffic rank

There also seems to NOT be a clear connection on how rank by traffic (x-axis) correlates to full download time. This shows that we have a great opportunity to outperform many other companies with faster download speeds.



* Alexa (owned by Amazon) publishes the top 1 Million websites by traffic, globally, based on data from  many different browser plugins plus from Amazon (cloud?) services.
Bookmark and Share