andreas.wpv: 2015

Thursday, September 24, 2015

Google Weblight - using full content from sites on their domain

Like to many 'lite' things, there are serious side effect with this web-light version by Google.

Some time ago, I think it was April, Google announced they would serve web pages that are too slow for some users scrape off the owner website and serve from their servers. The test was supposed to start in Indonesia and targeted towards 2G connection. Then later, they started rolling this out for 'select countries' and slow mobile connections (2G).

The 'transcoding' is done on the fly, according to news, and speeds things significantly up, we have seen 10x faster loads.

Weblight - look and feel

This is how it looks like in the original on the left, and the 'scraped' site on the right. Not bad, really. Users can load the original page in the top section, with a warning that it might be slow. Sounds very user friendly approach to me. The navigation (nav icon in the upper left) is solid, and not missing anything in the top level.

Biggest flaw is that it removes pretty much all third party elements - including our tracking elements, scripts, etc.. Google claims to leave up to two ads on the page, and allow simple tracking with Google analytics.

Traffic on weblight

We (Dell.com) actually see some traffic with Googleweblight. It is minuscule, noticed only because we have been specifically looking for it, but there it is. We see some referral traffic in Adobe sitecatalyst, and we also see the 'scraping' of pages in our logfiles.

Most of the traffic for us seems to be from India, and trying to reach a variety of pages in several countries instead just India, which might explain some of the slowness.

The scraped or 'transcoded' pages can be found in logs with cs_User_Agent="*googleweblight*", and the pages that get the referrer traffic from the referrer "googleweblight.com".

How relevant are 2G networks for global ecommerce

So far, it seems super small, but the market potential is quite big. Global data on network coverage is a bit harder to come by, but there are some relevant sources.

Some excerpts from a McKinsey report:
"In developed countries and many developing nations, 2G networks are widely available; in fact, Ericsson estimates that more than 85 percent of the world’s population is covered by a 2G signal.42 Germany, Italy, and Spain boast 2G networks that reach 100 percent of the population, while the United States, Sri Lanka, Egypt, Turkey, Thailand, and Bangladesh have each attained 2G coverage for more than 98 percent of the population.43 Some developing markets don’t fare as well: as of 2012, 2G network coverage extended to 90 percent of the population of India, 55 percent of Ethiopia, 80 percent of Tanzania, and just under 60 percent of Colombia.44 Growing demand and accelerated rates of smartphone adoption in many markets have spurred mobile network operators to invest in 3G networks.
Ericsson estimates that 60 percent of the world population now lives within coverage of a 3G network. The level of 3G infrastructure by country reveals a stark contrast between countries with robust 3G networks and extensive coverage, such as the United States (95 percent), Western European nations (ranging from 88 to 98 percent), and Vietnam (94 percent), and many developing markets such as India, which are still in the early stages of deploying 3G networks."
(highlights by me)

The graph from the same publication shows significant 2G/3G only coverage even for the US.
Slightly more optimistic statistics from the ICT figures report for 2015 - look at US penetration rate for example, or Norway. A map and discussion with US internet speeds on Gizmodo.

How to see your site on Google weblight

For this blog, it is: googleweblight.com/?lite_url=https://andreas-wpv.blogspot.com . Don't forget to use some kind of mobile emulator, to see how it really looks like! Only some resolutions are supported - one is the standard Iphone 4 with 320 x 480.

Google original:

"If you have a Google account:
- View the transcoded page
Otherwise:
- On your mobile device, browse to the link http://googleweblight.com/?lite_url=[your_website_URL]where the url is fully qualified (http://www.example.com).
  OR
- On your desktop, open the Chrome device mode emulator with the link http://googleweblight.com/?lite_url=[your_website_URL] where the url is fully qualified (http://www.example.com) "

Is this legal? How about copyright?

Honestly - I don't want to go there, I am no lawyer, and there are some complex questions arising - internally.
Google claims that companies win so much traffic, it is really in their best interest. For every site that does not want this treatment, there is a way to opt out described on the Google help pages, and it gives some insight into what's removed. Some more tech details on this article: mostly compression, removal of third party elements, reduction of design elements.

Google names this 'transcoding' - to me it looks like scraping and serving from their website, something that might be considered a copyright violation, and would likely be against Google 'terms of service' (highlights by me):

"Do not misuse our Services, for example, do not interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct."

Additionally - would not adwords on a weblight page violate Adwords terms of service?

" Content that is replicated from another source without adding value in the form of original content or additional functionality

Examples: Mirroring, framing, or scraping content from another source"

I would understand - perhaps misunderstand - this the way that Google Adwords cannot be used on weblight pages.

More general, is this scraping ok for Google to do?

Sept. 25, 2015: It seems from some glance in the detailed data transfer (chrome, fiddler), that it is more of a filtering of content with some kind of proxy. Working on it.

Resources:

Additionally to the above links, there are a few articles I found on this topic + Google sources:

http://gadgets.ndtv.com/internet/news/google-india-to-offer-faster-access-to-mobile-webpages-for-android-users-702302
http://www.unrevealtech.com/2015/07/how-to-prevent-site-loading-google-weblight.html
http://www.androidauthority.com/google-web-light-looks-616450/
http://digitalperiod.com/google-web-light/
https://support.google.com/webmasters/answer/6211428?hl=en

Tuesday, September 15, 2015

Data analysis preparation analysis script

Finally got to working on this. I am working with larger files and one of my current fun projects is to find out which urls have been visited by Google, out of all the urls we have live.

While working on a small DB to do this I downloaded some files from splunk, then imported them. And sure enough realized there are things I need to filter out before, or the DB becomes absolutely unwieldy.
The files are not huge, but large enough, with 10+ million lines, so I want to use command line tools and not redo a script for every number I need, looping repeatedly over the same file.

It shows count of

number of fields
total number of lines
empty lines
non-empty lines.

Then it pulls the full data for the

shortest line
the longest line
the first 4 lines
last 4 lines
middle 3 lines of the file.

This is testdata output from the script - cannot show real data. It works well - at least with a few fields. To run it on a 17 million lines one field list takes ~ 32 seconds, that's pretty good I think.

I highlight the description in green as you might see - otherwise it is kind of hard to read. Also - awk changed the numbers for the line counters to scientific notation, so I needed to use int(variable) to reset it to integer and be able to concatenate it into a range for the sed - system call. Ah, this is with comma separated files and needs to be adjusted if that's different.

#!bash

#this shows all lengths and how often
echo -e "\033[32;1m\n\nnumber of lines : number of fields\033[0m" ; awk ' {print NF} ' "$1" | sort | uniq -c 

#this shows only number of shortest / longest lines

awk ' BEGIN {FS=","} (NF<=shortest || shortest=="") {shortest=NF; shortestline=$0} 
(longest<=NF) {longest=NF; longestline=$0} 
(!NF) {emptylines+=1} 
(NF) {nonemptylines+=1}
(maxcount<NR) {maxcount=NR}
END { middlestart=(maxcount/2)-1;
middleend=(maxcount/2)+1;
range=int(middlestart)","int(middleend);
print "\033[32;1m\ntotal number of lines is:\n\t\033[0m" NR,
"\033[32;1m\n shortest is:\t\t\033[0m" shortest, 
"\033[32;1m\n longest is:\t\t\033[0m" longest, 
"\033[32;1m\n shortestline is:\t\t\033[0m" shortestline, 
"\033[32;1m\n longestline is:\t\t\033[0m" longestline,
"\033[32;1m\n number empty lines is: \t\t\033[0m" emptylines,
"\033[32;1m\n number of non-empty lines is:\t\t\033[0m" nonemptylines;

print "\033[32;1m\n\nrange is   \033[0m" range, "\033[1;32m\nFILENAME IS \033[0m" FILENAME, "\033[32;1m\nnow the middle 3 lines of the file: \n\033[0m";
system("sed -n "range"p " FILENAME)

} ' $1

echo -e "\033[32;1m\n\ntop 4 lines\n\033[0m"
head -n 4 "$1"
echo -e "\033[32;1m\n\nlast 4 lines\n\033[0m"
tail -n 4 "$1"
echo -e "\n"

Wednesday, July 29, 2015

Google: "making the web faster"

... how about starting at home?

I have chrome open with two tabs (gmail and G+) and three plugins (seoclarity, everyone social and WooRank), and you need HOW many processes and HOW much memory?

I have 20 plugins I'd like to use, and 10+ tabs I'd like to keep open, but that just won't work with this wasteful resource usage.

Wednesday, July 22, 2015

Parallel on AWS

Remember the post on how many parallel (simple, wget) processes can be run reasonably on a machine?

This is how it looks on aws / amazon web services EC2:

I added lower / higher numbers after a bit testing, but this is fast - likely because of a really good internet connection. (I might need to try something to test the processing itself without network dependency....)

Many of the scripts - even when run with parallel - still take hours or days to complete. 10,000 or 100,000 urls to check are the rule, sometimes even more. With downloads and processes on the data, this can take a while.

I can't run these from the office, and if I run them from home I keep checking.... which is not that relaxing in time where I want to relax. So, I often use amazon web services EC2.
For about a year I used the free tier - a great offer from Amazon to try the service with a very small instance for free. Now I am in the paid program, but cost is really low for what I do and after a few times it became routine to start a new instance when I need it, run some things, then terminate it.

One of the programs I use pretty much always is tmux - 'terminal multiplexer' - which now is by default on ubuntu instances. It allows not just to run many connected terminals, but also to detach a session. That means, I can start a session, run a script in it, detach and then close the terminal, connection and the script keeps running in the background. A few hours later I can just connect again and download the results.
Especially in combination with parallel this is a great way to run many of the scripts from this blog - and more.

Starting looks a bit like this (thanks to Ole Tange for the tip and the help with parallel):

ls x* | parallel -j24 --line-buffer " . script.sh {} >> results.txt "

I often just split large files , then ls or cat the split files to parallel. -j24 is 24 parallel threads, {} picks up the data from ls/cat.

Thursday, July 9, 2015

Crawl faster with "parallel" - but how fast?

Crawling websites - often our own - helps find technical seo, quality and speed issues, you'll see a lot of scripts on this blog in this regard.

With a site with millions of pages, crawling speed is crucial - even a smaller project easily has some thousand urls to crawl. While there are good tools to check for many things, wget and curl in combination with other command line tools give all the flexibility needed - but can be painfully slow.

To check for download speed for example, a regular script would loop through a list of files, download one, then move to the next. Even split up and running the same script is not great, and quite hard to handle at scale.

After some search I found gnu parallel - excellent especially for parallel tasks. Running multiple parallel processes is easy and powerful. A switch -j[digit] allows to set a preference for the number of parallel processes.

But what number is best, especially with networking projects? A wget should not be a heavy computing load, even with some awk/sed/grep afterwards, because most of the time is spent with waiting and data transfer.

So, a little script to check the time for a few different numbers of parallel processes:

for i in {12,24,36,48,60,80,100}; do var=$( time ( TIMEFORMAT="%R" ; cat "${1}" | parallel "-j${i}" . single-url-check-time-redirect.sh {} ) 2>&1 1>/dev/null ) ; echo "$i parallel threads take $var seconds" ; done

The referenced script (single-url-check-time-redirect.sh) just makes a curl call to check for a http status and a wget for the header transfer time; $1 pulls in a list of full urls.
The results differ greatly by network.

Here a slow, public network in a cafe:

This from a library with a pretty good connection:

And this from a location with a 50MB/sec connection:

Times can vary in each location by a factor of 2 easily, network connection and load still seem to make the biggest difference as all three were done on the same computer.

Wednesday, June 24, 2015

the script: 8 different user agents and how sites deal with it

User agent analysis script

And mentioned in the earlier post - a script helped me to grab the info on this post on how sites and google specifically treat various browsers.

While there's a lot more to analyse, much of it manually, I wanted to first see if there is an indication of differences - so for first insight I use just a plain wc -l to get characters, words, lines of the response, and it looks like there is a clear pattern.

So, let's take a look at the source, two nested "read " loops. The outer loop through the urls, the inner loop through the agents:

#check if the file exists
if [[ ! -e $1 ]]; then
echo -e "there's no file with this name"
fi

outfile=$RANDOM-agentdiff.txt
echo -e "agent \t url \t bytes \t words \t lines" > $outfile

# add a http to urls that don't have it
while read -r line; do

if [[ $line == http://* ]]; then
newline="$line" else
newline="http://$line"

# loop through agents. then read output into variables with read "here" <<<

while read -r agent; do

read filelines words chars <<< $(wget -O- -t 1 -T 3 --user-agent "$agent" "$newline" 2>&1| wc)

echo -e "$agent \t $line \t $filelines \t $words \t $chars" >> $outfile
done < $2
fi
done < $1
wc -l $outfile

Most difficult part was to get the wc output into separate variables, thanks stackexchange for the tip with the <<< here string.

Thursday, June 4, 2015

Speed: Data on top 1000 header load time vs full load time

Lots of tools give a different number for the speed of a site, how it is for users, over different channels, providers, including rendering time or not, including many elements or not.

This is the 'hardest' test of all:

With a little script I checked the top 1000 pages from the Alexa list*. The first script tests how long it takes to get the http header back (yes, document exists, and yes, I know where it is and it is OK), These are the blue dots.
The second script downloads ALL page elements of the homepage, including images, scripts, stylesheets, also from integrated third party tools like enlighten, tealeaf, omniture, or whatever a site uses. These are the orange dots.

First I ran this without time limit, and when I checked back the next day, it was still running, so I set the timeout to 20 seconds.

There seems to be a clear connection between header response time and full time, not so much between rank in the top 1000 by traffic and speed.

sorted by full download time

sorted by traffic rank

There also seems to NOT be a clear connection on how rank by traffic (x-axis) correlates to full download time. This shows that we have a great opportunity to outperform many other companies with faster download speeds.

* Alexa (owned by Amazon) publishes the top 1 Million websites by traffic, globally, based on data from many different browser plugins plus from Amazon (cloud?) services.

Tuesday, May 26, 2015

Google mobile friendly: Industries, site types and self centered algorithm change

Maybe it is just me.

Since Google was promoting everyone to make their sites mobile friendly, many pages are a pain to read now on a regular laptop or monitor.

Example SEO Blog: moz.com/blog

Take the moz blog, for example. Great on mobile, not good to read on a desktop. HUGE image up front, pushing all content below the fold. That's just not good, not for SEO, and not for usability.

Desktop design due to Google push? Image is all I see on the screenshot 14" full laptop screen.

Example News blog: Search Engine Land

Similar issue. I do NOT like headlines that big, really.
Desktop design due to Google push? On my screenshot 14" full laptop screen I see a lot of headline, and... Ads.

Now, lets look at a few more.

Amazon has a different url concept (in parts, at least). Mobile experience is not that great, and if I hit the back button, I get the desktop homepage on phone screen, I had hoped for something different.

Bestbuy is not really different - showing off their headline :-). The responsive seems not optimized, at least not for me.

Dell? We have some great pages, some ok pages (many on separate url on m.dell.com) and some definitely having room for improvement when it comes to mobile / responsive - and we have lots of teams working on improving that, both some different url country setups, and some responsive ( http://www.dell.com/en-us/work/learn/large-enterprise-solutions for example). But responsive lends itself to some content like consulting service description, and not to other content like picking a laptop out of a larger selection.

Many sites need to serve content to mobile users much better than they currently do, and Google's push is a good reminder. I do think they have gone a bit far, and I am not sure they are aware that other industries have different needs. Google has an easy play. Their content and searching lends itself to responsive design but it is not really fair to compare search and search results to news, magazines and eCommerce sites. They all have much more complex processes users need to go through, and they have other business models than exploitation of personal data.

Does not Google's data actually confirm that there are industries where mobile is really important and some where it is not? Showing data from Google adwords tool, just random terms used to see some variance:

Not really a big surprise when you think about it. And even the keyword 'search' has 51% of searchers using a pc / monitor, not a phone or tablet, according to Adwords data. I understand the search algorithm change was not large, but it all of the above indicates, it should really not be more either. Many industries don't have that much mobile share, and responsive does not work well for complex tasks. It can be made to work well with adaptive and even better separate urls, layout, but that is slow, complex and expensive, so it takes even more mobile share to justify the investment.

But maybe I am totally wrong because I just have not seen the great examples out there. Do you have an example of a site with complex task that really works well on phone + monitor in a responsive layout?

Thursday, April 23, 2015

8 User agents and responses Alexa top million pages

Some more interesting results from the test run with 8 different user agents and the return size of the documents from the top 1000 sites in the Alexa 1million.

These are the largest returns - interesting to see these sites here, newspapers, stores, portals.

Many of the smallest sites are returning nothing, redirecting or under construction, actually quite a lot of the top performing pages.

But... maybe that's just because of the unknown user agent? (or the one with wget in it?)

Yes, very different picture. With just the regular user agents, just few page don't send a page back. Some redirect, but some are just very, very small, great job!

See Baidu in there? Now let's take another look at a few search giants. Google is amazing - that looks like the exact same page with a few lines in various languages. Interesting to see how much code each line contains though, that's huge compared to other sites.

And while the search giants likely have all kinds of fancy (tracking?) elements, some of the search alternatives have a few more lines, but much less actual code on the page.

Filtering a bit further the confirmation no one likes my 'andi-wget' user agent. Means, future work with wget will need to have a different user agent nearly always!

Check out the first post with result on average responses sizes and how Google responds.

Monday, April 13, 2015

Documentation - where did I store that little script?

Goodness gracious me. I recall that I did this, I cannot just type it again - not doing that often - so HOW do I find that script?

How do you find your little scripts?

Ok, good folder structure, naming convention, all fine, but it still is hard. Is it in the sitespeed folder or the sitemap folder? The test folder, perhaps? Did I spell it this way?
Most scripts here are for 'few-times' use, build to come up with a quick insight, a starting point or some scaling information to build a business case. The scripts are quick and manifold, with many different variations.

Every now and then I recall a script I'd like to re-use, and then struggle to find the right script. Working across several computers is a challenge, here. Git - too big, had some security issues, and too steep a learning curve for these one-liners.

I used this blog, then, as a repository, and to find the scripts (with a site:andreas-wpv.google.com search) plus Google drive for a small part of the repository. Works, but still missing info.

Documentation script

But I am using #comments quite a lot even in these short bash scripts, so I will now extend that, and use this:

find . -name "*.sh" -exec echo -e "{}\n" \; -exec grep "^#" {} \; -exec echo -e "\n\n" \;

this pulls the script name and path, then an empty line, then all the comments, then two lines to separate from the next script. Not pretty, but works. Now I need to document a bit better :-)

How do you sort, document and find your little scripts?

Tuesday, April 7, 2015

Including Google: 8 agents - and average response code

Agents, not spies

Agents, user agents, play an important role in the setup and running of sites, likely more than many developers like. Exceptions, adjustments, fixes - and it is (?) impossible to generate code that shows up properly on all user agents and also is W3C compliant.

So, sites care, and likely Google cares as well. Google - like any site - offers only one interface to their information (search): source code. How the page looks like, how it is rendered, the tool that renders it is not Google, not provides or supported or maintained or owned by Google, but something on client side. Not even the transport is provided by Google. Google only provides a stream of data via http , https, and anything after that is not their business.

So, my question was - do user-agents make a difference?

And sure, what better to use than a little script? There are so many different user agents, I wanted to keep this open to changes, so I loop over a file with user agents - one per line - as well over a list of urls to get the response from these sites.

Surprising results: Bingbot and Googlebot receive the smallest files - not wget or a unknown bot. The graph overstates this, but the table shows there's a clear difference.

Google.com answering

Now let's take a look at the response Google.com sent back to the little script. They do NOT like wget or andi-bot, nor google bot or bingbot, although they perform a bit better than the first two.
High focus on the regular user agents. The picture for words is the same as for lines.

Google only provides a datastream, rendering is on client side. So if I use firefox, IE, safari, wget or curl is my business - and not Google's. I understand that google does not want people to scrape large amounts of data off their site - which i don't do - but I am surprised the 'block' is not a bit more sophisticated.

Monday, March 16, 2015

Automate your Logfile analysis for SEO with common tools

Fully automated log analysis with tools many use all the time

Surely no substitute for splunk and its algorithms and features, but very practical, near zero cost (take that!) and high efficiency. Requires mainly free tools (thanks cygwin) or standard system tools (like wiindows task scheduler), plus a bit of trial and error. (I also use MSFT Excel, but other spreadsheet programs should work as well).

Analysis of large logfiles, daily

Analyzing logfiles for bot and crawler behavior, but also to check for site quality is quite helpful. So, how to analyze our huge files? For a part of the site, we're talking about many GB of logs, even zipped.

Not that hard, actually, although it took me a while to get all these steps lined up and synchronized.

With the windows task manager I schedule a few steps over night:

copy last days logfiles on a dedicated computer
grep the respective entries in a variety of files (all 301, bot 301, etc.)
Then count the file lenghts (wc -l ) and append the values to a table (csv file) tracking these numbers
Delete logfiles
The resulting table and one or two of the complete files (all 404.txt) are copied to a server, which hosts an Excel file with uses the txt file as database, and updates graphs and tables on open.
delete temporary files (and this way avoid the dip you see)

Now our team can go quickly check if we have an issue up, and need to take a closer look, or not.
In a second step I also added all log entries resulting in a 404 into the spreadsheet on open.

.

Thursday, March 5, 2015

Who has H1?

Do we need an H1 on our homepage?

Sometimes it is necessary to convince pdms, devs, stakeholders that SEO efforts are necessary. One way to support this is to quickly run a test on competitors and / or on ... top pages on the web. (Yes, after all the other pro's have been given.) Especially since we're running one of the top pages ourselves, that list contains powerful names.(and the first url is in my tests.. because I know what's happening there, not because of being in the top 1million.. yet ;-) )

So, H1 or not?

Out of the top 1000 pages, about
Here is a screenshot of the top pages that have an H1 on their homepage.

This is the script, running over the top 1000 urls from alexa 1 million. Very easy to adjust for other page elements.

#!bash
echo -e "url\t has H1 " > 2top1kH1.txt
while read -r line; do
echo $line
h1yes=""
h1yes=$(wget -qO- -w 3 -t 3 "$line" | grep "<h1" | wc -l)
if [ "$h1yes" -gt 0 ]; then
echo -e "$line\t yes" >> 2top1kH1.txt
fi
done < $1

Not large, not complicated, but very convincing.

Monday, February 16, 2015

Google Sitespeed Score and Rank

While Matt Cutts mentioned that site speed only affects a small percentage of sites in their rankings - running one of the largest sites this is worth a second look. Due to the millions of pages, speed might be especially critical for rank - or at least for indexation.

Sitespeed is not sitespeed score, but a relatively 'neutral' way to measure speed related performance.

What to do? First, I selected pages across site, some 2500 pages to have a nice sample. Then I pulled the average rank for each of these pages from seoclarity(.net). And combined this with the speedscore from my script.

As a last step imported this table into Excel and ... scatterplot! Vertical is the ranking position, and x-axis is the speedscore. At first sight, this would indicate that with increasing speed score pages rank worse, but overall I would say this means there is no correlation at all (0.09), and the variance is way too big to use this to calculate a trend.

Tuesday, February 3, 2015

Page size by position in the site structure

Large sites have specific requirements at times, for example when checking for site speed optimization.

One factor influencing page speed, usability and ease of being crawled by search engines is the total page size. With millions of pages, how can focus our work beyond just finding examples of what works and what not?

We ran an internal crawler to identify the pages collecting data on page size for each page. Second step then was visualization, to see if there is a pattern that allows us to focus on high impact areas.

For a small part of the site (!), this is the printed plot of the page weight (with R) based on the number of “/” (-3) = folder depth, adjusted for the htt://dell/ slashes. As a result, http://www.dell.com/index.aspx would be 0, and http://www.dell.com/suppport/home would be level 1.

Based on the graph we were able to clearly identify the area where large pages sit, and were able to narrow it down to a type of page. This gave the necessary information to work with Dev and Design to improve the site significantly.

We also identified a range of descriptive statistics numbers, and the 'outliers' and could fix these immediately - like many will have guessed, not optimized pictures were the issue, and easily remedied.

Thursday, January 22, 2015

Alexa 1 million Top mobile performers on Google sitespeed score: perfect score 100

Mobile SpeedScore

Similar to the desktop numbers here are the top performers on mobile with a speedscore of 100 (out of the top 10,000 from Alexa's top million sites!
Checked with this script, and then just a graph in Excel to show the distribution of 100-score by 1000s. While there seems to be a strong correlation between being in the top 1000 and sitespeed, sites with a speedscore of 100 appear more frequent in other areas, highest in the sites which rank in the 8000s in Alexa.

Sitelist mobile Speed Score

No Google search, interestingly, and many other big names missing, but a list of urls I am not even going to try to check the content through our company network. Bold shows sites, that have a speedscore of 100 in desktop and mobile:

position in Alexa top 10,000	site url	speedscore
29	googleusercontent.com	100
122	secureserver.net	100
571	github.io	100
776	streamcloud.eu	100
1051	gstatic.com	100
1197	atpanel.com	100
1497	sourtimes.org	100
1537	openadserving.com	100
1576	googleadservices.com	100
1636	googleapis.com	100
1969	securepaynet.net	100
2153	anitube.se	100
2281	nocookie.net	100
2616	socialspark.com	100
2665	bookryanair.com	100
2696	onlinecreditcenter6.com	100
2729	gudvin.tv	100
2896	giaoduc.net.vn	100
2935	sexad.net	100
2981	9gag.tv	100
3179	withgoogle.com	100
3343	readserver.net	100
3455	ecplaza.net	100
3491	dilandau.eu	100
3606	prcm.jp	100
3659	themeko.org	100
3802	puu.sh	100
4047	get-a-fuck-tonight.com	100
4181	kienthuc.net.vn	100
4194	stream-tv.me	100
4340	api.ning.com	100
4355	benesse.ne.jp	100
4571	itrack.it	100
4606	trackoptimizer.com	100
4743	7xz3.com	100
4965	uast.ac.ir	100
4982	edgesuite.net	100
5013	liveadexchanger.com	100
5318	ipsosinteractive.com	100
5349	fun698.com	100
5433	moudamepo.com	100
6304	reduxmediia.com	100
6605	teknosa.com.tr	100
6711	tradetang.com	100
6714	vgsgaming.com	100
6804	yieldtraffic.com	100
6808	insight.ly	100
7009	contest-winners.com	100
7070	exhentai.org	100
7116	techhelpfox.com	100
7285	mlstatic.com	100
7736	mmstat.com	100
7772	lovethatsex.com	100
8038	endlessmatches.com	100
8040	savedwebhistory.org	100
8160	flirt-fuck.com	100
8213	h12-media.net	100
8298	kataskopoi.com	100
8414	ihct.mx	100
8601	evsuite.com	100
8630	rapmls.com	100
8735	9stock.com	100
8921	credomatic.com	100
8991	fullsail.edu	100
9097	youtu.be	100
9236	cncmax.cn	100
9335	gefaellt-mir.me	100
9345	vtb24.ru	100
9348	xe2c.com	100
9360	tehran.ir	100
9745	jobspapa.com	100
9792	iphone-winners.net	100
9898	nolix.ru	100
9907	mihanstore.net	100

Wednesday, January 14, 2015

Alexa 1 million Top desktop performers on Google sitespeed score: perfect score 100

Desktop sitespeed score

Using the good old (or new) Alexa top 1M sites list again, this is the list of the top performers on desktop with a speedscore of 100! Checked with this script, and then just a graph in Excel to show the distribution of 100-score by 1000s. While there seems to be a strong correlation between being in the top 1000 and sitespeed, sites with a speedscore of 100 appear more frequent in other areas, highest in the sites which rank in the 8000s in Alexa.

No. of sites with homepage speedscore 100 = perfect for desktop per 1000 sites from Alexa

Sitespeed score 100 site list

No Google search, interestingly, and many other big names missing, but a list of urls I am not even going to try to check the content through our company network.

No. in Alexa 1 million	Site URL	score
29	googleusercontent.com	100
122	secureserver.net	100
571	github.io	100
776	streamcloud.eu	100
1027	lapatilla.com	100
1051	gstatic.com	100
1197	atpanel.com	100
1497	sourtimes.org	100
1537	openadserving.com	100
1576	googleadservices.com	100
1636	googleapis.com	100
1969	securepaynet.net	100
1995	banesconline.com	100
2281	nocookie.net	100
2696	onlinecreditcenter6.com	100
2896	giaoduc.net.vn	100
2935	sexad.net	100
2981	9gag.tv	100
3179	withgoogle.com	100
3343	readserver.net	100
3347	xxxhost.me	100
3491	dilandau.eu	100
3606	prcm.jp	100
3802	puu.sh	100
4031	womenwan.com	100
4047	get-a-fuck-tonight.com	100
4181	kienthuc.net.vn	100
4194	stream-tv.me	100
4274	tradeindia.com	100
4355	benesse.ne.jp	100
4571	itrack.it	100
4606	trackoptimizer.com	100
4743	7xz3.com	100
4982	edgesuite.net	100
5013	liveadexchanger.com	100
5318	ipsosinteractive.com	100
5349	fun698.com	100
5433	moudamepo.com	100
5732	come.in	100
6304	reduxmediia.com	100
6304	reduxmediia.com	100
6714	vgsgaming.com	100
6714	vgsgaming.com	100
6804	yieldtraffic.com	100
6804	yieldtraffic.com	100
6808	insight.ly	100
6808	insight.ly	100
6964	adxhosting.net	100
6964	adxhosting.net	100
7009	contest-winners.com	100
7070	exhentai.org	100
7116	techhelpfox.com	100
7285	mlstatic.com	100
7337	fzg360.com	100
7382	siyahgazete.com	100
7484	imgsin.com	100
7736	mmstat.com	100
7762	cnnewmusic.com	100
7962	picketfenceblogs.com	100
8038	endlessmatches.com	100
8040	savedwebhistory.org	100
8160	flirt-fuck.com	100
8213	h12-media.net	100
8298	kataskopoi.com	100
8414	ihct.mx	100
8601	evsuite.com	100
8630	rapmls.com	100
8735	9stock.com	100
8890	travideos.com	100
8921	credomatic.com	100
9097	youtu.be	100
9236	cncmax.cn	100
9335	gefaellt-mir.me	100
9348	xe2c.com	100
9745	jobspapa.com	100
9792	iphone-winners.net	100
9898	nolix.ru	100
9907	mihanstore.net	100