Thursday, September 24, 2015

Google Weblight - using full content from sites on their domain

Like to many 'lite' things, there are serious side effect with this web-light version by Google. 

Some time ago, I think it was April, Google announced they would serve web pages that are too slow for some users scrape off the owner website and serve from their servers. The test was supposed to start in Indonesia and targeted towards 2G connection. Then later, they started rolling this out for 'select countries' and slow mobile connections (2G)

The 'transcoding' is done on the fly, according to news, and speeds things significantly up, we have seen 10x faster loads.

Weblight - look and feel

This is how it looks like in the original on the left, and the 'scraped'  site on the right. Not bad, really. Users can load the original page in the top section, with a warning that it might be slow. Sounds very user friendly approach to me. The navigation (nav icon in the upper left) is solid, and not missing anything in the top level.  


Biggest flaw is that it removes pretty much all third party elements - including our tracking elements, scripts, etc.. Google claims to leave up to two ads on the page, and allow simple tracking with Google analytics. 

Traffic on weblight

We ( actually see some traffic with Googleweblight. It is minuscule, noticed only because we have been specifically looking for it, but there it is. We see some referral traffic in Adobe sitecatalyst, and we also see the 'scraping' of pages in our logfiles. 

Most of the traffic for us seems to be from India, and trying to reach a variety of pages in several countries instead just India, which might explain some of the slowness. 

The scraped or 'transcoded' pages can be found in logs with cs_User_Agent="*googleweblight*", and the pages that get the referrer traffic from the referrer "". 

How relevant are 2G networks for global ecommerce

So far, it seems super small, but the market potential is quite big. Global data on network coverage is a bit harder to come by, but there are some relevant sources.

Some excerpts from a McKinsey report:
"In developed countries and many developing nations, 2G networks are widely available; in fact, Ericsson estimates that more than 85 percent of the world’s population is covered by a 2G signal.42 Germany, Italy, and Spain boast 2G networks that reach 100 percent of the population, while the United States, Sri Lanka, Egypt, Turkey, Thailand, and Bangladesh have each attained 2G coverage for more than 98 percent of the population.43 Some developing markets don’t fare as well: as of 2012, 2G network coverage extended to 90 percent of the population of India, 55 percent of Ethiopia, 80 percent of Tanzania, and just under 60 percent of Colombia.44 Growing demand and accelerated rates of smartphone adoption in many markets have spurred mobile network operators to invest in 3G networks.
Ericsson estimates that 60 percent of the world population now lives within coverage of a 3G network. The level of 3G infrastructure by country reveals a stark contrast between countries with robust 3G networks and extensive coverage, such as the United States (95 percent), Western European nations (ranging from 88 to 98 percent), and Vietnam (94 percent), and many developing markets such as India, which are still in the early stages of deploying 3G networks."
(highlights by me)

The graph from the same publication shows significant 2G/3G only coverage even for the US.
Slightly more optimistic statistics from the ICT figures report for 2015  - look at US penetration rate for example, or Norway. A map and discussion with US internet speeds on Gizmodo

How to see your site on Google weblight

For this blog, it is: . Don't forget to use some kind of mobile emulator, to see how it really looks like! Only some resolutions are supported - one is the standard Iphone 4 with 320 x 480. 

Google original:

Is this legal? How about copyright?

Honestly - I don't want to go there, I am no lawyer, and there are some complex questions arising - internally. 
Google claims that companies win so much traffic, it is really in their best interest. For every site that does not want this treatment, there is a way to opt out described on the Google help pages, and it gives some insight into what's removed.  Some more tech details on this article: mostly compression, removal of third party elements, reduction of design elements. 

Google names this 'transcoding' - to me it looks like scraping and serving from their website, something that might be considered a copyright violation, and would likely be against Google 'terms of service' (highlights by me):
"Do not misuse our Services, for example, do not interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct."

Additionally - would not adwords on a weblight page violate Adwords terms of service?
" Content that is replicated from another source without adding value in the form of original content or additional functionality

  • Examples: Mirroring, framing, or scraping content from another source"

  • I would understand - perhaps misunderstand - this the way that Google Adwords cannot be used on weblight pages.

    More general, is this scraping ok for Google to do?

    Sept. 25, 2015: It seems from some glance in the detailed data transfer (chrome, fiddler), that it is more of a filtering of content with some kind of proxy. Working on it.


    Additionally to the above links, there are a few articles I found on this topic + Google sources:

    Tuesday, September 15, 2015

    Data analysis preparation analysis script

    Finally got to working on this. I am working with larger files and one of my current fun projects is to find out which urls have been visited by Google, out of all the urls we have live.

    While working on a small DB to do this I downloaded some files from splunk, then imported them. And sure enough realized there are things I need to filter out before, or the DB becomes absolutely unwieldy.
    The files are not huge, but large enough, with 10+ million lines, so I want to use command line tools and not redo a script for every number I need, looping repeatedly over the same file.

    It shows count of 

    • number of fields
    • total number of lines
    • empty lines
    • non-empty lines. 
    Then it pulls the full data for the 

    • shortest line
    • the longest line
    • the first 4 lines
    • last 4 lines
    • middle 3 lines of the file.

    This is testdata output from the script - cannot show real data. It works well - at least with a few fields. To run it on a 17 million lines one field list takes ~ 32 seconds, that's pretty good I think.

    I highlight the description in green as you might see - otherwise it is kind of hard to read. Also - awk changed the numbers for the line counters to scientific notation, so I needed to use int(variable) to reset it to integer and be able to concatenate it into a range for the sed - system call. Ah, this is with comma separated files and needs to be adjusted if that's different.

    #this shows all lengths and how often
    echo -e "\033[32;1m\n\nnumber of lines : number of fields\033[0m" ; awk ' {print NF} ' "$1" | sort | uniq -c 
    #this shows only number of shortest / longest lines
    awk ' BEGIN {FS=","} (NF<=shortest || shortest=="") {shortest=NF; shortestline=$0} 
    (longest<=NF) {longest=NF; longestline=$0} 
    (!NF) {emptylines+=1} 
    (NF) {nonemptylines+=1}
    (maxcount<NR) {maxcount=NR}
    END { middlestart=(maxcount/2)-1;
    print "\033[32;1m\ntotal number of lines is:\n\t\033[0m" NR,
    "\033[32;1m\n shortest is:\t\t\033[0m" shortest, 
    "\033[32;1m\n longest is:\t\t\033[0m" longest, 
    "\033[32;1m\n shortestline is:\t\t\033[0m" shortestline, 
    "\033[32;1m\n longestline is:\t\t\033[0m" longestline,
    "\033[32;1m\n number empty lines is: \t\t\033[0m" emptylines,
    "\033[32;1m\n number of non-empty lines is:\t\t\033[0m" nonemptylines;
    print "\033[32;1m\n\nrange is   \033[0m" range, "\033[1;32m\nFILENAME IS \033[0m" FILENAME, "\033[32;1m\nnow the middle 3 lines of the file: \n\033[0m";
    system("sed -n "range"p " FILENAME)
    } ' $1
    echo -e "\033[32;1m\n\ntop 4 lines\n\033[0m"
    head -n 4 "$1"
    echo -e "\033[32;1m\n\nlast 4 lines\n\033[0m"
    tail -n 4 "$1"
    echo -e "\n"

    Wednesday, July 29, 2015

    Google: "making the web faster"

    ... how about starting at home?

    I have chrome open with two tabs (gmail and G+) and three plugins (seoclarity, everyone social and WooRank), and you need HOW many processes and HOW much memory?

    I have 20 plugins I'd like to use, and 10+ tabs I'd like to keep open, but that just won't work with this wasteful resource usage.

    Wednesday, July 22, 2015

    Parallel on AWS

    Remember the post on how many parallel (simple, wget) processes can be run reasonably on a machine?

    This is how it looks on aws / amazon web services EC2:

    I added lower / higher numbers after a bit testing, but this is fast - likely because of a really good internet connection. (I might need to try something to test the processing itself without network dependency....)

    Many of the scripts - even when run with parallel - still take hours or days to complete. 10,000 or 100,000 urls to check are the rule, sometimes even more. With downloads and processes on the data, this can take a while.

    I can't run these from the office, and if I run them from home I keep checking.... which is not that relaxing in time where I want to relax. So, I often use amazon web services EC2.
    For about a year I used the free tier - a great offer from Amazon to try the service with a very small instance for free. Now I am in the paid program, but cost is really low for what I do and after a few times it became routine to start a new instance when I need it, run some things, then terminate it.

    One of the programs I use pretty much always is tmux - 'terminal multiplexer' - which now is by default on ubuntu instances. It allows not just to run many connected terminals, but also to detach a session. That means, I can start a session, run a script in it, detach and then close the terminal, connection and the script keeps running in the background. A few hours later I can just connect again and download the results.
    Especially in combination with parallel this is a great way to run many of the scripts from this blog - and more.

    Starting looks a bit like this (thanks to Ole Tange for the tip and the help with parallel):

    ls x* | parallel -j24 --line-buffer  " . {}  >> results.txt "

    I often just split large files , then ls or cat the split files to parallel. -j24 is 24 parallel threads, {} picks up the data from ls/cat.

    Thursday, July 9, 2015

    Crawl faster with "parallel" - but how fast?

    Crawling websites - often our own - helps find technical seo, quality and speed issues, you'll see a lot of scripts on this blog in this regard.

    With a site with millions of pages, crawling speed is crucial - even a smaller project easily has some thousand urls to crawl. While there are good tools to check for many things, wget and curl in combination with other command line tools give all the flexibility needed - but can be painfully slow.  
    To check for download speed for example, a regular script would loop through a list of files, download one, then move to the next. Even split up and running the same script is not great, and quite hard to handle at scale. 

    After some search I found gnu parallel - excellent especially for parallel tasks. Running multiple parallel processes is easy and powerful. A switch -j[digit] allows to set a preference for the number of parallel processes. 

    But what number is best, especially with networking projects? A wget should not be a heavy computing load, even with some awk/sed/grep afterwards, because most of the time is spent with waiting and data transfer. 

    So, a little script to check the time for a few different numbers of parallel processes:

    for i in {12,24,36,48,60,80,100}; do var=$( time ( TIMEFORMAT="%R" ; cat "${1}" | parallel "-j${i}" . {} ) 2>&1 1>/dev/null ) ; echo "$i parallel threads take $var seconds" ; done

    The referenced script ( just makes a curl call to check for a http status and a wget for the header transfer time; $1 pulls in a list of full urls.
    The results differ greatly by network.

    Here a slow, public network in a cafe:

    This from a library with a pretty good connection:

    And this from a location with a 50MB/sec connection:

    Times can vary in  each location by a factor of 2 easily, network connection and load still seem to make the biggest difference as all three were done on the same computer.

    Wednesday, June 24, 2015

    the script: 8 different user agents and how sites deal with it

    User agent analysis script

    And mentioned in the earlier post - a script helped me to grab the info on this post on how sites and google specifically treat various browsers.

    While there's a lot more to analyse, much of it manually, I wanted to first see if there is an indication of differences - so for first insight I use just a plain wc -l to get characters, words, lines of the response, and it looks like there is a clear pattern. 

    So, let's take a look at the source, two nested "read " loops. The outer loop through the urls, the inner loop through the agents:

    #check if the file exists
    if [[ ! -e $1 ]]; then
     echo -e "there's no file with this name"
    echo -e "agent \t url \t  bytes \t words \t lines" > $outfile
    # add a http to urls that don't have it
    while read -r line; do

    if [[ $line == http://* ]]; then
    newline="$line" else
    #  loop through agents. then read output into variables with read "here" <<<
              while read -r agent; do
                   read filelines words chars <<< $(wget -O- -t 1 -T 3 --user-agent "$agent" "$newline"  2>&1| wc)
             echo -e "$agent \t $line \t $filelines \t $words \t $chars" >> $outfile
    done < $2
    done < $1
    wc -l $outfile
    Most difficult part was to get the wc output into separate variables, thanks stackexchange for the tip with the <<< here string. 

    Thursday, June 4, 2015

    Speed: Data on top 1000 header load time vs full load time

    Lots of tools give a different number for the speed of a site, how it is for users, over different channels, providers, including rendering time or not, including many elements or not.

    This is the 'hardest' test of all:

    • With a little script I checked the top 1000 pages from the Alexa list*. The first script tests how long it takes to get the http header back (yes, document exists, and yes, I know where it is and it is OK), These are the blue dots.
    • The second script downloads ALL page elements of the homepage, including images, scripts, stylesheets, also from integrated third party tools like enlighten, tealeaf, omniture, or whatever a site uses. These are the orange dots.

    First I ran this without time limit, and when I checked back the next day, it was still running, so I set the timeout to 20 seconds.

    There seems to be a clear connection between header response time and full time, not so much between rank in the top 1000 by traffic and speed.

    sorted by full download time

    sorted by traffic rank

    There also seems to NOT be a clear connection on how rank by traffic (x-axis) correlates to full download time. This shows that we have a great opportunity to outperform many other companies with faster download speeds.

    * Alexa (owned by Amazon) publishes the top 1 Million websites by traffic, globally, based on data from  many different browser plugins plus from Amazon (cloud?) services.

    Tuesday, May 26, 2015

    Google mobile friendly: Industries, site types and self centered algorithm change

    Maybe it is just me.

    Since Google was promoting everyone to make their sites mobile friendly, many pages are a pain to read now on a regular laptop or monitor.

    Example SEO Blog:

    Take the moz blog, for example. Great on mobile, not good to read on a desktop. HUGE image up front, pushing all content below the fold. That's just not good, not for SEO, and not for usability.

    Desktop design due to Google push? Image is all I see on the screenshot 14" full laptop screen.

    Example News blog: Search Engine Land

    Similar issue. I do NOT like headlines that big, really.
    Desktop design due to Google push? On my screenshot 14" full laptop screen I see a lot of headline, and... Ads.

    Now, lets look at a few more.

    Amazon has a different url concept (in parts, at least). Mobile experience is not that great, and if I hit the back button, I get the desktop homepage on phone screen, I had hoped for something different.

    Bestbuy is not really different - showing off their headline :-). The responsive seems not optimized, at least not for me.

    Dell? We have some great pages, some ok pages (many on separate url on and some definitely having room for  improvement when it comes to mobile / responsive - and we have lots of teams working on improving that, both some different url country setups, and some responsive ( for example). But responsive lends itself to some content like consulting service description, and not to other content like picking a laptop out of a larger selection.

    Many sites need to serve content to mobile users much better than they currently do, and Google's push is a good reminder. I do think they have gone a bit far, and I am not sure they are aware that other industries have different needs. Google has an easy play. Their content and searching lends itself to responsive design but it is not really fair to compare search and search results to news, magazines and eCommerce sites. They all have much more complex processes users need to go through, and they have other business models than exploitation of personal data.

    Does not Google's data actually confirm that there are industries where mobile is really important and some where it is not? Showing data from Google adwords tool, just random terms used to see some variance:

    Not really a big surprise when you think about it. And even the keyword 'search' has 51% of searchers using a pc / monitor, not a phone or tablet, according to Adwords data. I understand the search algorithm change was not large, but it all of the above indicates, it should really not be more either. Many industries don't have that much mobile share, and responsive does not work well for complex tasks. It can be made to work well with adaptive and even better separate urls, layout, but that is slow, complex and expensive, so it takes even more mobile share to justify the investment.

    But maybe I am totally wrong because I just have not seen the great examples out there. Do you have an example of a site with complex task that really works well on phone + monitor in a responsive layout?

    Thursday, April 23, 2015

    8 User agents and responses Alexa top million pages

    Some more interesting results from the test run with 8 different user agents and the return size of the documents from the top 1000 sites in the Alexa 1million.

    These are the largest returns - interesting to see these sites here, newspapers, stores, portals.

    Many of the smallest sites are returning nothing, redirecting or under construction, actually quite a lot of the top performing pages.

    But... maybe that's just because of the unknown user agent? (or the one with wget in it?)

    Yes, very different picture. With just the regular user agents, just few page don't send a page back. Some redirect, but some are just very, very small, great job!

    See Baidu in there? Now let's take another look at a few search giants. Google is amazing - that looks like the exact same page with a few lines in various languages. Interesting to see how much code each line contains though, that's huge compared to other sites.

    And while the search giants likely have all kinds of fancy (tracking?) elements, some of the search alternatives have a few more lines, but much less actual code on the page. 

    Filtering a bit further the confirmation no one likes my 'andi-wget' user agent. Means, future work with wget will need to have a different user agent nearly always!

    Check out the first post with result on average responses sizes and how Google responds.

    Monday, April 13, 2015

    Documentation - where did I store that little script?

    Goodness gracious me. I recall that I did this, I cannot just type it again - not doing that often - so HOW do I find that script?

    How do you find your little scripts?

    Ok, good folder structure, naming convention, all fine, but it still is hard. Is it in the sitespeed folder or the sitemap folder? The test folder, perhaps? Did I spell it this way?
    Most scripts here are for 'few-times' use, build to come up with a quick insight, a starting point or some scaling information to build a business case. The scripts are quick and manifold, with many different variations.

    Every now and then I recall a script I'd like to re-use, and then struggle to find the right script. Working across several computers is a challenge, here. Git - too big, had some security issues, and too steep a learning curve for these one-liners.

    I used this blog, then, as a repository, and to find the scripts (with a  search) plus Google drive for a small part of the repository. Works, but still missing info.

    Documentation script

    But I am using #comments quite a lot even in these short bash scripts, so I will now extend that, and use this:

    find . -name "*.sh" -exec echo -e "{}\n" \; -exec  grep "^#" {} \; -exec echo -e "\n\n" \;

    this pulls the script name and path, then an empty line, then all the comments, then two lines to separate from the next script. Not pretty, but works. Now I need to document a bit better :-)

    How do you sort, document and find your little scripts?

    Tuesday, April 7, 2015

    Including Google: 8 agents - and average response code

    Agents, not spies

    Agents, user agents, play an important role in the setup and running of sites, likely more than many developers like. Exceptions, adjustments, fixes - and it is (?) impossible to generate code that shows up properly on all user agents and also is W3C compliant. 

    So, sites care, and likely Google cares as well. Google - like any site - offers only one interface to their information (search): source code. How the page looks like, how it is rendered, the tool that renders it is not Google, not provides or supported or maintained or owned by Google, but something on client side. Not even the transport is provided by Google. Google only provides a stream of data via http , https, and anything after that is not their business. 

    So, my question was - do user-agents make a difference? 

    And sure, what better to use than a little script? There are so many different user agents, I wanted to keep this open to changes, so I loop over a file with user agents - one per line - as well over a list of urls to get the response from these sites.

    Surprising results: Bingbot and Googlebot receive the smallest files - not wget or a unknown bot. The graph overstates this, but the table shows there's a clear difference. answering

    Now let's take a look at the response sent back to the little script. They do NOT like wget or andi-bot, nor google bot or bingbot, although they perform a bit better than the first two.
    High focus on the regular user agents. The picture for words is the same as for lines.

    Google only provides a datastream, rendering is on client side. So if I use firefox, IE, safari, wget or curl is my business - and not Google's.  I understand that google does not want people to scrape large amounts of data off their site - which i don't do - but I am surprised the 'block' is not a bit more sophisticated.

    Monday, March 16, 2015

    Automate your Logfile analysis for SEO with common tools

    Fully automated log  analysis with tools many use all the time

    Surely no substitute for splunk and its algorithms and features, but very practical, near zero cost (take that!)  and high efficiency. Requires mainly free tools (thanks cygwin) or standard system tools (like wiindows task scheduler), plus a bit of trial and error.  (I also use MSFT Excel, but other spreadsheet programs should work as well).

    Analysis of large logfiles, daily

    Analyzing logfiles for bot and crawler behavior, but also to check for site quality is quite helpful. So, how to analyze our huge files? For a part of the site, we're talking about many GB of logs, even zipped.

    Not that hard, actually, although it took me a while to get all these steps lined up and synchronized.

    With the windows task manager I schedule a few steps over night:
    • copy last days logfiles on a dedicated computer
    • grep the respective entries in a variety of files (all 301, bot 301, etc.)
    • Then count the file lenghts (wc -l ) and append the values to a table (csv file) tracking these numbers
    • Delete logfiles
    • The resulting table and one or two of the complete files (all 404.txt) are copied to a server, which hosts an Excel file with uses the txt file as database, and updates graphs and tables on open.
    • delete temporary files (and this way avoid the dip you see)

    Now our team can go quickly check if we have an issue up, and need to take a closer look, or not.
    In a second step I also added all log entries resulting in a 404 into the spreadsheet on open.


    Thursday, March 5, 2015

    Who has H1?

    Do we need an H1 on our homepage?

    Sometimes it is necessary to convince pdms, devs, stakeholders that SEO efforts are necessary. One way to support this is to quickly  run a test on competitors and / or on ... top pages on the web. (Yes, after all the other pro's have been given.) Especially since we're running one of the top pages ourselves, that list contains powerful names.(and the first url is in my tests.. because I know what's happening there, not because of being in the top 1million.. yet ;-) )

    So, H1 or not?

    Out of the top 1000 pages, about
    Here is a screenshot of the top pages that have an H1 on their homepage.

    This is the script, running over the top 1000 urls from alexa 1 million.  Very easy to adjust for other page elements.

    echo -e "url\t has H1 " > 2top1kH1.txt
    while read -r line; do
    echo $line
    h1yes=$(wget -qO- -w 3 -t 3 "$line" | grep "<h1" | wc -l)
    if [ "$h1yes" -gt 0 ]; then
    echo -e "$line\t yes" >> 2top1kH1.txt
    done < $1

    Not large, not complicated, but very convincing.

    Monday, February 16, 2015

    Google Sitespeed Score and Rank

    While Matt Cutts mentioned that site speed only affects a small percentage of sites in their rankings - running one of the largest sites this is worth a second look. Due to the millions of pages, speed might be especially critical for rank - or at least for indexation.

    Sitespeed is not sitespeed score, but a relatively 'neutral' way to measure speed related performance.

    What to do? First, I selected pages across site, some 2500 pages to have a nice sample. Then I pulled the average rank for each of these pages from seoclarity(.net). And combined this with the speedscore from my script
    As a last step imported this table into Excel and ... scatterplot! Vertical is the ranking position, and x-axis is the speedscore. At first sight, this would indicate that with increasing speed score pages rank worse, but overall I would say this means there is no correlation at all (0.09),  and the variance is way too big to use this to calculate a trend. 

    Tuesday, February 3, 2015

    Page size by position in the site structure

    Large sites have specific requirements at times, for example when checking for site speed optimization.
    One factor influencing page speed, usability and ease of being crawled by search engines is the total page size. With millions of pages, how can focus our work beyond just finding examples of what works and what not?

    We ran an internal crawler to identify the pages collecting data on page size for each page. Second step then was visualization, to see if there is a pattern that allows us to focus on high impact areas. 

    For a small part of the site (!), this is the printed plot of the page weight (with R) based on the number of “/” (-3) = folder depth, adjusted for the htt://dell/ slashes. As a result, would be 0, and would be level 1. 

    Based on the graph we were able to clearly identify the area where large pages sit, and were able to narrow it down to a type of page. This gave the necessary information to work with Dev and Design to improve the site significantly. 

    We also identified a range of descriptive statistics numbers, and the 'outliers' and could fix these immediately - like many will have guessed, not optimized pictures were the issue, and easily remedied. 

    Thursday, January 22, 2015

    Alexa 1 million Top mobile performers on Google sitespeed score: perfect score 100

    Mobile SpeedScore

    Similar to the desktop numbers here are the top performers on mobile with a speedscore of 100 (out of the top 10,000 from Alexa's top million sites!
    Checked with this script, and then just a graph in Excel to show the distribution of 100-score by 1000s. While there seems to be a strong correlation between being in the top 1000 and sitespeed, sites with a speedscore of 100 appear more frequent in other areas, highest in the sites which rank in the 8000s in Alexa.

    Sitelist mobile Speed Score

    No Google search, interestingly, and many other big names missing, but a list of urls I am not even going to try to check the content through our company network. Bold shows sites, that have a speedscore of 100 in desktop and mobile:

    position in Alexa top 10,000 site url speedscore
    29 100
    122 100
    571 100
    776 100
    1051 100
    1197 100
    1497 100
    1537 100
    1576 100
    1636 100
    1969 100
    2153 100
    2281 100
    2616 100
    2665 100
    2696 100
    2729 100
    2896 100
    2935 100
    2981 100
    3179 100
    3343 100
    3455 100
    3491 100
    3606 100
    3659 100
    3802 100
    4047 100
    4181 100
    4194 100
    4340 100
    4355 100
    4571 100
    4606 100
    4743 100
    4965 100
    4982 100
    5013 100
    5318 100
    5349 100
    5433 100
    6304 100
    6605 100
    6711 100
    6714 100
    6804 100
    6808 100
    7009 100
    7070 100
    7116 100
    7285 100
    7736 100
    7772 100
    8038 100
    8040 100
    8160 100
    8213 100
    8298 100
    8414 100
    8601 100
    8630 100
    8735 100
    8921 100
    8991 100
    9097 100
    9236 100
    9335 100
    9345 100
    9348 100
    9360 100
    9745 100
    9792 100
    9898 100
    9907 100

    Wednesday, January 14, 2015

    Alexa 1 million Top desktop performers on Google sitespeed score: perfect score 100

    Desktop sitespeed score

    Using the good old (or new) Alexa top 1M sites list again, this is the list of the top performers on desktop with a speedscore of 100! Checked with this script, and then just a graph in Excel to show the distribution of 100-score by 1000s. While there seems to be a strong correlation between being in the top 1000 and sitespeed, sites with a speedscore of 100 appear more frequent in other areas, highest in the sites which rank in the 8000s in Alexa.

    No. of sites with homepage speedscore 100 = perfect for desktop per 1000 sites from Alexa

    Sitespeed score 100 site list

    No Google search, interestingly, and many other big names missing, but a list of urls I am not even going to try to check the content through our company network.

    No. in Alexa 1 million Site URL score
    29  100
    122  100
    571  100
    776  100
    1027  100
    1051  100
    1197  100
    1497  100
    1537  100
    1576  100
    1636  100
    1969  100
    1995  100
    2281  100
    2696  100
    2896  100
    2935  100
    2981  100
    3179  100
    3343  100
    3347  100
    3491  100
    3606  100
    3802  100
    4031  100
    4047  100
    4181  100
    4194  100
    4274  100
    4355  100
    4571  100
    4606  100
    4743  100
    4982  100
    5013  100
    5318  100
    5349  100
    5433  100
    5732  100
    6304  100
    6304  100
    6714  100
    6714  100
    6804  100
    6804  100
    6808  100
    6808  100
    6964  100
    6964  100
    7009  100
    7070  100
    7116  100
    7285  100
    7337  100
    7382  100
    7484  100
    7736  100
    7762  100
    7962  100
    8038  100
    8040  100
    8160  100
    8213  100
    8298  100
    8414  100
    8601  100
    8630  100
    8735  100
    8890  100
    8921  100
    9097  100
    9236  100
    9335  100
    9348  100
    9745  100
    9792  100
    9898  100
    9907  100

    Monday, January 5, 2015

    Alexa top 10,000: Google sitespeed score data and graphs

    Pagespeed score correlates pretty strongly with the position on the alexa top million list - for the first top 1000 urls bracket, compared to the next 9000.
    Generated with this script calling the google pagespeed api.

    Using this script to get the pagespeed scores, and then using this little script to get the average per 1000 urls (filtering out the zeros):
    awk -v counter="$2" '{sum += $3} (NR%counter!="0") {partsum += $3 ; if($3==0) zerocounter+=1} ( NR%counter=="0") { print "zeroes: "zerocounter, "partial sum: "partsum, "partial avg: " partsum/(counter-zerocounter); partsum = 0 ; totalzeroes+=zerocounter ;zerocounter=0;calccounter+=1} END {print "number of lines: "NR, "total sum: "sum, "total average: "sum/NR; print "number of calculations: "calccounter;print "lines missed: "NR%counter; print "lines no value: "totalzeroes} ' $1 
    Bookmark and Share