Thursday, September 24, 2015

Google Weblight - using full content from sites on their domain

Like to many 'lite' things, there are serious side effect with this web-light version by Google. 

Some time ago, I think it was April, Google announced they would serve web pages that are too slow for some users scrape off the owner website and serve from their servers. The test was supposed to start in Indonesia and targeted towards 2G connection. Then later, they started rolling this out for 'select countries' and slow mobile connections (2G)

The 'transcoding' is done on the fly, according to news, and speeds things significantly up, we have seen 10x faster loads.


Weblight - look and feel

This is how it looks like in the original on the left, and the 'scraped'  site on the right. Not bad, really. Users can load the original page in the top section, with a warning that it might be slow. Sounds very user friendly approach to me. The navigation (nav icon in the upper left) is solid, and not missing anything in the top level.  


 



















Biggest flaw is that it removes pretty much all third party elements - including our tracking elements, scripts, etc.. Google claims to leave up to two ads on the page, and allow simple tracking with Google analytics. 


Traffic on weblight

We (Dell.com) actually see some traffic with Googleweblight. It is minuscule, noticed only because we have been specifically looking for it, but there it is. We see some referral traffic in Adobe sitecatalyst, and we also see the 'scraping' of pages in our logfiles. 



Most of the traffic for us seems to be from India, and trying to reach a variety of pages in several countries instead just India, which might explain some of the slowness. 

The scraped or 'transcoded' pages can be found in logs with cs_User_Agent="*googleweblight*", and the pages that get the referrer traffic from the referrer "googleweblight.com". 


How relevant are 2G networks for global ecommerce

So far, it seems super small, but the market potential is quite big. Global data on network coverage is a bit harder to come by, but there are some relevant sources.

Some excerpts from a McKinsey report:
"In developed countries and many developing nations, 2G networks are widely available; in fact, Ericsson estimates that more than 85 percent of the world’s population is covered by a 2G signal.42 Germany, Italy, and Spain boast 2G networks that reach 100 percent of the population, while the United States, Sri Lanka, Egypt, Turkey, Thailand, and Bangladesh have each attained 2G coverage for more than 98 percent of the population.43 Some developing markets don’t fare as well: as of 2012, 2G network coverage extended to 90 percent of the population of India, 55 percent of Ethiopia, 80 percent of Tanzania, and just under 60 percent of Colombia.44 Growing demand and accelerated rates of smartphone adoption in many markets have spurred mobile network operators to invest in 3G networks.
Ericsson estimates that 60 percent of the world population now lives within coverage of a 3G network. The level of 3G infrastructure by country reveals a stark contrast between countries with robust 3G networks and extensive coverage, such as the United States (95 percent), Western European nations (ranging from 88 to 98 percent), and Vietnam (94 percent), and many developing markets such as India, which are still in the early stages of deploying 3G networks."
(highlights by me)


The graph from the same publication shows significant 2G/3G only coverage even for the US.
Slightly more optimistic statistics from the ICT figures report for 2015  - look at US penetration rate for example, or Norway. A map and discussion with US internet speeds on Gizmodo


How to see your site on Google weblight

For this blog, it is: googleweblight.com/?lite_url=https://andreas-wpv.blogspot.com . Don't forget to use some kind of mobile emulator, to see how it really looks like! Only some resolutions are supported - one is the standard Iphone 4 with 320 x 480. 

Google original:


Is this legal? How about copyright?

Honestly - I don't want to go there, I am no lawyer, and there are some complex questions arising - internally. 
Google claims that companies win so much traffic, it is really in their best interest. For every site that does not want this treatment, there is a way to opt out described on the Google help pages, and it gives some insight into what's removed.  Some more tech details on this article: mostly compression, removal of third party elements, reduction of design elements. 

Google names this 'transcoding' - to me it looks like scraping and serving from their website, something that might be considered a copyright violation, and would likely be against Google 'terms of service' (highlights by me):
"Do not misuse our Services, for example, do not interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct."

Additionally - would not adwords on a weblight page violate Adwords terms of service?
" Content that is replicated from another source without adding value in the form of original content or additional functionality



  • Examples: Mirroring, framing, or scraping content from another source"

  • I would understand - perhaps misunderstand - this the way that Google Adwords cannot be used on weblight pages.

    More general, is this scraping ok for Google to do?

    Sept. 25, 2015: It seems from some glance in the detailed data transfer (chrome, fiddler), that it is more of a filtering of content with some kind of proxy. Working on it.

    Resources:

    Additionally to the above links, there are a few articles I found on this topic + Google sources:

    http://gadgets.ndtv.com/internet/news/google-india-to-offer-faster-access-to-mobile-webpages-for-android-users-702302
    http://www.unrevealtech.com/2015/07/how-to-prevent-site-loading-google-weblight.html
    http://www.androidauthority.com/google-web-light-looks-616450/
    http://digitalperiod.com/google-web-light/
    https://support.google.com/webmasters/answer/6211428?hl=en

    Tuesday, September 15, 2015

    Data analysis preparation analysis script

    Finally got to working on this. I am working with larger files and one of my current fun projects is to find out which urls have been visited by Google, out of all the urls we have live.

    While working on a small DB to do this I downloaded some files from splunk, then imported them. And sure enough realized there are things I need to filter out before, or the DB becomes absolutely unwieldy.
    The files are not huge, but large enough, with 10+ million lines, so I want to use command line tools and not redo a script for every number I need, looping repeatedly over the same file.

    It shows count of 

    • number of fields
    • total number of lines
    • empty lines
    • non-empty lines. 
    Then it pulls the full data for the 

    • shortest line
    • the longest line
    • the first 4 lines
    • last 4 lines
    • middle 3 lines of the file.

    This is testdata output from the script - cannot show real data. It works well - at least with a few fields. To run it on a 17 million lines one field list takes ~ 32 seconds, that's pretty good I think.



    I highlight the description in green as you might see - otherwise it is kind of hard to read. Also - awk changed the numbers for the line counters to scientific notation, so I needed to use int(variable) to reset it to integer and be able to concatenate it into a range for the sed - system call. Ah, this is with comma separated files and needs to be adjusted if that's different.


    #!bash
    
    #this shows all lengths and how often
    echo -e "\033[32;1m\n\nnumber of lines : number of fields\033[0m" ; awk ' {print NF} ' "$1" | sort | uniq -c 
    
    #this shows only number of shortest / longest lines
    
    awk ' BEGIN {FS=","} (NF<=shortest || shortest=="") {shortest=NF; shortestline=$0} 
    (longest<=NF) {longest=NF; longestline=$0} 
    (!NF) {emptylines+=1} 
    (NF) {nonemptylines+=1}
    (maxcount<NR) {maxcount=NR}
    END { middlestart=(maxcount/2)-1;
    middleend=(maxcount/2)+1;
    range=int(middlestart)","int(middleend);
    print "\033[32;1m\ntotal number of lines is:\n\t\033[0m" NR,
    "\033[32;1m\n shortest is:\t\t\033[0m" shortest, 
    "\033[32;1m\n longest is:\t\t\033[0m" longest, 
    "\033[32;1m\n shortestline is:\t\t\033[0m" shortestline, 
    "\033[32;1m\n longestline is:\t\t\033[0m" longestline,
    "\033[32;1m\n number empty lines is: \t\t\033[0m" emptylines,
    "\033[32;1m\n number of non-empty lines is:\t\t\033[0m" nonemptylines;
    
    print "\033[32;1m\n\nrange is   \033[0m" range, "\033[1;32m\nFILENAME IS \033[0m" FILENAME, "\033[32;1m\nnow the middle 3 lines of the file: \n\033[0m";
    system("sed -n "range"p " FILENAME)
    
    } ' $1
    
    echo -e "\033[32;1m\n\ntop 4 lines\n\033[0m"
    head -n 4 "$1"
    echo -e "\033[32;1m\n\nlast 4 lines\n\033[0m"
    tail -n 4 "$1"
    echo -e "\n"
    

    Bookmark and Share