andreas.wpv: April 2015

Thursday, April 23, 2015

8 User agents and responses Alexa top million pages

Some more interesting results from the test run with 8 different user agents and the return size of the documents from the top 1000 sites in the Alexa 1million.

These are the largest returns - interesting to see these sites here, newspapers, stores, portals.

Many of the smallest sites are returning nothing, redirecting or under construction, actually quite a lot of the top performing pages.

But... maybe that's just because of the unknown user agent? (or the one with wget in it?)

Yes, very different picture. With just the regular user agents, just few page don't send a page back. Some redirect, but some are just very, very small, great job!

See Baidu in there? Now let's take another look at a few search giants. Google is amazing - that looks like the exact same page with a few lines in various languages. Interesting to see how much code each line contains though, that's huge compared to other sites.

And while the search giants likely have all kinds of fancy (tracking?) elements, some of the search alternatives have a few more lines, but much less actual code on the page.

Filtering a bit further the confirmation no one likes my 'andi-wget' user agent. Means, future work with wget will need to have a different user agent nearly always!

Check out the first post with result on average responses sizes and how Google responds.

Monday, April 13, 2015

Documentation - where did I store that little script?

Goodness gracious me. I recall that I did this, I cannot just type it again - not doing that often - so HOW do I find that script?

How do you find your little scripts?

Ok, good folder structure, naming convention, all fine, but it still is hard. Is it in the sitespeed folder or the sitemap folder? The test folder, perhaps? Did I spell it this way?
Most scripts here are for 'few-times' use, build to come up with a quick insight, a starting point or some scaling information to build a business case. The scripts are quick and manifold, with many different variations.

Every now and then I recall a script I'd like to re-use, and then struggle to find the right script. Working across several computers is a challenge, here. Git - too big, had some security issues, and too steep a learning curve for these one-liners.

I used this blog, then, as a repository, and to find the scripts (with a site:andreas-wpv.google.com search) plus Google drive for a small part of the repository. Works, but still missing info.

Documentation script

But I am using #comments quite a lot even in these short bash scripts, so I will now extend that, and use this:

find . -name "*.sh" -exec echo -e "{}\n" \; -exec grep "^#" {} \; -exec echo -e "\n\n" \;

this pulls the script name and path, then an empty line, then all the comments, then two lines to separate from the next script. Not pretty, but works. Now I need to document a bit better :-)

How do you sort, document and find your little scripts?

Tuesday, April 7, 2015

Including Google: 8 agents - and average response code

Agents, not spies

Agents, user agents, play an important role in the setup and running of sites, likely more than many developers like. Exceptions, adjustments, fixes - and it is (?) impossible to generate code that shows up properly on all user agents and also is W3C compliant.

So, sites care, and likely Google cares as well. Google - like any site - offers only one interface to their information (search): source code. How the page looks like, how it is rendered, the tool that renders it is not Google, not provides or supported or maintained or owned by Google, but something on client side. Not even the transport is provided by Google. Google only provides a stream of data via http , https, and anything after that is not their business.

So, my question was - do user-agents make a difference?

And sure, what better to use than a little script? There are so many different user agents, I wanted to keep this open to changes, so I loop over a file with user agents - one per line - as well over a list of urls to get the response from these sites.

Surprising results: Bingbot and Googlebot receive the smallest files - not wget or a unknown bot. The graph overstates this, but the table shows there's a clear difference.

Google.com answering

Now let's take a look at the response Google.com sent back to the little script. They do NOT like wget or andi-bot, nor google bot or bingbot, although they perform a bit better than the first two.
High focus on the regular user agents. The picture for words is the same as for lines.

Google only provides a datastream, rendering is on client side. So if I use firefox, IE, safari, wget or curl is my business - and not Google's. I understand that google does not want people to scrape large amounts of data off their site - which i don't do - but I am surprised the 'block' is not a bit more sophisticated.