andreas.wpv: November 2014

Tuesday, November 25, 2014

Pagespeed - pages crawled to page download time

Pagespeed and indexation

At least at first sight there seems to be a correlation between page load time and number of pages indexed.

Google seems to try to maintain a constant crawl 'budget' of n GB per day, apart from a few spikes we see, and when our page load time goes up, the number of crawled pages goes down, suggesting a connection.

Slower pages:

Less pages crawled:

Wednesday, November 19, 2014

Logfile Analysis for SEO: Get status codes

Ever wondered if bots use cookies? Or if there is a relation from 302's to search engine bot visits? Total sum of 4xx errors as a trend? Our current analytics setup does not show this data, log file analysis, so I wrote a few brief scripts, pulled the zipped logfiles from servers to a local folder, and then analyzed (scripts at the end of this post).

Next step I added the data to an xls and calculate the percentages, like 302s as a share of overall traffic or how many bot visits use a cookie (nearly all bots visits are cookied visits!).

Visualization of one time frame worth of http status codes

And sure easy to show trends of status codes, for example share of bots visits to overall visits and the number of 302's on a site.

And here are the scripts. If you wonder why writing the results into a file, and not just count, I run more analysis on these resulting files.

1. Get all lines with a certain status code not 200 OK:

zcat *.zip | grep " 301 " > all-301.txt
zcat *.zip | grep " 302 " > all-302.txt
zcat *.zip | grep " 304 " > all-304.txt
zcat *.zip | grep " 403 " > all-403.txt
zcat *.zip | grep " 404 " > all-404.txt
zcat *.zip | grep " 410 " > all-410.txt
zcat *.zip | grep " 500 " > all-500.txt
zcat *.zip | grep "^2014" | wc -l > logfile-results.txt

2. Error codes, redirects encountered by bots. First, filter out all lines with bots, then get the status codes lines in separate files.

zcat *.zip | grep "bot" > bots-traffic.txt
grep " 301 " bots-traffic.txt > bots-301.txt
grep " 302 " bots-traffic.txt > bots-302.txt
grep " 304 " bots-traffic.txt > bots-304.txt
grep " 403 " bots-traffic.txt > bots-403.txt
grep " 404 " bots-traffic.txt > bots-404.txt
grep " 410 " bots-traffic.txt > bots-410.txt
grep " 500 " bots-traffic.txt > bots-500.txt

3. Pull out all the visits with cookies, in this case only the cookies themselves. In these logfiles they are in field $15:

zcat *.zip | awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' > all-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-traffic.txt > bots-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' all-500.txt > all-500-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-500.txt > bots-500-cookies.txt

4. This pulls all urls with an 500 error code into files - to have Dev look at these:

awk 'BEGIN {FS = " "; OFS = "" } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' all-500.txt > all-500-urls.txt
awk 'BEGIN {FS = " "; OFS = "" } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' bots-500.txt > bots-500-urls.txt

It's filtered, so if field 8 is just a hyphen, it prints just the url stem, otherwise stem and pagename.

Tuesday, November 11, 2014

Alexa top 10,000 and rel publisher, opengraph tags, and schema

Alexa.com might not be high in their own rankings, but they provide a very, very handy list frequently - the top 1 million domains by estimated traffic rank.

The accuracy of this data is beyond my knowledge - only Alexa has detailed insight - but I would think that this list is at least directionally right and rare in the way it is offered (Quantcast offers something similar). Thanks Alexa!

I wondered how widely used tags are, we (at SEO for Dell) consider important or at least interesting.
I wrote two small scripts to test for the implementation and also to calculate or count, and here are the results.

Tested for 3 elements:

og:title - checking for the implementation of anything related to opengraph.org tags
rel_publisher - if the tag is implemented on the homepage, as recommended by Google
schema - this tests for implementation anything related to schema.org, according to the search engine recommendations.

All items are only tested on the homepage of the 10,000 domains! And if a domain did not answer - some are not set up properly and do not forward to a default subdomain like www - all values are set to 'no' by default. I manually checked the first 400 urls, and this happened twice.

First observation - these tags are FAR from everywhere.

Opengraph is on ~ 26% of homepages, rel publisher on 17% and schema only on 10%.
(First column = line counts)

Second observation is interesting, but seems a relatively weak correlation:

The implementation of opengraph, rel publisher and schema happens slightly more often on the higher ranking homepages. This is most pronounced for schema, least for opengraph, which has the highest overall adoption rate.

Is this compelling you to implement on your site?