Wednesday, November 19, 2014

Logfile Analysis for SEO: Get status codes


Ever wondered if bots use cookies? Or if there is a relation from 302's to search engine bot visits? Total sum of 4xx errors as a trend? Our current analytics setup does not show this data, log file analysis, so I wrote a few brief scripts, pulled the zipped logfiles from servers to a local folder, and then analyzed (scripts at the end of this post).

Next step I added the data to an xls and calculate the percentages, like 302s as a share of overall traffic or how many bot visits use a cookie (nearly all bots visits are cookied visits!).



Visualization of one time frame worth of  http status codes



And sure easy  to show trends of status codes, for example share of bots visits to overall visits and the number of 302's on  a site.



And here are the scripts. If you wonder why writing the results into a file, and not just count, I run more analysis on these resulting files.

1. Get all lines with a certain status code not 200 OK:

zcat *.zip | grep " 301 " > all-301.txt
zcat *.zip | grep " 302 " > all-302.txt
zcat *.zip | grep " 304 " > all-304.txt
zcat *.zip | grep " 403 " > all-403.txt
zcat *.zip | grep " 404 " > all-404.txt
zcat *.zip | grep " 410 " > all-410.txt
zcat *.zip | grep " 500 " > all-500.txt
zcat *.zip | grep "^2014" | wc -l > logfile-results.txt

2. Error codes, redirects encountered by bots. First, filter out all lines with bots, then get the status codes lines in separate files.

zcat *.zip | grep "bot" > bots-traffic.txt
grep " 301 " bots-traffic.txt > bots-301.txt
grep " 302 " bots-traffic.txt > bots-302.txt
grep " 304 " bots-traffic.txt > bots-304.txt
grep " 403 " bots-traffic.txt > bots-403.txt
grep " 404 " bots-traffic.txt > bots-404.txt
grep " 410 " bots-traffic.txt > bots-410.txt
grep " 500 " bots-traffic.txt > bots-500.txt

3.  Pull out all the visits with cookies, in this case only the cookies themselves. In these logfiles they are in field $15:

zcat *.zip | awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' > all-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-traffic.txt > bots-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' all-500.txt > all-500-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-500.txt > bots-500-cookies.txt

4. This pulls all urls with an 500 error code into files - to have Dev look at these:

awk 'BEGIN {FS = " "; OFS = ""  } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' all-500.txt > all-500-urls.txt
awk 'BEGIN {FS = " "; OFS = ""  } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' bots-500.txt > bots-500-urls.txt

It's filtered, so if field 8 is just a hyphen, it prints just the url stem, otherwise stem and pagename. 

No comments:

Post a Comment

Bookmark and Share