Showing posts with label server. Show all posts
Showing posts with label server. Show all posts

Monday, March 16, 2015

Automate your Logfile analysis for SEO with common tools

Fully automated log  analysis with tools many use all the time

Surely no substitute for splunk and its algorithms and features, but very practical, near zero cost (take that!)  and high efficiency. Requires mainly free tools (thanks cygwin) or standard system tools (like wiindows task scheduler), plus a bit of trial and error.  (I also use MSFT Excel, but other spreadsheet programs should work as well).






Analysis of large logfiles, daily

Analyzing logfiles for bot and crawler behavior, but also to check for site quality is quite helpful. So, how to analyze our huge files? For a part of the site, we're talking about many GB of logs, even zipped.

Not that hard, actually, although it took me a while to get all these steps lined up and synchronized.

With the windows task manager I schedule a few steps over night:
  • copy last days logfiles on a dedicated computer
  • grep the respective entries in a variety of files (all 301, bot 301, etc.)
  • Then count the file lenghts (wc -l ) and append the values to a table (csv file) tracking these numbers
  • Delete logfiles
  • The resulting table and one or two of the complete files (all 404.txt) are copied to a server, which hosts an Excel file with uses the txt file as database, and updates graphs and tables on open.
  • delete temporary files (and this way avoid the dip you see)

Now our team can go quickly check if we have an issue up, and need to take a closer look, or not.
In a second step I also added all log entries resulting in a 404 into the spreadsheet on open.

.

Tuesday, December 9, 2014

Logfile analysis for SEO: Which bots come how often?

Which bots crawling your site?

Which bots are visiting your site? Googlebot, bingbot and yandex? You might be surprised by the number and variety.


Script to identify crawlers

Just filtering the logfile we get from IIS with grep -i 'bot', and then writing the agent - in this logfile in position 13 - into a separate file, and then just sort, count occurrence of each.
grep -i 'bot' logfile | awk 'BEGIN { FS = " " } { print $ 13 } ' >> bots-names.txt
sort bots-names.txt | uniq -c | sort -k 1nr > bots-which-counter.txt
rm bots-names.txt
This gives me a nice list of bots, and how many requests they sent in the time of the logfile. Interesting list, lots from bots I would not have expected, like mail.RU and 'linux'.
Another post I share a table how often bots come over time - and I pick the most relevant bots with this above list (plus on what brings us traffic). 

Top bots and crawlers visiting certain parts of www.dell.com:



I cut off the numbers (count, first column) and this is just sorted the top few visiting crawlers / bots. 

Monday, September 23, 2013

Check url for http response codes with curl

A little linux helper to check the status of urls in a sitemap, based on the server response code.

Currently redirects are said to be not good for Bing ranking, neutral for Google. We want to rank in both, so we don't want 300s, and sure no 400s or 500s - the error response codes.

For this example from work I use curl, easy and fast, and  "support.dell.com".

curl -D - support.dell.com 

-D - means dump header into file - meaning stdout.
LONG result, but direction is correct.
Now add a -o /dev/null, meaning move content into output /dev/null.

curl -D - support.dell.com -o /dev/null
Still too long, but getting closer.
So I'll use sed to print just the header response, based on the regex HTTP:
curl -D - support.dell.com -o /dev/null | sed -n '/HTTP/p'

STILL not there. Adding -s to curl to silence the speed-o-meter gives me:

curl -s -D - support.dell.com -o /dev/null | sed -n '/HTTP/p'

results in: HTTP/1.1 200 OK

Got it!
It sure has limitations, this is not going to help identify server level rewrites or reverse proxy redirects without intermediate non 200 http response, nor is it going to identify a http-refresh. I still find it pretty helpful. The first is still good to submit to the Search Engines, and the second is rare, fortunately, at least where I work.

This again is patched together from a variety of sources, including stackoverflow, a sed post by Eric Pemment and little bits from Andrew Cowie (yep, that's about apis, but still helped): thanks everyone!

Thanks Andy, this is a great addition you suggest in the comments to add the L to follow redirects! I would then extend the sed to get this:
curl -s -L -D - www.dell.com/support/ -o /dev/null | sed -n '/HTTP\|Location/p' 
follows redirects, and with the extended sed we see the url and the http response like this:

HTTP/1.1 302 Found Location: http://www.dell.com/support/home/us/en/19?c=us&l=en&s=dhs
HTTP/1.1 301 Moved Permanently Location: /support/my-support
HTTP/1.1 301 Moved Permanently Location: /support/my-support/us/en/19
HTTP/1.1 200 OK

Bookmark and Share