andreas.wpv: logfile

Showing posts with label logfile. Show all posts

Tuesday, September 15, 2015

Data analysis preparation analysis script

Finally got to working on this. I am working with larger files and one of my current fun projects is to find out which urls have been visited by Google, out of all the urls we have live.

While working on a small DB to do this I downloaded some files from splunk, then imported them. And sure enough realized there are things I need to filter out before, or the DB becomes absolutely unwieldy.
The files are not huge, but large enough, with 10+ million lines, so I want to use command line tools and not redo a script for every number I need, looping repeatedly over the same file.

It shows count of

number of fields
total number of lines
empty lines
non-empty lines.

Then it pulls the full data for the

shortest line
the longest line
the first 4 lines
last 4 lines
middle 3 lines of the file.

This is testdata output from the script - cannot show real data. It works well - at least with a few fields. To run it on a 17 million lines one field list takes ~ 32 seconds, that's pretty good I think.

I highlight the description in green as you might see - otherwise it is kind of hard to read. Also - awk changed the numbers for the line counters to scientific notation, so I needed to use int(variable) to reset it to integer and be able to concatenate it into a range for the sed - system call. Ah, this is with comma separated files and needs to be adjusted if that's different.

#!bash

#this shows all lengths and how often
echo -e "\033[32;1m\n\nnumber of lines : number of fields\033[0m" ; awk ' {print NF} ' "$1" | sort | uniq -c 

#this shows only number of shortest / longest lines

awk ' BEGIN {FS=","} (NF<=shortest || shortest=="") {shortest=NF; shortestline=$0} 
(longest<=NF) {longest=NF; longestline=$0} 
(!NF) {emptylines+=1} 
(NF) {nonemptylines+=1}
(maxcount<NR) {maxcount=NR}
END { middlestart=(maxcount/2)-1;
middleend=(maxcount/2)+1;
range=int(middlestart)","int(middleend);
print "\033[32;1m\ntotal number of lines is:\n\t\033[0m" NR,
"\033[32;1m\n shortest is:\t\t\033[0m" shortest, 
"\033[32;1m\n longest is:\t\t\033[0m" longest, 
"\033[32;1m\n shortestline is:\t\t\033[0m" shortestline, 
"\033[32;1m\n longestline is:\t\t\033[0m" longestline,
"\033[32;1m\n number empty lines is: \t\t\033[0m" emptylines,
"\033[32;1m\n number of non-empty lines is:\t\t\033[0m" nonemptylines;

print "\033[32;1m\n\nrange is   \033[0m" range, "\033[1;32m\nFILENAME IS \033[0m" FILENAME, "\033[32;1m\nnow the middle 3 lines of the file: \n\033[0m";
system("sed -n "range"p " FILENAME)

} ' $1

echo -e "\033[32;1m\n\ntop 4 lines\n\033[0m"
head -n 4 "$1"
echo -e "\033[32;1m\n\nlast 4 lines\n\033[0m"
tail -n 4 "$1"
echo -e "\n"

Monday, March 16, 2015

Automate your Logfile analysis for SEO with common tools

Fully automated log analysis with tools many use all the time

Surely no substitute for splunk and its algorithms and features, but very practical, near zero cost (take that!) and high efficiency. Requires mainly free tools (thanks cygwin) or standard system tools (like wiindows task scheduler), plus a bit of trial and error. (I also use MSFT Excel, but other spreadsheet programs should work as well).

Analysis of large logfiles, daily

Analyzing logfiles for bot and crawler behavior, but also to check for site quality is quite helpful. So, how to analyze our huge files? For a part of the site, we're talking about many GB of logs, even zipped.

Not that hard, actually, although it took me a while to get all these steps lined up and synchronized.

With the windows task manager I schedule a few steps over night:

copy last days logfiles on a dedicated computer
grep the respective entries in a variety of files (all 301, bot 301, etc.)
Then count the file lenghts (wc -l ) and append the values to a table (csv file) tracking these numbers
Delete logfiles
The resulting table and one or two of the complete files (all 404.txt) are copied to a server, which hosts an Excel file with uses the txt file as database, and updates graphs and tables on open.
delete temporary files (and this way avoid the dip you see)

Now our team can go quickly check if we have an issue up, and need to take a closer look, or not.
In a second step I also added all log entries resulting in a 404 into the spreadsheet on open.

.

Tuesday, December 9, 2014

Logfile analysis for SEO: Which bots come how often?

Which bots crawling your site?

Which bots are visiting your site? Googlebot, bingbot and yandex? You might be surprised by the number and variety.

Script to identify crawlers

Just filtering the logfile we get from IIS with grep -i 'bot', and then writing the agent - in this logfile in position 13 - into a separate file, and then just sort, count occurrence of each.

grep -i 'bot' logfile | awk 'BEGIN { FS = " " } { print $ 13 } ' >> bots-names.txt

sort bots-names.txt | uniq -c | sort -k 1nr > bots-which-counter.txt
rm bots-names.txt

This gives me a nice list of bots, and how many requests they sent in the time of the logfile. Interesting list, lots from bots I would not have expected, like mail.RU and 'linux'.

Another post I share a table how often bots come over time - and I pick the most relevant bots with this above list (plus on what brings us traffic).

Top bots and crawlers visiting certain parts of www.dell.com:

I cut off the numbers (count, first column) and this is just sorted the top few visiting crawlers / bots.

Wednesday, November 19, 2014

Logfile Analysis for SEO: Get status codes

Ever wondered if bots use cookies? Or if there is a relation from 302's to search engine bot visits? Total sum of 4xx errors as a trend? Our current analytics setup does not show this data, log file analysis, so I wrote a few brief scripts, pulled the zipped logfiles from servers to a local folder, and then analyzed (scripts at the end of this post).

Next step I added the data to an xls and calculate the percentages, like 302s as a share of overall traffic or how many bot visits use a cookie (nearly all bots visits are cookied visits!).

Visualization of one time frame worth of http status codes

And sure easy to show trends of status codes, for example share of bots visits to overall visits and the number of 302's on a site.

And here are the scripts. If you wonder why writing the results into a file, and not just count, I run more analysis on these resulting files.

1. Get all lines with a certain status code not 200 OK:

zcat *.zip | grep " 301 " > all-301.txt
zcat *.zip | grep " 302 " > all-302.txt
zcat *.zip | grep " 304 " > all-304.txt
zcat *.zip | grep " 403 " > all-403.txt
zcat *.zip | grep " 404 " > all-404.txt
zcat *.zip | grep " 410 " > all-410.txt
zcat *.zip | grep " 500 " > all-500.txt
zcat *.zip | grep "^2014" | wc -l > logfile-results.txt

2. Error codes, redirects encountered by bots. First, filter out all lines with bots, then get the status codes lines in separate files.

zcat *.zip | grep "bot" > bots-traffic.txt
grep " 301 " bots-traffic.txt > bots-301.txt
grep " 302 " bots-traffic.txt > bots-302.txt
grep " 304 " bots-traffic.txt > bots-304.txt
grep " 403 " bots-traffic.txt > bots-403.txt
grep " 404 " bots-traffic.txt > bots-404.txt
grep " 410 " bots-traffic.txt > bots-410.txt
grep " 500 " bots-traffic.txt > bots-500.txt

3. Pull out all the visits with cookies, in this case only the cookies themselves. In these logfiles they are in field $15:

zcat *.zip | awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' > all-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-traffic.txt > bots-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' all-500.txt > all-500-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-500.txt > bots-500-cookies.txt

4. This pulls all urls with an 500 error code into files - to have Dev look at these:

awk 'BEGIN {FS = " "; OFS = "" } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' all-500.txt > all-500-urls.txt
awk 'BEGIN {FS = " "; OFS = "" } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' bots-500.txt > bots-500-urls.txt

It's filtered, so if field 8 is just a hyphen, it prints just the url stem, otherwise stem and pagename.