andreas.wpv: analysis

Showing posts with label analysis. Show all posts

Tuesday, September 15, 2015

Data analysis preparation analysis script

Finally got to working on this. I am working with larger files and one of my current fun projects is to find out which urls have been visited by Google, out of all the urls we have live.

While working on a small DB to do this I downloaded some files from splunk, then imported them. And sure enough realized there are things I need to filter out before, or the DB becomes absolutely unwieldy.
The files are not huge, but large enough, with 10+ million lines, so I want to use command line tools and not redo a script for every number I need, looping repeatedly over the same file.

It shows count of

number of fields
total number of lines
empty lines
non-empty lines.

Then it pulls the full data for the

shortest line
the longest line
the first 4 lines
last 4 lines
middle 3 lines of the file.

This is testdata output from the script - cannot show real data. It works well - at least with a few fields. To run it on a 17 million lines one field list takes ~ 32 seconds, that's pretty good I think.

I highlight the description in green as you might see - otherwise it is kind of hard to read. Also - awk changed the numbers for the line counters to scientific notation, so I needed to use int(variable) to reset it to integer and be able to concatenate it into a range for the sed - system call. Ah, this is with comma separated files and needs to be adjusted if that's different.

#!bash

#this shows all lengths and how often
echo -e "\033[32;1m\n\nnumber of lines : number of fields\033[0m" ; awk ' {print NF} ' "$1" | sort | uniq -c 

#this shows only number of shortest / longest lines

awk ' BEGIN {FS=","} (NF<=shortest || shortest=="") {shortest=NF; shortestline=$0} 
(longest<=NF) {longest=NF; longestline=$0} 
(!NF) {emptylines+=1} 
(NF) {nonemptylines+=1}
(maxcount<NR) {maxcount=NR}
END { middlestart=(maxcount/2)-1;
middleend=(maxcount/2)+1;
range=int(middlestart)","int(middleend);
print "\033[32;1m\ntotal number of lines is:\n\t\033[0m" NR,
"\033[32;1m\n shortest is:\t\t\033[0m" shortest, 
"\033[32;1m\n longest is:\t\t\033[0m" longest, 
"\033[32;1m\n shortestline is:\t\t\033[0m" shortestline, 
"\033[32;1m\n longestline is:\t\t\033[0m" longestline,
"\033[32;1m\n number empty lines is: \t\t\033[0m" emptylines,
"\033[32;1m\n number of non-empty lines is:\t\t\033[0m" nonemptylines;

print "\033[32;1m\n\nrange is   \033[0m" range, "\033[1;32m\nFILENAME IS \033[0m" FILENAME, "\033[32;1m\nnow the middle 3 lines of the file: \n\033[0m";
system("sed -n "range"p " FILENAME)

} ' $1

echo -e "\033[32;1m\n\ntop 4 lines\n\033[0m"
head -n 4 "$1"
echo -e "\033[32;1m\n\nlast 4 lines\n\033[0m"
tail -n 4 "$1"
echo -e "\n"

Monday, March 16, 2015

Automate your Logfile analysis for SEO with common tools

Fully automated log analysis with tools many use all the time

Surely no substitute for splunk and its algorithms and features, but very practical, near zero cost (take that!) and high efficiency. Requires mainly free tools (thanks cygwin) or standard system tools (like wiindows task scheduler), plus a bit of trial and error. (I also use MSFT Excel, but other spreadsheet programs should work as well).

Analysis of large logfiles, daily

Analyzing logfiles for bot and crawler behavior, but also to check for site quality is quite helpful. So, how to analyze our huge files? For a part of the site, we're talking about many GB of logs, even zipped.

Not that hard, actually, although it took me a while to get all these steps lined up and synchronized.

With the windows task manager I schedule a few steps over night:

copy last days logfiles on a dedicated computer
grep the respective entries in a variety of files (all 301, bot 301, etc.)
Then count the file lenghts (wc -l ) and append the values to a table (csv file) tracking these numbers
Delete logfiles
The resulting table and one or two of the complete files (all 404.txt) are copied to a server, which hosts an Excel file with uses the txt file as database, and updates graphs and tables on open.
delete temporary files (and this way avoid the dip you see)

Now our team can go quickly check if we have an issue up, and need to take a closer look, or not.
In a second step I also added all log entries resulting in a 404 into the spreadsheet on open.

.

Tuesday, December 30, 2014

Logfile analysis for SEO: visualization of bot visits

The bot visits are filtered out of logfiles and sorted / counted as shown here. This here now filters for certain bots (out of the hundreds visiting the site) and makes a small graph out of it.

This is the resulting table - easy to adjust to the bots of interest based on earlier research. (These are not real numbers, but just fillers to show how it looks like).

This is the visualization in excel of the table with real numbers. Yandex is not in here, because they had so many visits that they dwarfed all the other bots counters. Each color stands for a different bot.

This is the filter to get the table - which can be easily adapted to show more / less or different search engine crawler data:

echo -e 'google\tbing\tyandex\tbaidu\tapple\tlinux\ttrident' > bots-table.txt
for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do
google=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/oogle/ { print $ 13 } ' | wc -l ) ;
bing=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/bing/ { print $ 13 } ' | wc -l ) ;
yandex=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/yandex/ { print $ 13 } ' | wc -l ) ;
baidu=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/baidu/ { print $ 13 } ' | wc -l ) ;
apple=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Apple/ { print $ 13 } ' | wc -l ) ;
linux=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Linux/ { print $ 13 } ' | wc -l ) ;
trident=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Trident/ { print $ 13 } ' | wc -l ) ;
echo -e "$i\t$google\t$bing\t$yandex\t$baidu\t$apple\t$linux\t$trident" >> bots-table.txt
done

Old version - with 'wrong' results, needed to show date from folder to make sure they are aligned

echo -e 'google\tbing\tyandex\tbaidu\tapple\tlinux\ttrident' > bots-table.txt
google=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/oogle/ { print $ 13 } ' | wc -l ; done)
bing=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/bing/ { print $ 13 } ' | wc -l ; done)
yandex=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/yandex/ { print $ 13 } ' | wc -l ; done)
baidu=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/baidu/ { print $ 13 } ' | wc -l ; done)
apple=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Apple/ { print $ 13 } ' | wc -l ; done)
linux=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Linux/ { print $ 13 } ' | wc -l ; done)
trident=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Trident/ { print $ 13 } ' | wc -l ; done)
paste <(echo "$google") <(echo "$bing") <( echo "$yandex") <(echo "$baidu") <(echo "$apple") <(echo "$linux") <(echo "$trident") >> bots-table.txt

Wednesday, November 19, 2014

Logfile Analysis for SEO: Get status codes

Ever wondered if bots use cookies? Or if there is a relation from 302's to search engine bot visits? Total sum of 4xx errors as a trend? Our current analytics setup does not show this data, log file analysis, so I wrote a few brief scripts, pulled the zipped logfiles from servers to a local folder, and then analyzed (scripts at the end of this post).

Next step I added the data to an xls and calculate the percentages, like 302s as a share of overall traffic or how many bot visits use a cookie (nearly all bots visits are cookied visits!).

Visualization of one time frame worth of http status codes

And sure easy to show trends of status codes, for example share of bots visits to overall visits and the number of 302's on a site.

And here are the scripts. If you wonder why writing the results into a file, and not just count, I run more analysis on these resulting files.

1. Get all lines with a certain status code not 200 OK:

zcat *.zip | grep " 301 " > all-301.txt
zcat *.zip | grep " 302 " > all-302.txt
zcat *.zip | grep " 304 " > all-304.txt
zcat *.zip | grep " 403 " > all-403.txt
zcat *.zip | grep " 404 " > all-404.txt
zcat *.zip | grep " 410 " > all-410.txt
zcat *.zip | grep " 500 " > all-500.txt
zcat *.zip | grep "^2014" | wc -l > logfile-results.txt

2. Error codes, redirects encountered by bots. First, filter out all lines with bots, then get the status codes lines in separate files.

zcat *.zip | grep "bot" > bots-traffic.txt
grep " 301 " bots-traffic.txt > bots-301.txt
grep " 302 " bots-traffic.txt > bots-302.txt
grep " 304 " bots-traffic.txt > bots-304.txt
grep " 403 " bots-traffic.txt > bots-403.txt
grep " 404 " bots-traffic.txt > bots-404.txt
grep " 410 " bots-traffic.txt > bots-410.txt
grep " 500 " bots-traffic.txt > bots-500.txt

3. Pull out all the visits with cookies, in this case only the cookies themselves. In these logfiles they are in field $15:

zcat *.zip | awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' > all-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-traffic.txt > bots-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' all-500.txt > all-500-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-500.txt > bots-500-cookies.txt

4. This pulls all urls with an 500 error code into files - to have Dev look at these:

awk 'BEGIN {FS = " "; OFS = "" } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' all-500.txt > all-500-urls.txt
awk 'BEGIN {FS = " "; OFS = "" } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' bots-500.txt > bots-500-urls.txt

It's filtered, so if field 8 is just a hyphen, it prints just the url stem, otherwise stem and pagename.

Tuesday, October 21, 2014

awk: count items in n number of lines (including percentile calculations)

Using several scripts to check a number of sites on features, I needed a way to calculate the sum of occurrences of yes / no and such alike - but not just across the whole file, but for each x number of lines, each 100 lines, 1000 lines or so.

This is based on a much simpler version to calculate just the totals for a whole file, not n-numbers. The script is called with input file as first parameter and number of lines to summarize as parameter two: . runscript.sh filewithdata.txt 100. Filewithdata.txt is stored in $1, 100 in $2. Then the $2 is handed to awk with the -v counter=$2.
First step is to set variables, then count each field up with the 'if' a value occurs.
Third step happens when the number of lines is reached (NR-1) to account for the header. Values are printed, added to the totalvalue variables, reset to zero.
In the END section there are two elements - first to print the totals, then also to show if there are 'lines left' due to an the aggregation over x lines. If there are, the number of lines and counting values are printed.

awk -v counter="${2}" 'BEGIN {counturl = 0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter = 0 }

{counturl++ } ( $2 == "yes" ) { counttitle++ } ( $3 == "yes" ) { countpub++ } ( $4 == "yes" ) {countschema++ }

(NR-1)%counter==0 { print "number of lines: "NR-1, "og:title: "counttitle, "rel publisher: "countpub, "schema: "countschema;
titletotal+=counttitle ; totalpub+=countpub ; totalschema+=countschema;
counturl=0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter++ }

END { print "\nnumber of calculations: " printcounter; print "urls: " NR-1, "all og:title: " titletotal, "rel pub total: " totalpub, "all schema: " totalschema;
if(counturl) print "lines left: " counturl; if(counttitle) print "left og:title : " counttitle ; if(counttitle) print "left rel pub: " counttitle; if(countpub) print "left schema: " countschema ; print "\n" }' $1

Apart from the usual man pages, this post on stackoverflow was very helpful especially for the 'left' calculation in the END section.

Friday, September 5, 2014

Backlink Analysis - Tools Insights: Compare backlink domains from Google, Majestic Seo and Bing

Do 3rd party link tools show the same backlinks like Google does?

As described in the last post - there is some small overlap between domains linking to support.dell.com from either Open Site Explorer and Google Webmaster tools.

How many domains are the same from each tool?

Bing webmaster tools is showing 31 domains that link to support.dell.com that Google Webmaster tools show, and Majesticseo is showing 0 ( and remember - Moz / open site explorer showed 43 domains).

Two annotations:

Last year it was the opposite, Moz zero, Majestic some, Bing most overlap (nearly 5%).
Although Majestic seems not on spot, we used Majestic data + GWT (Google Webmaster tools) data to remove spam links and penalty - successfully. While I would expect the tools to show all links they possibly see (or even estimate like Majestic) - Google hides most backlink data. That might lead to a much larger overlap between the tools and GWT - we just don't see it.

The scripts are here - and because of the setup, the only line that needs to change to compare other tables is highlighted below:

create temporary table if not exists url_T (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from support_moz_t );
create temporary table if not exists url_T (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from supportdellcom_backlinks_t );
create temporary table if not exists url_T (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from supportdellcom_bing_links_urls_t );

Yeah, I had to drop the tables, but prefer to do that rather than to overwrite - I find it easier to remember.

(Ping me and I'll share the scripts. )

Monday, August 25, 2014

Backlink Analysis - Tools Insights: Generating database and tables, data import from Google Webmaster tools and open site explorer

Many SEO professionals are checking backlinks to their sites - including us. We use a variety of tools mainly out of necessity to get as much info as possible rather than relying on just one tool.

Last year I compared backlinks from Google webmaster tools, Moz (opensiteexplorer), Bing webmaster tools and Majesticseo, and found nearly zero overlap between the links shown from each tool. I used awk and excel last year - this time it's mysql, as I use the data in these tables for other purpose as well.

Opensiteexplorer finds a few backlinks in sync with Google Webmaster tool data

Now - with mysql - I found 44 lines with overlap - for Opensiteexplorer backlink information to the top 1000 backlink domains Google is giving us for support.dell.com. Not a whole lot.

I'd love to hear some comments on this!

Description of scripts, tables to compare backlink data

No complicated scripts, just generating the db and tables was not that easy, I have to admit.

Biggest challenge was to import the various formats in which all the data comes from the different sources, meaning lots of manual adjusting. It is also pretty annoying that companies like Google use the comma separator in large numbers (for 1,000) - but I found this post on stackoverflow that helped me deal with it. Basically the data for that field is read into a variable, and then with a 'set' while importing the value comes from that variable with a 'replace' of the comma.

File names and folders are removed, but we have some super long urls on our site - so some fields are loooooooong, you might want to reduce that if you have shorter urls and lots of pages.

Generate tables with the top external links from Google Webmaster tools and Bing Webmaster tools:

create table if not exists supportdellcom_gwt_sourcelinksdomains_T (
id int auto_increment not null primary key,
Source_URL varchar(2000) not null,
Link_Number int(15),
Linked_Pages int(5),
datasource varchar(10)
);
load data local infile 'path/supportdellcom_google_LinkingDomains.csv' into table supportdellcom_gwt_sourcelinksdomains_T
fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 lines
(
Source_URL,
@var1,
@var2
)
set Link_Number = replace(@var1, ",","") , Linked_Pages = replace(@var2, ",","") , datasource = "Google";

create table if not exists supportdellcom_bing_links_urls_T (
id int auto_increment not null primary key,
Target_URL varchar(2000) not null,
source_URL varchar(2000),
Anchor varchar(2000),
datasource varchar(10)
);

load data local infile 'c:/Users/Andreas_Voss/Documents/linkbuilding/2014/data/bing-support-euro-dell-com-sitelinks_6_20_2014.csv' into table supportdellcom_bing_links_urls_T
fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 lines
(
Target_URL,
source_URL,
Anchor
)
set datasource = "bing";

And this is the actual comparison of the backlinks given by Google vs. Bing:

create temporary table if not exists url_T (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from support_moz_t );
create temporary table if not exists fulldomain_T (select count(*) as count, left(fullurl,(locate("/",fullurl)-1)) as domain from url_T group by domain);
select * from fulldomain_T order by count desc;

create temporary table if not exists compare_T (select fulldomain_T.domain, fulldomain_T.count as fulldomaincount, supportdellcom_gwt_sourcelinksdomains_t.Source_URL, supportdellcom_gwt_sourcelinksdomains_t.Link_Number ,
supportdellcom_gwt_sourcelinksdomains_t.Linked_Pages from supportdellcom_gwt_sourcelinksdomains_t join fulldomain_T on supportdellcom_gwt_sourcelinksdomains_t.Source_URL
like concat('%',fulldomain_T.domain) group by Source_URL);

This next bit is necessary because domains can end the same, but still not be equal. Still need to find a better way (perhaps with a regex instead of the 'like' - but for now it works)

select * from compare_T where domain = Source_URL;

Why not use something like 'where Source_URL in (select Source_URL other table)? Reason is... that takes forever, kind of.