Tuesday, December 30, 2014

Logfile analysis for SEO: visualization of bot visits

The bot visits are filtered out of logfiles and sorted / counted as shown here. This here now filters for certain bots (out of the hundreds visiting the site) and makes a small graph out of it.

This is the resulting table - easy to adjust to the bots of interest based on earlier research. (These are not real numbers, but just fillers to show how it looks like). 


This is the visualization in excel of the table with real numbers. Yandex is not in here, because they had so many visits that they dwarfed all the other bots counters. Each color stands for a different bot.



This is the filter to get the table - which can be easily adapted to show more / less or different search engine crawler data:
echo -e 'google\tbing\tyandex\tbaidu\tapple\tlinux\ttrident' > bots-table.txt
for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do
google=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/oogle/ { print $ 13 } ' | wc -l ) ;
bing=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/bing/ { print $ 13 } ' | wc -l ) ;
yandex=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/yandex/ { print $ 13 } ' | wc -l ) ;
baidu=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/baidu/ { print $ 13 } ' | wc -l ) ;
apple=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Apple/ { print $ 13 } ' | wc -l ) ;
linux=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Linux/ { print $ 13 } ' | wc -l ) ;
trident=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Trident/ { print $ 13 } ' | wc -l ) ;
echo -e "$i\t$google\t$bing\t$yandex\t$baidu\t$apple\t$linux\t$trident" >> bots-table.txt
done

Old version - with 'wrong' results, needed to show date from folder to make sure they are aligned

echo -e 'google\tbing\tyandex\tbaidu\tapple\tlinux\ttrident' > bots-table.txt
google=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/oogle/ { print $ 13 } ' | wc -l ; done)
bing=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/bing/ { print $ 13 } ' | wc -l ; done)
yandex=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/yandex/ { print $ 13 } ' | wc -l ; done)
baidu=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/baidu/ { print $ 13 } ' | wc -l ; done)
apple=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Apple/ { print $ 13 } ' | wc -l ; done)
linux=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Linux/ { print $ 13 } ' | wc -l ; done)
trident=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Trident/ { print $ 13 } ' | wc -l ; done)
paste <(echo "$google") <(echo "$bing") <( echo "$yandex") <(echo "$baidu")  <(echo "$apple")  <(echo "$linux") <(echo "$trident") >> bots-table.txt




Thursday, December 18, 2014

Google sitespeed score - script to tap api and Alexa top 10,000


Pagespeed - Score

Even if it would not be relevant for large site indexation, it still has huge impact on traffic, bounce rate and CE. Google offers a great tool to get their feedback - the sitespeed score. It does not show the actual speed, but how your page is build compared to an page ideally built for speed. So it shows the potential to improve.

As input I use the top 10,000 from the Alexa top million pages - using the homepage, only. (I actually split it in 10, ran the script 10x in parallel). The process is relatively slow, as Google checks the pages, and give much more details back than just the score - which I filter out, below. Just fill in your api key (after the = sign) and feed your sitemap (just urls) to the script. Below is for mobile sitespeed, for desktop just swap mobile with desktop. The allowance currently is 25 k api calls a day for free, that's plenty for most sites or projects.

Mobile sitespeed score scatterplot of Alexa top 10,000 sites homepages

Then I just cleaned it out, set all garbage values to zero, and build a scatterplot.
  • Few sites at 100 sitespeed score
  • a relatively sparsely populated area between 80 and 100 
  • the bulk of sites between 40 and 80
  • scores 40 and lower being less frequent as well.




Speedtest score script:

And this is the script:
filename=mobile-speedtest-$RANDOM.txt
echo -e 'url\tspeed-score' > ${filename}
IFS="," ; while read -r counter line; do
score=$(curl -s  -m 30 -f --retry 1 --proto =https --proto-redir =https "https://www.googleapis.com/pagespeedonline/v1/runPagespeed?url=http://www.${line}&strategy=mobile&key=--- your api key ---" | sed -n '/score[^,]*,/p'| sed -e 's/\"//g' -e 's/,//g' -e 's/score: //g' )
echo -e "$counter \t $line \t $score" >> ${filename}
done < $1

Tuesday, December 9, 2014

Logfile analysis for SEO: Which bots come how often?

Which bots crawling your site?

Which bots are visiting your site? Googlebot, bingbot and yandex? You might be surprised by the number and variety.


Script to identify crawlers

Just filtering the logfile we get from IIS with grep -i 'bot', and then writing the agent - in this logfile in position 13 - into a separate file, and then just sort, count occurrence of each.
grep -i 'bot' logfile | awk 'BEGIN { FS = " " } { print $ 13 } ' >> bots-names.txt
sort bots-names.txt | uniq -c | sort -k 1nr > bots-which-counter.txt
rm bots-names.txt
This gives me a nice list of bots, and how many requests they sent in the time of the logfile. Interesting list, lots from bots I would not have expected, like mail.RU and 'linux'.
Another post I share a table how often bots come over time - and I pick the most relevant bots with this above list (plus on what brings us traffic). 

Top bots and crawlers visiting certain parts of www.dell.com:



I cut off the numbers (count, first column) and this is just sorted the top few visiting crawlers / bots. 
Bookmark and Share