Showing posts with label quality. Show all posts
Showing posts with label quality. Show all posts

Wednesday, July 22, 2015

Parallel on AWS

Remember the post on how many parallel (simple, wget) processes can be run reasonably on a machine?

This is how it looks on aws / amazon web services EC2:


I added lower / higher numbers after a bit testing, but this is fast - likely because of a really good internet connection. (I might need to try something to test the processing itself without network dependency....)

Many of the scripts - even when run with parallel - still take hours or days to complete. 10,000 or 100,000 urls to check are the rule, sometimes even more. With downloads and processes on the data, this can take a while.

I can't run these from the office, and if I run them from home I keep checking.... which is not that relaxing in time where I want to relax. So, I often use amazon web services EC2.
For about a year I used the free tier - a great offer from Amazon to try the service with a very small instance for free. Now I am in the paid program, but cost is really low for what I do and after a few times it became routine to start a new instance when I need it, run some things, then terminate it.

One of the programs I use pretty much always is tmux - 'terminal multiplexer' - which now is by default on ubuntu instances. It allows not just to run many connected terminals, but also to detach a session. That means, I can start a session, run a script in it, detach and then close the terminal, connection and the script keeps running in the background. A few hours later I can just connect again and download the results.
Especially in combination with parallel this is a great way to run many of the scripts from this blog - and more.

Starting looks a bit like this (thanks to Ole Tange for the tip and the help with parallel):

ls x* | parallel -j24 --line-buffer  " . script.sh {}  >> results.txt "

I often just split large files , then ls or cat the split files to parallel. -j24 is 24 parallel threads, {} picks up the data from ls/cat.

Thursday, June 4, 2015

Speed: Data on top 1000 header load time vs full load time

Lots of tools give a different number for the speed of a site, how it is for users, over different channels, providers, including rendering time or not, including many elements or not.

This is the 'hardest' test of all:

  • With a little script I checked the top 1000 pages from the Alexa list*. The first script tests how long it takes to get the http header back (yes, document exists, and yes, I know where it is and it is OK), These are the blue dots.
  • The second script downloads ALL page elements of the homepage, including images, scripts, stylesheets, also from integrated third party tools like enlighten, tealeaf, omniture, or whatever a site uses. These are the orange dots.

First I ran this without time limit, and when I checked back the next day, it was still running, so I set the timeout to 20 seconds.

There seems to be a clear connection between header response time and full time, not so much between rank in the top 1000 by traffic and speed.

sorted by full download time




sorted by traffic rank

There also seems to NOT be a clear connection on how rank by traffic (x-axis) correlates to full download time. This shows that we have a great opportunity to outperform many other companies with faster download speeds.



* Alexa (owned by Amazon) publishes the top 1 Million websites by traffic, globally, based on data from  many different browser plugins plus from Amazon (cloud?) services.

Monday, March 16, 2015

Automate your Logfile analysis for SEO with common tools

Fully automated log  analysis with tools many use all the time

Surely no substitute for splunk and its algorithms and features, but very practical, near zero cost (take that!)  and high efficiency. Requires mainly free tools (thanks cygwin) or standard system tools (like wiindows task scheduler), plus a bit of trial and error.  (I also use MSFT Excel, but other spreadsheet programs should work as well).






Analysis of large logfiles, daily

Analyzing logfiles for bot and crawler behavior, but also to check for site quality is quite helpful. So, how to analyze our huge files? For a part of the site, we're talking about many GB of logs, even zipped.

Not that hard, actually, although it took me a while to get all these steps lined up and synchronized.

With the windows task manager I schedule a few steps over night:
  • copy last days logfiles on a dedicated computer
  • grep the respective entries in a variety of files (all 301, bot 301, etc.)
  • Then count the file lenghts (wc -l ) and append the values to a table (csv file) tracking these numbers
  • Delete logfiles
  • The resulting table and one or two of the complete files (all 404.txt) are copied to a server, which hosts an Excel file with uses the txt file as database, and updates graphs and tables on open.
  • delete temporary files (and this way avoid the dip you see)

Now our team can go quickly check if we have an issue up, and need to take a closer look, or not.
In a second step I also added all log entries resulting in a 404 into the spreadsheet on open.

.

Wednesday, January 14, 2015

Alexa 1 million Top desktop performers on Google sitespeed score: perfect score 100

Desktop sitespeed score

Using the good old (or new) Alexa top 1M sites list again, this is the list of the top performers on desktop with a speedscore of 100! Checked with this script, and then just a graph in Excel to show the distribution of 100-score by 1000s. While there seems to be a strong correlation between being in the top 1000 and sitespeed, sites with a speedscore of 100 appear more frequent in other areas, highest in the sites which rank in the 8000s in Alexa.

No. of sites with homepage speedscore 100 = perfect for desktop per 1000 sites from Alexa


Sitespeed score 100 site list

No Google search, interestingly, and many other big names missing, but a list of urls I am not even going to try to check the content through our company network.

No. in Alexa 1 million Site URL score
29  googleusercontent.com  100
122  secureserver.net  100
571  github.io  100
776  streamcloud.eu  100
1027  lapatilla.com  100
1051  gstatic.com  100
1197  atpanel.com  100
1497  sourtimes.org  100
1537  openadserving.com  100
1576  googleadservices.com  100
1636  googleapis.com  100
1969  securepaynet.net  100
1995  banesconline.com  100
2281  nocookie.net  100
2696  onlinecreditcenter6.com  100
2896  giaoduc.net.vn  100
2935  sexad.net  100
2981  9gag.tv  100
3179  withgoogle.com  100
3343  readserver.net  100
3347  xxxhost.me  100
3491  dilandau.eu  100
3606  prcm.jp  100
3802  puu.sh  100
4031  womenwan.com  100
4047  get-a-fuck-tonight.com  100
4181  kienthuc.net.vn  100
4194  stream-tv.me  100
4274  tradeindia.com  100
4355  benesse.ne.jp  100
4571  itrack.it  100
4606  trackoptimizer.com  100
4743  7xz3.com  100
4982  edgesuite.net  100
5013  liveadexchanger.com  100
5318  ipsosinteractive.com  100
5349  fun698.com  100
5433  moudamepo.com  100
5732  come.in  100
6304  reduxmediia.com  100
6304  reduxmediia.com  100
6714  vgsgaming.com  100
6714  vgsgaming.com  100
6804  yieldtraffic.com  100
6804  yieldtraffic.com  100
6808  insight.ly  100
6808  insight.ly  100
6964  adxhosting.net  100
6964  adxhosting.net  100
7009  contest-winners.com  100
7070  exhentai.org  100
7116  techhelpfox.com  100
7285  mlstatic.com  100
7337  fzg360.com  100
7382  siyahgazete.com  100
7484  imgsin.com  100
7736  mmstat.com  100
7762  cnnewmusic.com  100
7962  picketfenceblogs.com  100
8038  endlessmatches.com  100
8040  savedwebhistory.org  100
8160  flirt-fuck.com  100
8213  h12-media.net  100
8298  kataskopoi.com  100
8414  ihct.mx  100
8601  evsuite.com  100
8630  rapmls.com  100
8735  9stock.com  100
8890  travideos.com  100
8921  credomatic.com  100
9097  youtu.be  100
9236  cncmax.cn  100
9335  gefaellt-mir.me  100
9348  xe2c.com  100
9745  jobspapa.com  100
9792  iphone-winners.net  100
9898  nolix.ru  100
9907  mihanstore.net  100

Tuesday, December 30, 2014

Logfile analysis for SEO: visualization of bot visits

The bot visits are filtered out of logfiles and sorted / counted as shown here. This here now filters for certain bots (out of the hundreds visiting the site) and makes a small graph out of it.

This is the resulting table - easy to adjust to the bots of interest based on earlier research. (These are not real numbers, but just fillers to show how it looks like). 


This is the visualization in excel of the table with real numbers. Yandex is not in here, because they had so many visits that they dwarfed all the other bots counters. Each color stands for a different bot.



This is the filter to get the table - which can be easily adapted to show more / less or different search engine crawler data:
echo -e 'google\tbing\tyandex\tbaidu\tapple\tlinux\ttrident' > bots-table.txt
for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do
google=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/oogle/ { print $ 13 } ' | wc -l ) ;
bing=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/bing/ { print $ 13 } ' | wc -l ) ;
yandex=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/yandex/ { print $ 13 } ' | wc -l ) ;
baidu=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/baidu/ { print $ 13 } ' | wc -l ) ;
apple=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Apple/ { print $ 13 } ' | wc -l ) ;
linux=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Linux/ { print $ 13 } ' | wc -l ) ;
trident=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Trident/ { print $ 13 } ' | wc -l ) ;
echo -e "$i\t$google\t$bing\t$yandex\t$baidu\t$apple\t$linux\t$trident" >> bots-table.txt
done

Old version - with 'wrong' results, needed to show date from folder to make sure they are aligned

echo -e 'google\tbing\tyandex\tbaidu\tapple\tlinux\ttrident' > bots-table.txt
google=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/oogle/ { print $ 13 } ' | wc -l ; done)
bing=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/bing/ { print $ 13 } ' | wc -l ; done)
yandex=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/yandex/ { print $ 13 } ' | wc -l ; done)
baidu=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/baidu/ { print $ 13 } ' | wc -l ; done)
apple=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Apple/ { print $ 13 } ' | wc -l ; done)
linux=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Linux/ { print $ 13 } ' | wc -l ; done)
trident=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Trident/ { print $ 13 } ' | wc -l ; done)
paste <(echo "$google") <(echo "$bing") <( echo "$yandex") <(echo "$baidu")  <(echo "$apple")  <(echo "$linux") <(echo "$trident") >> bots-table.txt




Bookmark and Share