andreas.wpv: July 2015

Wednesday, July 29, 2015

Google: "making the web faster"

... how about starting at home?

I have chrome open with two tabs (gmail and G+) and three plugins (seoclarity, everyone social and WooRank), and you need HOW many processes and HOW much memory?

I have 20 plugins I'd like to use, and 10+ tabs I'd like to keep open, but that just won't work with this wasteful resource usage.

Wednesday, July 22, 2015

Parallel on AWS

Remember the post on how many parallel (simple, wget) processes can be run reasonably on a machine?

This is how it looks on aws / amazon web services EC2:

I added lower / higher numbers after a bit testing, but this is fast - likely because of a really good internet connection. (I might need to try something to test the processing itself without network dependency....)

Many of the scripts - even when run with parallel - still take hours or days to complete. 10,000 or 100,000 urls to check are the rule, sometimes even more. With downloads and processes on the data, this can take a while.

I can't run these from the office, and if I run them from home I keep checking.... which is not that relaxing in time where I want to relax. So, I often use amazon web services EC2.
For about a year I used the free tier - a great offer from Amazon to try the service with a very small instance for free. Now I am in the paid program, but cost is really low for what I do and after a few times it became routine to start a new instance when I need it, run some things, then terminate it.

One of the programs I use pretty much always is tmux - 'terminal multiplexer' - which now is by default on ubuntu instances. It allows not just to run many connected terminals, but also to detach a session. That means, I can start a session, run a script in it, detach and then close the terminal, connection and the script keeps running in the background. A few hours later I can just connect again and download the results.
Especially in combination with parallel this is a great way to run many of the scripts from this blog - and more.

Starting looks a bit like this (thanks to Ole Tange for the tip and the help with parallel):

ls x* | parallel -j24 --line-buffer " . script.sh {} >> results.txt "

I often just split large files , then ls or cat the split files to parallel. -j24 is 24 parallel threads, {} picks up the data from ls/cat.

Thursday, July 9, 2015

Crawl faster with "parallel" - but how fast?

Crawling websites - often our own - helps find technical seo, quality and speed issues, you'll see a lot of scripts on this blog in this regard.

With a site with millions of pages, crawling speed is crucial - even a smaller project easily has some thousand urls to crawl. While there are good tools to check for many things, wget and curl in combination with other command line tools give all the flexibility needed - but can be painfully slow.

To check for download speed for example, a regular script would loop through a list of files, download one, then move to the next. Even split up and running the same script is not great, and quite hard to handle at scale.

After some search I found gnu parallel - excellent especially for parallel tasks. Running multiple parallel processes is easy and powerful. A switch -j[digit] allows to set a preference for the number of parallel processes.

But what number is best, especially with networking projects? A wget should not be a heavy computing load, even with some awk/sed/grep afterwards, because most of the time is spent with waiting and data transfer.

So, a little script to check the time for a few different numbers of parallel processes:

for i in {12,24,36,48,60,80,100}; do var=$( time ( TIMEFORMAT="%R" ; cat "${1}" | parallel "-j${i}" . single-url-check-time-redirect.sh {} ) 2>&1 1>/dev/null ) ; echo "$i parallel threads take $var seconds" ; done

The referenced script (single-url-check-time-redirect.sh) just makes a curl call to check for a http status and a wget for the header transfer time; $1 pulls in a list of full urls.
The results differ greatly by network.

Here a slow, public network in a cafe:

This from a library with a pretty good connection:

And this from a location with a 50MB/sec connection:

Times can vary in each location by a factor of 2 easily, network connection and load still seem to make the biggest difference as all three were done on the same computer.