Thursday, July 9, 2015

Crawl faster with "parallel" - but how fast?

Crawling websites - often our own - helps find technical seo, quality and speed issues, you'll see a lot of scripts on this blog in this regard.

With a site with millions of pages, crawling speed is crucial - even a smaller project easily has some thousand urls to crawl. While there are good tools to check for many things, wget and curl in combination with other command line tools give all the flexibility needed - but can be painfully slow.  
To check for download speed for example, a regular script would loop through a list of files, download one, then move to the next. Even split up and running the same script is not great, and quite hard to handle at scale. 

After some search I found gnu parallel - excellent especially for parallel tasks. Running multiple parallel processes is easy and powerful. A switch -j[digit] allows to set a preference for the number of parallel processes. 

But what number is best, especially with networking projects? A wget should not be a heavy computing load, even with some awk/sed/grep afterwards, because most of the time is spent with waiting and data transfer. 

So, a little script to check the time for a few different numbers of parallel processes:

for i in {12,24,36,48,60,80,100}; do var=$( time ( TIMEFORMAT="%R" ; cat "${1}" | parallel "-j${i}" . {} ) 2>&1 1>/dev/null ) ; echo "$i parallel threads take $var seconds" ; done

The referenced script ( just makes a curl call to check for a http status and a wget for the header transfer time; $1 pulls in a list of full urls.
The results differ greatly by network.

Here a slow, public network in a cafe:

This from a library with a pretty good connection:

And this from a location with a 50MB/sec connection:

Times can vary in  each location by a factor of 2 easily, network connection and load still seem to make the biggest difference as all three were done on the same computer.

No comments:

Post a Comment

Bookmark and Share