Remember the post on how many parallel (simple, wget) processes can be run reasonably on a machine?
This is how it looks on aws / amazon web services EC2:
I added lower / higher numbers after a bit testing, but this is fast - likely because of a really good internet connection. (I might need to try something to test the processing itself without network dependency....)
Many of the scripts - even when run with parallel - still take hours or days to complete. 10,000 or 100,000 urls to check are the rule, sometimes even more. With downloads and processes on the data, this can take a while.
I can't run these from the office, and if I run them from home I keep checking.... which is not that relaxing in time where I want to relax. So, I often use amazon web services EC2.
For about a year I used the free tier - a great offer from Amazon to try the service with a very small instance for free. Now I am in the paid program, but cost is really low for what I do and after a few times it became routine to start a new instance when I need it, run some things, then terminate it.
One of the programs I use pretty much always is tmux - 'terminal multiplexer' - which now is by default on ubuntu instances. It allows not just to run many connected terminals, but also to detach a session. That means, I can start a session, run a script in it, detach and then close the terminal, connection and the script keeps running in the background. A few hours later I can just connect again and download the results.
Especially in combination with parallel this is a great way to run many of the scripts from this blog - and more.
Starting looks a bit like this (thanks to Ole Tange for the tip and the help with parallel):
ls x* | parallel -j24 --line-buffer " . script.sh {} >> results.txt "
I often just split large files , then ls or cat the split files to parallel. -j24 is 24 parallel threads, {} picks up the data from ls/cat.
No comments:
Post a Comment