The immediate question popping up right away: which one is best? Or how do they differ, and which list can I use to run my tests?
First, sure, clean up the data, pull out the url (don't need the other elements for this. ) Pretty easy with cut, then into full lists, and split of the top urls with a head -1000 or such.
So - finally set up an ubuntu server at home. Just an old Dell work laptop - 4 years old, still running like a charm with ubuntu.
Setting up ssh server was pretty easy too - all behind my router, I don't really need outside - in access, as the crawls will run many hours or days. Still using key authentication - it's actually easier for later logons, and much more secure. (Many thanks to digitalocean and stackoverflow for all the info to help me through this.)
Up and running!