Tuesday, June 27, 2017

Too many, too long, too slow

As mentioned in the earlier posts - there are 5 different 'top 1 million websites' lists available for free.

The immediate question popping up right away: which one is best? Or how do they differ, and which list can I use to run my tests?

First, sure, clean up the data, pull out the url (don't need the other elements for this. ) Pretty easy with cut, then into full lists, and split of the top urls with a head -1000 or such.
 

So I started to compare with some awk script, looping one list over the next, and it's taking for hours. Well, I started with 1000 urls each, worked fine, 10,000 urls, takes a while, but then with one million ... not so much. It's 1,000,000 times comparing to 1,000,000 lines. Some of that can be optimized inside of the script (continue on a match), but the remainders are still large.

So - finally set up an ubuntu server at home. Just an old Dell work laptop - 4 years old, still running like a charm with ubuntu.

Setting up ssh server was pretty easy too - all behind my router, I don't really need outside - in access, as the crawls will run many hours or days. Still using key authentication - it's actually easier for later logons, and much more secure. (Many thanks to digitalocean and stackoverflow for all the info to help me through this.)


Up and running!


Bookmark and Share