Alexa 1 million, Statvoo 1 million, OpenDns 1 million, Majestic 1 million, quantcast 1 million:
All "top million websites" have slightly different formats, but all have many domains by amount of traffic - just how they are selected varies.
Filtered for a list of unique urls, then added http://www at the beginning, checked if this gives a 200 OK.
Starting with over 4 million urls, only 34,000 are on all lists (when checked as above):
Here's the list for download, no warranty, promises, absolutely at your own risk. Re-running this might yield different results to changes in the original list, timeouts, etc.
I'll use this list for a while to run a bunch more queries.
Here's the list for download.
Showing posts with label opendns. Show all posts
Showing posts with label opendns. Show all posts
Monday, November 20, 2017
Tuesday, June 27, 2017
Too many, too long, too slow
As mentioned in the earlier posts - there are 5 different 'top 1 million websites' lists available for free.
The immediate question popping up right away: which one is best? Or how do they differ, and which list can I use to run my tests?
First, sure, clean up the data, pull out the url (don't need the other elements for this. ) Pretty easy with cut, then into full lists, and split of the top urls with a head -1000 or such.
So I started to compare with some awk script, looping one
list over the next, and it's taking for hours. Well, I started with 1000
urls each, worked fine, 10,000 urls, takes a while, but then with one
million ... not so much. It's 1,000,000 times comparing to 1,000,000 lines. Some of that can be optimized inside of the script (continue on a match), but the remainders are still large.
So - finally set up an ubuntu server at home. Just an old Dell work laptop - 4 years old, still running like a charm with ubuntu.
Setting up ssh server was pretty easy too - all behind my router, I don't really need outside - in access, as the crawls will run many hours or days. Still using key authentication - it's actually easier for later logons, and much more secure. (Many thanks to digitalocean and stackoverflow for all the info to help me through this.)
Up and running!
The immediate question popping up right away: which one is best? Or how do they differ, and which list can I use to run my tests?
First, sure, clean up the data, pull out the url (don't need the other elements for this. ) Pretty easy with cut, then into full lists, and split of the top urls with a head -1000 or such.
So - finally set up an ubuntu server at home. Just an old Dell work laptop - 4 years old, still running like a charm with ubuntu.
Setting up ssh server was pretty easy too - all behind my router, I don't really need outside - in access, as the crawls will run many hours or days. Still using key authentication - it's actually easier for later logons, and much more secure. (Many thanks to digitalocean and stackoverflow for all the info to help me through this.)
Up and running!
Monday, April 17, 2017
New top 1 million websites list
Large lists of websites:
Alexa 1 million
Alexa's list of the top 1 million sites has helped me many times to run larger scans over homepages and others on the top sites, compare setup and speed depending on how er a traffic a site gets and similar.Now, Alexa has officially been discontinued as far as I know, although the file is available (right now at least).
As it is based on browser plugins to track visits, it might have a different mix than the newer OpenDNS list.
OpenDNS 1 Million
This one is based on over 100 Billion DNS requests per day, not limited to http/https requests .. but perhaps read the details for yourself.
And even older versions are available, kind of asking to use this for time series.
MajesticSeo 1 Million
Majestic 1 Million - another company seeing the potential of the list, and publishing the list, but not maintaining... or do they? Unclear if it is maintained, unclear when the last update happened, but seems a good list with a nice web interface, too.Quantcast 1 Million
This is around since a while, but it's unclear if it is regularly maintained, and how exactly it is generated. But it works very similar to Alexa, which makes it a convenient alternative. Here's the page linking to the download, but testing just now, that's not available anymore, but this link is working (zip) as of today.Statvoo 1 Million
Another offer that seems new - and like Alexa, on the site there is a very nice categorization, but the samples are tiny, 20 sites, and no download options for these - but for the top 1 Million.
Subscribe to:
Comments (Atom)
