Wednesday, July 26, 2017

Sorting files for unique entries - which method is fastest

Starting with some data around keywords and logfiles, there are 23 files, csv, 3 fields and we need the list of unique lines across all files. They have a total of ~ 58 million rows. Might take a while - and I hope there is a faster way.

Used 3 ways to do this:

a. sort -u Q*.csv > output
b. cat Q*.csv >> temp
    sort -u temp > output2

c.  for file in Q*csv; do
       cat $file >> results
       sort -u results > fresh
       mv fresh results
     done

About 48 million unique lines result:
a.  <1 min  and killed performance on my computer
b. 1  min - definitely did not expect such a difference
c. 11 min - ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.
Now, just using the first field it should have much larger overlap.
About 43 million unique lines result
a.  48 sec and killed performance on my computer
b. 1 min - definitely did not expect such a difference
c. 6 min -
ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.

Now, with at least 50% overlap, all fields

a.  4 min and killed performance on my computer
b. 6 min - definitely did not expect such a difference
c. 24 min - wow.
 Seems the sort with no temp file is by far the fastest way to see 

Tuesday, July 18, 2017

Wordlists - Domains

Wordlists

Are just lists of words: there are many. I found a list with over 9000 English nouns, and just had to test.

aardvark
abacus
abbey
abdomen
ability
abolishment
abroad
abuse
accelerant
accelerator

First, added a 'www.' to the beginning, then a '.com' to the end: ah, urls!
How many of these are used? I had heard nearly all domains that make some kind of sense (and a few more) are taken already. Seems not so much.

Just curl'd them, then counted the http stati:

9111 urls have:

200 OK    3565
301     1697
302     752
304     0
404     63
403     49
410     0
500     8
and a bunch of never seen response codes, like 416, 456, or 501.

All in all, 2045 urls had no response at all.

Bookmark and Share