Wednesday, July 26, 2017

Sorting files for unique entries - which method is fastest

Starting with some data around keywords and logfiles, there are 23 files, csv, 3 fields and we need the list of unique lines across all files. They have a total of ~ 58 million rows. Might take a while - and I hope there is a faster way.

Used 3 ways to do this:

a. sort -u Q*.csv > output
b. cat Q*.csv >> temp
    sort -u temp > output2

c.  for file in Q*csv; do
       cat $file >> results
       sort -u results > fresh
       mv fresh results
     done

About 48 million unique lines result:
a.  <1 min  and killed performance on my computer
b. 1  min - definitely did not expect such a difference
c. 11 min - ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.
Now, just using the first field it should have much larger overlap.
About 43 million unique lines result
a.  48 sec and killed performance on my computer
b. 1 min - definitely did not expect such a difference
c. 6 min -
ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.

Now, with at least 50% overlap, all fields

a.  4 min and killed performance on my computer
b. 6 min - definitely did not expect such a difference
c. 24 min - wow.
 Seems the sort with no temp file is by far the fastest way to see 

No comments:

Post a Comment

Bookmark and Share