Used 3 ways to do this:
a. sort -u Q*.csv > output
b. cat Q*.csv >> temp
sort -u temp > output2
c. for file in Q*csv; do
cat $file >> results
sort -u results > fresh
mv fresh results
done
About 48 million unique lines result:
a. <1 min and killed performance on my computerNow, just using the first field it should have much larger overlap.
b. 1 min - definitely did not expect such a difference
c. 11 min - ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.
About 43 million unique lines result
a. 48 sec and killed performance on my computer
b. 1 min - definitely did not expect such a difference
c. 6 min -
ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.
Now, with at least 50% overlap, all fields
a. 4 min and killed performance on my computerSeems the sort with no temp file is by far the fastest way to see
b. 6 min - definitely did not expect such a difference
c. 24 min - wow.
No comments:
Post a Comment