While working on a small DB to do this I downloaded some files from splunk, then imported them. And sure enough realized there are things I need to filter out before, or the DB becomes absolutely unwieldy.
The files are not huge, but large enough, with 10+ million lines, so I want to use command line tools and not redo a script for every number I need, looping repeatedly over the same file.
It shows count of
- number of fields
- total number of lines
- empty lines
- non-empty lines.
- shortest line
- the longest line
- the first 4 lines
- last 4 lines
- middle 3 lines of the file.
This is testdata output from the script - cannot show real data. It works well - at least with a few fields. To run it on a 17 million lines one field list takes ~ 32 seconds, that's pretty good I think.
I highlight the description in green as you might see - otherwise it is kind of hard to read. Also - awk changed the numbers for the line counters to scientific notation, so I needed to use int(variable) to reset it to integer and be able to concatenate it into a range for the sed - system call. Ah, this is with comma separated files and needs to be adjusted if that's different.
#!bash #this shows all lengths and how often echo -e "\033[32;1m\n\nnumber of lines : number of fields\033[0m" ; awk ' {print NF} ' "$1" | sort | uniq -c #this shows only number of shortest / longest lines awk ' BEGIN {FS=","} (NF<=shortest || shortest=="") {shortest=NF; shortestline=$0} (longest<=NF) {longest=NF; longestline=$0} (!NF) {emptylines+=1} (NF) {nonemptylines+=1} (maxcount<NR) {maxcount=NR} END { middlestart=(maxcount/2)-1; middleend=(maxcount/2)+1; range=int(middlestart)","int(middleend); print "\033[32;1m\ntotal number of lines is:\n\t\033[0m" NR, "\033[32;1m\n shortest is:\t\t\033[0m" shortest, "\033[32;1m\n longest is:\t\t\033[0m" longest, "\033[32;1m\n shortestline is:\t\t\033[0m" shortestline, "\033[32;1m\n longestline is:\t\t\033[0m" longestline, "\033[32;1m\n number empty lines is: \t\t\033[0m" emptylines, "\033[32;1m\n number of non-empty lines is:\t\t\033[0m" nonemptylines; print "\033[32;1m\n\nrange is \033[0m" range, "\033[1;32m\nFILENAME IS \033[0m" FILENAME, "\033[32;1m\nnow the middle 3 lines of the file: \n\033[0m"; system("sed -n "range"p " FILENAME) } ' $1 echo -e "\033[32;1m\n\ntop 4 lines\n\033[0m" head -n 4 "$1" echo -e "\033[32;1m\n\nlast 4 lines\n\033[0m" tail -n 4 "$1" echo -e "\n"
No comments:
Post a Comment