Tuesday, September 15, 2015

Data analysis preparation analysis script

Finally got to working on this. I am working with larger files and one of my current fun projects is to find out which urls have been visited by Google, out of all the urls we have live.

While working on a small DB to do this I downloaded some files from splunk, then imported them. And sure enough realized there are things I need to filter out before, or the DB becomes absolutely unwieldy.
The files are not huge, but large enough, with 10+ million lines, so I want to use command line tools and not redo a script for every number I need, looping repeatedly over the same file.

It shows count of 

  • number of fields
  • total number of lines
  • empty lines
  • non-empty lines. 
Then it pulls the full data for the 

  • shortest line
  • the longest line
  • the first 4 lines
  • last 4 lines
  • middle 3 lines of the file.

This is testdata output from the script - cannot show real data. It works well - at least with a few fields. To run it on a 17 million lines one field list takes ~ 32 seconds, that's pretty good I think.



I highlight the description in green as you might see - otherwise it is kind of hard to read. Also - awk changed the numbers for the line counters to scientific notation, so I needed to use int(variable) to reset it to integer and be able to concatenate it into a range for the sed - system call. Ah, this is with comma separated files and needs to be adjusted if that's different.


#!bash

#this shows all lengths and how often
echo -e "\033[32;1m\n\nnumber of lines : number of fields\033[0m" ; awk ' {print NF} ' "$1" | sort | uniq -c 

#this shows only number of shortest / longest lines

awk ' BEGIN {FS=","} (NF<=shortest || shortest=="") {shortest=NF; shortestline=$0} 
(longest<=NF) {longest=NF; longestline=$0} 
(!NF) {emptylines+=1} 
(NF) {nonemptylines+=1}
(maxcount<NR) {maxcount=NR}
END { middlestart=(maxcount/2)-1;
middleend=(maxcount/2)+1;
range=int(middlestart)","int(middleend);
print "\033[32;1m\ntotal number of lines is:\n\t\033[0m" NR,
"\033[32;1m\n shortest is:\t\t\033[0m" shortest, 
"\033[32;1m\n longest is:\t\t\033[0m" longest, 
"\033[32;1m\n shortestline is:\t\t\033[0m" shortestline, 
"\033[32;1m\n longestline is:\t\t\033[0m" longestline,
"\033[32;1m\n number empty lines is: \t\t\033[0m" emptylines,
"\033[32;1m\n number of non-empty lines is:\t\t\033[0m" nonemptylines;

print "\033[32;1m\n\nrange is   \033[0m" range, "\033[1;32m\nFILENAME IS \033[0m" FILENAME, "\033[32;1m\nnow the middle 3 lines of the file: \n\033[0m";
system("sed -n "range"p " FILENAME)

} ' $1

echo -e "\033[32;1m\n\ntop 4 lines\n\033[0m"
head -n 4 "$1"
echo -e "\033[32;1m\n\nlast 4 lines\n\033[0m"
tail -n 4 "$1"
echo -e "\n"

No comments:

Post a Comment

Bookmark and Share