Tuesday, October 21, 2014

awk: count items in n number of lines (including percentile calculations)


Using several scripts to check a number of sites on features, I needed a way to calculate the sum of occurrences of yes / no and such alike - but not just across the whole file, but for each x number of lines, each 100 lines, 1000 lines or so.




This is based on a much simpler version to calculate just the totals for a whole file, not n-numbers. The script is called with input file as first parameter and number of lines to summarize as parameter two: . runscript.sh filewithdata.txt 100. Filewithdata.txt is stored in $1, 100 in $2. Then the $2 is handed to awk with the -v counter=$2.
First step is to set variables, then count each field up with the 'if' a value occurs.
Third step happens when the number of lines is reached (NR-1) to account for the header. Values are printed, added to the totalvalue variables, reset to zero.
In the END section there are two elements - first to print the totals, then also to show if there are 'lines left' due to an the aggregation over x lines. If there are, the number of lines and counting values are printed.
awk -v counter="${2}" 'BEGIN {counturl = 0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter = 0 }

{counturl++ } ( $2 == "yes" ) { counttitle++ } ( $3 == "yes" ) { countpub++ } ( $4 == "yes" ) {countschema++ }

(NR-1)%counter==0 { print "number of lines: "NR-1, "og:title: "counttitle, "rel publisher: "countpub, "schema: "countschema;
titletotal+=counttitle ; totalpub+=countpub ; totalschema+=countschema;
counturl=0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter++ }

END { print "\nnumber of calculations: " printcounter; print "urls: " NR-1, "all og:title: " titletotal, "rel pub total: " totalpub, "all schema: " totalschema;
if(counturl) print "lines left: " counturl; if(counttitle) print "left og:title : " counttitle ; if(counttitle) print "left rel pub: " counttitle; if(countpub) print "left schema: " countschema ; print "\n" }' $1  

Apart from the usual man pages, this post on stackoverflow was very helpful especially for the 'left' calculation in the END section.

No comments:

Post a Comment

Bookmark and Share