andreas.wpv: data wrangling

Tuesday, October 21, 2014

awk: count items in n number of lines (including percentile calculations)

Using several scripts to check a number of sites on features, I needed a way to calculate the sum of occurrences of yes / no and such alike - but not just across the whole file, but for each x number of lines, each 100 lines, 1000 lines or so.

This is based on a much simpler version to calculate just the totals for a whole file, not n-numbers. The script is called with input file as first parameter and number of lines to summarize as parameter two: . runscript.sh filewithdata.txt 100. Filewithdata.txt is stored in $1, 100 in $2. Then the $2 is handed to awk with the -v counter=$2.
First step is to set variables, then count each field up with the 'if' a value occurs.
Third step happens when the number of lines is reached (NR-1) to account for the header. Values are printed, added to the totalvalue variables, reset to zero.
In the END section there are two elements - first to print the totals, then also to show if there are 'lines left' due to an the aggregation over x lines. If there are, the number of lines and counting values are printed.

awk -v counter="${2}" 'BEGIN {counturl = 0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter = 0 }

{counturl++ } ( $2 == "yes" ) { counttitle++ } ( $3 == "yes" ) { countpub++ } ( $4 == "yes" ) {countschema++ }

(NR-1)%counter==0 { print "number of lines: "NR-1, "og:title: "counttitle, "rel publisher: "countpub, "schema: "countschema;
titletotal+=counttitle ; totalpub+=countpub ; totalschema+=countschema;
counturl=0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter++ }

END { print "\nnumber of calculations: " printcounter; print "urls: " NR-1, "all og:title: " titletotal, "rel pub total: " totalpub, "all schema: " totalschema;
if(counturl) print "lines left: " counturl; if(counttitle) print "left og:title : " counttitle ; if(counttitle) print "left rel pub: " counttitle; if(countpub) print "left schema: " countschema ; print "\n" }' $1

Apart from the usual man pages, this post on stackoverflow was very helpful especially for the 'left' calculation in the END section.

Monday, October 13, 2014

AWK - simple counter of yes / no in tables

Counting yes or no in a table (or replace with what you need to count)

For a project I write data into a tab delimited table, urls and properties of the pages under the urls. Using the script over and over, I wanted to make the counting at the end a bit more efficient, and added below lines to the script generating the table.

This is how the call and result look like:

And this is how it is set up:
In the BEGIN section the counters are set, set to zero. Then in the body it counts three parameters up, if the according field contains 'yes'. In the END section it prints the number of lines minus one for the header column, then the name and value of each counter. And last, as standalone script, $1 is the file scanned - which, if added to the tab generating script is just replaced with the filename into which the table is written.

awk 'BEGIN {counttitle = 0; countpub = 0; countschema = 0 } ( $2 == "yes" ) { counttitle++ } ( $3 == "yes" ) { countpub++ } ( $4 == "yes" ) {countschema++ } END {print "number of lines: "NR-1, "og:title: "counttitle, "rel publisher: "countpub, "schema: "countschema} ' $1

Should be easy to adjust for re-use!