andreas.wpv: October 2014

Wednesday, October 29, 2014

Script to check for Opengraph tags, schema and rel publisher

How common are tags like opengraph, schema and rel publisher?

These are interesting, perhaps important features of a website, not just, but also for seo. What better than to take a look at a larger number of sites, and to check if they use these tags.
This is the output of a little script to test for these three tags (schema.org, opengraph.org, rel_publisher for G+ ) on a list of urls.

First generate a unique filename, then copy the header into it. The while loop iterates over a list of urls, and pulls the data into a variable, because the script needs to check for several items, and this avoids to send three requests. I added the timeout parameters to wget, because several domains I tested did not send ANY response when missing the subdomain, and the script hung up.

Next steps are the three filters for og:title, rel publisher and schema (itemtype), into variables, then writing to the line with the url. Done.

#!bash

filename=topresults-$RANDOM.txt

echo -e '\turl\tog:title\trel_publisher\tschema' > ${filename}

while read -r line; do

file=$(wget -qO- -t 1 -T 10 --dns-timeout=10 --connect-timeout=10 --read-timeout=10 "${line}")

title=$(echo "${file}" | grep 'og:title' | wc -l)

if (( "$title" > 0 ))

then title="yes"

else

title="no"

publisher=$( echo "$file" | grep 'rel="publisher"' | wc -l )

if (( "$publisher" > 0 ))

then publisher="yes"

else

publisher="no"

schema=$( echo "$file" | grep 'itemtype="' | wc -l )

if (( "$schema" > 0 ))

then schema="yes"

else

schema="no"

echo -e "$line\t$title\t$publisher\t$schema" >> ${filename}

done < $1

wc -l $filename

Tuesday, October 21, 2014

awk: count items in n number of lines (including percentile calculations)

Using several scripts to check a number of sites on features, I needed a way to calculate the sum of occurrences of yes / no and such alike - but not just across the whole file, but for each x number of lines, each 100 lines, 1000 lines or so.

This is based on a much simpler version to calculate just the totals for a whole file, not n-numbers. The script is called with input file as first parameter and number of lines to summarize as parameter two: . runscript.sh filewithdata.txt 100. Filewithdata.txt is stored in $1, 100 in $2. Then the $2 is handed to awk with the -v counter=$2.
First step is to set variables, then count each field up with the 'if' a value occurs.
Third step happens when the number of lines is reached (NR-1) to account for the header. Values are printed, added to the totalvalue variables, reset to zero.
In the END section there are two elements - first to print the totals, then also to show if there are 'lines left' due to an the aggregation over x lines. If there are, the number of lines and counting values are printed.

awk -v counter="${2}" 'BEGIN {counturl = 0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter = 0 }

{counturl++ } ( $2 == "yes" ) { counttitle++ } ( $3 == "yes" ) { countpub++ } ( $4 == "yes" ) {countschema++ }

(NR-1)%counter==0 { print "number of lines: "NR-1, "og:title: "counttitle, "rel publisher: "countpub, "schema: "countschema;
titletotal+=counttitle ; totalpub+=countpub ; totalschema+=countschema;
counturl=0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter++ }

END { print "\nnumber of calculations: " printcounter; print "urls: " NR-1, "all og:title: " titletotal, "rel pub total: " totalpub, "all schema: " totalschema;
if(counturl) print "lines left: " counturl; if(counttitle) print "left og:title : " counttitle ; if(counttitle) print "left rel pub: " counttitle; if(countpub) print "left schema: " countschema ; print "\n" }' $1

Apart from the usual man pages, this post on stackoverflow was very helpful especially for the 'left' calculation in the END section.

Monday, October 13, 2014

AWK - simple counter of yes / no in tables

Counting yes or no in a table (or replace with what you need to count)

For a project I write data into a tab delimited table, urls and properties of the pages under the urls. Using the script over and over, I wanted to make the counting at the end a bit more efficient, and added below lines to the script generating the table.

This is how the call and result look like:

And this is how it is set up:
In the BEGIN section the counters are set, set to zero. Then in the body it counts three parameters up, if the according field contains 'yes'. In the END section it prints the number of lines minus one for the header column, then the name and value of each counter. And last, as standalone script, $1 is the file scanned - which, if added to the tab generating script is just replaced with the filename into which the table is written.

awk 'BEGIN {counttitle = 0; countpub = 0; countschema = 0 } ( $2 == "yes" ) { counttitle++ } ( $3 == "yes" ) { countpub++ } ( $4 == "yes" ) {countschema++ } END {print "number of lines: "NR-1, "og:title: "counttitle, "rel publisher: "countpub, "schema: "countschema} ' $1

Should be easy to adjust for re-use!