Wednesday, October 29, 2014

Script to check for Opengraph tags, schema and rel publisher


How common are tags like opengraph, schema and rel publisher?

These are interesting, perhaps important features of a website, not just, but also for seo. What better than to take a look at a larger number of sites, and to check if they use these tags.
This is the output of a little script to test for these three tags (schema.org, opengraph.org, rel_publisher for G+ ) on a list of urls.



First generate a unique filename, then copy the header into it. The while loop iterates over a list of urls, and pulls the data into a variable, because the script needs to check for several items, and this avoids to send three requests. I added the timeout parameters to wget, because several domains I tested did not send ANY response when missing the subdomain, and the script hung up.

Next steps are the three filters for og:title, rel publisher and schema (itemtype), into variables, then writing to the line with the url. Done.

#!bash
filename=topresults-$RANDOM.txt
echo -e '\turl\tog:title\trel_publisher\tschema' > ${filename}

while read -r line; do
file=$(wget -qO- -t 1 -T 10 --dns-timeout=10 --connect-timeout=10 --read-timeout=10 "${line}")

title=$(echo "${file}" | grep 'og:title'  | wc -l)
        if (( "$title" > 0 ))
                then title="yes"
        else 
                title="no"
        fi

publisher=$( echo "$file" | grep 'rel="publisher"' | wc -l )
        if (( "$publisher" > 0 ))
                then publisher="yes"
        else 
                publisher="no"
        fi

schema=$( echo "$file" | grep 'itemtype="' | wc -l )
        if (( "$schema" > 0 ))
                then schema="yes"
        else 
                schema="no"
        fi

echo -e "$line\t$title\t$publisher\t$schema" >> ${filename}

done < $1

wc -l $filename


No comments:

Post a Comment

Bookmark and Share