Tuesday, November 28, 2017

1/3 of top sites uses schema on homepage

How many of the top sites on the web use: Jquery, Schema tags, Google's "nositesearchbox" to exclude the site from sitesearches on Google?

34039 urls tested from the 'top internet sites homepages' list, merging Majestic, Alexa, Statvoo, OpenDNS and Quantcast top million sites. Checking only the http://www homepage of all domains in this list.

24407 - pages have jquery in the source code (72%)
12187 - are using schema in one form or another (36%)
114 - sites had a 'nositesearchbox'  (0.3%)

The percentage of sites in the top list and the sites with nosearchbox were the items I was particularly interested in, the jquery info is a nice added bonus.


The script crawls urls in a file, stores it in a variable,  and then tests if any of the three terms given appears in the variable, counts it and lists it in a file:
(only parts shown)
while read -r line; do
acount=111; bcount=111; ccount=111
feedback=$(curl -L -s -m "$time_out" -b cookies -c cookies -A "$agent"  "$line")
if [[ $feedback ]] ; then
acount=$(echo "$feedback" | grep -i -c "$3")
bcount=$(echo "$feedback" | grep -i -c "$4")
ccount=$(echo "$feedback" | grep -i -c "$5")
[[ $acount -gt 0 ]] && [[ $acount -ne 111 ]] && acounter=$(( $acounter+1))
[[ $bcount -gt 0 ]] && [[ $bcount -ne 111 ]] && bcounter=$(( $bcounter+1))
[[ $ccount -gt 0 ]] && [[ $ccount -ne 111 ]] && ccounter=$(( $ccounter+1))
echo -e "$line\t$acount\t$bcount\t$ccount" | tee -a $outfile

No comments:

Post a Comment

Bookmark and Share