Tuesday, November 28, 2017

1/3 of top sites uses schema on homepage

How many of the top sites on the web use: Jquery, Schema tags, Google's "nositesearchbox" to exclude the site from sitesearches on Google?

34039 urls tested from the 'top internet sites homepages' list, merging Majestic, Alexa, Statvoo, OpenDNS and Quantcast top million sites. Checking only the http://www homepage of all domains in this list.

24407 - pages have jquery in the source code (72%)
12187 - are using schema in one form or another (36%)
114 - sites had a 'nositesearchbox'  (0.3%)

The percentage of sites in the top list and the sites with nosearchbox were the items I was particularly interested in, the jquery info is a nice added bonus.


The script crawls urls in a file, stores it in a variable,  and then tests if any of the three terms given appears in the variable, counts it and lists it in a file:
(only parts shown)
while read -r line; do
acount=111; bcount=111; ccount=111
feedback=$(curl -L -s -m "$time_out" -b cookies -c cookies -A "$agent"  "$line")
if [[ $feedback ]] ; then
acount=$(echo "$feedback" | grep -i -c "$3")
bcount=$(echo "$feedback" | grep -i -c "$4")
ccount=$(echo "$feedback" | grep -i -c "$5")
[[ $acount -gt 0 ]] && [[ $acount -ne 111 ]] && acounter=$(( $acounter+1))
[[ $bcount -gt 0 ]] && [[ $bcount -ne 111 ]] && bcounter=$(( $bcounter+1))
[[ $ccount -gt 0 ]] && [[ $ccount -ne 111 ]] && ccounter=$(( $ccounter+1))
echo -e "$line\t$acount\t$bcount\t$ccount" | tee -a $outfile

Monday, November 20, 2017

URLs on all 'top million sites' lists

Alexa 1 million, Statvoo 1 million, OpenDns 1 million, Majestic 1 million, quantcast 1 million:
All "top million websites" have slightly different formats, but all have many domains by amount of traffic - just how they are selected varies.
Filtered for a list of unique urls, then added http://www at the beginning, checked if this gives a 200 OK.

Starting with over 4 million urls, only  34,000 are on all lists (when checked as above):

Here's the list for download, no warranty, promises, absolutely at your own risk. Re-running this might yield different results to changes in the original list, timeouts, etc.

I'll use this list for a while to run a bunch more queries.

Here's the list for download.
Bookmark and Share