andreas.wpv: June 2014

Thursday, June 19, 2014

Tornado Warning in Google Results

Interesting, helpful example for a part of the knowledge graph:

The overview maps on google.org where very helpful, too, although I did not check for their timeliness - which is key.
I did not take a screenshot of the map, but it also has a good collection of additional links and info (and alerts active at the time looking at it ).

The map is also highly configurable (screen showing only a part of the options). Very interesting - it looks like a high level mashup of various sources with maps.

Thursday, June 12, 2014

Finally: free to use Sitemap Generator with automated 'priority' configuration

Easy to use - at least in an linux environment. Not sure if adjustments are necessary for various (bash) flavors, same with -nix emulators like cygwin. Sitemaps can have up to 50,000 links - so this should be sufficient for many sites.

First, check if called with a file - ideally the list of urls from an earlier scan with one of the scanners of this site, or from somewhere else, perhaps a content management system. Then, grab the file name without ending and add a random number to avoid overwriting of existing files.

if [ -z $1 ]
then echo "needs a file to start with"
else
file=$(basename "${1}")
name=${file%.*}-$RANDOM.txt
fi

Clean up the file from empty lines, then from lines that start with underscore or hyphen - errors I found a few times.

sed -i '/^$/d' "${1}"
sed -i '/^[_\|-]/d' "${1}"

Echo the header into the new file - xml definition for a sitemap.

echo "" > "${name}"

Then, check if the url has a http:// in front, if not, add it, otherwise the following calculations get tricky.
Next - count the number of "/ ", subtract the two dashes from the http://. The 'gsub' returns the number of replacements - and that's all I need here, so I replace a dash with a dash.

The next step calculates the priority based on the depth in the folder structure - the lower in hierarchy, the lower in priority. This is a setting that might need to be changed, depending on the structure of a site.

awk ' $1 !~ /^http/ { $1 = "http://"$1 } { count=(gsub("/" , "/" , $1)-2) } count > 9 { print $1, "0.1" } ( count > 0 && count <=9 ) { print $1, (10-count)/10 } count <= 0 { print $1, "1" }' $1 | while read input1 input2

The little sed inset is then used to remove the additional decimals - this was the easiest way to do this.
Then just echo it into the overall structure of the sitemap entries, then end the while loop with done.

do priorityvalue=$(echo $input2 | sed 's/^$...$.*/\1/')
echo -e "$\n$input1\n$priorityvalue\n" >> "${name}"

done

Finally, add the sitemap footer per definition and show the result

echo "

" >> "${name}"cat $name

looks like this:

http://www.tage.de 1
http://www.directorinternet.com/ 0.9
http://andreas-wpv.blogspot.com/2014/05/new-google-schema-implementation-for.html 0.7

Same script on Dropbox for easy use.

Thursday, June 5, 2014

Content ownership - is Google using 3rd party site content for Ad revenue on their search results page?

Look at this:

Google shows content taken from another website (mentioned in a miniature link below the content).

Would you click through or not?

These are the current results:

With this content taken from the 3rd party site shown on Google - is there still a need to go to the other site? How much is this going to affect the traffic on the other website? Is Google is monetizing other sites' content with their ads on the search results pages?

The site is tagged nicely, having the right descriptors in place.

I tested the site in Google's markup tester, and the markup for recipe and author works (Download).
The tool also shows a very different picture as a preview - misleading, in case an author tries to see what others are likely to see in Google.

Many sites rely on ad revenue to finance their operations and content - this will become impossible if above becomes more common. With this revenue taken off the other sites it seems as if Google is cutting off the branch on which they are sitting. And it very much seems like copyright infringement to me (but I am no lawyer and might be wrong).

the search - the result screen (top) is from June 1, 2014

Monday, June 2, 2014

Big social platform shares - added Stumble Upon likes, shares, lists

And another script to check for social shares...this time more details around stumble upon (no, not really active there, anymore, although it has led me to a few outstanding sites). The older version of this script pulls stumble data, but this shows pageviews from stumblers, not amplification through stumble activities. So this was mixing traffic data with engagement data - fixed now.

The data for stumble is pulled in two steps - first get the badge ID for the url, and with that pull the data (likes, shares, lists) for it. They all go into just one variable below, tab delimited, and then per echo into the larger list. That's why there are not 3 fields immediately visible when adding the data, but just seemingly one.

echo -e "Url\tripples\tFB-comments\tFB-shares\tFB-likes\ttwitter\tlinkedin\tstumble_likes\tstumble_shares\tstumble_lists" > "${1}-all-social-shares".csv

while read -r shortline ;
do

This line is to replace special characters with their html encoding...

line=$(echo $shortline | sed -e 's/\?/\%3F/g' -e 's/&/\%26/g' -e 's/#/\%23/g')

gpull="https://plus.google.com/ripple/details?url=${line}"
ripples=$(wget -qO- "${gpull}" | grep -o "[0-9]*\s*public\s*shares.<" | sed "s/[^0-9]//g" | tr "\n" "\t" | sed 's/\thttp/\nhttp/g'| sed 's/\t//')
commentpull="https://api.facebook.com/method/fql.query?query=select%20comment_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
comment_count=`wget -qO- $commentpull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`
echo "comment count: " $comment_count
sharepull="https://api.facebook.com/method/fql.query?query=select%20share_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
share_count=`wget -qO- $sharepull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`
echo "share count: " $share_count
likepull="https://api.facebook.com/method/fql.query?query=select%20like_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
like_count=`wget -qO- $likepull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`
echo "like count: " $like_count
twitterpull="http://urls.api.twitter.com/1/urls/count.json?url=${line}&callback=twttr.receiveCount"
twitternumber=$(wget -qO- "${twitterpull}" | grep -o 'count\":[0-9]*\,' | sed -e 's/count//g' -e 's/,//g' -e 's/://g' -e 's/"//g' )
echo "twitter: " $twitternumber
linkedpull="http://www.linkedin.com/countserv/count/share?format=json&url=${line}" #echo ${linkedpull}
linkednumber=$(wget -qO- "${linkedpull}" | grep -o 'count\":[0-9]*\,' | sed -e 's/count//g' -e 's/,//g' -e 's/://g' -e 's/"//g' )
echo "linkedin count: " $linkednumber
stumblepull=`wget -qO- "http://www.stumbleupon.com/services/1.01/badge.getinfo?url=${line}" | grep -o "publicid.*views" | sed -e "s/publicid//" -e "s/\":\"//" -e "s/\",\"views//"`echo $stumblepullstumblenumber=$(wget -qO- "http://www.stumbleupon.com/content/${stumblepull}" | grep "mark>" | sed -e "s/^.*\">//" -e "s/<.*$//" -e "2d" -e "4d" -e "6d" | tr "\n" "\t")

echo -e "${line}\t${ripples}\t${comment_count}\t${share_count}\t${like_count}\t${twitternumber}\t${linkednumber}\t${stumblenumber}" >> "${1}-all-social-shares".csv
done < $1

cat -A "${1}-all-social-shares".csv