Wednesday, May 21, 2014

SeoClarity - Certification

SeoClarity has a new certification program - a nice way to connect with users, engage them. this is also nice to have on a resume, and I am looking forward to more discussions with the great team and other seoclarity clients. We use them as our enterprise seo metrics platform.

I use the tool and services a lot - it has a lot of features that help with the super large scale of our website. I especially like and use the sitescans - and download many GB results data, then crunch them in mysql, R or just with awk, sed on bash.

Thursday, May 15, 2014

Scraper for video pages to get all data for video sitemap

This scraper is based mainly on opengraph tags (which are used by Facebook, for example), so it should work well with many pages, not just pages. More info on sitemaps at Google .

#! bash
#check if its called with a filename - a file containing urls for pages with videos
if [[ ! $1 ]] ; then
echo "need to call with filename"
exit 1
 #now make sure to have a unique filename, based on the file with the urls
filename=$(basename $1)
#header info for the file with the results. Need to pull page url, video title, thumbnail url, description.
echo -e 'url\tpage\tTitle\tThumb\tDescription' > ${name}video-sitemap-data.txt
#loop through the file and store in a variable
while read -r line; do
filecontent=$(wget -qO- "$line")
# echo results and clean up with sed, tr and grep, then append to the file that has the column headers already. It has 4 elements - and each is isolated in its own part. The parts are connected with &&, and everything in ( and ) - otherwise it only echos the last part into the file.
(echo "$line" | sed 's/\r$/\t/' | tr '\n' '\t'  && echo "$filecontent" | grep "og:video" | grep "swf" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:title" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:image" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:description" | sed -e "s/^.*content=\"//" -e "s/\".*$//") >> ${name}video-sitemap-data.txt
done < "$1"
I'd be delighted to know this helped someone else - why don't you drop me a note when you do?

This is one of the pages that I used for testing, just in case someone wants to test this: .

Monday, May 12, 2014

Google Webmaster tools - import into mysql

For a research project I am integrating all data I can get my hands on in regards to a site (

Majestic, seoclarity, omniture, maxamine, etc. . and for sure, Google webmaster data (and Bing).

I get this data from the 'top pages' in GWT:

Page Impressions Change Clicks Change CTR Change Avg. position Change
http://pageurl/          22,577 19% 3,959 15% 18% -0.5 11 -0.1

So far, so good (and great numbers, team!). When setting up the table, I am not able to use any number format, but end up with using varchar(10) instead of int or decimal, because if I use a number format, I get always 'data truncated' errors. And even with quite some online research, could not find a better way to do this. I change the number format in impressions to not have the comma, and then use string fields. Any idea how I could make this work with number formats?
Likely int for impressions, ???  for Change in % , int for clicks, and decimal for the CTR change, but how?

These are the other scripts to create the table and to load the data:

Thursday, May 1, 2014

New Google Schema implementation for Phone Numbers - little helper for lots of phone numbers

Google is showing phone numbers now for company searches

Nope, not happy about the fact that Google is showing phone numbers for company searches - but not for phone searches. How much sense does that make?

I prefer online, usually, and call only as a last resort. Calls take much longer, sometimes it is hard to hear or understand what the other's say. And sometimes I find it incredibly aggravating to be on a call for hours to fix something that the company I am calling "broke" in the first place - or did not prevent. (Although I know there are many companies involved, in producing one product). And I also know, most companies (including the one I work for)  really try hard to give good online and phone support.

And now Google is showing support phone numbers for companies - no matter if it is an online only company.  And they show them as part of the knowledge graph - but NOT when someone searches for a phone number. How smart is that.

And they are currently  not showing for Chrome books or for google - how fair is that?

The schema documentation works just fine, watch for this:
  • the visible number on the page needs to be the same as the one in the json / schema part
  • for multiple contact points for the same organization, that part of the code can be iterated, comma separated
  • phone numbers for several countries can be integrated on one page. 

Script for many phone numbers:

Ok, so I had to update ~ 80 country phone numbers, and got just one not very good list of phone numbers, including international calling code. First I started concatenating in my beloved MSFT Excel, but that's no fun, especially since quotes need to be escaped (doubled), so I ended up with a lot of double quotes - which I then had to remove in a later step. But this needed to be done quickly, once I had the final list, which made me do this little script.

This script loops through the list of phone numbers with awk, writes the two parameters into variables, and fills the info in the schema snippet. As a result, the script generates one snippet for each country and phone number.

Three files needed, the script (shown at the end) a list of phone numbers and the schema for this kind of phone numbers.

This is the schema info we were going to use, where I used two placeholders - changeNumber and changeArea:
The list of phone numbers looks like this, and is tab delimited:

And now the script:
awk '{print $1, $2}' tab-phone-numbers | while read input2 input1
awk -v var1="${input1}" '{ sub("changearea",var1)} {print $0}' phone-code-snippet |  awk -v var2="${input2}" '{sub("changenumber",var2);print$0}'>> phone-results
What really helped me do this was the trick to use read to get the output into variables (bold). Found this first here (thanks duckeggs01) and confirmed on my trusted source for man pages online - ss64 . I am sure I'll use this more, very quick and easy, efficient.
First part is a loop with awk through the numbers assigning each field value to a variable. Then nested two awk commands - assigning the external variable to an internal. Then, replacing the first variable the first placeholder, piped into a second awk with the second variable replacing the second placeholder, ... done.

In this first run this script produced one complete snippet per country - good if you need the snippets on several pages. For usage on one page, the template for the schema needs to be changed to only include the contact point info, then run, and add header / footer afterwards to the whole section.

Bookmark and Share