andreas.wpv: 2014

Tuesday, December 30, 2014

Logfile analysis for SEO: visualization of bot visits

The bot visits are filtered out of logfiles and sorted / counted as shown here. This here now filters for certain bots (out of the hundreds visiting the site) and makes a small graph out of it.

This is the resulting table - easy to adjust to the bots of interest based on earlier research. (These are not real numbers, but just fillers to show how it looks like).

This is the visualization in excel of the table with real numbers. Yandex is not in here, because they had so many visits that they dwarfed all the other bots counters. Each color stands for a different bot.

This is the filter to get the table - which can be easily adapted to show more / less or different search engine crawler data:

echo -e 'google\tbing\tyandex\tbaidu\tapple\tlinux\ttrident' > bots-table.txt
for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do
google=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/oogle/ { print $ 13 } ' | wc -l ) ;
bing=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/bing/ { print $ 13 } ' | wc -l ) ;
yandex=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/yandex/ { print $ 13 } ' | wc -l ) ;
baidu=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/baidu/ { print $ 13 } ' | wc -l ) ;
apple=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Apple/ { print $ 13 } ' | wc -l ) ;
linux=$( cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Linux/ { print $ 13 } ' | wc -l ) ;
trident=$(cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Trident/ { print $ 13 } ' | wc -l ) ;
echo -e "$i\t$google\t$bing\t$yandex\t$baidu\t$apple\t$linux\t$trident" >> bots-table.txt
done

Old version - with 'wrong' results, needed to show date from folder to make sure they are aligned

echo -e 'google\tbing\tyandex\tbaidu\tapple\tlinux\ttrident' > bots-table.txt
google=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/oogle/ { print $ 13 } ' | wc -l ; done)
bing=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/bing/ { print $ 13 } ' | wc -l ; done)
yandex=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/yandex/ { print $ 13 } ' | wc -l ; done)
baidu=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/baidu/ { print $ 13 } ' | wc -l ; done)
apple=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Apple/ { print $ 13 } ' | wc -l ; done)
linux=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Linux/ { print $ 13 } ' | wc -l ; done)
trident=$(for i in $(ls | grep "[[:digit:]]-[[:digit:]]"); do cat $i/bots-traffic.txt | awk 'BEGIN { FS = " " } $13~/Trident/ { print $ 13 } ' | wc -l ; done)
paste <(echo "$google") <(echo "$bing") <( echo "$yandex") <(echo "$baidu") <(echo "$apple") <(echo "$linux") <(echo "$trident") >> bots-table.txt

Thursday, December 18, 2014

Google sitespeed score - script to tap api and Alexa top 10,000

Pagespeed - Score

Even if it would not be relevant for large site indexation, it still has huge impact on traffic, bounce rate and CE. Google offers a great tool to get their feedback - the sitespeed score. It does not show the actual speed, but how your page is build compared to an page ideally built for speed. So it shows the potential to improve.

As input I use the top 10,000 from the Alexa top million pages - using the homepage, only. (I actually split it in 10, ran the script 10x in parallel). The process is relatively slow, as Google checks the pages, and give much more details back than just the score - which I filter out, below. Just fill in your api key (after the = sign) and feed your sitemap (just urls) to the script. Below is for mobile sitespeed, for desktop just swap mobile with desktop. The allowance currently is 25 k api calls a day for free, that's plenty for most sites or projects.

Mobile sitespeed score scatterplot of Alexa top 10,000 sites homepages

Then I just cleaned it out, set all garbage values to zero, and build a scatterplot.

Few sites at 100 sitespeed score
a relatively sparsely populated area between 80 and 100
the bulk of sites between 40 and 80
scores 40 and lower being less frequent as well.

Speedtest score script:

And this is the script:

filename=mobile-speedtest-$RANDOM.txt
echo -e 'url\tspeed-score' > ${filename}
IFS="," ; while read -r counter line; do
score=$(curl -s -m 30 -f --retry 1 --proto =https --proto-redir =https "https://www.googleapis.com/pagespeedonline/v1/runPagespeed?url=http://www.${line}&strategy=mobile&key=--- your api key ---" | sed -n '/score[^,]*,/p'| sed -e 's/\"//g' -e 's/,//g' -e 's/score: //g' )
echo -e "$counter \t $line \t $score" >> ${filename}
done < $1

Tuesday, December 9, 2014

Logfile analysis for SEO: Which bots come how often?

Which bots crawling your site?

Which bots are visiting your site? Googlebot, bingbot and yandex? You might be surprised by the number and variety.

Script to identify crawlers

Just filtering the logfile we get from IIS with grep -i 'bot', and then writing the agent - in this logfile in position 13 - into a separate file, and then just sort, count occurrence of each.

grep -i 'bot' logfile | awk 'BEGIN { FS = " " } { print $ 13 } ' >> bots-names.txt

sort bots-names.txt | uniq -c | sort -k 1nr > bots-which-counter.txt
rm bots-names.txt

This gives me a nice list of bots, and how many requests they sent in the time of the logfile. Interesting list, lots from bots I would not have expected, like mail.RU and 'linux'.

Another post I share a table how often bots come over time - and I pick the most relevant bots with this above list (plus on what brings us traffic).

Top bots and crawlers visiting certain parts of www.dell.com:

I cut off the numbers (count, first column) and this is just sorted the top few visiting crawlers / bots.

Tuesday, November 25, 2014

Pagespeed - pages crawled to page download time

Pagespeed and indexation

At least at first sight there seems to be a correlation between page load time and number of pages indexed.

Google seems to try to maintain a constant crawl 'budget' of n GB per day, apart from a few spikes we see, and when our page load time goes up, the number of crawled pages goes down, suggesting a connection.

Slower pages:

Less pages crawled:

Wednesday, November 19, 2014

Logfile Analysis for SEO: Get status codes

Ever wondered if bots use cookies? Or if there is a relation from 302's to search engine bot visits? Total sum of 4xx errors as a trend? Our current analytics setup does not show this data, log file analysis, so I wrote a few brief scripts, pulled the zipped logfiles from servers to a local folder, and then analyzed (scripts at the end of this post).

Next step I added the data to an xls and calculate the percentages, like 302s as a share of overall traffic or how many bot visits use a cookie (nearly all bots visits are cookied visits!).

Visualization of one time frame worth of http status codes

And sure easy to show trends of status codes, for example share of bots visits to overall visits and the number of 302's on a site.

And here are the scripts. If you wonder why writing the results into a file, and not just count, I run more analysis on these resulting files.

1. Get all lines with a certain status code not 200 OK:

zcat *.zip | grep " 301 " > all-301.txt
zcat *.zip | grep " 302 " > all-302.txt
zcat *.zip | grep " 304 " > all-304.txt
zcat *.zip | grep " 403 " > all-403.txt
zcat *.zip | grep " 404 " > all-404.txt
zcat *.zip | grep " 410 " > all-410.txt
zcat *.zip | grep " 500 " > all-500.txt
zcat *.zip | grep "^2014" | wc -l > logfile-results.txt

2. Error codes, redirects encountered by bots. First, filter out all lines with bots, then get the status codes lines in separate files.

zcat *.zip | grep "bot" > bots-traffic.txt
grep " 301 " bots-traffic.txt > bots-301.txt
grep " 302 " bots-traffic.txt > bots-302.txt
grep " 304 " bots-traffic.txt > bots-304.txt
grep " 403 " bots-traffic.txt > bots-403.txt
grep " 404 " bots-traffic.txt > bots-404.txt
grep " 410 " bots-traffic.txt > bots-410.txt
grep " 500 " bots-traffic.txt > bots-500.txt

3. Pull out all the visits with cookies, in this case only the cookies themselves. In these logfiles they are in field $15:

zcat *.zip | awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' > all-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-traffic.txt > bots-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' all-500.txt > all-500-cookies.txt
awk ' BEGIN { FS = " " } ; NR > 5 { print $15} ' bots-500.txt > bots-500-cookies.txt

4. This pulls all urls with an 500 error code into files - to have Dev look at these:

awk 'BEGIN {FS = " "; OFS = "" } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' all-500.txt > all-500-urls.txt
awk 'BEGIN {FS = " "; OFS = "" } ($8 == "-") {print "www.dell.com"$7 } ($8 != "-" ){ print "www.dell.com"$7,$8} ' bots-500.txt > bots-500-urls.txt

It's filtered, so if field 8 is just a hyphen, it prints just the url stem, otherwise stem and pagename.

Tuesday, November 11, 2014

Alexa top 10,000 and rel publisher, opengraph tags, and schema

Alexa.com might not be high in their own rankings, but they provide a very, very handy list frequently - the top 1 million domains by estimated traffic rank.

The accuracy of this data is beyond my knowledge - only Alexa has detailed insight - but I would think that this list is at least directionally right and rare in the way it is offered (Quantcast offers something similar). Thanks Alexa!

I wondered how widely used tags are, we (at SEO for Dell) consider important or at least interesting.
I wrote two small scripts to test for the implementation and also to calculate or count, and here are the results.

Tested for 3 elements:

og:title - checking for the implementation of anything related to opengraph.org tags
rel_publisher - if the tag is implemented on the homepage, as recommended by Google
schema - this tests for implementation anything related to schema.org, according to the search engine recommendations.

All items are only tested on the homepage of the 10,000 domains! And if a domain did not answer - some are not set up properly and do not forward to a default subdomain like www - all values are set to 'no' by default. I manually checked the first 400 urls, and this happened twice.

First observation - these tags are FAR from everywhere.

Opengraph is on ~ 26% of homepages, rel publisher on 17% and schema only on 10%.
(First column = line counts)

Second observation is interesting, but seems a relatively weak correlation:

The implementation of opengraph, rel publisher and schema happens slightly more often on the higher ranking homepages. This is most pronounced for schema, least for opengraph, which has the highest overall adoption rate.

Is this compelling you to implement on your site?

Wednesday, October 29, 2014

Script to check for Opengraph tags, schema and rel publisher

How common are tags like opengraph, schema and rel publisher?

These are interesting, perhaps important features of a website, not just, but also for seo. What better than to take a look at a larger number of sites, and to check if they use these tags.
This is the output of a little script to test for these three tags (schema.org, opengraph.org, rel_publisher for G+ ) on a list of urls.

First generate a unique filename, then copy the header into it. The while loop iterates over a list of urls, and pulls the data into a variable, because the script needs to check for several items, and this avoids to send three requests. I added the timeout parameters to wget, because several domains I tested did not send ANY response when missing the subdomain, and the script hung up.

Next steps are the three filters for og:title, rel publisher and schema (itemtype), into variables, then writing to the line with the url. Done.

#!bash

filename=topresults-$RANDOM.txt

echo -e '\turl\tog:title\trel_publisher\tschema' > ${filename}

while read -r line; do

file=$(wget -qO- -t 1 -T 10 --dns-timeout=10 --connect-timeout=10 --read-timeout=10 "${line}")

title=$(echo "${file}" | grep 'og:title' | wc -l)

if (( "$title" > 0 ))

then title="yes"

else

title="no"

publisher=$( echo "$file" | grep 'rel="publisher"' | wc -l )

if (( "$publisher" > 0 ))

then publisher="yes"

else

publisher="no"

schema=$( echo "$file" | grep 'itemtype="' | wc -l )

if (( "$schema" > 0 ))

then schema="yes"

else

schema="no"

echo -e "$line\t$title\t$publisher\t$schema" >> ${filename}

done < $1

wc -l $filename

Tuesday, October 21, 2014

awk: count items in n number of lines (including percentile calculations)

Using several scripts to check a number of sites on features, I needed a way to calculate the sum of occurrences of yes / no and such alike - but not just across the whole file, but for each x number of lines, each 100 lines, 1000 lines or so.

This is based on a much simpler version to calculate just the totals for a whole file, not n-numbers. The script is called with input file as first parameter and number of lines to summarize as parameter two: . runscript.sh filewithdata.txt 100. Filewithdata.txt is stored in $1, 100 in $2. Then the $2 is handed to awk with the -v counter=$2.
First step is to set variables, then count each field up with the 'if' a value occurs.
Third step happens when the number of lines is reached (NR-1) to account for the header. Values are printed, added to the totalvalue variables, reset to zero.
In the END section there are two elements - first to print the totals, then also to show if there are 'lines left' due to an the aggregation over x lines. If there are, the number of lines and counting values are printed.

awk -v counter="${2}" 'BEGIN {counturl = 0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter = 0 }

{counturl++ } ( $2 == "yes" ) { counttitle++ } ( $3 == "yes" ) { countpub++ } ( $4 == "yes" ) {countschema++ }

(NR-1)%counter==0 { print "number of lines: "NR-1, "og:title: "counttitle, "rel publisher: "countpub, "schema: "countschema;
titletotal+=counttitle ; totalpub+=countpub ; totalschema+=countschema;
counturl=0; counttitle = 0; countpub = 0; countschema = 0 ; printcounter++ }

END { print "\nnumber of calculations: " printcounter; print "urls: " NR-1, "all og:title: " titletotal, "rel pub total: " totalpub, "all schema: " totalschema;
if(counturl) print "lines left: " counturl; if(counttitle) print "left og:title : " counttitle ; if(counttitle) print "left rel pub: " counttitle; if(countpub) print "left schema: " countschema ; print "\n" }' $1

Apart from the usual man pages, this post on stackoverflow was very helpful especially for the 'left' calculation in the END section.

Monday, October 13, 2014

AWK - simple counter of yes / no in tables

Counting yes or no in a table (or replace with what you need to count)

For a project I write data into a tab delimited table, urls and properties of the pages under the urls. Using the script over and over, I wanted to make the counting at the end a bit more efficient, and added below lines to the script generating the table.

This is how the call and result look like:

And this is how it is set up:
In the BEGIN section the counters are set, set to zero. Then in the body it counts three parameters up, if the according field contains 'yes'. In the END section it prints the number of lines minus one for the header column, then the name and value of each counter. And last, as standalone script, $1 is the file scanned - which, if added to the tab generating script is just replaced with the filename into which the table is written.

awk 'BEGIN {counttitle = 0; countpub = 0; countschema = 0 } ( $2 == "yes" ) { counttitle++ } ( $3 == "yes" ) { countpub++ } ( $4 == "yes" ) {countschema++ } END {print "number of lines: "NR-1, "og:title: "counttitle, "rel publisher: "countpub, "schema: "countschema} ' $1

Should be easy to adjust for re-use!

Monday, September 29, 2014

Same domain, completely different sites - how come?

KFMA - is a radio station here in Austin. As I just listened to some music, and wanted to learn what it was, I tried to access their site.

1. kfma.org in chrome - result is a redirect to this, same result for www.kfma.org:

So far, so good - or not, because that does not look like the radio station.

Thinking my memory might have tricked me, I google and see this, with big surprise, because that's what I typed and got the Korean site. How come?

I click through, and get this result - the site I wanted to see.

Funny, though, that's the url that I used earlier, which forwarded me to the other site, but this time, coming from Google it works.

How is that possible? Wrong entry in some DNS? And why is Google sending me to the 'right' site, and not to the Korean? Do they actually use IP addresses rather than domains?

Friday, September 5, 2014

Backlink Analysis - Tools Insights: Compare backlink domains from Google, Majestic Seo and Bing

Do 3rd party link tools show the same backlinks like Google does?

As described in the last post - there is some small overlap between domains linking to support.dell.com from either Open Site Explorer and Google Webmaster tools.

How many domains are the same from each tool?

Bing webmaster tools is showing 31 domains that link to support.dell.com that Google Webmaster tools show, and Majesticseo is showing 0 ( and remember - Moz / open site explorer showed 43 domains).

Two annotations:

Last year it was the opposite, Moz zero, Majestic some, Bing most overlap (nearly 5%).
Although Majestic seems not on spot, we used Majestic data + GWT (Google Webmaster tools) data to remove spam links and penalty - successfully. While I would expect the tools to show all links they possibly see (or even estimate like Majestic) - Google hides most backlink data. That might lead to a much larger overlap between the tools and GWT - we just don't see it.

The scripts are here - and because of the setup, the only line that needs to change to compare other tables is highlighted below:

create temporary table if not exists url_T (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from support_moz_t );
create temporary table if not exists url_T (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from supportdellcom_backlinks_t );
create temporary table if not exists url_T (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from supportdellcom_bing_links_urls_t );

Yeah, I had to drop the tables, but prefer to do that rather than to overwrite - I find it easier to remember.

(Ping me and I'll share the scripts. )

Monday, August 25, 2014

Backlink Analysis - Tools Insights: Generating database and tables, data import from Google Webmaster tools and open site explorer

Many SEO professionals are checking backlinks to their sites - including us. We use a variety of tools mainly out of necessity to get as much info as possible rather than relying on just one tool.

Last year I compared backlinks from Google webmaster tools, Moz (opensiteexplorer), Bing webmaster tools and Majesticseo, and found nearly zero overlap between the links shown from each tool. I used awk and excel last year - this time it's mysql, as I use the data in these tables for other purpose as well.

Opensiteexplorer finds a few backlinks in sync with Google Webmaster tool data

Now - with mysql - I found 44 lines with overlap - for Opensiteexplorer backlink information to the top 1000 backlink domains Google is giving us for support.dell.com. Not a whole lot.

I'd love to hear some comments on this!

Description of scripts, tables to compare backlink data

No complicated scripts, just generating the db and tables was not that easy, I have to admit.

Biggest challenge was to import the various formats in which all the data comes from the different sources, meaning lots of manual adjusting. It is also pretty annoying that companies like Google use the comma separator in large numbers (for 1,000) - but I found this post on stackoverflow that helped me deal with it. Basically the data for that field is read into a variable, and then with a 'set' while importing the value comes from that variable with a 'replace' of the comma.

File names and folders are removed, but we have some super long urls on our site - so some fields are loooooooong, you might want to reduce that if you have shorter urls and lots of pages.

Generate tables with the top external links from Google Webmaster tools and Bing Webmaster tools:

create table if not exists supportdellcom_gwt_sourcelinksdomains_T (
id int auto_increment not null primary key,
Source_URL varchar(2000) not null,
Link_Number int(15),
Linked_Pages int(5),
datasource varchar(10)
);
load data local infile 'path/supportdellcom_google_LinkingDomains.csv' into table supportdellcom_gwt_sourcelinksdomains_T
fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 lines
(
Source_URL,
@var1,
@var2
)
set Link_Number = replace(@var1, ",","") , Linked_Pages = replace(@var2, ",","") , datasource = "Google";

create table if not exists supportdellcom_bing_links_urls_T (
id int auto_increment not null primary key,
Target_URL varchar(2000) not null,
source_URL varchar(2000),
Anchor varchar(2000),
datasource varchar(10)
);

load data local infile 'c:/Users/Andreas_Voss/Documents/linkbuilding/2014/data/bing-support-euro-dell-com-sitelinks_6_20_2014.csv' into table supportdellcom_bing_links_urls_T
fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 lines
(
Target_URL,
source_URL,
Anchor
)
set datasource = "bing";

And this is the actual comparison of the backlinks given by Google vs. Bing:

create temporary table if not exists url_T (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from support_moz_t );
create temporary table if not exists fulldomain_T (select count(*) as count, left(fullurl,(locate("/",fullurl)-1)) as domain from url_T group by domain);
select * from fulldomain_T order by count desc;

create temporary table if not exists compare_T (select fulldomain_T.domain, fulldomain_T.count as fulldomaincount, supportdellcom_gwt_sourcelinksdomains_t.Source_URL, supportdellcom_gwt_sourcelinksdomains_t.Link_Number ,
supportdellcom_gwt_sourcelinksdomains_t.Linked_Pages from supportdellcom_gwt_sourcelinksdomains_t join fulldomain_T on supportdellcom_gwt_sourcelinksdomains_t.Source_URL
like concat('%',fulldomain_T.domain) group by Source_URL);

This next bit is necessary because domains can end the same, but still not be equal. Still need to find a better way (perhaps with a regex instead of the 'like' - but for now it works)

select * from compare_T where domain = Source_URL;

Why not use something like 'where Source_URL in (select Source_URL other table)? Reason is... that takes forever, kind of.

Wednesday, July 30, 2014

Send money per Gmail

Most interesting feature - I know quite a few companies have been working on this, and read a bit about Google planning financial services, but it still came as a surprise when this popped up in my gmail today:

Google offers to send money per email to any recipient it looks like.

And the transfer seems to be free with a debit card of when linked to an bank account.

It again provides verification - real world verification, much stronger even than a phone number! And it also will make it easier to purchase through Google - giving Google fees in the process.

Thursday, July 10, 2014

Google ngrams - see the rapid increase for Google

Ngrams has data on word occurrences in books over a long timespan (1800 to 2008). A 1gram is just the word itself, a 2gram a two word combination and so on.

What's closer than trying this on mentions on search engines in books? They pretty much show the rapid rise of mentions of 'Google' compared to others like Bing, Baidu, Hotbot, Altavista, Wolfram Alpha (with the last two not even being found).

Take a look:

Definitely going to use this more!

Tuesday, July 1, 2014

Mysql - cautious with formatted numbers!

WATCH IT - when using formatted numbers in mysql. This was keeping me busy for a little while, I did not expect this to happen and had do do some searches and tries to find the root cause:

The query for these two is the same , but the first calculation is with formatted numbers, the second with unformatted. What a difference (and what a "curious" behavior in mysql): -172306 vs 97.

mysql calculation with 'formatted' number

mysql calculation with not formatted numbers

Just to show the queries in case:

set @one=(select format(count(distinct(Target_URL)),0) from supportdellcom_inbound_links_urls_T);

set @two=(select format(count(distinct(Target_URL)),0) from supportdellcom_backlinks_t);
set @three=(select count(*) from (select Target_URL from supportdellcom_backlinks_t
union
select Target_URL from supportdellcom_inbound_links_urls_T) x) ;
select (@one + @two - @three);

set @four=(select count(distinct(Target_URL)) from supportdellcom_inbound_links_urls_T);
set @five=(select count(distinct(Target_URL)) from supportdellcom_backlinks_t);
set @six=(select count(*) from (select Target_URL from supportdellcom_backlinks_t
union
select Target_URL from supportdellcom_inbound_links_urls_T) x) ; # why the x at the end of the line?
select concat("number of the same target urls majestic to google is: ", (@four + @five - @six);

Thursday, June 19, 2014

Tornado Warning in Google Results

Interesting, helpful example for a part of the knowledge graph:

The overview maps on google.org where very helpful, too, although I did not check for their timeliness - which is key.
I did not take a screenshot of the map, but it also has a good collection of additional links and info (and alerts active at the time looking at it ).

The map is also highly configurable (screen showing only a part of the options). Very interesting - it looks like a high level mashup of various sources with maps.

Thursday, June 12, 2014

Finally: free to use Sitemap Generator with automated 'priority' configuration

Easy to use - at least in an linux environment. Not sure if adjustments are necessary for various (bash) flavors, same with -nix emulators like cygwin. Sitemaps can have up to 50,000 links - so this should be sufficient for many sites.

First, check if called with a file - ideally the list of urls from an earlier scan with one of the scanners of this site, or from somewhere else, perhaps a content management system. Then, grab the file name without ending and add a random number to avoid overwriting of existing files.

if [ -z $1 ]
then echo "needs a file to start with"
else
file=$(basename "${1}")
name=${file%.*}-$RANDOM.txt
fi

Clean up the file from empty lines, then from lines that start with underscore or hyphen - errors I found a few times.

sed -i '/^$/d' "${1}"
sed -i '/^[_\|-]/d' "${1}"

Echo the header into the new file - xml definition for a sitemap.

echo "" > "${name}"

Then, check if the url has a http:// in front, if not, add it, otherwise the following calculations get tricky.
Next - count the number of "/ ", subtract the two dashes from the http://. The 'gsub' returns the number of replacements - and that's all I need here, so I replace a dash with a dash.

The next step calculates the priority based on the depth in the folder structure - the lower in hierarchy, the lower in priority. This is a setting that might need to be changed, depending on the structure of a site.

awk ' $1 !~ /^http/ { $1 = "http://"$1 } { count=(gsub("/" , "/" , $1)-2) } count > 9 { print $1, "0.1" } ( count > 0 && count <=9 ) { print $1, (10-count)/10 } count <= 0 { print $1, "1" }' $1 | while read input1 input2

The little sed inset is then used to remove the additional decimals - this was the easiest way to do this.
Then just echo it into the overall structure of the sitemap entries, then end the while loop with done.

do priorityvalue=$(echo $input2 | sed 's/^$...$.*/\1/')
echo -e "$\n$input1\n$priorityvalue\n" >> "${name}"

done

Finally, add the sitemap footer per definition and show the result

echo "

" >> "${name}"cat $name

looks like this:

http://www.tage.de 1
http://www.directorinternet.com/ 0.9
http://andreas-wpv.blogspot.com/2014/05/new-google-schema-implementation-for.html 0.7

Same script on Dropbox for easy use.

Thursday, June 5, 2014

Content ownership - is Google using 3rd party site content for Ad revenue on their search results page?

Look at this:

Google shows content taken from another website (mentioned in a miniature link below the content).

Would you click through or not?

These are the current results:

With this content taken from the 3rd party site shown on Google - is there still a need to go to the other site? How much is this going to affect the traffic on the other website? Is Google is monetizing other sites' content with their ads on the search results pages?

The site is tagged nicely, having the right descriptors in place.

I tested the site in Google's markup tester, and the markup for recipe and author works (Download).
The tool also shows a very different picture as a preview - misleading, in case an author tries to see what others are likely to see in Google.

Many sites rely on ad revenue to finance their operations and content - this will become impossible if above becomes more common. With this revenue taken off the other sites it seems as if Google is cutting off the branch on which they are sitting. And it very much seems like copyright infringement to me (but I am no lawyer and might be wrong).

the search - the result screen (top) is from June 1, 2014

Monday, June 2, 2014

Big social platform shares - added Stumble Upon likes, shares, lists

And another script to check for social shares...this time more details around stumble upon (no, not really active there, anymore, although it has led me to a few outstanding sites). The older version of this script pulls stumble data, but this shows pageviews from stumblers, not amplification through stumble activities. So this was mixing traffic data with engagement data - fixed now.

The data for stumble is pulled in two steps - first get the badge ID for the url, and with that pull the data (likes, shares, lists) for it. They all go into just one variable below, tab delimited, and then per echo into the larger list. That's why there are not 3 fields immediately visible when adding the data, but just seemingly one.

echo -e "Url\tripples\tFB-comments\tFB-shares\tFB-likes\ttwitter\tlinkedin\tstumble_likes\tstumble_shares\tstumble_lists" > "${1}-all-social-shares".csv

while read -r shortline ;
do

This line is to replace special characters with their html encoding...

line=$(echo $shortline | sed -e 's/\?/\%3F/g' -e 's/&/\%26/g' -e 's/#/\%23/g')

gpull="https://plus.google.com/ripple/details?url=${line}"
ripples=$(wget -qO- "${gpull}" | grep -o "[0-9]*\s*public\s*shares.<" | sed "s/[^0-9]//g" | tr "\n" "\t" | sed 's/\thttp/\nhttp/g'| sed 's/\t//')
commentpull="https://api.facebook.com/method/fql.query?query=select%20comment_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
comment_count=`wget -qO- $commentpull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`
echo "comment count: " $comment_count
sharepull="https://api.facebook.com/method/fql.query?query=select%20share_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
share_count=`wget -qO- $sharepull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`
echo "share count: " $share_count
likepull="https://api.facebook.com/method/fql.query?query=select%20like_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
like_count=`wget -qO- $likepull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`
echo "like count: " $like_count
twitterpull="http://urls.api.twitter.com/1/urls/count.json?url=${line}&callback=twttr.receiveCount"
twitternumber=$(wget -qO- "${twitterpull}" | grep -o 'count\":[0-9]*\,' | sed -e 's/count//g' -e 's/,//g' -e 's/://g' -e 's/"//g' )
echo "twitter: " $twitternumber
linkedpull="http://www.linkedin.com/countserv/count/share?format=json&url=${line}" #echo ${linkedpull}
linkednumber=$(wget -qO- "${linkedpull}" | grep -o 'count\":[0-9]*\,' | sed -e 's/count//g' -e 's/,//g' -e 's/://g' -e 's/"//g' )
echo "linkedin count: " $linkednumber
stumblepull=`wget -qO- "http://www.stumbleupon.com/services/1.01/badge.getinfo?url=${line}" | grep -o "publicid.*views" | sed -e "s/publicid//" -e "s/\":\"//" -e "s/\",\"views//"`echo $stumblepullstumblenumber=$(wget -qO- "http://www.stumbleupon.com/content/${stumblepull}" | grep "mark>" | sed -e "s/^.*\">//" -e "s/<.*$//" -e "2d" -e "4d" -e "6d" | tr "\n" "\t")

echo -e "${line}\t${ripples}\t${comment_count}\t${share_count}\t${like_count}\t${twitternumber}\t${linkednumber}\t${stumblenumber}" >> "${1}-all-social-shares".csv
done < $1

cat -A "${1}-all-social-shares".csv

Wednesday, May 21, 2014

SeoClarity - Certification

SeoClarity has a new certification program - a nice way to connect with users, engage them. this is also nice to have on a resume, and I am looking forward to more discussions with the great team and other seoclarity clients. We use them as our enterprise seo metrics platform.

I use the tool and services a lot - it has a lot of features that help with the super large scale of our dell.com website. I especially like and use the sitescans - and download many GB results data, then crunch them in mysql, R or just with awk, sed on bash.

Thursday, May 15, 2014

Scraper for video pages to get all data for video sitemap

This scraper is based mainly on opengraph tags (which are used by Facebook, for example), so it should work well with many pages, not just Dell.com pages. More info on sitemaps at Google .

#! bash

#check if its called with a filename - a file containing urls for pages with videos

if [[ ! $1 ]] ; then
echo "need to call with filename"
exit 1
else

#now make sure to have a unique filename, based on the file with the urls

filename=$(basename $1)
name=${filename%.*}-$RANDOM-

#header info for the file with the results. Need to pull page url, video title, thumbnail url, description.

echo -e 'url\tpage\tTitle\tThumb\tDescription' > ${name}video-sitemap-data.txt

#loop through the file and store in a variable

while read -r line; do
filecontent=$(wget -qO- "$line")

# echo results and clean up with sed, tr and grep, then append to the file that has the column headers already. It has 4 elements - and each is isolated in its own part. The parts are connected with &&, and everything in ( and ) - otherwise it only echos the last part into the file.

(echo "$line" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:video" | grep "swf" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:title" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:image" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:description" | sed -e "s/^.*content=\"//" -e "s/\".*$//") >> ${name}video-sitemap-data.txt
done < "$1"
fi

I'd be delighted to know this helped someone else - why don't you drop me a note when you do?

This is one of the pages that I used for testing, just in case someone wants to test this:
http://video.dell.com/details.php?oid=RraTJsZDrsDKFAjNcj2WmTovA1ovugc-&c=us&l=en&s=bsd&p=learn .

Monday, May 12, 2014

Google Webmaster tools - import into mysql

For a research project I am integrating all data I can get my hands on in regards to a site (techpageone.dell.com).

Majestic, seoclarity, omniture, maxamine, etc. . and for sure, Google webmaster data (and Bing).

I get this data from the 'top pages' in GWT:

Page	Impressions	Change	Clicks	Change	CTR	Change	Avg. position	Change
http://pageurl/	22,577	19%	3,959	15%	18%	-0.5	11	-0.1

So far, so good (and great numbers, team!). When setting up the table, I am not able to use any number format, but end up with using varchar(10) instead of int or decimal, because if I use a number format, I get always 'data truncated' errors. And even with quite some online research, could not find a better way to do this. I change the number format in impressions to not have the comma, and then use string fields. Any idea how I could make this work with number formats?
Likely int for impressions, ??? for Change in % , int for clicks, and decimal for the CTR change, but how?

These are the other scripts to create the table and to load the data:

Thursday, May 1, 2014

New Google Schema implementation for Phone Numbers - little helper for lots of phone numbers

Google is showing phone numbers now for company searches

Nope, not happy about the fact that Google is showing phone numbers for company searches - but not for phone searches. How much sense does that make?

I prefer online, usually, and call only as a last resort. Calls take much longer, sometimes it is hard to hear or understand what the other's say. And sometimes I find it incredibly aggravating to be on a call for hours to fix something that the company I am calling "broke" in the first place - or did not prevent. (Although I know there are many companies involved, in producing one product). And I also know, most companies (including the one I work for) really try hard to give good online and phone support.

And now Google is showing support phone numbers for companies - no matter if it is an online only company. And they show them as part of the knowledge graph - but NOT when someone searches for a phone number. How smart is that.

And they are currently not showing for Chrome books or for google - how fair is that?

The schema documentation works just fine, watch for this:

the visible number on the page needs to be the same as the one in the json / schema part
for multiple contact points for the same organization, that part of the code can be iterated, comma separated
phone numbers for several countries can be integrated on one page.

Script for many phone numbers:

Ok, so I had to update ~ 80 country phone numbers, and got just one not very good list of phone numbers, including international calling code. First I started concatenating in my beloved MSFT Excel, but that's no fun, especially since quotes need to be escaped (doubled), so I ended up with a lot of double quotes - which I then had to remove in a later step. But this needed to be done quickly, once I had the final list, which made me do this little script.

This script loops through the list of phone numbers with awk, writes the two parameters into variables, and fills the info in the schema snippet. As a result, the script generates one schema.org snippet for each country and phone number.

Three files needed, the script (shown at the end) a list of phone numbers and the schema for this kind of phone numbers.

This is the schema info we were going to use, where I used two placeholders - changeNumber and changeArea:

The list of phone numbers looks like this, and is tab delimited:

And now the script:

awk '{print $1, $2}' tab-phone-numbers | while read input2 input1

do

awk -v var1="${input1}" '{ sub("changearea",var1)} {print $0}' phone-code-snippet | awk -v var2="${input2}" '{sub("changenumber",var2);print$0}'>> phone-results

done

What really helped me do this was the trick to use read to get the output into variables (bold). Found this first here (thanks duckeggs01) and confirmed on my trusted source for man pages online - ss64 . I am sure I'll use this more, very quick and easy, efficient.
First part is a loop with awk through the numbers assigning each field value to a variable. Then nested two awk commands - assigning the external variable to an internal. Then, replacing the first variable the first placeholder, piped into a second awk with the second variable replacing the second placeholder, ... done.

In this first run this script produced one complete snippet per country - good if you need the snippets on several pages. For usage on one page, the template for the schema needs to be changed to only include the contact point info, then run, and add header / footer afterwards to the whole section.

Monday, April 28, 2014

Moz' post on 'the greatest misconception in content marketing'

Rand Fishkin from Moz.com knows what he is talking about, and here is the proof for everyone to see. (Disclaimer - I read the transcript)

Let's cite a few core items from Rand, talking about what people and companies do wrong when using blogs for their content strategy:

"So they do a few things that are really dumb. They don't take this piece of content and put links to potentially relevant stuff on their site inside there, and they don't internally link to it well either. So they've almost orphaned off a lot of these content pieces.

You can see many people who've orphaned their blog from their main site, which of course is terrible. They'll put them on subdomains or separate root domains so that none of the link authority is shared between those.

They don't think about sharing through Google+ or building an audience with Google+, which can really help with the personalization. Nor do they think about using keywords wisely. "

So true, and in the past weeks I have found that Moz' strategy really works out well. I have compared several company blogs and news sites ( multiple blogs in magazine form) on how the content performs in social media engagement - sharing on Google Plus, Facebook, twitter, Linkedin.

Then I took the numbers and calculated a 'social share per piece of content'. Now guess who's the blue line?

Exactly, that's Moz' company blog. Rand and his team are doing it right, as far as anyone can see from the outside.
The numbers are impressive:

The second from last row is the moz blog, the other rows are big company content sites. The last row is not integrated into the graph - someone is likely using automation to push their results, as explained here.

I cannot agree more with the Moz article on:

Link from and especially towards the blogposts
integrate the content blog into the on site area for this content (we call it a 'hub')
Engage on Google Plus around this content, constantly.

Wednesday, April 23, 2014

Someone has lots of posts with the same amount of shares, comments, linkedin post - how does this happen?

One of my favorite projects right now is to check how competitors (to our Techpageone) are performing with their content - and one aspect of that is the social media engagement their content receives.

This time I looked at a different competitor, their site is targeted at their medium sized companies, if I am not mistaken.

As usual, I get a list of urls with a little script, after a bit of cleanup, the list has 5326 urls in it.
Then I run a second script to check for social shares on Google Plus, Facebook, Twitter, Stumble upon, Linkedin. Then I sort - and that's where it got really interesting, see all the duplicate numbers in FB and Linkedin? This looks very much not like an organic, natural result. (Twitter and ripples are different enough - if very low).

So, is the tool working? Let's see a sanity check, where I compare the numbers with numbers from sharetally.com. Share Tally checks a lot more (small) social media platforms, but includes all platforms my script checks, so the numbers should never be lower, and sometimes slightly higher. Works pretty well as you can see:

Is the tool stuck, cannot retrieve numbers? Sanity check no 2, this time just different urls in the same scan:

Nope, works just fine.

Manual check:
The pages do not have consistent (if any) canonical, they seem genuinely different articles from various authors, to various dates.

What, please can cause this share pattern?
These are the potential causes I can imagine (help me if you see more!)

Coincidence
Highly disciplined workforce sharing over and over + great internal process
Paid sharing
Automated sharing

What am I missing? What do you think is most likely?

Thursday, April 17, 2014

Script to check social shares: Facebook comments, shares and likes, tweets, G+ ripples, linkedin and stumbles

Social shares on Facebook, Linkedin, Twitter, Google Plus are relevant - for users, and we assume as well for rankings in search engines. (likely not directly, but indirectly through user behavior). There's a post checking a few platforms - now with more than ever! %-)

This is the version I use most - it covers the biggest relevant platforms (here in the US), and pulls it in the right speed. With my current internet connection, it just pulls slowly enough to not trigger any blocks from the api-s.

I also realized it is easiest to use a random file name, rather than pulling some info in from the site scanned. The url list can come from a sitemap scan or from a sitescan - or any other url list.

This is how the results look like:

I usually sort by each column, to filter out 'outliers', and then use the results.

This is the script:

#!bash
# set all variables to zero at the beginnning! then value >, replace
# check where the tabs come from (likely one tab to many in one of the echos?
rm "${1}-all-social-shares".csv

echo -e "Url\tripples\tFB-comments\tFB-shares\tFB-likes\ttwitter\tstumble_upon\tlinkedin" > "${1}-all-social-shares".csv
while read -r shortline ;

do
line=$(echo $shortline | sed -e 's/\?/\%3F/g' -e 's/&/\%26/g' -e 's/#/\%23/g')
echo $shortline
echo $line
number=0
re='[0-9]+'

gpull="https://plus.google.com/ripple/details?url=${line}"
ripples=$(wget -qO- "${gpull}" | grep -o "[0-9]*\s*public\s*shares.<" | sed "s/[^0-9]//g" | tr "\n" "\t" | sed 's/\thttp/\nhttp/g'| sed 's/\t//')

commentpull="https://api.facebook.com/method/fql.query?query=select%20comment_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
comment_count=`wget -qO- $commentpull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`

sharepull="https://api.facebook.com/method/fql.query?query=select%20share_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
share_count=`wget -qO- $sharepull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`

likepull="https://api.facebook.com/method/fql.query?query=select%20like_count%20from%20link_stat%20where%20url=%22${line}%22&format=json"
like_count=`wget -qO- $likepull | sed -e 's/^.*://g' -e 's/\}//g' -e 's/$]$//g'`

twitterpull="http://urls.api.twitter.com/1/urls/count.json?url=${line}&callback=twttr.receiveCount"
twitternumber=$(wget -qO- "${twitterpull}" | grep -o 'count\":[0-9]*\,' | sed -e 's/count//g' -e 's/,//g' -e 's/://g' -e 's/"//g' )

stumblepull="http://www.stumbleupon.com/services/1.01/badge.getinfo?url=${line}"
stumblenumber=$(wget -qO- "${stumblepull}" | grep -o 'views\":[0-9]*\,' | sed -e 's/views//g' -e 's/,//g' -e 's/://g' -e 's/"//g' )

linkedpull="http://www.linkedin.com/countserv/count/share?format=json&url=${line}" #echo ${linkedpull}
linkednumber=$(wget -qO- "${linkedpull}" | grep -o 'count\":[0-9]*\,' | sed -e 's/count//g' -e 's/,//g' -e 's/://g' -e 's/"//g' )

#echo -e "$line\t$value\t$comment_count\t$share_count\t$like_count" >> "$1-all-public-shares".txt
echo -e "${line}\t${ripples}\t${comment_count}\t${share_count}\t${like_count}\t${twitternumber}\t${stumblenumber}\t${linkednumber}" >> "${1}-all-social-shares".csv

done < $1

Tuesday, April 8, 2014

Scan site for list of urls - use in sitemap or for other scans

This is the sixth version (some other versions) of this scanner I use - I find it quite practical as it allows me to scan folders easily, come back with good number of urls.

It makes a header request, then stores these in a textfile. Once the scan stops, the script cleans out to only have the 200 responses in a separate file, then filters to only have the url in the final file.
I use a random number for the filename, as I use these in a special folder this allows me to not worry about incompatible characters in the filename, filtering out the basename or duplicate filenames / overwriting files in case I run a scan several times - on purpose or not. I ceep the intermediate textfiles so I can go back and check where something went wrong. Every now and then, I clean up the folder for these files.

#!bash
url="$1"
echo $url
sleep 10

name=$RANDOM

wget --spider -l 10 -r -e robots=on --max-redirect 1 -np "${1}" 2>&1 | grep -e 'http:\/\/' -e 'HTTP request sent' >> "$name"-forsitemap-raw.txt

echo $name

grep -B 1 "200 OK" "$name"-forsitemap-raw.txt > "$name"-forsitemap-200s.txt
grep -v "200 OK" "$name"-forsitemap-200s.txt > "$name"-forsitemap-urls.txt
sed -i "s/^.*http:/http:/" "$name"-forsitemap-urls.txt
sort -u -o"$name"-forsitemap-urls.txt "$name"-forsitemap-urls.txt
cat -n "$name"-forsitemap-urls.txt

Thoughts? Feedback?

Thursday, April 3, 2014

Pull data for video sitemap

Video sitemaps are sometimes helpful to make Search engines aware of videos on a site. We use several systems to generate pages with videos, and as a result it is not easy to get information from the back-end to generate sitemaps. So - like Google - we have to take it from the front end as much as possible. This script with details is likely limited to just Dell.com, and even here I have found that videos in some sections are not able to be indexed by this. Still, this has been extremely helpful to find the 'hidden' details on our video implementations. (And yes, we have requirements to change these in the process since a while :-) ).

Elements necessary for a sitemap are:

Pageurl
Title
Keywords
Description
Video URL
Thumbnail url

And this scripts pulls it nicely of many of our video pages. (We use open graph tags, which makes it relatively easy to pull most info). The script needs to be called with the filename of the text list of urls as first parameter ( . script.sh listofpages.txt)

if [[ ! $1 ]] ; then
echo "need to call with filename"
exit 1
fi
file=$RANDOM-sitemap-data.txt
echo $file
echo -e "url\tTitle\tthumbnail\tdescription" > $file
while read -r line; do
filecontent=$(wget -qO- "$line")
wait
(echo "$line" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:video" | grep "swf" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:title" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:image" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:description" | sed -e "s/^.*content=\"//" -e "s/\".*$//" ) >> $file
done < "$1"
cat -A "$file"

As always - I use this, and would love to hear tips to improve or see other scripts for site optimization and maintenance.