Monday, August 25, 2014

Backlink Analysis - Tools Insights: Generating database and tables, data import from Google Webmaster tools and open site explorer

Many SEO professionals are checking backlinks to their sites - including us. We use a variety of tools mainly out of necessity to get as much info as possible rather than relying on just one tool.

Last year I compared backlinks from Google webmaster tools, Moz (opensiteexplorer), Bing webmaster tools and Majesticseo, and found nearly zero overlap between the links shown from each tool. I used awk and excel last year - this time it's mysql, as I use the data in these tables for other purpose as well.

Opensiteexplorer finds a few backlinks in sync with Google Webmaster tool data

Now - with mysql - I found 44 lines with overlap - for Opensiteexplorer backlink information to the top 1000 backlink domains Google is giving us for Not a whole lot.

I'd love to hear some comments on this!

Description of scripts, tables to compare backlink data

No complicated scripts, just generating the db and tables was not that easy, I have to admit.

Biggest challenge was to import the various formats in which all the data comes from the different sources, meaning lots of manual adjusting. It is also pretty annoying that companies like Google use the comma separator in large numbers (for 1,000) - but I found this post on stackoverflow that helped me deal with it.  Basically the data for that field is read into a variable, and then with a 'set' while importing the value comes from that variable with a 'replace' of the comma.

File names and folders are removed, but we have some super long urls on our site - so some fields are loooooooong, you might want to reduce that if you have shorter urls and lots of pages.

Generate tables with the top external links from Google Webmaster tools and Bing Webmaster tools:
create table if not exists supportdellcom_gwt_sourcelinksdomains_T (
id int auto_increment not null primary key,
Source_URL varchar(2000) not null,
Link_Number int(15),
Linked_Pages int(5),
datasource varchar(10)
load data local infile 'path/supportdellcom_google_LinkingDomains.csv' into table supportdellcom_gwt_sourcelinksdomains_T
fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 lines
set Link_Number = replace(@var1, ",","") , Linked_Pages = replace(@var2, ",","") , datasource = "Google";
 create table if not exists supportdellcom_bing_links_urls_T (
id int auto_increment not null primary key,
Target_URL varchar(2000) not null,
source_URL varchar(2000),
Anchor varchar(2000),
datasource varchar(10)

load data local infile 'c:/Users/Andreas_Voss/Documents/linkbuilding/2014/data/bing-support-euro-dell-com-sitelinks_6_20_2014.csv' into table supportdellcom_bing_links_urls_T
fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 lines
 set datasource = "bing";

And this is the actual comparison of the backlinks given by Google vs. Bing:

create temporary table if not exists url_T  (select mid(Source_URL, (locate("//", Source_URL)+2)) as fullurl from support_moz_t );
create temporary table if not exists fulldomain_T (select count(*) as count, left(fullurl,(locate("/",fullurl)-1)) as domain from url_T group by domain);
select * from fulldomain_T order by count desc;
create temporary table if not exists compare_T (select fulldomain_T.domain, fulldomain_T.count as fulldomaincount, supportdellcom_gwt_sourcelinksdomains_t.Source_URL, supportdellcom_gwt_sourcelinksdomains_t.Link_Number ,
supportdellcom_gwt_sourcelinksdomains_t.Linked_Pages from supportdellcom_gwt_sourcelinksdomains_t join fulldomain_T  on supportdellcom_gwt_sourcelinksdomains_t.Source_URL
like concat('%',fulldomain_T.domain) group by Source_URL);
This next bit is necessary because domains can end the same, but still not be equal. Still need to find a better way (perhaps with a regex instead of the 'like' - but for now it works)
select * from compare_T where domain = Source_URL; 
Why not use something like 'where Source_URL in (select Source_URL other table)? Reason is... that takes forever, kind of.

Bookmark and Share