andreas.wpv: seo

Showing posts with label seo. Show all posts

Monday, April 17, 2017

New top 1 million websites list

Large lists of websites:

Alexa 1 million

Alexa's list of the top 1 million sites has helped me many times to run larger scans over homepages and others on the top sites, compare setup and speed depending on how er a traffic a site gets and similar.

Now, Alexa has officially been discontinued as far as I know, although the file is available (right now at least).
As it is based on browser plugins to track visits, it might have a different mix than the newer OpenDNS list.

OpenDNS 1 Million

This one is based on over 100 Billion DNS requests per day, not limited to http/https requests .. but perhaps read the details for yourself.
And even older versions are available, kind of asking to use this for time series.

MajesticSeo 1 Million

Majestic 1 Million - another company seeing the potential of the list, and publishing the list, but not maintaining... or do they? Unclear if it is maintained, unclear when the last update happened, but seems a good list with a nice web interface, too.

Quantcast 1 Million

This is around since a while, but it's unclear if it is regularly maintained, and how exactly it is generated. But it works very similar to Alexa, which makes it a convenient alternative. Here's the page linking to the download, but testing just now, that's not available anymore, but this link is working (zip) as of today.

Statvoo 1 Million

Another offer that seems new - and like Alexa, on the site there is a very nice categorization, but the samples are tiny, 20 sites, and no download options for these - but for the top 1 Million.

Tuesday, September 20, 2016

How common is Cloaking? (Showing different content to search crawlers than to users)

Google often points out that sites should not show significantly different content for users and search crawlers. Penalties threatened. It might negatively impact ranking accuracy, and Google also needs to see differences in files to update their index (not mentioned by Google, but likely).

An easy way to see if cloaking happens is to compare variations of a page downloaded with different user agents (curl, googlebot, firefox, ....) and then compre the md5 hashs of the versions. If a site sends different data only to crawlers (and not based on other user agents, IE or FF for example) this indicates that the site might be cloaking.

How do some categories of sites fare in this?

Spamlist 2 (as explained previously) is basically a list of 2336 blogs on different large scale blog platforms like blogger, tumblr, wordpress, blogspot that have many attributes that might be indicators for spam.
The earlier spam list has similar results (spam based on industry and competitor related inbound links).
The Alexa* lists use the top 1000 urls, middle 1000 urls and last 1000 urls of the Alexa top 1 million list.

The results sorted by percentage of 'cloaking':

And the table of results, again sorted by 'cloaking' percentage (other combinations make the difference to 100%):

Discussion

It seems one group of spammers differentiates a lot by agent - they still don't have a lot of settings where only bots see different content. Quite interesting, too, that one spam group and the top alexa sites are more likely to cloak than other sites. (Again - this is only considering one factor on how the homepage is displayed with all the resulting vagueness.)

How to replicate

First - this is severely limited, as it only analyses the homepage, and only the core page, not elements that are loaded with the page (images, scripts, etc.).

First generate list of spam-like urls / domains. All lists are checked for 200 status with several options: the url as is, then as https, then with www, then www with https plus one test that tries to work around crawler detection. The resulting lists of unique 200 OK urls are used in the next steps.

Download the homepage (just the html part) with different user agents (googlebot,bingbot,FF,IE,Chrome, and whatever else pleases your heart). For each download build the md5 has of the file, store in a table.
With awk we can compare quickly if a hash is the same or different by user agent, which then just needs to be summarized.

Tuesday, August 16, 2016

Are spam sites using variate testing tool (mbox) more than other sites?

Mboxes are such an interesting topic, and the spam list I used last time was good, but I was not sure how representative. So my next two lists of spam domains were generated using Moz' opensiteexplorer. This tool has a feature where it categorizes domains linking to other sites by up to 18 parameters indicative of spam.

Spam domains linking to large sites using mbox

I used a few profiles (root domains like dell.com) from our industry first, and pulled all inbound linking subdomains with the highest spam scores. For most sites, I had to use scores 8 and greater (not 18 or 15 or 12), because the link profiles were actually pretty clean for all large sites checked.

On a total of 4544 linking 'spam' domains I found 196 mbox implementations on the homepage, 4.3 %. This is completely in line with the first set of spam domains last time (4.4%).

Spam domains linking to blogging platforms using mbox

Now the second set of spam domains are from the same tool, but this time I looked at the link profiles for the root domains blogger.com, blogspot.com, tumblr.com and wordpress.com, in the hope to catch some bad linkspam from some of the bloggers. But I was surprised how relatively clean these were again. Not as good as the large site link domains, though. Perhaps due to the large number of overall inbound links - the samples are much larger and I could focus on domains with a spam score of 13 and higher.

On a total of 2336 domains I found 99 mbox implementations on the homepage, 4.2 %. This is completely in line with the first set of spam domains last time (4.4%) and the other spam list.

(First I ran this with 5 sec timeout and discovered only 87 sites with mbox. 5 sec for all files would be nice, but perhaps not realistic, so with a timeout threshold of 30 sec more mboxes showed up.)

All in all this shows that mboxes (for variate testing) are used on all kinds of sites and are not an indicator for a spam-like site - but not the opposite either.

Thursday, May 12, 2016

Dell wins key award for their SEO implementation

The company behind one of the largest, truly global Enterprise SEO platforms (seoClarity) just awarded Dell with the ‘Synergy’ award – out of more than 3000 brands (!) they cover with their platform.

This highlights the exemplary integration of all teams to optimize the experience for users coming from Search Engines like Google to Dell.com and beyond.

“ Dell has demonstrated an award-worthy level of synergy across the product management, development, design, ux, brand and content aspects of the Dell Support Group (DSG). "

Original SeoClarity Press release: http://bit.ly/1WqEL9O

Wednesday, April 6, 2016

Google site ownership verification per DNS

Many (or perhaps all?) seo / webmasters all have their sites verified in Google search console (and others). There are currently several options to verify ownership of a site, usually a domain or a subdomain. Many people use a file that's uploaded into the root folder of their site or a meta tag on the homepage.

If you have access to a DNS, adding a TXT entry to the DNS is a great option. We actually use several methods to verify the same domains, for example the meta tag + DNS, or verification file + DNS to add some extra protection in case one breaks.

This is the official (new) guideline from Google, but it is also explained in GSC. There's a second page with the details for the 4 different DNS entry variations that can be used - up to now I only had seen the first two.
One adds just a TXT entry with some copy in it, and the copy starts with "google-site-verification=' and then adds a specific entry per domain.

We have lots of domains, and a dedicated DNS team, so I needed to test if they implemented correctly - and as usual a quick script.
Script to test DNS TXT entry for google verification is below, and the output works - as shown with domains with this entry, and below without.

Done with a few domains where I don't have the DNS entry:

while read domain; do
dig -t TXT $domain | grep "TXT" | grep "google-site-verification"

done < $1

Tuesday, February 16, 2016

site search box in SERP

Many of us site managers and tech Seo will have thought about using the sitesearch link box in Google results. When Google displays expanded results with sub-results, they offer to integrate a sitesearch box. Posts on this are rare, and have little info on the impact of these. What I have seen so far indicates that there's no significant impact on natural search traffic. Information about on site conversion (closer match should increase on-site conversion) is not published to my knowledge, and would be hard to separate out from the overall search.

There are three options to deal with this Google offer:
1. ignore, do nothing
2. block the searchbox actively
3. integrate schema into the homepage of the site to 'invite' Google to add the box.

Which route to go? A first step, as often, is to look at how common it is to use one of these options. Working on one of the largest websites, I compare here with the top 10,000 sites by traffic as estimated by Alexa (top million sites report).

Now showing:

how many block, use schema, do nothing,
average rank of blocked sites compared to sites with schema
what do the top sites do (surprise)
top sites blocked by rank
top sites with schema by rank
bash script to test

1. How many block, use schema, do nothing

2. Average rank of blocked sites compared to sites with schema

3. What do the top sites do (surprise)

4. Top sites blocked by rank

5. Top sites with schema by rank

6. Bash script to test

while read -r line; do
if [[ $line != www* ]]; then
line="www.${line}"
fi
output=$(curl -s "${line}" )
nobox=$(echo "$output" | grep "nosite")
boxschema=$( echo "$output" | grep "SearchAction")

if [[ $nobox != "" ]] ; then
nobox="blocked"
fi

if [[ $boxschema != "" ]] ; then
boxschema="schema"
fi

echo -e "$line \t $nobox \t $boxschema" | tee -a nobox-or-box.txt

done < $1

Thursday, February 4, 2016

Canonical checker

For search, a header tag can be very important. We have many areas with duplicate pages, or at least very similar pages. That's bad news for ranking in external search engines as the 'rank equity', or 'ranking love' like some would say, is diluted.
For many situations this rel canonical tag can be used. It indicates to search engines, which of the many urls for the same content should rank in search results. It is one of the well understood and well working elements in seo.

Unfortunately though, some webpages don't make full use of this. Most tools, internally and externally, show if a canonical is on a page, but they don't check anything else, particularly not if the canonical is self-referring to the url of the page on which it is implemented.
These self-referring canonicals are nothing bad, even recommended, but if we have 3 urls for the same content (think segments or customer segments) all have a self referring canonical we need to change this to one having the self referring, and the other two refer to that url with the self-referring one.

Knowledge is the parent of action or so, right?

Here is a little tool to check in your area, works under linux and also on cygwin. Suggestions for improvement welcome as always.

#bash
echo -e "\033[91m if url in feeding file had NO http or https, http is assumed and set \033[0m"
if ! [[ -f $1 ]]; then
echo "need a file with urls starting with http or https"
return
fi

echo -e "URL tested\tcanonical status"
while read line; do
if [[ ! $line == http* ]]; then
line="http://"$line
fi
response=$(curl -I -s --write-out %{http_code} "$line" -o /dev/null)
if [[ $response == 200 ]]; then
canonical=$( curl -s --max-time 3 "$line" | grep "canonical" | grep -o '<.*canonical.*>' | sed -e 's/<link //' -e 's/rel=//' -e 's/canonical//' -e 's/href=//' -e 's/ //g' -e 's/\x22//g' -e 's/\x27//g' -e 's/\/>$//' )
case $canonical in
*hreflang*)
canonical="coding error"
;;
"")
canonical="none"
;;
*/)
canonical="${canonical%/}"
;;
*)
canonical="${canonical}"
esac
if [[ $line == $canonical ]]; then
canonical="\033[32mmatch\033[0m"
fi
else canonical="redirected"
fi
echo -e "$line\t$canonical" | tee -a output.txt

done < $1

Thursday, September 24, 2015

Google Weblight - using full content from sites on their domain

Like to many 'lite' things, there are serious side effect with this web-light version by Google.

Some time ago, I think it was April, Google announced they would serve web pages that are too slow for some users scrape off the owner website and serve from their servers. The test was supposed to start in Indonesia and targeted towards 2G connection. Then later, they started rolling this out for 'select countries' and slow mobile connections (2G).

The 'transcoding' is done on the fly, according to news, and speeds things significantly up, we have seen 10x faster loads.

Weblight - look and feel

This is how it looks like in the original on the left, and the 'scraped' site on the right. Not bad, really. Users can load the original page in the top section, with a warning that it might be slow. Sounds very user friendly approach to me. The navigation (nav icon in the upper left) is solid, and not missing anything in the top level.

Biggest flaw is that it removes pretty much all third party elements - including our tracking elements, scripts, etc.. Google claims to leave up to two ads on the page, and allow simple tracking with Google analytics.

Traffic on weblight

We (Dell.com) actually see some traffic with Googleweblight. It is minuscule, noticed only because we have been specifically looking for it, but there it is. We see some referral traffic in Adobe sitecatalyst, and we also see the 'scraping' of pages in our logfiles.

Most of the traffic for us seems to be from India, and trying to reach a variety of pages in several countries instead just India, which might explain some of the slowness.

The scraped or 'transcoded' pages can be found in logs with cs_User_Agent="*googleweblight*", and the pages that get the referrer traffic from the referrer "googleweblight.com".

How relevant are 2G networks for global ecommerce

So far, it seems super small, but the market potential is quite big. Global data on network coverage is a bit harder to come by, but there are some relevant sources.

Some excerpts from a McKinsey report:
"In developed countries and many developing nations, 2G networks are widely available; in fact, Ericsson estimates that more than 85 percent of the world’s population is covered by a 2G signal.42 Germany, Italy, and Spain boast 2G networks that reach 100 percent of the population, while the United States, Sri Lanka, Egypt, Turkey, Thailand, and Bangladesh have each attained 2G coverage for more than 98 percent of the population.43 Some developing markets don’t fare as well: as of 2012, 2G network coverage extended to 90 percent of the population of India, 55 percent of Ethiopia, 80 percent of Tanzania, and just under 60 percent of Colombia.44 Growing demand and accelerated rates of smartphone adoption in many markets have spurred mobile network operators to invest in 3G networks.
Ericsson estimates that 60 percent of the world population now lives within coverage of a 3G network. The level of 3G infrastructure by country reveals a stark contrast between countries with robust 3G networks and extensive coverage, such as the United States (95 percent), Western European nations (ranging from 88 to 98 percent), and Vietnam (94 percent), and many developing markets such as India, which are still in the early stages of deploying 3G networks."
(highlights by me)

The graph from the same publication shows significant 2G/3G only coverage even for the US.
Slightly more optimistic statistics from the ICT figures report for 2015 - look at US penetration rate for example, or Norway. A map and discussion with US internet speeds on Gizmodo.

How to see your site on Google weblight

For this blog, it is: googleweblight.com/?lite_url=https://andreas-wpv.blogspot.com . Don't forget to use some kind of mobile emulator, to see how it really looks like! Only some resolutions are supported - one is the standard Iphone 4 with 320 x 480.

Google original:

"If you have a Google account:
- View the transcoded page
Otherwise:
- On your mobile device, browse to the link http://googleweblight.com/?lite_url=[your_website_URL]where the url is fully qualified (http://www.example.com).
  OR
- On your desktop, open the Chrome device mode emulator with the link http://googleweblight.com/?lite_url=[your_website_URL] where the url is fully qualified (http://www.example.com) "

Is this legal? How about copyright?

Honestly - I don't want to go there, I am no lawyer, and there are some complex questions arising - internally.
Google claims that companies win so much traffic, it is really in their best interest. For every site that does not want this treatment, there is a way to opt out described on the Google help pages, and it gives some insight into what's removed. Some more tech details on this article: mostly compression, removal of third party elements, reduction of design elements.

Google names this 'transcoding' - to me it looks like scraping and serving from their website, something that might be considered a copyright violation, and would likely be against Google 'terms of service' (highlights by me):

"Do not misuse our Services, for example, do not interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct."

Additionally - would not adwords on a weblight page violate Adwords terms of service?

" Content that is replicated from another source without adding value in the form of original content or additional functionality

Examples: Mirroring, framing, or scraping content from another source"

I would understand - perhaps misunderstand - this the way that Google Adwords cannot be used on weblight pages.

More general, is this scraping ok for Google to do?

Sept. 25, 2015: It seems from some glance in the detailed data transfer (chrome, fiddler), that it is more of a filtering of content with some kind of proxy. Working on it.

Resources:

Additionally to the above links, there are a few articles I found on this topic + Google sources:

http://gadgets.ndtv.com/internet/news/google-india-to-offer-faster-access-to-mobile-webpages-for-android-users-702302
http://www.unrevealtech.com/2015/07/how-to-prevent-site-loading-google-weblight.html
http://www.androidauthority.com/google-web-light-looks-616450/
http://digitalperiod.com/google-web-light/
https://support.google.com/webmasters/answer/6211428?hl=en

Tuesday, September 15, 2015

Data analysis preparation analysis script

Finally got to working on this. I am working with larger files and one of my current fun projects is to find out which urls have been visited by Google, out of all the urls we have live.

While working on a small DB to do this I downloaded some files from splunk, then imported them. And sure enough realized there are things I need to filter out before, or the DB becomes absolutely unwieldy.
The files are not huge, but large enough, with 10+ million lines, so I want to use command line tools and not redo a script for every number I need, looping repeatedly over the same file.

It shows count of

number of fields
total number of lines
empty lines
non-empty lines.

Then it pulls the full data for the

shortest line
the longest line
the first 4 lines
last 4 lines
middle 3 lines of the file.

This is testdata output from the script - cannot show real data. It works well - at least with a few fields. To run it on a 17 million lines one field list takes ~ 32 seconds, that's pretty good I think.

I highlight the description in green as you might see - otherwise it is kind of hard to read. Also - awk changed the numbers for the line counters to scientific notation, so I needed to use int(variable) to reset it to integer and be able to concatenate it into a range for the sed - system call. Ah, this is with comma separated files and needs to be adjusted if that's different.

#!bash

#this shows all lengths and how often
echo -e "\033[32;1m\n\nnumber of lines : number of fields\033[0m" ; awk ' {print NF} ' "$1" | sort | uniq -c 

#this shows only number of shortest / longest lines

awk ' BEGIN {FS=","} (NF<=shortest || shortest=="") {shortest=NF; shortestline=$0} 
(longest<=NF) {longest=NF; longestline=$0} 
(!NF) {emptylines+=1} 
(NF) {nonemptylines+=1}
(maxcount<NR) {maxcount=NR}
END { middlestart=(maxcount/2)-1;
middleend=(maxcount/2)+1;
range=int(middlestart)","int(middleend);
print "\033[32;1m\ntotal number of lines is:\n\t\033[0m" NR,
"\033[32;1m\n shortest is:\t\t\033[0m" shortest, 
"\033[32;1m\n longest is:\t\t\033[0m" longest, 
"\033[32;1m\n shortestline is:\t\t\033[0m" shortestline, 
"\033[32;1m\n longestline is:\t\t\033[0m" longestline,
"\033[32;1m\n number empty lines is: \t\t\033[0m" emptylines,
"\033[32;1m\n number of non-empty lines is:\t\t\033[0m" nonemptylines;

print "\033[32;1m\n\nrange is   \033[0m" range, "\033[1;32m\nFILENAME IS \033[0m" FILENAME, "\033[32;1m\nnow the middle 3 lines of the file: \n\033[0m";
system("sed -n "range"p " FILENAME)

} ' $1

echo -e "\033[32;1m\n\ntop 4 lines\n\033[0m"
head -n 4 "$1"
echo -e "\033[32;1m\n\nlast 4 lines\n\033[0m"
tail -n 4 "$1"
echo -e "\n"

Wednesday, July 22, 2015

Parallel on AWS

Remember the post on how many parallel (simple, wget) processes can be run reasonably on a machine?

This is how it looks on aws / amazon web services EC2:

I added lower / higher numbers after a bit testing, but this is fast - likely because of a really good internet connection. (I might need to try something to test the processing itself without network dependency....)

Many of the scripts - even when run with parallel - still take hours or days to complete. 10,000 or 100,000 urls to check are the rule, sometimes even more. With downloads and processes on the data, this can take a while.

I can't run these from the office, and if I run them from home I keep checking.... which is not that relaxing in time where I want to relax. So, I often use amazon web services EC2.
For about a year I used the free tier - a great offer from Amazon to try the service with a very small instance for free. Now I am in the paid program, but cost is really low for what I do and after a few times it became routine to start a new instance when I need it, run some things, then terminate it.

One of the programs I use pretty much always is tmux - 'terminal multiplexer' - which now is by default on ubuntu instances. It allows not just to run many connected terminals, but also to detach a session. That means, I can start a session, run a script in it, detach and then close the terminal, connection and the script keeps running in the background. A few hours later I can just connect again and download the results.
Especially in combination with parallel this is a great way to run many of the scripts from this blog - and more.

Starting looks a bit like this (thanks to Ole Tange for the tip and the help with parallel):

ls x* | parallel -j24 --line-buffer " . script.sh {} >> results.txt "

I often just split large files , then ls or cat the split files to parallel. -j24 is 24 parallel threads, {} picks up the data from ls/cat.

Thursday, July 9, 2015

Crawl faster with "parallel" - but how fast?

Crawling websites - often our own - helps find technical seo, quality and speed issues, you'll see a lot of scripts on this blog in this regard.

With a site with millions of pages, crawling speed is crucial - even a smaller project easily has some thousand urls to crawl. While there are good tools to check for many things, wget and curl in combination with other command line tools give all the flexibility needed - but can be painfully slow.

To check for download speed for example, a regular script would loop through a list of files, download one, then move to the next. Even split up and running the same script is not great, and quite hard to handle at scale.

After some search I found gnu parallel - excellent especially for parallel tasks. Running multiple parallel processes is easy and powerful. A switch -j[digit] allows to set a preference for the number of parallel processes.

But what number is best, especially with networking projects? A wget should not be a heavy computing load, even with some awk/sed/grep afterwards, because most of the time is spent with waiting and data transfer.

So, a little script to check the time for a few different numbers of parallel processes:

for i in {12,24,36,48,60,80,100}; do var=$( time ( TIMEFORMAT="%R" ; cat "${1}" | parallel "-j${i}" . single-url-check-time-redirect.sh {} ) 2>&1 1>/dev/null ) ; echo "$i parallel threads take $var seconds" ; done

The referenced script (single-url-check-time-redirect.sh) just makes a curl call to check for a http status and a wget for the header transfer time; $1 pulls in a list of full urls.
The results differ greatly by network.

Here a slow, public network in a cafe:

This from a library with a pretty good connection:

And this from a location with a 50MB/sec connection:

Times can vary in each location by a factor of 2 easily, network connection and load still seem to make the biggest difference as all three were done on the same computer.

Wednesday, June 24, 2015

the script: 8 different user agents and how sites deal with it

User agent analysis script

And mentioned in the earlier post - a script helped me to grab the info on this post on how sites and google specifically treat various browsers.

While there's a lot more to analyse, much of it manually, I wanted to first see if there is an indication of differences - so for first insight I use just a plain wc -l to get characters, words, lines of the response, and it looks like there is a clear pattern.

So, let's take a look at the source, two nested "read " loops. The outer loop through the urls, the inner loop through the agents:

#check if the file exists
if [[ ! -e $1 ]]; then
echo -e "there's no file with this name"
fi

outfile=$RANDOM-agentdiff.txt
echo -e "agent \t url \t bytes \t words \t lines" > $outfile

# add a http to urls that don't have it
while read -r line; do

if [[ $line == http://* ]]; then
newline="$line" else
newline="http://$line"

# loop through agents. then read output into variables with read "here" <<<

while read -r agent; do

read filelines words chars <<< $(wget -O- -t 1 -T 3 --user-agent "$agent" "$newline" 2>&1| wc)

echo -e "$agent \t $line \t $filelines \t $words \t $chars" >> $outfile
done < $2
fi
done < $1
wc -l $outfile

Most difficult part was to get the wc output into separate variables, thanks stackexchange for the tip with the <<< here string.

Thursday, June 4, 2015

Speed: Data on top 1000 header load time vs full load time

Lots of tools give a different number for the speed of a site, how it is for users, over different channels, providers, including rendering time or not, including many elements or not.

This is the 'hardest' test of all:

With a little script I checked the top 1000 pages from the Alexa list*. The first script tests how long it takes to get the http header back (yes, document exists, and yes, I know where it is and it is OK), These are the blue dots.
The second script downloads ALL page elements of the homepage, including images, scripts, stylesheets, also from integrated third party tools like enlighten, tealeaf, omniture, or whatever a site uses. These are the orange dots.

First I ran this without time limit, and when I checked back the next day, it was still running, so I set the timeout to 20 seconds.

There seems to be a clear connection between header response time and full time, not so much between rank in the top 1000 by traffic and speed.

sorted by full download time

sorted by traffic rank

There also seems to NOT be a clear connection on how rank by traffic (x-axis) correlates to full download time. This shows that we have a great opportunity to outperform many other companies with faster download speeds.

* Alexa (owned by Amazon) publishes the top 1 Million websites by traffic, globally, based on data from many different browser plugins plus from Amazon (cloud?) services.

Tuesday, May 26, 2015

Google mobile friendly: Industries, site types and self centered algorithm change

Maybe it is just me.

Since Google was promoting everyone to make their sites mobile friendly, many pages are a pain to read now on a regular laptop or monitor.

Example SEO Blog: moz.com/blog

Take the moz blog, for example. Great on mobile, not good to read on a desktop. HUGE image up front, pushing all content below the fold. That's just not good, not for SEO, and not for usability.

Desktop design due to Google push? Image is all I see on the screenshot 14" full laptop screen.

Example News blog: Search Engine Land

Similar issue. I do NOT like headlines that big, really.
Desktop design due to Google push? On my screenshot 14" full laptop screen I see a lot of headline, and... Ads.

Now, lets look at a few more.

Amazon has a different url concept (in parts, at least). Mobile experience is not that great, and if I hit the back button, I get the desktop homepage on phone screen, I had hoped for something different.

Bestbuy is not really different - showing off their headline :-). The responsive seems not optimized, at least not for me.

Dell? We have some great pages, some ok pages (many on separate url on m.dell.com) and some definitely having room for improvement when it comes to mobile / responsive - and we have lots of teams working on improving that, both some different url country setups, and some responsive ( http://www.dell.com/en-us/work/learn/large-enterprise-solutions for example). But responsive lends itself to some content like consulting service description, and not to other content like picking a laptop out of a larger selection.

Many sites need to serve content to mobile users much better than they currently do, and Google's push is a good reminder. I do think they have gone a bit far, and I am not sure they are aware that other industries have different needs. Google has an easy play. Their content and searching lends itself to responsive design but it is not really fair to compare search and search results to news, magazines and eCommerce sites. They all have much more complex processes users need to go through, and they have other business models than exploitation of personal data.

Does not Google's data actually confirm that there are industries where mobile is really important and some where it is not? Showing data from Google adwords tool, just random terms used to see some variance:

Not really a big surprise when you think about it. And even the keyword 'search' has 51% of searchers using a pc / monitor, not a phone or tablet, according to Adwords data. I understand the search algorithm change was not large, but it all of the above indicates, it should really not be more either. Many industries don't have that much mobile share, and responsive does not work well for complex tasks. It can be made to work well with adaptive and even better separate urls, layout, but that is slow, complex and expensive, so it takes even more mobile share to justify the investment.

But maybe I am totally wrong because I just have not seen the great examples out there. Do you have an example of a site with complex task that really works well on phone + monitor in a responsive layout?

Thursday, April 23, 2015

8 User agents and responses Alexa top million pages

Some more interesting results from the test run with 8 different user agents and the return size of the documents from the top 1000 sites in the Alexa 1million.

These are the largest returns - interesting to see these sites here, newspapers, stores, portals.

Many of the smallest sites are returning nothing, redirecting or under construction, actually quite a lot of the top performing pages.

But... maybe that's just because of the unknown user agent? (or the one with wget in it?)

Yes, very different picture. With just the regular user agents, just few page don't send a page back. Some redirect, but some are just very, very small, great job!

See Baidu in there? Now let's take another look at a few search giants. Google is amazing - that looks like the exact same page with a few lines in various languages. Interesting to see how much code each line contains though, that's huge compared to other sites.

And while the search giants likely have all kinds of fancy (tracking?) elements, some of the search alternatives have a few more lines, but much less actual code on the page.

Filtering a bit further the confirmation no one likes my 'andi-wget' user agent. Means, future work with wget will need to have a different user agent nearly always!

Check out the first post with result on average responses sizes and how Google responds.

Monday, April 13, 2015

Documentation - where did I store that little script?

Goodness gracious me. I recall that I did this, I cannot just type it again - not doing that often - so HOW do I find that script?

How do you find your little scripts?

Ok, good folder structure, naming convention, all fine, but it still is hard. Is it in the sitespeed folder or the sitemap folder? The test folder, perhaps? Did I spell it this way?
Most scripts here are for 'few-times' use, build to come up with a quick insight, a starting point or some scaling information to build a business case. The scripts are quick and manifold, with many different variations.

Every now and then I recall a script I'd like to re-use, and then struggle to find the right script. Working across several computers is a challenge, here. Git - too big, had some security issues, and too steep a learning curve for these one-liners.

I used this blog, then, as a repository, and to find the scripts (with a site:andreas-wpv.google.com search) plus Google drive for a small part of the repository. Works, but still missing info.

Documentation script

But I am using #comments quite a lot even in these short bash scripts, so I will now extend that, and use this:

find . -name "*.sh" -exec echo -e "{}\n" \; -exec grep "^#" {} \; -exec echo -e "\n\n" \;

this pulls the script name and path, then an empty line, then all the comments, then two lines to separate from the next script. Not pretty, but works. Now I need to document a bit better :-)

How do you sort, document and find your little scripts?

Tuesday, April 7, 2015

Including Google: 8 agents - and average response code

Agents, not spies

Agents, user agents, play an important role in the setup and running of sites, likely more than many developers like. Exceptions, adjustments, fixes - and it is (?) impossible to generate code that shows up properly on all user agents and also is W3C compliant.

So, sites care, and likely Google cares as well. Google - like any site - offers only one interface to their information (search): source code. How the page looks like, how it is rendered, the tool that renders it is not Google, not provides or supported or maintained or owned by Google, but something on client side. Not even the transport is provided by Google. Google only provides a stream of data via http , https, and anything after that is not their business.

So, my question was - do user-agents make a difference?

And sure, what better to use than a little script? There are so many different user agents, I wanted to keep this open to changes, so I loop over a file with user agents - one per line - as well over a list of urls to get the response from these sites.

Surprising results: Bingbot and Googlebot receive the smallest files - not wget or a unknown bot. The graph overstates this, but the table shows there's a clear difference.

Google.com answering

Now let's take a look at the response Google.com sent back to the little script. They do NOT like wget or andi-bot, nor google bot or bingbot, although they perform a bit better than the first two.
High focus on the regular user agents. The picture for words is the same as for lines.

Google only provides a datastream, rendering is on client side. So if I use firefox, IE, safari, wget or curl is my business - and not Google's. I understand that google does not want people to scrape large amounts of data off their site - which i don't do - but I am surprised the 'block' is not a bit more sophisticated.

Monday, March 16, 2015

Automate your Logfile analysis for SEO with common tools

Fully automated log analysis with tools many use all the time

Surely no substitute for splunk and its algorithms and features, but very practical, near zero cost (take that!) and high efficiency. Requires mainly free tools (thanks cygwin) or standard system tools (like wiindows task scheduler), plus a bit of trial and error. (I also use MSFT Excel, but other spreadsheet programs should work as well).

Analysis of large logfiles, daily

Analyzing logfiles for bot and crawler behavior, but also to check for site quality is quite helpful. So, how to analyze our huge files? For a part of the site, we're talking about many GB of logs, even zipped.

Not that hard, actually, although it took me a while to get all these steps lined up and synchronized.

With the windows task manager I schedule a few steps over night:

copy last days logfiles on a dedicated computer
grep the respective entries in a variety of files (all 301, bot 301, etc.)
Then count the file lenghts (wc -l ) and append the values to a table (csv file) tracking these numbers
Delete logfiles
The resulting table and one or two of the complete files (all 404.txt) are copied to a server, which hosts an Excel file with uses the txt file as database, and updates graphs and tables on open.
delete temporary files (and this way avoid the dip you see)

Now our team can go quickly check if we have an issue up, and need to take a closer look, or not.
In a second step I also added all log entries resulting in a 404 into the spreadsheet on open.

.

Thursday, March 5, 2015

Who has H1?

Do we need an H1 on our homepage?

Sometimes it is necessary to convince pdms, devs, stakeholders that SEO efforts are necessary. One way to support this is to quickly run a test on competitors and / or on ... top pages on the web. (Yes, after all the other pro's have been given.) Especially since we're running one of the top pages ourselves, that list contains powerful names.(and the first url is in my tests.. because I know what's happening there, not because of being in the top 1million.. yet ;-) )

So, H1 or not?

Out of the top 1000 pages, about
Here is a screenshot of the top pages that have an H1 on their homepage.

This is the script, running over the top 1000 urls from alexa 1 million. Very easy to adjust for other page elements.

#!bash
echo -e "url\t has H1 " > 2top1kH1.txt
while read -r line; do
echo $line
h1yes=""
h1yes=$(wget -qO- -w 3 -t 3 "$line" | grep "<h1" | wc -l)
if [ "$h1yes" -gt 0 ]; then
echo -e "$line\t yes" >> 2top1kH1.txt
fi
done < $1

Not large, not complicated, but very convincing.

Monday, February 16, 2015

Google Sitespeed Score and Rank

While Matt Cutts mentioned that site speed only affects a small percentage of sites in their rankings - running one of the largest sites this is worth a second look. Due to the millions of pages, speed might be especially critical for rank - or at least for indexation.

Sitespeed is not sitespeed score, but a relatively 'neutral' way to measure speed related performance.

What to do? First, I selected pages across site, some 2500 pages to have a nice sample. Then I pulled the average rank for each of these pages from seoclarity(.net). And combined this with the speedscore from my script.

As a last step imported this table into Excel and ... scatterplot! Vertical is the ranking position, and x-axis is the speedscore. At first sight, this would indicate that with increasing speed score pages rank worse, but overall I would say this means there is no correlation at all (0.09), and the variance is way too big to use this to calculate a trend.