Monday, September 30, 2013

Download urls from sitemap into textfile

Sitemaps again - they are still very helpful, especially for a large site. For several processes the urls are necessary - and going back to the sourcefile is not always possible or practical.

So, here is a small bash script to scan a given sitemap and store the results just the urls into a textfile. Input parameter is the full url of the sitemap. Let's name this then it would be

# bash
if [[ ! $1 ]]; then echo 'call script with parameter of url for file'
exit 1
wget -qO- "${1}"  | grep "loc" | sed -e 's/^.*loc>//g' -e 's/<\/loc>.*$//g' > sitemap-scan-output.txt
First checking if the file is called with the sitemap url as parameter ($1) and if not, exit with echoing a message. If parameter is set, then download the page without saving it, grep for the right line, and use sed to replace the irrelevant parts, means html tags, with nothing to remove it.

Not fancy, but still good to have. I will use this to check for a few interesting things in next posts, and this is really helpful also if the urls are needed for import in analytics tools and Excel.

Monday, September 23, 2013

Check url for http response codes with curl

A little linux helper to check the status of urls in a sitemap, based on the server response code.

Currently redirects are said to be not good for Bing ranking, neutral for Google. We want to rank in both, so we don't want 300s, and sure no 400s or 500s - the error response codes.

For this example from work I use curl, easy and fast, and  "".

curl -D - 

-D - means dump header into file - meaning stdout.
LONG result, but direction is correct.
Now add a -o /dev/null, meaning move content into output /dev/null.

curl -D - -o /dev/null
Still too long, but getting closer.
So I'll use sed to print just the header response, based on the regex HTTP:
curl -D - -o /dev/null | sed -n '/HTTP/p'

STILL not there. Adding -s to curl to silence the speed-o-meter gives me:

curl -s -D - -o /dev/null | sed -n '/HTTP/p'

results in: HTTP/1.1 200 OK

Got it!
It sure has limitations, this is not going to help identify server level rewrites or reverse proxy redirects without intermediate non 200 http response, nor is it going to identify a http-refresh. I still find it pretty helpful. The first is still good to submit to the Search Engines, and the second is rare, fortunately, at least where I work.

This again is patched together from a variety of sources, including stackoverflow, a sed post by Eric Pemment and little bits from Andrew Cowie (yep, that's about apis, but still helped): thanks everyone!

Thanks Andy, this is a great addition you suggest in the comments to add the L to follow redirects! I would then extend the sed to get this:
curl -s -L -D - -o /dev/null | sed -n '/HTTP\|Location/p' 
follows redirects, and with the extended sed we see the url and the http response like this:

HTTP/1.1 302 Found Location:
HTTP/1.1 301 Moved Permanently Location: /support/my-support
HTTP/1.1 301 Moved Permanently Location: /support/my-support/us/en/19
HTTP/1.1 200 OK

Wednesday, September 18, 2013

Blogspot domains and ranking

Some blogs on blogger / blogspot appear with several top level domains, as shown last week.
What seems not to work though, is ranking with these urls.
I found some blogs which have a few results show up in SERP, but even German blogs had more results with the .com domain than with the .de domain, although they showed up with a de domain when I searched for a generic blog.

I guess to better understand why this happens, it would be necessary to check what was the start domain sites used, and what are signals that could trigger these results.

Any ideas?

Tuesday, September 10, 2013

Tools for Technical SEO - little helpers

While analyzing performance and natural search performance, lots of tools can be used. We use seoclarity, majesticseo, adobe analytics, moz,  and many more enterprise metrics tools at Dell. 

Still, as you can see on this blog, not everything can be done comfortably with these, sometimes the setup is too cumbersome or slow or costly, sometimes it would require too many tweaks, and sometimes its just not possible to get what we need.

So, here is the list of little helpers, the tools I mainly use to analyse things for technical aspects of seo.
  •  Httpfox, Fiddler2 – pageload, coding, caching, errors, http response codes
  • Screaming frog – elements on page (title, redirects, meta description, keywords etc.)
  • Source code view in chrome, IE, FF for title, meta, header elements, check if copy is in source code
  • Developer tools in chrome ( Ctrl + shift + J ) –  for speed (new) , source code, various items
    • Do NOT use chrome developer tools to check if content is in source code (not accurate) (also not usable for this particular aspect - right click ‘inspect element’)
  • Accessibility checkers – required under some federal law regulations, likely will hear a lot more about this as a certified fed vendor. BUT it is also great to check for SEO, if accessible = 95% accessible for SEO spiders
  • And then ad hoc tools we use rarely (and many times research to solve a particular task)
  • ADDE, which is an Accenture site scan tool - a large scale tool offering similar insight like screaming frog, xenu etc, moz, just running on a bunch of internal servers allowing in depth access and analysis (entry added Nov 13, '13)
So, now it's up to you: which tools for tech analysis are we missing, and what can they help with?

Monday, September 2, 2013

Blogspot domains

Blogspot urls: Top level domain matters?

It seems that blogspot posts are available under several top level domains.
One of my posts has recently been shared internally at Dell in a larger newsletter, nothing special, but I for sure tried if the link works.

This is the link:

Stop! That's not right, my blog runs at

So I tried and switched a few times back and forth, and the post is shown perfectly fine under both domains, as long as I keep the other parts unchanged. 
works just as well as

Other countries work also like

  • de
  • be
  • fr
while these don't:

  • cn
  • br
  • es

It also works for other blogs, as long as they use and not customized domains, this seems to work for everyone.

Bookmark and Share