Showing posts with label technical seo. Show all posts
Showing posts with label technical seo. Show all posts

Monday, September 30, 2013

Download urls from sitemap into textfile

Sitemaps again - they are still very helpful, especially for a large site. For several processes the urls are necessary - and going back to the sourcefile is not always possible or practical.

So, here is a small bash script to scan a given sitemap and store the results just the urls into a textfile. Input parameter is the full url of the sitemap. Let's name this sitemap-urls.sh then it would be

# bash sitemap-urls.sh http://www.dell.com/wwwsupport-us-sitemap.xml
#!bash
if [[ ! $1 ]]; then echo 'call script with parameter of url for file'
exit 1
else
wget -qO- "${1}"  | grep "loc" | sed -e 's/^.*loc>//g' -e 's/<\/loc>.*$//g' > sitemap-scan-output.txt
fi
First checking if the file is called with the sitemap url as parameter ($1) and if not, exit with echoing a message. If parameter is set, then download the page without saving it, grep for the right line, and use sed to replace the irrelevant parts, means html tags, with nothing to remove it.

Not fancy, but still good to have. I will use this to check for a few interesting things in next posts, and this is really helpful also if the urls are needed for import in analytics tools and Excel.




Monday, August 12, 2013

Clean up WGET results for sitemap

After running the script to get urls per wget, now got to clean them out to get just plain urls. The urls need to be on the right domain / folder and need to have had a 200 OK http respone. So, now there are a bunch of text files with urls in them, but not just urls, but a lot more:

2013-07-16 21:39:00 URL:http://www.dell.com/ [21149] -> "www.dell.com/index.html" [1]
2013-07-16 21:39:00 URL:http://www.dell.com/robots.txt [112] -> "www.dell.com/robots.txt" [1]
2013-07-16 21:39:01 URL: http://www.dell.com/_/rsrc/1373543058000/system/app/css/overlay.css?cb=slate712px150goog-ws-left 200 OK

Now, that's all good to know, but not really usable for building a sitemap. Being used to work and fix a lot of things in excel, that was my first try - but I quickly quit. The only way to come close to cleaning up was changing table to text, and even that did not clean up everything - and took too long, too.

#! bash 
#loop through the files
for i in {1..30}; do
#pull only the lines with a URL, then all that do NOT match my domain 
grep 'URL' wgetlog$i.txt | grep -i 'dell.com' > wget$i.txt
# delete beginning date including url then delete ending part from space on, including file size numbers, then remove spaces, then remove lines not #matching Dell.com
sed -e 's/^[0-9 -:]*URL://g' -e 's/\ \[.*$//g' -e 's/\ //g' -e '/dell\.com/!d' wget$i.txt > support-url-$i.txt
done
Not that pretty, but works fine. I just need to adjust the file name to what I use in the wget script, plus set the number to the number of files used, and then it works just fine. 

This is great for any site using just a couple of thousand pages and a few sitemap updates per year.
Bookmark and Share