Monday, September 30, 2013

Download urls from sitemap into textfile

Sitemaps again - they are still very helpful, especially for a large site. For several processes the urls are necessary - and going back to the sourcefile is not always possible or practical.

So, here is a small bash script to scan a given sitemap and store the results just the urls into a textfile. Input parameter is the full url of the sitemap. Let's name this sitemap-urls.sh then it would be

# bash sitemap-urls.sh http://www.dell.com/wwwsupport-us-sitemap.xml
#!bash
if [[ ! $1 ]]; then echo 'call script with parameter of url for file'
exit 1
else
wget -qO- "${1}"  | grep "loc" | sed -e 's/^.*loc>//g' -e 's/<\/loc>.*$//g' > sitemap-scan-output.txt
fi
First checking if the file is called with the sitemap url as parameter ($1) and if not, exit with echoing a message. If parameter is set, then download the page without saving it, grep for the right line, and use sed to replace the irrelevant parts, means html tags, with nothing to remove it.

Not fancy, but still good to have. I will use this to check for a few interesting things in next posts, and this is really helpful also if the urls are needed for import in analytics tools and Excel.




No comments:

Post a Comment

Bookmark and Share