Thursday, June 12, 2014

Finally: free to use Sitemap Generator with automated 'priority' configuration

Easy to use - at least in an linux environment. Not sure if adjustments are necessary for various (bash) flavors, same with -nix emulators like cygwin. Sitemaps can have up to 50,000 links - so this should be sufficient for many sites. 

First, check if called with a file - ideally the list of urls from an earlier scan with one of the scanners of this site, or from somewhere else, perhaps a content management system. Then, grab the file name without ending and add a random number to avoid overwriting of existing files.

if [ -z $1 ]
then echo "needs a file to start with"
else
file=$(basename "${1}")
name=${file%.*}-$RANDOM.txt
fi
 Clean up the file from empty lines, then from lines that start with underscore or hyphen - errors I found a few times.
sed -i '/^$/d' "${1}"
sed -i '/^[_\|-]/d' "${1}"

Echo the header into the new file - xml definition for a sitemap.
echo ""  >  "${name}"
Then, check if the url has a http:// in front, if not, add it, otherwise the following calculations get tricky.
Next - count the number of "/ ", subtract the two dashes from the http://. The 'gsub' returns the number of replacements - and that's all I need here, so I replace a dash with a dash.

The next step calculates the priority based on the depth in the folder structure - the lower in hierarchy, the lower in priority. This is a setting that might need to be changed, depending on the structure of a site.
awk ' $1 !~ /^http/ { $1 = "http://"$1 }  { count=(gsub("/" , "/" , $1)-2) } count > 9 { print $1, "0.1" } ( count > 0 && count <=9 ) { print $1, (10-count)/10 } count <= 0 { print $1, "1" }' $1 | while read input1 input2
The little sed inset is then used to remove the additional decimals - this was the easiest way to do this.
Then just echo it into the overall structure of the sitemap entries, then end the while loop with done.
do priorityvalue=$(echo $input2 | sed 's/^\(...\).*/\1/')
echo -e "$\n$input1\n$priorityvalue\n" >> "${name}"

done 
Finally, add the sitemap footer per definition and show the result
echo "
" >> "${name}"cat $name

looks like this:










http://www.tage.de  1
http://www.directorinternet.com/  0.9
http://andreas-wpv.blogspot.com/2014/05/new-google-schema-implementation-for.html  0.7

Same script on Dropbox for easy use.

No comments:

Post a Comment

Bookmark and Share