Easy to use - at least in an linux environment. Not sure if adjustments are necessary for various (bash) flavors, same with -nix emulators like cygwin. Sitemaps can have up to 50,000 links - so this should be sufficient for many sites.
First, check if called with a file - ideally the list of urls from an earlier scan with one of the scanners of this site, or from somewhere else, perhaps a content management system. Then, grab the file name without ending and add a random number to avoid overwriting of existing files.
Clean up the file from empty lines, then from lines that start with underscore or hyphen - errors I found a few times.
if [ -z $1 ]
then echo "needs a file to start with"
else
file=$(basename "${1}")
name=${file%.*}-$RANDOM.txt
fi
sed -i '/^$/d' "${1}"Echo the header into the new file - xml definition for a sitemap.
sed -i '/^[_\|-]/d' "${1}"
echo "Then, check if the url has a http:// in front, if not, add it, otherwise the following calculations get tricky." > "${name}"
Next - count the number of "/ ", subtract the two dashes from the http://. The 'gsub' returns the number of replacements - and that's all I need here, so I replace a dash with a dash.
The next step calculates the priority based on the depth in the folder structure - the lower in hierarchy, the lower in priority. This is a setting that might need to be changed, depending on the structure of a site.
awk ' $1 !~ /^http/ { $1 = "http://"$1 } { count=(gsub("/" , "/" , $1)-2) } count > 9 { print $1, "0.1" } ( count > 0 && count <=9 ) { print $1, (10-count)/10 } count <= 0 { print $1, "1" }' $1 | while read input1 input2The little sed inset is then used to remove the additional decimals - this was the easiest way to do this.
Then just echo it into the overall structure of the sitemap entries, then end the while loop with done.
do priorityvalue=$(echo $input2 | sed 's/^\(...\).*/\1/')
echo -e "$\n " >> "${name}"$input1 \n$priorityvalue \n
Finally, add the sitemap footer per definition and show the result
done
echo "" >> "${name}"cat $name
looks like this:
http://www.directorinternet.com/ 0.9
Same script on Dropbox for easy use.
No comments:
Post a Comment