Tuesday, April 8, 2014

Scan site for list of urls - use in sitemap or for other scans

This is the sixth version (some other versions) of this scanner I use - I find it quite practical as it allows me to scan folders easily, come back with good number of urls.

It makes a header request, then stores these in a textfile. Once the scan stops, the script cleans out to only have the 200 responses in a separate file, then filters to only have the url in the final file.
I use a random number for the filename, as I use these in a special folder this allows me to not worry about incompatible characters in the filename, filtering out the basename or duplicate filenames / overwriting files in case I run a scan several times - on purpose or not. I ceep the intermediate textfiles so I can go back and check where something went wrong. Every now and then, I clean up the folder for these files.

#!bash
url="$1"
echo $url
sleep 10

name=$RANDOM

wget --spider -l 10 -r -e robots=on --max-redirect 1  -np "${1}" 2>&1 | grep -e 'http:\/\/' -e 'HTTP request sent' >> "$name"-forsitemap-raw.txt

echo $name

grep -B 1 "200 OK" "$name"-forsitemap-raw.txt > "$name"-forsitemap-200s.txt
grep -v "200 OK" "$name"-forsitemap-200s.txt > "$name"-forsitemap-urls.txt
sed -i "s/^.*http:/http:/" "$name"-forsitemap-urls.txt
sort -u -o"$name"-forsitemap-urls.txt "$name"-forsitemap-urls.txt
cat -n "$name"-forsitemap-urls.txt


Thoughts? Feedback?

No comments:

Post a Comment

Bookmark and Share