Thursday, October 10, 2013

Scan site or folder to generate a sitemap with wget

After having searched for a solution to build a sitemamp easily, quickly, without any server side installation or software - and getting no result - I came up with these small scripts. Worked fine for me when I was scanning www.dell.com/support/my-support/ie/en/04/ and some other areas of the same site.

  1. Use wget to spider the site, recursive, no parent, unlimited in depth = the whole section, but nothing more
  2. Ignore redirects - not good in a sitemap, and it is also to work better with the resulting file
  3. Grep for all lines with 'http://' to get the url, and then the line with the http response, ie. 301, 302, 404, 200 
  4. append to file

wget --spider -l inf --max-redirect 0 -np -nc www.dell.com/support/my-support/ie/en/iedhs1/ 2>&1 | grep -e 'http:\/\/' -e 'HTTP request sent' >> ms-ie-sitemap-raw.txt

 Using the spider has the disadvantage, that each output is its own line, so each url has one line and the status has another line. That's why it is relevant to use the 'max redirect 0' to not show the list of redirects, but only the status. This way, each entry consists of exactly two lines, which comes in handy to work on making this usable for a sitemap.

Now this just needs some cleaning up:


Merge two lines into one (one has http status, one has url), replacing the newline with a tab:
sed -i 'N;s/\nHTTP/\tHTTP/g'
Remove all lines which have a different http status than 200
sed -i '/200/!d' 
Remove everything except url (first from start to url, then from url to end)
sed -i 's/^.*http/http/g' 
Dedupe
sort file | uniq > final-file.txt
I ended up with ~ 8000 unique urls!

Not sure if I could use -e to add these all to each other, but I guess they need to be piped, because they are not alternatives (match any) but need to be done one after another. True?

No comments:

Post a Comment

Bookmark and Share