- Use wget to spider the site, recursive, no parent, unlimited in depth = the whole section, but nothing more
- Ignore redirects - not good in a sitemap, and it is also to work better with the resulting file
- Grep for all lines with 'http://' to get the url, and then the line with the http response, ie. 301, 302, 404, 200
- append to file
wget --spider -l inf --max-redirect 0 -np -nc www.dell.com/support/my-support/ie/en/iedhs1/ 2>&1 | grep -e 'http:\/\/' -e 'HTTP request sent' >> ms-ie-sitemap-raw.txt
Using the spider has the disadvantage, that each output is its own line, so each url has one line and the status has another line. That's why it is relevant to use the 'max redirect 0' to not show the list of redirects, but only the status. This way, each entry consists of exactly two lines, which comes in handy to work on making this usable for a sitemap.
Now this just needs some cleaning up:
Merge two lines into one (one has http status, one has url), replacing the newline with a tab:
sed -i 'N;s/\nHTTP/\tHTTP/g'Remove all lines which have a different http status than 200
sed -i '/200/!d'Remove everything except url (first from start to url, then from url to end)
sed -i 's/^.*http/http/g'Dedupe
sort file | uniq > final-file.txt
I ended up with ~ 8000 unique urls!
Not sure if I could use -e to add these all to each other, but I guess they need to be piped, because they are not alternatives (match any) but need to be done one after another. True?
No comments:
Post a Comment