Monday, August 12, 2013

Clean up WGET results for sitemap

After running the script to get urls per wget, now got to clean them out to get just plain urls. The urls need to be on the right domain / folder and need to have had a 200 OK http respone. So, now there are a bunch of text files with urls in them, but not just urls, but a lot more:

2013-07-16 21:39:00 URL:http://www.dell.com/ [21149] -> "www.dell.com/index.html" [1]
2013-07-16 21:39:00 URL:http://www.dell.com/robots.txt [112] -> "www.dell.com/robots.txt" [1]
2013-07-16 21:39:01 URL: http://www.dell.com/_/rsrc/1373543058000/system/app/css/overlay.css?cb=slate712px150goog-ws-left 200 OK

Now, that's all good to know, but not really usable for building a sitemap. Being used to work and fix a lot of things in excel, that was my first try - but I quickly quit. The only way to come close to cleaning up was changing table to text, and even that did not clean up everything - and took too long, too.

#! bash 
#loop through the files
for i in {1..30}; do
#pull only the lines with a URL, then all that do NOT match my domain 
grep 'URL' wgetlog$i.txt | grep -i 'dell.com' > wget$i.txt
# delete beginning date including url then delete ending part from space on, including file size numbers, then remove spaces, then remove lines not #matching Dell.com
sed -e 's/^[0-9 -:]*URL://g' -e 's/\ \[.*$//g' -e 's/\ //g' -e '/dell\.com/!d' wget$i.txt > support-url-$i.txt
done
Not that pretty, but works fine. I just need to adjust the file name to what I use in the wget script, plus set the number to the number of files used, and then it works just fine. 

This is great for any site using just a couple of thousand pages and a few sitemap updates per year.

No comments:

Post a Comment

Bookmark and Share