2013-07-16 21:39:00 URL:http://www.dell.com/ [21149] -> "www.dell.com/index.html" [1]
2013-07-16 21:39:00 URL:http://www.dell.com/robots.txt [112] -> "www.dell.com/robots.txt" [1]
2013-07-16 21:39:01 URL: http://www.dell.com/_/rsrc/1373543058000/system/app/css/overlay.css?cb=slate712px150goog-ws-left 200 OK
Now, that's all good to know, but not really usable for building a sitemap. Being used to work and fix a lot of things in excel, that was my first try - but I quickly quit. The only way to come close to cleaning up was changing table to text, and even that did not clean up everything - and took too long, too.
#! bash#loop through the filesfor i in {1..30}; do#pull only the lines with a URL, then all that do NOT match my domain
grep 'URL' wgetlog$i.txt | grep -i 'dell.com' > wget$i.txt# delete beginning date including url then delete ending part from space on, including file size numbers, then remove spaces, then remove lines not #matching Dell.comsed -e 's/^[0-9 -:]*URL://g' -e 's/\ \[.*$//g' -e 's/\ //g' -e '/dell\.com/!d' wget$i.txt > support-url-$i.txtdoneNot that pretty, but works fine. I just need to adjust the file name to what I use in the wget script, plus set the number to the number of files used, and then it works just fine.
This is great for any site using just a couple of thousand pages and a few sitemap updates per year.
No comments:
Post a Comment