Monday, July 8, 2013

Pull urls from site for sitemap with wGet

Working for a large company we can use a lot of different tools to do our job. One thing I wanted to do is to build a sitemap for a site where the content management system does not provide this feature.

So, I started to check various tools, like screaming frog, sitebuilder. Xenu was not reliable last time I tried, and these two tools did not work as wished for as well, the site is relatively large. And while screaming frog is great and fast, it slows down very much after a few thousand urls.

Using linux at home, I quickly started my first trials with cURL and wget. Curl was ruled out quickly, so focusing on wget I tried a few things.

First, I just started with the root url, and then waited:

wget --spider --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog.txt &

spider for only getting the urls, recursive with no-parent for the whole directory but nothing above, -t 3 for three trials to download a url, sending urls to the logfile.
Slowly but surely the list kept building. Added -4 after some research, as is was said to help speed up to force a IPv4 request.

Still very slow, so I tried to run this with xargs:
xxargs -n 1 -P 10 url-list.txt wget --spider --recursive --no-verbose --no-parent -t 2 -4 -save-headers --output-file=wgetlog.txt &

I did not really see an improvement - just plain 'feeling' of time, but it was definitely still to slow to go through 10,000 + urls in a day.

After some research I came up with this solution, and it seems to work well enough:
I split the site into several sections, and then gathered the top ~ 10 urls in a textfile, which I used as input for a loop in a bash script (the # echo I use for testing the scripts, I am a pretty bloody beginner and this helps) :
#! bash
while read -r line; do
#echo $line
wget --spider --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog$i.txt $line
#echo wgetlog$i.txt
done < urls.txt
In the wget line the $line stands for the input file into wget, it takes variables. Works well. I get a bunch of wgetlog files with different names with all the urls, and it sure seemed faster than xargs, although I read that xargs is better in distributing load.

1 comment:

  1. If I am missing an opportunity to make it faster, shorter, better - I am grateful for suggestions.

    ReplyDelete

Bookmark and Share