So, I started to check various tools, like screaming frog, sitebuilder. Xenu was not reliable last time I tried, and these two tools did not work as wished for as well, the site is relatively large. And while screaming frog is great and fast, it slows down very much after a few thousand urls.
Using linux at home, I quickly started my first trials with cURL and wget. Curl was ruled out quickly, so focusing on wget I tried a few things.
First, I just started with the root url, and then waited:
wget --spider --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog.txt &
spider for only getting the urls, recursive with no-parent for the whole directory but nothing above, -t 3 for three trials to download a url, sending urls to the logfile.
Slowly but surely the list kept building. Added -4 after some research, as is was said to help speed up to force a IPv4 request.
Still very slow, so I tried to run this with xargs:
I did not really see an improvement - just plain 'feeling' of time, but it was definitely still to slow to go through 10,000 + urls in a day.xxargs -n 1 -P 10 url-list.txt wget --spider --recursive --no-verbose --no-parent -t 2 -4 -save-headers --output-file=wgetlog.txt &
After some research I came up with this solution, and it seems to work well enough:
I split the site into several sections, and then gathered the top ~ 10 urls in a textfile, which I used as input for a loop in a bash script (the # echo I use for testing the scripts, I am a pretty bloody beginner and this helps) :
In the wget line the $line stands for the input file into wget, it takes variables. Works well. I get a bunch of wgetlog files with different names with all the urls, and it sure seemed faster than xargs, although I read that xargs is better in distributing load.#! bashwhile read -r line; do#echo $linewget --spider --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog$i.txt $line#echo wgetlog$i.txtdone < urls.txt
If I am missing an opportunity to make it faster, shorter, better - I am grateful for suggestions.
ReplyDelete