andreas.wpv: sitemap

Showing posts with label sitemap. Show all posts

Thursday, June 12, 2014

Finally: free to use Sitemap Generator with automated 'priority' configuration

Easy to use - at least in an linux environment. Not sure if adjustments are necessary for various (bash) flavors, same with -nix emulators like cygwin. Sitemaps can have up to 50,000 links - so this should be sufficient for many sites.

First, check if called with a file - ideally the list of urls from an earlier scan with one of the scanners of this site, or from somewhere else, perhaps a content management system. Then, grab the file name without ending and add a random number to avoid overwriting of existing files.

if [ -z $1 ]
then echo "needs a file to start with"
else
file=$(basename "${1}")
name=${file%.*}-$RANDOM.txt
fi

Clean up the file from empty lines, then from lines that start with underscore or hyphen - errors I found a few times.

sed -i '/^$/d' "${1}"
sed -i '/^[_\|-]/d' "${1}"

Echo the header into the new file - xml definition for a sitemap.

echo "" > "${name}"

Then, check if the url has a http:// in front, if not, add it, otherwise the following calculations get tricky.
Next - count the number of "/ ", subtract the two dashes from the http://. The 'gsub' returns the number of replacements - and that's all I need here, so I replace a dash with a dash.

The next step calculates the priority based on the depth in the folder structure - the lower in hierarchy, the lower in priority. This is a setting that might need to be changed, depending on the structure of a site.

awk ' $1 !~ /^http/ { $1 = "http://"$1 } { count=(gsub("/" , "/" , $1)-2) } count > 9 { print $1, "0.1" } ( count > 0 && count <=9 ) { print $1, (10-count)/10 } count <= 0 { print $1, "1" }' $1 | while read input1 input2

The little sed inset is then used to remove the additional decimals - this was the easiest way to do this.
Then just echo it into the overall structure of the sitemap entries, then end the while loop with done.

do priorityvalue=$(echo $input2 | sed 's/^$...$.*/\1/')
echo -e "$\n$input1\n$priorityvalue\n" >> "${name}"

done

Finally, add the sitemap footer per definition and show the result

echo "

" >> "${name}"cat $name

looks like this:

http://www.tage.de 1
http://www.directorinternet.com/ 0.9
http://andreas-wpv.blogspot.com/2014/05/new-google-schema-implementation-for.html 0.7

Same script on Dropbox for easy use.

Thursday, May 15, 2014

Scraper for video pages to get all data for video sitemap

This scraper is based mainly on opengraph tags (which are used by Facebook, for example), so it should work well with many pages, not just Dell.com pages. More info on sitemaps at Google .

#! bash

#check if its called with a filename - a file containing urls for pages with videos

if [[ ! $1 ]] ; then
echo "need to call with filename"
exit 1
else

#now make sure to have a unique filename, based on the file with the urls

filename=$(basename $1)
name=${filename%.*}-$RANDOM-

#header info for the file with the results. Need to pull page url, video title, thumbnail url, description.

echo -e 'url\tpage\tTitle\tThumb\tDescription' > ${name}video-sitemap-data.txt

#loop through the file and store in a variable

while read -r line; do
filecontent=$(wget -qO- "$line")

# echo results and clean up with sed, tr and grep, then append to the file that has the column headers already. It has 4 elements - and each is isolated in its own part. The parts are connected with &&, and everything in ( and ) - otherwise it only echos the last part into the file.

(echo "$line" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:video" | grep "swf" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:title" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:image" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:description" | sed -e "s/^.*content=\"//" -e "s/\".*$//") >> ${name}video-sitemap-data.txt
done < "$1"
fi

I'd be delighted to know this helped someone else - why don't you drop me a note when you do?

This is one of the pages that I used for testing, just in case someone wants to test this:
http://video.dell.com/details.php?oid=RraTJsZDrsDKFAjNcj2WmTovA1ovugc-&c=us&l=en&s=bsd&p=learn .

Tuesday, April 8, 2014

Scan site for list of urls - use in sitemap or for other scans

This is the sixth version (some other versions) of this scanner I use - I find it quite practical as it allows me to scan folders easily, come back with good number of urls.

It makes a header request, then stores these in a textfile. Once the scan stops, the script cleans out to only have the 200 responses in a separate file, then filters to only have the url in the final file.
I use a random number for the filename, as I use these in a special folder this allows me to not worry about incompatible characters in the filename, filtering out the basename or duplicate filenames / overwriting files in case I run a scan several times - on purpose or not. I ceep the intermediate textfiles so I can go back and check where something went wrong. Every now and then, I clean up the folder for these files.

#!bash
url="$1"
echo $url
sleep 10

name=$RANDOM

wget --spider -l 10 -r -e robots=on --max-redirect 1 -np "${1}" 2>&1 | grep -e 'http:\/\/' -e 'HTTP request sent' >> "$name"-forsitemap-raw.txt

echo $name

grep -B 1 "200 OK" "$name"-forsitemap-raw.txt > "$name"-forsitemap-200s.txt
grep -v "200 OK" "$name"-forsitemap-200s.txt > "$name"-forsitemap-urls.txt
sed -i "s/^.*http:/http:/" "$name"-forsitemap-urls.txt
sort -u -o"$name"-forsitemap-urls.txt "$name"-forsitemap-urls.txt
cat -n "$name"-forsitemap-urls.txt

Thoughts? Feedback?

Thursday, April 3, 2014

Pull data for video sitemap

Video sitemaps are sometimes helpful to make Search engines aware of videos on a site. We use several systems to generate pages with videos, and as a result it is not easy to get information from the back-end to generate sitemaps. So - like Google - we have to take it from the front end as much as possible. This script with details is likely limited to just Dell.com, and even here I have found that videos in some sections are not able to be indexed by this. Still, this has been extremely helpful to find the 'hidden' details on our video implementations. (And yes, we have requirements to change these in the process since a while :-) ).

Elements necessary for a sitemap are:

Pageurl
Title
Keywords
Description
Video URL
Thumbnail url

And this scripts pulls it nicely of many of our video pages. (We use open graph tags, which makes it relatively easy to pull most info). The script needs to be called with the filename of the text list of urls as first parameter ( . script.sh listofpages.txt)

if [[ ! $1 ]] ; then
echo "need to call with filename"
exit 1
fi
file=$RANDOM-sitemap-data.txt
echo $file
echo -e "url\tTitle\tthumbnail\tdescription" > $file
while read -r line; do
filecontent=$(wget -qO- "$line")
wait
(echo "$line" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:video" | grep "swf" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:title" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:image" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:description" | sed -e "s/^.*content=\"//" -e "s/\".*$//" ) >> $file
done < "$1"
cat -A "$file"

As always - I use this, and would love to hear tips to improve or see other scripts for site optimization and maintenance.

Monday, September 30, 2013

Download urls from sitemap into textfile

Sitemaps again - they are still very helpful, especially for a large site. For several processes the urls are necessary - and going back to the sourcefile is not always possible or practical.

So, here is a small bash script to scan a given sitemap and store the results just the urls into a textfile. Input parameter is the full url of the sitemap. Let's name this sitemap-urls.sh then it would be

# bash sitemap-urls.sh http://www.dell.com/wwwsupport-us-sitemap.xml

#!bash
if [[ ! $1 ]]; then echo 'call script with parameter of url for file'
exit 1
else
wget -qO- "${1}" | grep "loc" | sed -e 's/^.*loc>//g' -e 's/<\/loc>.*$//g' > sitemap-scan-output.txt
fi

First checking if the file is called with the sitemap url as parameter ($1) and if not, exit with echoing a message. If parameter is set, then download the page without saving it, grep for the right line, and use sed to replace the irrelevant parts, means html tags, with nothing to remove it.

Not fancy, but still good to have. I will use this to check for a few interesting things in next posts, and this is really helpful also if the urls are needed for import in analytics tools and Excel.

Monday, September 23, 2013

Check url for http response codes with curl

A little linux helper to check the status of urls in a sitemap, based on the server response code.

Currently redirects are said to be not good for Bing ranking, neutral for Google. We want to rank in both, so we don't want 300s, and sure no 400s or 500s - the error response codes.

For this example from work I use curl, easy and fast, and "support.dell.com".

curl -D - support.dell.com

-D - means dump header into file - meaning stdout.
LONG result, but direction is correct.
Now add a -o /dev/null, meaning move content into output /dev/null.

curl -D - support.dell.com -o /dev/null

Still too long, but getting closer.
So I'll use sed to print just the header response, based on the regex HTTP:
curl -D - support.dell.com -o /dev/null | sed -n '/HTTP/p'

STILL not there. Adding -s to curl to silence the speed-o-meter gives me:

curl -s -D - support.dell.com -o /dev/null | sed -n '/HTTP/p'

results in: HTTP/1.1 200 OK

Got it!
It sure has limitations, this is not going to help identify server level rewrites or reverse proxy redirects without intermediate non 200 http response, nor is it going to identify a http-refresh. I still find it pretty helpful. The first is still good to submit to the Search Engines, and the second is rare, fortunately, at least where I work.

This again is patched together from a variety of sources, including stackoverflow, a sed post by Eric Pemment and little bits from Andrew Cowie (yep, that's about apis, but still helped): thanks everyone!

Thanks Andy, this is a great addition you suggest in the comments to add the L to follow redirects! I would then extend the sed to get this:

curl -s -L -D - www.dell.com/support/ -o /dev/null | sed -n '/HTTP\|Location/p'

follows redirects, and with the extended sed we see the url and the http response like this:

HTTP/1.1 302 Found Location: http://www.dell.com/support/home/us/en/19?c=us&l=en&s=dhs
HTTP/1.1 301 Moved Permanently Location: /support/my-support
HTTP/1.1 301 Moved Permanently Location: /support/my-support/us/en/19
HTTP/1.1 200 OK

Monday, August 12, 2013

Clean up WGET results for sitemap

After running the script to get urls per wget, now got to clean them out to get just plain urls. The urls need to be on the right domain / folder and need to have had a 200 OK http respone. So, now there are a bunch of text files with urls in them, but not just urls, but a lot more:

2013-07-16 21:39:00 URL:http://www.dell.com/ [21149] -> "www.dell.com/index.html" [1]
2013-07-16 21:39:00 URL:http://www.dell.com/robots.txt [112] -> "www.dell.com/robots.txt" [1]
2013-07-16 21:39:01 URL: http://www.dell.com/_/rsrc/1373543058000/system/app/css/overlay.css?cb=slate712px150goog-ws-left 200 OK

Now, that's all good to know, but not really usable for building a sitemap. Being used to work and fix a lot of things in excel, that was my first try - but I quickly quit. The only way to come close to cleaning up was changing table to text, and even that did not clean up everything - and took too long, too.

#! bash

#loop through the files

for i in {1..30}; do

#pull only the lines with a URL, then all that do NOT match my domain

grep 'URL' wgetlog$i.txt | grep -i 'dell.com' > wget$i.txt

# delete beginning date including url then delete ending part from space on, including file size numbers, then remove spaces, then remove lines not #matching Dell.com

sed -e 's/^[0-9 -:]*URL://g' -e 's/\ \[.*$//g' -e 's/\ //g' -e '/dell\.com/!d' wget$i.txt > support-url-$i.txt

done

Not that pretty, but works fine. I just need to adjust the file name to what I use in the wget script, plus set the number to the number of files used, and then it works just fine.

This is great for any site using just a couple of thousand pages and a few sitemap updates per year.

Monday, July 8, 2013

Pull urls from site for sitemap with wGet

Working for a large company we can use a lot of different tools to do our job. One thing I wanted to do is to build a sitemap for a site where the content management system does not provide this feature.

So, I started to check various tools, like screaming frog, sitebuilder. Xenu was not reliable last time I tried, and these two tools did not work as wished for as well, the site is relatively large. And while screaming frog is great and fast, it slows down very much after a few thousand urls.

Using linux at home, I quickly started my first trials with cURL and wget. Curl was ruled out quickly, so focusing on wget I tried a few things.

First, I just started with the root url, and then waited:

wget --spider --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog.txt &

spider for only getting the urls, recursive with no-parent for the whole directory but nothing above, -t 3 for three trials to download a url, sending urls to the logfile.
Slowly but surely the list kept building. Added -4 after some research, as is was said to help speed up to force a IPv4 request.

Still very slow, so I tried to run this with xargs:

xxargs -n 1 -P 10 url-list.txt wget --spider --recursive --no-verbose --no-parent -t 2 -4 -save-headers --output-file=wgetlog.txt &

I did not really see an improvement - just plain 'feeling' of time, but it was definitely still to slow to go through 10,000 + urls in a day.

After some research I came up with this solution, and it seems to work well enough:
I split the site into several sections, and then gathered the top ~ 10 urls in a textfile, which I used as input for a loop in a bash script (the # echo I use for testing the scripts, I am a pretty bloody beginner and this helps) :

#! bash

while read -r line; do

#echo $line

wget --spider --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog$i.txt $line

#echo wgetlog$i.txt

done < urls.txt

In the wget line the $line stands for the input file into wget, it takes variables. Works well. I get a bunch of wgetlog files with different names with all the urls, and it sure seemed faster than xargs, although I read that xargs is better in distributing load.