andreas.wpv: October 2013

Thursday, October 31, 2013

Check the number of Google Plus ripples of the urls in your sitemap

Want to see how many ripples a page has? The easiest way to check all your pages is to use your sitemap.xml. This only works for public shares or public ripples, as Google does not show any number or information about not public shares, and it only shows ripples for 'regular' pages, not for Google plus posts - but who has them in their sitemap.xml anyway.

This little tool has three files, below the script to take a look, behind the links are the source files for linux bash:

The script to pull the urls from the sitemap
the script to get the ripples for each url and store url and number of public ripples
the script combining both.

I made this into three scripts because I use sitemaps for several things, and using these moduls it is easier to reuse parts - like the script to pull urls from an xml sitemap.

1. Wget urls from sitemap and clean up to keep only urls:

#!bash
if [[ ! $1 ]]; then echo 'call with parameter of url for file'
exit 1
else
filename=(output-${RANDOM})
wget -qO- "${1}" | grep "loc" | sed -e 's/^.*//g' -e 's/<\/loc>.*$//g' > ${filename}
#echo $filename
fi

2. Loop through the url list, then load the page showing ripples. Grep the right line, isolate the part with the number, and then store url and number in a csv.

#!bash
while read -r line ;
do
number=0
re='[0-9]+'
pull="https://plus.google.com/ripple/details?url=${line}"
number=$(wget -qO- "${pull}" | grep -o "[0-9]*\s*public\s*shares.<" | sed "s/[^0-9]//g" | tr "\n" "\t" | sed 's/\thttp/\nhttp/g')
if [[ $number =~ $re ]]; then
value=${number}
else
value="0"
fi
echo -e "$line\t$value" >> ${1}.csv
done < ${1}

3. For easy work, use this script to call above scripts for getting ripples for all urls in your sitemap in the right order. One command, all done.

#!bash
if [[ ! $1 ]]; then
echo 'need input xml file'
else
source get-urls-from-sitemap $1
wait ${!}
source loop-through-sitemap ${filename}
wait ${!}
cat ${filename}.csv
fi

It might not be the easiest way to do this, but it works just fine. Please feel free to suggest improvements. I tested this only on Ubuntu 12.04.

Wednesday, October 16, 2013

Check URLs for ripples

How many ripples does a page have? Thanks to +AJ Kohn we have a little browser snippet showing this for each individual page.

How about a list of pages?

See number of ripples for all urls in a textfile

You can call this script in bash, adjust input file (urls.txt) and output file and location:

#!bash
while read -r line ;

do
number=0
re='[0-9]+'
pull="https://plus.google.com/ripple/details?url=${line}"
number=$(wget -qO- "${pull}" | grep -o "[0-9]*\s*public\s*shares.<" | sed "s/[^0-9]//g" | tr "\n" "\t" | sed 's/\thttp/\nhttp/g')
if [[ $number =~ $re ]]; then
value=${number}
else
value="0"
fi
echo -e "$line\t$value" >> ~/Desktop/spider-public-shares.txt
done < urls.txt

Looping through urls.txt. Then wget a page, isolate the number of ripples, store it in a variable. If the value is a number not equal ( != ) zero, use it, if not, store zero in second variable $number. Then echo url ($line), tab, number value per url in the csv. This works only for ripples of regular pages, not for Google plus posts.

Thursday, October 10, 2013

Scan site or folder to generate a sitemap with wget

After having searched for a solution to build a sitemamp easily, quickly, without any server side installation or software - and getting no result - I came up with these small scripts. Worked fine for me when I was scanning www.dell.com/support/my-support/ie/en/04/ and some other areas of the same site.

Use wget to spider the site, recursive, no parent, unlimited in depth = the whole section, but nothing more
Ignore redirects - not good in a sitemap, and it is also to work better with the resulting file
Grep for all lines with 'http://' to get the url, and then the line with the http response, ie. 301, 302, 404, 200
append to file

wget --spider -l inf --max-redirect 0 -np -nc www.dell.com/support/my-support/ie/en/iedhs1/ 2>&1 | grep -e 'http:\/\/' -e 'HTTP request sent' >> ms-ie-sitemap-raw.txt

Using the spider has the disadvantage, that each output is its own line, so each url has one line and the status has another line. That's why it is relevant to use the 'max redirect 0' to not show the list of redirects, but only the status. This way, each entry consists of exactly two lines, which comes in handy to work on making this usable for a sitemap.

Now this just needs some cleaning up:

Merge two lines into one (one has http status, one has url), replacing the newline with a tab:

sed -i 'N;s/\nHTTP/\tHTTP/g'

Remove all lines which have a different http status than 200

sed -i '/200/!d'

Remove everything except url (first from start to url, then from url to end)

sed -i 's/^.*http/http/g'

Dedupe

sort file | uniq > final-file.txt

I ended up with ~ 8000 unique urls!

Not sure if I could use -e to add these all to each other, but I guess they need to be piped, because they are not alternatives (match any) but need to be done one after another. True?