Thursday, May 15, 2014

Scraper for video pages to get all data for video sitemap

This scraper is based mainly on opengraph tags (which are used by Facebook, for example), so it should work well with many pages, not just pages. More info on sitemaps at Google .

#! bash
#check if its called with a filename - a file containing urls for pages with videos
if [[ ! $1 ]] ; then
echo "need to call with filename"
exit 1
 #now make sure to have a unique filename, based on the file with the urls
filename=$(basename $1)
#header info for the file with the results. Need to pull page url, video title, thumbnail url, description.
echo -e 'url\tpage\tTitle\tThumb\tDescription' > ${name}video-sitemap-data.txt
#loop through the file and store in a variable
while read -r line; do
filecontent=$(wget -qO- "$line")
# echo results and clean up with sed, tr and grep, then append to the file that has the column headers already. It has 4 elements - and each is isolated in its own part. The parts are connected with &&, and everything in ( and ) - otherwise it only echos the last part into the file.
(echo "$line" | sed 's/\r$/\t/' | tr '\n' '\t'  && echo "$filecontent" | grep "og:video" | grep "swf" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:title" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:image" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:description" | sed -e "s/^.*content=\"//" -e "s/\".*$//") >> ${name}video-sitemap-data.txt
done < "$1"
I'd be delighted to know this helped someone else - why don't you drop me a note when you do?

This is one of the pages that I used for testing, just in case someone wants to test this: .

No comments:

Post a Comment

Bookmark and Share