Thursday, April 3, 2014

Pull data for video sitemap

Video sitemaps are sometimes helpful to make Search engines aware of videos on a site. We use several systems to generate pages with videos, and as a result it is not easy to get information from the back-end to generate sitemaps. So - like Google - we have to take it from the front end as much as possible. This script with details is likely limited to just Dell.com, and even here I have found that videos in some sections are not able to be indexed by this. Still, this has been extremely helpful to find the 'hidden' details on our video implementations. (And yes, we have requirements to change these in the process since a while :-) ).

Elements necessary for a sitemap are:
  1. Pageurl
  2. Title
  3. Keywords
  4. Description
  5. Video URL
  6. Thumbnail url
And this scripts pulls it nicely of many of our video pages. (We use open graph tags, which makes it relatively easy to pull most info). The script needs to be called with the filename of the text list of urls as first parameter ( . script.sh listofpages.txt)
if [[ ! $1 ]] ; then
echo "need to call with filename"
exit 1
fi
file=$RANDOM-sitemap-data.txt
echo $file
echo -e "url\tTitle\tthumbnail\tdescription" > $file
while read -r line; do
filecontent=$(wget -qO- "$line")
wait
(echo "$line" | sed 's/\r$/\t/' | tr '\n' '\t'  && echo "$filecontent" | grep "og:video" | grep "swf" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t'  && echo "$filecontent" | grep "og:title" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:image" | sed -e "s/^.*content=\"//" -e "s/\".*$//" | sed 's/\r$/\t/' | tr '\n' '\t' && echo "$filecontent" | grep "og:description" | sed -e "s/^.*content=\"//" -e "s/\".*$//" )  >> $file
done < "$1"
cat -A "$file"


As always - I use this, and would love to hear tips to improve or see other scripts for site optimization and maintenance.


No comments:

Post a Comment

Bookmark and Share