Tuesday, December 31, 2013

Reduce pictures with a script for imagemagick

Blogging is fun, but can be quite some effort. One of the things necessary is to scale pictures so they fit into the blog, are large and sharp enough to show the details necessary, but also be as small as possible to have great page load times.

The best results I can possibly generate are with photoshop, and that also has a nice batch option. On Windows Irfanview is a great tool to automate this super easily in pretty good quality as well. My tool of choice on linux is imagemagick. While it has tons of options, below setting works great for me.

This script I start from the folder with the pictures. It takes one parameter on call - the length wanted for the longer side. So, calling it like 'image-resize.sh 800'  is the way to go.
It would check if the folder exists and if not generate it; then rename all filenames in the startfolder to small letters, then rename .jpeg to .jpg to make all jpg accessible to the imagemagic script.
if [[ ! -d $1 ]]
      then mkdir "$1"
rename 'y/A-Z/a-z/' *
rename 's/\.jpeg/\.jpg/' *

for i in *.jpg
convert "$i" -resize "${1}^>" -quality 25% -unsharp 1.2x1.2+1+0 "$1"/s_"$i"

Then it reduce pictures where the larger side (height or width) is larger than 800 px to exactly 800 px. It maintains the ratio, sharpens as well and reduces the quality to 30% as well - a value I found the sweet spot between quality and image size for many of my pictures. Final step is to add a s_ to the filename and generate it into that folder. Most important insight (from a forum) was the setting for 'value^>' - setting the longer side to this value.

Monday, December 23, 2013

Download files on one page with wget - define the type

Something I rarely do - but now I had to: Download a bunch of .ogg files. I am exploring some sound capabilities of my system, and found the system sounds in /usr/share/sounds/.

Naja, not very special - so I started searching in google for 'free sound download filetype;ogg' and similar, and found a few nice sites like www.mediacollege.com.

So, I downloaded 2-3 wavs, and oh, my, that takes time. So, here's my little script:

wget -r -l2 --user-agent Mozilla -nd -A.wav "http://www.mediacollege.com/downloads/sound-effects/people/laugh/" -P download
for file in download/*.wav ; do echo "$file" && paplay "$file"; done

Using wget with r for recursive, -nd -P to not rebuild the directory structure (thisis 4 levels down) and then -P download to download into the subfolder download. -A.wav,.ogg only downloads wav and ogg files, and -l1 (one level recursive) for just this page and the files linked from it. Changing this can lead to huge download times and sizes, so careful.

Once done, the last line just echos each filename and then plays it. (paplay for my system, if that does not result in anything perhaps 'aplay' works).

Tuesday, December 17, 2013

Keyword - check: ranking 100 urls in search engine

Do you ever need to check for a keyword or two which pages rank on Google, and find the results hard to read, and especially cumbersome to copy for further use? At Dell we use large scale tools like seoclarity , and get tons of data in high quality, and I still sometimes have these one-off requests, where I need a small tool, NOW.

It is a small script for linux bash (and thus should run with slight modifications also on cygwin on windows computers).

First I call the script with two parameters - the search engine, then the search term, as in
# . script.sh www.searchengine.com "searchterm" . Search terms with spaces work, just replace the space with a '+'.
That's used to build the url, which then is used with curl to pull the results from Google.  Xidel is a small bash program with super-easy use to use xpath to filter content.

# $1 is query url, $2 is the search term, skipping the check if both are given for shortness

curl -s -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv: Gecko/20100401 Firefox/3.6.3" -o temp.file "$url" 
xidel temp.file -e "//cite" > urls-from-curlgoogle.csv 
rm temp.file 

Thanks to Benito for Xidel and for helpful input to fix my variable assignment thanks to  +Norbert Varzariu , +Alexander Skwar.

Again, this is not to replace any of the excellent tools out there, free or paid, but to accommodate small tasks, 'quick and dirty' and with some accuracy but not too much handling. And for sure please handle this responsibly, not spamming any search engine and not disregarding any terms of use. I checked when I tried this, and the current ToS seem not to prevent this - but I am no lawyer and might be mistaken. At your own risk.

Thursday, December 12, 2013

"Obamacare": Public shares of Healthcare.gov on Google Plus

Shares to the public on Google plus are named 'ripples', and they show how often and which ways a post has been shared.

A while ago I made a little script to get the number of ripples for a list of urls, and showed how to use it with Alexa top results (here).
Now, while that was entertaining, I wondered if I can learn something, and the Obama administration has a bit of a reputation to be tech-savvy.

I checked www.whitehouse.gov - no sitemap. So, next 'popular topic' that came to mind was 'Obamacare', so I pulled the sitemap from www.healthcare.gov (yes, they have one). And then I ran the little script.

At time of testing (Dec 11, 2013) healthcare.gov had 612 urls in the sitemap. Only 4 of these pages have been shared publicly on G+, with a pretty good number on the homepage.

While 800 plus is far from www.google.com and www.facebook.com with 7000+ shares, Texas.gov has only 5 public shares from what I can see, and www.whitehouse.gov has only 138.

PLEASE: focus on comments around SEO, not politics or healthcare. While these topics are more important overall than any seo or Google plus, for this post they are off topic.

Tuesday, December 10, 2013

Bing Webmaster Tools - check for old verification files

Sometimes it is necessary to check if domains have the right verification files in the root folder: for an agency for a multitude of sites, for a company for several domains and subdomains.

People move on, and the webmaster or seo agency from 3 years ago is not necessarily a partner any more. And there still might be old verification files in the root folder, allowing access to data in Bing Webmaster tools (BWT)  for people who should not have access to this any more. 

And this is not just about getting access to perhaps very deep knowledge, but also about the possibility of changing settings and causing immediate monetary damage. One possible example would be a false redirect or sitelinks or target country for a site - major damage possible!

So, got to check only the right meta tags and verification files. These IDs are very long, and comparing by hand is prone to mistakes and might take quite a while. Copy paste into a spreadsheet works fine, but depending on the number of domains to check can take a while. 

Here I just show how to check for verification files. They are called BingSiteAuth.xml and mainly contain of an ID with little code around it. 
  1. Download the current file from BWT, and check this id.
  2. Next step, prepare a url list in a text file, for all domains / subdomains you want to check.
  3. Run this little authentication file checker script:
if [[ ! $1 ]] ; then
echo -e '\n\tplease call this file with a list of urls\n'
exit 1

while read -r line; do

site_id=$(wget -S -qO- "$url" 2> grep "" | tr "\n" "\t"  | sed -e "s/^.*//g" -e "s/<\/user.*$//g" )
# set default value for site_id, then check if gathered value matches the given value (as mentioned in intro text 
if [[ $site_id = "" ]] ; then site_id="0" ; fi
if [ ${site_id} == "78F80DA184A74A413....and-so-on...45671" ] ;
then match="match"
else match="\E[34;47m no match"
# get all in one line, then reset formatting 
echo -e  ${line} "\t\tsite-id" ${site_id} ${match} `tput sgr0`

# alternate ending into file for further use BingSiteAuth-matches.txt
#if [ ${site_id} == "78F80DA184A74A4137F56098D9D45671" ] ;
# then match="match"
# else match="no match"
#echo -e  ${line} "\t\tsite-id" ${site_id} ${match} >> BingSite-check.txt 
done < $1

Thursday, November 14, 2013

Software for SEO - standard tools for standard usage

This is a list of less specialized tools I use when working on SEO - here I show the standard tools for management of site seo, that can also be used for many other aspects of site maintenance, content development, digital strategy.

A big part of my job is  management of seo processes, from the writing of business cases, participating in building a BRD or I have to come up with requirements, define and establish processes, convince stakeholders to fund or to update content or to join projects.

Spending more hours in meetings and preparing them than in any other part of my role, this shows in the list of tools I use. Analytics are another joyful part of my role - mainly combining data from various sources, and social media engagement.

Absolut top are Microsoft powerpoint, outlook and excel. Take anything but these and I'll do ok. 
I tried to sort this by importance, but that varies from project to project, time to time, that it can only be a rough estimate:

  • Onenote
  • Lync
  • Sharepoint
  • Chatter 
  • Freemind
  • Notepad ++
  • Photoshop elements (Adobe)
  • MS Project 
  • Cutepdf
  • Gnu-utils (cygwin)
  • Rssowl
This list does not include specific seo tools. Especially for the linkbuilding and technical aspects, nor does it cover online only  / mainly tools or social media tools (working on these lists).

Anything I miss or should take a close look at? Which other tool could increase my productivity?

Wednesday, November 6, 2013

Alexa Top 20 internet websites and Google Plus public ripples

Alexa publishes a list (under the ad on the right rail) for download with the top 1 million websites based on estimated traffic. Google, Facebook, Apple, Twitter, are just a few of the well known brands in the top of this list.

I wanted to check how this list compares to the ripples (A ripple is a public share of a page via Google Plus.) companies get on their homepages, and used my little script to do exactly that  for the top 1000 companies' homepages (only homepage!):

Top 20 homepages sorted by traffic according to Alexa.com. I then pulled the script to check for ripples of url from a textfile, and got this list. Then I added the numbers for rank by traffic and rank by number of ripples, and compared these. All of the top 1000 urls have the same rank by traffic and by ripples.

rank traffic url homepage ripples rank by ripples rank unmatch
1 facebook.com 5793 1   TRUE
2 google.com 5560 2   TRUE
3 apple.com 5109 3   TRUE
4 twitter.com 4826 4   TRUE
5 youtube.com 4629 5   TRUE
6 delicious.com 2507 6   TRUE
7 twoo.com 2362 7   TRUE
8 blogger.com 2289 8   TRUE
9 reddit.com 2066 9   TRUE
10 flickr.com 2058 10   TRUE
11 amazon.com 1895 11   TRUE
12 archive.org 1713 12   TRUE
13 google.com.hk 1622 13   TRUE
14 500px.com 1582 14   TRUE
15 mashable.com 1514 15   TRUE
16 stackoverflow.com 1440 16   TRUE
17 w3schools.com 1396 17   TRUE
18 9gag.com 1392 18   TRUE
19 scribd.com 1338 19   TRUE
20 goo.gl 1330 20   TRUE

Several were a surprise, I would have expected more retailers on this list and wikipedia - I was wrong. Twoo was a big surprise, but there's a lot of bad rep I got when searching for them.

The biggest surprise of all for me is that people share homepages of all these on G+, though, except for the shortener goo.gl.

You can download the list of all top 1000 homepage ripples numbers here.

Thursday, October 31, 2013

Check the number of Google Plus ripples of the urls in your sitemap

Want to see how many ripples a page has? The easiest way to check all your pages is to use your sitemap.xml. This only works for public shares or public ripples, as Google does not show any number or information about not public shares, and it only shows ripples for 'regular' pages, not for Google plus posts - but who has them in their sitemap.xml anyway.

This little tool has three files, below the script to take a look, behind the links are the source files for linux bash:

  1. The script to pull the urls from the sitemap
  2. the script to get the ripples for each url and store url and number of public ripples
  3. the script combining both.
I made this into three scripts because I use sitemaps for several things, and using these moduls it is easier to reuse parts - like the script to pull urls from an xml sitemap.

1. Wget urls from sitemap and clean up to keep only urls:

if [[ ! $1 ]]; then echo 'call with parameter of url for file'
exit 1
wget -qO- "${1}"  | grep "loc" | sed -e 's/^.*//g' -e 's/<\/loc>.*$//g' > ${filename}
#echo $filename
2. Loop through the url list, then load the page showing ripples. Grep the right line, isolate the part with the number, and then store url and number in a csv. 
while read -r line ;
number=$(wget -qO- "${pull}" | grep -o "[0-9]*\s*public\s*shares.<" | sed "s/[^0-9]//g"  | tr "\n" "\t" | sed 's/\thttp/\nhttp/g')
if [[ $number =~ $re ]]; then
echo -e "$line\t$value" >> ${1}.csv
done < ${1}
3. For easy work, use this script to call above scripts for getting ripples for all urls in your sitemap in the right order. One command, all done. 

if [[ ! $1 ]]; then
echo 'need input xml file'
source get-urls-from-sitemap $1
wait ${!}
source loop-through-sitemap ${filename}
wait ${!}
cat ${filename}.csv

It might not be the easiest way to do this, but it works just fine. Please feel free to suggest improvements. I tested this only on Ubuntu 12.04.

Wednesday, October 16, 2013

Check URLs for ripples

How many ripples does a page have? Thanks to +AJ Kohn we have a little browser snippet showing this for each individual page.

How about a list of pages?

See number of ripples for all urls in a textfile

You can call this script in bash, adjust input file (urls.txt) and output file and location:
while read -r line ;

number=$(wget -qO- "${pull}" | grep -o "[0-9]*\s*public\s*shares.<" | sed "s/[^0-9]//g"  | tr "\n" "\t" | sed 's/\thttp/\nhttp/g')
if [[ $number =~ $re ]]; then
echo -e "$line\t$value" >> ~/Desktop/spider-public-shares.txt
done < urls.txt

Looping through urls.txt. Then wget a page, isolate the number of ripples, store it in a variable. If the value is a number not equal ( != ) zero, use it, if not, store zero in second variable $number. Then echo url ($line), tab, number value per url in the csv. This works only for ripples of regular pages, not for Google plus posts.

Thursday, October 10, 2013

Scan site or folder to generate a sitemap with wget

After having searched for a solution to build a sitemamp easily, quickly, without any server side installation or software - and getting no result - I came up with these small scripts. Worked fine for me when I was scanning www.dell.com/support/my-support/ie/en/04/ and some other areas of the same site.

  1. Use wget to spider the site, recursive, no parent, unlimited in depth = the whole section, but nothing more
  2. Ignore redirects - not good in a sitemap, and it is also to work better with the resulting file
  3. Grep for all lines with 'http://' to get the url, and then the line with the http response, ie. 301, 302, 404, 200 
  4. append to file

wget --spider -l inf --max-redirect 0 -np -nc www.dell.com/support/my-support/ie/en/iedhs1/ 2>&1 | grep -e 'http:\/\/' -e 'HTTP request sent' >> ms-ie-sitemap-raw.txt

 Using the spider has the disadvantage, that each output is its own line, so each url has one line and the status has another line. That's why it is relevant to use the 'max redirect 0' to not show the list of redirects, but only the status. This way, each entry consists of exactly two lines, which comes in handy to work on making this usable for a sitemap.

Now this just needs some cleaning up:

Merge two lines into one (one has http status, one has url), replacing the newline with a tab:
sed -i 'N;s/\nHTTP/\tHTTP/g'
Remove all lines which have a different http status than 200
sed -i '/200/!d' 
Remove everything except url (first from start to url, then from url to end)
sed -i 's/^.*http/http/g' 
sort file | uniq > final-file.txt
I ended up with ~ 8000 unique urls!

Not sure if I could use -e to add these all to each other, but I guess they need to be piped, because they are not alternatives (match any) but need to be done one after another. True?

Monday, September 30, 2013

Download urls from sitemap into textfile

Sitemaps again - they are still very helpful, especially for a large site. For several processes the urls are necessary - and going back to the sourcefile is not always possible or practical.

So, here is a small bash script to scan a given sitemap and store the results just the urls into a textfile. Input parameter is the full url of the sitemap. Let's name this sitemap-urls.sh then it would be

# bash sitemap-urls.sh http://www.dell.com/wwwsupport-us-sitemap.xml
if [[ ! $1 ]]; then echo 'call script with parameter of url for file'
exit 1
wget -qO- "${1}"  | grep "loc" | sed -e 's/^.*loc>//g' -e 's/<\/loc>.*$//g' > sitemap-scan-output.txt
First checking if the file is called with the sitemap url as parameter ($1) and if not, exit with echoing a message. If parameter is set, then download the page without saving it, grep for the right line, and use sed to replace the irrelevant parts, means html tags, with nothing to remove it.

Not fancy, but still good to have. I will use this to check for a few interesting things in next posts, and this is really helpful also if the urls are needed for import in analytics tools and Excel.

Monday, September 23, 2013

Check url for http response codes with curl

A little linux helper to check the status of urls in a sitemap, based on the server response code.

Currently redirects are said to be not good for Bing ranking, neutral for Google. We want to rank in both, so we don't want 300s, and sure no 400s or 500s - the error response codes.

For this example from work I use curl, easy and fast, and  "support.dell.com".

curl -D - support.dell.com 

-D - means dump header into file - meaning stdout.
LONG result, but direction is correct.
Now add a -o /dev/null, meaning move content into output /dev/null.

curl -D - support.dell.com -o /dev/null
Still too long, but getting closer.
So I'll use sed to print just the header response, based on the regex HTTP:
curl -D - support.dell.com -o /dev/null | sed -n '/HTTP/p'

STILL not there. Adding -s to curl to silence the speed-o-meter gives me:

curl -s -D - support.dell.com -o /dev/null | sed -n '/HTTP/p'

results in: HTTP/1.1 200 OK

Got it!
It sure has limitations, this is not going to help identify server level rewrites or reverse proxy redirects without intermediate non 200 http response, nor is it going to identify a http-refresh. I still find it pretty helpful. The first is still good to submit to the Search Engines, and the second is rare, fortunately, at least where I work.

This again is patched together from a variety of sources, including stackoverflow, a sed post by Eric Pemment and little bits from Andrew Cowie (yep, that's about apis, but still helped): thanks everyone!

Thanks Andy, this is a great addition you suggest in the comments to add the L to follow redirects! I would then extend the sed to get this:
curl -s -L -D - www.dell.com/support/ -o /dev/null | sed -n '/HTTP\|Location/p' 
follows redirects, and with the extended sed we see the url and the http response like this:

HTTP/1.1 302 Found Location: http://www.dell.com/support/home/us/en/19?c=us&l=en&s=dhs
HTTP/1.1 301 Moved Permanently Location: /support/my-support
HTTP/1.1 301 Moved Permanently Location: /support/my-support/us/en/19
HTTP/1.1 200 OK

Wednesday, September 18, 2013

Blogspot domains and ranking

Some blogs on blogger / blogspot appear with several top level domains, as shown last week.
What seems not to work though, is ranking with these urls.
I found some blogs which have a few results show up in SERP, but even German blogs had more results with the .com domain than with the .de domain, although they showed up with a de domain when I searched for a generic blog.

I guess to better understand why this happens, it would be necessary to check what was the start domain sites used, and what are signals that could trigger these results.

Any ideas?

Tuesday, September 10, 2013

Tools for Technical SEO - little helpers

While analyzing performance and natural search performance, lots of tools can be used. We use seoclarity, majesticseo, adobe analytics, moz,  and many more enterprise metrics tools at Dell. 

Still, as you can see on this blog, not everything can be done comfortably with these, sometimes the setup is too cumbersome or slow or costly, sometimes it would require too many tweaks, and sometimes its just not possible to get what we need.

So, here is the list of little helpers, the tools I mainly use to analyse things for technical aspects of seo.
  •  Httpfox, Fiddler2 – pageload, coding, caching, errors, http response codes
  • Screaming frog – elements on page (title, redirects, meta description, keywords etc.)
  • Source code view in chrome, IE, FF for title, meta, header elements, check if copy is in source code
  • Developer tools in chrome ( Ctrl + shift + J ) –  for speed (new) , source code, various items
    • Do NOT use chrome developer tools to check if content is in source code (not accurate) (also not usable for this particular aspect - right click ‘inspect element’)
  • Accessibility checkers – required under some federal law regulations, likely will hear a lot more about this as a certified fed vendor. BUT it is also great to check for SEO, if accessible = 95% accessible for SEO spiders
  • And then ad hoc tools we use rarely (and many times research to solve a particular task)
  • ADDE, which is an Accenture site scan tool - a large scale tool offering similar insight like screaming frog, xenu etc, moz, just running on a bunch of internal servers allowing in depth access and analysis (entry added Nov 13, '13)
So, now it's up to you: which tools for tech analysis are we missing, and what can they help with?

Monday, September 2, 2013

Blogspot domains

Blogspot urls: Top level domain matters?

It seems that blogspot posts are available under several top level domains.
One of my posts has recently been shared internally at Dell in a larger newsletter, nothing special, but I for sure tried if the link works.

This is the link:

Stop! That's not right, my blog runs at http://andreas-wpv.blogspot.com.

So I tried and switched a few times back and forth, and the post is shown perfectly fine under both domains, as long as I keep the other parts unchanged.

works just as well as

Other countries work also like

  • de
  • co.uk
  • be
  • fr
while these don't:

  • cn
  • br
  • es

It also works for other blogs, as long as they use blogspot.domains and not customized domains, this seems to work for everyone.

Monday, August 26, 2013

Better: Pull urls from site for sitemap with wGet

The other bash wget script works just fine, BUT I found it had one main flaw. Every time I would run this fro another site I would either use the same filename for the file with the Urls and this way deleting the older version, or I would have to change the filename in the script. So I changed the script.
  1. Now I can call the script with the filename of the url-list as startup parameter. 
  2. It also checks if it gets that parameter, if not, mentions that. 
  3. Finally I use the input filename as part of the output filename, so no overwriting there either. 

#! bash
if [[ ! $1 ]] ;
then echo "need to call this with the file name with url list as argument"
while read -r line; do
wget --spider -b --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog-$1-$i.txt $line 
done < $1
Slowly, but getting there :-)

Monday, August 12, 2013

Clean up WGET results for sitemap

After running the script to get urls per wget, now got to clean them out to get just plain urls. The urls need to be on the right domain / folder and need to have had a 200 OK http respone. So, now there are a bunch of text files with urls in them, but not just urls, but a lot more:

2013-07-16 21:39:00 URL:http://www.dell.com/ [21149] -> "www.dell.com/index.html" [1]
2013-07-16 21:39:00 URL:http://www.dell.com/robots.txt [112] -> "www.dell.com/robots.txt" [1]
2013-07-16 21:39:01 URL: http://www.dell.com/_/rsrc/1373543058000/system/app/css/overlay.css?cb=slate712px150goog-ws-left 200 OK

Now, that's all good to know, but not really usable for building a sitemap. Being used to work and fix a lot of things in excel, that was my first try - but I quickly quit. The only way to come close to cleaning up was changing table to text, and even that did not clean up everything - and took too long, too.

#! bash 
#loop through the files
for i in {1..30}; do
#pull only the lines with a URL, then all that do NOT match my domain 
grep 'URL' wgetlog$i.txt | grep -i 'dell.com' > wget$i.txt
# delete beginning date including url then delete ending part from space on, including file size numbers, then remove spaces, then remove lines not #matching Dell.com
sed -e 's/^[0-9 -:]*URL://g' -e 's/\ \[.*$//g' -e 's/\ //g' -e '/dell\.com/!d' wget$i.txt > support-url-$i.txt
Not that pretty, but works fine. I just need to adjust the file name to what I use in the wget script, plus set the number to the number of files used, and then it works just fine. 

This is great for any site using just a couple of thousand pages and a few sitemap updates per year.

Thursday, August 1, 2013

How to optimize your Google Plus post for high CTR in notifications

Google plus brought in top designers to get the tech company to appealing design, and now see one example of what's come out of it: Notifications. 

Here is a picture of some random notifications I got when writing this post, lets take a closer look.

G+ notifications
What does each tile show me? 

1. Face or logo
That's great, if I know the person or company well enough. Favoring big brands - personal and companies.

2. Name and 'Shared a post' 
Shared a post once on the flap might be good, but on each tile?

3. Copy
Looks like it is showing the first ~ 45 to 50 characters of a post or if no copy is in the post it shows where it is coming from.

This being the notifications I see to alert me of new posts, the picture / logo can be used branding, and the first 50 characters can be used to convert to a click through. 
This seems to be fairly similar to the browser title for pages - 'frontload' your post with a relevant keyword and call to action, to capture attention and get a click through. 

Apart from this optimization aspect, I hope that Google will integrate more info into each box - for example swapping for the superfluous 'shared a link'. Would be great to see any of these or a combination:

  • Topic category
  • one of the new wonderful auto-hashtags
  • a bit more copy 
  • the number of +1 or comments on the post.
Each of these would help me decide which of the hundreds of post are worth checking out. Now that would be great design to have it in there and not look cramped. 

Tuesday, July 23, 2013

Google Plus links count as backlinks

There is an ongoing discussion on Social Media and SEO. Currently promoting the synergies of these at Dell, no wonder there are some insights.

One of the common questions is:

Does Social media have an influence on natural search rankings - Answer: Yes

And there are several ways I can 'prove' that. So the question many still ask - correlation or causation, can be answered: Both!

First things first, let me show you one screen which proves the connection. Do you use Google webmaster tools? For SEO folks, that's a standard. And in that tool are backlink reports. As backlinks are considered to have one of the strongest influences on rankings, it is a standard for seo to look at these.

This is how you get there:
On the following page, select 'download latest' links from others to your site.
And then search in the results for plus.google.com as the referring url:

Voila! Clear proof that Google sees these links just as regular backlinks and tracks them in GWT.
Like with many other links, it is not possible to see HOW MUCH influence one link has - and it is not for lack of trying - but I would consider this enough  of a proof that it does count for search engine results page rankings.

Google shows these since roughly a year I would say. Now my hope would be, that Google easily and quickly identifies Google Plus Link spammers and discredits their links, but I doubt they are there already.

As shown in profile - I work for Dell and we have a rather large site with the according number of backlinks from Google Plus and many other sites.

Would you count this as proof that social influences search rankings?

Monday, July 8, 2013

Pull urls from site for sitemap with wGet

Working for a large company we can use a lot of different tools to do our job. One thing I wanted to do is to build a sitemap for a site where the content management system does not provide this feature.

So, I started to check various tools, like screaming frog, sitebuilder. Xenu was not reliable last time I tried, and these two tools did not work as wished for as well, the site is relatively large. And while screaming frog is great and fast, it slows down very much after a few thousand urls.

Using linux at home, I quickly started my first trials with cURL and wget. Curl was ruled out quickly, so focusing on wget I tried a few things.

First, I just started with the root url, and then waited:

wget --spider --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog.txt &

spider for only getting the urls, recursive with no-parent for the whole directory but nothing above, -t 3 for three trials to download a url, sending urls to the logfile.
Slowly but surely the list kept building. Added -4 after some research, as is was said to help speed up to force a IPv4 request.

Still very slow, so I tried to run this with xargs:
xxargs -n 1 -P 10 url-list.txt wget --spider --recursive --no-verbose --no-parent -t 2 -4 -save-headers --output-file=wgetlog.txt &

I did not really see an improvement - just plain 'feeling' of time, but it was definitely still to slow to go through 10,000 + urls in a day.

After some research I came up with this solution, and it seems to work well enough:
I split the site into several sections, and then gathered the top ~ 10 urls in a textfile, which I used as input for a loop in a bash script (the # echo I use for testing the scripts, I am a pretty bloody beginner and this helps) :
#! bash
while read -r line; do
#echo $line
wget --spider --recursive --no-verbose --no-parent -t 3 -4 –save-headers --output-file=wgetlog$i.txt $line
#echo wgetlog$i.txt
done < urls.txt
In the wget line the $line stands for the input file into wget, it takes variables. Works well. I get a bunch of wgetlog files with different names with all the urls, and it sure seemed faster than xargs, although I read that xargs is better in distributing load.

Bookmark and Share