andreas.wpv: 2017

Thursday, December 14, 2017

Google A/B Testing tool on homepages

Are sites using Google's A/B testing tool?

After running the 'A/B or multivariate testing' test on a few thousand homepages with the 'mbox', now the same test with Google's testing tool (and the same list of spam domains used to test).

According to Google's documentation, the typical element on a page would be 'Google Experiment' on a page that's being tested or a control.

The result of only 4 spam sites using Google A/B testing is quite a small number - the google tool seems much less used than mboxes - at least on homepages, according to this very limited sample of domains.

All 4 domains show the same result for all user agents, as tested earlier on other domains. Not really a big surprise that sites that use Google A/B tool comply with Google SEO requirement regarding cloaking.

Tuesday, December 12, 2017

Script timer: run vs pause script execution during certain hours

This quickly became a standard part of my scripts, especially when checking larger lists of sites over days and days, like the checking the Alexa 1 million for use of 'schema' on the homepage.

Running these scripts from home, there are good times to run a larger amount of traffic over the shared internet connection: at night, usually.

This script snippet has three elements:

a variable to set the hours where it can run
setting a variable to the current hour
comparing with a regex match if the hour is in the array, if yes, run, if no, pause

To use it, I edit the hours, start the script in a byobu on server, f6 to logout, check in a few days later and collect the data.

intime=' 00 01 02 03 04 05 06 07 22 23 24 '

while read -r line ; do
    hour=$(date +%H)
    if [[ $intime =~ $hour ]] ; then
        echo "running now " $( date +%H:%M )

#script to execute, for example curl $line

# --------next part of the timer

else
echo "sleeping and time is " $( date +%H:%M)
sleep 10m
fi

Tuesday, November 28, 2017

1/3 of top sites uses schema on homepage

How many of the top sites on the web use: Jquery, Schema tags, Google's "nositesearchbox" to exclude the site from sitesearches on Google?

34039 urls tested from the 'top internet sites homepages' list, merging Majestic, Alexa, Statvoo, OpenDNS and Quantcast top million sites. Checking only the http://www homepage of all domains in this list.

24407 - pages have jquery in the source code (72%)
12187 - are using schema in one form or another (36%)
114 - sites had a 'nositesearchbox' (0.3%)

The percentage of sites in the top list and the sites with nosearchbox were the items I was particularly interested in, the jquery info is a nice added bonus.

----------------

The script crawls urls in a file, stores it in a variable, and then tests if any of the three terms given appears in the variable, counts it and lists it in a file:
(only parts shown)

while read -r line; do
acount=111; bcount=111; ccount=111
feedback=$(curl -L -s -m "$time_out" -b cookies -c cookies -A "$agent" "$line")
if [[ $feedback ]] ; then
acount=$(echo "$feedback" | grep -i -c "$3")
bcount=$(echo "$feedback" | grep -i -c "$4")
ccount=$(echo "$feedback" | grep -i -c "$5")
fi
[[ $acount -gt 0 ]] && [[ $acount -ne 111 ]] && acounter=$(( $acounter+1))
[[ $bcount -gt 0 ]] && [[ $bcount -ne 111 ]] && bcounter=$(( $bcounter+1))
[[ $ccount -gt 0 ]] && [[ $ccount -ne 111 ]] && ccounter=$(( $ccounter+1))
echo -e "$line\t$acount\t$bcount\t$ccount" | tee -a $outfile

Monday, November 20, 2017

URLs on all 'top million sites' lists

Alexa 1 million, Statvoo 1 million, OpenDns 1 million, Majestic 1 million, quantcast 1 million:
All "top million websites" have slightly different formats, but all have many domains by amount of traffic - just how they are selected varies.
Filtered for a list of unique urls, then added http://www at the beginning, checked if this gives a 200 OK.

Starting with over 4 million urls, only 34,000 are on all lists (when checked as above):

Here's the list for download, no warranty, promises, absolutely at your own risk. Re-running this might yield different results to changes in the original list, timeouts, etc.

I'll use this list for a while to run a bunch more queries.

Here's the list for download.

Tuesday, August 29, 2017

Backup Android to linux filesystem

Not sure if I like my Android. Phone is great, Android is great - but not great as well.
(too many walls: why should I not be root? why can I not print, natively? why can i not remove certain apps?)

And up to now, backing up my windows phone (yes, yes) was so much easier, even running while running linux on my machine. Copy, paste per file manager is no fun, per cp it works, but one cannot simply use the mounted file location (at least did not work for me). I imagine the reason for making a backup hard is to make it more likely for users to use the integrated backup to Google servers, so they can scan and analyze the data. I do NOT like that, not here, nor for anything else.

Well, not anymore. With this it is just few steps:

1. connect phone
2. select 'transfer files'
3. run script

I made a small folder, and a script basically containing these steps:

Get directory where phone is mounted into a variable (might be easier just take the string after gvfs/ but found this on stackexchange
rsync files from card and internal memory to a backup folder without the fluff of caches.

#!/bin/bash

directory=$(ls $XDG_RUNTIME_DIR/gvfs/) #thanks Stackoverflow
rsync -auv --exclude *cache* --exclude */Android/data* /run/user/1000/gvfs/${directory}/* ~/Documents/phone-backup

Why first get the directory name, instead of copying the whole phone? Usage rights did not allow to do that. But this works like a charm, and I run it irregularly depending on how much my phone data changes.

Wednesday, July 26, 2017

Sorting files for unique entries - which method is fastest

Starting with some data around keywords and logfiles, there are 23 files, csv, 3 fields and we need the list of unique lines across all files. They have a total of ~ 58 million rows. Might take a while - and I hope there is a faster way.

Used 3 ways to do this:

a. sort -u Q*.csv > output
b. cat Q*.csv >> temp
sort -u temp > output2

c. for file in Q*csv; do
cat $file >> results
sort -u results > fresh
mv fresh results
done

About 48 million unique lines result:

a. <1 min and killed performance on my computer
b. 1 min - definitely did not expect such a difference
c. 11 min - ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.

Now, just using the first field it should have much larger overlap.
About 43 million unique lines result

a. 48 sec and killed performance on my computer
b. 1 min - definitely did not expect such a difference
c. 6 min -

ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.

Now, with at least 50% overlap, all fields

a. 4 min and killed performance on my computer
b. 6 min - definitely did not expect such a difference
c. 24 min - wow.

Seems the sort with no temp file is by far the fastest way to see

Tuesday, July 18, 2017

Wordlists - Domains

Wordlists

Are just lists of words: there are many. I found a list with over 9000 English nouns, and just had to test.

aardvark
abacus
abbey
abdomen
ability
abolishment
abroad
abuse
accelerant
accelerator

First, added a 'www.' to the beginning, then a '.com' to the end: ah, urls!
How many of these are used? I had heard nearly all domains that make some kind of sense (and a few more) are taken already. Seems not so much.

Just curl'd them, then counted the http stati:

9111 urls have:

200 OK    3565
301     1697
302     752
304     0
404     63
403     49
410     0
500     8
and a bunch of never seen response codes, like 416, 456, or 501.

All in all, 2045 urls had no response at all.

Tuesday, June 27, 2017

Too many, too long, too slow

As mentioned in the earlier posts - there are 5 different 'top 1 million websites' lists available for free.

The immediate question popping up right away: which one is best? Or how do they differ, and which list can I use to run my tests?

First, sure, clean up the data, pull out the url (don't need the other elements for this. ) Pretty easy with cut, then into full lists, and split of the top urls with a head -1000 or such.

So I started to compare with some awk script, looping one list over the next, and it's taking for hours. Well, I started with 1000 urls each, worked fine, 10,000 urls, takes a while, but then with one million ... not so much. It's 1,000,000 times comparing to 1,000,000 lines. Some of that can be optimized inside of the script (continue on a match), but the remainders are still large.

So - finally set up an ubuntu server at home. Just an old Dell work laptop - 4 years old, still running like a charm with ubuntu.

Setting up ssh server was pretty easy too - all behind my router, I don't really need outside - in access, as the crawls will run many hours or days. Still using key authentication - it's actually easier for later logons, and much more secure. (Many thanks to digitalocean and stackoverflow for all the info to help me through this.)

Up and running!

Monday, April 17, 2017

New top 1 million websites list

Large lists of websites:

Alexa 1 million

Alexa's list of the top 1 million sites has helped me many times to run larger scans over homepages and others on the top sites, compare setup and speed depending on how er a traffic a site gets and similar.

Now, Alexa has officially been discontinued as far as I know, although the file is available (right now at least).
As it is based on browser plugins to track visits, it might have a different mix than the newer OpenDNS list.

OpenDNS 1 Million

This one is based on over 100 Billion DNS requests per day, not limited to http/https requests .. but perhaps read the details for yourself.
And even older versions are available, kind of asking to use this for time series.

MajesticSeo 1 Million

Majestic 1 Million - another company seeing the potential of the list, and publishing the list, but not maintaining... or do they? Unclear if it is maintained, unclear when the last update happened, but seems a good list with a nice web interface, too.

Quantcast 1 Million

This is around since a while, but it's unclear if it is regularly maintained, and how exactly it is generated. But it works very similar to Alexa, which makes it a convenient alternative. Here's the page linking to the download, but testing just now, that's not available anymore, but this link is working (zip) as of today.

Statvoo 1 Million

Another offer that seems new - and like Alexa, on the site there is a very nice categorization, but the samples are tiny, 20 sites, and no download options for these - but for the top 1 Million.

Friday, February 3, 2017

Google results on ubuntu

Just saw this - for the first time - on a linux laptop. Tried for a few searches, they all come up like this.