Tuesday, August 29, 2017

Backup Android to linux filesystem


Not sure if I like my Android. Phone is great, Android is great - but not great as well.
(too many walls: why should I not be root? why can I not print, natively? why can i not remove certain apps?)

And up to now, backing up my windows phone (yes, yes) was so much easier, even running while running linux on my machine. Copy, paste per file manager is no fun, per cp it works, but one cannot simply use the mounted file location (at least did not work for me). I imagine the reason for making a backup hard is to make it more likely for users to use the integrated backup to Google servers, so they can scan and analyze the data. I do NOT like that, not here, nor for anything else.

Well, not anymore. With this it is just few steps:

1. connect phone
2. select 'transfer files'
3. run script


I made a small folder, and a script basically containing these steps:
  • Get directory where phone is mounted into a variable (might be easier just take the string after gvfs/ but found this on stackexchange
  • rsync files from card and internal memory to a backup folder without the fluff of caches.

#!/bin/bash

directory=$(ls $XDG_RUNTIME_DIR/gvfs/) #thanks Stackoverflow
rsync -auv --exclude *cache* --exclude */Android/data* /run/user/1000/gvfs/${directory}/* ~/Documents/phone-backup


Why first get the directory name, instead of copying the whole phone? Usage rights did not allow to do that. But this works like a charm, and I run it irregularly depending on how much my phone data changes.


Wednesday, July 26, 2017

Sorting files for unique entries - which method is fastest

Starting with some data around keywords and logfiles, there are 23 files, csv, 3 fields and we need the list of unique lines across all files. They have a total of ~ 58 million rows. Might take a while - and I hope there is a faster way.

Used 3 ways to do this:

a. sort -u Q*.csv > output
b. cat Q*.csv >> temp
    sort -u temp > output2

c.  for file in Q*csv; do
       cat $file >> results
       sort -u results > fresh
       mv fresh results
     done

About 48 million unique lines result:
a.  <1 min  and killed performance on my computer
b. 1  min - definitely did not expect such a difference
c. 11 min - ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.
Now, just using the first field it should have much larger overlap.
About 43 million unique lines result
a.  48 sec and killed performance on my computer
b. 1 min - definitely did not expect such a difference
c. 6 min -
ouch and some surprise. I thought the overlap was larger, so this indirect way might be faster to merge, then get uniques, then merge with the next batch, rather than making on huge file of all entries first and then sorting.

Now, with at least 50% overlap, all fields

a.  4 min and killed performance on my computer
b. 6 min - definitely did not expect such a difference
c. 24 min - wow.
 Seems the sort with no temp file is by far the fastest way to see 

Tuesday, July 18, 2017

Wordlists - Domains

Wordlists

Are just lists of words: there are many. I found a list with over 9000 English nouns, and just had to test.

aardvark
abacus
abbey
abdomen
ability
abolishment
abroad
abuse
accelerant
accelerator

First, added a 'www.' to the beginning, then a '.com' to the end: ah, urls!
How many of these are used? I had heard nearly all domains that make some kind of sense (and a few more) are taken already. Seems not so much.

Just curl'd them, then counted the http stati:

9111 urls have:

200 OK    3565
301     1697
302     752
304     0
404     63
403     49
410     0
500     8
and a bunch of never seen response codes, like 416, 456, or 501.

All in all, 2045 urls had no response at all.

Tuesday, June 27, 2017

Too many, too long, too slow

As mentioned in the earlier posts - there are 5 different 'top 1 million websites' lists available for free.

The immediate question popping up right away: which one is best? Or how do they differ, and which list can I use to run my tests?

First, sure, clean up the data, pull out the url (don't need the other elements for this. ) Pretty easy with cut, then into full lists, and split of the top urls with a head -1000 or such.
 

So I started to compare with some awk script, looping one list over the next, and it's taking for hours. Well, I started with 1000 urls each, worked fine, 10,000 urls, takes a while, but then with one million ... not so much. It's 1,000,000 times comparing to 1,000,000 lines. Some of that can be optimized inside of the script (continue on a match), but the remainders are still large.

So - finally set up an ubuntu server at home. Just an old Dell work laptop - 4 years old, still running like a charm with ubuntu.

Setting up ssh server was pretty easy too - all behind my router, I don't really need outside - in access, as the crawls will run many hours or days. Still using key authentication - it's actually easier for later logons, and much more secure. (Many thanks to digitalocean and stackoverflow for all the info to help me through this.)


Up and running!


Monday, April 17, 2017

New top 1 million websites list

Large lists of websites:


Alexa 1 million

Alexa's list of the top 1 million sites has helped me many times to run larger scans over homepages and others on the top sites, compare setup and speed depending on how er a traffic a site gets and similar.

Now, Alexa has officially been discontinued as far as I know, although the file is available (right now at least).
As it is based on browser plugins to track visits, it might have a different mix than the newer OpenDNS list.

OpenDNS 1 Million


This one is based on over 100 Billion DNS requests per day,  not limited to http/https requests .. but perhaps read the details for yourself.
And even older versions are available, kind of asking to use this for time series.


MajesticSeo 1 Million

Majestic 1 Million - another company seeing the potential of the list, and publishing the list, but not maintaining... or do they? Unclear if it is maintained, unclear when the last update happened, but seems a good list with a nice web interface, too.

Quantcast 1 Million

This is around since a while, but it's unclear if it is regularly maintained, and how exactly it is generated. But it works very similar to Alexa, which makes it a convenient alternative. Here's the page linking to the download, but testing just now, that's not available anymore, but this link is working (zip) as of today.

Statvoo 1 Million

Another offer that seems new - and like Alexa, on the site there is a very nice categorization, but the samples are tiny, 20 sites, and no download options for these - but for the top 1 Million.


Friday, February 3, 2017

Google results on ubuntu


Just saw this - for the first time - on a linux laptop. Tried for a few searches, they all come up like this.


Bookmark and Share