Mar 25, 2013

Secure cloud backups with Tahoe-LAFS

There's plenty of public cloud storage these days, but trusting any of them with any kind of data seem reckless - service is free to corrupt, monetize, leak, hold hostage or just drop it then.
Given that these services are provided at no cost, and generally without much ads, guess reputation and ToS are the things stopping them from acting like that.
Not trusting any single one of these services looks like a sane safeguard against them suddenly collapsing or blocking one's account.
And not trusting any of them with plaintext of the sensitive data seem to be a good way to protect it from all the shady things that can be done to it.

Tahoe-LAFS is a great capability-based secure distributed storage system, where you basically do "tahoe put somefile" and get capability string like "URI:CHK:iqfgzp3ouul7tqtvgn54u3ejee:...u2lgztmbkdiuwzuqcufq:1:1:680" in return.

That string is sufficient to find, decrypt and check integrity of the file (or directory tree) - basically to get it back in what guaranteed to be the same state.
Neither tahoe node state nor stored data can be used to recover that cap.
Retreiving the file afterwards is as simple as GET with that cap in the url.

With remote storage providers, tahoe node works as a client, so all crypto being client-side, actual cloud provider is clueless about the stuff you store, which I find to be quite important thing, especially if you stripe data across many of these leaky and/or plain evil things.

Finally got around to connecting a third backend ( to tahoe today, so wanted to share a few links on the subject:

Aug 09, 2012

Unhosted remoteStorage idea

Having a bit of free time recently, worked a bit on feedjack web rss reader / aggregator project.
To keep track of what's already read and what's not, historically I've used js + client-side localStorage approach, which has quite a few advantages:
  • Works with multiple clients, i.e. everyone has it's own state.
  • Server doesn't have to store any data for possible-infinite number of clients, not even session or login data.
  • Same pages still can be served to all clients, some will just hide unwanted content.
  • Previous point leads to pages being very cache-friendly.
  • No need to "recognize" client in any way, which is commonly acheived with authentication.
  • No interation of "write" kind with the server means much less potential for abuse (DDoS, spam, other kinds of exploits).

Flip side of that rosy picture is that localStorage only works in one browser (or possibly several synced instances), which is quite a drag, because one advantage of a web-based reader is that it can be accessed from anywhere, not just single platform, where you might as well install specialized app.

To fix that unfortunate limitation, about a year ago I've added ad-hoc storage mechanism to just dump localStorage contents as json to some persistent storage on server, authenticated by special "magic" header from a browser.
It was never a public feature, requiring some browser tweaking and being a server admin, basically.

Recently, however, remoteStorage project from unhosted group has caught my attention.

Idea itself and the movement's goals are quite ambitious and otherwise awesome - to return to "decentralized web" idea, using simple already available mechanisms like webfinger for service discovery (reminds of Thimbl concept by, WebDAV for storage and OAuth2 for authorization (meaning no special per-service passwords or similar crap).
But the most interesting thing I've found about it is that it should be actually easier to use than write ad-hoc client syncer and server storage implementation - just put off-the-shelf remoteStorage.js to the page (it even includes "syncer" part to sync localStorage to remote server) and depoy or find any remoteStorage provider and you're all set.
In practice, it works as advertised, but will have quite significant changes soon (with the release of 0.7.0 js version) and had only ad-hoc proof-of-concept server implementation in python (though there's also ownCloud in php and node.js/ruby versions), so I wrote django-unhosted implementation, being basically a glue between simple WebDAV, oauth2app and Django Storage API (which has backends for everything).
Using that thing in feedjack now (here, for example) instead of that hacky json cache I've had with django-unhosted deployed on my server, allowing to also use it with all the apps with support out there.
Looks like a really neat way to provide some persistent storage for any webapp out there, guess that's one problem solved for any future webapps I might deploy that will need one.
With JS being able to even load and use binary blobs (like images) that way now, it becomes possible to write even unhosted facebook, with only events like status updates still aggregated and broadcasted through some central point.
I bet there's gotta be something similar, but with facebook, twitter or maybe github backends, but as proven in many cases, it's not quite sane to rely on these centralized platforms for any kind of service, which is especially a pain if implementation there is one-platform-specific, unlike one remoteStorage protocol for any of them.
Would be really great if they'd support some protocol like that at some point though.

But aside for short-term "problem solved" thing, it's really nice to see such movements out there, even though whole stack of market incentives (which heavily favors control over data, centralization and monopolies) is against them.

Jun 05, 2010

Getting rid of dead bittorrent trackers for rtorrent by scrubbing .torrent files

If you're downloading stuff off the 'net via bt like me, then TPB is probably quite a familiar place to you.

Ever since the '09 trial there were a problems with TPB trackers ( - the name gets inserted into every .torrent yet it points to
Lately, TPB offshot, suffers from the same problem and actually there's a lot of other trackers in .torrent files that point either at or these days.
As I use rtorrent, I have an issue with this - rtorrent seem pretty dumb when it comes to tracker filtering so it queries all of them on a round-robin basis, without checking where it's name points to or if it's down for the whole app uptime, and queries take quite a lot of time to timeout, so that means at least two-minute delay in starting download (as there's TPB trackers first), plus it breaks a lot of other things like manual tracker-cycling ("t" key), since it seem to prefer only top-tier trackers and these are 100% down.
Now the problem with http to localhost can be solved with the firewall, of course, although it's an ugly solution, and seem to fail pretty fast by itself, but stateless udp is still a problem.
Another way to tackle the issue is probably just to use a client that is capable of filtering the trackers by ip address, but that probably means some heavy-duty stuff like azureus or vanilla bittorrent which I've found pretty buggy and also surprisingly heavy in the past.

So the easiest generic solution (which will, of course, work for rtorrent) I've found is just to process the .torrent files before feeding them to the leecher app. Since I'm downloading these via firefox exclusively, and there I use FlashGot (not the standard "open with" interface since I also use it to download large files on remote machine w/o firefox, and afaik it's just not the way "open with" works) to drop them into an torrent bin via script, it's actually a matter of updating the link-receiving script.

Bencode is not a mystery, plus it's pure-py implementation is actually the reference one, since it's a part of original python bittorrent client, so all I basically had to do is to rip from it and paste the relevant part into the script.
The right way might've been to add dependency on the whole bittorrent client, but it's an overkill for such a simple task plus the api itself seem to be purely internal and probably a subject to change with client releases anyway.

So, to the script itself...

# URL checker
def trak_check(trak):
    if not trak: return False
    try: ip = gethostbyname(urlparse(trak).netloc.split(':', 1)[0])
    except gaierror: return True # prehaps it will resolve later on
    else: return ip not in ('', '')

# Actual processing
torrent = bdecode(torrent)
for tier in list(torrent['announce-list']):
    for trak in list(tier):
        if not trak_check(trak):
            # print >>sys.stderr, 'Dropped:', trak
    if not tier: torrent['announce-list'].remove(tier)
# print >>sys.stderr, 'Result:', torrent['announce-list']
if not trak_check(torrent['announce']):
    torrent['announce'] = torrent['announce-list'][0][0]
torrent = bencode(torrent)
That, plus the simple "fetch-dump" part, if needed.
No magic of any kind, just a plain "announce-list" and "announce" urls check, dropping each only if it resolves to that bogus placeholder IPs.

I've wanted to make it as light as possible so no logging or optparse/argparse stuff I tend cram everywhere I can, and the extra and heavy imports like urllib/urllib2 are conditional as well. The only dependency is python (>=2.6) itself.

Basic use-case is one of these:

% < /path/to/file.torrent > /path/to/proccessed.torrent
% > /path/to/proccessed.torrent
    -r -c "...some cookies..." -d /path/to/torrents-bin/

All the extra headers like cookies and referer are optional, so is the destination path (dir, basename is generated from URL). My use-case in FlashGot is this: "[URL] -r [REFERER] -c [COOKIE] -d /mnt/p2p/bt/torrents"

And there's the script itself.

Quick, dirty and inconclusive testing showed almost 100 KB/s -> 600 KB/s increase on several different (two successive tests on the same file even with clean session are obviously wrong) popular and unrelated .torrent files.
That's pretty inspiring. Guess now I can waste even more time on the TV-era crap than before, oh joy ;)
Member of The Internet Defense League