of public cloud storage these days, but trusting any of them
with any kind of data seem reckless - service is free to corrupt, monetize,
leak, hold hostage or just drop it then.
Given that these services are provided at no cost, and generally without much
ads, guess reputation and ToS are the things stopping them from acting like
Not trusting any single one of these services looks like a sane safeguard
against them suddenly collapsing or blocking one's account.
And not trusting any of them with plaintext of the sensitive data seem to be a
good way to protect it from all the shady things that can be done to it.
Tahoe-LAFS is a great capability-based secure distributed storage system,
where you basically do "tahoe put somefile" and get capability string like
That string is sufficient to find, decrypt and check integrity of the file (or
directory tree) - basically to get it back in what guaranteed to be the same
Neither tahoe node state nor stored data can be used to recover that cap.
Retreiving the file afterwards is as simple as GET with that cap in the url.
With remote storage providers, tahoe node works as a client, so all crypto being
client-side, actual cloud provider is clueless about the stuff you store, which
I find to be quite important thing, especially if you stripe data across many of
these leaky and/or plain evil things.
Finally got around to connecting a third backend (box.net) to tahoe today, so
wanted to share a few links on the subject:
Public cloud drivers for tahoe-lafs.
Tool to intelligently (compression, deduplication, rate-limiting, filtering,
metadata, etc) backup stuff to tahoe.
Upstream repo with more enterprisey cloud backend drivers (s3, openstack,
Redundant Array of Independent Clouds concept.
A way to link all the clouds together without having any special drivers.
As I run tahoe nodes on a headless linux machines, running proprietary GUI
clients there doesn't sound too appealing, even if they exist for certain
Having a bit of free time recently, worked a bit on feedjack
web rss reader / aggregator project.
To keep track of what's already read and what's not, historically I've used
js + client-side localStorage approach, which has quite a few advantages:
- Works with multiple clients, i.e. everyone has it's own state.
- Server doesn't have to store any data for possible-infinite number of
clients, not even session or login data.
- Same pages still can be served to all clients, some will just hide
- Previous point leads to pages being very cache-friendly.
- No need to "recognize" client in any way, which is commonly acheived
- No interation of "write" kind with the server means much less
potential for abuse (DDoS, spam, other kinds of exploits).
Flip side of that rosy picture is that localStorage only works in one browser
(or possibly several synced instances), which is quite a drag, because one
advantage of a web-based reader is that it can be accessed from anywhere, not
just single platform, where you might as well install specialized app.
To fix that unfortunate limitation, about a year ago I've added ad-hoc storage
mechanism to just dump localStorage contents as json to some persistent
storage on server, authenticated by special "magic" header from a browser.
It was never a public feature, requiring some browser tweaking and being a
server admin, basically.
Recently, however, remoteStorage project from
unhosted group has caught my attention.
Idea itself and the movement's goals are quite ambitious and otherwise
awesome - to return to "decentralized web" idea, using simple already
available mechanisms like webfinger for service discovery (reminds of Thimbl
concept by telekommunisten.net), WebDAV for storage and
OAuth2 for authorization (meaning no special per-service passwords or similar
But the most interesting thing I've found about it is that it should be
actually easier to use than write ad-hoc client syncer and server storage
implementation - just put off-the-shelf remoteStorage.js to the page (it even
includes "syncer" part to sync localStorage to remote server) and depoy or
find any remoteStorage provider and you're all set.
In practice, it works as advertised, but will have quite significant changes
soon (with the release of 0.7.0 js version) and had only ad-hoc
proof-of-concept server implementation in python (though there's also
in php and node.js/ruby versions), so I
implementation, being basically a glue between simple WebDAV, oauth2app
and Django Storage API (which has
Using that thing in feedjack now (here
, for example) instead of that hacky
json cache I've had with django-unhosted deployed on my server, allowing to
also use it with all the apps with support
Looks like a really neat way to provide some persistent storage for any webapp
out there, guess that's one problem solved for any future webapps I might
deploy that will need one.
With JS being able to even load and use binary blobs (like images) that way
now, it becomes possible to write even unhosted facebook, with only events
like status updates still aggregated and broadcasted through some central
I bet there's gotta be something similar, but with facebook, twitter or maybe
github backends, but as proven in many cases, it's not quite sane to rely on
these centralized platforms for any kind of service, which is especially a
pain if implementation there is one-platform-specific, unlike one
remoteStorage protocol for any of them.
Would be really great if they'd support some protocol like that at some point
But aside for short-term "problem solved" thing, it's really nice to see such
movements out there, even though whole stack of market incentives (which
control over data, centralization and monopolies) is against them.
If you're downloading stuff off the 'net via bt like me, then TPB is probably quite a familiar place to you.
Ever since the '09 trial there were a problems with TPB trackers
(tracker.thepiratebay.org) - the name gets inserted into every .torrent yet it
points to 127.0.0.1.
Lately, TPB offshot, tracker.openbittorrent.com suffers from the same problem
and actually there's a lot of other trackers in .torrent files that point
either at 0.0.0.0 or 127.0.0.1 these days.
As I use rtorrent
, I have an issue with
this - rtorrent seem pretty dumb when it comes to tracker filtering so it
queries all of them on a round-robin basis, without checking where it's name
points to or if it's down for the whole app uptime, and queries take quite a
lot of time to timeout, so that means at least two-minute delay in starting
download (as there's TPB trackers first), plus it breaks a lot of other things
like manual tracker-cycling ("t" key), since it seem to prefer only top-tier
trackers and these are 100% down.
Now the problem with http to localhost can be solved with the firewall, of
course, although it's an ugly solution, and 0.0.0.0 seem to fail pretty fast
by itself, but stateless udp is still a problem.
Another way to tackle the issue is probably just to use a client that is
capable of filtering the trackers by ip address, but that probably means some
heavy-duty stuff like azureus or vanilla bittorrent which I've found pretty
buggy and also surprisingly heavy in the past.
So the easiest generic solution (which will, of course, work for rtorrent) I've
found is just to process the .torrent files before feeding them to the leecher
app. Since I'm downloading these via firefox exclusively, and there I use
FlashGot (not the standard "open with" interface since
I also use it to download large files on remote machine w/o firefox, and afaik
it's just not the way "open with" works) to drop them into an torrent bin via
script, it's actually a matter of updating the link-receiving script.
is not a mystery, plus it's
pure-py implementation is actually the reference one, since it's a part of
original python bittorrent client, so all I basically had to do is to rip
bencode.py from it and paste the relevant part into the script.
The right way might've been to add dependency on the whole bittorrent client,
but it's an overkill for such a simple task plus the api itself seem to be
purely internal and probably a subject to change with client releases anyway.
So, to the script itself...
# URL checker
if not trak: return False
try: ip = gethostbyname(urlparse(trak).netloc.split(':', 1))
except gaierror: return True # prehaps it will resolve later on
else: return ip not in ('127.0.0.1', '0.0.0.0')
# Actual processing
torrent = bdecode(torrent)
for tier in list(torrent['announce-list']):
for trak in list(tier):
if not trak_check(trak):
# print >>sys.stderr, 'Dropped:', trak
if not tier: torrent['announce-list'].remove(tier)
# print >>sys.stderr, 'Result:', torrent['announce-list']
if not trak_check(torrent['announce']):
torrent['announce'] = torrent['announce-list']
torrent = bencode(torrent)
That, plus the simple "fetch-dump" part, if needed.
No magic of any kind, just a plain "announce-list" and "announce" urls check,
dropping each only if it resolves to that bogus placeholder IPs.
I've wanted to make it as light as possible so no logging or optparse/argparse
stuff I tend cram everywhere I can, and the extra and heavy imports like
urllib/urllib2 are conditional as well. The only dependency is python (>=2.6)
Basic use-case is one of these:
% brecode.py < /path/to/file.torrent > /path/to/proccessed.torrent
% brecode.py http://tpb.org/torrents/myfile.torrent > /path/to/proccessed.torrent
% brecode.py http://tpb.org/torrents/myfile.torrent\
-r http://tpb.org/ -c "...some cookies..." -d /path/to/torrents-bin/
All the extra headers like cookies and referer are optional, so is the
destination path (dir, basename is generated from URL). My use-case in FlashGot
is this: "[URL] -r [REFERER] -c [COOKIE] -d /mnt/p2p/bt/torrents"
And there's the script itself.
Quick, dirty and inconclusive testing showed almost 100 KB/s -> 600 KB/s
increase on several different (two successive tests on the same file even with
clean session are obviously wrong) popular and unrelated .torrent files.
That's pretty inspiring. Guess now I can waste even more time on the TV-era
crap than before, oh joy ;)