Feb 04, 2013

codetag + tmsu: Tag all the Things (and Go!)

Was hacking something irrelevant together again and, as often happens with such things, realized that I implemented something like that before.
It can be some simple - locking function in python, awk pipe to get some monitoring data, chunk of argparse-based code to process multiple subcommands, TLS wrapper for requests, dbapi wrapper, multi-module parser/generator for human-readable dates, logging buffer, etc...
Point is - some short snippet of code is needed as a base for implementing something new or maybe even to re-use as-is, yet it's not noteworthy enough on it's own to split into a module or generally do anything specific about it.

Happens a lot to me, as over the years, a lot of such ad-hoc yet reusable code gets written, and I can usually remember enough implementation details (e.g. which modules were used there, how the methods/classes were called and such), but going "grep" over the source dir takes a shitload of time.

Some things make it faster - ack or pss tools can scan only relevant things (like e.g. "grep ... **/*.py" will do in zsh), but these also run for minutes, as even simple "find" does - there're several django source trees in appengine sdk, php projects with 4k+ files inside, maybe even whole linux kernel source tree or two...

Traversing all these each time on regular fs to find something that can be rewritten in a few minutes will never be an option for me, but luckily there're cool post-fs projects like tmsu, which allow to transcend single-hierarchy-index limitation of a traditional unix fs in much more elegant and useful way than gazillion of symlinks and dentries.

tmsu allows to attach any tags to any files, then query these files back using a set of tags, which it does really fast using sqlite db and clever indexes there.

So, just tagging all the "*.py" files with "lang:py" will allow to:

% time tmsu files lang:py | grep myclass
tmsu files lang:py  0.08s user 0.01s system 98% cpu 0.094 total
grep --color=auto myclass  0.01s user 0.00s system 10% cpu 0.093 total

That's 0.1s instead of several minutes for all the python code in the development area on this machine.

tmsu can actually do even cooler tricks than that with fuse-tagfs mounts, but that's all kinda wasted until all the files won't be tagged properly.
Which, of course, is a simple enough problem to solve.
So here's my first useful Go project - codetag.

I've added taggers for things that are immediately useful for me to tag files by - implementation language, code hosting (github, bitbucket, local project, as I sometimes remember that snippet was in some public tool), scm type (git, hg, bzr, svn), but it adding a new one is just a metter of writing a "Tagger" function, which, given the path and config, returns a list of string tags, plus they're only used if explicitly enabled in config.

Other features include proper python-like logging and rsync-like filtering (but using more powerful re2 regexps instead of simple glob patterns).
Up-to-date list of these should be apparent from the included configuration file.

Being a proper compiled language, Go allows to make the thing into a single static binary, which is quite neat, as I realized that I now have a tool to tag all the things everywhere - media files on servers' remote-fs'es, like music and movies, hundreds of configuration files by the app they belong to (think tmsu files daemon:apache to find/grep all the horrible ".htaccess" things and it's "*.conf" includes), distfiles by the os package name, etc... can be useful.

So, to paraphrase well-known meme, Tag All The Things! ;)

github link

Oct 23, 2011

dm-crypt password caching between dracut and systemd, systemd password agent

Update 2015-11-25: with "ask-password" caching implemented as of systemd-227 (2015-10-07), better way would be to use that in-kernel caching, though likely requires systemd running in initramfs (e.g. dracut had that for a while).

Up until now I've used lvm on top of single full-disk dm-crypt partition.
It seems easiest to work with - no need to decrypt individual lv's, no confusion between what's encrypted (everything but /boot!) and what's not, etc.
Main problem with it though is that it's harder to have non-encrypted parts, everything is encrypted with the same keys (unless there're several dm-crypt layers) and it's bad for SSD - dm-crypt still (as of 3.0) doesn't pass any TRIM requests through, leading to nasty write amplification effect, even more so with full disk given to dm-crypt+lvm.
While there's hope that SSD issues will be kinda-solved (with an optional security trade-off) in 3.1, it's still much easier to keep different distros or some decrypted-when-needed partitions with dm-crypt after lvm, so I've decided to go with the latter for new 120G SSD.
Also, such scheme allows to re-create encrypted lvs, issuing TRIM for the old ones, thus recycling the blocks even w/o support for this in dm-crypt.
Same as with previous initramfs, I've had simple "openct" module (udev there makes it even easier) in dracut to find inserted smartcard and use it to obtain encryption key, which is used once to decrypt the only partition on which everything resides.
Since the only goal of dracut is to find root and get-the-hell-outta-the-way, it won't even try to decrypt all the /var and /home stuff without serious ideological changes.
The problem is actually solved in generic distros by plymouth, which gets the password(s), caches it, and provides it to dracut and systemd (or whatever comes as the real "init"). I don't need splash, and actually hate it for hiding all the info that scrolls in it's place, so plymouth is a no-go for me.

Having a hack to obtain and cache key for dracut by non-conventional means anyway, I just needed to pass it further to systemd, and since they share common /run tmpfs these days, it basically means not to rm it in dracut after use.

Luckily, system-wide password handling mechanism in systemd is well-documented and easily extensible beyond plymouth and default console prompt.

So whole key management in my system goes like this now:

  • dracut.cmdline: create udev rule to generate key.
  • dracut.udev.openct: find smartcard, run rule to generate and cache key in /run/initramfs.
  • dracut.udev.crypt: check for cached key or prompt for it (caching result), decrypt root, run systemd.
  • systemd: start post-dracut-crypt.path unit to monitor /run/systemd/ask-password for password prompts, along with default .path units for fallback prompts via wall/console.
  • systemd.udev: discover encrypted devices, create key requests.
  • systemd.post-dracut-crypt.path: start post-dracut-crypt.service to read cached passwords from /run/initramfs and use these to satisfy requests.
  • systemd.post-dracut-crypt-cleanup.service (after local-fs.target is activated): stop post-dracut-crypt.service, flush caches, generate new one-time keys for decrypted partitions.
End result is passwordless boot with this new layout, which seem to be only possible to spoof by getting root during that process somehow, with altering unencrypted /boot to run some extra code and revert it back being the most obvious possibility.
It's kinda weird that there doesn't seem to be any caching in place already, surely not everyone with dm-crypt are using plymouth?

Most complicated piece here is probably the password agent (in python), which can actually could've been simplier if I haven't followed the proper guidelines and thought a bit around them.

For example, whole inotify handling thing (I've used it via ctypes) can be dropped with .path unit with DirectoryNotEmpty= activation condition - it's there already, PolicyKit authorization just isn't working at such an early stage, there doesn't seem to be much need to check request validity since sending replies to sockets is racy anyway, etc
Still, a good excercise.

Python password agent for systemd. Unit files to start and stop it on demand.

Dec 29, 2010

Sane playback for online streaming video via stream dumping

I rarely watch footage from various conferences online, usually because I have some work to do and video takes much more dedicated time than the same thing just written on a webpage, making it basically a waste of time, but sometimes it's just fun.
Watching familiar "desktop linux complexity" holywar right on the stage of "Desktop on the Linux..." presentation of 27c3 (here's the dump, available atm, better one should probably appear in the Recordings section) certainly was, and since there are few other interesting topics on schedule (like DJB's talk about high-speed network security) and I have some extra time, I decided not to miss the fun.

Problem is, "watching stuff online" is even worse than just "watching stuff" - either you pay attention or you just miss it, so I set up recording as a sort of "fs-based cache", at the very least to watch the recorded streams right as they get written, being able to pause or rewind, should I need to do so.

Natural tool to do the job is mplayer, with it's "-streamdump" flag.
It works well up until some network (remote or local) error or just mplayer glitch, which seem to happen quite often.
That's when mplayer crashes with funny "Core dumped ;)" line and if you're watching the recorded stream atm, you'll certainly miss it at the time, noticing the fuckup when whatever you're watching ends aburptly and the real-time talk is already finished.
Somehow, I managed to forget about the issue and got bit by it soon enough.

So, mplayer needs to be wrapped in a while loop, but then you also need to give dump files unique names to keep mplayer from overwriting them, and actually do several such loops for several streams you're recording (different halls, different talks, same time), and then probably because of strain on the streaming servers mplayer tend to reconnect several times in a row, producing lots of small dumps, which aren't really good for anything, and you'd also like to get some feedback on what's happening, and... so on.

Well, I just needed a better tool, and it looks like there aren't much simple non-gui dumpers for video+audio streams and not many libs to connect to http video streams from python, existing one being vlc bindings, which isn't probably any better than mplayer, provided all I need is just to dump a stream to a file, without any reconnection or rotation or multi-stream handling mechanism.

To cut the story short I ended up writing a bit more complicated eventloop script to control several mplayer instances, aggregating (and marking each accordingly) their output, restarting failed ones, discarding failed dumps and making up sensible names for the produced files.
It was a quick ad-hoc hack, so I thought to implement it straight through signal handling and poll loop for the output, but thinking about all the async calls and state-tracking it'd involve I quickly reconsidered and just used twisted to shove all this mess under the rug, ending up with quick and simple 100-liner.
Script code, twisted is required.

And now, back to the ongoing talks of day 3.

Aug 15, 2010

Home-brewed NAS gluster with sensible replication

Hardware

I'd say "every sufficiently advanced user is indistinguishable from a sysadmin" (yes, it's a play on famous Arthur C Clarke's quote), and it doesn't take much of "advanced" to come to a home-server idea.
And I bet the main purpose for most of these aren't playground, p2p client or some http/ftp server - it's just a storage. Serving and updating the stored stuff is kinda secondary.
And I guess it's some sort of nature law that any storage runs outta free space sooner or later. And when this happens, just buying more space seem to be a better option than cleanup because a) "hey, there's dirt-cheap 2TB harddisks out there!" b) you just get used to having all that stuff at packet's reach.
Going down this road I found myself out of connectors on the motherboard (which is fairly spartan D510MO miniITX) and the slots for an extension cards (the only PCI is used by dual-port nic).
So I hooked up two harddisks via usb, but either the usb-sata controllers or usb's on the motherboard were faulty and controllers just hang with both leds on, vanishing from the system. Not that it's a problem - just mdraid'ed them into raid1 and when one fails like that, I just have to replug it and start raid recovery, never losing access to the data itself.
Then, to extend the storage capacity a bit further (and to provide a backup to that media content) I just bought +1 miniITX unit.
Now, I could've mouned two NFS'es from both units, but this approach has several disadvantages:
  • Two mounts instead of one. Ok, not a big deal by itself.
  • I'd have to manage free space on these by hand, shuffling subtrees between them.
  • I need replication for some subtrees, and that complicates the previous point a bit further.
  • Some sort of HA would be nice, so I'd be able to shutdown one replica and switch to using another automatically.
The obvious improvement would be to use some distributed network filesystem, and pondering on the possibilities I've decided to stick with the glusterfs due to it's flexible yet very simple "layer cake" configuration model.
Oh, and the most important reason to set this whole thing up - it's just fun ;)

The Setup

Ok, what I have is:

  • Node1
    • physical storage (raid1) "disk11", 300G, old and fairly "valuable" data (well, of course it's not "valuable", since I can just re-download it all from p2p, but I'd hate to do that)
    • physical disk "disk12", 150G, same stuff as disk11
  • Node2
    • physical disk "disk2", 1.5T, blank, to-be-filled

What I want is one single network storage, with db1 (disk11 + disk12) data available from any node (replicated) and new stuff which won't fit onto this storage should be writen to db2 (what's left of disk2).

With glusterfs there are several ways to do this:

Scenario 1: fully network-aware client.

That's actually the simpliest scenario - glusterfsd.vol files on "nodes" should just export local disks and client configuration ties it all together.

Pros:

  • Fault tolerance. Client is fully aware of the storage hierarchy, so if one node with db1 is down, it will just use the other one.
  • If the bandwidth is better than disk i/o, reading from db1 can be potentially faster (dunno if glusterfs allows that, actually), but that's not the case, since "client" is one of the laptops and it's a slow wifi link.

Cons:

  • Write performance is bad - client has to push data to both nodes, and that's a

    big minus with my link.

Scenario 2: server-side replication.

Here, "nodes" export local disks for each other and gather local+remote db1 into cluster/replicate and then export this already-replicated volume. Client just ties db2 and one of the replicated-db1 together via nufa or distribute layer.

Pros:

  • Better performance.

Cons:

  • Single point of failure, not only for db2 (which is natural, since it's not replicated), but for db1 as well.

Scenario 3: server-side replication + fully-aware client.

db1 replicas are synced by "nodes" and client mounts all three volumes (2 x db1, 1 x db2) with either cluster/unify layer and nufa scheduler (listing both db1 replicas in "local-volume-name") or cluster/nufa.

That's the answer to obvious question I've asked myself after implementing scenario 2: "why not get rid of this single_point_of_failure just by using not single, but both replicated-db1 volumes in nufa?"
In this case, if node1 goes down, client won't even notice it! And if that happens to node2, files from db2 just disappear from hierarchy, but db1 will still remain fully-accessible.

But there is a problem: cluster/nufa has no support for multiple local-volume-name specs. cluster/unify has this support, but requires it's ugly "namespace-volume" hack. The solution would be to aggregate both db1's into a distribute layer and use it as a single volume alongside db2.

With aforementioned physical layout this seem to be just the best all-around case.

Pros:

  • Best performance and network utilization.

Cons:

  • None?

Implementation

So, scenarios 2 and 3 in terms of glusterfs, with the omission of different performance, lock layers and a few options, for the sake of clarity:

node1 glusterfsd.vol:

## db1: replicated node1/node2
volume local-db1
    type storage/posix
    option directory /srv/storage/db1
end-volume
# No client-caches here, because ops should already come aggregated
# from the client, and the link between servers is much fatter than the client's
volume node2-db1
    type protocol/client
    option remote-host node2
    option remote-subvolume local-db1
end-volume
volume composite-db1
    type cluster/replicate
    subvolumes local-db1 node2-db1
end-volume
## db: linear (nufa) db1 + db2
## export: local-db1 (for peers), composite-db1 (for clients)
volume export
    type protocol/server
    subvolumes local-db1 composite-db1
end-volume

node2 glusterfsd.vol:

## db1: replicated node1/node2
volume local-db1
    type storage/posix
    option directory /srv/storage/db1
end-volume
# No client-caches here, because ops should already come aggregated
# from the client, and the link between servers is much fatter than the client's
volume node1-db1
    type protocol/client
    option remote-host node1
    option remote-subvolume local-db1
end-volume
volume composite-db1
    type cluster/replicate
    subvolumes local-db1 node1-db1
end-volume
## db2: node2
volume db2
    type storage/posix
    option directory /srv/storage/db2
end-volume
## db: linear (nufa) db1 + db2
## export: local-db1 (for peers), composite-db1 (for clients)
volume export
    type protocol/server
    subvolumes local-db1 composite-db1
end-volume

client (replicated to both nodes):

volume node1-db1
    type protocol/client
    option remote-host node1
    option remote-subvolume composite-db1
end-volume
volume node2-db1
    type protocol/client
    option remote-host node2
    option remote-subvolume composite-db1
end-volume
volume db1
    type cluster/distribute
    option remote-subvolume node1-db1 node2-db1
end-volume
volume db2
    type protocol/client
    option remote-host node2
    option remote-subvolume db2
end-volume
volume db
    type cluster/nufa
    option local-volume-name db1
    subvolumes db1 db2
end-volume
Actually there's one more scenario I thought of for non-local clients - same as 2, but pushing nufa into glusterfsd.vol on "nodes", thus making client mount single unified volume on a single host via single port in a single connection.
Not that I really need this one, since all I need to mount from external networks is just music 99.9% of time, and NFS + FS-Cache offer more advantages there, although I might resort to it in the future, when music won't fit to db1 anymore (doubt that'll happen anytime soon).
P.S.
Configs are fine, but the most important thing for setting up glusterfs are these lines:
node# /usr/sbin/glusterfsd --debug --volfile=/etc/glusterfs/glusterfsd.vol
client# /usr/sbin/glusterfs --debug --volfile-server=somenode /mnt/tmp

Feb 17, 2010

Listening to music over the 'net with authentication and cache

Having seen people really obsessed with the music, I don't consider myself to be much into it, yet I've managed to accumulate more than 70G of it, and counting. That's probably because I don't like to listen to something on a loop over and over, so, naturally, it's quite a bother to either keep the collection on every machine I use or copy the parts of it just to listen and replace.

Ideal solution for me is to mount whole hoard right from home server, and mounting it over the internet means that I need some kind of authentication.
Since I also use it at work, encryption is also nice, so I can always pass this bandwith as something work-friendly and really necessary, should it arouse any questions.
And while bandwith at work is pretty much unlimited, it's still controlled, so I wouldn't like to overuse it too much, and listening to oggs, mp3 and flacs for the whole work-day can generate traffic of 500-800 megs, and that's quite excessive to that end, in my own estimation.

The easiest way for me was trusty sshfs - it's got the best authentication, nice performance and encryption off-the-shelf with just one simple command. Problem here is the last aforementioned point - sshfs would generate as much bandwith as possible, caching content only temporarily in volatile RAM.

Persistent caching seem to be quite easy to implement in userspace with either fuse layer over network filesystem or something even simplier (and more hacky), like aufs and inotify, catching IN_OPEN events and pulling files in question to intermediate layer of fs-union.

Another thing I've considered was fs-cache in-kernel mechanism, which appeared in the main tree since around 2.6.30, but the bad thing about was that while being quite efficient, it only worked for NFS or AFS.
Second was clearly excessive for my purposes, and the first one I've come to hate for being extremely ureliable and limiting. In fact, NFS never gave me anything but trouble in the past, yet since I haven't found any decent implementations of the above ideas, I'd decided to give it (plus fs-cache) a try.
Setting up nfs server is no harder than sharing dir on windows - just write a line to /etc/exports and fire up nfs initscript. Since nfs4 seems superior than nfs in every way, I've used that version.
Trickier part is authentication. With nfs' "accept-all" auth model and kerberos being out of question, it has to be implemented on some transport layer in the middle.
Luckily, ssh is always there to provide a secure authenticated channel and nfs actually supports tcp these days. So the idea is to start nfs on localhost on server and use ssh tunnel to connecto to it from the client.

Setting up tunnel was quite straightforward, although I've put together a simple script to avoid re-typing the whole thing and to make sure there aren't any dead ssh processes laying around.

#!/bin/sh
PID="/tmp/.$(basename $0).$(echo "$1.$2" | md5sum | cut -b-5)"
touch "$PID"
flock -n 3 3<"$PID" || exit 0
exec 3>"$PID"
( flock -n 3 || exit 0
  exec ssh\
   -oControlPath=none\
   -oControlMaster=no\
   -oServerAliveInterval=3\
   -oServerAliveCountMax=5\
   -oConnectTimeout=5\
   -qyTnN $3 -L "$1" "$2" ) &
echo $! >&3
exit 0

That way, ssh process is daemonized right away. Simple locking is also implemented, based on tunnel and ssh destination, so it might be put as a cronjob (just like "ssh_tunnel 2049:127.0.0.1:2049 user@remote") to keep the link alive.

Then I've put a line like this to /etc/exports:

/var/export/leech 127.0.0.1/32(ro,async,no_subtree_check,insecure)
...and tried to "mount -t nfs4 localhost:/ /mnt/music" on the remote.
Guess what? "No such file or dir" error ;(
Ok, nfs3-way to "mount -t nfs4 localhost:/var/export/leech /mnt/music" doesn't work as well. No indication of why it is whatsoever.

Then it gets even better - "mount -t nfs localhost:/var/export/leech /mnt/music" actually works (locally, since nfs3 defaults to udp).
Completely useless errors and nothing on the issue in manpages was quite weird, but prehaps I haven't looked at it well enough.

Gotcha was in the fact that it wasn't allowed to mount nfs4 root, so tweaking exports file like this...

/var/export/leech 127.0.0.1/32(ro,async,no_subtree_check,insecure,fsid=0)
/var/export/leech/music 127.0.0.1/32(ro,async,no_subtree_check,insecure,fsid=1)

...and "mount -t nfs4 localhost:/music /mnt/music" actually solved the issue.

Why can't I use one-line exports and why the fuck it's not on the first (or any!) line of manpage escapes me completely, but at least it works now even from remote. Hallelujah.

Next thing is the cache layer. Luckily, it doesn't look as crappy as nfs and tying them together can be done with a single mount parameter. One extra thing needed, aside from the kernel part, here, is cachefilesd.
Strange thing it's not in gentoo portage yet (since it's kinda necessary for kernel mechanism and quite aged already), but there's an ebuild in b.g.o (now mirrored to my overlay, as well).
Setting it up is even simplier.
Config is well-documented and consists of five lines only, the only relevant of which is the path to fs-backend, oh, and the last one seem to need user_xattr support enabled.

fstab lines for me were these:

/dev/dump/nfscache /var/fscache ext4 user_xattr
localhost:/music /mnt/music nfs4 ro,nodev,noexec,intr,noauto,user,fsc

First two days got me 800+ megs in cache and from there it was even better bandwidth-wise, so, all-in-all, this nfs circus was worth it.

Another upside of nfs was that I could easily share it with workmates just by binding ssh tunnel endpoint to a non-local interface - all that's needed from them is to issue the mount command, although I didn't came to like to implementation any more than I did before.
Wonder if it's just me, but whatever...
Member of The Internet Defense League