Mar 21, 2017

Running glusterfs in a user namespace (uid-mapped container)

Traditionally glusterd (glusterfs storage node) runs as root without any kind of namespacing, and that's suboptimal for two main reasons:

  • Grossly-elevated privileges (it's root) for just using net and storing files.
  • Inconvenient to manage in the root fs/namespace.

Apart from being historical thing, glusterd uses privileges for three things that I know of:

  • Set appropriate uid/gid on stored files.
  • setxattr() with "trusted" namespace for all kinds of glusterfs xattrs.
  • Maybe running nfsd? Not sure about this one, didn't use its nfs access.

For my purposes, only first two are useful, and both can be easily satisfied in non-uid-mapped contained, e.g. systemd-nspawn without -U.

With user_namespaces(7), first requirement is also satisfied, as chown works for pseudo-root user inside namespace, but second one will never work without some kind of namespace-private fs or xattr-mapping namespace.

"user" xattr namespace works fine there though, so rather obvious fix is to make glusterd use those instead, and it has no obvious downsides, at least if backing fs is used only by glusterd.

xattr names are unfortunately used quite liberally in the gluster codebase, and don't have any macro for prefix, but finding all "trusted" outside of tests/docs with grep is rather easy, seem to be no caveats there either.

Would be cool to see something like that upstream eventually.

It won't work unless all nodes are using patched glusterfs version though, as non-patched nodes will be sending SETXATTR/XATTROP for trusted.* xattrs.

Two extra scripts that can be useful with this patch and existing setups:

First one is to copy trusted.* xattrs to user.*, and second one to set upper 16 bits of uid/gid to systemd-nspawn container id value.

Both allow to pass fs from old root glusterd to a user-xattr-patched glusterd inside uid-mapped container (i.e. bind-mount it there), without loosing anything. Both operations are also reversible - can just nuke user.* stuff or upper part of uid/gid values to revert everything back.

One more random bit of ad-hoc trivia - use getfattr -Rd -m '.*' /srv/glusterfs-stuff
(getfattr without -m '.*' hack hides trusted.* xattrs)

Note that I didn't test this trick extensively (yet?), and only use simple distribute-replicate configuration here anyway, so probably a bad idea to run something like this blindly in an important and complicated production setup.

Also wow, it's been 7 years since I've written here about glusterfs last, time (is made of) flies :)

Dec 07, 2010

MooseFS usage experiences

It's been three months since I've replaced gluster with moose and I've had a few questions about it's performance so far.
Info on the subject in the internets is a bit scarce, so here goes my case. Keep in mind however that it's not a stress-benchmark of any kind and actually rather degenerate use-case, since loads aren't pushing hardware to any limits.

Guess I can say that it's quite remarkable in a way that it's really unremarkable - I just kinda forgot it's there, which is probably the best thing one can expect from a filesystem.

My setup is 4 physical nodes at most, with 1.3 TiB of data in fairly large files (3361/52862 dirs/files, calculated average is about 250 MiB for a file).
Spanned fs hosts a storage for distfiles, media content, vm images and pretty much anything that comes in the compressed form and worth keeping for further usage. Hence the access to files is highly sequential in most cases (as in reading gzip, listening to mp3, watching a movie, etc).
OS on the nodes is gentoo/exherbo/fedora mix and is a subject to constant software updates, breakages and rebuilds. Naturally, mfs is proven to be quite resilent in these conditions, since it doesn't depend on boost or other volatile crap and just consists of several binaries and configuration files, which work fine even with defaults right out of the box.
One node is a "master", eating 100 MiB RAM (115 VSZ) on the few-month average (according to atop logs). Others have metalogger slaves which cost virtually nothing (<3 MiB VSZ), so it's not a big deal to keep metadata fully-replicated just in case.
Chunkservers have 500 GiB - 3 TiB space on btrfs. These usually hang on 10 MiB RAM, occasional 50-100 MiB in VSZ, though it's not swapped-out, just unused.
Cpu usage for each is negligible, even though mfsmaster + mfsmount + mfschunkserver node is Atom D510 on miniITX board.
mfsmount maintains persistent connection to master and on-demand to chunkservers.
It doesn't seem to mind if some of them are down though, so I guess it's perfectly possible to upload files via mfsmount to one (the only accessible) node and they'll be replicated to others from there (more details on that below), although I'm unsure what will happen when you'll try to retrieve chunks, stored exclusively on inaccessible nodes (guess
it's easy enough to test, anyway).
I use only one mfsmount on the same machine as master, and re-export (mostly for reading) it over NFS, SFTP, WebDAV and plain HTTP to other machines.
Re-export is there because that way I don't need access to all machines in cluster, which can be in a separate network (i.e. if I access fs from work), plus stuff like NFS comes out of the box (no need for separate client) and have a nice FS-Cache support, which saves a lot of bandwidth, webdav/sftp works for ms-os machines as well and server-based replication saves more precious bandwidth all by itself.
FS bandwidth in my case in constant ~1 MiB read 24/7 plus any on-demand reading on speeds, which are usually slower than any single hdd (over slower network links like 100 Mbps LAN and WiFi), and using only a few threads as well, so I'm afraid I can't give any real-world stress results here.
On a local bulk-copy operations to/from mfs mount though, disk always seem to be a bottleneck, with all other parameters far below any possible limitations, but in my case it's a simple "wd green" low-speed/noise high-capacity disks or seagate/hitachi disks with AAM threshold set to lowest level via "hdparm -M" (works well for sound, but
I never really cared about how it affects speed to check).

Chunkservers' storage consists of idexed (AA/AABCD...) paths, according to chunk names, which can be easily retreived from master. They rely on fs scanning to determine which chunks they have, so I've been able to successfully merge two nodes into one w/o storing the chunks on different filesystems/paths (which is also perfectly possible).

Chunkservers talk to each other on p2p-basis (doesn't imply that they don't need connection to master, but bandwidth there doesn't seem to be an issue at all) to maintain requested replication goal and auto-balance disk space between themselves, so the free percentage tries to be equal on all nodes (w/o compromising the goal, of course), so with goal=2 and 4 nodes I have 30% space usage on backend-fs on both 500 GiB node and 3 TiB one.

Balancing seem to be managed by every chunkserver in background (not quite sure if I've seen it in any docs, but there's a "chunk testing" process, which seem to imply that, and can be tuned btw), according to info about chunk and other currently-available nodes' space utilization from master.

Hence, adding/removing nodes is a bliss - just turn it on/off, no configuration changes for other nodes are necessary - master sees the change (new/lost connection) and all the chunkservers start relocating/getting the chunks to restore the balance and maintain the requested goal. In a few hours everything will be balanced again.

Whole approach seem superior to dumb round-robin of the chunks on creation or rehashing and relocating every one of them on single node failure, and suggests that it might be easy to implement custom replication and balancing scheme just by rsync'ing chunks between nodes as necessary (i.e. to make most of small ssd buffer, putting most-demanded files' chunks there).
And indeed I've utilized that feature twice to merge different nodes and filesystems, although the latter is not really necessary, since chunkserver can work with several storage paths on different filesystems, but it's just seem irrational to keep several btrfs trees these days, as they can even span to multiple devices.
But the best part, enabling me not to look further for alternatives, is the simple fact that I've yet to see any problem in the stability department - it still just works. mfsmount never refused to give or receive a file, node daemons never crashed or came back up with a weird inconsistency (which I don't think is easy to produce with such simple setup/design, anyway).
Connection between nodes has failed quite often - sometimes my NIC/switch/cables went to 30% packet loss for no apparent reason, sometimes I've messed up openswan and ipsec or some other network setup, shut down and hard-rebooted the nodes as necessary, but such failures were always unnoticeable here, without any need to restart anything on the mfs level - chunkservers just reconnect, forget obsolete chunks and keep going about their business.
Well, there *was* one exception: one time I've managed to hard-reboot a master machine and noticed that mfsmaster has failed to start.
Problem was missing metadata.mfs file in /var/lib, which I believe is created on mfsmaster stop and checkpointed every hour to .back file, so, knowing there was no changes to fs in the last few minutes, I just removed the .back suffix and everything started just fine.
Doing it The Right Way would've involved stopping any of the metalogger nodes (or signaling it somehow) and retreiving this file from there, or just starting master on that node, updating the mfsmaster ns entry, since they're identical.

Of course, it's just a commodity hardware and lighter loads, but it's still way above other stuff I've tried here in virtually every aspect, so thumbs up for moose.

Sep 09, 2010

Distributed fault-tolerant fs take 2: MooseFS

Ok, almost one month of glusterfs was too much for me to handle. That was an epic fail ;)

Random errors on start (yeah, just restart nodes a few times and it'll be fine) and during operation (same ESTALE, EIOs for whole mount, half of files just vanishing) seem to be a norm for it. I mean, that's with a perfectly sane and calm conditions - everything works, links stable.
A bit complicated configurations like server-side replication seem to be the cause of these, sometimes to the point when the stuff just gives ESTALE in 100% cases right from the start w/o any reason I can comprehend. And adding a third node to the system just made things worse and configuration files are even more scary.

Well, maybe I'm just out of luck or too brain-dead for it, whatever.

So, moving on, I've tried (although briefly) ceph.

Being in mainline kernel, and not just the staging part, I'd expected it to be much less work-in-progress, but as it is, it's very raw, to the point that x86_64 monitor daemon just crashes upon receiving data from plain x86. Interface is a bunch of simple shell scripts, fairly opaque operation, and the whole thing is built on such crap as boost.

Managed to get it running with two nodes, but it feels like the end of the world - one more kick and it all falls to pieces. Confirmed by the reports all over the mailing list and #ceph.

In-kernel and seemingly fast is a good mix though, so I may get back to it eventually, but now I'd rather prefer to settle on something that actually works.

Next thing in my sight was tahoe-lafs, but it still lacks normal posix-fs interface layer, sftp interface being totally unusable on 1.8.0c3 - no permissions, cp -R fails w/ I/O error, displayed data in inconsistent even with locally-made changes, and so on. A pity, whole system design looks very cool, with raid5-like "parity" instead of plain chunk replication, and it's python!

Thus I ended up with MooseFS.

First thing to note here is incredibly simple and yet infinitely more powerful interface that probably sold me the fs right from the start.
None of this configuration layers hell of gluster, just a line in hosts (so there's no need to tweak configs at all!) plus a few about what to export (subtrees-to-nets, nfs style) and where to put chunks (any fs as a simple backend key-value storage), and that's about it.

Replication? Piece a cake, and it's configured on per-tree basis, so important or compact stuff can have one replication "goal" and some heavy trash in the neighbor path have no replication at all. No chance of anything like this with gluster and it's not even documented for ceph.

Performance is totally I/O and network bound (which is totally not-the-case with tahoe, for instance), so no complaints here as well.

One more amazing thing is how simple and transparent it is:

fraggod@anathema:~% mfsgetgoal tmp/db/softCore/_nix/os/systemrescuecd-x86-1.5.8.iso
tmp/db/softCore/_nix/os/systemrescuecd-x86-1.5.8.iso: 2
fraggod@anathema:~% mfsfileinfo tmp/db/softCore/_nix/os/systemrescuecd-x86-1.5.8.iso
 chunk 0: 000000000000CE78_00000001 / (id:52856 ver:1)
  copy 1:
  copy 2:
 chunk 1: 000000000000CE79_00000001 / (id:52857 ver:1)
  copy 1:
  copy 2:
 chunk 2: 000000000000CE7A_00000001 / (id:52858 ver:1)
  copy 1:
  copy 2:
 chunk 3: 000000000000CE7B_00000001 / (id:52859 ver:1)
  copy 1:
  copy 2:
 chunk 4: 000000000000CE7C_00000001 / (id:52860 ver:1)
  copy 1:
  copy 2:
fraggod@anathema:~% mfsdirinfo tmp/db/softCore/_nix/os
 inodes: 12
 directories: 1
 files: 11
 chunks: 175
 length: 11532174263
 size: 11533462528
 realsize: 23066925056
And if that's not enough, there's even a cow snaphots, trash bin with a customizable grace period and a special attributes for file caching and ownership, all totally documented along with the architectural details in manpages and on the project site.
Code is plain C, no shitload of deps like boost and lib*whatevermagic*, and it's really lite. Whole thing feels like a simple and solid design, not some polished turd of a *certified professionals*.
Yes, it's not truly-scalable, as there's a master host (with optional metalogger failover backups) with fs metadata, but there's no chance it'll be a bottleneck in my setup and comparing to a "no-way" bottlenecks of other stuff, I'd rather stick with this one.

MooseFS has yet to pass the trial of time on my makeshift "cluster", yet none of the other setups went (even remotely) as smooth as this one so far, thus I feel pretty optimistic about it.

Aug 15, 2010

Home-brewed NAS gluster with sensible replication


I'd say "every sufficiently advanced user is indistinguishable from a sysadmin" (yes, it's a play on famous Arthur C Clarke's quote), and it doesn't take much of "advanced" to come to a home-server idea.
And I bet the main purpose for most of these aren't playground, p2p client or some http/ftp server - it's just a storage. Serving and updating the stored stuff is kinda secondary.
And I guess it's some sort of nature law that any storage runs outta free space sooner or later. And when this happens, just buying more space seem to be a better option than cleanup because a) "hey, there's dirt-cheap 2TB harddisks out there!" b) you just get used to having all that stuff at packet's reach.
Going down this road I found myself out of connectors on the motherboard (which is fairly spartan D510MO miniITX) and the slots for an extension cards (the only PCI is used by dual-port nic).
So I hooked up two harddisks via usb, but either the usb-sata controllers or usb's on the motherboard were faulty and controllers just hang with both leds on, vanishing from the system. Not that it's a problem - just mdraid'ed them into raid1 and when one fails like that, I just have to replug it and start raid recovery, never losing access to the data itself.
Then, to extend the storage capacity a bit further (and to provide a backup to that media content) I just bought +1 miniITX unit.
Now, I could've mouned two NFS'es from both units, but this approach has several disadvantages:
  • Two mounts instead of one. Ok, not a big deal by itself.
  • I'd have to manage free space on these by hand, shuffling subtrees between them.
  • I need replication for some subtrees, and that complicates the previous point a bit further.
  • Some sort of HA would be nice, so I'd be able to shutdown one replica and switch to using another automatically.
The obvious improvement would be to use some distributed network filesystem, and pondering on the possibilities I've decided to stick with the glusterfs due to it's flexible yet very simple "layer cake" configuration model.
Oh, and the most important reason to set this whole thing up - it's just fun ;)

The Setup

Ok, what I have is:

  • Node1
    • physical storage (raid1) "disk11", 300G, old and fairly "valuable" data (well, of course it's not "valuable", since I can just re-download it all from p2p, but I'd hate to do that)
    • physical disk "disk12", 150G, same stuff as disk11
  • Node2
    • physical disk "disk2", 1.5T, blank, to-be-filled

What I want is one single network storage, with db1 (disk11 + disk12) data available from any node (replicated) and new stuff which won't fit onto this storage should be writen to db2 (what's left of disk2).

With glusterfs there are several ways to do this:

Scenario 1: fully network-aware client.

That's actually the simpliest scenario - glusterfsd.vol files on "nodes" should just export local disks and client configuration ties it all together.


  • Fault tolerance. Client is fully aware of the storage hierarchy, so if one node with db1 is down, it will just use the other one.
  • If the bandwidth is better than disk i/o, reading from db1 can be potentially faster (dunno if glusterfs allows that, actually), but that's not the case, since "client" is one of the laptops and it's a slow wifi link.


  • Write performance is bad - client has to push data to both nodes, and that's a

    big minus with my link.

Scenario 2: server-side replication.

Here, "nodes" export local disks for each other and gather local+remote db1 into cluster/replicate and then export this already-replicated volume. Client just ties db2 and one of the replicated-db1 together via nufa or distribute layer.


  • Better performance.


  • Single point of failure, not only for db2 (which is natural, since it's not replicated), but for db1 as well.

Scenario 3: server-side replication + fully-aware client.

db1 replicas are synced by "nodes" and client mounts all three volumes (2 x db1, 1 x db2) with either cluster/unify layer and nufa scheduler (listing both db1 replicas in "local-volume-name") or cluster/nufa.

That's the answer to obvious question I've asked myself after implementing scenario 2: "why not get rid of this single_point_of_failure just by using not single, but both replicated-db1 volumes in nufa?"
In this case, if node1 goes down, client won't even notice it! And if that happens to node2, files from db2 just disappear from hierarchy, but db1 will still remain fully-accessible.

But there is a problem: cluster/nufa has no support for multiple local-volume-name specs. cluster/unify has this support, but requires it's ugly "namespace-volume" hack. The solution would be to aggregate both db1's into a distribute layer and use it as a single volume alongside db2.

With aforementioned physical layout this seem to be just the best all-around case.


  • Best performance and network utilization.


  • None?


So, scenarios 2 and 3 in terms of glusterfs, with the omission of different performance, lock layers and a few options, for the sake of clarity:

node1 glusterfsd.vol:

## db1: replicated node1/node2
volume local-db1
    type storage/posix
    option directory /srv/storage/db1
# No client-caches here, because ops should already come aggregated
# from the client, and the link between servers is much fatter than the client's
volume node2-db1
    type protocol/client
    option remote-host node2
    option remote-subvolume local-db1
volume composite-db1
    type cluster/replicate
    subvolumes local-db1 node2-db1
## db: linear (nufa) db1 + db2
## export: local-db1 (for peers), composite-db1 (for clients)
volume export
    type protocol/server
    subvolumes local-db1 composite-db1

node2 glusterfsd.vol:

## db1: replicated node1/node2
volume local-db1
    type storage/posix
    option directory /srv/storage/db1
# No client-caches here, because ops should already come aggregated
# from the client, and the link between servers is much fatter than the client's
volume node1-db1
    type protocol/client
    option remote-host node1
    option remote-subvolume local-db1
volume composite-db1
    type cluster/replicate
    subvolumes local-db1 node1-db1
## db2: node2
volume db2
    type storage/posix
    option directory /srv/storage/db2
## db: linear (nufa) db1 + db2
## export: local-db1 (for peers), composite-db1 (for clients)
volume export
    type protocol/server
    subvolumes local-db1 composite-db1

client (replicated to both nodes):

volume node1-db1
    type protocol/client
    option remote-host node1
    option remote-subvolume composite-db1
volume node2-db1
    type protocol/client
    option remote-host node2
    option remote-subvolume composite-db1
volume db1
    type cluster/distribute
    option remote-subvolume node1-db1 node2-db1
volume db2
    type protocol/client
    option remote-host node2
    option remote-subvolume db2
volume db
    type cluster/nufa
    option local-volume-name db1
    subvolumes db1 db2
Actually there's one more scenario I thought of for non-local clients - same as 2, but pushing nufa into glusterfsd.vol on "nodes", thus making client mount single unified volume on a single host via single port in a single connection.
Not that I really need this one, since all I need to mount from external networks is just music 99.9% of time, and NFS + FS-Cache offer more advantages there, although I might resort to it in the future, when music won't fit to db1 anymore (doubt that'll happen anytime soon).
Configs are fine, but the most important thing for setting up glusterfs are these lines:
node# /usr/sbin/glusterfsd --debug --volfile=/etc/glusterfs/glusterfsd.vol
client# /usr/sbin/glusterfs --debug --volfile-server=somenode /mnt/tmp

Feb 17, 2010

Listening to music over the 'net with authentication and cache

Having seen people really obsessed with the music, I don't consider myself to be much into it, yet I've managed to accumulate more than 70G of it, and counting. That's probably because I don't like to listen to something on a loop over and over, so, naturally, it's quite a bother to either keep the collection on every machine I use or copy the parts of it just to listen and replace.

Ideal solution for me is to mount whole hoard right from home server, and mounting it over the internet means that I need some kind of authentication.
Since I also use it at work, encryption is also nice, so I can always pass this bandwith as something work-friendly and really necessary, should it arouse any questions.
And while bandwith at work is pretty much unlimited, it's still controlled, so I wouldn't like to overuse it too much, and listening to oggs, mp3 and flacs for the whole work-day can generate traffic of 500-800 megs, and that's quite excessive to that end, in my own estimation.

The easiest way for me was trusty sshfs - it's got the best authentication, nice performance and encryption off-the-shelf with just one simple command. Problem here is the last aforementioned point - sshfs would generate as much bandwith as possible, caching content only temporarily in volatile RAM.

Persistent caching seem to be quite easy to implement in userspace with either fuse layer over network filesystem or something even simplier (and more hacky), like aufs and inotify, catching IN_OPEN events and pulling files in question to intermediate layer of fs-union.

Another thing I've considered was fs-cache in-kernel mechanism, which appeared in the main tree since around 2.6.30, but the bad thing about was that while being quite efficient, it only worked for NFS or AFS.
Second was clearly excessive for my purposes, and the first one I've come to hate for being extremely ureliable and limiting. In fact, NFS never gave me anything but trouble in the past, yet since I haven't found any decent implementations of the above ideas, I'd decided to give it (plus fs-cache) a try.
Setting up nfs server is no harder than sharing dir on windows - just write a line to /etc/exports and fire up nfs initscript. Since nfs4 seems superior than nfs in every way, I've used that version.
Trickier part is authentication. With nfs' "accept-all" auth model and kerberos being out of question, it has to be implemented on some transport layer in the middle.
Luckily, ssh is always there to provide a secure authenticated channel and nfs actually supports tcp these days. So the idea is to start nfs on localhost on server and use ssh tunnel to connecto to it from the client.

Setting up tunnel was quite straightforward, although I've put together a simple script to avoid re-typing the whole thing and to make sure there aren't any dead ssh processes laying around.

PID="/tmp/.$(basename $0).$(echo "$1.$2" | md5sum | cut -b-5)"
touch "$PID"
flock -n 3 3<"$PID" || exit 0
exec 3>"$PID"
( flock -n 3 || exit 0
  exec ssh\
   -qyTnN $3 -L "$1" "$2" ) &
echo $! >&3
exit 0

That way, ssh process is daemonized right away. Simple locking is also implemented, based on tunnel and ssh destination, so it might be put as a cronjob (just like "ssh_tunnel 2049: user@remote") to keep the link alive.

Then I've put a line like this to /etc/exports:

...and tried to "mount -t nfs4 localhost:/ /mnt/music" on the remote.
Guess what? "No such file or dir" error ;(
Ok, nfs3-way to "mount -t nfs4 localhost:/var/export/leech /mnt/music" doesn't work as well. No indication of why it is whatsoever.

Then it gets even better - "mount -t nfs localhost:/var/export/leech /mnt/music" actually works (locally, since nfs3 defaults to udp).
Completely useless errors and nothing on the issue in manpages was quite weird, but prehaps I haven't looked at it well enough.

Gotcha was in the fact that it wasn't allowed to mount nfs4 root, so tweaking exports file like this...


...and "mount -t nfs4 localhost:/music /mnt/music" actually solved the issue.

Why can't I use one-line exports and why the fuck it's not on the first (or any!) line of manpage escapes me completely, but at least it works now even from remote. Hallelujah.

Next thing is the cache layer. Luckily, it doesn't look as crappy as nfs and tying them together can be done with a single mount parameter. One extra thing needed, aside from the kernel part, here, is cachefilesd.
Strange thing it's not in gentoo portage yet (since it's kinda necessary for kernel mechanism and quite aged already), but there's an ebuild in b.g.o (now mirrored to my overlay, as well).
Setting it up is even simplier.
Config is well-documented and consists of five lines only, the only relevant of which is the path to fs-backend, oh, and the last one seem to need user_xattr support enabled.

fstab lines for me were these:

/dev/dump/nfscache /var/fscache ext4 user_xattr
localhost:/music /mnt/music nfs4 ro,nodev,noexec,intr,noauto,user,fsc

First two days got me 800+ megs in cache and from there it was even better bandwidth-wise, so, all-in-all, this nfs circus was worth it.

Another upside of nfs was that I could easily share it with workmates just by binding ssh tunnel endpoint to a non-local interface - all that's needed from them is to issue the mount command, although I didn't came to like to implementation any more than I did before.
Wonder if it's just me, but whatever...
Member of The Internet Defense League