Jul 16, 2014

(yet another) Dynamic DNS thing for tinydns (djbdns)

Tried to find any simple script to update tinydns (part of djbdns) zones that'd be better than ssh dns_update@remote_host update.sh, but failed - they all seem to be hacky php scripts, doomed to run behind httpds, send passwords in url, query random "myip" hosts or something like that.

What I want instead is something that won't be making http, tls or ssh connections (and stirring all the crap behind these), but would rather just send udp or even icmp pings to remotes, which should be enough for update, given source IPs of these packets and some authentication payload.

So yep, wrote my own scripts for that - tinydns-dynamic-dns-updater project.

Tool sends UDP packets with 100 bytes of "( key_id || timestamp ) || Ed25519_sig" from clients, authenticating and distinguishing these server-side by their signing keys ("key_id" there is to avoid iterating over them all, checking which matches signature).

Server zone files can have "# dynamic: ts key1 key2 ..." comments before records (separated from static records after these by comments or empty lines), which says that any source IPs of packets with correct signatures (and more recent timestamps) will be recorded in A/AAAA records (depending on source AF) that follow instead of what's already there, leaving anything else in the file intact.

Zone file only gets replaced if something is actually updated and it's possible to use dynamic IP for server as well, using dynamic hostname on client (which is resolved for each delayed packet).

Lossy nature of UDP can be easily mitigated by passing e.g. "-n5" to the client script, so it'd send 5 packets (with exponential delays by default, configurable via --send-delay), plus just having the thing on fairly regular intervals in crontab.

Putting server script into socket-activated systemd service file also makes all daemon-specific pains like using privileged ports (and most other security/access things), startup/daemonization, restarts, auto-suspend timeout and logging woes just go away, so there's --systemd flag for that too.

Given how easy it is to run djbdns/tinydns instance, there really doesn't seem to be any compelling reason not to use your own dynamic dns stuff for every single machine or device that can run simple python scripts.

Github link: tinydns-dynamic-dns-updater

Aug 08, 2013

Encrypted root on a remote vds

Most advice wrt encryption on a remote hosts (VPS, VDS) don't seem to involve full-disk encryption as such, but is rather limited to encrypting /var and /home, so that machine will boot from non-crypted / and you'll be able to ssh to it, decrypt these parts manually, then start services that use data there.

That seem to be in contrast with what's generally used on local machines - make LUKS container right on top of physical disk device, except for /boot (if it's not on USB key) and don't let that encryption layer bother you anymore.

Two policies seem to differ in that former one is opt-in - you have to actively think which data to put onto encrypted part (e.g. /etc/ssl has private keys? move to /var, shred from /etc), while the latter is opt-out - everything is encrypted, period.

So, in spirit of that opt-out way, I thought it'd be a drag to go double-think wrt which data should be stored where and it'd be better to just go ahead and put everything possible to encrypted container for a remote host as well, leaving only /boot with kernel and initramfs in the clear.

Naturally, to enter encryption password and not have it stored alongside LUKS header, some remote login from the network is in order, and sshd seem to be the secure and easy way to go about it.
Initramfs in question should then also be able to setup network, which luckily dracut can. Openssh sshd is a bit too heavy for it though, but there are much lighter sshd's like dropbear.

Searching around for someone to tie the two things up, found a bit incomplete and non-packaged solutions like this RH enhancement proposal and a set of hacky scripts and instructions in dracut-crypt-wait repo on bitbucket.

Approach outlined in RH bugzilla is to make dracut "crypt" module to operate normally and let cryptsetup query for password in linux console, but also start sshd in the background, where one can login and use a simple tool to echo password to that console (without having it echoed).
dracut-crypt-wait does a clever hack of removing "crypt" module hook instead and basically creates "rescure" console on sshd, where user have to manually do all the decryption necessary and then signal initramfs to proceed with the boot.

I thought first way was rather more elegant and clever, allowing dracut to figure out which device to decrypt and start cryptsetup with all the necessary, configured and documented parameters, also still allowing to type passphrase into console - best of both worlds, so went along with that one, creating dracut-crypt-sshd project.

As README there explains, using it is as easy as adding it into dracut.conf (or passing to dracut on command line) and adding networking to grub.cfg, e.g.:

menuentry "My Linux" {
        linux /vmlinuz ro root=LABEL=root
                rd.luks.uuid=7a476ea0 rd.lvm.vg=lvmcrypt rd.neednet=1
        initrd /dracut.xz

("ip=dhcp" might be simplier way to go, but doesn't yield default route in my case)

And there, you'll have sshd on that IP port 2222 (configurable), with pre-generated (during dracut build) keys, which might be a good idea to put into "known_hosts" for that ip/port somewhere. "authorized_keys" is taken from /root/.ssh by default, but also configurable via dracut.conf, if necessary.

Apart from sshd, that module includes two tools for interaction with console - console_peek and console_auth (derived from auth.c in the bugzilla link above).

Logging in to that sshd then yields sequence like this:

[214] Aug 08 13:29:54 lastlog_perform_login: Couldn't stat /var/log/lastlog: No such file or directory
[214] Aug 08 13:29:54 lastlog_openseek: /var/log/lastlog is not a file or directory!

# console_peek
[    1.711778] Write protecting the kernel text: 4208k
[    1.711875] Write protecting the kernel read-only data: 1116k
[    1.735488] dracut: dracut-031
[    1.756132] systemd-udevd[137]: starting version 206
[    1.760022] tsc: Refined TSC clocksource calibration: 2199.749 MHz
[    1.760109] Switching to clocksource tsc
[    1.809905] systemd-udevd[145]: renamed network interface eth0 to enp0s9
[    1.974202] 8139too 0000:00:09.0 enp0s9: link up, 100Mbps, full-duplex, lpa 0x45E1
[    1.983151] dracut: sshd port: 2222
[    1.983254] dracut: sshd key fingerprint: 2048 0e:14:...:36:f9  root@congo (RSA)
[    1.983392] dracut: sshd key bubblebabble: 2048 xikak-...-poxix  root@congo (RSA)
[185] Aug 08 13:29:29 Failed reading '-', disabling DSS
[186] Aug 08 13:29:29 Running in background
[    2.093869] dracut: luksOpen /dev/sda3 luks-...
Enter passphrase for /dev/sda3:
[213] Aug 08 13:29:50 Child connection from
[213] Aug 08 13:29:54 Pubkey auth succeeded for 'root' with key md5 0b:97:bb:...

# console_auth

First command - "console_peek" - allows to see which password is requested (if any) and second one allows to login.
Note that fingerprints of host keys are also echoed to console on sshd start, in case one has access to console but still needs sshd later.
I quickly found out that such initramfs with sshd is also a great and robust rescue tool, especially if "debug" and/or "rescue" dracut modules are enabled.
And as it includes fairly comprehensive network-setup options, might be a good way to boot multiple different OS'es with same (machine-specific) network parameters,

Probably obligatory disclaimer for such post should mention that crypto above won't save you from malicious hoster or whatever three-letter-agency that will coerce it into cooperation, should it take interest in your poor machine - it'll just extract keys from RAM image (especially if it's a virtualized VPS) or backdoor kernel/initramfs and force a reboot.

Threat model here is more trivial - be able to turn off and decomission host without fear of disks/images then falling into some other party's hands, which might also happen if hoster eventually goes bust or sells/scraps disks due to age or bad blocks.

Also, even minor inconvenience like forcing to extract keys like outlined above might be helpful in case of quite well-known "we came fishing to a datacenter, shut everything down, give us all the hardware in these racks" tactic employed by some agencies.

Absolute security is a myth, but these measures are fairly trivial and practical to be employed casually to cut off at least some number of basic threats.

So, yay for dracut, the amazingly cool and hackable initramfs project, which made it that easy.

Code link: https://github.com/mk-fg/dracut-crypt-sshd

Jun 09, 2013

cjdns per-IP (i.e. per-peer) traffic accounting

I've been using Hyperboria darknet for about a month now, and after late influx of russian users there (after this article) got plently of peers, so node is forwarding a bit of network traffic.

Being a dorknet-proper, of course, you can't see what kind of traffic it is or to whom it goes (though cjdns doesn't have anonymity as a goal), but I thought it'd be nice to at least know when my internet lags due to someone launching DoS flood or abusing torrents.

Over the Internet (called "clearnet" here), cjdns peers using udp, but linux conntrack seem to be good enough to track these "connections" just as if they were stateful tcp flows.

Simple-ish traffic accounting on vanilla linux usually boils down to ulogd2, which can use packet-capturing interfaces (raw sockets via libpcap, netfilter ULOG and NFLOG targets), but it's kinda heavy-handed here - traffic is opaque, only endpoints matter, so another one of its interfaces seem to be better option - conntrack tables/events.

Handy conntrack-tools (or /proc/net/{ip,nf}_conntrack) cat track all the connections, including simple udp-based ones (like cjdns uses), producing entries like:

udp 17 179 \
        src= dst= sport=52728 dport=8131 \
        src= dst= sport=8131 dport=52728 \
        [ASSURED] mark=16 use=1

First trick is to enable the packet/byte counters there, which is a simple, but default-off sysctl knob:

# sysctl -w net.netfilter.nf_conntrack_acct=1

That will add "bytes=" and "packets=" values there for both directions.

Of course, polling the table is a good way to introduce extra hangs into system (/proc files are basically hooks that tend to lock stuff to get consistent reads) and loose stuff in-between polls, so luckily there's an event-based netlink interface and ulogd2 daemon to monitor that.

One easy way to pick both incoming and outgoing udp flows in ulogd2 is to add connmarks to these:

-A INPUT -p udp --dport $cjdns_port -j CONNMARK --set-xmark 0x10/0x10
-A OUTPUT -p udp --sport $cjdns_port -j CONNMARK --set-xmark 0x10/0x10

Then setup filtering by these in ulogd.conf:






Should produce parseable log of all the traffic flows with IPs and such.

Fairly simple script can then be used to push this data to graphite, munin, ganglia, cacti or whatever time-series graphing/processing tool. Linked script is for graphite "carbon" interface.

Update: obsoleted/superseded by cjdns "InterfaceController_peerStats" admin api function and graphite-metrics cjdns_peer_stats collector.

Apr 24, 2013

fatrace - poor man's auditd

Was hacking on (or rather debugging) Convergence FF plugin and it became painfully obvious that I really needed something simple to push js changes from local git clone to ~/.mozilla so that I can test them.

Usually I tend to employ simple ad-hoc for src in $(git st | awk ...); do cat $src >... hack, and done same thing in this case as well, but was forgetting to run it after small "debug printf" changes waaay too often.

At this point, I sometimes hack some ad-hoc emacs post-save hook to run the thing, but this time decided to find some simplier and more generic "run that on any changes to path" tool.

Until the last few years, the only way to do that was polling or inotify, and for some project dir it's actually quite fine, but luckily there's fanotify in kernel now, and fatrace looks like the simliest cli tool based on it.

# fatrace
sadc(977): W /var/log/sa/sa24
sadc(977): W /var/log/sa/sa24
sadc(977): W /var/log/sa/sa24
sadc(977): W /var/log/sa/sa24
qmgr(1195): O /var/spool/postfix/deferred
qmgr(1195): CO /var/spool/postfix/deferred/0
qmgr(1195): CO /var/spool/postfix/deferred/3
qmgr(1195): CO /var/spool/postfix/deferred/7
That thing can just watch everything that's being done to all (or any specific) local mount(s).
Even better - reports the app that does the changes.

I never got over auditd's complexity for such simple use-cases, so was damn glad that there is a real and simplier alternative now.

Unfortunately, with power of the thing comes the need for root, so one simple bash wrapper later, my "sync changes" issue was finally resolved:

(root) ~# fatrace_pipe ~user/hatch/project
(user) project% xargs -in1 </tmp/fatrace.fifo make

Looks like a real problem-solver for a lot of real-world "what the hell happens on the fs there!?" cases as well - can't recommend the thing highly-enough for all that.

Apr 06, 2013

Fighting storage bitrot and decay

Everyone is probably aware that bits do flip here and there in the supposedly rock-solid, predictable and deterministic hardware, but somehow every single data-management layer assumes that it's not its responsibility to fix or even detect these flukes.

Bitrot in RAM is a known source of bugs, but short of ECC, dunno what one can do without huge impact on performance.

Disks, on the other hand, seem to have a lot of software layers above them, handling whatever data arrangement, compression, encryption, etc, and the fact that bits do flip in magnetic media seem to be just as well-known (study1, study2, study3, ...).
In fact, these very issues seem to be the main idea behind well known storage behemoth ZFS.
So it really bugged me for quite a while that any modern linux system seem to be completely oblivious to the issue.

Consider typical linux storage stack on a commodity hardware:

  • You have closed-box proprietary hdd brick at the bottom, with no way to tell what it does to protect your data - aside from vendor marketing pitches, that is.

  • Then you have well-tested and robust linux driver for some ICH storage controller.

    I wouldn't bet that it will corrupt anything at this point, but it doesn't do much else to the data but pass around whatever it gets from the flaky device either.

  • Linux blkdev layer above, presenting /dev/sdX. No checks, just simple mapping.

  • device-mapper.

    Here things get more interesting.

    I tend to use lvm wherever possible, but it's just a convenience layer (or a set of nice tools to setup mappings) on top of dm, no checks of any kind, but at least it doesn't make things much worse either - lvm metadata is fairly redundant and easy to backup/recover.

    dm-crypt gives no noticeable performance overhead, exists either above or under lvm in the stack, and is nice hygiene against accidental leaks (selling or leasing hw, theft, bugs, etc), but lacking authenticated encryption modes it doesn't do anything to detect bit-flips.
    Worse, it amplifies the issue.
    In the most common CBC mode one flipped bit in the ciphertext will affect a few other bits of data until the end of the dm block.
    Current dm-crypt default (since the latest cryptsetup-1.6.X, iirc) is XTS block encryption mode, which somewhat limits the damage, but dm-crypt has little support for changing modes on-the-fly, so tough luck.
    But hey, there is dm-verity, which sounds like exactly what I want, except it's read-only, damn.
    Read-only nature is heavily ingrained in its "hash tree" model of integrity protection - it is hashes-of-hashes all the way up to the root hash, which you specify on mount, immutable by design.

    Block-layer integrity protection is a bit weird anyway - lots of unnecessary work potential there with free space (can probably be somewhat solved by TRIM), data that's already journaled/checksummed by fs and just plain transient block changes which aren't exposed for long and one might not care about at all.

  • Filesystem layer above does the right thing sometimes.

    COW fs'es like btrfs and zfs have checksums and scrubbing, so seem to be a good options.
    btrfs was slow as hell on rotating plates last time I checked, but zfs port might be worth a try, though if a single cow fs works fine on all kinds of scenarios where I use ext4 (mid-sized files), xfs (glusterfs backend) and reiserfs (hard-linked backups, caches, tiny-file sub trees), then I'd really be amazed.

    Other fs'es plain suck at this. No care for that sort of thing at all.

  • Above-fs syscall-hooks kernel layers.

    IMA/EVM sound great, but are also for immutable security ("integrity") purposes ;(

    In fact, this layer is heavily populated by security stuff like LSM's, which I can't imagine being sanely used for bitrot-detection purposes.
    Security tools are generally oriented towards detecting any changes, intentional tampering included, and are bound to produce a lot of false-positives instead of legitimate and actionable alerts.

    Plus, upon detecting some sort of failure, these tools generally don't care about the data anymore acting as a Denial-of-Service attack on you, which is survivable (everything can be circumvented), but fighting your own tools doesn't sound too great.

  • Userspace.

    There is tripwire, but it's also a security tool, unsuitable for the task.

    Some rare discussions of the problem pop up here and there, but alas, I failed to salvage anything useable from these, aside from ideas and links to subject-relevant papers.

Scanning github, bitbucket and xmpp popped up bitrot script and a proof-of-concept md-checksums md layer, which apparently haven't even made it to lkml.

So, naturally, following long-standing "... then do it yourself" motto, introducing fs-bitrot-scrubber tool for all the scrubbing needs.

It should be fairly well-described in the readme, but the gist is that it's just a simple userspace script to checksum file contents and check changes there over time, taking all the signs of legitimate file modifications and the fact that it isn't the only thing that needs i/o in the system into account.

Main goal is not to provide any sort of redundancy or backups, but rather notify of the issue before all the old backups (or some cluster-fs mirrors in my case) that can be used to fix it are rotated out of existance or overidden.

Don't suppose I'll see such decay phenomena often (if ever), but I don't like having the odds, especially with an easy "most cases" fix within grasp.

If I'd keep lot of important stuff compressed (think what will happen if a single bit is flipped in the middle of few-gigabytes .xz file) or naively (without storage specifics and corruption in mind) encrypted in cbc mode (or something else to the same effect), I'd be worried about the issue so much more.

Wish there'd be something common out-of-the-box in the linux world, but I guess it's just not the time yet (hell, there's not even one clear term in the techie slang for it!) - with still increasing hdd storage sizes and much more vulnerable ssd's, some more low-level solution should materialize eventually.

Here's me hoping to raise awareness, if only by a tiny bit.

github project link

Mar 25, 2013

Secure cloud backups with Tahoe-LAFS

There's plenty of public cloud storage these days, but trusting any of them with any kind of data seem reckless - service is free to corrupt, monetize, leak, hold hostage or just drop it then.
Given that these services are provided at no cost, and generally without much ads, guess reputation and ToS are the things stopping them from acting like that.
Not trusting any single one of these services looks like a sane safeguard against them suddenly collapsing or blocking one's account.
And not trusting any of them with plaintext of the sensitive data seem to be a good way to protect it from all the shady things that can be done to it.

Tahoe-LAFS is a great capability-based secure distributed storage system, where you basically do "tahoe put somefile" and get capability string like "URI:CHK:iqfgzp3ouul7tqtvgn54u3ejee:...u2lgztmbkdiuwzuqcufq:1:1:680" in return.

That string is sufficient to find, decrypt and check integrity of the file (or directory tree) - basically to get it back in what guaranteed to be the same state.
Neither tahoe node state nor stored data can be used to recover that cap.
Retreiving the file afterwards is as simple as GET with that cap in the url.

With remote storage providers, tahoe node works as a client, so all crypto being client-side, actual cloud provider is clueless about the stuff you store, which I find to be quite important thing, especially if you stripe data across many of these leaky and/or plain evil things.

Finally got around to connecting a third backend (box.net) to tahoe today, so wanted to share a few links on the subject:

Jun 16, 2012

Proper(-ish) way to start long-running systemd service on udev event (device hotplug)

Update 2015-01-12: There's a follow-up post with a different way to do that, enabled by "systemd-escape" tool available in more recent systemd versions.

I use a smartcard token which requires long-running (while device is plugged) handler process to communicate with the chip.
Basically, udev has to start a daemon process when the device get plugged.
Until recently, udev didn't mind doing that via just RUN+="/path/to/binary ...", but in recent merged systemd-udevd versions this behavior was deprecated:
Starting daemons or other long running processes is not appropriate for
udev; the forked processes, detached or not, will be unconditionally killed
after the event handling has finished.

I think it's for the best - less accumulating cruft and unmanageable pids forked from udevd, but unfortunately it also breaks existing udev rule-files, the ones which use RUN+="..." to do just that.

One of the most obvious breakage for me was the smartcard failing, so decided to fix that. Documentation on the whole migration process is somewhat lacking (hence this post), even though docs on all the individual pieces are there (which are actually awesome).

Main doc here is systemd.device(5) for the reference on the udev attributes which systemd recognizes, and of course udev(7) for a generic syntax reference.
Also, there's this entry on Lennart's blog.

In my case, when device (usb smartcard token) get plugged, ifdhandler process should be started via openct-control (OpenCT sc middleware), which then creates unix socket through which openct libraries (used in turn by OpenSC PKCS#11 or PCSClite) can access the hardware.

So, basically I've had something like this (there are more rules for different hw, of course, but for the sake of clarity...):

SUBSYSTEM!="usb", GOTO="openct_rules_end"
ACTION!="add", GOTO="openct_rules_end"
PROGRAM="/bin/sleep 0.1"
SUBSYSTEM=="usb", ENV{DEVTYPE}=="usb_device",\
  ENV{ID_VENDOR_ID}=="0529", ENV{ID_MODEL_ID}=="0600",\
  RUN+="/usr/sbin/openct-control attach usb:$env{PRODUCT} usb $env{DEVNAME}"

Instead of RUN here, ENV{SYSTEMD_WANTS} can be used to start a properly-handled service, but note that some hardware parameters are passed from udev properties and in general systemd unit can't reference these.

I.e. if just ENV{SYSTEMD_WANTS}="openct-handler.service" (or more generic smartcard.target) is started, it won't know which device to pass to "openct-control attach" command.

One way might be storing these parameters in some dir, where they'll be picked by some path unit, a bit more hacky way would be scanning usb bus in the handler, and yet another one (which I decided to go along with) is to use systemd unit-file templating to pass these parameters.



ExecStart=/bin/sh -c "exec openct-control attach %I"

Note that it requires openct.service, which is basically does "openct-control init" once per boot to setup paths and whatnot:

ExecStart=/usr/sbin/openct-control init
ExecStop=/usr/sbin/openct-control shutdown

Another thing to note is that "sh" used in the handler.
It's intentional, because just %I will be passed by systemd as a single argument, while it should be three of them after "attach".

Finally, udev rules file for the device:

SUBSYSTEM!="usb", GOTO="openct_rules_end"
ACTION!="add", GOTO="openct_rules_end"
SUBSYSTEM=="usb", ENV{DEVTYPE}=="usb_device",\
  ENV{ID_VENDOR_ID}=="0529", ENV{ID_MODEL_ID}=="0600",\
  GROUP="usb", TAG+="systemd",\

(I highly doubt newline escaping in ENV{SYSTEMD_WANTS} above will work - added them just for readability, so pls strip these in your mind to a single line without spaces)

Systemd escaping in the rule above is described in systemd.unit(5) and produces a name - and start a service - like this one:


Which then invokes:

sh -c "exec openct-control attach\
  usb:0529/0600/0100 usb /dev/bus/usb/002/003"

And it forks ifdhandler process, which works with smartcard from then on.

ifdhandler seem to be able to detect unplugging events and exits gracefully, but otherwise BindTo= unit directive can be used to stop the service when udev detects that device is unplugged.

Note that it might be more obvious to just do RUN+="systemctl start whatever.service", but it's a worse way to do it, because you don't bind that service to a device in any way, don't produce the "whatever.device" unit and there are lot of complications due to systemctl being a tool for the user, not the API proper.

Feb 28, 2012

Late adventures with time-series data collection and representation

When something is wrong and you look at the system, most often you'll see that... well, it works. There's some cpu, disk, ram usage, some number of requests per second on different services, some stuff piling up, something in short supply here and there...

And there's just no way of telling what's wrong without answers to the questions like "so, what's the usual load average here?", "is the disk always loaded with requests 80% of time?", "is it much more requests than usual?", etc, otherwise you might be off to some wild chase just to find out that load has always been that high, or solve the mystery of some unoptimized code that's been there for ages, without doing anything about the problem in question.

Historical data is the answer, and having used rrdtool with stuff like (customized) cacti and snmpd (with some my hacks on top) in the past, I was overjoyed when I stumbled upon a graphite project at some point.

From then on, I strived to collect as much metrics as possible, to be able to look at history of anything I want (and lots of values can be a reasonable symptom for the actual problem), without any kind of limitations.
carbon-cache does magic by batching writes and carbon-aggregator does a great job at relieving you of having to push aggregate metrics along with a granular ones or sum all these on graphs.

Initially, I started using it with just collectd (and still using it), but there's a need for something to convert metric names to a graphite hierarcy.

After looking over quite a few solutions to collecd-carbon bridge, decided to use bucky, with a few fixes of my own and quite large translation config.

Bucky can work anywhere, just receiving data from collectd network plugin, understands collectd types and properly translates counter increments to N/s rates. It also includes statsd daemon, which is brilliant at handling data from non-collector daemons and scripts and more powerful metricsd implementation.
Downside is that it's only maintained in forks, has bugs in less-used code (like metricsd), quite resource-hungry (but can be easily scaled-out) and there's kinda-official collectd-carbon plugin now (although I found it buggy as well, not to mention much less featureful, but hopefully that'll be addressed in future collectd versions).

Some of the problems I've noticed with such collectd setup:

  • Disk I/O metrics are godawful or just doesn't work - collected metrics of read/write either for processes of device are either zeroes, have weird values detached from reality (judging by actual problems and tools like atop and sysstat provide) or just useless.
  • Lots of metrics for network and memory (vmem, slab) and from various plugins have naming, inconsistent with linux /proc or documentation names.
  • Some useful metrics that are in, say, sysstat doesn't seem to work with collectd, like sensor data, nfsv4, some paging and socket counters.
  • Some metrics need non-trivial post-processing to be useful - disk utilization % time is one good example.
  • Python plugins leak memory on every returned value. Some plugins (ping, for example) make collectd segfault several times a day.
  • One of the most useful info is the metrics from per-service cgroup hierarchies, created by systemd - there you can compare resource usage of various user-space components, totally pinpointing exactly what caused the spikes on all the other graphs at some time.
  • Second most useful info by far is produced from logs and while collectd has a damn powerful tail plugin, I still found it to be too limited or just too complicated to use, while simple log-tailing code does the better job and is actually simplier due to more powerful language than collectd configuration. Same problem with table plugin and /proc.
  • There's still a need for lagre post-processing chunk of code and pushing the values to carbon.
Of course, I wanted to add systemd cgroup metrics, some log values and missing (and just properly-named) /proc tables data, and initially I wrote a collectd plugin for that. It worked, leaked memory, occasionally crashed (with collectd itself), used some custom data types, had to have some metric-name post-processing code chunk in bucky...
Um, what the hell for, when sending metric value directly takes just "echo some.metric.name $val $(printf %(%s)T -1) >/dev/tcp/carbon_host/2003"?

So off with collectd for all the custom metrics.

Wrote a simple "while True: collect_and_send() && sleep(till_deadline);" loop in python, along with the cgroup data collectors (there are even proper "block io" and "syscall io" per-service values!), log tailer and sysstat data processor (mainly for disk and network metrics which have batshit-crazy values in collectd plugins).

Another interesting data-collection alternative I've explored recently is ganglia.
Redundant gmond collectors and aggregators, communicating efficiently over multicast are nice. It has support for python plugins, and is very easy to use - pulling data from gmond node network can be done with one telnet or nc command, and it's fairly comprehensible xml, not some binary protocol. Another nice feature is that it can re-publish values only on some significant changes (where you define what "significant" is), thus probably eliminating traffic for 90% of "still 0" updates.
But as I found out while trying to use it as a collectd replacement (forwarding data to graphite through amqp via custom scripts), there's a fatal flaw - gmond plugins can't handle dynamic number of values, so writing a plugin that collects metrics from systemd services' cgroups without knowing how many of these will be started in advance is just impossible.
Also it has no concept for timestamps of values - it only has "current" ones, making plugins like "sysstat data parser" impossible to implement as well.
collectd, in contrast, has no constraint on how many values plugin returns and has timestamps, but with limitations on how far backwards they are.

Pity, gmond looked like a nice, solid and resilent thing otherwise.

I still like the idea to pipe graphite metrics through AMQP (like rocksteady does), routing them there not only to graphite, but also to some proper threshold-monitoring daemon like shinken (basically nagios, but distributed and more powerful), with alerts, escalations, trending and flapping detection, etc, but most of the existing solutions all seem to use graphite and whisper directly, which seem kinda wasteful.

Looking forward, I'm actually deciding between replacing collectd completely for a few most basic metrics it now collects, pulling them from sysstat or just /proc directly or maybe integrating my collectors back into collectd as plugins, extending collectd-carbon as needed and using collectd threshold monitoring and matches/filters to generate and export events to nagios/shinken... somehow first option seem to be more effort-effective, even in the long run, but then maybe I should just work more with collectd upstream, not hack around it.

Oct 23, 2011

dm-crypt password caching between dracut and systemd, systemd password agent

Update 2015-11-25: with "ask-password" caching implemented as of systemd-227 (2015-10-07), better way would be to use that in-kernel caching, though likely requires systemd running in initramfs (e.g. dracut had that for a while).

Up until now I've used lvm on top of single full-disk dm-crypt partition.
It seems easiest to work with - no need to decrypt individual lv's, no confusion between what's encrypted (everything but /boot!) and what's not, etc.
Main problem with it though is that it's harder to have non-encrypted parts, everything is encrypted with the same keys (unless there're several dm-crypt layers) and it's bad for SSD - dm-crypt still (as of 3.0) doesn't pass any TRIM requests through, leading to nasty write amplification effect, even more so with full disk given to dm-crypt+lvm.
While there's hope that SSD issues will be kinda-solved (with an optional security trade-off) in 3.1, it's still much easier to keep different distros or some decrypted-when-needed partitions with dm-crypt after lvm, so I've decided to go with the latter for new 120G SSD.
Also, such scheme allows to re-create encrypted lvs, issuing TRIM for the old ones, thus recycling the blocks even w/o support for this in dm-crypt.
Same as with previous initramfs, I've had simple "openct" module (udev there makes it even easier) in dracut to find inserted smartcard and use it to obtain encryption key, which is used once to decrypt the only partition on which everything resides.
Since the only goal of dracut is to find root and get-the-hell-outta-the-way, it won't even try to decrypt all the /var and /home stuff without serious ideological changes.
The problem is actually solved in generic distros by plymouth, which gets the password(s), caches it, and provides it to dracut and systemd (or whatever comes as the real "init"). I don't need splash, and actually hate it for hiding all the info that scrolls in it's place, so plymouth is a no-go for me.

Having a hack to obtain and cache key for dracut by non-conventional means anyway, I just needed to pass it further to systemd, and since they share common /run tmpfs these days, it basically means not to rm it in dracut after use.

Luckily, system-wide password handling mechanism in systemd is well-documented and easily extensible beyond plymouth and default console prompt.

So whole key management in my system goes like this now:

  • dracut.cmdline: create udev rule to generate key.
  • dracut.udev.openct: find smartcard, run rule to generate and cache key in /run/initramfs.
  • dracut.udev.crypt: check for cached key or prompt for it (caching result), decrypt root, run systemd.
  • systemd: start post-dracut-crypt.path unit to monitor /run/systemd/ask-password for password prompts, along with default .path units for fallback prompts via wall/console.
  • systemd.udev: discover encrypted devices, create key requests.
  • systemd.post-dracut-crypt.path: start post-dracut-crypt.service to read cached passwords from /run/initramfs and use these to satisfy requests.
  • systemd.post-dracut-crypt-cleanup.service (after local-fs.target is activated): stop post-dracut-crypt.service, flush caches, generate new one-time keys for decrypted partitions.
End result is passwordless boot with this new layout, which seem to be only possible to spoof by getting root during that process somehow, with altering unencrypted /boot to run some extra code and revert it back being the most obvious possibility.
It's kinda weird that there doesn't seem to be any caching in place already, surely not everyone with dm-crypt are using plymouth?

Most complicated piece here is probably the password agent (in python), which can actually could've been simplier if I haven't followed the proper guidelines and thought a bit around them.

For example, whole inotify handling thing (I've used it via ctypes) can be dropped with .path unit with DirectoryNotEmpty= activation condition - it's there already, PolicyKit authorization just isn't working at such an early stage, there doesn't seem to be much need to check request validity since sending replies to sockets is racy anyway, etc
Still, a good excercise.

Python password agent for systemd. Unit files to start and stop it on demand.

Sep 16, 2011

Detailed process memory accounting, including shared and swapped one

Two questions:

  • How to tell which pids (or groups of forks) eat most swap right now?
  • How much RAM one apache/php/whatever really consumes?

Somehow people keep pointing me at "top" and "ps" tools to do this sort of things, but there's an obvious problem:

#include <stdlib.h>
#include <unistd.h>

#define G 1024*1024*1024

int main (void) {
    (void *) malloc(2 * G);
    return 0;

This code will immediately float to 1st position in top, sorted by "swap" (F p <return>), showing 2G even with no swap in the system.

Second question/issue is also common but somehow not universally recognized, which is kinda obvious when scared admins (or whoever happen to ssh into web backend machine) see N pids of something, summing up to more than total amount of RAM in the system, like 50 httpd processes 50M each.
It gets even worse when tools like "atop" helpfully aggregate the numbers ("atop -p"), showing that there are 6 sphinx processes, eating 15G on a machine with 4-6G physical RAM + 4-8G swap, causing local panic and mayhem.
The answer is, of course, that sphinx, apache and pretty much anything using worker processes share a lot of memory pages between their processes, and not just because of shared objects like libc.

Guess it's just general ignorance of how memory works in linux (or other unix-os'es) of those who never had to write a fork() or deal with malloc's in C, which kinda make lots of these concepts look fairly trivial.

So, mostly out of curiosity than the real need, decided to find a way to answer these questions.
proc(5) reveals this data more-or-less via "maps" / "smaps" files, but that needs some post-processing to give per-pid numbers.
Closest tools I was able to find were pmap from procps package and ps_mem.py script from coreutils maintainer. Former seem to give only mapped memory region sizes, latter cleverly shows shared memory divided by a number of similar processes, omitting per-process numbers and swap.
Oh, and of course there are glorious valgrind and gdb, but both seem to be active debugging tools, not much suitable for normal day-to-day operation conditions and a bit too complex for the task.

So I though I'd write my own tool for the job to put the matter at rest once and for all, and so I can later point people at it and just say "see?" (although I bet it'll never be that simple).

Idea is to group similar processes (by cmd) and show details for each one, like this:

    private: 252.0 KiB
    shared: 712.0 KiB
    swap: 0
      private: 84.0 KiB
      shared: 712.0 KiB
      swap: 0
    -cmdline: /sbin/agetty tty3 38400
      -shared-with: rpcbind, _plutorun, redshift, dbus-launch, acpid, ...
      private: 8.0 KiB
      shared: 104.0 KiB
      swap: 0
      -shared-with: rpcbind, _plutorun, redshift, dbus-launch, acpid, ...
      private: 12.0 KiB
      shared: 548.0 KiB
      swap: 0
      -shared-with: agetty
      private: 4.0 KiB
      shared: 24.0 KiB
      swap: 0
      -shared-with: firefox, redshift, tee, sleep, ypbind, pulseaudio [updated], ...
      private: 0
      shared: 8.0 KiB
      swap: 0
      private: 20.0 KiB
      shared: 0
      swap: 0
      private: 8.0 KiB
      shared: 0
      swap: 0
      private: 24.0 KiB
      shared: 0
      swap: 0
      private: 0
      shared: 0
      swap: 0
      private: 84.0 KiB
      shared: 712.0 KiB
      swap: 0
    -cmdline: /sbin/agetty tty4 38400
      private: 84.0 KiB
      shared: 712.0 KiB
      swap: 0
    -cmdline: /sbin/agetty tty5 38400

So it's obvious that there are 3 agetty processes, which ps will report as 796 KiB RSS:

root 7606 0.0 0.0 3924 796 tty3 Ss+ 23:05 0:00 /sbin/agetty tty3 38400
root 7608 0.0 0.0 3924 796 tty4 Ss+ 23:05 0:00 /sbin/agetty tty4 38400
root 7609 0.0 0.0 3924 796 tty5 Ss+ 23:05 0:00 /sbin/agetty tty5 38400
Each of which, in fact, consumes only 84 KiB of RAM, with 24 KiB more shared between all agettys as /sbin/agetty binary, rest of stuff like ld and libc is shared system-wide (shared-with list contains pretty much every process in the system), so it won't be freed by killing agetty and starting 10 more of them will consume ~1 MiB, not ~10 MiB, as "ps" output might suggest.
"top" will show ~3M of "swap" (same with "SZ" in ps) for each agetty, which is also obviously untrue.

More machine-friendly (flat) output might remind of sysctl:

agetty.-stats.private: 252.0 KiB
agetty.-stats.shared: 712.0 KiB
agetty.-stats.swap: 0
agetty.7606.-stats.private: 84.0 KiB
agetty.7606.-stats.shared: 712.0 KiB
agetty.7606.-stats.swap: 0
agetty.7606.-cmdline: /sbin/agetty tty3 38400
agetty.7606.'/lib/ld-2.12.2.so'.-shared-with: ...
agetty.7606.'/lib/ld-2.12.2.so'.private: 8.0 KiB
agetty.7606.'/lib/ld-2.12.2.so'.shared: 104.0 KiB
agetty.7606.'/lib/ld-2.12.2.so'.swap: 0
agetty.7606.'/lib/libc-2.12.2.so'.-shared-with: ...

Script. No dependencies needed, apart from python 2.7 or 3.X (works with both w/o conversion).

Some optional parameters are supported:

usage: ps_mem_details.py [-h] [-p] [-s] [-n MIN_VAL] [-f] [--debug] [name]
Detailed process memory usage accounting tool.
positional arguments:
  name           String to look for in process cmd/binary.
optional arguments:
  -h, --help     show this help message and exit
  -p, --private  Show only private memory leaks.
  -s, --swap     Show only swapped-out stuff.
  -n MIN_VAL, --min-val MIN_VAL
            Minimal (non-inclusive) value for tracked parameter
            (KiB, see --swap, --private, default: 0).
  -f, --flat     Flat output.
  --debug        Verbose operation mode.

For example, to find what hogs more than 500K swap in the system:

# ps_mem_details.py --flat --swap -n 500
memcached.-stats.private: 28.4 MiB
memcached.-stats.shared: 588.0 KiB
memcached.-stats.swap: 1.5 MiB
memcached.927.-cmdline: /usr/bin/memcached -p 11211 -l
memcached.927.[anon].private: 28.0 MiB
memcached.927.[anon].shared: 0
memcached.927.[anon].swap: 1.5 MiB
squid.-stats.private: 130.9 MiB
squid.-stats.shared: 1.2 MiB
squid.-stats.swap: 668.0 KiB
squid.1334.-cmdline: /usr/sbin/squid -NYC
squid.1334.[heap].private: 128.0 MiB
squid.1334.[heap].shared: 0
squid.1334.[heap].swap: 660.0 KiB
udevd.-stats.private: 368.0 KiB
udevd.-stats.shared: 796.0 KiB
udevd.-stats.swap: 748.0 KiB

...or what eats more than 20K in agetty pids (should be useful to see which .so or binary "leaks" in a process):

# ps_mem_details.py --private --flat -n 20 agetty
agetty.-stats.private: 252.0 KiB
agetty.-stats.shared: 712.0 KiB
agetty.-stats.swap: 0
agetty.7606.-stats.private: 84.0 KiB
agetty.7606.-stats.shared: 712.0 KiB
agetty.7606.-stats.swap: 0
agetty.7606.-cmdline: /sbin/agetty tty3 38400
agetty.7606.[stack].private: 24.0 KiB
agetty.7606.[stack].shared: 0
agetty.7606.[stack].swap: 0
agetty.7608.-stats.private: 84.0 KiB
agetty.7608.-stats.shared: 712.0 KiB
agetty.7608.-stats.swap: 0
agetty.7608.-cmdline: /sbin/agetty tty4 38400
agetty.7608.[stack].private: 24.0 KiB
agetty.7608.[stack].shared: 0
agetty.7608.[stack].swap: 0
agetty.7609.-stats.private: 84.0 KiB
agetty.7609.-stats.shared: 712.0 KiB
agetty.7609.-stats.swap: 0
agetty.7609.-cmdline: /sbin/agetty tty5 38400
agetty.7609.[stack].private: 24.0 KiB
agetty.7609.[stack].shared: 0
agetty.7609.[stack].swap: 0
← Previous Next → Page 2 of 4
Member of The Internet Defense League