Aug 08, 2013

Encrypted root on a remote vds

Most advice wrt encryption on a remote hosts (VPS, VDS) don't seem to involve full-disk encryption as such, but is rather limited to encrypting /var and /home, so that machine will boot from non-crypted / and you'll be able to ssh to it, decrypt these parts manually, then start services that use data there.

That seem to be in contrast with what's generally used on local machines - make LUKS container right on top of physical disk device, except for /boot (if it's not on USB key) and don't let that encryption layer bother you anymore.

Two policies seem to differ in that former one is opt-in - you have to actively think which data to put onto encrypted part (e.g. /etc/ssl has private keys? move to /var, shred from /etc), while the latter is opt-out - everything is encrypted, period.

So, in spirit of that opt-out way, I thought it'd be a drag to go double-think wrt which data should be stored where and it'd be better to just go ahead and put everything possible to encrypted container for a remote host as well, leaving only /boot with kernel and initramfs in the clear.

Naturally, to enter encryption password and not have it stored alongside LUKS header, some remote login from the network is in order, and sshd seem to be the secure and easy way to go about it.
Initramfs in question should then also be able to setup network, which luckily dracut can. Openssh sshd is a bit too heavy for it though, but there are much lighter sshd's like dropbear.

Searching around for someone to tie the two things up, found a bit incomplete and non-packaged solutions like this RH enhancement proposal and a set of hacky scripts and instructions in dracut-crypt-wait repo on bitbucket.

Approach outlined in RH bugzilla is to make dracut "crypt" module to operate normally and let cryptsetup query for password in linux console, but also start sshd in the background, where one can login and use a simple tool to echo password to that console (without having it echoed).
dracut-crypt-wait does a clever hack of removing "crypt" module hook instead and basically creates "rescure" console on sshd, where user have to manually do all the decryption necessary and then signal initramfs to proceed with the boot.

I thought first way was rather more elegant and clever, allowing dracut to figure out which device to decrypt and start cryptsetup with all the necessary, configured and documented parameters, also still allowing to type passphrase into console - best of both worlds, so went along with that one, creating dracut-crypt-sshd project.

As README there explains, using it is as easy as adding it into dracut.conf (or passing to dracut on command line) and adding networking to grub.cfg, e.g.:

menuentry "My Linux" {
        linux /vmlinuz ro root=LABEL=root
                rd.luks.uuid=7a476ea0 rd.neednet=1
        initrd /dracut.xz

("ip=dhcp" might be simplier way to go, but doesn't yield default route in my case)

And there, you'll have sshd on that IP port 2222 (configurable), with pre-generated (during dracut build) keys, which might be a good idea to put into "known_hosts" for that ip/port somewhere. "authorized_keys" is taken from /root/.ssh by default, but also configurable via dracut.conf, if necessary.

Apart from sshd, that module includes two tools for interaction with console - console_peek and console_auth (derived from auth.c in the bugzilla link above).

Logging in to that sshd then yields sequence like this:

[214] Aug 08 13:29:54 lastlog_perform_login: Couldn't stat /var/log/lastlog: No such file or directory
[214] Aug 08 13:29:54 lastlog_openseek: /var/log/lastlog is not a file or directory!

# console_peek
[    1.711778] Write protecting the kernel text: 4208k
[    1.711875] Write protecting the kernel read-only data: 1116k
[    1.735488] dracut: dracut-031
[    1.756132] systemd-udevd[137]: starting version 206
[    1.760022] tsc: Refined TSC clocksource calibration: 2199.749 MHz
[    1.760109] Switching to clocksource tsc
[    1.809905] systemd-udevd[145]: renamed network interface eth0 to enp0s9
[    1.974202] 8139too 0000:00:09.0 enp0s9: link up, 100Mbps, full-duplex, lpa 0x45E1
[    1.983151] dracut: sshd port: 2222
[    1.983254] dracut: sshd key fingerprint: 2048 0e:14:...:36:f9  root@congo (RSA)
[    1.983392] dracut: sshd key bubblebabble: 2048 xikak-...-poxix  root@congo (RSA)
[185] Aug 08 13:29:29 Failed reading '-', disabling DSS
[186] Aug 08 13:29:29 Running in background
[    2.093869] dracut: luksOpen /dev/sda3 luks-...
Enter passphrase for /dev/sda3:
[213] Aug 08 13:29:50 Child connection from
[213] Aug 08 13:29:54 Pubkey auth succeeded for 'root' with key md5 0b:97:bb:...

# console_auth

First command - "console_peek" - allows to see which password is requested (if any) and second one allows to login.
Note that fingerprints of host keys are also echoed to console on sshd start, in case one has access to console but still needs sshd later.
I quickly found out that such initramfs with sshd is also a great and robust rescue tool, especially if "debug" and/or "rescue" dracut modules are enabled.
And as it includes fairly comprehensive network-setup options, might be a good way to boot multiple different OS'es with same (machine-specific) network parameters,

Probably obligatory disclaimer for such post should mention that crypto above won't save you from malicious hoster or whatever three-letter-agency that will coerce it into cooperation, should it take interest in your poor machine - it'll just extract keys from RAM image (especially if it's a virtualized VPS) or backdoor kernel/initramfs and force a reboot.

Threat model here is more trivial - be able to turn off and decomission host without fear of disks/images then falling into some other party's hands, which might also happen if hoster eventually goes bust or sells/scraps disks due to age or bad blocks.

Also, even minor inconvenience like forcing to extract keys like outlined above might be helpful in case of quite well-known "we came fishing to a datacenter, shut everything down, give us all the hardware in these racks" tactic employed by some agencies.

Absolute security is a myth, but these measures are fairly trivial and practical to be employed casually to cut off at least some number of basic threats.

So, yay for dracut, the amazingly cool and hackable initramfs project, which made it that easy.

Code link:

Jun 09, 2013

cjdns per-IP (i.e. per-peer) traffic accounting

I've been using Hyperboria darknet for about a month now, and after late influx of russian users there (after this article) got plently of peers, so node is forwarding a bit of network traffic.

Being a dorknet-proper, of course, you can't see what kind of traffic it is or to whom it goes (though cjdns doesn't have anonymity as a goal), but I thought it'd be nice to at least know when my internet lags due to someone launching DoS flood or abusing torrents.

Over the Internet (called "clearnet" here), cjdns peers using udp, but linux conntrack seem to be good enough to track these "connections" just as if they were stateful tcp flows.

Simple-ish traffic accounting on vanilla linux usually boils down to ulogd2, which can use packet-capturing interfaces (raw sockets via libpcap, netfilter ULOG and NFLOG targets), but it's kinda heavy-handed here - traffic is opaque, only endpoints matter, so another one of its interfaces seem to be better option - conntrack tables/events.

Handy conntrack-tools (or /proc/net/{ip,nf}_conntrack) cat track all the connections, including simple udp-based ones (like cjdns uses), producing entries like:

udp 17 179 \
        src= dst= sport=52728 dport=8131 \
        src= dst= sport=8131 dport=52728 \
        [ASSURED] mark=16 use=1

First trick is to enable the packet/byte counters there, which is a simple, but default-off sysctl knob:

# sysctl -w net.netfilter.nf_conntrack_acct=1

That will add "bytes=" and "packets=" values there for both directions.

Of course, polling the table is a good way to introduce extra hangs into system (/proc files are basically hooks that tend to lock stuff to get consistent reads) and loose stuff in-between polls, so luckily there's an event-based netlink interface and ulogd2 daemon to monitor that.

One easy way to pick both incoming and outgoing udp flows in ulogd2 is to add connmarks to these:

-A INPUT -p udp --dport $cjdns_port -j CONNMARK --set-xmark 0x10/0x10
-A OUTPUT -p udp --sport $cjdns_port -j CONNMARK --set-xmark 0x10/0x10

Then setup filtering by these in ulogd.conf:






Should produce parseable log of all the traffic flows with IPs and such.

Fairly simple script can then be used to push this data to graphite, munin, ganglia, cacti or whatever time-series graphing/processing tool. Linked script is for graphite "carbon" interface.

Update: obsoleted/superseded by cjdns "InterfaceController_peerStats" admin api function and graphite-metrics cjdns_peer_stats collector.

Apr 24, 2013

fatrace - poor man's auditd

Was hacking on (or rather debugging) Convergence FF plugin and it became painfully obvious that I really needed something simple to push js changes from local git clone to ~/.mozilla so that I can test them.

Usually I tend to employ simple ad-hoc for src in $(git st | awk ...); do cat $src >... hack, and done same thing in this case as well, but was forgetting to run it after small "debug printf" changes waaay too often.

At this point, I sometimes hack some ad-hoc emacs post-save hook to run the thing, but this time decided to find some simplier and more generic "run that on any changes to path" tool.

Until the last few years, the only way to do that was polling or inotify, and for some project dir it's actually quite fine, but luckily there's fanotify in kernel now, and fatrace looks like the simliest cli tool based on it.

# fatrace
sadc(977): W /var/log/sa/sa24
sadc(977): W /var/log/sa/sa24
sadc(977): W /var/log/sa/sa24
sadc(977): W /var/log/sa/sa24
qmgr(1195): O /var/spool/postfix/deferred
qmgr(1195): CO /var/spool/postfix/deferred/0
qmgr(1195): CO /var/spool/postfix/deferred/3
qmgr(1195): CO /var/spool/postfix/deferred/7
That thing can just watch everything that's being done to all (or any specific) local mount(s).
Even better - reports the app that does the changes.

I never got over auditd's complexity for such simple use-cases, so was damn glad that there is a real and simplier alternative now.

Unfortunately, with power of the thing comes the need for root, so one simple bash wrapper later, my "sync changes" issue was finally resolved:

(root) ~# fatrace_pipe ~user/hatch/project
(user) project% xargs -in1 </tmp/fatrace.fifo make

Looks like a real problem-solver for a lot of real-world "what the hell happens on the fs there!?" cases as well - can't recommend the thing highly-enough for all that.

Apr 08, 2013

TCP Hijacking for The Greater Good

As discordian folk celebrated Jake Day yesterday, decided that I've had it with random hanging userspace state-machines, stuck forever with tcp connections that are not legitimately dead, just waiting on both sides.

And since pretty much every tool can handle transient connection failures and reconnects, decided to come up with some simple and robust-enough solution to break such links without (or rather before) patching all the apps to behave.

One last straw was davfs2 failing after a brief net-hiccup, with my options limited to killing everything that uses (and is hanging dead on) its mount, then going kill/remount way.
As it uses stateless http connections, I bet it's not even an issue for it to repeat whatever request it tried last and it sure as hell handles network failures, just not well in all cases.

I've used such technique to test some twisted-things in the past, so it was easy to dig scapy-automata code for doing that, though the real trick is not to craft/send FIN or RST packet, but rather to guess TCP seq/ack numbers to stamp it with.

Alas, none of the existing tools (e.g. tcpkill) seem to do anything clever in this regard.

cutter states that

There is a feature of the TCP/IP protocol that we could use to good effect here - if a packet (other than an RST) is received on a connection that has the wrong sequence number, then the host responds by sending a corrective "ACK" packet back.

But neither the tool itself nor the technique described seem to work, and I actually failed to find (or recall) any mentions (or other uses) of such corrective behavior. Maybe it was so waaay back, dunno.

Naturally, as I can run such tool on the host where socket endpoint is, local kernel has these numbers stored, but apparently no one really cared (or had a legitimate enough use-case) to expose these to the userspace... until very recently, that is.

Recent work of Parallels folks on CRIU landed getsockopt(sk, SOL_TCP, TCP_QUEUE_SEQ, ...) in one the latest mainline kernel releases.
Trick is then just to run that syscall in the pid that holds the socket fd, which looks like a trivial enough task, but looking over crtools (which unfortunately doesn't seem to work with vanilla kernel yet) and ptrace-parasite tricks of compiling and injecting shellcode, decided that it's just too much work for me, plus they share the same x86_64-only codebase, and I'd like to have the thing working on ia32 machines as well.
Caching all the "seen" seq numbers in advance looks tempting, especially since for most cases, relevant traffic is processed already by nflog-zmq-pcap-pipe and Snort, which can potentially dump "(endpoint1-endpoint2, seq, len)" tuples to some fast key-value backend.
Invalidation of these might be a minor issue, but I'm not too thrilled about having some dissection code to pre-cache stuff that's already cached in every kernel anyway.

Patching kernel to just expose stuff via /proc looks like bit of a burden as well, though an isolated module code would probably do the job well. Weird that there doesn't seem to be one of these around already, closest one being tcp_probe.c code, which hooks into tcp_recv code-path and doesn't really get seqs without some traffic either.

One interesting idea that got my attention and didn't require a single line of extra code was proposed on the local xmpp channel - to use tcp keepalives.

Sure, they won't make kernel drop connection when it's userspace that hangs on both ends, with connection itself being perfectly healthy, but every one of these carries a seq number that can be spoofed and used to destroy that "healthy" state.

Pity these are optional and can't be just turned on for all sockets system-wide on linux (unlike some BSD systems, apparently), and nothing uses these much by choice (which can be seen in netstat --timer).

Luckily, there's a dead-simple LD_PRELOAD code of libkeepalive which can be used to enforce system-wide opt-out behavior for these (at least for non-static binaries).
For suid stuff (like mount.davfs, mentioned above), it has to be in /etc/, not just env, but as I need it "just in case" for all the connections, that seems fine in my case.

And tuning keepalives to be frequent-enough seem to be a no-brainer and shouldn't have any effect on 99% of legitimate connections at all, as they probably pass some traffic every other second, not after minutes or hours.

net.ipv4.tcp_keepalive_time = 900
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 156

(default is to send empty keepalive packet after 2 hours of idleness)

With that, tool has to run ~7 min on average to kill any tcp connection in the system, which totally acceptable, and no fragile non-portable ptrace-shellcode magic involved (at least yet, I bet it'd be much easier to do in the future).

Code and some docs for the tool/approach can be found on github.

More of the same (update 2013-08-11):

Actually, lacking some better way to send RST/FIN from a machine to itself than swapping MACs (and hoping that router is misconfigured enough to bounce packet "from itself" back) or "-j REJECT --reject-with tcp-reset" (plus a "recent" match or transient-port matching, to avoid blocking reconnect as well), countdown for a connection should be ~7 + 15 min, as only next keepalive will reliably produce RST response.

With a bit of ipset/iptables/nflog magic, it was easy to make the one-time REJECT rule, snatching seq from dropped packet via NFLOG and using that to produce RST for the other side as well.

Whole magic there goes like this:

-A conn_cutter ! -p tcp -j RETURN
-A conn_cutter -m set ! --match-set conn_cutter src,src -j RETURN
-A conn_cutter -p tcp -m recent --set --name conn_cutter --rsource
-A conn_cutter -p tcp -m recent ! --rcheck --seconds 20\
        --hitcount 2 --name conn_cutter --rsource -j NFLOG
-A conn_cutter -p tcp -m recent ! --rcheck --seconds 20\
        --hitcount 2 --name conn_cutter --rsource -j REJECT --reject-with tcp-reset

-I OUTPUT -j conn_cutter

"recent" matcher there is a bit redundant in most cases, as outgoing connections usually use transient-range tcp ports, which shouldn't match for different attempts, but some apps might bind these explicitly.

ipset turned out to be quite a neat thing to avoid iptables manipulations (to add/remove match).

It's interesting that this set of rules handles RST to both ends all by itself if packet arrives from remote first - response (e.g. ACK) from local socket will get RST but won't reach remote, and retransmit from remote will get RST because local port is legitimately closed by then.

Current code allows to optionally specify ipset name, whether to use nflog (via spin-off scapy-nflog-capture driver) or raw sockets, and doesn't do any mac-swapping, only sending RST to remote (which, again, should still be sufficient with frequent-enough keepalives).

Now, if only some decade-old undocumented code didn't explicitly disable these nice keepalives...

Feb 28, 2012

Late adventures with time-series data collection and representation

When something is wrong and you look at the system, most often you'll see that... well, it works. There's some cpu, disk, ram usage, some number of requests per second on different services, some stuff piling up, something in short supply here and there...

And there's just no way of telling what's wrong without answers to the questions like "so, what's the usual load average here?", "is the disk always loaded with requests 80% of time?", "is it much more requests than usual?", etc, otherwise you might be off to some wild chase just to find out that load has always been that high, or solve the mystery of some unoptimized code that's been there for ages, without doing anything about the problem in question.

Historical data is the answer, and having used rrdtool with stuff like (customized) cacti and snmpd (with some my hacks on top) in the past, I was overjoyed when I stumbled upon a graphite project at some point.

From then on, I strived to collect as much metrics as possible, to be able to look at history of anything I want (and lots of values can be a reasonable symptom for the actual problem), without any kind of limitations.
carbon-cache does magic by batching writes and carbon-aggregator does a great job at relieving you of having to push aggregate metrics along with a granular ones or sum all these on graphs.

Initially, I started using it with just collectd (and still using it), but there's a need for something to convert metric names to a graphite hierarcy.

After looking over quite a few solutions to collecd-carbon bridge, decided to use bucky, with a few fixes of my own and quite large translation config.

Bucky can work anywhere, just receiving data from collectd network plugin, understands collectd types and properly translates counter increments to N/s rates. It also includes statsd daemon, which is brilliant at handling data from non-collector daemons and scripts and more powerful metricsd implementation.
Downside is that it's only maintained in forks, has bugs in less-used code (like metricsd), quite resource-hungry (but can be easily scaled-out) and there's kinda-official collectd-carbon plugin now (although I found it buggy as well, not to mention much less featureful, but hopefully that'll be addressed in future collectd versions).

Some of the problems I've noticed with such collectd setup:

  • Disk I/O metrics are godawful or just doesn't work - collected metrics of read/write either for processes of device are either zeroes, have weird values detached from reality (judging by actual problems and tools like atop and sysstat provide) or just useless.
  • Lots of metrics for network and memory (vmem, slab) and from various plugins have naming, inconsistent with linux /proc or documentation names.
  • Some useful metrics that are in, say, sysstat doesn't seem to work with collectd, like sensor data, nfsv4, some paging and socket counters.
  • Some metrics need non-trivial post-processing to be useful - disk utilization % time is one good example.
  • Python plugins leak memory on every returned value. Some plugins (ping, for example) make collectd segfault several times a day.
  • One of the most useful info is the metrics from per-service cgroup hierarchies, created by systemd - there you can compare resource usage of various user-space components, totally pinpointing exactly what caused the spikes on all the other graphs at some time.
  • Second most useful info by far is produced from logs and while collectd has a damn powerful tail plugin, I still found it to be too limited or just too complicated to use, while simple log-tailing code does the better job and is actually simplier due to more powerful language than collectd configuration. Same problem with table plugin and /proc.
  • There's still a need for lagre post-processing chunk of code and pushing the values to carbon.
Of course, I wanted to add systemd cgroup metrics, some log values and missing (and just properly-named) /proc tables data, and initially I wrote a collectd plugin for that. It worked, leaked memory, occasionally crashed (with collectd itself), used some custom data types, had to have some metric-name post-processing code chunk in bucky...
Um, what the hell for, when sending metric value directly takes just "echo $val $(printf %(%s)T -1) >/dev/tcp/carbon_host/2003"?

So off with collectd for all the custom metrics.

Wrote a simple "while True: collect_and_send() && sleep(till_deadline);" loop in python, along with the cgroup data collectors (there are even proper "block io" and "syscall io" per-service values!), log tailer and sysstat data processor (mainly for disk and network metrics which have batshit-crazy values in collectd plugins).

Another interesting data-collection alternative I've explored recently is ganglia.
Redundant gmond collectors and aggregators, communicating efficiently over multicast are nice. It has support for python plugins, and is very easy to use - pulling data from gmond node network can be done with one telnet or nc command, and it's fairly comprehensible xml, not some binary protocol. Another nice feature is that it can re-publish values only on some significant changes (where you define what "significant" is), thus probably eliminating traffic for 90% of "still 0" updates.
But as I found out while trying to use it as a collectd replacement (forwarding data to graphite through amqp via custom scripts), there's a fatal flaw - gmond plugins can't handle dynamic number of values, so writing a plugin that collects metrics from systemd services' cgroups without knowing how many of these will be started in advance is just impossible.
Also it has no concept for timestamps of values - it only has "current" ones, making plugins like "sysstat data parser" impossible to implement as well.
collectd, in contrast, has no constraint on how many values plugin returns and has timestamps, but with limitations on how far backwards they are.

Pity, gmond looked like a nice, solid and resilent thing otherwise.

I still like the idea to pipe graphite metrics through AMQP (like rocksteady does), routing them there not only to graphite, but also to some proper threshold-monitoring daemon like shinken (basically nagios, but distributed and more powerful), with alerts, escalations, trending and flapping detection, etc, but most of the existing solutions all seem to use graphite and whisper directly, which seem kinda wasteful.

Looking forward, I'm actually deciding between replacing collectd completely for a few most basic metrics it now collects, pulling them from sysstat or just /proc directly or maybe integrating my collectors back into collectd as plugins, extending collectd-carbon as needed and using collectd threshold monitoring and matches/filters to generate and export events to nagios/shinken... somehow first option seem to be more effort-effective, even in the long run, but then maybe I should just work more with collectd upstream, not hack around it.

Nov 12, 2011

Running stuff like firefox, flash and skype with apparmor

Should've done it a long time ago, actually. I was totally sure it'd be much harder task, but then recently I've had some spare time and decided to do something about this binary crap, and looking for possible solutions stumbled upon apparmor.

A while ago I've used SELinux (which was the reason why I thought it'd have to be hard) and kinda considered LSM-based security as kind of heavy-handed no-nonsense shit you chose NOT to deal with if you have such choice, but apparmor totally proved this to be a silly misconception, which I'm insanely happy about.
With apparmor, it's just one file with a set of permissions, which can be loaded/enforced/removed at runtime, no xattrs (and associated maintenance burden) or huge and complicated policies like SELinux has.
For good whole-system security SELinux still seem to be a better approach, but not for confining a few crappy apps on a otherwise general system.
On top of that, it's also trivially easy to install on a general system - only kernel LSM and one userspace package needed.

Case in point - skype apparmor profile, which doesn't allow it to access anything but ~/.Skype, /opt/skype and a few other system-wide things:

#include <tunables/global>
/usr/bin/skype {
  #include <abstractions/base>
  #include <abstractions/user-tmp>
  #include <abstractions/pulse>
  #include <abstractions/nameservice>
  #include <abstractions/ssl_certs>
  #include <abstractions/fonts>
  #include <abstractions/X>
  #include <abstractions/>
  #include <abstractions/kde>
  #include <abstractions/site/base>
  #include <abstractions/site/de>

  /usr/bin/skype mr,
  /opt/skype/skype pix,
  /opt/skype/** mr,
  /usr/share/fonts/X11/** m,

  @{PROC}/*/net/arp r,
  @{PROC}/sys/kernel/ostype r,
  @{PROC}/sys/kernel/osrelease r,

  /dev/ r,
  /dev/tty rw,
  /dev/pts/* rw,
  /dev/video* mrw,

  @{HOME}/.Skype/ rw,
  @{HOME}/.Skype/** krw,

  deny @{HOME}/.mozilla/ r, # no idea what it tries to get there
  deny @{PROC}/[0-9]*/fd/ r,
  deny @{PROC}/[0-9]*/task/ r,
  deny @{PROC}/[0-9]*/task/** r,

"deny" lines here are just to supress audit warnings about this paths, everything is denied by default, unless explicitly allowed.

Compared to "default" linux DAC-only "as user" confinement, where it has access to all your documents, activities, smartcard, gpg keys and processes, ssh keys and sessions, etc - it's a huge improvement.

Even more useful confinement is firefox and it's plugin-container process (which can - and does, in my configuration - have separate profile), where known-to-be-extremely-exploitable adobe flash player runs.
Before apparmor, I mostly relied on FlashBlock extension to keep Flash in check somehow, but at some point I noted that plugin-container with seem to be running regardless of FlashBlock and whether flash is displayed on pages or not. I don't know if it's just a warm-start, check-run or something, but still looks like a possible hole.
Aforementioned (among others) profiles can be found here.
I'm actually quite surprised that I failed to find functional profiles for common apps like firefox and pulseaudio on the internets, aside from some blog posts like this one.
In theory, Ubuntu and SUSE should have these, since apparmor is developed and deployed there by default (afaik), so maybe google just haven't picked these files up in the package manifests, and all I needed was to go over them by hand. Not sure if it was much faster or more productive than writing them myself though.

Oct 23, 2011

dm-crypt password caching between dracut and systemd, systemd password agent

Update 2015-11-25: with "ask-password" caching implemented as of systemd-227 (2015-10-07), better way would be to use that in-kernel caching, though likely requires systemd running in initramfs (e.g. dracut had that for a while).

Up until now I've used lvm on top of single full-disk dm-crypt partition.
It seems easiest to work with - no need to decrypt individual lv's, no confusion between what's encrypted (everything but /boot!) and what's not, etc.
Main problem with it though is that it's harder to have non-encrypted parts, everything is encrypted with the same keys (unless there're several dm-crypt layers) and it's bad for SSD - dm-crypt still (as of 3.0) doesn't pass any TRIM requests through, leading to nasty write amplification effect, even more so with full disk given to dm-crypt+lvm.
While there's hope that SSD issues will be kinda-solved (with an optional security trade-off) in 3.1, it's still much easier to keep different distros or some decrypted-when-needed partitions with dm-crypt after lvm, so I've decided to go with the latter for new 120G SSD.
Also, such scheme allows to re-create encrypted lvs, issuing TRIM for the old ones, thus recycling the blocks even w/o support for this in dm-crypt.
Same as with previous initramfs, I've had simple "openct" module (udev there makes it even easier) in dracut to find inserted smartcard and use it to obtain encryption key, which is used once to decrypt the only partition on which everything resides.
Since the only goal of dracut is to find root and get-the-hell-outta-the-way, it won't even try to decrypt all the /var and /home stuff without serious ideological changes.
The problem is actually solved in generic distros by plymouth, which gets the password(s), caches it, and provides it to dracut and systemd (or whatever comes as the real "init"). I don't need splash, and actually hate it for hiding all the info that scrolls in it's place, so plymouth is a no-go for me.

Having a hack to obtain and cache key for dracut by non-conventional means anyway, I just needed to pass it further to systemd, and since they share common /run tmpfs these days, it basically means not to rm it in dracut after use.

Luckily, system-wide password handling mechanism in systemd is well-documented and easily extensible beyond plymouth and default console prompt.

So whole key management in my system goes like this now:

  • dracut.cmdline: create udev rule to generate key.
  • dracut.udev.openct: find smartcard, run rule to generate and cache key in /run/initramfs.
  • dracut.udev.crypt: check for cached key or prompt for it (caching result), decrypt root, run systemd.
  • systemd: start post-dracut-crypt.path unit to monitor /run/systemd/ask-password for password prompts, along with default .path units for fallback prompts via wall/console.
  • systemd.udev: discover encrypted devices, create key requests.
  • start post-dracut-crypt.service to read cached passwords from /run/initramfs and use these to satisfy requests.
  • (after is activated): stop post-dracut-crypt.service, flush caches, generate new one-time keys for decrypted partitions.
End result is passwordless boot with this new layout, which seem to be only possible to spoof by getting root during that process somehow, with altering unencrypted /boot to run some extra code and revert it back being the most obvious possibility.
It's kinda weird that there doesn't seem to be any caching in place already, surely not everyone with dm-crypt are using plymouth?

Most complicated piece here is probably the password agent (in python), which can actually could've been simplier if I haven't followed the proper guidelines and thought a bit around them.

For example, whole inotify handling thing (I've used it via ctypes) can be dropped with .path unit with DirectoryNotEmpty= activation condition - it's there already, PolicyKit authorization just isn't working at such an early stage, there doesn't seem to be much need to check request validity since sending replies to sockets is racy anyway, etc
Still, a good excercise.

Python password agent for systemd. Unit files to start and stop it on demand.

Sep 16, 2011

Detailed process memory accounting, including shared and swapped one

Two questions:

  • How to tell which pids (or groups of forks) eat most swap right now?
  • How much RAM one apache/php/whatever really consumes?

Somehow people keep pointing me at "top" and "ps" tools to do this sort of things, but there's an obvious problem:

#include <stdlib.h>
#include <unistd.h>

#define G 1024*1024*1024

int main (void) {
    (void *) malloc(2 * G);
    return 0;

This code will immediately float to 1st position in top, sorted by "swap" (F p <return>), showing 2G even with no swap in the system.

Second question/issue is also common but somehow not universally recognized, which is kinda obvious when scared admins (or whoever happen to ssh into web backend machine) see N pids of something, summing up to more than total amount of RAM in the system, like 50 httpd processes 50M each.
It gets even worse when tools like "atop" helpfully aggregate the numbers ("atop -p"), showing that there are 6 sphinx processes, eating 15G on a machine with 4-6G physical RAM + 4-8G swap, causing local panic and mayhem.
The answer is, of course, that sphinx, apache and pretty much anything using worker processes share a lot of memory pages between their processes, and not just because of shared objects like libc.

Guess it's just general ignorance of how memory works in linux (or other unix-os'es) of those who never had to write a fork() or deal with malloc's in C, which kinda make lots of these concepts look fairly trivial.

So, mostly out of curiosity than the real need, decided to find a way to answer these questions.
proc(5) reveals this data more-or-less via "maps" / "smaps" files, but that needs some post-processing to give per-pid numbers.
Closest tools I was able to find were pmap from procps package and script from coreutils maintainer. Former seem to give only mapped memory region sizes, latter cleverly shows shared memory divided by a number of similar processes, omitting per-process numbers and swap.
Oh, and of course there are glorious valgrind and gdb, but both seem to be active debugging tools, not much suitable for normal day-to-day operation conditions and a bit too complex for the task.

So I though I'd write my own tool for the job to put the matter at rest once and for all, and so I can later point people at it and just say "see?" (although I bet it'll never be that simple).

Idea is to group similar processes (by cmd) and show details for each one, like this:

    private: 252.0 KiB
    shared: 712.0 KiB
    swap: 0
      private: 84.0 KiB
      shared: 712.0 KiB
      swap: 0
    -cmdline: /sbin/agetty tty3 38400
      -shared-with: rpcbind, _plutorun, redshift, dbus-launch, acpid, ...
      private: 8.0 KiB
      shared: 104.0 KiB
      swap: 0
      -shared-with: rpcbind, _plutorun, redshift, dbus-launch, acpid, ...
      private: 12.0 KiB
      shared: 548.0 KiB
      swap: 0
      -shared-with: agetty
      private: 4.0 KiB
      shared: 24.0 KiB
      swap: 0
      -shared-with: firefox, redshift, tee, sleep, ypbind, pulseaudio [updated], ...
      private: 0
      shared: 8.0 KiB
      swap: 0
      private: 20.0 KiB
      shared: 0
      swap: 0
      private: 8.0 KiB
      shared: 0
      swap: 0
      private: 24.0 KiB
      shared: 0
      swap: 0
      private: 0
      shared: 0
      swap: 0
      private: 84.0 KiB
      shared: 712.0 KiB
      swap: 0
    -cmdline: /sbin/agetty tty4 38400
      private: 84.0 KiB
      shared: 712.0 KiB
      swap: 0
    -cmdline: /sbin/agetty tty5 38400

So it's obvious that there are 3 agetty processes, which ps will report as 796 KiB RSS:

root 7606 0.0 0.0 3924 796 tty3 Ss+ 23:05 0:00 /sbin/agetty tty3 38400
root 7608 0.0 0.0 3924 796 tty4 Ss+ 23:05 0:00 /sbin/agetty tty4 38400
root 7609 0.0 0.0 3924 796 tty5 Ss+ 23:05 0:00 /sbin/agetty tty5 38400
Each of which, in fact, consumes only 84 KiB of RAM, with 24 KiB more shared between all agettys as /sbin/agetty binary, rest of stuff like ld and libc is shared system-wide (shared-with list contains pretty much every process in the system), so it won't be freed by killing agetty and starting 10 more of them will consume ~1 MiB, not ~10 MiB, as "ps" output might suggest.
"top" will show ~3M of "swap" (same with "SZ" in ps) for each agetty, which is also obviously untrue.

More machine-friendly (flat) output might remind of sysctl:

agetty.-stats.private: 252.0 KiB
agetty.-stats.shared: 712.0 KiB
agetty.-stats.swap: 0
agetty.7606.-stats.private: 84.0 KiB
agetty.7606.-stats.shared: 712.0 KiB
agetty.7606.-stats.swap: 0
agetty.7606.-cmdline: /sbin/agetty tty3 38400
agetty.7606.'/lib/'.-shared-with: ...
agetty.7606.'/lib/'.private: 8.0 KiB
agetty.7606.'/lib/'.shared: 104.0 KiB
agetty.7606.'/lib/'.swap: 0
agetty.7606.'/lib/'.-shared-with: ...

Script. No dependencies needed, apart from python 2.7 or 3.X (works with both w/o conversion).

Some optional parameters are supported:

usage: [-h] [-p] [-s] [-n MIN_VAL] [-f] [--debug] [name]
Detailed process memory usage accounting tool.
positional arguments:
  name           String to look for in process cmd/binary.
optional arguments:
  -h, --help     show this help message and exit
  -p, --private  Show only private memory leaks.
  -s, --swap     Show only swapped-out stuff.
  -n MIN_VAL, --min-val MIN_VAL
            Minimal (non-inclusive) value for tracked parameter
            (KiB, see --swap, --private, default: 0).
  -f, --flat     Flat output.
  --debug        Verbose operation mode.

For example, to find what hogs more than 500K swap in the system:

# --flat --swap -n 500
memcached.-stats.private: 28.4 MiB
memcached.-stats.shared: 588.0 KiB
memcached.-stats.swap: 1.5 MiB
memcached.927.-cmdline: /usr/bin/memcached -p 11211 -l
memcached.927.[anon].private: 28.0 MiB
memcached.927.[anon].shared: 0
memcached.927.[anon].swap: 1.5 MiB
squid.-stats.private: 130.9 MiB
squid.-stats.shared: 1.2 MiB
squid.-stats.swap: 668.0 KiB
squid.1334.-cmdline: /usr/sbin/squid -NYC
squid.1334.[heap].private: 128.0 MiB
squid.1334.[heap].shared: 0
squid.1334.[heap].swap: 660.0 KiB
udevd.-stats.private: 368.0 KiB
udevd.-stats.shared: 796.0 KiB
udevd.-stats.swap: 748.0 KiB

...or what eats more than 20K in agetty pids (should be useful to see which .so or binary "leaks" in a process):

# --private --flat -n 20 agetty
agetty.-stats.private: 252.0 KiB
agetty.-stats.shared: 712.0 KiB
agetty.-stats.swap: 0
agetty.7606.-stats.private: 84.0 KiB
agetty.7606.-stats.shared: 712.0 KiB
agetty.7606.-stats.swap: 0
agetty.7606.-cmdline: /sbin/agetty tty3 38400
agetty.7606.[stack].private: 24.0 KiB
agetty.7606.[stack].shared: 0
agetty.7606.[stack].swap: 0
agetty.7608.-stats.private: 84.0 KiB
agetty.7608.-stats.shared: 712.0 KiB
agetty.7608.-stats.swap: 0
agetty.7608.-cmdline: /sbin/agetty tty4 38400
agetty.7608.[stack].private: 24.0 KiB
agetty.7608.[stack].shared: 0
agetty.7608.[stack].swap: 0
agetty.7609.-stats.private: 84.0 KiB
agetty.7609.-stats.shared: 712.0 KiB
agetty.7609.-stats.swap: 0
agetty.7609.-cmdline: /sbin/agetty tty5 38400
agetty.7609.[stack].private: 24.0 KiB
agetty.7609.[stack].shared: 0
agetty.7609.[stack].swap: 0

Aug 14, 2011

Notification-daemon in python

I've delayed update of the whole libnotify / notification-daemon / notify-python stack for a while now, because notification-daemon got too GNOME-oriented around 0.7, making it a lot more simplier, but sadly dropping lots of good stuff I've used there.
Default nice-looking theme is gone in favor of black blobs (although colors are probably subject to gtkrc); it's one-note-at-a-time only, which makes reading them intolerable; configurability was dropped as well, guess blobs follow some gnome-panel settings now.
Older notification-daemon versions won't build with newer libnotify.
Same problem with notify-python, which seem to be unnecessary now, since it's functionality is accessible via introspection and PyGObject (part known as PyGI before merge - gi.repositories.Notify).
Looking for more-or-less drop-in replacements I've found notipy project, which looked like what I needed, and the best part is that it's python - no need to filter notification requests in a proxy anymore, eliminating some associated complexity.
Project has a bit different goals however, them being simplicity, less deps and concept separation, so I incorporated (more-or-less) notipy as a simple NotificationDisplay class into notification-proxy, making it into notification-thing (first name that came to mind, not that it matters).
All the rendering now is in python using PyGObject (gi) / gtk-3.0 toolkit, which seem to be a good idea, given that I still have no reason to keep Qt in my system, and gtk-2.0 being obsolete.
Exploring newer Gtk stuff like css styling and honest auto-generated interfaces was fun, although the whole mess seem to be much harder than expected. Simple things like adding a border, margins or some non-solid background to existing widgets seem to be very complex and totally counter-intuitive, unlike say, doing the same (even in totally cross-browser fashion) with html. I also failed to find a way to just draw what I want on arbitrary widgets, looks like it was removed (in favor of GtkDrawable) on purpose.
My (uneducated) guess is that gtk authors geared toward "one way to do one thing" philosophy, but unlike Python motto, they've to ditch the "one *obvious* way" part. But then, maybe it's just me being too lazy to read docs properly.
All the previous features like filtering and rate-limiting are there.

Looking over Desktop Notifications Spec in process, I've noticed that there are more good ideas that I'm not using, so guess I might need to revisit local notification setup in the near future.

Jun 12, 2011

Using csync2 for security-sensitive paths

Usually I was using fabric to clone similar stuff to many machines, but since I've been deploying csync2 everywhere to sync some web templates and I'm not the only one introducing changes, it ocurred to me that it'd be great to use it for scripts as well.
Problem I see there is security - most scripts I need to sync are cronjobs executed as root, so updating some script one one (compromised) machine with "rm -Rf /*" and running csync2 to push this change to other machines will cause a lot of trouble.

So I came up with simple way to provide one-time keys to csync2 hosts, which will be valid only when I want them to.

Idea is to create FIFO socket in place of a key on remote hosts, then just pipe a key into each socket while script is running on my dev machine. Simpliest form of such "pipe" I could come up with is an "ssh host 'cat >remote.key.fifo'", no fancy sockets, queues or protocols.
That way, even if one host is compromised changes can't be propagnated to other hosts without access to fifo sockets there and knowing the right key. Plus running sync for that "privileged" group accidentally will just result in a hang 'till the script will push data to fifo socket - nothing will break down or crash horribly, just wait.
Key can be spoofed of course, and sync can be timed to the moment the keys are available, so the method is far from perfect, but it's insanely fast and convenient.
Implementation is fairly simple twisted eventloop, spawning ssh processes (guess twisted.conch or stuff like paramiko can be used for ssh implementation there, but neither performance nor flexibility is an issue with ssh binary).
Script also (by default) figures out the hosts to connect to from the provided group name(s) and the local copy of csync2 configuration file, so I don't have to specify keep separate list of these or specify them each time.
As always, twisted makes it insanely simple to write such IO-parallel loop.

csync2 can be configured like this:

group sbin_sync {
    host host1 host2;
    key /var/lib/csync2/privileged.key;
    include /usr/local/sbin/*.sh

And then I just run it with something like "./ sbin_sync" when I need to replicate updates between hosts.


← Previous Next → Page 2 of 4
Member of The Internet Defense League