Dec 08, 2015

GHG - simplier GnuPG (gpg) replacement for file encryption

Have been using gpg for many years now, many times a day, as I keep lot of stuff in .gpg files, but still can't seem to get used to its quirky interface and practices.

Most notably, it's "trust" thing, keyrings and arcane key editing, expiration dates, gpg-agent interaction and encrypted keys are all sources of dread and stress for me.

Last drop, following the tradition of many disastorous interactions with the tool, was me loosing my master signing key password, despite it being written down on paper and working before. #fail ;(

Certainly my fault, but as I'll be replacing the damn key anyway, why not throw out the rest of that incomprehensible tangle of pointless and counter-productive practices and features I never use?

Took ~6 hours to write a replacement ghg tool - same thing as gpg, except with simple and sane key management (which doesn't assume you entering anything, ever!!!), none of that web-of-trust or signing crap, good (and non-swappable) djb crypto, and only for file encryption.

Does everything I've used gpg for from the command-line, and has one flat file for all the keys, so no more hassle with --edit-key nonsense.

Highly suggest to anyone who ever had trouble and frustration with gpg to check ghg project out or write their own (trivial!) tool, and ditch the old thing - life's too short to deal with that constant headache.

Dec 07, 2015

Resizing first FAT32 partition to microSD card size on boot from Raspberry Pi

One other thing I've needed to do recently is to have Raspberry Pi OS resize its /boot FAT32 partition to full card size (i.e. "make it as large as possible") from right underneath itself.

RPis usually have first FAT (fat16 / fat32 / vfat) partition needed by firmware to load config.txt and uboot stuff off, and that is the only partition one can see in Windows OS when plugging microSD card into card-reader (which is a kinda arbitrary OS limitation).

Map of the usual /dev/mmcblk0 on RPi (as seen in parted):

Number  Start   End     Size    Type     File system  Flags
        32.3kB  1049kB  1016kB           Free Space
 1      1049kB  106MB   105MB   primary  fat16        lba
 2      106MB   1887MB  1782MB  primary  ext4

Resizing that first partition is naturally difficult, as it is followed by ext4 one with RPi's OS, but when you want to have small (e.g. <2G) and easy-to-write "rpi.img" file for any microSD card, there doesn't seem to be a way around that - img have to have as small initial partitions as possible to fit on any card.

Things get even more complicated by the fact that there don't seem to be any tools around for resizing FAT fs'es, so it has to be re-created from scratch.

There is quite an easy way around all these issues however, which can be summed-up as a sequence of the following steps:

  • Start while rootfs is mounted read-only or when it can be remounted as such, i.e. on early boot.

    Before=systemd-remount-fs.service in systemd terms.

  • Grab sfdisk/parted map of the microSD and check if there's "Free Space" chunk left after last (ext4/os) partition.

    If there is, there's likely a lot of it, as SD cards increase in 2x size factors, so 4G image written on larger card will have 4+ gigs there, in fact a lot more for 16G or 32G cards.

    Or there can be only a few megs there, in case of matching card size, where it's usually a good idea to make slightly smaller images, as actual cards do vary in size a bit.

  • "dd" whole rootfs to the end of the microSD card.

    This is safe with read-only rootfs, and dumb "dd" approach to copying it (as opposed to dmsetup + mkfs + cp) seem to be simpliest and least error-prone.

  • Update partition table to have rootfs in the new location (at the very end of the card) and boot partition covering rest of the space.

  • Initiate reboot, so that OS will load from the new rootfs location.

  • Starting on early-boot again, remount rootfs rw if necessary, temporary copy all contents of boot partition (which should still be small) to rootfs.

  • Run mkfs.vfat on the new large boot partition and copy stuff back to it from rootfs.

  • Reboot once again, in case whatever boot timeouts got triggered.

  • Avoid running same thing on all subsequent boots.

    E.g. touch /etc/boot-resize-done and have ConditionPathExists=!/etc/boot-resize-done in the systemd unit file.

That should do it \o/

resize-rpi-fat32-for-card (in fgtk repo) is a script I wrote to do all of this stuff, exactly as described.

systemd unit file for the thing (can also be printed by running script with "--print-systemd-unit" option):

Before=systemd-remount-fs.service -.mount



It does use lsblk -Jnb JSON output to get rootfs device and partition, and get whether it's mounted read-only, then parted -ms /dev/... unit B print free to grab machine-readable map of the device.

sfdisk -J (also JSON output) could've been better option than parted (extra dep, which is only used to get that one map), except it doesn't conveniently list "free space" blocks and device size, pity.

If partition table doesn't have extra free space at the end, "fsstat" tool from sleuthkit is used to check whether FAT filesystem covers whole partition and needs to be resized.

After that, and only if needed, either "dd + sfdisk" or "cp + mkfs.vfat + cp back" sequence gets executed, followed by a reboot command.

Extra options for the thing:

  • "--skip-ro-check" - don't bother checkin/forcing read-only rootfs before "dd" step, which should be fine, if there's no activity there (e.g. early boot).

  • "--done-touch-file" - allows to specify location of file to create (if missing) when "no resize needed" state gets reached.

    Script doesn't check whether this file exists and always does proper checks of partition table and "fsstat" when deciding whether something has to be done, only creates the file at the end (if it doesn't exist already).

  • "--overlay-image" uses splash.go tool that I've mentioned earlier (be sure to compile it first, ofc) to set some "don't panic, fs resize in progress" image (resized/centered and/or with text and background) during the whole process, using RPi's OpenVG GPU API, covering whatever console output.

  • Misc other stuff for setup/debug - "--print-systemd-unit", "--debug", "--reboot-delay".

    Easy way to debug the thing with these might be to add StandardOutput=tty to systemd unit's Service section and ... --debug --reboot-delay 60 options there, or possibly adding extra ExecStart=/bin/sleep 60 after the script (and changing its ExecStart= to ExecStart=-, so delay will still happen on errors).

    This should provide all the info on what's happening in the script (has plenty of debug output) to the console (one on display or UART).

One more link to the script: resize-rpi-fat32-for-card

Sep 04, 2015

Parsing OpenSSH Ed25519 keys for fun and profit

Adding key derivation to git-nerps from OpenSSH keys, needed to get the actual "secret" or something deterministically (plus in an obvious and stable way) derived from it (to then feed into some pbkdf2 and get the symmetric key).

Idea is for liteweight ad-hoc vms/containers to have a single "master secret", from which all others (e.g. one for git-nerps' encryption) can be easily derived or decrypted, and omnipresent, secure, useful and easy-to-generate ssh key in ~/.ssh/id_ed25519 seem to be the best candidate.

Unfortunately, standard set of ssh tools from openssh doesn't seem to have anything that can get key material or its hash - next best thing is to get "fingerprint" or such, but these are derived from public keys, so not what I wanted at all (as anyone can derive that, having public key, which isn't secret).

And I didn't want to hash full openssh key blob, because stuff there isn't guaranteed to stay the same when/if you encrypt/decrypt it or do whatever ssh-keygen does.

What definitely stays the same is the values that openssh plugs into crypto algos, so wrote a full parser for the key format (as specified in PROTOCOL.key file in openssh sources) to get that.

While doing so, stumbled upon fairly obvious and interesting application for such parser - to get really and short easy to backup, read or transcribe string which is the actual secret for Ed25519.

I.e. that's what OpenSSH private key looks like:


And here's the only useful info in there, enough to restore whole blob above from, in the same base64 encoding:


Latter, of course, being way more suitable for tried-and-true "write on a sticker and glue at the desk" approach. Or one can just have a file with one host key per line - also cool.

That's the 32-byte "seed" value, which can be used to derive "ed25519_sk" field ("seed || pubkey") in that openssh blob, and all other fields are either "none", "ssh-ed25519", "magic numbers" baked into format or just padding.

So rolled the parser from git-nerps into its own tool - ssh-keyparse, which one can run and get that string above for key in ~/.ssh/id_ed25519, or do some simple crypto (as implemented by djb in, not me) to recover full key from the seed.

From output serialization formats that tool supports, especially liked the idea of Douglas Crockford's Base32 - human-readable one, where all confusing l-and-1 or O-and-0 chars are interchangeable, and there's an optional checksum (one letter) at the end:

% ssh-keyparse test-key --base32

% ssh-keyparse test-key --base32-nodashes

base64 (default) is still probably most efficient for non-binary (there's --raw otherwise) backup though.

[ssh-keyparse code link]

Sep 01, 2015

Transparent and easy encryption for files in git repositories

Have been installing things to an OS containers (/var/lib/machines) lately, and looking for proper configuration management in these.

Large-scale container setups use some hard-to-integrate things like etcd, where you have to template configuration from values in these, which is not very convenient and very low effort-to-results ratio (maintenance of that system itself) for "10 service containers on 3 hosts" case.

Besides, such centralized value store is a bit backwards for one-container-per-service case, where most values in such "central db" are specific to one container, and it's much easier to edit end-result configs then db values and then templates and then check how it all gets rendered on every trivial tweak.

Usual solution I have for these setups is simply putting all confs under git control, but leaving all the secrets (e.g. keys, passwords, auth data) out of the repo, in case it might be pulled from on other hosts, by different people and for purposes which don't need these sensitive bits and might leak them (e.g. giving access to contracted app devs).

For more transient container setups, something should definitely keep track of these "secrets" however, as "rm -rf /var/lib/machines/..." is much more realistic possibility and has its uses.

So my (non-original) idea here was to have one "master key" per host - just one short string - with which to encrypt all secrets for that host, which can then be shared between hosts and specific people (making these public might still be a bad idea), if necessary.

This key should then be simply stored in whatever key-management repo, written on a sticker and glued to a display, or something.

Git can be (ab)used for such encryption, with its "filter" facilities, which are generally used for opposite thing (normalization to one style), but are easy to adapt for this case too.

Git filters work by running "clear" operation on selected paths (can be a wildcard patterns like "*.c") every time git itself uses these and "smudge" when showing to user and checking them out to a local copy (where they are edited).

In case of encryption, "clear" would not be normalizing CR/LF in line endings, but rather wrapping contents (or parts of them) into a binary blob, and "smudge" should do the opposite, and gitattributes patterns would match files to be encrypted.

Looking for projects that already do that, found quite a few, but still decided to write my own tool, because none seem have all the things I wanted:

  • Use sane encryption.

    It's AES-CTR in the absolutely best case, and AES-ECB (wtf!?) in some, sometimes openssl is called with "password" on the command line (trivial to spoof in /proc).

    OpenSSL itself is a red flag - hard to believe that someone who knows how bad its API and primitives are still uses it willingly, for non-TLS, at least.

    Expected to find at least one project using AEAD through NaCl or something, but no such luck.

  • Have tool manage gitattributes.

    You don't add file to git repo by typing /path/to/myfile managed=version-control some-other-flags to some config, why should you do it here?

  • Be easy to deploy.

    Ideally it'd be a script, not some c++/autotools project to install build tools for or package to every setup.

    Though bash script is maybe taking it a bit too far, given how messy it is for anything non-trivial, secure and reliable in diff environments.

  • Have "configuration repository" as intended use-case.

So wrote git-nerps python script to address all these.

Crypto there is trivial yet solid PyNaCl stuff, marking files for encryption is as easy as git-nerps taint /what/ever/path and bootstrapping the thing requires nothing more than python, git, PyNaCl (which are norm in any of my setups) and git-nerps key-gen in the repo.

README for the project has info on every aspect of how the thing works and more on the ideas behind it.

I expect it'll have a few more use-case-specific features and convenience-wrapper commands once I'll get to use it in a more realistic cases than it has now (initially).

[project link]

Jan 28, 2015

Sample code for using ST7032I I2C/SMBus driver in Midas LCD with python

There seem to be a surprising lack of python code on the net for this particular device, except for this nice pi-ras blog post, in japanese.
So, to give google some more food and a bit of commentary in english to that post - here goes.

I'm using Midas MCCOG21605C6W-SPTLYI 2x16 chars LCD panel, connected to 5V VDD and 3.3V BeagleBone Black I2C bus:

simple digital clock on lcd

Code for the above LCD clock "app" (python 2.7):

import smbus, time

class ST7032I(object):

  def __init__(self, addr, i2c_chan, **init_kws):
    self.addr, self.bus = addr, smbus.SMBus(i2c_chan)

  def _write(self, data, cmd=0, delay=None):
    self.bus.write_i2c_block_data(self.addr, cmd, list(data))
    if delay: time.sleep(delay)

  def init(self, contrast=0x10, icon=False, booster=False):
    assert contrast < 0x40 # 6 bits only, probably not used on most lcds
    pic_high = 0b0111 << 4 | (contrast & 0x0f) # c3 c2 c1 c0
    pic_low = ( 0b0101 << 4 |
      icon << 3 | booster << 2 | ((contrast >> 4) & 0x03) ) # c5 c4
    self._write([0x38, 0x39, 0x14, pic_high, pic_low, 0x6c], delay=0.01)
    self._write([0x0c, 0x01, 0x06], delay=0.01)

  def move(self, row=0, col=0):
    assert 0 <= row <= 1 and 0 <= col <= 15, [row, col]
    self._write([0b1000 << 4 | (0x40 * row + col)])

  def addstr(self, chars, pos=None):
    if pos is not None:
      row, col = (pos, 0) if isinstance(pos, int) else pos
      self.move(row, col)
    self._write(map(ord, chars), cmd=0x40)

  def clear(self):

if __name__ == '__main__':
  lcd = ST7032I(0x3e, 2)
  while True:
    ts_tuple = time.localtime()
    lcd.addstr(time.strftime('date: %y-%m-%d', ts_tuple), 0)
    lcd.addstr(time.strftime('time: %H:%M:%S', ts_tuple), 1)
Note the constants in the "init" function - these are all from "INITIALIZE(5V)" sequence on page-8 of the Midas LCD datasheet , setting up things like voltage follower circuit, OSC frequency, contrast (not used on my panel), modes and such.
Actual reference on what all these instructions do and how they're decoded can be found on page-20 there.

Even with the same exact display, but connected to 3.3V, these numbers should probably be a bit different - check the datasheet (e.g. page-7 there).

Also note the "addr" and "i2c_chan" values (0x3E and 2) - these should be taken from the board itself.

"i2c_chan" is the number of the device (X) in /dev/i2c-X, of which there seem to be usually more than one on ARM boards like RPi or BBB.
For instance, Beaglebone Black has three I2C buses, two of which are available on the expansion headers (with proper dtbs loaded).
See this post on Fortune Datko blog and/or this one on minix-i2c blog for one way to tell reliably which device in /dev corresponds to which hardware bus and pin numbers.

And the address is easy to get from the datasheet (lcd I have uses only one static slave address), or detect via i2cdetect -r -y <i2c_chan>, e.g.:

# i2cdetect -r -y 2
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:          -- -- -- -- -- -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- 3e --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: -- -- -- -- UU UU UU UU -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- 68 -- -- -- -- -- -- --
70: -- -- -- -- -- -- -- --

Here I have DS1307 RTC on 0x68 and an LCD panel on 0x3E address (again, also specified in the datasheet).

Both "i2cdetect" command-line tool and python "smbus" module are part of i2c-tools project, which is developed under lm-sensors umbrella.
On Arch or source-based distros these all come with "i2c-tools" package, but on e.g. debian, python module seem to be split into "python-smbus".

Plugging these bus number and the address for your particular hardware into the script above and maybe adjusting the values there for your lcd panel modes should make the clock show up and tick every second.

In general, upon seeing tutorial on some random blog (like this one), please take it with a grain of salt, because it's highly likely that it was written by a fairly incompetent person (like me), since engineers who deal with these things every day don't see above steps as any kind of accomplishment - it's a boring no-brainer routine for them, and they aren't likely to even think about it, much less write tutorials on it (all trivial and obvious, after all).

Nevertheless, hope this post might be useful to someone as a pointer on where to look to get such device started, if nothing else.

Jul 16, 2014

(yet another) Dynamic DNS thing for tinydns (djbdns)

Tried to find any simple script to update tinydns (part of djbdns) zones that'd be better than ssh dns_update@remote_host, but failed - they all seem to be hacky php scripts, doomed to run behind httpds, send passwords in url, query random "myip" hosts or something like that.

What I want instead is something that won't be making http, tls or ssh connections (and stirring all the crap behind these), but would rather just send udp or even icmp pings to remotes, which should be enough for update, given source IPs of these packets and some authentication payload.

So yep, wrote my own scripts for that - tinydns-dynamic-dns-updater project.

Tool sends UDP packets with 100 bytes of "( key_id || timestamp ) || Ed25519_sig" from clients, authenticating and distinguishing these server-side by their signing keys ("key_id" there is to avoid iterating over them all, checking which matches signature).

Server zone files can have "# dynamic: ts key1 key2 ..." comments before records (separated from static records after these by comments or empty lines), which says that any source IPs of packets with correct signatures (and more recent timestamps) will be recorded in A/AAAA records (depending on source AF) that follow instead of what's already there, leaving anything else in the file intact.

Zone file only gets replaced if something is actually updated and it's possible to use dynamic IP for server as well, using dynamic hostname on client (which is resolved for each delayed packet).

Lossy nature of UDP can be easily mitigated by passing e.g. "-n5" to the client script, so it'd send 5 packets (with exponential delays by default, configurable via --send-delay), plus just having the thing on fairly regular intervals in crontab.

Putting server script into socket-activated systemd service file also makes all daemon-specific pains like using privileged ports (and most other security/access things), startup/daemonization, restarts, auto-suspend timeout and logging woes just go away, so there's --systemd flag for that too.

Given how easy it is to run djbdns/tinydns instance, there really doesn't seem to be any compelling reason not to use your own dynamic dns stuff for every single machine or device that can run simple python scripts.

Github link: tinydns-dynamic-dns-updater

May 12, 2014

My Firefox Homepage

Wanted to have some sort of "homepage with my fav stuff, arranged as I want to" in firefox for a while, and finally got resolve to do something about it - just finished a (first version of) script to generate the thing - firefox-homepage-generator.

Default "grid of page screenshots" never worked for me, and while there are other projects that do other layouts for different stuff, they just aren't flexible enough to do whatever horrible thing I want.

In this particular case, I wanted to experiment with chaotic tag cloud of bookmarks (so they won't ever be in the same place), relations graph for these tags and random picks for "links to read" from backlog.

Result is a dynamic d3 + (interactive example of this layout) page without much style:

homepage screenshot
"Mark of Chaos" button in the corner can fly/re-pack tags around.
Clicking tag shows bookmarks tagged as such and fades all other tags out in proportion to how they're related to the clicked one (i.e. how many links share the tag with others).

Started using FF bookmarks again in a meaningful way only recently, so not much stuff there yet, but it does seem to help a lot, especially with these handy awesome bar tricks.

Not entirely sure how useful the cloud visualization or actually having a homepage would be, but it's a fun experiment and a nice place to collect any useful web-surfing-related stuff I might think of in the future.

Repo link: firefox-homepage-generator

Sep 26, 2013

FAT32 driver in python

Wrote a driver for still common FAT32 recently, while solving the issue with shuffle on cheap "usb stick with microsd slot" mp3 player.

It's kinda amazing how crappy firmware in these things can be.

Guess one should know better than to get such crap with 1-line display, gapful playback, weak battery, rewind at non-accelerating ~3x speed, no ability to pick tracks while playing and plenty of other annoying "features", but the main issue I've had with the thing by far is missing shuffle functionality - it only plays tracks in static order in which they were uploaded (i.e. how they're stored on fs).

Seems like whoever built the thing made it deliberately hard to shuffle the tracks offline - just one sort by name would've made things a lot easier, and it's clear that device reads the full dir listing from the time it spends opening dirs with lots of files.

Most obvious way to do such "offline shuffle", given how the thing orders files, is to re-upload tracks in different order, which is way too slow and wears out flash ram.

Second obvious for me was to dig into FAT32 and just reorder entries there, which is what the script does.

It's based off example of a simplier fat16 parser in construct module repo and processes all the necessary metadata structures like PDRs, FATs (cluster maps) and directory tables with vfat long-name entries inside.

Given that directory table on FAT32 is just an array (with vfat entries linked to dentries after them though), it's kinda easy just to shuffle entries there and write data back to clusters from where it was read.

One less obvious solution to shuffle, coming from understanding how vfat lfn entries work, is that one can actually force fs driver to reorder them by randomizing filename length, as it'll be forced to move longer entries to the end of the directory table.

But that idea came a bit too late, and such parser can be useful for extending FAT32 to whatever custom fs (with e.g. FUSE or 9p interface) or implementing some of the more complex hacks.

It's interestng that fat dentries can (and apparently known to) store unix-like modes and uid/gid instead of some other less-useful attrs, but linux driver doesn't seem to make use of it.

OS'es also don't allow symlinks or hardlinks on fat, while technically it's possible, as long as you keep these read-only - just create dentries that point to the same cluster.

Should probably work for both files and dirs and allow to create multiple hierarchies of the same files, like several dirs where same tracks are shuffled with different seed, alongside dirs where they're separated by artist/album/genre or whatever other tags.

It's very fast and cheap to create these, as each is basically about "(name_length + 32B) * file_count" in size, which is like just 8 KiB for dir structure holding 100+ files.

So plan is to extend this small hack to use mutagen to auto-generate such hierarchies in the future, or maybe hook it directly into beets as an export plugin, combined with transcoding, webui and convenient music-db there.

Also, can finally tick off "write proper on-disk fs driver" from "things to do in life" list ;)

Apr 08, 2013

TCP Hijacking for The Greater Good

As discordian folk celebrated Jake Day yesterday, decided that I've had it with random hanging userspace state-machines, stuck forever with tcp connections that are not legitimately dead, just waiting on both sides.

And since pretty much every tool can handle transient connection failures and reconnects, decided to come up with some simple and robust-enough solution to break such links without (or rather before) patching all the apps to behave.

One last straw was davfs2 failing after a brief net-hiccup, with my options limited to killing everything that uses (and is hanging dead on) its mount, then going kill/remount way.
As it uses stateless http connections, I bet it's not even an issue for it to repeat whatever request it tried last and it sure as hell handles network failures, just not well in all cases.

I've used such technique to test some twisted-things in the past, so it was easy to dig scapy-automata code for doing that, though the real trick is not to craft/send FIN or RST packet, but rather to guess TCP seq/ack numbers to stamp it with.

Alas, none of the existing tools (e.g. tcpkill) seem to do anything clever in this regard.

cutter states that

There is a feature of the TCP/IP protocol that we could use to good effect here - if a packet (other than an RST) is received on a connection that has the wrong sequence number, then the host responds by sending a corrective "ACK" packet back.

But neither the tool itself nor the technique described seem to work, and I actually failed to find (or recall) any mentions (or other uses) of such corrective behavior. Maybe it was so waaay back, dunno.

Naturally, as I can run such tool on the host where socket endpoint is, local kernel has these numbers stored, but apparently no one really cared (or had a legitimate enough use-case) to expose these to the userspace... until very recently, that is.

Recent work of Parallels folks on CRIU landed getsockopt(sk, SOL_TCP, TCP_QUEUE_SEQ, ...) in one the latest mainline kernel releases.
Trick is then just to run that syscall in the pid that holds the socket fd, which looks like a trivial enough task, but looking over crtools (which unfortunately doesn't seem to work with vanilla kernel yet) and ptrace-parasite tricks of compiling and injecting shellcode, decided that it's just too much work for me, plus they share the same x86_64-only codebase, and I'd like to have the thing working on ia32 machines as well.
Caching all the "seen" seq numbers in advance looks tempting, especially since for most cases, relevant traffic is processed already by nflog-zmq-pcap-pipe and Snort, which can potentially dump "(endpoint1-endpoint2, seq, len)" tuples to some fast key-value backend.
Invalidation of these might be a minor issue, but I'm not too thrilled about having some dissection code to pre-cache stuff that's already cached in every kernel anyway.

Patching kernel to just expose stuff via /proc looks like bit of a burden as well, though an isolated module code would probably do the job well. Weird that there doesn't seem to be one of these around already, closest one being tcp_probe.c code, which hooks into tcp_recv code-path and doesn't really get seqs without some traffic either.

One interesting idea that got my attention and didn't require a single line of extra code was proposed on the local xmpp channel - to use tcp keepalives.

Sure, they won't make kernel drop connection when it's userspace that hangs on both ends, with connection itself being perfectly healthy, but every one of these carries a seq number that can be spoofed and used to destroy that "healthy" state.

Pity these are optional and can't be just turned on for all sockets system-wide on linux (unlike some BSD systems, apparently), and nothing uses these much by choice (which can be seen in netstat --timer).

Luckily, there's a dead-simple LD_PRELOAD code of libkeepalive which can be used to enforce system-wide opt-out behavior for these (at least for non-static binaries).
For suid stuff (like mount.davfs, mentioned above), it has to be in /etc/, not just env, but as I need it "just in case" for all the connections, that seems fine in my case.

And tuning keepalives to be frequent-enough seem to be a no-brainer and shouldn't have any effect on 99% of legitimate connections at all, as they probably pass some traffic every other second, not after minutes or hours.

net.ipv4.tcp_keepalive_time = 900
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 156

(default is to send empty keepalive packet after 2 hours of idleness)

With that, tool has to run ~7 min on average to kill any tcp connection in the system, which totally acceptable, and no fragile non-portable ptrace-shellcode magic involved (at least yet, I bet it'd be much easier to do in the future).

Code and some docs for the tool/approach can be found on github.

More of the same (update 2013-08-11):

Actually, lacking some better way to send RST/FIN from a machine to itself than swapping MACs (and hoping that router is misconfigured enough to bounce packet "from itself" back) or "-j REJECT --reject-with tcp-reset" (plus a "recent" match or transient-port matching, to avoid blocking reconnect as well), countdown for a connection should be ~7 + 15 min, as only next keepalive will reliably produce RST response.

With a bit of ipset/iptables/nflog magic, it was easy to make the one-time REJECT rule, snatching seq from dropped packet via NFLOG and using that to produce RST for the other side as well.

Whole magic there goes like this:

-A conn_cutter ! -p tcp -j RETURN
-A conn_cutter -m set ! --match-set conn_cutter src,src -j RETURN
-A conn_cutter -p tcp -m recent --set --name conn_cutter --rsource
-A conn_cutter -p tcp -m recent ! --rcheck --seconds 20\
        --hitcount 2 --name conn_cutter --rsource -j NFLOG
-A conn_cutter -p tcp -m recent ! --rcheck --seconds 20\
        --hitcount 2 --name conn_cutter --rsource -j REJECT --reject-with tcp-reset

-I OUTPUT -j conn_cutter

"recent" matcher there is a bit redundant in most cases, as outgoing connections usually use transient-range tcp ports, which shouldn't match for different attempts, but some apps might bind these explicitly.

ipset turned out to be quite a neat thing to avoid iptables manipulations (to add/remove match).

It's interesting that this set of rules handles RST to both ends all by itself if packet arrives from remote first - response (e.g. ACK) from local socket will get RST but won't reach remote, and retransmit from remote will get RST because local port is legitimately closed by then.

Current code allows to optionally specify ipset name, whether to use nflog (via spin-off scapy-nflog-capture driver) or raw sockets, and doesn't do any mac-swapping, only sending RST to remote (which, again, should still be sufficient with frequent-enough keepalives).

Now, if only some decade-old undocumented code didn't explicitly disable these nice keepalives...

Apr 06, 2013

Fighting storage bitrot and decay

Everyone is probably aware that bits do flip here and there in the supposedly rock-solid, predictable and deterministic hardware, but somehow every single data-management layer assumes that it's not its responsibility to fix or even detect these flukes.

Bitrot in RAM is a known source of bugs, but short of ECC, dunno what one can do without huge impact on performance.

Disks, on the other hand, seem to have a lot of software layers above them, handling whatever data arrangement, compression, encryption, etc, and the fact that bits do flip in magnetic media seem to be just as well-known (study1, study2, study3, ...).
In fact, these very issues seem to be the main idea behind well known storage behemoth ZFS.
So it really bugged me for quite a while that any modern linux system seem to be completely oblivious to the issue.

Consider typical linux storage stack on a commodity hardware:

  • You have closed-box proprietary hdd brick at the bottom, with no way to tell what it does to protect your data - aside from vendor marketing pitches, that is.

  • Then you have well-tested and robust linux driver for some ICH storage controller.

    I wouldn't bet that it will corrupt anything at this point, but it doesn't do much else to the data but pass around whatever it gets from the flaky device either.

  • Linux blkdev layer above, presenting /dev/sdX. No checks, just simple mapping.

  • device-mapper.

    Here things get more interesting.

    I tend to use lvm wherever possible, but it's just a convenience layer (or a set of nice tools to setup mappings) on top of dm, no checks of any kind, but at least it doesn't make things much worse either - lvm metadata is fairly redundant and easy to backup/recover.

    dm-crypt gives no noticeable performance overhead, exists either above or under lvm in the stack, and is nice hygiene against accidental leaks (selling or leasing hw, theft, bugs, etc), but lacking authenticated encryption modes it doesn't do anything to detect bit-flips.
    Worse, it amplifies the issue.
    In the most common CBC mode one flipped bit in the ciphertext will affect a few other bits of data until the end of the dm block.
    Current dm-crypt default (since the latest cryptsetup-1.6.X, iirc) is XTS block encryption mode, which somewhat limits the damage, but dm-crypt has little support for changing modes on-the-fly, so tough luck.
    But hey, there is dm-verity, which sounds like exactly what I want, except it's read-only, damn.
    Read-only nature is heavily ingrained in its "hash tree" model of integrity protection - it is hashes-of-hashes all the way up to the root hash, which you specify on mount, immutable by design.

    Block-layer integrity protection is a bit weird anyway - lots of unnecessary work potential there with free space (can probably be somewhat solved by TRIM), data that's already journaled/checksummed by fs and just plain transient block changes which aren't exposed for long and one might not care about at all.

  • Filesystem layer above does the right thing sometimes.

    COW fs'es like btrfs and zfs have checksums and scrubbing, so seem to be a good options.
    btrfs was slow as hell on rotating plates last time I checked, but zfs port might be worth a try, though if a single cow fs works fine on all kinds of scenarios where I use ext4 (mid-sized files), xfs (glusterfs backend) and reiserfs (hard-linked backups, caches, tiny-file sub trees), then I'd really be amazed.

    Other fs'es plain suck at this. No care for that sort of thing at all.

  • Above-fs syscall-hooks kernel layers.

    IMA/EVM sound great, but are also for immutable security ("integrity") purposes ;(

    In fact, this layer is heavily populated by security stuff like LSM's, which I can't imagine being sanely used for bitrot-detection purposes.
    Security tools are generally oriented towards detecting any changes, intentional tampering included, and are bound to produce a lot of false-positives instead of legitimate and actionable alerts.

    Plus, upon detecting some sort of failure, these tools generally don't care about the data anymore acting as a Denial-of-Service attack on you, which is survivable (everything can be circumvented), but fighting your own tools doesn't sound too great.

  • Userspace.

    There is tripwire, but it's also a security tool, unsuitable for the task.

    Some rare discussions of the problem pop up here and there, but alas, I failed to salvage anything useable from these, aside from ideas and links to subject-relevant papers.

Scanning github, bitbucket and xmpp popped up bitrot script and a proof-of-concept md-checksums md layer, which apparently haven't even made it to lkml.

So, naturally, following long-standing "... then do it yourself" motto, introducing fs-bitrot-scrubber tool for all the scrubbing needs.

It should be fairly well-described in the readme, but the gist is that it's just a simple userspace script to checksum file contents and check changes there over time, taking all the signs of legitimate file modifications and the fact that it isn't the only thing that needs i/o in the system into account.

Main goal is not to provide any sort of redundancy or backups, but rather notify of the issue before all the old backups (or some cluster-fs mirrors in my case) that can be used to fix it are rotated out of existance or overidden.

Don't suppose I'll see such decay phenomena often (if ever), but I don't like having the odds, especially with an easy "most cases" fix within grasp.

If I'd keep lot of important stuff compressed (think what will happen if a single bit is flipped in the middle of few-gigabytes .xz file) or naively (without storage specifics and corruption in mind) encrypted in cbc mode (or something else to the same effect), I'd be worried about the issue so much more.

Wish there'd be something common out-of-the-box in the linux world, but I guess it's just not the time yet (hell, there's not even one clear term in the techie slang for it!) - with still increasing hdd storage sizes and much more vulnerable ssd's, some more low-level solution should materialize eventually.

Here's me hoping to raise awareness, if only by a tiny bit.

github project link

← Previous Next → Page 2 of 5
Member of The Internet Defense League