Dec 08, 2015
Have been using gpg for many years now, many times a day, as I keep lot of stuff
in .gpg files, but still can't seem to get used to its quirky interface and
Most notably, it's "trust" thing, keyrings and arcane key editing, expiration
dates, gpg-agent interaction and encrypted keys are all sources of dread and
stress for me.
Last drop, following the tradition of many disastorous interactions with the
tool, was me loosing my master signing key password, despite it being written
down on paper and working before. #fail ;(
Certainly my fault, but as I'll be replacing the damn key anyway, why not throw
out the rest of that incomprehensible tangle of pointless and counter-productive
practices and features I never use?
Took ~6 hours to write a replacement ghg tool - same thing as gpg, except with
simple and sane key management (which doesn't assume you entering anything,
ever!!!), none of that web-of-trust or signing crap, good (and non-swappable)
djb crypto, and only for file encryption.
Does everything I've used gpg for from the command-line, and has one flat file
for all the keys, so no more hassle with --edit-key nonsense.
Highly suggest to anyone who ever had trouble and frustration with gpg to check
ghg project out or write their own (trivial!) tool, and ditch the old thing -
life's too short to deal with that constant headache.
Dec 07, 2015
One other thing I've needed to do recently is to have Raspberry Pi OS resize
its /boot FAT32 partition to full card size (i.e. "make it as large as possible")
from right underneath itself.
RPis usually have first FAT (fat16 / fat32 / vfat) partition needed by firmware
to load config.txt and uboot stuff off, and that is the only partition one can
see in Windows OS when plugging microSD card into card-reader (which is a kinda
arbitrary OS limitation).
Map of the usual /dev/mmcblk0 on RPi (as seen in parted):
Number Start End Size Type File system Flags
32.3kB 1049kB 1016kB Free Space
1 1049kB 106MB 105MB primary fat16 lba
2 106MB 1887MB 1782MB primary ext4
Resizing that first partition is naturally difficult, as it is followed by ext4
one with RPi's OS, but when you want to have small (e.g. <2G) and easy-to-write
"rpi.img" file for any microSD card, there doesn't seem to be a way around
that - img have to have as small initial partitions as possible to fit on any
Things get even more complicated by the fact that there don't seem to be any
tools around for resizing FAT fs'es, so it has to be re-created from scratch.
There is quite an easy way around all these issues however, which can be
summed-up as a sequence of the following steps:
Start while rootfs is mounted read-only or when it can be remounted as such,
i.e. on early boot.
Before=systemd-remount-fs.service local-fs-pre.target in systemd terms.
Grab sfdisk/parted map of the microSD and check if there's "Free Space" chunk
left after last (ext4/os) partition.
If there is, there's likely a lot of it, as SD cards increase in 2x size
factors, so 4G image written on larger card will have 4+ gigs there, in fact a
lot more for 16G or 32G cards.
Or there can be only a few megs there, in case of matching card size, where
it's usually a good idea to make slightly smaller images, as actual cards do
vary in size a bit.
"dd" whole rootfs to the end of the microSD card.
This is safe with read-only rootfs, and dumb "dd" approach to copying it (as
opposed to dmsetup + mkfs + cp) seem to be simpliest and least error-prone.
Update partition table to have rootfs in the new location (at the very end of
the card) and boot partition covering rest of the space.
Initiate reboot, so that OS will load from the new rootfs location.
Starting on early-boot again, remount rootfs rw if necessary, temporary copy
all contents of boot partition (which should still be small) to rootfs.
Run mkfs.vfat on the new large boot partition and copy stuff back to it from
Reboot once again, in case whatever boot timeouts got triggered.
Avoid running same thing on all subsequent boots.
E.g. touch /etc/boot-resize-done and have
ConditionPathExists=!/etc/boot-resize-done in the systemd unit file.
That should do it \o/
resize-rpi-fat32-for-card (in fgtk repo) is a script I wrote to do all of
this stuff, exactly as described.
systemd unit file for the thing (can also be printed by running script with
Before=systemd-remount-fs.service -.mount local-fs-pre.target local-fs.target
It does use lsblk -Jnb JSON output to get rootfs device and partition, and
get whether it's mounted read-only, then parted -ms /dev/... unit B print
free to grab machine-readable map of the device.
sfdisk -J (also JSON output) could've been better option than parted (extra
dep, which is only used to get that one map), except it doesn't conveniently
list "free space" blocks and device size, pity.
If partition table doesn't have extra free space at the end, "fsstat" tool from
sleuthkit is used to check whether FAT filesystem covers whole partition and
needs to be resized.
After that, and only if needed, either "dd + sfdisk" or "cp + mkfs.vfat + cp
back" sequence gets executed, followed by a reboot command.
Extra options for the thing:
"--skip-ro-check" - don't bother checkin/forcing read-only rootfs before "dd"
step, which should be fine, if there's no activity there (e.g. early boot).
"--done-touch-file" - allows to specify location of file to create (if
missing) when "no resize needed" state gets reached.
Script doesn't check whether this file exists and always does proper checks of
partition table and "fsstat" when deciding whether something has to be done,
only creates the file at the end (if it doesn't exist already).
"--overlay-image" uses splash.go tool that I've mentioned earlier (be
sure to compile it first, ofc) to set some "don't panic, fs resize in
progress" image (resized/centered and/or with text and background) during the
whole process, using RPi's OpenVG GPU API, covering whatever console output.
Misc other stuff for setup/debug - "--print-systemd-unit", "--debug",
Easy way to debug the thing with these might be to add StandardOutput=tty
to systemd unit's Service section and ... --debug --reboot-delay 60
options there, or possibly adding extra ExecStart=/bin/sleep 60 after the
script (and changing its ExecStart= to ExecStart=-, so delay will
still happen on errors).
This should provide all the info on what's happening in the script (has plenty
of debug output) to the console (one on display or UART).
One more link to the script: resize-rpi-fat32-for-card
Sep 04, 2015
Adding key derivation to git-nerps from OpenSSH keys, needed to get the actual
"secret" or something deterministically (plus in an obvious and stable way)
derived from it (to then feed into some pbkdf2 and get the symmetric key).
Idea is for liteweight ad-hoc vms/containers to have a single "master secret",
from which all others (e.g. one for git-nerps' encryption) can be easily derived
or decrypted, and omnipresent, secure, useful and easy-to-generate ssh key in
~/.ssh/id_ed25519 seem to be the best candidate.
Unfortunately, standard set of ssh tools from openssh doesn't seem to have
anything that can get key material or its hash - next best thing is to get
"fingerprint" or such, but these are derived from public keys, so not what I
wanted at all (as anyone can derive that, having public key, which isn't
And I didn't want to hash full openssh key blob, because stuff there isn't
guaranteed to stay the same when/if you encrypt/decrypt it or do whatever
What definitely stays the same is the values that openssh plugs into crypto
algos, so wrote a full parser for the key format (as specified in PROTOCOL.key
file in openssh sources) to get that.
While doing so, stumbled upon fairly obvious and interesting application for
such parser - to get really and short easy to backup, read or transcribe
string which is the actual secret for Ed25519.
I.e. that's what OpenSSH private key looks like:
-----BEGIN OPENSSH PRIVATE KEY-----
-----END OPENSSH PRIVATE KEY-----
And here's the only useful info in there, enough to restore whole blob above
from, in the same base64 encoding:
Latter, of course, being way more suitable for tried-and-true "write on a
sticker and glue at the desk" approach.
Or one can just have a file with one host key per line - also cool.
That's the 32-byte "seed" value, which can be used to derive "ed25519_sk" field
("seed || pubkey") in that openssh blob, and all other fields are either "none",
"ssh-ed25519", "magic numbers" baked into format or just padding.
So rolled the parser from git-nerps into its own tool - ssh-keyparse, which
one can run and get that string above for key in ~/.ssh/id_ed25519, or do some
simple crypto (as implemented by djb in ed25519.py, not me) to recover full
key from the seed.
From output serialization formats that tool supports, especially liked the idea
of Douglas Crockford's Base32 - human-readable one, where all confusing
l-and-1 or O-and-0 chars are interchangeable, and there's an optional checksum
(one letter) at the end:
% ssh-keyparse test-key --base32
% ssh-keyparse test-key --base32-nodashes
base64 (default) is still probably most efficient for non-binary (there's --raw
otherwise) backup though.
[ssh-keyparse code link]
Sep 01, 2015
Have been installing things to an OS containers (/var/lib/machines) lately, and
looking for proper configuration management in these.
Large-scale container setups use some hard-to-integrate things like etcd, where
you have to template configuration from values in these, which is not very
convenient and very low effort-to-results ratio (maintenance of that system
itself) for "10 service containers on 3 hosts" case.
Besides, such centralized value store is a bit backwards for
one-container-per-service case, where most values in such "central db" are
specific to one container, and it's much easier to edit end-result configs then
db values and then templates and then check how it all gets rendered on every
Usual solution I have for these setups is simply putting all confs under git
control, but leaving all the secrets (e.g. keys, passwords, auth data) out of
the repo, in case it might be pulled from on other hosts, by different people
and for purposes which don't need these sensitive bits and might leak them
(e.g. giving access to contracted app devs).
For more transient container setups, something should definitely keep track of
these "secrets" however, as "rm -rf /var/lib/machines/..." is much more
realistic possibility and has its uses.
So my (non-original) idea here was to have one "master key" per host - just one
short string - with which to encrypt all secrets for that host, which can then
be shared between hosts and specific people (making these public might still be
a bad idea), if necessary.
This key should then be simply stored in whatever key-management repo, written
on a sticker and glued to a display, or something.
Git can be (ab)used for such encryption, with its "filter" facilities, which are
generally used for opposite thing (normalization to one style), but are easy to
adapt for this case too.
Git filters work by running "clear" operation on selected paths (can be a
wildcard patterns like "*.c") every time git itself uses these and "smudge"
when showing to user and checking them out to a local copy (where they are
In case of encryption, "clear" would not be normalizing CR/LF in line endings,
but rather wrapping contents (or parts of them) into a binary blob, and "smudge"
should do the opposite, and gitattributes patterns would match files to be
Looking for projects that already do that, found quite a few, but still
decided to write my own tool, because none seem have all the things I wanted:
Use sane encryption.
It's AES-CTR in the absolutely best case, and AES-ECB (wtf!?) in some,
sometimes openssl is called with "password" on the command line (trivial to
spoof in /proc).
OpenSSL itself is a red flag - hard to believe that someone who knows how bad
its API and primitives are still uses it willingly, for non-TLS, at least.
Expected to find at least one project using AEAD through NaCl or something,
but no such luck.
Have tool manage gitattributes.
You don't add file to git repo by typing /path/to/myfile
managed=version-control some-other-flags to some config, why should you do
Be easy to deploy.
Ideally it'd be a script, not some c++/autotools project to install build
tools for or package to every setup.
Though bash script is maybe taking it a bit too far, given how messy it is for
anything non-trivial, secure and reliable in diff environments.
Have "configuration repository" as intended use-case.
So wrote git-nerps python script to address all these.
Crypto there is trivial yet solid PyNaCl stuff, marking files for encryption is
as easy as git-nerps taint /what/ever/path and bootstrapping the thing
requires nothing more than python, git, PyNaCl (which are norm in any of my
setups) and git-nerps key-gen in the repo.
README for the project has info on every aspect of how the thing works and more
on the ideas behind it.
I expect it'll have a few more use-case-specific features and
convenience-wrapper commands once I'll get to use it in a more realistic cases
than it has now (initially).
Jan 28, 2015
There seem to be a surprising lack of python code on the net for this particular
device, except for this nice pi-ras blog post
, in japanese.
So, to give google some more food and a bit of commentary in english to that
post - here goes.
I'm using Midas MCCOG21605C6W-SPTLYI 2x16 chars LCD panel, connected to 5V VDD
and 3.3V BeagleBone Black I2C bus:
Code for the above LCD clock "app" (python 2.7):
import smbus, time
def __init__(self, addr, i2c_chan, **init_kws):
self.addr, self.bus = addr, smbus.SMBus(i2c_chan)
def _write(self, data, cmd=0, delay=None):
self.bus.write_i2c_block_data(self.addr, cmd, list(data))
if delay: time.sleep(delay)
def init(self, contrast=0x10, icon=False, booster=False):
assert contrast < 0x40 # 6 bits only, probably not used on most lcds
pic_high = 0b0111 << 4 | (contrast & 0x0f) # c3 c2 c1 c0
pic_low = ( 0b0101 << 4 |
icon << 3 | booster << 2 | ((contrast >> 4) & 0x03) ) # c5 c4
self._write([0x38, 0x39, 0x14, pic_high, pic_low, 0x6c], delay=0.01)
self._write([0x0c, 0x01, 0x06], delay=0.01)
def move(self, row=0, col=0):
assert 0 <= row <= 1 and 0 <= col <= 15, [row, col]
self._write([0b1000 << 4 | (0x40 * row + col)])
def addstr(self, chars, pos=None):
if pos is not None:
row, col = (pos, 0) if isinstance(pos, int) else pos
self._write(map(ord, chars), cmd=0x40)
if __name__ == '__main__':
lcd = ST7032I(0x3e, 2)
ts_tuple = time.localtime()
lcd.addstr(time.strftime('date: %y-%m-%d', ts_tuple), 0)
lcd.addstr(time.strftime('time: %H:%M:%S', ts_tuple), 1)
Note the constants in the "init" function - these are all from
"INITIALIZE(5V)" sequence on page-8 of the Midas LCD datasheet
, setting up
things like voltage follower circuit, OSC frequency, contrast (not used on my
panel), modes and such.
Actual reference on what all these instructions do and how they're decoded can
be found on page-20 there.
Even with the same exact display, but connected to 3.3V, these numbers should
probably be a bit different - check the datasheet (e.g. page-7 there).
Also note the "addr" and "i2c_chan" values (0x3E and 2) - these should be taken
from the board itself.
"i2c_chan" is the number of the device (X) in /dev/i2c-X, of which there seem
to be usually more than one on ARM boards like RPi or BBB.
For instance, Beaglebone Black has three I2C buses, two of which are available
on the expansion headers (with proper dtbs loaded).
And the address is easy to get from the datasheet (lcd I have uses only one
static slave address), or detect via i2cdetect -r -y <i2c_chan>, e.g.:
# i2cdetect -r -y 2
0 1 2 3 4 5 6 7 8 9 a b c d e f
00: -- -- -- -- -- -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- 3e --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: -- -- -- -- UU UU UU UU -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- 68 -- -- -- -- -- -- --
70: -- -- -- -- -- -- -- --
Here I have DS1307 RTC on 0x68 and an LCD panel on 0x3E address (again, also
specified in the datasheet).
On Arch or source-based distros these all come with "i2c-tools" package, but
on e.g. debian, python module seem to be split into "python-smbus".
Plugging these bus number and the address for your particular hardware into the
script above and maybe adjusting the values there for your lcd panel modes
should make the clock show up and tick every second.
In general, upon seeing tutorial on some random blog (like this one), please
take it with a grain of salt, because it's highly likely that it was written by
a fairly incompetent person (like me), since engineers who deal with these
things every day don't see above steps as any kind of accomplishment - it's a
boring no-brainer routine for them, and they aren't likely to even think about
it, much less write tutorials on it (all trivial and obvious, after all).
Nevertheless, hope this post might be useful to someone as a pointer on where to
look to get such device started, if nothing else.
Jul 16, 2014
Tried to find any simple script to update tinydns (part of djbdns) zones
that'd be better than ssh dns_update@remote_host update.sh, but failed -
they all seem to be hacky php scripts, doomed to run behind httpds, send
passwords in url, query random "myip" hosts or something like that.
What I want instead is something that won't be making http, tls or ssh
connections (and stirring all the crap behind these), but would rather just send
udp or even icmp pings to remotes, which should be enough for update, given
source IPs of these packets and some authentication payload.
So yep, wrote my own scripts for that - tinydns-dynamic-dns-updater project.
Tool sends UDP packets with 100 bytes of "( key_id || timestamp ) ||
Ed25519_sig" from clients, authenticating and distinguishing these
server-side by their signing keys ("key_id" there is to avoid iterating over
them all, checking which matches signature).
Server zone files can have "# dynamic: ts key1 key2 ..." comments before records
(separated from static records after these by comments or empty lines), which
says that any source IPs of packets with correct signatures (and more recent
timestamps) will be recorded in A/AAAA records (depending on source AF) that
follow instead of what's already there, leaving anything else in the file
Zone file only gets replaced if something is actually updated and it's possible
to use dynamic IP for server as well, using dynamic hostname on client (which is
resolved for each delayed packet).
Lossy nature of UDP can be easily mitigated by passing e.g. "-n5" to the client
script, so it'd send 5 packets (with exponential delays by default, configurable
via --send-delay), plus just having the thing on fairly regular intervals in
Putting server script into socket-activated systemd service file also makes all
daemon-specific pains like using privileged ports (and most other
security/access things), startup/daemonization, restarts, auto-suspend timeout
and logging woes just go away, so there's --systemd flag for that too.
Given how easy it is to run djbdns/tinydns instance, there really doesn't seem
to be any compelling reason not to use your own dynamic dns stuff for every
single machine or device that can run simple python scripts.
Github link: tinydns-dynamic-dns-updater
May 12, 2014
Wanted to have some sort of "homepage with my fav stuff, arranged as I want to"
in firefox for a while, and finally got resolve to do something about it - just
finished a (first version of) script to generate the thing -
Default "grid of page screenshots" never worked for me, and while there are
other projects that do other layouts for different stuff, they just aren't
flexible enough to do whatever horrible thing I want.
In this particular case, I wanted to experiment with chaotic tag cloud of
bookmarks (so they won't ever be in the same place), relations graph for these
tags and random picks for "links to read" from backlog.
Result is a dynamic d3 + d3.layout.cloud (interactive example of this
layout) page without much style:
"Mark of Chaos" button in the corner can fly/re-pack tags around.
Clicking tag shows bookmarks tagged as such and fades all other tags out in
proportion to how they're related to the clicked one (i.e. how many links
share the tag with others).
Started using FF bookmarks again in a meaningful way only recently, so not much
stuff there yet, but it does seem to help a lot, especially with these handy
awesome bar tricks.
Not entirely sure how useful the cloud visualization or actually having a
homepage would be, but it's a fun experiment and a nice place to collect any
useful web-surfing-related stuff I might think of in the future.
Repo link: firefox-homepage-generator
Sep 26, 2013
Wrote a driver for still common FAT32 recently, while solving the issue with
shuffle on cheap "usb stick with microsd slot" mp3 player.
It's kinda amazing how crappy firmware in these things can be.
Guess one should know better than to get such crap with 1-line display, gapful
playback, weak battery, rewind at non-accelerating ~3x speed, no ability to pick
tracks while playing and plenty of other annoying "features", but the main issue
I've had with the thing by far is missing shuffle functionality - it only plays
tracks in static order in which they were uploaded (i.e. how they're stored on
Seems like whoever built the thing made it deliberately hard to shuffle the
tracks offline - just one sort by name would've made things a lot easier, and
it's clear that device reads the full dir listing from the time it spends
opening dirs with lots of files.
Most obvious way to do such "offline shuffle", given how the thing orders files,
is to re-upload tracks in different order, which is way too slow and wears out
Second obvious for me was to dig into FAT32 and just reorder entries there,
which is what the script does.
It's based off example of a simplier fat16 parser in construct module repo
and processes all the necessary metadata structures like PDRs, FATs (cluster
maps) and directory tables with vfat long-name entries inside.
Given that directory table on FAT32 is just an array (with vfat entries linked
to dentries after them though), it's kinda easy just to shuffle entries there
and write data back to clusters from where it was read.
One less obvious solution to shuffle, coming from understanding how vfat lfn
entries work, is that one can actually force fs driver to reorder them by
randomizing filename length, as it'll be forced to move longer entries to the
end of the directory table.
But that idea came a bit too late, and such parser can be useful for extending
FAT32 to whatever custom fs (with e.g. FUSE or 9p interface) or implementing
some of the more complex hacks.
It's interestng that fat dentries can (and apparently known to) store unix-like
modes and uid/gid instead of some other less-useful attrs, but linux driver
doesn't seem to make use of it.
OS'es also don't allow symlinks or hardlinks on fat, while technically it's
possible, as long as you keep these read-only - just create dentries that point
to the same cluster.
Should probably work for both files and dirs and allow to create multiple
hierarchies of the same files, like several dirs where same tracks are
shuffled with different seed, alongside dirs where they're separated by
artist/album/genre or whatever other tags.
It's very fast and cheap to create these, as each is basically about
"(name_length + 32B) * file_count" in size, which is like just 8 KiB for dir
structure holding 100+ files.
So plan is to extend this small hack to use mutagen to auto-generate such
hierarchies in the future, or maybe hook it directly into beets as an export
plugin, combined with transcoding, webui and convenient music-db there.
Also, can finally tick off "write proper on-disk fs driver" from "things to do
in life" list ;)
Apr 08, 2013
As discordian folk celebrated Jake Day yesterday, decided that I've had it
with random hanging userspace state-machines, stuck forever with tcp connections
that are not legitimately dead, just waiting on both sides.
And since pretty much every tool can handle transient connection failures and
reconnects, decided to come up with some simple and robust-enough solution to
break such links without (or rather before) patching all the apps to behave.
One last straw was davfs2 failing after a brief net-hiccup, with my options
limited to killing everything that uses (and is hanging dead on) its mount,
then going kill/remount way.
As it uses stateless http connections, I bet it's not even an issue for it to
repeat whatever request it tried last and it sure as hell handles network
failures, just not well in all cases.
I've used such technique to test some twisted-things in the past, so it was easy
to dig scapy-automata code for doing that, though the real trick is not to
craft/send FIN or RST packet, but rather to guess TCP seq/ack numbers to stamp
Alas, none of the existing tools (e.g. tcpkill) seem to do anything clever in
cutter states that
There is a feature of the TCP/IP protocol that we could use to good effect
here - if a packet (other than an RST) is received on a connection that has
the wrong sequence number, then the host responds by sending a corrective
"ACK" packet back.
But neither the tool itself nor the technique described seem to work, and I
actually failed to find (or recall) any mentions (or other uses) of such
corrective behavior. Maybe it was so waaay back, dunno.
Naturally, as I can run such tool on the host where socket endpoint is, local
kernel has these numbers stored, but apparently no one really cared (or had a
legitimate enough use-case) to expose these to the userspace... until very
recently, that is.
Recent work of Parallels folks on CRIU
landed getsockopt(sk, SOL_TCP,
in one the latest mainline kernel releases.
Trick is then just to run that syscall in the pid that holds the socket fd,
which looks like a trivial enough task, but looking over crtools
unfortunately doesn't seem to work with vanilla kernel yet) and
tricks of compiling and injecting shellcode, decided that
it's just too much work for me, plus they share the same x86_64-only codebase,
and I'd like to have the thing working on ia32 machines as well.
Caching all the "seen" seq numbers in advance looks tempting, especially since
for most cases, relevant traffic is processed already by
, which can potentially dump
"(endpoint1-endpoint2, seq, len)" tuples to some fast key-value backend.
Invalidation of these might be a minor issue, but I'm not too thrilled about
having some dissection code to pre-cache stuff that's already cached in every
Patching kernel to just expose stuff via /proc looks like bit of a burden as
well, though an isolated module code would probably do the job well.
Weird that there doesn't seem to be one of these around already, closest one
being tcp_probe.c code, which hooks into tcp_recv code-path and doesn't really
get seqs without some traffic either.
One interesting idea that got my attention and didn't require a single line of
extra code was proposed on the local xmpp channel - to use tcp keepalives.
Sure, they won't make kernel drop connection when it's userspace that hangs on
both ends, with connection itself being perfectly healthy, but every one of
these carries a seq number that can be spoofed and used to destroy that
Pity these are optional and can't be just turned on for all sockets system-wide
on linux (unlike some BSD systems, apparently), and nothing uses these much by
choice (which can be seen in netstat --timer).
Luckily, there's a dead-simple LD_PRELOAD code of libkeepalive
which can be
used to enforce system-wide opt-out behavior for these (at least for
For suid stuff (like mount.davfs, mentioned above), it has to be in
/etc/ld.so.preload, not just env, but as I need it "just in case" for all the
connections, that seems fine in my case.
And tuning keepalives to be frequent-enough seem to be a no-brainer and
shouldn't have any effect on 99% of legitimate connections at all, as they
probably pass some traffic every other second, not after minutes or hours.
net.ipv4.tcp_keepalive_time = 900
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 156
(default is to send empty keepalive packet after 2 hours of idleness)
With that, tool has to run ~7 min on average to kill any tcp connection in the
system, which totally acceptable, and no fragile non-portable ptrace-shellcode
magic involved (at least yet, I bet it'd be much easier to do in the future).
Code and some docs for the tool/approach can be found on github.
More of the same (update 2013-08-11):
Actually, lacking some better way to send RST/FIN from a machine to itself than
swapping MACs (and hoping that router is misconfigured enough to bounce packet
"from itself" back) or "-j REJECT --reject-with tcp-reset" (plus a "recent"
match or transient-port matching, to avoid blocking reconnect as well),
countdown for a connection should be ~7 + 15 min, as only next keepalive will
reliably produce RST response.
With a bit of ipset/iptables/nflog magic, it was easy to make the one-time
REJECT rule, snatching seq from dropped packet via NFLOG and using that to
produce RST for the other side as well.
Whole magic there goes like this:
-A conn_cutter ! -p tcp -j RETURN
-A conn_cutter -m set ! --match-set conn_cutter src,src -j RETURN
-A conn_cutter -p tcp -m recent --set --name conn_cutter --rsource
-A conn_cutter -p tcp -m recent ! --rcheck --seconds 20\
--hitcount 2 --name conn_cutter --rsource -j NFLOG
-A conn_cutter -p tcp -m recent ! --rcheck --seconds 20\
--hitcount 2 --name conn_cutter --rsource -j REJECT --reject-with tcp-reset
-I OUTPUT -j conn_cutter
"recent" matcher there is a bit redundant in most cases, as outgoing connections
usually use transient-range tcp ports, which shouldn't match for different
attempts, but some apps might bind these explicitly.
ipset turned out to be quite a neat thing to avoid iptables manipulations (to
It's interesting that this set of rules handles RST to both ends all by itself
if packet arrives from remote first - response (e.g. ACK) from local socket will
get RST but won't reach remote, and retransmit from remote will get RST because
local port is legitimately closed by then.
Current code allows to optionally specify ipset name, whether to use nflog
(via spin-off scapy-nflog-capture driver) or raw sockets, and doesn't do any
mac-swapping, only sending RST to remote (which, again, should still be
sufficient with frequent-enough keepalives).
Now, if only some decade-old undocumented code didn't explicitly disable these
Apr 06, 2013
Everyone is probably aware that bits do flip here and there in the supposedly
rock-solid, predictable and deterministic hardware, but somehow every single
data-management layer assumes that it's not its responsibility to fix or even
detect these flukes.
Bitrot in RAM is a known source of bugs, but short of ECC, dunno what one
can do without huge impact on performance.
Disks, on the other hand, seem to have a lot of software layers above them,
handling whatever data arrangement, compression, encryption, etc, and the fact
that bits do flip in magnetic media seem to be just as well-known (study1
So it really bugged me for quite a while that any modern linux system seem to
be completely oblivious to the issue.
Consider typical linux storage stack on a commodity hardware:
You have closed-box proprietary hdd brick at the bottom, with no way to tell
what it does to protect your data - aside from vendor marketing pitches, that
Then you have well-tested and robust linux driver for some ICH storage
I wouldn't bet that it will corrupt anything at this point, but it doesn't do
much else to the data but pass around whatever it gets from the flaky device
Linux blkdev layer above, presenting /dev/sdX. No checks, just simple mapping.
Here things get more interesting.
I tend to use lvm wherever possible, but it's just a convenience layer (or a
set of nice tools to setup mappings) on top of dm, no checks of any kind, but
at least it doesn't make things much worse either - lvm metadata is fairly
redundant and easy to backup/recover.
dm-crypt gives no noticeable performance overhead, exists either above or
under lvm in the stack, and is nice hygiene against accidental leaks
(selling or leasing hw, theft, bugs, etc), but lacking authenticated
encryption modes it doesn't do anything to detect bit-flips.
Worse, it amplifies the issue.
In the most common CBC mode
one flipped bit in the ciphertext will affect
a few other bits of data until the end of the dm block.
Current dm-crypt default (since the latest cryptsetup-1.6.X, iirc) is XTS
block encryption mode, which somewhat limits the damage, but dm-crypt has
little support for changing modes on-the-fly, so tough luck.
But hey, there is dm-verity
, which sounds like exactly what I want,
except it's read-only, damn.
Read-only nature is heavily ingrained in its "hash tree" model of integrity
protection - it is hashes-of-hashes all the way up to the root hash, which
you specify on mount, immutable by design.
Block-layer integrity protection is a bit weird anyway - lots of unnecessary
work potential there with free space (can probably be somewhat solved by
TRIM), data that's already journaled/checksummed by fs and just plain
transient block changes which aren't exposed for long and one might not care
about at all.
Filesystem layer above does the right thing sometimes.
fs'es like btrfs
have checksums and scrubbing, so seem
to be a good options.
btrfs was slow as hell on rotating plates last time I checked, but zfs port
might be worth a try, though if a single cow fs works fine on all kinds of
scenarios where I use ext4 (mid-sized files), xfs (glusterfs backend) and
reiserfs (hard-linked backups, caches, tiny-file sub trees), then I'd really
Other fs'es plain suck at this. No care for that sort of thing at all.
Above-fs syscall-hooks kernel layers.
IMA/EVM sound great, but are also for immutable security ("integrity")
In fact, this layer is heavily populated by security stuff like LSM's, which I
can't imagine being sanely used for bitrot-detection purposes.
Security tools are generally oriented towards detecting any changes,
intentional tampering included, and are bound to produce a lot of
false-positives instead of legitimate and actionable alerts.
Plus, upon detecting some sort of failure, these tools generally don't care
about the data anymore acting as a Denial-of-Service attack on you, which is
survivable (everything can be circumvented), but fighting your own tools
doesn't sound too great.
There is tripwire, but it's also a security tool, unsuitable for the task.
Some rare discussions of the problem pop up here and there, but alas, I
failed to salvage anything useable from these, aside from ideas and links to
Scanning github, bitbucket and xmpp popped up bitrot script and a
proof-of-concept md-checksums md layer, which apparently haven't even made it
So, naturally, following long-standing "... then do it yourself" motto,
introducing fs-bitrot-scrubber tool for all the scrubbing needs.
It should be fairly well-described in the readme, but the gist is that it's just
a simple userspace script to checksum file contents and check changes there over
time, taking all the signs of legitimate file modifications and the fact that it
isn't the only thing that needs i/o in the system into account.
Main goal is not to provide any sort of redundancy or backups, but rather notify
of the issue before all the old backups (or some cluster-fs mirrors in my case)
that can be used to fix it are rotated out of existance or overidden.
Don't suppose I'll see such decay phenomena often (if ever), but I don't like
having the odds, especially with an easy "most cases" fix within grasp.
If I'd keep lot of important stuff compressed (think what will happen if a
single bit is flipped in the middle of few-gigabytes .xz file) or naively
(without storage specifics and corruption in mind) encrypted in cbc mode (or
something else to the same effect), I'd be worried about the issue so much more.
Wish there'd be something common out-of-the-box in the linux world, but I guess
it's just not the time yet (hell, there's not even one clear term in the techie
slang for it!) - with still increasing hdd storage sizes and much more
vulnerable ssd's, some more low-level solution should materialize eventually.
Here's me hoping to raise awareness, if only by a tiny bit.
github project link