May 05, 2022
Alpine Linux is a tiny distro (like ~50 MiB for the whole system),
which is usually very good for a relatively single-purpose setup,
dedicated to running only a couple things at once.
It doesn't really sacrifice much in the process, since busybox and OpenRC
there provide pretty complete toolkit for setting things up to start
and debugging everything, its repos have pretty much anything,
and what they don't have is trivial to build via APKBUILD files
very similar to Arch's PKGBUILDs.
Really nice thing seem to have happened to it with the rise of docker/podman
and such single-app containers, when it got a huge boost in popularity,
being almost always picked as an OS base in these, and so these days
it seem to be well-supported to cover all linux architectures,
and likely will be for quite a while.
So it's a really good choice for an appliance-type ARM boards to deploy and
update for years, but it doesn't really bother specializing itself for each of
those or supporting board-specific quirks, which is why some steps to set it up
for a specific board have to be done manually.
Since I've done it for a couple different boards relatively recently, and been
tinkering with these for a while, thought to write down a list of steps on how
to set alpine up for a new board here.
In this case, it's a relatively-old ODROID-C2 that I have from 2016, which is
well-supported in mainline linux kernel by now, as popular boards tend to be,
after a couple years since release, and roughly same setup steps should apply
to any of them.
Current Alpine is 3.15.4 and Linux is 5.18 (or 5.15 LTS) atm of writing this.
Get USB-UART dongle (any dirt-cheap $5 one will do) and a couple jumper wires,
lookup where main UART pins are on the board (ones used by firmware/u-boot).
Official documentation or wiki is usually the best source of info on that.
There should be RX/TX and GND pins, only those need to be connected.
screen /dev/ttyUSB0 115200 command is my go-to way to interact with those,
but there're other tools for that too.
If there's another ARM board laying around, they usually have same UART
pins, so just twist TX/RX and connect to the serial console from there.
Check which System-on-Chip (SoC) family/vendor device uses, look it up wrt
mainline linux support, and where such effort might be coordinated.
Beaglebones' TI SoC at
elinux.org, RPi's site/community, ... etc etc.
There're usually an IRC or whatever chat channels linked on those too,
and tons of helpful info all around, though sometimes outdated and distro-specific.
Get U-Boot for the board, ideally configured for some sane filesystem layout.
"Generic ARM" images on Alpine Downloads page tend to have these,
but architecture is limited to armv7/aarch64 boards, but a board vendor's
u-boot will work just as well, and it might come with pre-uboot stages or
signing anyway, which definitely have to be downloaded from their site and docs.
With ODC2, there're docs and download links for bootloader specifically,
which include an archive with a signing tool and a script to sign/combine all blobs
and dd them into right location on microSD card, where firmware will try to run it from.
Bootloader can also be extracted from some vendor-supplied distro .img
(be it ubuntu or android), if there's no better option - dd that on the card
and rm -rf all typical LFS dirs on the main partition there, as well as
kernel/initramfs in /boot - what's left should be a bootloader
(incl. parts of it dd'ed before partitions) and its configs.
Often all magic there is "which block to dd u-boot.img to", and if
Alpine Generic ARM has u-boot blob for specific board (has one for ODROID-C2),
then that one should be good to use, along with blobs in /efi and its main
config file in /extlinux - dd/cp u-boot to where firmware expects it to be
and copy those extra stage/config dirs.
Alpine wiki might have plenty of board-specific advice too, e.g. check out
Odroid-C2 page there.
Boot the board into u-boot with UART console connected, get that working.
There should be at least output from board's firmware, even without any
microSD card, then u-boot will print a lot of info about itself and what
it's doing, and eventually it (or later stages) will try to load linux kernel,
initramfs and dtb files, specified in whatever copied configs.
If "Generic ARM" files or whatever vendor distro was not cleaned-up from card
partitions, it might even start linux and some kind of proper OS, though kernel
in that generic alpine image seem to be lacking ODC2 board support and has no
dtb for it.
Next part is to drop some working linux kernel/dtb (+initramfs maybe),
rootfs, and that should boot into something interactive.
Conversely, no point moving ahead with the rest of the setup until bootloader
and console connected there all work as expected.
Get or build a working Linux kernel.
It's not a big deal (imo at least) to configure and build these manually,
as there aren't really that many options when configuring those for ARM -
you basically pick the SoC and all options with MESON or SUNXI in the name
for this specific model, plus whatever generic features/filesystems/etc
you will need in there.
Idea is to enable with =y (compile-in) all drivers that are needed
for SoC to boot, and the rest can be =m (loadable modules).
Ideally after being built somewhere, kernel should be installed via proper
Alpine package, and for my custom ODC2 kernel, I just reused linux-lts
APKBUILD file from "aports" git repository, running "make nconfig" in
src/build-... dir to make new .config and then copy it back for abuild to use.
Running abuild -K build rootpkg is generally how kernel package can be
rebuilt after any .config modifications in the build directory, i.e. after
fixing whatever forgotten things needed for boot.
(good doc/example of the process can be found at strfry's blog here)
These builds can be run in same-arch alpine chroot, in a VM or on some other board.
I've used an old RPi3 board to do aarch64 build with an RPi Alpine image
there - there're special pre-configured ones for those boards, likely
due to their massive popularity (so it was just the easiest thing to do).
I'd also suggest renaming the package from linux-lts to something unique for
specific build/config, like "linux-odc2-5-15-36", with idea there being able
to install multiple such linux-* packages alongside each other easily.
This can help a lot if one of them might have issues in the future -
not booting, crashes/panics, any incompatibilities, etc - as you can then
simply flip bootloader config to load earlier version, and test new versions
without comitting to them easily via kexec (especially cool for remote
headless setups - see below).
For C2 board kconfig, I've initially grabbed one from
superna9999/meta-meson repo, built/booted that (on 1G-RAM RPi3, that large
build needed zram enabled for final vmlinuz linking), to test if it works at all,
and then configured much more minimal kernel from scratch.
One quirk needed there for ODC2 was enabling CONFIG_REGULATOR_* options for
"X voltage supplied by Y" messages in dmesg that eventually enable MMC reader,
but generally kernel's console output is pretty descriptive wrt what might be missing,
especially if it can be easily compared with an output from a working build like that.
For the purposes of initial boot, linux kernel is just one "vmlinuz" file
inside resulting .apk (tar can unpack those), but dtbs-* directory must be
copied from package alongside it too for ARM boards like these odroids, as they
use blobs there to describe what is connected where and how to kernel.
Specific dtb can also be concatenated into kernel image file, to have it all-in-one,
but there should be no need to do that, as all such board bootloaders expect
to have dtbs and have configuration lines to load them already.
So, to recap:
- Linux configuration: enable stuff needed to boot and for accessing MMC slot,
or whatever rootfs/squashfs storage.
- Build scripts: use APKBUILD from Alpine aports, just tweak config there
and maybe rename it, esp. if you want multiple/fallback kernels.
- Build machine: use other same-arch board or VM with alpine or its chroot.
- Install: "tar -xf" apk, drop vmlinuz file and dtbs dir into /boot.
Update extlinux.conf or whatever bootloader config for new kernel files, boot that.
E.g. after dropping vmlinuz and dtbs from "linux-c51536r01" package for u-boot.bin
from Alpine's "Generic ARM" build (running efi blob with extlinux.conf support):
TIMEOUT 20
PROMPT 1
DEFAULT odc2
LABEL odc2
MENU LABEL Linux odc2
KERNEL /boot/vmlinuz-c51536r01
# INITRD /boot/initramfs-c51536r01
FDTDIR /boot/dtbs-c51536r01
APPEND root=/dev/mmcblk0p1
Idea with this boot config is to simply get kernel to work and mount some
rootfs without issues, so e.g. for custom non-generic/modular one,
built specifically for ODROID-C2 board in prev step, there's no need for
that commented-out INITRD line.
Once this boots and mounts rootfs (and then presumably panics as it can't find
/sbin/init there), remaining part is to bootstrap/grab basic alpine rootfs for
it to run userspace OS parts from.
This test can also be skipped for more generic and modular kernel config,
as it's not hard to test it with proper roots and initramfs later either.
Setup bootable Alpine rootfs.
It can be grabbed from the same Alpine downloads URL for any arch,
but if there's already a same-arch alpine build setup for kernel package,
might be easier to bootstrap it with all necessary packages there instead:
# apk='apk -p odc2-root -X https://dl.alpinelinux.org/alpine/v3.15/main/ -U --allow-untrusted'
# $apk --arch aarch64 --initdb add alpine-base
# $apk add linux-firmware-none mkinitfs
# $apk add e2fsprogs # or ones for f2fs, btrfs or whatever
# $apk add linux-c51536r01-5.15.36-r1.apk
To avoid needing --allow-untrusted for anything but that local linux-c51536r01
package, apk keys can be pre-copied from /etc/apk to odc2-root/etc/apk just
like Alpine's setup-disk script does it.
That rootfs will have everything needed for ODC2 board to boot, including
custom kernel with initramfs generated for it, containing any necessary modules.
Linux files on /boot should overwrite manually-unpacked/copied ones in earlier
test, INITRD or similar line can be enabled/corrected for bootloader,
but after cp/rsync, rootfs still needs a couple additional tweaks
(again similar to what setup-disk Alpine script does):
etc/fstab: add e.g. /dev/mmcblk0p1 / ext4 rw,noatime 0 1 to fsck/remount rootfs.
I'd add tmpfs /tmp tmpfs size=30%,nodev,nosuid,mode=1777 0 0 there as well,
since Alpine's OpenRC is not systemd and doesn't have default mountpoints
all over the place.
etc/inittab: comment-out all default tty's and enable UART one.
E.g. ttyAML0::respawn:/sbin/getty -L ttyAML0 115200 vt100 for ODROID-C2 board.
This is important to get login prompt on the right console,
so make sure to check kernel output to find the right one, e.g. with ODC2 board:
c81004c0.serial: ttyAML0 at MMIO 0xc81004c0 (irq = 23, base_baud = 1500000) is a meson_uart
But it can also be ttyS0 or ttyAMA0 or something else on other ARM platforms.
etc/securetty: add ttyAML0 line for console there too, same as with inittab.
Otherwise even though login prompt will be there, getty will refuse to ask
for password on that console, and immediately respond with "user is disabled"
or something like that.
This should be enough to boot and login into working OS on the same UART console.
Setup basic stuff on booted rootfs.
With initial rootfs booting and login-able, all that's left to get this to
a more typical OS with working network and apk package management.
"vi" is default editor in busybox, and it can be easy to use once you know to
press "i" at the start (switching it to a notepad-like "insert" mode),
and press "esc" followed by ":wq" + "enter" at the end to save edits
(or ":q!" to discard).
/etc/apk/repositories should have couple lines like
https://dl.alpinelinux.org/alpine/v3.15/main/, appropriate for this
Alpine release and repos (just main + community, probably)
echo nameserver 1.1.1.1 > /etc/resolv.conf + /etc/network/interfaces:
printf '%s\n' >/etc/network/interfaces \
'auto lo' 'iface lo inet loopback' 'auto eth0' 'iface eth0 inet dhcp'
Or something like that.
Also, to dhcp-configure network right now:
ip link set eth0 up && udhcpc
ntpd -dd -n -q -p pool.ntp.org
rc-update add anything used/useful from /etc/init.d.
There aren't many scripts in there by default,
and all should be pretty self-explanatory wrt what they are for.
Something like this can be a good default:
for s in bootmisc hostname modules sysctl syslog ; do rc-update add $s boot; done
for s in devfs dmesg hwdrivers mdev ; do rc-update add $s sysinit; done
for s in killprocs mount-ro savecache ; do rc-update add $s shutdown; done
for s in networking ntpd sshd haveged ; do rc-update add $s; done
Run rc-update show -v to get a full picture of what is setup to run when
with openrc, should be much simpler and more comprehensible than systemd in general.
sshd and haveged there can probably be installed after network works, or in
earlier rootfs-setup step (and haveged likely unnecessary with more recent
kernels that fixed blocking /dev/random).
ln -s /usr/share/zoneinfo/... /etc/localtime maybe?
That should be it - supported and upgradable Alpine with a custom kernel apk
and bootloader for this specific board.
Extra: kexec for updating kernels safely, without breaking the boot.
Once some initial linux kernel boots and works, and board is potentially
tucked away somewhere where it's hard to reach for pulling microSD card out,
it can be scary to update kernel to potentially something that won't be able
to boot, start crashing, etc.
Easy solution for that is kexec - syscall/mechanism to start a new kernel
from another working kernel/OS.
Might need to build/install kexec-tools apk to use it - it's missing on some
architectures, but APKBUILD from aports and the tool itself should work just
fine without changes. Also don't forget to enable it in the kernel.
Using multiple kernel packages alongside each other like suggested above,
something like this should work for starting new linux from an old one immediately:
# apk install --allow-untrusted linux-c51705r08-5.17.5-r0.apk
# kexec -sf /boot/vmlinuz-c51705r08 --initrd /boot/initramfs-c51705r08 --reuse-cmdline
It works just like userspace exec() syscalls, but gets current kernel to exec
a new one instead, which generally looks like reboot, except without involving
firmware and bootloader at all.
This way it's easy to run and test anything in new kernel or in its cmdline
options safely, with simple reboot or power-cycling reverting it back to an
old known-good linux/bootloader setup.
Once everything is confirmed-working with new kernel, bootloader config can
potentially be updated to use it instead, and old linux-something package
replaced with a new one as a known-good fallback on the next update.
This process is not entirely foolproof however, as sometimes linux drivers or
maybe hardware have some kind of init-state mismatch, which actually happens
with C2 board, where its built-in network card fails to work after kexec,
unless its linux module is unloaded before that.
Something like this inside "screen" session can be used to fix that particular issue:
# rmmod dwmac_meson8b stmmac_platform stmmac && kexec ... ; reboot
"reboot" at the end should never run if "kexec" works, but if any of this fails,
C2 board won't end up misconfigured/disconnected, and just reboot back into
old kernel instead.
So far I've only found such hack needed with ODROID-C2, other boards seem to
work fine after kexec, so likely just a problem in this specific driver expecting
NIC to be in one specific state on init.
This post is kinda long because of how writing works, but really it all boils
down to "dd" for board-specific bootloader and using board-specific kernel
package with a generic Alpine rootfs, so not too difficult, especially given how
simple and obvious everything is in Alpine Linux itself.
Supporting many different boards, each with its unique kernel, bootloader and
quirks seem to be a pain for most distros, which can barely get by with x86 support,
but if the task is simplified to just providing rootfs and pre-built package
repository, a reasonably well-supported distro like Alpine has a good chance
to work well in the long run, I think.
ArchLinuxARM, Armbian and various vendor distros like Raspbian provide nicer
experience out-of-the-box, but eventually have to drop support for old
hardware, while these old ARM boards don't really go anywhere in my experience,
and keep working fine for their purpose, hence more long-term bet on something
like alpine seems more reasonable than distro-hopping or needless hardware
replacement every couple years, and alpine in particular is just a great fit
for such smaller systems.
May 04, 2022
It's usually easy to get computers to talk over ethernet, but making traditional
PC hardware to talk to electronics or a GPIO line tends to be more difficult.
Issue came up when I wanted to make a smart wakeup switch for a local backup
ARM-board the other day, which would toggle separate 5V power to it and its
drive in-between weekly backups, so went on to explore the options there.
Simplest one is actually to wake up an ARM board using Wake-on-LAN, which can
then control own and peripherals' power via its GPIO lines, but don't think mine
has that.
Next easiest one (from my pov) seem to be grabbing any Arduino-ish MCU laying
around, hook that up to a relay and make it into a "smart switch" via couple
dozen lines of glue code.
Problem with that approach is that PC mobos tend not to have simple GPIO lines
easily exposed and intended for use from userspace, but there are options:
Parallel port, if motherboard is ancient enough, or if hooked-up via one of
cheap USB-to-LPT cables with tiny controller in the connector.
These have a bunch of essentially GPIO lines, albeit in a funny connector
package, and with a weird voltage, but pretty sure only really old mobos have
that built-in by now.
RS-232 COM port, likely via a header on the board.
Dunno about latest hardware, but at least ones from 6-7 years ago can still
have these.
Even though it's a serial port, they have DTR and RTS pins, which are easy
to control directly, using ioctl() with TIOCM_DTR and TIOCM_RTS flags on /dev/ttyS*.
Outside of pure bit-banging, RS-232 can usually be hooked-up to simpler
hardware's UART pins via RS232-to-TTL chip, instead of just GPIO via optocoupler.
PC Speaker!
That's a 5V signal too (PWM, sure, but still no problem to read),
and not like it's in much demand for anything else these days.
Might have to either emit/detect distinct frequency or sequence of beeps
to make sure that's not the signal of machine booting on the other side.
3-pin fan headers with a voltage level controls.
Can be a neat use for those analog pins on arduinos, though not all monitoring
chips are RE-ed enough to have linux drivers with control over fans, unfortunately.
Power on USB ports as a signal - at least some mobos can put those to sleep
via sysfs, I think, usually with per-controller level of granularity.
For my purposes, COM was available via an easy pin header, and it was easy
enough to use its control pins, with some simple voltage-level shifting
(COM itself has -12V/+12V for 0/1).
Bit of a shame that PCs don't traditionally have robust GPIO headers for users,
feel like that'd have enabled a lot of fancy cas modding, home automation,
or got more people into understanding basic electronics, if nothing else.
Apr 05, 2022
Internet have been heavily restricted here for a while (Russia), but with recent
events, local gov seem to have gotten a renewed shared interest in blocking
whatever's left of it with everyone else in the world, so it obviously got worse.
Until full whitelist and/or ML-based filtering for "known-good" traffic is
implemented though, simple way around it seem to be tunneling traffic through
any IP that's not blocked yet.
I've used simple "ssh -D" socks-proxy with an on-off button in a browser to
flip around restrictions (or more like different restriction regimes),
but it does get tiresome these days, and some scripts and tools outside of
browsing start to get affected.
So ideally traffic to some services/domains should be routed around internal
censorship or geo-blocking on the other side automatically.
Linux routing tables allow to do that for specific IPs (which is usually called
"policy(-based) routing" or PBR), but one missing component for my purposes was
something to sync these tables with the reality of internet services using DNS
hostnames of ever-shifting IPs, blocks applied on HTTP protocol level or dropped
after DPI filter on connection, and these disruptions being rather dynamic
from day to day, most times looking like some fickle infants flipping on/off
switches on either side.
Having a script to http(s)-check remote endpoints and change routing on the fly
seem to work well for most stuff in my case, having a regular linux box as a
router here, which can run such script and easily accomodate for such changes,
implemented here:
https://github.com/mk-fg/name-based-routing-policy-controller
Checks are pretty straightforward parallel curl fetches (as curl can be
generally relied upon to work with all modern http[s] stuff properly),
but processing their results can be somewhat tricky,
and dynamic routing itself can be implemented in a bunch of different ways.
Generally sending traffic through tunnels needs some help from the firewall
(esp. since straight-up "nat" in "ip rule" is deprecated), to at least NAT
packets going through whatever local IPs and exclude checker app itself from
workarounds (so that it can do its checks), so it seem to make sense marking
route-around traffic there as well.
fwmarks based on nftables set work well for that, together with "ip rule"
sending marked pkts through separate routing table, while simple skuid
match or something like cgroup service-based filtering works for exceptions.
There's a (hopefully) more comprehensive example of such routing setup in the
nbrpc project's repo/README.
Strongly suspect that this kind of workaround is only a beginning though,
and checks for different/mangled content (instead of just HTTP codes indicating blocking)
might be needed there, workarounds being implemented as something masquerading
as multiplexed H2 connections instead of straightforward encrypted wg tunnels,
with individual exit points checked for whether they got discovered and have to
be cycled today, as well as ditching centrally-subverted TLS PKI for any kind of
authenticity guarantees.
This heavily reminds me of relatively old "Internet Perspectives" and
Covergence ideas/projects (see also second part of old post here),
most of which seem to have been long-abandoned, as people got used to
corporate-backed internet being mostly-working and mostly-secure, with
apparently significant money riding on that being the case.
Only counter-examples being heavily-authoritarian, lawless and/or pariah states
where global currency no longer matters as much, and commercial internet doesn't
care about those.
Feel like I might need to at least re-examine those efforts here soon enough,
if not just stop bothering and go offline entirely.
Aug 31, 2021
Linux 5.13 finally merged nftables feature that seem to have been forgotten and lost
as a random PATCH on LKML since ~2016 - "cgroupsv2" path match (upstreamed patch here).
Given how no one seemed particulary interested in it for years, guess either
distros only migrated to using unified cgroup hierarchy (cgroup-v2) by
default rather recently, so probably not many people use network filtering by
these either, even though it's really neat.
On a modern linux with systemd, cgroup tree is mostly managed by it, and looks
something like this when you run e.g. systemctl status:
/init.scope # pid-1 systemd itself
/system.slice/systemd-udevd.service
/system.slice/systemd-networkd.service
/system.slice/dbus.service
/system.slice/sshd.service
/system.slice/postfix.service
...
/system.slice/system-getty.slice/getty@tty1.service
/system.slice/system-getty.slice/getty@tty2.service
...
/user.slice/user-1000.slice/user@1000.service/app.slice/dbus.service
/user.slice/user-1000.slice/user@1000.service/app.slice/pulseaudio.service
/user.slice/user-1000.slice/user@1000.service/app.slice/xorg.service
/user.slice/user-1000.slice/user@1000.service/app.slice/WM.service
...
I.e. every defined app neatly separated into its own cgroup, and there are
well-defined ways to group these into slices (or manually-started stuff into scopes),
which is done automatically for instantiated units, user sessions, and some special stuff.
See earlier "cgroup-v2 resource limits for apps with systemd scopes and slices"
post for more details on all these and some neat ways to use them for any arbitrary
pids (think "systemd-run") as well as on-the-fly cgroup-controller resource limits.
Such "cgroup-controller" resource limits notably do not include networking,
as it's historically been filtered separately from cpu/memory/io stuff via
systemd-wide firewalls - ipchains, iptables, nftables - and more recently via
various eBPFs microkernels - either bpfilter or cgroup and socket-attached BPFs
(e.g. "cgroup/skb" eBPF).
And that kind of per-cgroup filtering is a very useful concept, since you already
have these nicely grouping and labelling everything in the system, and any new
(sub-)groups are easy to add with extra slices/scopes or systemd-run wrappers.
It allows you to say, for example - "only my terminal app is allowed to access
VPN network, and only on ssh ports", or "this game doesn't get any network access",
or "this ssh user only needs this and this network access", etc.
systemd allows some very basic filtering for this kind of stuff via
IPAddressAllow=/IPAddressDeny= (systemd.resource-control) and custom BPFs,
and these can work fine in some use-cases, but are somewhat unreliable (systemd
doesn't treat missing resource controls as failure and quietly ignores them),
have very limited matching capabilities (unless you code these yourself into
custom BPFs), and are spread all over the system with "systemd --user" units
adding/setting their own restrictions from home dirs, often in form of tiny
hard-to-track override files.
But nftables and iptables can be used to filter by cgroups too,
with a single coherent and easy-to-read system-wide policy in one file,
using rules like:
add rule inet filter vpn.whitelist socket cgroupv2 level 5 \
"user.slice/user-1000.slice/user@1000.service/app.slice/claws-mail.scope" \
ip daddr mail.intranet.local tcp dport {25, 143} accept
I already use system/container-wide firewalls everywhere, so to me this looks
like a more convenient and much more powerful approach from a top-down system
admin perspective, while attaching custom BPF filters for apps is more useful in
a bottom-up scenario and should probably mostly be left for devs and packagers,
shipped/bundled with the apps, just like landlock or seccomp rulesets and
such wrappers - e.g. set in apps' systemd unit files or flatpak containers
(flatpaks only support trivial network on/off switch atm though).
There's a quirk in these "socket cgroupsv2" rules in nftables however (same as
their iptables "-m cgroup --path ..." counterpart) - they don't actually match
cgroup paths, but rather resolve them to numeric cgroup IDs when such rules are
loaded into kernel, and not automatically update them in any way afterwards.
This means that:
Firewall rules can't be added for not-yet-existing cgroups.
I.e. loading nftables.conf with a rule like the one above on early boot would
produce "Error: cgroupv2 path fails: No such file or directory" from nft (and
"xt_cgroup: invalid path, errno=-2" error in dmesg for iptables).
When cgroup gets removed and re-created, none of the existing rules will apply
to it, as it will have new and unique ID.
Basically such rules in a system-wide policy config only work for cgroups
that are created early on boot and never removed after that, which is not how
systemd works with its cgroups, obviously - they are entirely transient and get
added/removed as necessary.
But this doesn't mean that such filtering is unusable at all, just that it
has to work slightly differently, in a "decoupled" fashion:
Following the decoupled approach: If the cgroup is gone, the filtering
policy would not match anymore. You only have to subscribe to events
and perform an incremental updates to tear down the side of the
filtering policy that you don't need anymore. If a new cgroup is
created, you load the filtering policy for the new cgroup and then add
processes to that cgroup. You only have to follow the right sequence
to avoid problems.
There seem to be no easy-to-find helpers to manage such filtering policy around
yet though. It was proposed for systemd itself to do that in RFE-7327,
but as it doesn't manage system firewall (yet?), this seem to be a non-starter.
So had to add one myself - mk-fg/systemd-cgroup-nftables-policy-manager
(scnpm) - a small tool to monitor system/user unit events from systemd journal
tags (think "journalctl -o json-pretty") and (re-)apply rules for these via libnftables.
Since such rules can't be applied directly, and to be explicit wrt what to monitor,
they have to be specified as a comment lines in nftables.conf, e.g.:
# postfix.service :: add rule inet filter vpn.whitelist \
# socket cgroupv2 level 2 "system.slice/postfix.service" tcp dport 25 accept
# app-mail.scope :: add rule inet filter vpn.whitelist socket cgroupv2 level 5 \
# "user.slice/user-1000.slice/user@1000.service/app.slice/app-mail.scope" \
# ip daddr mail.intranet.local tcp dport {25, 143} accept
add rule inet filter output oifname my-vpn jump vpn.whitelist
add rule inet filter output oifname my-vpn reject with icmpx type admin-prohibited
And this works pretty well for my purposes so far.
One particularly relevant use-case as per example above is migrating everything
to use "zero-trust" overlay networks (or just VPNs), though on modern server
setups access to these tend to be much easier to manage by running something
like innernet (or tailscale, or one of a dozen other WireGuard tunnel
managers) in netns containers (docker, systemd-nspawn, lxc) or VMs, as access
in these systems tend to be regulated by just link availability/bridging,
which translates to having right crypto keys for a set of endpoints with wg tunnels.
So this is more of a thing for more complicated desktop machines rather than
proper containerized servers, but still very nice way to handle access controls,
instead of just old-style IP/port/etc matching without specifying which app
should have that kind of access, as that's almost never universal (outside of
aforementioned dedicated single-app containers), composing it all together in
one coherent systemd-wide policy file.
Aug 30, 2021
Being an old linux user with slackware/gentoo background, I still prefer to
compile kernel for local desktop/server machines from sources, if only to check
which new things get added there between releases and how they're organized.
This is nowhere near as demanding on resources as building a distro kernel with
all possible modules, but still can take a while, so not a great fit for my old
ultrabook or desktop machine, which both must be 10yo+ by now.
Two obvious ways to address this seem to be distributed compilation via distcc
or just building the thing on a different machine.
And distcc turns out to be surprisingly bad for this task - it doesn't support
gcc plugins that modern kernel uses for some security features, requires
suppressing a bunch of gcc warnings, and even then with or without pipelining it
eats roughly same amount of machine's CPU as a local build, without even fully
loading remote i5 machine, as I guess local preprocessing and distcc's own
overhead is a lot with the kernel code already.
Second option is a fully-remote build, but packaging just kernel + module
binaries like distros do there kinda sucks, as that adds an extra dependency
(for something very basic) and then it's hard to later quickly tweak and rebuild
it or add some module for some new networking or hardware thingy that you want
to use - and for that to be fast, kbuild/make's build cache of .o object files
needs to be local as well.
Such cache turns out to be a bit hard to share/rsync between machines,
due to following caveats:
Absolute paths used in intermediate kbuild files.
Just running mv linux-5.13 linux-5.13-a will force full rebuild for "make"
inside, so have to build the thing in the same dir everywhere, e.g.
/var/src/linux-5.13 on both local/remote machines.
Symlinks don't help with this, but bind mounts should, or just using
consistent build location work as well.
There's a default-disabled KBUILD_ABS_SRCTREE make-flag for this, with docs
saying "Kbuild uses a relative path to point to the tree when possible",
but that doesn't seem to be case for me at all - maybe "when possible" is too
limited, or was only true with older toolchains.
Some caches use "ls -l | md5" as a key, which breaks between machines due to
different usernames or unstable ls output in general.
One relevant place where this happens for me is kernel/gen_kheaders.sh,
and can be worked around using "find -printf ..." there:
% patch -tNp1 -l <<'EOF'
diff --git a/kernel/gen_kheaders.sh b/kernel/gen_kheaders.sh
index 34a1dc2..bfa0dd9 100755
--- a/kernel/gen_kheaders.sh
+++ b/kernel/gen_kheaders.sh
@@ -44,4 +44,5 @@ all_dirs="$all_dirs $dir_list"
-headers_md5="$(find $all_dirs -name "*.h" |
- grep -v "include/generated/compile.h" |
- grep -v "include/generated/autoconf.h" |
- xargs ls -l | md5sum | cut -d ' ' -f1)"
+headers_md5="$(
+ find $all_dirs -name "*.h" -printf '%p :: %Y:%l :: %s :: %T@\n' |
+ grep -v "include/generated/compile.h" |
+ grep -v "include/generated/autoconf.h" |
+ sed 's/\.[0-9]\+$//' | LANG=C sort | md5sum | cut -d ' ' -f1)"
@@ -50 +51 @@ headers_md5="$(find $all_dirs -name "*.h" |
-this_file_md5="$(ls -l $sfile | md5sum | cut -d ' ' -f1)"
+this_file_md5="$(md5sum $sfile | cut -d ' ' -f1)"
EOF
One funny thing there is an extra sed 's/\.[0-9]\+$//' to cut precision
from find's %T@ timestamps, as some older filesystems (like reiserfs, which is
still great for tiny-file performance and storage efficiency) don't support
too high precision on these, and that will change them in this output without
any complaints from e.g. rsync.
Host and time-dependent KBUILD_* variables.
These embed build user/time/host etc, are relevant for reproducible builds,
and maybe not so much here, but still best to lock down for consistency via e.g.:
make KBUILD_BUILD_USER=user KBUILD_BUILD_HOST=host \
KBUILD_BUILD_VERSION=b1 KBUILD_BUILD_TIMESTAMP=e1
All these vars are documented under Documentation/ in the kernel tree.
Compiler toolchain must be roughly same between these machines.
Not hard to do if they're both same Arch, but otherwise probably best way to
get this is to have same distro(s) for the build within containers (e.g. nspawn).
This allows to rsync -rtlz the build tree from remote after "make" and do
the usual "make install" or tweak it locally later without doing slow full rebuilds.
Apr 26, 2021
bsdtar from libarchive is the usual go-to tool for packing stuff up in linux
scripts, but it always had an annoying quirk for me - no data checksums built
into tar formats.
Ideally you'd unpack any .tar and if it's truncated/corrupted in any way, bsdtar
will spot that and tell you, but that's often not what happens, for example:
% dd if=/dev/urandom of=file.orig bs=1M count=1
% cp -a file{.orig,} && bsdtar -czf file.tar.gz -- file
% bsdtar -xf file.tar.gz && b2sum -l96 file{.orig,}
2d6b00cfb0b8fc48d81a0545 file.orig
2d6b00cfb0b8fc48d81a0545 file
% python -c 'f=open("file.tar.gz", "r+b"); f.seek(512 * 2**10); f.write(b"\1\2\3")'
% bsdtar -xf file.tar.gz && b2sum -l96 file{.orig,}
2d6b00cfb0b8fc48d81a0545 file.orig
c9423358edc982ba8316639b file
In a compressed multi-file archives such change can get tail of an archive
corrupted-enough that it'd affect one of tarball headers, and those have
checksums, so might be detected, but it's very unreliable, and won't affect
uncompressed archives (e.g. media files backup which won't need compression).
Typical solution is to put e.g. .sha256 files next to archives and hope that
people check those, but of course no one ever does in reality - bsdtar itself
has to always do it implicitly for that kind of checking to stick,
extra opt-in steps won't work.
Putting checksum in the filename is a bit better, but still not useful for the
same reason - almost no one will ever check it, unless it's automatic.
Luckily bsdtar has at least some safe options there, which I think should always
be used by default, unless there's a good reason not to in some particular case:
bsdtar --xz (and its --lzma predecessor):
% bsdtar -cf file.tar.xz --xz -- file
% python -c 'f=open("file.tar.xz", "r+b"); f.seek(...); f.write(...)'
% bsdtar -xf file.tar.xz && b2sum -l96 file{.orig,}
file: Lzma library error: Corrupted input data
% tar -xf file.tar.xz && b2sum -l96 file{.orig,}
xz: (stdin): Compressed data is corrupt
Heavier on resources than .gz and might be a bit less compatible, but given that even
GNU tar supports it out of the box and much better compression (with faster decompression)
in addition to mandatory checksumming, should always be a default for compressed archives.
Lowering compression level might help a bit with performance as well.
bsdtar --format zip:
% bsdtar -cf file.zip --format zip -- file
% python -c 'f=open("file.zip", "r+b"); f.seek(...); f.write(...)'
% bsdtar -xf file.zip && b2sum -l96 file{.orig,}
file: ZIP bad CRC: 0x2c1170b7 should be 0xc3aeb29f
Can be an OK option if there's no need for unixy file metadata, streaming decompression,
and/or max compatibility is a plus, as good old zip should be readable everywhere.
Simple deflate compression is inferior to .xz, so not the best for linux-only
stuff or if compression is not needed, BUT there is --options zip:compression=store,
which basically just adds CRC32 checksums.
bsdtar --use-compress-program zstd but NOT its built-in --zstd flag:
% bsdtar -cf file.tar.zst --use-compress-program zstd -- file
% python -c 'f=open("file.tar.zst", "r+b"); f.seek(...); f.write(...)'
% bsdtar -xf file.tar.zst && b2sum -l96 file{.orig,}
file: Lzma library error: Corrupted input data
% tar -xf file.tar.zst && b2sum -l96 file{.orig,}
file: Zstd decompression failed: Restored data doesn't match checksum
Very fast and efficient, gains popularity quickly, but bsdtar --zstd flag
will use libzstd defaults (using explicit zstd --no-check with binary too)
and won't add checksums (!!!), even though it validates data against them on
decompression.
Still good alternative to above, as long as you pretend that --zstd option
does not exist and always go with explicit zstd command instead.
GNU tar does not seem to have this problem, as --zstd there always uses
binary and its defaults (and -C/--check in particular).
bsdtar --lz4 --options lz4:block-checksum:
% bsdtar -cf file.tar.lz4 --lz4 --options lz4:block-checksum -- file
% python -c 'f=open("file.tar.lz4", "r+b"); f.seek(...); f.write(...)'
% bsdtar -xf file.tar.lz4 && b2sum -l96 file{.orig,}
bsdtar: Error opening archive: malformed lz4 data
% tar -I lz4 -xf file.tar.lz4 && b2sum -l96 file{.orig,}
Error 66 : Decompression error : ERROR_blockChecksum_invalid
lz4 barely adds any compression resource overhead, so is essentially free,
same for xxHash32 checksums there, so can be a safe replacement for uncompressed tar.
bsdtar manpage says that lz4 should have stream checksum default-enabled,
but it doesn't seem to help at all with corruption - only block-checksums
like used here do.
GNU tar doesn't understand lz4 by default, so requires explicit -I lz4.
bsdtar --bzip2 - actually checks integrity, but is very inefficient algo
cpu-wise, so best to always avoid it in favor of --xz or zstd these days.
bsdtar --lzop - similar to lz4, somewhat less common,
but always respects data consistency via adler32 checksums.
bsdtar --lrzip - opposite of --lzop above wrt compression, but even
less-common/niche wrt install base and use-cases. Adds/checks md5 hashes by default.
It's still sad that tar can't have some post-data checksum headers, but always
using one of these as a go-to option seem to mitigate that shortcoming,
and these options seem to cover most common use-cases pretty well.
What DOES NOT provide consistency checks with bsdtar: -z/--gz, --zstd (not even
when it's built without libzstd!), --lz4 without lz4:block-checksum option,
base no-compression mode.
With -z/--gz being replaced by .zst everywhere, hopefully either libzstd changes
its no-checksums default or bsdtar/libarchive might override it, though I wouldn't
hold much hope for either of these, just gotta be careful with that particular mode.
Aug 28, 2020
Recently Google stopped sending email notifications for YouTube subscription
feeds, as apparently conversion of these to page views was 0.1% or something.
And even though I've used these to track updates exclusively,
guess it's fair, as I also had xdg-open script jury-rigged to just open any
youtube links in mpv instead of bothering with much inferior ad-ridden
in-browser player.
One alternative workaround is to grab OPML of subscription Atom feeds
and only use those from now on, converting these to familiar to-watch
notification emails, which I kinda like for this purpose because they are
persistent and have convenient state tracking (via read/unread and tags) and
info text without the need to click through, collected in a dedicated mailbox dir.
Fetching/parsing feeds and sending emails are long-solved problems (feedparser
in python does great job on former, MTAs on the latter), while check intervals
and state tracking usually need to be custom, and can have a few tricks that I
haven't seen used too widely.
Trick One - use moving average for an estimate of when events (feed updates) happen.
Some feeds can have daily activity, some have updates once per month, and
checking both every couple hours would be incredibly wasteful to both client and server,
yet it seem to be common practice in this type of scenario.
Obvious fix is to get some "average update interval" and space-out checks in
time based on that, but using simple mean value ("sum / n") has significant
drawbacks for this:
- You have to keep a history of old timestamps/intervals to calculate it.
- It treats recent intervals same as old ones, even though they are more relevant.
Weighted moving average value fixes both of these elegantly:
interval_ewma = w * last_interval + (1 - w) * interval_ewma
Where "w" is a weight for latest interval vs all previous ones, e.g. 0.3 to have
new value be ~30% determined by last interval, ~30% of the remainder by pre-last,
and so on.
Allows to keep only one "interval_ewma" float in state (for each individual feed)
instead of a list of values needed for mean and works better for prediction
due to higher weights for more recent values.
For checking feeds in particular, it can also be updated on "no new items" attempts,
to have backoff interval increase (up to some max value), instead of using last
interval ad infinitum, long past the point when it was relevant.
Trick Two - keep a state log.
Very useful thing for debugging automated stuff like this, where instead of keeping
only last "feed last_timestamp interval_ewma ..." you append every new one to a log file.
When such log file grows to be too long (e.g. couple megs), rename it to .old
and seed new one with last states for each distinct thing (feed) from there.
Add some timestamp and basic-info prefix before json-line there and it'd allow
to trivially check when script was run, which feeds did it check, what change(s)
it did detect there (affecting that state value), as well as e.g. easily remove
last line for some feed to test-run processing last update there again.
When something goes wrong, this kind of log is invaluable, as not only you can
re-trace what happened, but also repeat last state transition with
e.g. some --debug option and see what exactly happened there.
Keeping only one last state instead doesn't allow for any of that, and you'd
have to either keep separate log of operations for that anyway, and manually
re-construct older state from there to retrace last script steps properly,
tweak the inputs to re-construct state that way, or maybe just drop old state
and hope that re-running script with latest inputs without it hits same bug(s).
I.e. there's basically no substitute for that, as text log is pretty much same thing,
describing state changes in non-machine-readable and often-incomplete text,
which can be added to such state-log instead as an extra metadata anyway.
With youtube-rss-tracker script with multiple feeds, it's a single log file,
storing mostly-json lines like these (wrapped for readability):
2020-08-28 05:59 :: UC2C_jShtL725hvbm1arSV9w 'CGP Grey' ::
{"chan": "UC2C_jShtL725hvbm1arSV9w", "delay_ewma": 400150.7167069313,
"ts_last_check": 1598576342.064458, "last_entry_id": "FUV-dyMpi8K_"}
2020-08-28 05:59 :: UCUcyEsEjhPEDf69RRVhRh4A 'The Great War' ::
{"chan": "UCUcyEsEjhPEDf69RRVhRh4A", "delay_ewma": 398816.38761319243,
"ts_last_check": 1598576342.064458, "last_entry_id": "UGomntKCjFJL"}
Text prefix there is useful when reading the log, while script itself only cares
about json bit after that.
Anything doesn't work well - notifications missing, formatted badly, errors, etc -
you just remove last state and tweak/re-run the script (maybe in --dry-run mode too),
and get pretty much exactly same thing as happened last, aside from any input
(feed) changes, which should be very predictable in this particular case.
Not sure what this concept is called in CS, there gotta be some fancy name for it.
Link to YouTube feed email-notification script used as an example here:
Jun 26, 2020
Usual and useful way to represent traffic counters for accounting purposes like
"how much inbound/outbound traffic passed thru within specific day/month/year?"
are bar charts or tables I think, i.e. something like this:
With one bar or table entry there for each day, month or year (accounting periods).
Counter values in general are not special in prometheus, so grafana builds the
usual monotonic-line graphs for these by default, which are not very useful.
Results of prometheus query like increase(iface_traffic_bytes_total[1d])
are also confusing, as it returns arbitrary sliding windows,
with points that don't correspond to any fixed accounting periods.
But grafana is not picky about its data sources, so it's easy to query
prometheus and re-aggregate (and maybe cache) data as necessary,
e.g. via its "Simple JSON" datasource plugin.
To get all historical data, it's useful to go back to when metric first appeared,
and official prometheus clients add "_created" metric for counters, which can be
queried for min value to get (slightly pre-) earliest value timestamp, for example
min_over_time(iface_traffic_bytes_created[10y]).
From there, value for each bar will be diffs between min/max for each accounting
interval, that can be queried naively via min_over_time + max_over_time (like
created-time above), but if exporter is written smartly or persistent 64-bit counters
are used (e.g. SNMP IF-MIB::ifXEntry), these lookups can be simplified a lot by
just querying min/max values at the start and end of the period,
instead of having prometheus do full sweep,
which can be especially bad on something like a year of datapoints.
Such optimization can potentially return no values, if data at that interval
start/end was missing, but that's easy to work around by expanding lookup range
until something is returned.
Deltas between resulting max/min values are easy to dump as JSON for bar chart
or table in response to grafana HTTP requests.
Had to implement this as a prometheus-grafana-simplejson-aggregator py3
script (no deps), which runs either on timer to query/aggregate latest data from
prometheus (to sqlite), or as a uWSGI app which can be queried anytime from grafana.
Also needed to collect/export data into prometheus over SNMP for that,
so related script is prometheus-snmp-iface-counters-exporter for SNMPv3
IF-MIB/hrSystemUptime queries (using pysnmp), exporting counters via
prometheus client module.
(note - you'd probably want to check and increase retention period in prometheus
for data like that, as it defaults to dropping datapoints older than a month)
I think aggregation for such accounting use-case can be easily implemented in
either prometheus or grafana, with a special fixed-intervals query type in the
former or more clever queries like described above in the latter, but found that
both seem to explicitly reject such use-case as "not supported". Oh well.
Code links (see README in the repo and -h/--help for usage info):
Jun 21, 2020
File tagging is one useful concept that stuck with me since I started using tmsu
and made a codetag tool a while ago to auto-add tags to local files.
For example, with code, you'd usually have code files and whatever assets
arranged in some kind of project trees, with one dir per project and all files
related to it in git repo under that.
But then when you work on something unrelated and remember "oh, I did implement
or seen this already somewhere", there's no easy and quick "grep all python
files" option with such hierarchy, as finding all of them on the whole fs tends
to take a while, or too long for a quick check anyway to not be distracting.
And on top of that filesystem generally only provides filenames as metadata,
while e.g. script shebangs or file magic are not considered, so "filetag" python
script won't even be detected when naively grepping all *.py files.
Easy and sufficient fix that I've found for that is to have cronjob/timer to go
over files in all useful non-generic fs locations and build a db for later querying,
which is what codetag and tmsu did for years.
But I've never came to like golang in any way (would highly recommend checking
out OCAML instead), and tmsu never worked well for my purposes - was slow to
interface with, took a lot of time to build db, even longer to check and clean
it up, while quierying interface was clunky and lackluster (long commands, no
NUL-separated output, gone files in output, etc).
So couple months ago found time to just rewrite all that in one python script -
filetag - which does all codetag + tmsu magic in something like 100 lines of
actual code, faster, and doesn't have shortcomings of the old tools.
Was initially expecting to use sqlite there, but then realized that I only
index/lookup stuff by tags, so key-value db should suffice, and it won't
actually need to be updated either, only rebuilt from scratch on each indexing,
so used simple gdbm at first.
Didn't want to store many duplicates of byte-strings there however, so split
keys into three namespaces and stored unique paths and tags as numeric indexes,
which can be looked-up in the same db, which ended up looking like this:
"\0" "tag_bits" = tag1 "\0" tag2 ...
"\1" path-index-1 = path-1
"\1" path-index-2 = path-2
...
"\2" tag-index-1 = path-index-1 path-index-2 ...
"\2" tag-index-2 = path-index-1 path-index-2 ...
...
So db lookup loads "tag_bits" value, finds all specified tag-indexes there
(encoded using minimal number of bytes), then looks up each one, getting a set
of path indexes for each tag (uint64 numbers).
If any logic have to be applied on such lookup, i.e. "want these tags or these,
but not those", it can be compiled into DNF "OR of bitmasks" list,
which is then checked against each tag-bits of path-index superset,
doing the filtering.
Resulting paths are looked up by their index and printed out.
Looks pretty minimal and efficient, nothing is really duplicated, right?
In RDBMS like sqlite, I'd probably store this as a simple tag + path table,
with index on the "tag" field and have it compress that as necessary.
Well, running filetag on my source/projects dirs in ~ gets 100M gdbm file with
schema described above and 2.8M sqlite db with such simple schema.
Massive difference seem to be due to sqlite compressing such repetitive and
sometimes-ascii data and just being generally very clever and efficient.
Compressing gdbm file with zstd gets 1.5M too, i.e. down to 1.5% - impressive!
And it's not mostly-empty file, aside from all those zero-bytes in uint64 indexes.
Anyhow, point and my take-away here was, once again - "just use sqlite where
possible, and don't bother with other local storages".
It's fast, efficient, always available, very reliable, easy to use, and covers a
ton of use-cases, working great for all of them, even when they look too simple
for it, like the one above.
One less-obvious aspect from the list above, which I've bumped into many times
myself, and probably even mentioned on this blog already, is "very reliable" -
dbm modules and many other "simple" databases have all sorts of poorly-documented
failure modes, corrupting db and loosing data where sqlite always "just works".
Wanted to document this interesting fail here mostly to reinforce the notion
in my own head once more.
sqlite is really awesome, basically :)
Jun 02, 2020
After replacing DNS resolver daemons a bunch of weeks ago in couple places,
found the hard way that nothing is quite as reliable as (t)rusty dnscache
from djbdns, which is sadly too venerable and lacks some essential features
at this point.
Complex things like systemd-resolved and unbound either crash, hang
or just start dropping all requests silently for no clear reason
(happens outside of conditions described in that email as well, to this day).
But whatever, such basic service as name resolver needs some kind of watchdog
anyway, and seem to be easy to test too - delegate some subdomain to a script
(NS entry + glue record) which would give predictable responses to arbitrary
queries and make/check those.
Implemented both sides of that testing process in dns-test-daemon script,
which can be run with some hash-key for BLAKE2S HMAC:
% ./dns-test-daemon -k hash-key -b 127.0.0.1:5533 &
% dig -p5533 @127.0.0.1 aaaa test.com
...
test.com. 300 IN AAAA eb5:7823:f2d2:2ed2:ba27:dd79:a33e:f762
...
And then query it like above, getting back first bytes of keyed name hash after
inet_ntop conversion as a response address.
Good thing about it is that name can be something random like
"o3rrgbs4vxrs.test.mydomain.com", to force DNS resolver to actually do its job
and not just keep returning same "google.com" from the cache or something.
And do it correctly too, as otherwise resulting hash won't match expected value.
So same script has client mode to use same key and do the checking,
as well as randomizing queried names:
% dns-test-daemon -k hash-key --debug @.test.mydomain.com
(optionally in a loop too, with interval/retries/timeout opts, and checking
general network availability via fping to avoid any false alarms due to that)
Ended up running this tester/hack to restart unbound occasionally when it
craps itself, which restored reliable DNS operation for me,
essential thing for any outbound network access, pretty big deal.