Aug 05, 2018

rsync backups over reverse ssh tunnels

Have been reviewing backups last about 10 years ago (p2, p3), and surprisingly not much changed since then - still same ssh and rsync for secure sync over network remote.

Stable-enough btrfs makes --link-dest somewhat anachronistic, and rsync filters came a long way since then, but everything else is the same, so why not use this opportunity to make things simplier and smoother...

In particular, it was always annoying to me that backups either had to be pulled from some open and accessible port or pushed to same thing on the backup server, which isn't hard to fix with "ssh -R" tunnels - that allows backup server to have locked-down and reasonably secure ssh port open at most, yet provides no remote push access (i.e. no overwriting whatever remote wants to, like simple rsyncd setup would do), and does everything through just a single outgoing ssh connection.

That is, run "rsync --daemon" or localhost, make reverse-tunnel to it when connecting to backup host and let it pull from that.

On top of being no-brainer to implement and use - as simple as ssh from behind however many NATs - it avoids following (partly mentioned) problematic things:

  • Pushing stuff to backup-host, which can be exploited to delete stuff.
  • Using insecure network channels and/or rsync auth - ssh only.
  • Having any kind of insecure auth or port open on backup-host (e.g. rsyncd) - ssh only.
  • Requiring backed-up machine to be accessible on the net for backup-pulls - can be behind any amount of NAT layers, and only needs one outgoing ssh connection.
  • Specifying/handling backup parameters (beyond --filter lists), rotation and cleanup on the backed-up machine - backup-host will handle all that in a known-good and uniform manner.
  • Running rsyncd or such with unrestricted fs access "for backups" - only runs it on localhost port with one-time auth for ssh connection lifetime, restricted to specified read-only path, with local filter rules on top.
  • Needing anything beyond basic ssh/rsync/python on either side.

Actual implementation I've ended up with is ssh-r-sync + ssh-r-sync-recv scripts in fgtk repo, both being rather simple py3 wrappers for ssh/rsync stuff.

Both can be used by regular uid's, and can use rsync binary with capabilities or sudo wrapper to get 1-to-1 backup with all permissions instead of --fake-super (though note that giving root-rsync access to uid is pretty much same as "NOPASSWD: ALL" sudo).

One relatively recent realization (coming from acme-cert-tool) compared to scripts I wrote earlier, is that using bunch of script hooks all over the place is a way easier than hardcoding a dozen of ad-hoc options.

I.e. have option group like this (-h/--help output from argparse):

Hook options:
  -x hook:path, --hook hook:path
    Hook-script to run at the specified point.
    Specified path must be executable (chmod +x ...),
      will be run synchronously, and must exit with 0
      for tool to continue operation, non-0 to abort.
    Hooks are run with same uid/gid
      and env as the main script, can use PATH-lookup.
    See --hook-list output to get full list of
      all supported hook-points and arguments passed to them.
    Example spec: -x rsync.pre:~/hook.pre-sync.sh
  --hook-timeout seconds
    Timeout for waiting for hook-script to finish running,
      before aborting the operation (treated as hook error).
    Zero or negative value will disable timeout. Default: no-limit
  --hook-list
    Print the list of all supported
      hooks with descriptions/parameters and exit.

And --hook-list providing full attached info like:

Available hook points:

  script.start:
    Before starting handshake with authenticated remote.
    args: backup root dir.

  ...

  rsync.pre:
    Right before backup-rsync is started, if it will be run.
    args: backup root dir, backup dir, remote name, privileged sync (0 or 1).
    stdout: any additional \0-separated args to pass to rsync.
      These must be terminated by \0, if passed,
        and can start with \0 to avoid passing any default options.

  rsync.done:
    Right after backup-rsync is finished, e.g. to check/process its output.
    args: backup root dir, backup dir, remote name, rsync exit code.
    stdin: interleaved stdout/stderr from rsync.
    stdout: optional replacement for rsync return code, int if non-empty.

Hooks are run synchronously,
  waiting for subprocess to exit before continuing.
All hooks must exit with status 0 to continue operation.
Some hooks get passed arguments, as mentioned in hook descriptions.
Setting --hook-timeout (defaults to no limit)
  can be used to abort when hook-scripts hang.

Very trivial to implement and then allows to hook much simplier single-purpose bash scripts handling specific stuff like passing extra options on per-host basis, handling backup rotation/cleanup and --link-dest, creating "backup-done-successfully" mark and manifest files, or whatever else, without needing to add all these corner-cases into the main script.

One boilerplate thing that looks useful to hardcode though is a "nice ionice ..." wrapper, which is pretty much inevitable for background backup scripts (though cgroup limits can also be a good idea), and fairly easy to do in python, with minor a caveat of a hardcoded ioprio_set syscall number, but these pretty much never change on linux.

As a side-note, can recommend btrbk as a very nice tool for managing backups stored on btrfs, even if for just rotating/removing snapshots in an easy and sane "keep A daily ones, B weekly, C monthly, ..." manner.

[code link: ssh-r-sync + ssh-r-sync-recv scripts]

Feb 14, 2010

My "simple" (ok, not quite) backup system - implementation (backup host)

According to the general plan, with backed-up side scripts in place, some backup-grab mechanism is needed on the backup host.

So far, sshd provides secure channel and authentication, launching control script as a shell, backed-up side script provides hostname:port for one-shot ssh link on the commandline, with private key to this link and backup-exclusion paths list piped in.

All that's left to do on this side is to read the data from a pipe and start rsync over this link, with a few preceding checks, like a free space check, so backup process won't be strangled by its abscence and as many as possible backups will be preserved for as long as possible, removing them right before receiving new ones.

Historically, this script also works with any specified host, interactively logging into it as root for rsync operation, so there's bit of interactive voodoo involved, which isn't relevant for the remotely-initiated backup case.

Ssh parameters for rsync transport are passed to rsync itself, since it starts ssh process, via "--rsh" option. Inside the script,these are accumulated in bak_src_ext variable

Note that in case then this script is started as a shell, user is not a root, yet it needs to store filesystem metadata like uids, gids, acls, etc.
To that end, rsync can employ user_xattr's, although it looks extremely unportable and inproper to me, since nothing but rsync will translate them back to original metadata, so rsync need to be able to change fs metadata directly, and to that end there's posix capabilities.

I use custom module for capability manipulation, as well as other convenience modules here and there, their purpose is quite obvious and replacing these with stdlib functions should be pretty straightforward, if necessary.

Activating the inherited capabilities:

bak_user = os.getuid()
if bak_user:
    from fgc.caps import Caps
    import pwd
    os.putenv('HOME', pwd.getpwuid(bak_user).pw_dir)
    Caps.from_process().activate().apply()

But first things first - there's data waiting on commandline and stdin. Getting the hostname and port...

bak_src = argz[0]
try: bak_src, bak_src_ext = bak_src.split(':')
except: bak_src_ext = tuple()
else: bak_src_ext = '-p', bak_src_ext

...and the key / exclusions:

bak_key = bak_sub('.key_{0}'.format(bak_host))
password, reply = it.imap(
    op.methodcaller('strip', spaces), sys.stdin.read().split('\n\n\n', 1) )
open(bak_key, 'w').write(password)
sh.chmod(bak_key, 0400)
bak_src_ext += '-i', os.path.realpath(bak_key)

Then, basic rsync invocation options can be constructed:

sync_optz = [ '-HaAXz',
    ('--skip-compress='
        r'gz/bz2/t\[gb\]z/tbz2/lzma/7z/zip/rar'
        r'/rpm/deb/iso'
        r'/jpg/gif/png/mov/avi/ogg/mp\[34g\]/flv/pdf'),
    '--super',
    '--exclude-from={0}'.format(bak_exclude_server),
    '--rsync-path=ionice -c3 rsync',
    '--rsh=ssh {0}'.format(' '.join(bak_src_ext)) ]

Excluded paths list here is written to a local file, to keep track which paths were excluded in each backup. "--super" option is actually necessary if local user is not root, rsync drops all the metadata otherwise. "HaAX" is like "preserve all" flags - Hardlinks, ownership/modes/times ("a" flag), Acl's, eXtended attrs. "--rsh" here is the ssh command, with parameters, determined above.

Aside from that, there's also need to specify hardlink destination path, which should be a previous backup, and that traditionnaly is the domain of ugly perlisms - regexps.

bakz_re = re.compile(r'^([^.].*)\.\d+-\d+-\d+.\d+$') # host.YYYY-mm-dd.unix_time
bakz = list( bak for bak in os.listdir(bak_root)
 if bakz_re.match(bak) ) # all backups
bakz_host = sorted(it.ifilter(op.methodcaller(
    'startswith', bak_host ), bakz), reverse=True)

So, the final sync options come to these:

src = '{0}:/'.format(src)
sync_optz = list(dta.chain( sync_optz, '--link-dest={0}'\
        .format(os.path.realpath(bakz_host[0])), src, bak_path ))\
    if bakz_host else list(dta.chain(sync_optz, src, bak_path))

The only interlude is to cleanup backup partition if it gets too crowded:

## Free disk space check / cleanup
ds, df = sh.df(bak_root)
min_free = ( max(min_free_avg( (ds-df) / len(bakz)), min_free_abs*G)
    if min_free_avg and bakz else min_free_abs*G )

def bakz_rmq():
    '''Iterator that returns bakz in order of removal'''
    bakz_queue = list( list(bakz) for host,bakz in it.groupby(sorted(bakz),
        key=lambda bak: bakz_re.match(bak).group(1)) )
    while bakz_queue:
        bakz_queue.sort(key=len)
        bakz_queue[-1].sort(reverse=True)
        if len(bakz_queue[-1]) <= min_keep: break
        yield bakz_queue[-1].pop()

if df < min_free:
    for bak in bakz_rmq():
        log.info('Removing backup: {0}'.format(bak))
        sh.rr(bak, onerror=False)
        ds, df = sh.df(bak_root)
        if df >= min_free: break
    else:
        log.fatal( 'Not enough space on media:'
                ' {0:.1f}G, need {1:.1f}G, {2} backups min)'\
            .format( op.truediv(df, G),
                op.truediv(min_free, G), min_keep ), crash=2 )

And from here it's just to start rsync and wait 'till the job's done.

This thing works for months now, and saved my day on many occasions, but the most important thing here I think is the knowledge that the backup is there should you need one, so you never have to worry about breaking your system or losing anything important there, whatever you do.

Here's the full script.

Actually, there's more to the story, since just keeping backups on single local harddisk (raid1 of two disks, actually) isn't enough for me.
Call this paranoia, but setting up system from scratch and restoring all the data I have is a horrible nightmare, and there are possibility of fire, robbery, lighting, voltage surge or some other disaster that can easily take this disk(s) out of the picture, and few gigabytes of space in the web come almost for free these days - there are p2p storages like wuala, dropbox, google apps/mail with their unlimited quotas...
So, why not upload all this stuff there and be absolutely sure it'd never go down, whatever happens? Sure thing.
Guess I'll write a note on the topic as much to document it for myself as for the case someone might find it useful as well, plus the ability to link it instead of explaining ;)
Member of The Internet Defense League