This post originated from an RSS feed registered with Ruby Buzz
by Eigen Class.
Original Post: A better backup system based on Git
Feed Title: Eigenclass
Feed URL: http://feeds.feedburner.com/eigenclass
Feed Description: Ruby stuff --- trying to stay away from triviality.
A fast, powerful backup system built upon Git and efficient, compact tools written in OCaml (faster than the C counterpart with 1/5th of the code :)
Recent events have pushed me to get serious about backing up my data.
I'm naturally inclined to use simple solutions over specialized backup
systems, preferring something like rsync to a special-purpose tool.
As far as "standard" tools go, however, git provides a very nice
infrastructure that can be used to build your own system, to wit:
it is more space-efficient than most incremental backup schemes, since it does file compression and both textual *and* binary deltas (in particular, it's better than solutions relying on hardlinks or incremental backups à la tar/cpio)
its transport mechanism is more efficient than rsync's
it is fast: recoving your data is *faster* than cp -a
you keep the full revision history
powerful toolset with a rich vocabulary
I'm of course not the first one to think of git as the basis for a backup
system. You can find many blog articles about this, but few people have gone
beyond saying "stuff you data in a git repos" and tried to fill the holes and
automate things properly. The most serious projects I've found are
etckeeper and
git-home-history.
None of them gets it entirely right for my purposes, however (more on this below).
The tool I've written retains all the advantages from Git, and supplements it
in some key areas:
metadata support
management of submodules (nested Git repositories)
automation of common operations; for instance, a commit consists of several steps:
determining if some files which were committed earlier are ignored now and removing them from the index
adding new and modified files to the index
registering new git submodules and copying them to a special area under .git
committing changes in the index
compaction and optimization of the repository
Only one command is needed in practice:
gibak commit
I'm using it to save 2GB in over 200000 files, and it normally takes under 20
seconds to take a snapshot (AFAICS it scales linearly, so I expect to be able
to backup 10GB in under 2 minutes...). The full power of the git toolset is
available, so earlier versions can be restored with "git checkout", remote
copies can be created/synchronized with git clone/push/fetch/pull, you can see
what has changed with "git diff", and so on.
The major thing missing in Git when used as a backup tool is support for
file metadata (mostly file permissions) and empty directories (git just
ignores them). git-home-history doesn't handle them at all, and etckeeper
relies on metastore to preserve a
snapshot of the metadata (owner, group, permissions, mtime, etc.) in a
.metastore file located at the top of the git repository (/etc in the case of
etckeeper).
The problem with metastore is that it doesn't know about Git's file exclusion
mechanisms (.gitignore), so it ends up storing the metadata of files that
aren't actually to be saved. Even though metastore is quite small and simple
(only some 1500 lines of C code), extending it to honor .gitignore files
seemed fairly involved because the semantics is a bit tricky (subdirectories
inherit patterns from their parents and there's more than one kind of
pattern) and harder to express without higher-order functions and closures.
I decided to reimplement metastore in OCaml (I named it very unimaginatively
ometastore) and I'm glad I did so. I implemented metastore's functionality in
one fifth of the code (1500 lines of C vs. under 300 of OCaml), the resulting
executable is faster by 10% without any optimization effort and needs less
memory. Even with functionality like path prefix compression (which shrinks
the snapshot by 50%) and Git-like semantics for ignored files, ometastore
took 4 times less code than metastore (I've since optimized glob matching,
making my directory traversal routine faster than git-ls-files', so it's up to
1/3rd of the size of metastore now).