Ruby Buzz Forum - A better backup system based on Git

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Ruby Buzz Forum
A better backup system based on Git

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.

A better backup system based on Git

Posted: Mar 5, 2008 8:44 AM

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: A better backup system based on Git Feed Title: Eigenclass Feed URL: http://feeds.feedburner.com/eigenclass Feed Description: Ruby stuff --- trying to stay away from triviality.	Latest Ruby Buzz Posts Latest Ruby Buzz Posts by Eigen Class Latest Posts From Eigenclass

A fast, powerful backup system built upon Git and efficient, compact tools written in OCaml (faster than the C counterpart with 1/5th of the code :)

Recent events have pushed me to get serious about backing up my data. I'm naturally inclined to use simple solutions over specialized backup systems, preferring something like rsync to a special-purpose tool. As far as "standard" tools go, however, git provides a very nice infrastructure that can be used to build your own system, to wit:

it is more space-efficient than most incremental backup schemes, since it does file compression and both textual *and* binary deltas (in particular, it's better than solutions relying on hardlinks or incremental backups à la tar/cpio)
its transport mechanism is more efficient than rsync's
it is fast: recoving your data is *faster* than cp -a
you keep the full revision history
powerful toolset with a rich vocabulary

I'm of course not the first one to think of git as the basis for a backup system. You can find many blog articles about this, but few people have gone beyond saying "stuff you data in a git repos" and tried to fill the holes and automate things properly. The most serious projects I've found are etckeeper and git-home-history. None of them gets it entirely right for my purposes, however (more on this below).

The tool I've written retains all the advantages from Git, and supplements it in some key areas:

metadata support
management of submodules (nested Git repositories)
automation of common operations; for instance, a commit consists of several steps:
- determining if some files which were committed earlier are ignored now and removing them from the index
- adding new and modified files to the index
- registering new git submodules and copying them to a special area under .git
- committing changes in the index
- compaction and optimization of the repository

Only one command is needed in practice:

 gibak commit

I'm using it to save 2GB in over 200000 files, and it normally takes under 20 seconds to take a snapshot (AFAICS it scales linearly, so I expect to be able to backup 10GB in under 2 minutes...). The full power of the git toolset is available, so earlier versions can be restored with "git checkout", remote copies can be created/synchronized with git clone/push/fetch/pull, you can see what has changed with "git diff", and so on.

You can get the code with

 git clone http://eigenclass.org/repos/git/gibak/.git/

The repository can be browsed at

 http://eigenclass.org/repos/gitweb

The metadata issue

The major thing missing in Git when used as a backup tool is support for file metadata (mostly file permissions) and empty directories (git just ignores them). git-home-history doesn't handle them at all, and etckeeper relies on metastore to preserve a snapshot of the metadata (owner, group, permissions, mtime, etc.) in a .metastore file located at the top of the git repository (/etc in the case of etckeeper).

The problem with metastore is that it doesn't know about Git's file exclusion mechanisms (.gitignore), so it ends up storing the metadata of files that aren't actually to be saved. Even though metastore is quite small and simple (only some 1500 lines of C code), extending it to honor .gitignore files seemed fairly involved because the semantics is a bit tricky (subdirectories inherit patterns from their parents and there's more than one kind of pattern) and harder to express without higher-order functions and closures.

I decided to reimplement metastore in OCaml (I named it very unimaginatively ometastore) and I'm glad I did so. I implemented metastore's functionality in one fifth of the code (1500 lines of C vs. under 300 of OCaml), the resulting executable is faster by 10% without any optimization effort and needs less memory. Even with functionality like path prefix compression (which shrinks the snapshot by 50%) and Git-like semantics for ignored files, ometastore took 4 times less code than metastore (I've since optimized glob matching, making my directory traversal routine faster than git-ls-files', so it's up to 1/3rd of the size of metastore now).

The backup system

Read more...

Read: A better backup system based on Git

Previous Topic

Next Topic


	Web Artima.com