The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
A better backup system based on Git

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.
A better backup system based on Git Posted: Mar 5, 2008 8:44 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: A better backup system based on Git
Feed Title: Eigenclass
Feed URL: http://feeds.feedburner.com/eigenclass
Feed Description: Ruby stuff --- trying to stay away from triviality.
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Eigen Class
Latest Posts From Eigenclass

Advertisement

A fast, powerful backup system built upon Git and efficient, compact tools written in OCaml (faster than the C counterpart with 1/5th of the code :)

Recent events have pushed me to get serious about backing up my data. I'm naturally inclined to use simple solutions over specialized backup systems, preferring something like rsync to a special-purpose tool. As far as "standard" tools go, however, git provides a very nice infrastructure that can be used to build your own system, to wit:

  • it is more space-efficient than most incremental backup schemes, since it does file compression and both textual *and* binary deltas (in particular, it's better than solutions relying on hardlinks or incremental backups à la tar/cpio)
  • its transport mechanism is more efficient than rsync's
  • it is fast: recoving your data is *faster* than cp -a
  • you keep the full revision history
  • powerful toolset with a rich vocabulary

I'm of course not the first one to think of git as the basis for a backup system. You can find many blog articles about this, but few people have gone beyond saying "stuff you data in a git repos" and tried to fill the holes and automate things properly. The most serious projects I've found are etckeeper and git-home-history. None of them gets it entirely right for my purposes, however (more on this below).

The tool I've written retains all the advantages from Git, and supplements it in some key areas:

  • metadata support
  • management of submodules (nested Git repositories)
  • automation of common operations; for instance, a commit consists of several steps:
    • determining if some files which were committed earlier are ignored now and removing them from the index
    • adding new and modified files to the index
    • registering new git submodules and copying them to a special area under .git
    • committing changes in the index
    • compaction and optimization of the repository

Only one command is needed in practice:

 gibak commit

I'm using it to save 2GB in over 200000 files, and it normally takes under 20 seconds to take a snapshot (AFAICS it scales linearly, so I expect to be able to backup 10GB in under 2 minutes...). The full power of the git toolset is available, so earlier versions can be restored with "git checkout", remote copies can be created/synchronized with git clone/push/fetch/pull, you can see what has changed with "git diff", and so on.

You can get the code with

 git clone http://eigenclass.org/repos/git/gibak/.git/

The repository can be browsed at

 http://eigenclass.org/repos/gitweb

The metadata issue

The major thing missing in Git when used as a backup tool is support for file metadata (mostly file permissions) and empty directories (git just ignores them). git-home-history doesn't handle them at all, and etckeeper relies on metastore to preserve a snapshot of the metadata (owner, group, permissions, mtime, etc.) in a .metastore file located at the top of the git repository (/etc in the case of etckeeper).

The problem with metastore is that it doesn't know about Git's file exclusion mechanisms (.gitignore), so it ends up storing the metadata of files that aren't actually to be saved. Even though metastore is quite small and simple (only some 1500 lines of C code), extending it to honor .gitignore files seemed fairly involved because the semantics is a bit tricky (subdirectories inherit patterns from their parents and there's more than one kind of pattern) and harder to express without higher-order functions and closures.

I decided to reimplement metastore in OCaml (I named it very unimaginatively ometastore) and I'm glad I did so. I implemented metastore's functionality in one fifth of the code (1500 lines of C vs. under 300 of OCaml), the resulting executable is faster by 10% without any optimization effort and needs less memory. Even with functionality like path prefix compression (which shrinks the snapshot by 50%) and Git-like semantics for ignored files, ometastore took 4 times less code than metastore (I've since optimized glob matching, making my directory traversal routine faster than git-ls-files', so it's up to 1/3rd of the size of metastore now).

The backup system


Read more...

Read: A better backup system based on Git

Topic: How Fonts Set the Tone Amazing analogies to animals on how type... Previous Topic   Next Topic Topic: Rails is the best thing that ever happened to Python

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use