This post originated from an RSS feed registered with Java Buzz
by Mathias Bogaert.
Original Post: How to handle big repositories with git
Feed Title: Scuttlebutt
Feed URL: http://feeds.feedburner.com/AtlassianDeveloperBlog
Feed Description: tech gossip by mathias
git is a fantastic choice for tracking the evolution of your code base and to collaborate efficiently with your peers. But what happens when the repository you want to track is really huge? In this post I’ll try to give you some ideas and techniques to deal properly with the different categories of huge. Two categories of Big repositories If you think about it there are broadly two major reasons for repositories growing massive: They accumulate a very very long history (the project grows over a very long period of time and the baggage accumulates) They include huge binary assets that need to be tracked and paired together with code. Both of the above. So a repository can grow in two orthogonal directions: The size of the working directory – i.e. the latest commit – and the size of the entire accumulated history. Sometimes the second category of problem is compounded by the fact that old deprecated binary artifacts are still stored in the repository, but for that has a moderately easy – if annoying – fix, see below. For the above two scenarios the techniques and workarounds are different – though sometimes complementary – and so let me cover them separately. Handling Repositories With Very Long History Even though the bounds that identify a repository as massive are pretty high – for example the latest Linux kernel clocks at 15+ million lines of code but people seem happy to peruse it in full – very old projects that for regulatory/legal reasons have to be kept intact can become a pain to clone (Now to be transparent the Linux kernel is split in a historical repository and a more recent one, and requires a simple grafting setup to have access to the full unified history). Simple solution is a shallow clone The first solution to a fast clone and to saving developers and systems time and disk space is to perform a shallow clone using git. A shallow clone allows you to clone a repository keeping only the latest n commits of history. How do you do it? Just use the - -depth option, for example: 1git clone --depth depth remote-url Imagine you accumulated ten or more years of project history in your repository – for example for JIRA we migrated to git an 11 years old code base -, the time savings can add up and be very noticeable. The full clone of JIRA is 677MB with the working directory being another 320+MB , making up for more than 47,000+ commits. From a quick check on the JIRA checkout a shallow clone took 29.5 seconds compared to the 4 minutes 24 seconds of a full complete clone with all the history. The disparity grows also proportionally to how many binary assets your project has swallowed over time. In any case build systems can greatly profit from this technique too. Recent git has improved support for shallow clones Shallow clones used to be somewhat impaired citizens of the git world […]