Bill Venners: One problem with deploying Java applications that require a lot of memory is that the more memory they use, the longer garbage collection pauses can become. Can you explain how Azul's pauseless garbage collector works?
Gil Tene: At Azul, we've been building high performance Java virtual machines (JVM)s for the past eight years now. We started the company in 2002, and we aimed to solve several scalability and infrastructure problems for Java back then. An interesting thing is that garbage collection (GC) was probably number three or four on the list. It wasn't number one. If we're going to build big scalable Java systems obviously you have to solve garbage collection, because you can't scale without it. But that was a corollary not an upfront need.
We've been shipping a pauseless garbage collector for the last five years on our Vega systems. First it was a simple single generation garbage collector, then that was quickly followed by a generational pauseless collector for high throughput.
With the Zing platform, which we've just recently announced and will be starting to ship soon1 , we've taken that entire stack, a stack that can deliver a pauseless collector and scaleable JVM into pretty much any OS, and brought it to a pure software product. We can do this now on top of commodity x86 hardware, on commodity hypervisers like VMware and KVM. Our JVM runs on Linux, Solaris, Windows, HP-UX, AIX, and pretty much any operating system out there. The JVM gets virtualized from that OS into our stack, which includes our own underlying operating environment that allows us to run this pauseless collection, scalable JVM.
At the heart of all this is a Java Virtual Machine that has a lot of interesting features, one of which is a fundamentally different way of doing garbage collection. It's a collector that's designed to concurrently do everything that a collector needs to do, and also avoid doing anything that's rare. We didn't take the typical approach where you try and optimize for the common fast case, but remain stuck with some things that are really hard to do, which you push into the future. Then you tune and tune to make those events rare, maybe once every ten minutes or every hour—but they are going to happen. We took the opposite approach. We figured that to have a smooth, wide operating range and high scalability we pretty much have to solve the hardest problem all the time. If we do that well, then the rest doesn't matter. Our collector really does the only hard thing in garbage collection, but it does it all the time. It compacts the heap all the time and moves objects all the time, but it does it concurrently without stopping the application. That's the unique trick in it, I'd say, a trick that current commercial collectors in Java SE just don't do.
Pretty much every collector out there today will take the approach of trying to find all the efficient things to do without moving objects around, and delaying the moving of objects around—or at least the old objects around—as much as possible. If you eventually end up having to move the objects around because you've fragmented the heap and you have to compact memory, then you pause to do that. That's the big, bad pause everybody sees when you see a full GC pause even on a mostly concurrent collector. They're mostly concurrent because eventually they have to compact the heap. It's unavoidable.
Our collector is different. The only way it ever collects is to compact the heap. It's the only thing we ever do. As a result, we basically never have a rare event. We will compact the young generation concurrently all the time. We will compact the old generation concurrently all the time. It's the only thing we know how to do. And we do it well.
In order to support concurrent compaction though, the algorithm itself, the collector, is built using some tricks and techniques that haven't been viable on regular operating systems before, and still aren't. We're actually trying to change that through the Managed Runtime Initiative Azul launched, advocating for new features in Linux and other OS kernels, but that's going to take awhile. In our virtual appliances we basically deliver a stack that has OS features that allow new things to happen on top of modern hardware. Then we use those features in the collector to deliver pauseless collection behavior.
One of the key techniques that we use is massive and rapid manipulation of virtual memory mappings. We will change mappings of virutal to physical memory at the rate of Java allocation. If you're allocating a whole bunch of objects per second because you're doing work, you're transacting, you're doing whatever it is your application does, for each couple of megabytes of generated Java objects, we will map a new page. Then we will remap that page and protect it and unmap it as part of our garbage collection cycle operations. This means we have to keep up with your allocation rate on a sustained basis and peak to much higher rates when the collector actually does its job. Those rates on a modern x86 server today are fairly high, a single Nehalem chip today could easily generate four gigabytes per second of new garbage when running a classic enterprise workload. So if you want to keep up with that you need to sustain remappings, mappings, allocations, unmappings at those rates. If you want to sustain it with garbage collection, you need to be able to peak to about ten to one hundred times that rate during critical phases. That means hundreds of gigabytes, and perhaps low numbers of terabytes per second, where a regular OS on current hardware will do less than a gigabyte per second of that type of virtual memory operation. So we're just missing some features in regular OSes. We added those features in a custom kernel based on Linux and embedded it in our Zing virtual appliances. Our JVM uses those features in the same way that it did on our hardware-based Vega appliances over the past several years, so Zing can now deliver exactly the same pauseless collection algorithm that our hardware appliances have been delivering since 2005, but on modern commodity hardware instead.
Our collection algorithm basically does several of these manipulations per cycle in order to deliver two key fundamental characteristics. The first is a single pass completely concurrent marker. We will find all the live objects in the heap in a single pass. We will never have to revisit any reference. We can do that because we have a special read barrier that makes sure that every time you attempt to read a reference, the VM will intercept the attempt to read a reference if the reference has not yet been marked, make sure it gets marked, and then keep the execution going. With our read barrier, no references can hide from the collector and the collector never needs to revisit modified references.
The reason this is a very important feature is that it makes the collector insensitive to mutation rate. It doesn't matter how fast you manipulate references in the heap; the collector will always complete in one pass. With this feature, the length of the collection doesn't depend on the rate at which you mutate the heap. That's an important level of insensitivity, and it makes the collector robust.
The other key feature is concurrent compaction. We can concurrently move an object without remapping all the references to that object before we let the program run. This is where our collector really stands out in comparison to anything else out there in the Java SE world. In a classic collector in a regular JVM, if you move an object from point A to point B, you now have to find the three billion places where that object is pointed to potentially and fix them to point not to A but to B. Until you've done that, until you're sure there are no more pointers to the stale A location, you cannot let the program run a single instruction because it can find corrupt data or race with itself in bad ways. As a result people try and do all kinds of things to avoid this problem, like not move objects, or if they do move them they try and only move ones that they know only have a few pointers to them. Eventually you will fragment the heap. You will have to move some objects that at least at some point in time had a lot of pointers to them. When you do that, you're going to have to scan the heap for pointers, and you will have to do this scan in a stop-the-world-pause. With the Azul collector, we simply don't do any of that. And again this is because we rely on our read barrier. The same read barrier I mentioned before will also intercept any attempt to read a reference to an object that has been relocated, and that allows us to lazily relocate references without needing a pause. We compact by moving an entire virtual page worth of objects, we kind of blow it up, moving all the live objects to other places and thereby compacting them. But we don't try and locate and find all the pointers to that page immediately. We'll eventually do that, and we have what we call a remap phase that basically works as part of the next mark phase to lazily fix all those pointers. The approach is that eventually, we'll visit all the pointers again to find out what's alive, and we'll fix the pointers when we get to them.
While we're waiting for that to happen, if the program does run into a pointer into one of those objects, the VM will intercept attempts to load the pointer to begin with and we will fix the pointer right then and there and keep running.
In both of those execution time read barrier operations, both the interception of any pointer that has not yet been marked and the interception of any pointer to an object that has been moved. We use a feature we call "self healing." Self healing is a unique capability that, as far as we know, the Azul collector is the only one to do, and it can only be done with a read barrier.
At the read barrier point, we just loaded a reference from memory. We don't like that reference. That reference has not yet been marked through and may be pointing to a place that is no longer valid; it should be pointing somewhere else. We know how to fix that and make it a good reference, but we also know one other very important thing. We know where the reference came from. So at the read barrier, we can go fix that too. Because we can fix the source memory location, we can heal the reason we trapped. We fix not only the data we got but also the place we got it from. Once fixed, that place in memory will not trap anybody else. In fact every reference will generally only trigger a read barrier—by not being at the right state—once, until somebody hits it. So either the collector got to it first or the program got to it first. But it's going to happen only once. Because of self healing, there are no hot loops of you running into these triggers over and over again on the same data. Somebody's going to fix it and it's going to run really fast after that.
The self healing part, which can be only done at the read barrier, is key to our collector. It means that we're not in a hurry: we need to run the collection fast enough to free memory at your allocation rate, but that's all we have to do. We don't have a phase in this collector where if we don't execute fast enough, you're going to generate so much bad stuff that we'll never close a cycle or a loop. We just have to do one pass at the rate that's fast enough to generate enough free space from the garbage. It's not that if we don't mark fast enough you'll make more work for us in marking, or if we don't remap fast enough or relocate fast enough then you're going to get into trouble and have hot loops that trap all the time. There are no bad, hot anythings because of the design. Self healing is a key part of our read barrier capability and allows the collector to really continue very calmly on its path, if you like. It doesn't have to dedicate tons of CPU to it. It does need to tune to run at a sustainable rate, basically depending on how fast you allocate, but it's not in a hurry other than that.
Another key capability that comes along with self healing is a feature I call "quick release." It too is a pretty unique feature that has to do with virtual memory manipulation. When we've relocated objects out of a two megabyte memory page on x86, and they're compacted away somewhere else and I have an empty two megabytes of memory, there may be a billion pointers into there. I can't put anything else in that address, because I have to intercept all those pointers and fix them before the end of the next mark phase. But if I manage this correctly, which we do, the physical content behind that address doesn't need to be there because nothing is ever going to be looking into it. Any attempt to look into it will end up being intercepted, remapping the reference and looking somewhere else. So we keep all the forwarding information of where objects went outside of these physical pages and we immediately release the physical page underneath the virtual mapping once we've relocated its contents, without waiting for any of the references to be fixed or remapped. We can then immediately recyle that physical page at a new virtual address satisfying allocation right now without waiting for all the remaps to complete. We can also use these recycled physical pages to feed to compaction itself—doing hand over hand compaction, allowing us to easily compact the entire heap without needing a whole bunch of empty memory to compact it into. Quick release means that we can react very quickly. When you need stuff and we've remapped stuff, we can give you the empty memory right on the spot before even completing the garbage collection cycle.
We still have to sustain the same overall memory release rates, we have to free at the same overall rate that your program allocates. But from the time we identify and relocate a page of memory and free it, to the time you can have it in your program, is a very short amount of time. That means that we're a lot more robust to heuristic mistakes. It is much harder to surprise the collector and run really fast as you could with other collectors, where you will need a GC full cycle before you get to use the released memory. We generally have the memory available more quickly to give it to you, and as a result, the GC heuristics, the part of knowing when to start the collector activity in order to keep your application fed, tends to be simple and robust.
All these are key differentiating features combine into what we call a pauseless collector. It is important to note, though, that the pauseless part of this describes the algorithm, but not necessarily the implementation. Let me explain what I mean by that. A way to think about it is that the implementation is a "pause less" collector. The algorithm itself is a truly pauseless algorithm, meaning that at no point during this algorithm do you need to stop all the threads and not let them move. Individual threads may need to be stopped to do some scanning in their stacks or something, but other threads don't have to be stopped at the same time. It is technically a fully pauseless algorithm.
However, in implementation in a JVM, there are annoying little details that have nothing to do with the classic garbage collected heap. Things like cleaning out the compiled code cache, the system dictionary, and other little items here and there that have to do with cleaning out classes and runtime bookkeeping. While technically they can all be done concurrently, it's a lot of engineering work to complete. And since many of these extra things tend to be insensitive in length to metrics like heap size, allocation rate, transaction rate and the rest, if you get them to be small enough, you've achieved your engineering goal. If you can get an entire GC phase shift to be sub-millisecond, going down to zero is just gravy. Just from a complexity point of view, hitting and making every single thing in the actual runtime concurrent far is complex. And since at our current sub-millisecond levels there are bigger causes for jitter, like the CPU scheduler for example, improving GC phase shift further won't actually improve application behavior unless those other causes are also addressed.Over the years—we shipped our first pauseless collector in 2005—we have been chipping away at all those little engineering tasks. What gets done in a pause is fewer and fewer things. For example, weak ref processing, soft ref processing, finalizer processing are not done in a pause anymore in Azul VM. Every other VM will still do them in a pause regardless of heap transversal. It's not the classic collection work, it's other things.
We've also taken out most of the code cache work in class unloading and things like that also primarily get done outside of a pause. But there's still tiny little phase flip operations of "I'm starting a mark phase," "I've ended a mark phase," "I'm starting a relocation phase," that will still synchronize across all the threads and let them go. We do technically do those. They exist. On Zing they generally tend to be below a millisecond. We've measured them going up to often as large as ten or twenty milliseconds on existing systems. That primarily happens when the scheduler ends up taking away our CPU in the middle or something of that sort. Scheduling jitter on our current systems tend to be dramatically larger than GC pauses. But technically the pauses still exist, they're just tiny. We like to be honest about that.
1Azul has announced the general availability of Zing and is currently shipping.
Azul Systems is at:
http://www.azulsystems.com
The Managed Runtime Initiative is at:
http://www.managedruntime.org
Gil Tene is the Vice President of Technology and CTO, Co-Founder of Azul Systems
Gil has been involved with virtual machine technologies for the past 20 years and has been building Java technology based products since 1995. He co-founded Azul Systems in 2002 with the goal of eliminating common Java responsiveness, performance, scale, and overall deployment barriers. Gil guides Azul Systems architectural vision and product design to align with business and market opportunity strategies.
At Azul, Gil pioneered Pauseless Garbage Collection, Java Virtualization, and various managed runtime and systems stack technologies that combine to deliver the industry's most scalable and robust Java platform.
Prior to co-founding Azul Systems, Gil was Director of Technology at Nortel Networks, Shasta Networks and at Check Point Software Technologies, where he delivered several industry-leading traffic management solutions including the industry's first Firewall-1 based security appliance, and the industry's first subscriber edge Broadband Service Node.
Gil architected operating systems for Stratus Computer, clustering solutions at Qualix/Legato, and served as an officer in the Israeli Navy Computer R&D unit. He holds a BSEE from The Technion Israel Institute of Technology, and has been awarded 19 patents in computer related technologies.
Have an opinion? Readers have already posted 14 comments about this article. Why not add yours?
Bill Venners is president of Artima, Inc., publisher of Artima Developer (www.artima.com). He is author of the book, Inside the Java Virtual Machine, a programmer-oriented survey of the Java platform's architecture and internals. His popular columns in JavaWorld magazine covered Java internals, object-oriented design, and Jini. Active in the Jini Community since its inception, Bill led the Jini Community's ServiceUI project, whose ServiceUI API became the de facto standard way to associate user interfaces to Jini services. Bill is also the lead developer and designer of ScalaTest, an open source testing tool for Scala and Java developers, and coauthor with Martin Odersky and Lex Spoon of the book, Programming in Scala.
Artima provides consulting and training services to help you make the most of Scala, reactive
and functional programming, enterprise systems, big data, and testing.