I read this PDF a year ago and recently it popped up on my radar again. So I read it again and absorbed more this time than last time. The PDF is The pitfalls of Object Oriented Programming and it is a really interesting read. The author jumps around a bit. I mean, who cares about the history of C++, get on with your point please.
The PDF is about optimizing the rendering loop of a 3D engine. He managed to reduce its total run time from 19.6ms to 3.3ms — this is significant if you intend to have 60 frames per second because you only have 16.6ms to compute your physics, do your AI choices and then render your scene before you have to start all over again. Many console games will aim for 30fps or 24fps and sometimes even lower just so that they can do more in the meager processing time they have.
The world of the PC is no different to the console in this regard. You need to squeeze as much as possible out of those 16.6ms. So, his PDF is relevant, although not really about object oriented programming in my opinion — and I'll get to that shortly.
His speed up is achieved by doing two things: Re-organizing the data in his program and using a special CPU optimization to push data sets in to the L2 cpu cache. Back in 1979 the latency for memory access was 1 cpu cycle. These days it's around 400+ cpu cycles. He posits the question in his paper "How much can you do in 400 cycles" .. Really the question is "How much can you do in 400 cycles without accessing data outside of the L2 cache". The answer is: not much.
So his paper is very relevant and it's worth considering what he's discussing in terms of Smalltalk and other object oriented programming languages. The L2 cache is much smaller than main memory of course so you want to get the best use out of it you can. Consider that we might have a class X with 10 instance variables and we have class Y with two instances variables. Now imagine we have 10000 instances of class X and each instance contains an instance of class Y in one of its slots. One of the operations an instance of X does is to tell its Y to update some of its information.
manyXs do: [:eachX | eachX myY updateYourInformation]
This is not uncommon code. But using his technique, we can consider the impact on the L2 cache by running this code. Every single X and Y in this equation will pass through the L2 cache. We know that each X has ten slots but we're only accessing one of the slots in this execution. That means we're wasting 9 pointers worth of data in the L2 cache for every instance of X we load in. Add on top of that however big each Y actually is and suddenly we're blowing out the L2 cache hundreds of times as we iterate over our data set.
The paper describes the exact same problem in the rendering engine he was working on. His solution was to stop using classes and structs and to instead use distinct sets for each slot in the object. Instead of having an instance variable called "name" for class X, instead you'd have an array of "names for class X". You'd have an array for each slot in your class and each -index- in to each array represents all the data for an individual instance.
If the memory is laid out like this, you're not going to pull in the 9 extra instance variable slots you're not accessing for your loop and you'll get much more out of the L2 cache than you could before and this means the program will run faster — measurably faster, as he discovers in his paper with just this re-arrangement he got a 35% speed increase just by re-arranging his memory.
He further analysed his data structures and realized that his tree structure he was iterating over was filled with information that could be inferred from the data structure itself — why store children and parent points when you just need to know how many children a parent has? .. after removing the redundant information, he could fit even more of his information in to the L2 cache and reduced the render time form 19.6ms down to 4.8ms. He's not changing the way his program works, merely changing the way the information is being stored in memory.
The final step in his paper is to use the CPU instruction for prefetch, which hints to the CPU you're intending to access an area of memory a lot in the near future. It will prefetch that memory in to the L2 cache and if you've done it right, by the time you get to your second or third access, it'll all be in memory and you won't have any cache misses. This final adjustment to his program got his total runtime down to 3.3ms.
So, what does this mean for Object Oriented Programming - is the "Object" a bad thing? Well, no.. this story is not about object oriented programming, it's about memory layout and memory management in general.
When I write code that interacts with a C library, I allocate a C struct and fill in its information with #memberAt:put: and retrieve information from it with #memberAt: - I am forced to add a level of indirection between my Smalltalk code and the C library.
When I wrote the Spellchecker library the first time, I didn't care about memory consumption — but when it was suggested it should become a goodie and used in the Refactoring Browser to do spell checking as you program, I was asked to reduce its memory footprint. In this case, I ended up doing the same thing the guy in the paper did - I removed the redundant object information and stored my tree in to a ByteArray. Again, I was forced to add a level of indirection between my spell check program and its tree representation. I was also forced to use a ByteArray instead of instance variables.
Finally, we have the example of splitting up an Objects slots in to separate array buckets. If you were to do this in Smalltalk, you'd be forced to add a layer of indirection and once more, instance variables cannot be used. Because we're not using instance variables for these three examples, we're losing encapsulation in our code base — we now have to program by pattern very strictly to ensure we don't make mistakes. We also add the overhead of extra message sends just to get or put data in to memory.
If ever there were an argument for moving memory management up out of the virtual machine and in to the smalltalk code itself, perhaps this would be it. The smalltalk compiler should take instance variables as a suggestion for what data is stored but not -how- it is stored. Right now, we treat everything as an array of pointers, unless its byte data. Instead, I would prefer if the compiler reflected off a memory layout class — just as we have #parserClass and #compilerClass, we'd have #memoryLayoutClass (*better name required) and that would allow a developer to create custom memory management policies as required.
The draw back is that we would also have to implement a custom garbage collector for our memory layout classes — on the other hand, this is also a bonus. It means we can have objects stored in foreign data structures and GC them when we know we're ready — even calling the external C Apis to inform the C library that we're destroying the structure without needing to use weak arrays or ephemerons. It would also allow us to implement non-GC'd pockets of execution, where objects are allocated and freed explicitly… it would also allow custom object pools.
If the virtual machine had primitives for accessing memory directly, like you get in C using & and *, then we could implement this tomorrow all at the image level. But memory management has been an area of black magic in Smalltalk for so long, no one seems to have gone back to this area to improve it, or expose more of it — probably from fear of the garbage collector.
Don't get me wrong, the memory allocators we have in VisualWorks are awesome — they cover 96% of situations. It's those last 4% of situations that mean you'll never get your render time down from 19.6ms to 3.3ms in Smalltalk. Not unless we change.