This post originated from an RSS feed registered with Agile Buzz
by James Robertson.
Original Post: Bryce Kampjes - Exupery
Feed Title: Michael Lucas-Smith
Feed URL: http://www.michaellucassmith.com/site.atom
Feed Description: Smalltalk and my misinterpretations of life
A small crowd turned up, myself and four other people. I was hoping for more, but on short notice and at an odd time (evening, Sunday) you can't be too surprised. I'll post pictures when I get home and can download the pictures off my camera.
When we (myself and David Price) picked up Bryce from the airport we didn't know what Bryce looked like. So we decided to fire up our laptops - I ran squeak, with croquet and put up the words 'Bryce' while david tried to write some postscript to show the name 'Bryce'.
I recognised him by the balloon on his shirt. We got off to a rough start when I forgot to pay for the parking card. The booth at the gate claimed we could pay with a credit card.. three credit cards later we had to give up on that and David ran back to pay the ticket while cars lined up behind us.
On then in to the city. We went and got some italian food in the city center, had a glass of red and started discussing some of our discoveries and interests with Smalltalk. After dinner, we were 20 minutes away from starting the presentation, so we headed over to the ANU and from there the presentation began...
Exupery makes 'Squeak Fly'. It's not a usual jit. In a regular JIT, if we don't have compiled code, we stop execution, compile the code, then run it. In Exupery, if the code isn't compiled, the interpreter continues running and compiles in the background. Returns can jump back in to newly compiled code.
Exupery is designed to compete with C, but because it's written in Smalltalk, it's slow (comparitively) but that's okay, because it'll produce fast code in the end. Compilers are complex, so it's not a big deal for it to run slowly at this point.
The whole project is completely Test Driven Developed. Exupery is 2-5x faster than Squeak.
Eventually, it's planned with a full optimiser - it'll be able to compete with anything given time. One advantage, he's not writing this compiler in C, but Smalltalk.
People started compilers in around the 70's - things got interesting in 83 that made a fast compiler. They were writing fast compilers - compiling took the same time as interpreting. In this case, it was worthwhile compiling everytime you hit some missing compiled code.
In 89 there was Dynamo, which emulated risc on a risc, but it ran faster because of the jitting. This was done by the 'self' team, at the machine code instead of at a bytecode level.
The design of a 'fast compiler' limits you to expensive algorithms and it really has to be fast to avoid holding up execution for long. You don't have time to do complex optimisation. Further work tried to get rid of bytecodes and just work with the machine code. The Risc machines were masically like a Sparc.
They did lots of analysis to find where the polymorphism was. For some benchmarks this got wonderful results. But for slight changes, it all fell apart and didn't work well at all.
Self introduced a key idea. If you instrument the program, then you can find what types occur there and then feed that back in to the execution. They had two compilers: one fast and one slow. The fast compilers remove the abstractions, but you can't do much more with it than that, it's too Cish. This gave them about 1/2 the speed of C code.
When you're doing heavy CPU usage - where you want optimisations, is not when you want the program to pause to do compiling: 3d graphics, real time music.
So, Exupery wants really fast compilation, but no compile time pauses. Optimising compilers are slow, with n^3 worst case. But most of the time they run at linear time. But, generally, it's slow. So Exupery allows the interpreter to keep running as the compiler runs in the background. Squeak is media rich and can't afford pauses due to compilation.
The other side of Exupery is targetted to out of order modern CPU's. Scheduling is done by the CPU and lots of spare capacity. Good C and Fortran is using about 1/1.5 commands per cycle. The machines have roughly 3/4 commands per cycle.
For Smalltalk, modern CPU's aren't that bad. It gives plenty of room to avoid dropping the CPU cache. Tight loops are solved by moving type checks outside of the loop, or inlining methods inside the tight loop to avoid method lookups per type inside a collection. So tight loops are easy to optimise. That's how you get near C speed.
Exupery is not a JIT, it's really close to a JIT. There's one method in Exupery that compiles your method. Any time you call your method or return in to your method, the compiled code will then run. Exupery never stops, it continually compiles things. To make it a real jit, let it compile things as they are run. This isn't a big step away. But for now, it's not done, still debugging and avoiding crashes.
Eventually, it'll have a full optimiser. At the moment it's culling, coalescing, register optimiser. The register optimiser does a surprisingly good job. The problems happen when you run out of registers.
100 bytes to one bytecode. So, 3million bytecodes in a Squeak image ends up as a lot of compiled code. So not everything is compiled. It's up to the developer, as like in LISP where you choose whether to compile or not.
Exupery is taking roughly 4 clock cycles per bytecode. For simple code, it's roughly 2 clock cycles per bytecode. With further work, it can get up to 1 clock cycle per bytecode, which is pretty good - it's up to C speed. Message sends are slightly slower than the interpreter at the moment - it hasn't been optimised yet. Sends are the next most important thing to do - but Bryce is finishing off his current code first.
Architecture:
Byte code Reader
Intermediate Generator
Tree optimiser (currently being developed)
Intermediate Converter (currently being developed)
Instruction Selector
Register Allocator
Assembler
Trivial Loader
The Byte Code Reader is simple, uses two parses. Intermediate Generator which is lots of code that is not terribly interesting.. lots of tests. It basically just expands a simple tree in to a really detailed tree. The tree optimiser then optimises the detailed tree to remove redundancy. Eg: avoiding redundant tagging of integers where you already know you're dealing with integers.
Instruction Selection decides what machine code instructions to actually use. There is lots of theory on which are the best to use. This is the component that would be replaced for cross platform compilation (the assembler too). The Register Allocator is described above, culling etc. The assembler is fairly basic. The loader just provides a way to actually put the code in executable memory.
at: and at:put: are quite slow in Squeak. They produce very large code. They are an artifact of Squeak's design to be more memory friendly than speed friendly.
The interpreter was modified to do the return intercepts and the sends intercepted but not much more was done to the interpreter.
The future for Exupery: Next, Polymorphic Inline Caches (speed up sends). Then Dynamic Inlining (remove sends). Most Smalltalk methods are tiny things that don't do much. So inlining those is a good goal to remove all the sends and polymorphic branching. Then, removing expensive code from loops.
The Animorphic people found that a frequency count on the number of times a method is called was sufficient to identify the methods that should be inlined.
After all that, better register spilling. X86 is bad, you only have 6 registers. 32 registers and the problem goes away. 16 registers gives you enough coverage to avoid most spills.
Another future plan: port to the ARM and others are interested in doing a powerpc port. Another possible optimisation is escape analysis. This is deciding if an object never leaves your local part of the stack - no need to allocate it to real memory.
Current design problems/limitations: What do you need to get wide use and what do you need to do to compete with C. Can Exupery run faster than C? Bryce says so, given effort and time. Previous JIT projects failed as they were written in C++ - author lost interest.
Exupery needs PICS and some tuning. PICS gives fast sends (Polymorphic Inline Cache) and it doesn't support blocks. Most blocks would end up being inlined away.
What would it take to make Smalltalk as fast as C? Assuming full inlining, blocks, PIC's, etc. You have non-effecient code by C standards. There's lots of duplication. Induction of values, which allows you to move information outside of loops. This allows at:/at:put: to run as fast as C's, even with Squeak's horrible object header mess.
Competing with C also means understanding the code you're running. Lots of Crypto is 32-bit heavy. Smalltalk doesn't do 32-bit very well. Smalltalk steals a few bits for tagging. So to avoid the tagging, try to know that you have an unboxed integer and use it where you know you can, then tag it again when you don't know. Also, lots of floats. Some Smalltalks box floats. Vector processing using SSE means you can do four operations on floats at the same time. This can jump you ahead of lots of C compilers.
Then, at:/at:put: - Squeak is designed for space optimisation, not speed. Each object has three headers. a) the object is in the header, b) the object has a size, c) the object is a full object. A lot of the time, Squeak gets away with one-word headers, but other times it ends up being three-words.
The interesting result of having a compiler in smalltalk gives you a new an exciting way to blow your image up, by tweaking the jitter to do things the wrong way :) Same as breaking your widgets, breaking the bytecode compiler, etc..
Other uses for Exupery: Very fast regular expressions and parsers. Database querying engine. Network filtering. Email virus checkers.