There are many facets to Xtreams, so we broke it up in to different “parts”. There’s the Core, which you must have, then there are Terminals which are implementation specific (the part that would be rewritten if Xtreams were ported to, say, Squeak), then there are Xtreams-Transforms and Xtreams-Substreams which are common behavior that you’ll probably want to use but go above and beyond what streaming is. Beyond that, we have Parsing which implements a PEG parser using Xtreams as well as several example grammars, Xtras which contains a bunch of non-core and non-common stuff and Xperiments where we drop in ideas we haven’t fully decided on yet.
Today I’d like to touch on some of the stuff in the Xtreams-Xtras package, specifically the object marshaler. In this package you’ll find support for chunked streams, crypto streams, compression streams but the two of interest here are: interpreting streams and object streams.
The interpreting stream is a hidden treasure in Xtreams. Have you ever wished you could stream across an array of shorts, or floats, or doubles using the UninterpretedBytes as your source and do it efficiently? Well now you can, that’s what the interpreting stream is for. On top of that, I built an object marshaler.
There are many object marshalers floating around, generally geared toward different goals, in VisualWorks we have BOSS, Parcels, Opentalk-STST, Sixx, a Squeak compatible marshaler in Monticello and many many others. So why do many? I think perhaps it’s because the marshalers themselves are not pluggable. One of my goals for Xtreams was to keep the rules about how you store objects in to a stream pluggable. There are three streams for the marshaler: Read, Write and Analyse. The Analyse stream is a read stream but it never instantiates any of the data - this is incredibly useful when you’re debugging if you can’t instantiate the data properly, you can at least see how the different bytes represent data in the stream.
All three streams are given a marshaler object, the default of which is ObjectMarshaler. You can provide your own marshaler object which needs to implement #marshal:object:, #unmarshal: and #analyse: - what you do from there is your own business. Also, ObjectMarshaler itself uses pragmas to get its marshaling rules, so it itself is pluggable. If you add pragmas to it, it’ll change its version signature automatically to avoid accidentally hooking up two incompatible streams.
One crazy idea floating about in my head is to have the write stream always marshal out the marshaler on open so that the read stream never uses its own local marshaler code, thus making (in theory) the compatibility a non-issue (for the most part). I won’t be doing this any time soon but I still think it’s an interesting idea.
I wanted the object marshaler to work across different kinds of streams with ease, which means no positioning at all - backward references are achieved by an internal referencing id mechanism. It’s important to note that the marshaler does not implement the idea of object identity persistence as you will find it in Opentalk-STST or Glorp. Instead, I leave that up to the exercise of someone who really wants it (read: I’m deliberately creating a barrier here because in my experience, remote references and object identity both add complexity and introduce issues if you don’t consider their implications with great detail before using them).
I also wanted the marshaler to be fairly efficient on space and reasonably efficient on speed. Finally, I did not feel it was a worthwhile goal to make the marshaler able to transmit classes across the wire or other things of that nature - mostly because the marshaler expects you to be communicating with a compatible destination; given that assumption there’s little need to have the protocol able to modify the destination to -be- compatible; in fact, that could be downright undesirable almost all of the time. You are able to transmit class definitions to your remote destination and have your remote destination install them, just not class references; so once again I leave this as an exercise for someone who really wants it.
At times I sacrifice space for speed: eg, for byte sized integers I can store them as a single byte, but if they get in to the small integer size range, I will store them as a long - why? because I can read a long using the interpreted streams which uses the UninterpretedBytes which uses a primitive and its all nice and fast. Therefore, the number 256 does not take two bytes, but instead five bytes (one byte to indicate its a long and the four bytes for the long). I figure, if the size really does matter for you, you can easily sacrifice CPU time by wrapping a compression stream around it which is nice and easy to do with Xtreams.
Assuming you’re using my ObjectMarshaler class, the two ends of the stream begin knowing about a set number of classes and any time you reference a class that has not been discussed yet, its name is transmitted across the wire only once and a new class id is created on both sides to represent it. This means that the majority of class communication ends up being single-byte class id’s. Objects such as true/false/nil end up being a single byte too, since those objects are singletons. Objects that are immediate, such as numbers, do not need an object id so they write their body directly on to the stream. The implication is that the number 0, for instance, takes two bytes: one to mark it as an integer and one for its body. An instance of Object on the other hand would also take two bytes (at a minimum): one to mark it as an Object and one for its object id, then any bytes for its body (Object has no body so this would be no bytes).
During a single object transfer (which I call a ‘transaction of objects’) an objects body is only ever written out once. The second time it is transmitted its object id will be written only. This allows you to have objects that reference each other or themselves without wasting space either. However, outside of that single transaction, there is no object identity (I may one day change that.. we shall see).
The upside of all of this is that you can send just about anything over the wire or store it on disk or put it in a byte array to be written to the database. You can write out block closures and even whole processes if you so desire. Doing some tests on this today I made an interesting discovery about the nature of clean blocks that I did not know before. Consider the following code:
myMethod: stream
| temp1 temp2 cleanblock |
temp1 := self makeSomeHugeObject.
temp2 := self makeAnotherHugeObject.
cleanblock := [1 + 2].
stream put: cleanblock
Point of interest #1: the clean block has no outerContext and no copiedValues because it references none of the outer variables. It does, however, have a method which contains the bytecode for 1 + 2. This method is a CompiledBlock and the compiled block has an outerMethod variable which will be #myMethod:. Whatever #myMethod: references will also get marshaled. If #myMethod: is some big huge thing (in the case of my test case, it is a big long test method) then you end up with quite a surprise on your hands as I did. Worse still, if you insert a breakpoint, that breakpoint is now part of the CompiledMethod that is eventually referenced by your clean block and it too gets thrown in to the mix.
Now, that said, it all works - all of that junk happily gets sent across the wire, it's just that I never expected it to be there in the first place. Shouldn't that CompiledBlock know it's completely isolated from its parent method and be implemented in such a way as to have no external references? Well, may be, but it turns out that this is not he case in VisualWorks. The decompiler and one would assume the debugger make use of this meta information so divorcing it might not even be possible, or possibly even desirable. Still, it took me by surprise when I was reading the Analysis of my bytes in the above scenario.
At this point, the marshaler seems to do everything I wanted it to do, it's a practical replacement for BOSS and Opentalk-STST and if anyone ever ported it to Squeak would also be a practical replacement for the marshaler inside Monticello. It might also be a good basis for byte-level communication between smalltalk images across smalltalk implementations, perhaps for Monticello2. My next goal is to make a simple image-to-image processor for transmitting messages and running code in a trusted network scenario. My goal here is two-fold: make it streaming and make it fast. I want to be able to transmit gigs of data over this protocol while still also sending messages; this is where all previous protocols fall flat on their face.
This would also be the basis for a new Grid/Polycephaly implementation. Polycephally was two things: a way to fire up and manage duplicate images and a way to send messages to those images via pipes. Xtreams with this marshaler can take care of the pipe communicate and the marshaler can take care of the messaging leaving only the image management to be taken care of. It also is a replacement for the Grid, which assumed distributed images across machines where the management of those image was dealt with outside of the image but discovery was done via a broadcast protocol.
The new marshaler can also be used as an on-disk format, since the destination/source does not matter, it can be sockets, in-memory, disk, whatever kind of terminal you're able to come up with (perhaps database terminals should be next).