But to take an analogy from drag racing, sometime what matters is getting across the finish line the fastest. How you did it does not matter. Certainly, everything and anything has been tried by backyard hotrodders.
What are some things that can be done with C++, that a good backyard hotrodder with greasy overalls and a pack of cigs rolled in his shirtsleeve would do, to get a program speeding over the finish line first?
The rules of this game are simple—to be a backyard C++ hotrodding modification, it's got to make the program faster than the "correct" way.
Warning: use of these techniques is rumored to be illegal in at least 5 states. They may cause you to be fired. C++ gurus will heap contempt upon your head. You'll get an "F" from your professor. Your warranty will be voided.
So let's pop the hood and slip that bottle of nitrous in.
The proper way to do this is to serialize the data structure, object by object, into some disk friendly format. The format is then read, parsed, and object by object, it gets recreated back in memory. Each object needs a serialize function, where one must figure out how to represent each member on disk, and a constructor that can build a replica of that object from the serialized data. When the data structure is non-trivial, with cycles, pointers, shared objects, etc., this is not a simple task. Furthermore, it is inherently slow. Each object must be broken down into its serial representation, and when reconstituting, each object must be allocated, constructed, and put back together.
But the data structure is already in memory, in a sequence of bytes. Couldn't that just be copied to disk, and read back in again? Yes, if we're willing to step outside of defined behavior.
The first thing to do is organize the data structure so it is entirely contained within a known block or sequence of blocks of memory. This means it may not contain any pointers or references to objects outside of itself. That also means no references to string literals, which are stored in the static program data, or any other static data.
Next, create a custom allocator/deallocator for the objects that go into the data structure, so that they are allocated within a known block or sequence of blocks, and not mixed up with other allocations that are not to be saved.
So, those blocks can then just be rolled out directly into disk files, and rolled back in. Right? Nope, because there are pointers. When it's rolled back in, it's probably not going to wind up at the same address, and the pointers will all be pointing to the wrong place.
We need a way to dehydrate the pointers when writing the blocks to disk, and hydrate them (just add water!) when reading a block back in, so they become real pointers again. This can be done by walking the data structure, and for each pointer in each object, convert that pointer to an address to an offset from the start of the block. Each class gets a dehydrate function, which handles all the pointers in its objects.
But there's a problem. If the data structure is cyclic, or has multiple pointers to the same object, we need a way to determine if a pointer has been dehydrated already or not. Just comparing the pointer value to see what range it lies in is not enough, as an offset into the block(s) may overlap the physical addresses the block is mapped into.
The solution lies in noticing that allocators typically allocate data that is aligned on 2 or 4 byte boundaries (and since we're writing custom allocator for this, we can ensure this property). That means, for a valid pointer, the least significant bit will always be 0. We can then define a dehydrated pointer as being odd, and a hydrated pointer as being 1. Dehydrating a pointer then becomes:
void ptr_dehydrate(void **p) { if (*p) *(long *)p |= 1; }and hydrating it becomes:
int isdehydrated(void *p) { return ((long)p & 1) != 0; } void ptr_hydrate(void **p) { if (isdehydrated(*p)) *(char **)p -= ptr_adjust; }and for a class:
class Foo { void *p; Bar *o; virtual void dehydrate() { ptr_dehydrate(&p); if (o && !isdehydrated(o)) { o->dehydrate(); ptr_dehydrate(&o); } } virtual void hydrate() { ptr_hydrate(&p); if (isdehydrated(o)) { o->hydrate(); ptr_hydrate(&o); } } }In practice, this works out really fast. But since it is "bad" C++ code, there are some problem areas that must be avoided or accounted for:
vptr
which points to a table of functions called the
vtbl[]
. A virtual function call is performed by using the
vptr
to find the
vtbl[]
, and then calling the function at a specific index into that
vtbl[]
. Therefore, the polymorphic behavior of a class object is controlled by where the hidden
vptr
member is pointing. The
vptr
member is set when the object is constructed by some hidden code added by the compiler to every constructor for that object's class.
So, by manipulating the vptr
ourselves, we can control the behavior of an object, even change its type, without calling a constructor. This technique is called vptr jamming.
For example, consider a collection class which needs to be very fast. Most of the time, it will be used in a single-threaded manner, but sometimes, in a multithreaded manner. The program can go back and forth between single and multithreaded more than once during execution. It's got to run as fast as possible, so in single-threaded mode the time spent to do locks is unaffordable. So, our class might look like:
struct Collection { ... members of the collection ... virtual void foo() = 0; }; struct SingleThreadedCollection : Collection { void foo() { ... optimized for single threaded ... } }; struct MultiThreadedCollection : Collection { void foo() { ... synced for multithreaded ... } };The desire is to switch back and forth between single and multithreaded operations without having to destruct and reconstruct the object—we want to dynamically change the behavior by switching (i.e. jamming) the
vptr
. Here's how that would look:
struct Collection { ... members of the collection ... virtual void foo() = 0; void toSingle(); void toMulti(); }; struct SingleThreadedCollection : Collection { static SingleThreadedCollection tmp; void foo() { ... optimized for single threaded ... } }; struct MultiThreadedCollection : Collection { static MultiThreadedCollection tmp; void foo() { ... synced for multi threaded ... } }; SingleThreadedCollection SingleThreadedCollection::tmp; MultiThreadedCollection MultiThreadedCollection::tmp; void Collection::toSingle() { *(void **)this = *(void **)&SingleThreadedCollection::tmp; } void Collection::toMulti() { *(void **)this = *(void **)&MultiThreadedCollection::tmp; }The assignment in the
toXxxx()
functions gets the value of the right
vptr
from the static temporary
tmp
created for just that purpose, and jams it into the
vptr
location in
*this
. For most compilers, the vptr is at offset 0 of the struct. For the rest, this code will have to be tweaked to account.
Naturally, there are problems that must be avoided or accounted for here as well:
vptr
at different offsets within the object instance. A bit of simple testing and looking at the generated assembler will quickly find it.vptr
will be at the same offset. This is true of all the compilers I tried it on, but there is certainly no guarantee that this is true.vptr
jamming, the polymorphic behavior of an object instance is controlled by what
vtbl[]
its
vptr
is pointing too. It's not a big leap from that to realize that by testing the value in the
vtbl[]
, the type of the object can be determined.
The usual way to determine the derived type of an object is by doing a dynamic_cast
. If the dynamic_cast
to a derived class succeeds, it returns a pointer to the derived class. If it fails, it returns NULL:
dynamic_cast
is slooooww. For example:
struct A { virtual int foo(); }; struct B : A { int foo(); }; int test(A *a) { return dynamic_cast<B*>(a) != 0; }Here's the generated assembly code for the test to see if
a
is really an instance of
B
:
mov EAX,4[ESP] test EAX,EAX je L24 push 0 push offset FLAT:___ti?AUB@@ push offset FLAT:___ti?AUA@@ push EAX mov ECX,[EAX] push dword ptr -4[ECX] call near ptr ?__rtti_cast@@YAPAXPAX0PBD1H@Z add ESP,014h jmp short L26 L24: xor EAX,EAX L26: neg EAX sbb EAX,EAX neg EAX retThere are lots of instructions being executed, and a function call. It also relies on RTTI being generated for the class, which is bbllooaatt.
If only we could snipe the RTTI and figure out the type directly. If we've got the need for speed, we can do the following:
B tmp; int test(A *a) { return *(void**)a == *(void**)&tmp; }All this does is compare the
vptr
in a with the vptr in
tmp
. Most compilers put the
vptr
as the first member in a class most of the time, so this will work. When it doesn't, adjust the offset to
a
and
&tmp
to match.
The generated assembler code looks like:
mov EAX,4[ESP] mov ECX,[EAX] cmp ECX,?tmp@@3UB@@A mov EAX,1 je L15 xor EAX,EAX L15: retHoly hotrod, Batman! That brought the test for the type down to two instructions. We can even do slightly better. The Digital Mars C++ compiler has special support for RTTI sniping with the
__istype
pseudo member function:
int test(A *a) { return a->__istype(B) != 0; }and we're down to one instruction:
mov EAX,4[ESP] cmp dword ptr [EAX],offset FLAT:??_QB@@6B@[4] mov EAX,1 je L13 xor EAX,EAX L13: retThe obvious question is, why doesn't
dynamic_cast
produce the short, fast code? The answer is that RTTI sniping only works if the class type being tested for is the most derived class in the class heirarchy (because that determines the
vtbl[]
), whereas
dynamic_cast
needs to work for any derived class.
Once again, there are problems with RTTI sniping:
vtbl[]
s between classes, so the vptr
for class B
and class A
, where B
is derived from A
, point to the same value. This clever compiler optimization must be defeated, sometimes it can be via a switch to "turn on RTTI", or by some other switch. Worst case, avoid using RTTI sniping between classes derived from one another.vtbl[]
s for the same class, so that two vptr
s can hold different addresses, but still be the same type. Fortunately, such compilers are rarely used these days. But the problem can still crop up if one DLL generates one instance while another DLL generates another. The moral is to have all the constructors for a particular object implemented in one source file, and have that be only in one DLL, not many.this
#include "implementation.h" class Foo { private: ... the implementation ... // the interface public: void bar() { ... manipulate the implementation ... } };The trouble with this, of course, is that the implementation is still there with its bare face hanging out, and in order to compile it, every irrelevant thing that the implementation needs has to be in scope, too.
// User sees this class definition class Implementation; // stub definition class Foo { private: Implementation *pimpl; // the interface public: Foo(); void bar(); }; // Separate, hidden version of Foo #include "implementation.h" Foo::Foo() : pimpl(new Implementation()) { } void Foo::bar() { pimpl->bar(); }
But there's a way to hide the implementation completely without having an extra object. The idea is to counterfeit the this
pointer, so that the user thinks it is one type, but the implementation knows it is another:
// User sees this class definition class Foo { // the interface public: Foo *factory(); // create and initialize an instance void bar(); }; // Separate, hidden version of Foo #include "implementation.h" Foo *Foo::factory() { return reinterpret_cast<Foo *>new Implementation(); } void Foo::bar() { (reinterpret_cast<Implementation *>this)->bar(); }The
reinterpret_cast
is doing the dirty work of counterfeiting the type of the object from
Implementation
to
Foo
and back again.
Caveats:
Foo
cannot have any data members, even hidden ones like a vptr
. Therefore, it cannot have any virtual functions.Foo
cannot have any constructors, because we aren't constructing a real Foo
, only a counterfeit one.These techniques are also applicable to the D programming language[1].
Sometimes, you just feel the need for speed.
Have an opinion? Readers have already posted 10 comments about this article. Why not add yours?
-
Artima provides consulting and training services to help you make the most of Scala, reactive
and functional programming, enterprise systems, big data, and testing.