Agile Buzz Forum - How to get a Human from a Chimpanzee

One of the things I've never liked about the VisualWorks differencing tools, is that I have to choose between source and text differencing. Usually I want source. Source compares the source code tokens. But sometimes, it shows nothing. Why? Because the only change is a word in the method comment. Unfortunately, it only shows me the whole comment is changed, not the actual localized changes. Changes only in whitespace are equally distracting. The current solution embodied by DiffList, and a UI tool called Differator, doesn't really get where I want to.

What I really want one is ONE difference mode, that shows all of this at the same time. Pursuing this goal, lend me on an interesting journey. The results are in the mar10.2 7.8 build that will come out tomorrow.

To begin with, we have to separate two distinct parts. One is comparing texts. The other is what to do with the differences we find. Let's leave the second behind for a minute. If we solve the first problem generally enough, we'll have a myriad of choices open to us for the second.

A good starting place, is to go to the classic Unix diff command line utility. After digging around for a while with this, we'll discover two things: 1) It's not really about text at all 2) the algorithm found at the heart of the diff program is a nicely tuned piece of work.

The basic diff algorithm starts out on the basis, that for any two sequences (whether they're sequences of characters, words, textlines, bytecodes, or even DNA markers), there is a sequence that is a "longest common sequence". There may be multiple resolutions to this, but none of them is longer than the other. You can find lots of interesting stuff about this around the net. A simple example is the sequences h-u-m-a-n and c-h-i-m-p-a-n-z-e-e. The longest common sequence (LCS) between the two is h-m-a-n. Once we have that, we can determine the steps it takes to turn the first sequence into the second. This is sometimes called the shortest edit script. In this case, we end up with the following edit script: Insert c, copy h, delete u, copy m, insert p, copy a and n, insert z and e and e. This gives a directional evolution from one to the other. Want to go the other direction though? Just transpose inserts with deletes and keep matches in place.

With this kind of information, we can envision a number of different ways of highlighting differences between texts, or DNA sequences, etc. We can do them in a unified view. We can do them in side by side views. We can even discern replacements from true insertions or deletions, because it's a paired delete/insert sequence.

The mar10.2 build adds a new differences: API to SequenceableCollection. You can take any two sequences and determine how to evolve one from the other. You can feed it DNA symbols, strings, numbers, whatever you can think of to put in a SequenceableCollection. The result is an array of SubsequenceDifference subclasses, one for each of the different kind of edits.

At the heart of the algorithm is a modified/ported version of a generalized version of the Meyer's Diff algorithm. This too, you can find lots of stuff on the net about. You'll even find a couple of visualizations. What the Myers adaptation of the LCS exploits is that usually, the algorithm is applied in cases where differences are few compared to matches. It also integrates the process of finding the LCS with the process of determining steps, it ends up being a linear space algorithm. To say it's a hairy nasty work of indirection and head scratching, is an understatement. It makes sorting algorithms look like kindergarten reading.

I ported it 3 times. Each implementation I found had a test suite, so I gathered those into one big test suite and added a couple edge cases. The first port was of a C# implementation. This seemed easy, but I ended up with out of bounds errors (the algorithm relies heavily on 0 based array addressing). So I tried a Ruby version. It was ironically harder to port (this is the language that's supposed to be a lot like Smalltalk). And it too had issues. Then I ported a generalized C version (generalized in the sense it was parameter-izable), and it worked! All the tests passed. There's irony for you.

How much faster is it? Given the degenerate case of determining how to get from Object comment to Object comment reverse, the VisualWorks DiffList which is a hybrid LCS searcher and produces only "unmatching range" information, takes 12.5 seconds to run. The differences: method does it in under half a second. Implementing a naive recursive LCS method, has to be killed after grinding for 3 hours.

At least two good things have come out of this. One is that there is now a general purpose method to quickly and efficiently produce complete shortest edit scripts between any two sequences. Application programmers don't have to be tied to the IDE's method of highlighting differences between two strings, they can do whatever they want. Application Programmers can use it in contexts beyond source comparison (e.g. comparing load lists, or determining undo differences, or DNA sequencing, or schedule comparison, etc).

Secondly, we've separated what to diff with how to diff completely. This means that I can have a method on RBScanner which scans source tokens. But it can go further. Since said scanner knows where comments are, we can produce strings for the comments. In fact we can go further, we can break the comments up into words easily. We can even note when the token is a string literal token and break it up into words as well. And we can insert elements for white space. So that when we difference Smalltalk methods, we get everything broken down to the granularity we'd like. Which was what I wanted in the first place.


	Web Artima.com