I've realized that my initial performance comparisons
were flawed because the index included neither the text nor the term vectors.
According to Ferret's documentation,
(and a basic understanding of inverted indices) term vectors are needed for
creating search result excerpts and performing phrase searches. Also, since
the index based on suffix arrays has a copy
of the original text, it makes sense to have it stored in Ferret's index if
the comparison is to be meaningful.
Meanwhile, I've also discovered that Ferret is much slower than I thought
when you actually try to do something with the results, such as getting their
URIs (otherwise, all you have is an internal document ID that doesn't tell you
anything). In some quick tests, it needed over 0.30 seconds to return 1165
hits when looking for "sprintf" in linux's sources after a few runs, and
over 8 seconds when the cache was cold. I think both figures will be quite
easy to beat, but that will come later --- I want indexing to be fast to begin
with, as I'll be running the indexer often while I develop this.
I've rewritten the indexer, made it modular (e.g. it can index documents with
multiple fields, using different analyzers on each), and then implemented a
couple functions in C --- some 150 lines of C, compared to Ferret's >50000...
This is the beginning of FTSearch (I'll soon put the darcs repository online).
Here's how FTSearch compares to Ferret right now:
Needless to say, this would make it way faster than Lucene --- maybe an order
of magnitude, if this still holds*1 for this corpus.