The first attempts to optimize my
pure-Ruby, 200LoC full-text search engine
based on suffix arrays (which evolved into the in-progress FTSearch) led me to perform some n-gram analysis over Ruby's core/stdlib docs (as well as a pack of gems).
With a suffix array, locating a word in the corpus involves a binary search
of complexity , N being the total number of suffixes.
Since the suffix array (SA) is but a large array of integer values representing
offsets in the "full text" file (which correspond to the start of the
suffixes), you need to read the text at the position indicated in the SA
in order to compare it to the term you're looking for.
Therefore a corpus like the Linux sources (with some 20 million suffixes)
would require about 25 disk seeks even if the suffix array itself were held in
memory. At 10ms the seek, that would be 250ms when the file isn't cached...
The first idea that came to mind to minimize the number of seeks was storing
the first n bytes of text right after the offset in the SA:
Ideally, if suffixes covered the whole "character space" uniformly, n bytes
would represent n * 8 fewer disk seeks. So n = 4, which would double the size
of the SA (offsets are 32 bit integers), would be enough to find any word in
a corpus under 4GB without a single disk seek when the SA is held in
memory.
Of course, the distribution of suffixes is hardly uniform. We tend to repeat a
limited number of words instead of writing random byte strings...
I did some basic n-gram analysis over the documentation from the core/stdlib and
some ~80 gems; character-wise to being with (since this is what matters for
the full text search engine) and then word-based while I was at it.
This analysis proved that such "inline suffixes" add a large space overhead to
the index with relatively little gains, so I adopted a slightly different solution.
Counting n-grams
If you have FastRI and have used it to build a full-text index of
your core/stdlib/gem docs, your ~/.fastri-fulltext/ directory will hold a
"suffixes.dat" file (just a large array of 32 bit ints) and a "full_text.dat"
file which holds the actual indexed documentation.
Collecting the n-grams is easy:
MAX_SUFFIX_SIZE=64NGRAM_SIZES=[4,8,10,12]suffixes=[]File.open("full_text.dat","rb")do|ft|File.open("suffixes.dat","rb")do|f|untilf.eof?offset=f.read(4).unpack("V")[0]ft.pos=offsetsuffix=ft.read(MAX_SUFFIX_SIZE).split(/\0/)[0]suffixes<<suffixendendendputs"Total suffixes: #{suffixes.size}"ngrams=Hash.new{|h,k|h[k]=Hash.new{|h2,k2|h2[k2]=0}}all={}suffixes.eachdo|suffix|all[suffix]=trueNGRAM_SIZES.each{|n|ngrams[n][suffix[0,n]]+=1}endputs<<EOF
===============================
Character-based n-gram analysis
===============================
#{all.size} unique #{MAX_SUFFIX_SIZE}-grams
A lookup would take on average #{Math.log(suffixes.size)/Math.log(2)} disk seeks.
EOF
With my index, I get:
Total suffixes: 224687
196625 unique 64-grams
A lookup would take on average 17.7775571295372 disk seeks.
n-gram statistics
At this point, ngrams[n] is a hash associating n-grams to number of
occurrences, and we can try to obtain some interesting statistics:
number of suffixes which start with the same n bytes (the same n-gram)
mean, median, maximum number of suffixes per n-gram, as well as stddev
how many disk seeks we can expect to save on average
The latter is actually the entropy of the n-gram distribution. If there's
only one n-gram, knowing the first n bytes doesn't give you any usable info
(the entropy is 0), and you don't save any disk seek, all you've done is
increase the size of the SA with no gain whatsoever. Now, if there are two
n-grams, 50% of the suffixes start with one and 50% with the other, then a
binary search will take 1 disk seek less (you just saved 10ms). If we consider
a n-gram, we can save at most seeks, iff there are
suffixes and each starts with a
different n-gram.
The gain (number of disk seeks saved) can thus be computed as
if there are M distinct N-grams each with suffixes,
is the number of suffixes starting with the i-th n-gram,
and .