The last time I blogged about the
FTSearch (simple) full-text search engine, it already indexed the Reuters
corpus over twice faster than Ferret. I have rewritten a few more methods in C and
got an extra 50% speed boost, making it now over 3 times faster than Ferret
when indexing.
Today's news is that I've implemented enough functionality to begin to compare
searching performance. And FTSearch does often outperform Ferret (but read below).
The code
FTSearch is currently powered by some 500 lines of C and 800 in Ruby, compared
to over 50000LoC in C for Ferret. Of course, it's far from being as featureful
as Ferret, but the #1 advantage of being small is that it's easier to debug, and
I think it can be made more reliable than Ferret with relatively little effort
(Ferret often crashed with a segfault while I was timing it; for instance, it was
really angry at me when I searched Linux' sources for "void").
Here are some times for the Linux corpus (160MB of .c and .h files, 20 million
suffixes).
Word search
Ferret
$ ruby ferret-lookup.rb
Input term: sprintf
Needed 0.050867 for the search.
Needed 8.83525 to get the URIs.
Total time: 8.886119
Total matches: 1165
Showing top 10 matches:
0.100 corpus/linux/drivers/usb/host/uhci-debug.c
0.097 corpus/linux/drivers/scsi/aic7xxx_old/aic7xxx_proc.c
0.090 corpus/linux/drivers/isdn/hisax/q931.c
0.089 corpus/linux/drivers/s390/sysinfo.c
0.086 corpus/linux/drivers/pci/hotplug/cpqphp_sysfs.c
0.086 corpus/linux/drivers/pci/hotplug/shpchp_sysfs.c
0.083 corpus/linux/drivers/mca/mca-proc.c
0.077 corpus/linux/drivers/dio/dio-sysfs.c
0.077 corpus/linux/drivers/net/wan/lmc/lmc_debug.c
0.073 corpus/linux/drivers/media/radio/miropcm20-rds.c
0.072 corpus/linux/drivers/video/console/promcon.c
Input term: sprintf
Needed 0.004382 for the search.
Needed 0.322449 to get the URIs.
Total time: 0.326834
Total matches: 1165
[...]
FTSearch
$ ruby sample-lookup.rb
Input term: sprintf
Needed 0.085965 for the search.
Needed 0.008282 to rank 5556 hits into 1176 docs
Needed 0.001414 to get the URIs.
Total time: 0.095685
Showing top 10 matches:
44576 4913 corpus/linux/drivers/scsi/aic7xxx_old/aic7xxx_proc.c
41280 5073 corpus/linux/drivers/usb/host/uhci-debug.c
39200 4647 corpus/linux/drivers/s390/sysinfo.c
38720 4557 corpus/linux/drivers/pci/hotplug/cpqphp_sysfs.c
38704 4582 corpus/linux/drivers/pci/hotplug/shpchp_sysfs.c
34210 3640 corpus/linux/drivers/isdn/hisax/q931.c
31556 3718 corpus/linux/drivers/mca/mca-proc.c
27920 3839 corpus/linux/drivers/media/radio/miropcm20-rds.c
26925 4425 corpus/linux/drivers/net/wan/lmc/lmc_debug.c
25770 3228 corpus/linux/drivers/dio/dio-sysfs.c
25699 5293 corpus/linux/drivers/video/console/promcon.c
Input term: sprintf
Needed 0.000508 for the search.
Needed 0.007636 to rank 5556 hits into 1176 docs
Needed 0.001463 to get the URIs.
Total time: 0.009619
[...]