The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Outperforming Ferret at searching, 3X faster indexing, code online

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.
Outperforming Ferret at searching, 3X faster indexing, code online Posted: Dec 4, 2006 12:05 PM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: Outperforming Ferret at searching, 3X faster indexing, code online
Feed Title: Eigenclass
Feed URL: http://feeds.feedburner.com/eigenclass
Feed Description: Ruby stuff --- trying to stay away from triviality.
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Eigen Class
Latest Posts From Eigenclass

Advertisement

The last time I blogged about the FTSearch (simple) full-text search engine, it already indexed the Reuters corpus over twice faster than Ferret. I have rewritten a few more methods in C and got an extra 50% speed boost, making it now over 3 times faster than Ferret when indexing.

Today's news is that I've implemented enough functionality to begin to compare searching performance. And FTSearch does often outperform Ferret (but read below).

The code

FTSearch is currently powered by some 500 lines of C and 800 in Ruby, compared to over 50000LoC in C for Ferret. Of course, it's far from being as featureful as Ferret, but the #1 advantage of being small is that it's easier to debug, and I think it can be made more reliable than Ferret with relatively little effort (Ferret often crashed with a segfault while I was timing it; for instance, it was really angry at me when I searched Linux' sources for "void").

You can have a look at the code at http://eigenclass.org/repos/ftsearch/head

Search performance

Here are some times for the Linux corpus (160MB of .c and .h files, 20 million suffixes).

Word search

Ferret
   $ ruby ferret-lookup.rb 
   Input term: sprintf
   Needed 0.050867 for the search.
   Needed 8.83525 to get the URIs.
   Total time: 8.886119
   Total matches: 1165
   Showing top 10 matches:
   0.100 corpus/linux/drivers/usb/host/uhci-debug.c
   0.097 corpus/linux/drivers/scsi/aic7xxx_old/aic7xxx_proc.c
   0.090 corpus/linux/drivers/isdn/hisax/q931.c
   0.089 corpus/linux/drivers/s390/sysinfo.c
   0.086 corpus/linux/drivers/pci/hotplug/cpqphp_sysfs.c
   0.086 corpus/linux/drivers/pci/hotplug/shpchp_sysfs.c
   0.083 corpus/linux/drivers/mca/mca-proc.c
   0.077 corpus/linux/drivers/dio/dio-sysfs.c
   0.077 corpus/linux/drivers/net/wan/lmc/lmc_debug.c
   0.073 corpus/linux/drivers/media/radio/miropcm20-rds.c
   0.072 corpus/linux/drivers/video/console/promcon.c
   Input term: sprintf
   Needed 0.004382 for the search.
   Needed 0.322449 to get the URIs.
   Total time: 0.326834
   Total matches: 1165
   [...]
FTSearch
   $ ruby sample-lookup.rb 
   Input term: sprintf
   Needed 0.085965 for the search.
   Needed 0.008282 to rank 5556 hits into 1176 docs
   Needed 0.001414 to get the URIs.
   Total time: 0.095685
   Showing top 10 matches:
    44576  4913 corpus/linux/drivers/scsi/aic7xxx_old/aic7xxx_proc.c
    41280  5073 corpus/linux/drivers/usb/host/uhci-debug.c
    39200  4647 corpus/linux/drivers/s390/sysinfo.c
    38720  4557 corpus/linux/drivers/pci/hotplug/cpqphp_sysfs.c
    38704  4582 corpus/linux/drivers/pci/hotplug/shpchp_sysfs.c
    34210  3640 corpus/linux/drivers/isdn/hisax/q931.c
    31556  3718 corpus/linux/drivers/mca/mca-proc.c
    27920  3839 corpus/linux/drivers/media/radio/miropcm20-rds.c
    26925  4425 corpus/linux/drivers/net/wan/lmc/lmc_debug.c
    25770  3228 corpus/linux/drivers/dio/dio-sysfs.c
    25699  5293 corpus/linux/drivers/video/console/promcon.c
   Input term: sprintf
   Needed 0.000508 for the search.
   Needed 0.007636 to rank 5556 hits into 1176 docs
   Needed 0.001463 to get the URIs.
   Total time: 0.009619
   [...]

Phrase search


Read more...

Read: Outperforming Ferret at searching, 3X faster indexing, code online

Topic: m��nage �� text? Previous Topic   Next Topic Topic: Exceptional: Ruby on Rails with better exception handling

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use