Java Buzz Forum - Indexing Object Graphs with Lucene

And he is right, I wasn't there -- it was simply flat indexing. Never one to turn down a fun sounding problem (I am weak that way) I spent Sunday afternoon (while Joy was out looking for wedding stuff -- not enough time to hack when planning a wedding, when does the "Yes dear, that sounds good" phase kick in?) tossing together a tool do build indexes on arbitrary object graphs in a useful way. Had some success too! Here's a source xref of the results, with some junit code showing how to use it

The code basically builds an index on the graph allowing queries of the form beer.name: Schlitz to look for instances of beer with the name field being Schlitz. A more fun one would be cooler.beer.name: Shitz~ AND cooler.location: My House would hit on the documents indexing org.skife.lucene.graph.helper.Cooler which contain beer whose name is like "Shitz~" (ie, Schlitz) and whose location is "My House".

    public void setUp() throws Exception {
        final SimpleNameMapper mapper = new SimpleNameMapper();

        indexer = new GraphIndexer(new MetadataFactory() {
            public Field[] build(final Object entity) {
                final Field name = Field.Text("name", mapper.build(entity));
                return new Field[]{name};
            }
        });
    }    

    public void testFuzzy() throws Exception {
        final Cooler cooler = new Cooler(1);
        cooler.stock(new Beer(2, "Schlitz"));
        cooler.stock(new Beer(3, "Caffreys"));
        cooler.stock(new Beer(4, "McEwan's"));
        final File index = indexer.index(cooler);
        final IndexReader reader = IndexReader.open(index);
        final IndexSearcher searcher = new IndexSearcher(reader);

        final Query query = parser.parse("cooler.beer.name: Schitz~");
        final Hits hits = searcher.search(query);

        assertEquals(1, hits.length());
    }

The GraphIndexer builds an index (or can add to an existing) via the final File index = indexer.index(cooler); call, returning the directory (on filesystem) where the index is stored. The above index will create the fields:

Field: cooler.beer.name
Field: beers
Field: cooler.beer.identity
Field: beer.name
Field: identity
Field: cooler.identity
Field: beer.identity
Field: cooler.beer
Field: name
Field: cooler.beers

And populate the correct fields onto the correct documents. You can query against the index on simple property names (name, identity, etc) or specify types (beer.name, cooler.identity) etc. This allows all the following to be useful (? ;-) queries:

cooler.identity: 7
cooler.beer.name: mcewans
name: schlitz
cozy.beer.name: Bob OR cozy.owner: Bob
identity: 1 OR 2 OR 3
beer.identity: 1 OR 2 OR 3
beer.name: Coors OR cooler.beer.name: Schlitz
beer
cooler

Also notice the MetadataFactory passed to the GraphIndexer. This is just a convenient way to add additional fields on a per instance basis. I use it here to add name field to every indexed instance where the value is the simple mapped class name (drop package and downcase). I also use the name field as the default search field, so you can do nice searches like beer quality: good and get back hits for all instances of good beer. In a real application I would add the information required to query for the entity, such as the class name and pk value (using OJB probably just the stringified Identity for the instance, as I can extract all of that from the identity =)

Right now it has a couple quirks. Being a sunday afternoon hack (and Joy getting back from looking for wedding stuff) there is one case which is not handled, which is downstream mapping from a cycle in the graph. This is a *very* small case though, and won't be difficult to add when I get a chance. The second gotcha is that if you have a really big interconnected graph (say several gigabytes) it will be interesting to index because as it is right now it needs to keep the full graph in memory while it indexes. I don't think this will be gotten around without implementing some kind of relationship-only caching, which will be almost as memory intensive as the whole graph -- at least in the same order of magnitude if with a smaller constant. The workaround is to index aggregates recognizing that between aggregates property chaining will be slightly off. For graphs which are fairly hierarchical (you can keep all cycles in the same agrgegate being indexed at once) this will all properly index.

Most of the behavior is configurable, from the manner in which it traverses the graph, to how it names root classes, to how it stringifies things, to filtering out instances. The defaults should be pretty reasonable (go read what they do) for most cases and playing around though.

Lastly, it has a couple dependencies. The first I am happy about, it uses the grafolia library I wrote for object graph manipulation for OJB 1.1. I pulled it out from OJB because I realized it was the type of code I had implemented many times before (arbitrary object graph traversal type stuff). Glad to see it fit naturally here. The other dependency is commons-beanutils because it is just so much more convenient than using java.beans. BeanUtils drags commons-logging along with it. Sorry.

Tarball is available if anyone wants to play. It is a pretty basic little maven project right now (javadocs, xrefs, etc posted), so should be easy to build. Remember, maven idea and maven eclipse are your friends ;-)


	Web Artima.com