May 1, 2011

Lucene vs. Terrier

There are different text retrieval toolkits out there that you can use to build search engines or simply test your new search algorithm. I have used Lucene, Lemur and Terrier. Lucene is a snap when it comes to building an application layer over the search functionality. It is fast and easy to understand and very suitable if you are looking to build a commercial app. One nice feature of Lucene is that it provides support for query biased summarization. The downside is, that it has not adopted some of the state of the art models and still relies on basic boolean retrieval and basic Vector Space model. It is currently relatively difficult to integrate new retrieval models into Lucene due to missing fields such as average document length. I am not sure that this problem has been fixed in recent releases. I recently noticed a BM25 implementation for Lucene floating around, so you may want to check that out.  

My experience with Terrier has been quite encouraging. The best part about Terrier (Terrier 3.0) is that it supports most state of the art retrieval models such as Dir Prior Language Models, DFR models, Okapi BM25 and so on. So it is highly recommended for research use. Building an application layer over Terrier may require some additional understanding of the internal workings.  Here is where you can get started. The downside of Terrier is that it does not provide support for incremental indexing and you cannot easily generate query biased summaries. So building an application will require some tweaks.

Update 06/22/11: As of Terrier 3.5, the following features are now supported:
  • Out-of-the-box support for query-biased summaries
  • A good example of how to set up a web-based interface
Here are some Lucene related links:

Here is a very short tutorial on Terrier.