Text Analytics in Enterprise Search

A presentation made by Daniel Ling at Apache Lucene Eurocon in Barcelona, october 2011.

We think this is the first of many forthcoming presentations.

We also want to get more involved in the community in the future. By doing presentations, sponsoring, contributing code. Hope to bring more news on this subject in the next few weeks. Enjoy the presentation:

Text Analytics in Enterprise Search, Daniel Ling, Findwise, Eurocon 2011 from Lucene Revolution on Vimeo.

Distributed processing + search == true?

In June 2011, I attended the Berlin Buzzwords conference. The main theme of the conference was undoubtedly the current paradigm shift in distributed processing, driven by the major success of Hadoop. Doug Cutting – founder of Apache projects such as Lucene, Nutch and Hadoop – held one of the keynotes. He focused on what he recognized as the new foundations for this paradigm shift:

– Commodity hardware
– Sequential file access
– Sharding
– Automated, high level reliability
– Open source

Distributed processing is done fairly well with Hadoop. Distributed search on the other hand is more or less limited to sharding and/or replicating the index. The downside of sharding is that you perform the same search on multiple servers and then need to combine the results. Due to the nature of algorithms in search such as tf/idf, tasks like ranking results suffers. Andrzej Białecki (another frequent Lucene committer) held a presentation on this topic, and his view can be summarized as: Use local search as long as you can, distribute only when the cost of local search limitations outweighs the cost of distributed search.

The setup of automated replication and sharding, with help from Zookeeper in the Solr Cloud project, is a major step in the right direction but the question on how to properly combine search results from different nodes still remains. One thing is sure though, there is a lot of interesting work being done in this area.

Solr 3.1 released

Last friday, Solr 3.1 was released along with Lucene 3.1. This might seem like a big step from previous version 1.4.1, but is an effect of the merged development for Solr and Lucene that took place a year ago. The Solr version now reflects the Lucene version that is used.

For a complete list of new features and enhancements, you can read the release notes. Though, some of the most interesting features are:

  • Extended dismax (edismax) query parser. It’s an enhancement over dismax, supports full lucene query syntax etc.
  • Spatial search (ie, we can now enable geo-search; sort by distance, boost by distance etc)
  • Numeric range facets.
  • Lots of optimizations and performance improvements, including better Unicode and 64-bit JVM support.

Update: There’s a good list of features and enhancements at Sematexts blog:

I’m really keen on the Spatial Search which open up a new set of applications, espeacially for Mobile Search where you have the advantage of knowing the position of the user.

I’m glad the community pulled of this release after the merge with Lucene and it will be fun to start working with it. What’s your favorite feature in 3.1? Drop a comment!

Comparing Open Source for Search

Even Gartner has talked about open source solutions as interesting search tools. For those of you who needs an introduction, a slideshow comparing Lucene, Solr and Nutch can be found here.