Text Analytics in Enterprise Search

Posted on January 11, 2012 by

A presentation made by Daniel Ling at Apache Lucene Eurocon in Barcelona, october 2011.

We think this is the first of many forthcoming presentations.

We also want to get more involved in the community in the future. By doing presentations, sponsoring, contributing code. Hope to bring more news on this subject in the next few weeks. Enjoy the presentation:

Text Analytics in Enterprise Search, Daniel Ling, Findwise, Eurocon 2011 from Lucene Revolution on Vimeo.

Distributed processing + search == true?

Posted on September 30, 2011 by

In June 2011, I attended the Berlin Buzzwords conference. The main theme of the conference was undoubtedly the current paradigm shift in distributed processing, driven by the major success of Hadoop. Doug Cutting – founder of Apache projects such as Lucene, Nutch and Hadoop – held one of the keynotes. He focused on what he recognized as the new foundations for this paradigm shift:

– Commodity hardware
– Sequential file access
– Sharding
– Automated, high level reliability
– Open source

Distributed processing is done fairly well with Hadoop. Distributed search on the other hand is more or less limited to sharding and/or replicating the index. The downside of sharding is that you perform the same search on multiple servers and then need to combine the results. Due to the nature of algorithms in search such as tf/idf, tasks like ranking results suffers. Andrzej Białecki (another frequent Lucene committer) held a presentation on this topic, and his view can be summarized as: Use local search as long as you can, distribute only when the cost of local search limitations outweighs the cost of distributed search.

The setup of automated replication and sharding, with help from Zookeeper in the Solr Cloud project, is a major step in the right direction but the question on how to properly combine search results from different nodes still remains. One thing is sure though, there is a lot of interesting work being done in this area.

Solr 3.1 released

Posted on April 5, 2011 by

Last friday, Solr 3.1 was released along with Lucene 3.1. This might seem like a big step from previous version 1.4.1, but is an effect of the merged development for Solr and Lucene that took place a year ago. The Solr version now reflects the Lucene version that is used.

For a complete list of new features and enhancements, you can read the release notes. Though, some of the most interesting features are:

Extended dismax (edismax) query parser. It’s an enhancement over dismax, supports full lucene query syntax etc.
Spatial search (ie, we can now enable geo-search; sort by distance, boost by distance etc)
Numeric range facets.
Lots of optimizations and performance improvements, including better Unicode and 64-bit JVM support.

Update: There’s a good list of features and enhancements at Sematexts blog:

I’m really keen on the Spatial Search which open up a new set of applications, espeacially for Mobile Search where you have the advantage of knowing the position of the user.

I’m glad the community pulled of this release after the merge with Lucene and it will be fun to start working with it. What’s your favorite feature in 3.1? Drop a comment!

Apache Nutch Making Use of Open Pipeline

Posted on November 11, 2010 by

During the last couple of months I’ve been working on a project for Uppsala University. The project’s goal is to improve the findability on the university web site. The solution that we are working on is based on Apache Nutch 1.1 in conjunction with Apache Solr 1.4. Nutch provides us with a robust web crawler that scales very well and also gives us a page rank for each page that we can use for relevance tuning. Besides the web information crawled by Nutch, the search application will also be used to search people and organizational information that we index from another source. I thought that I would share some details on how we are using Nutch.

We have made two extensions to Nutch, one is a parser plug-in that can run Open Pipeline embedded in it. This was an important extension in order to get better control of the information that we index to Solr and also to be able to reuse our different Open Pipeline components. The main stages of the pipeline are the following:

Extract the encoding of a web page
Extract all links from a web page
Extract all headings (hx) from a web page
Remove all tags that don’t contain complete sentences on a web page
Extract text and metadata from different types of documents with Tika
Do some metadata mapping and cleaning
Populate facets according to metadata and/or URL
Do static URL ranking
Replace certain common titles with the largest heading of the web page

The other extension we made to Nutch is an indexing filter that makes sure all our metadata fields are indexed to Solr.

So far so good. The fetching, parsing and indexing works well now and currently our largest challenge is tuning all the different relevance parameters we have, as well as harmonizing the relevance of web information to that of people and organizational information. I will have to get back to you on how that went!

Comparing Open Source for Search

Posted on December 31, 2007 by

Even Gartner has talked about open source solutions as interesting search tools. For those of you who needs an introduction, a slideshow comparing Lucene, Solr and Nutch can be found here.

The Findability blog

the enterprise search and findability blog by Tietoevry Findwise

Tag Archives: Cross-platform software

Text Analytics in Enterprise Search

Solr 3.1 released

Apache Nutch Making Use of Open Pipeline

Comparing Open Source for Search