During the last couple of months I’ve been working on a project for Uppsala University. The project’s goal is to improve the findability on the university web site. The solution that we are working on is based on Apache Nutch 1.1 in conjunction with Apache Solr 1.4. Nutch provides us with a robust web crawler that scales very well and also gives us a page rank for each page that we can use for relevance tuning. Besides the web information crawled by Nutch, the search application will also be used to search people and organizational information that we index from another source. I thought that I would share some details on how we are using Nutch.
We have made two extensions to Nutch, one is a parser plug-in that can run Open Pipeline embedded in it. This was an important extension in order to get better control of the information that we index to Solr and also to be able to reuse our different Open Pipeline components. The main stages of the pipeline are the following:
- Extract the encoding of a web page
- Extract all links from a web page
- Extract all headings (hx) from a web page
- Remove all tags that don’t contain complete sentences on a web page
- Extract text and metadata from different types of documents with Tika
- Do some metadata mapping and cleaning
- Populate facets according to metadata and/or URL
- Do static URL ranking
- Replace certain common titles with the largest heading of the web page
The other extension we made to Nutch is an indexing filter that makes sure all our metadata fields are indexed to Solr.
So far so good. The fetching, parsing and indexing works well now and currently our largest challenge is tuning all the different relevance parameters we have, as well as harmonizing the relevance of web information to that of people and organizational information. I will have to get back to you on how that went!