The Findability blog

the enterprise search and findability blog by Tietoevry Findwise

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About

Tag Archives: Apache HTTP Server

Apache Nutch Making Use of Open Pipeline

Posted on November 11, 2010 by
1

During the last couple of months I’ve been working on a project for Uppsala University. The project’s goal is to improve the findability on the university web site. The solution that we are working on is based on Apache Nutch 1.1 in conjunction with Apache Solr 1.4. Nutch provides us with a robust web crawler that scales very well and also gives us a page rank for each page that we can use for relevance tuning. Besides the web information crawled by Nutch, the search application will also be used to search people and organizational information that we index from another source. I thought that I would share some details on how we are using Nutch.

We have made two extensions to Nutch, one is a parser plug-in that can run Open Pipeline embedded in it. This was an important extension in order to get better control of the information that we index to Solr and also to be able to reuse our different Open Pipeline components. The main stages of the pipeline are the following:

  1. Extract the encoding of a web page
  2. Extract all links from a web page
  3. Extract all headings (hx) from a web page
  4. Remove all tags that don’t contain complete sentences on a web page
  5. Extract text and metadata from different types of documents with Tika
  6. Do some metadata mapping and cleaning
  7. Populate facets according to metadata and/or URL
  8. Do static URL ranking
  9. Replace certain common titles with the largest heading of the web page

The other extension we made to Nutch is an indexing filter that makes sure all our metadata fields are indexed to Solr.

So far so good. The fetching, parsing and indexing works well now and currently our largest challenge is tuning all the different relevance parameters we have, as well as harmonizing the relevance of web information to that of people and organizational information. I will have to get back to you on how that went!

Posted in Data Processing, Findability, Lucene, Open source, Solr | Tagged Apache HTTP Server, Apache Software Foundation, Apache Solr, Cross-platform software, Doug Cutting, findability, internet search engines, Knowledge representation, Lucene, Metadata, Nutch, Open Pipeline, search application, university web site, Uppsala University, web crawler, web information | 1 Reply

Recent Posts

  • Semantic Annotation (how to make stuff findable, and more)
  • Building a chatbot – that actually works
  • Design Elements of Search – Zero Results Page
  • Design Elements of Search – Landing Page
  • Design Elements of Search – Results

Recent Comments

  • Fashion Styles on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!
  • Fashion Styles on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!
  • vitaminler.ra6.org on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!
  • Health Fitness on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!
  • polipropilenovye-Meshki05.Ru on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!

Tags

Apache Software Foundation Apache Solr business intelligence content management systems Document Management System Enterprise Search Facebook findability Findwise Google Google Search Appliance Human-computer interaction IBM Index Information Information retrieval Information science internet search engines Intranet Knowledge representation Kristian Norling Lucene M&A Metadata Microsoft Microsoft SharePoint search analytics search engine search engines search experience Searching search platform search result search results search solution search solutions search technology SharePoint Social information processing Technical communication usability Web 2.0 web design Web search engine World Wide Web
Find us on Google+

Categories

Archives

Proudly powered by WordPress