Processing Pipeline for the Google Search Appliance

Max has previously highlighted the subject of a processing pipeline for Apache Solr. Another enterprise search engine that is lacking this feature is the Google Search Appliance (GSA). Today, I’d like to share what me and my colleagues have done to overcome this issue in a couple of projects lately. Discussing the needs and motivation why a pipeline is needed is not really the scope of this post, but I can give you some brief examples why I believe it’s needed:

Normalize metadata across sources

In an enterprise search installation you are making several sources searchable. Say you have indexed the intranet and a file share with documents. On the intranet, each page is tagged with an author in a metadata field called “creator”. The documents on the file share also have an author information but this is stored in a metadata field called “author”. If you want to find all information from a given author you need to know all the different fields in all sources that holds the author. Using a pipeline, you can map the creator metadata from the intranet and the author metadata to a common field in the index (i.e author).

Overcome GSA shortcomings

One shortcoming of the GSA is that it doesn’t calculate the file size of a non-html document. In a pipeline you can easily calculate the size of each document coming in, regardless of file format, and put it in a field in the GSA index.

So how have we done this? We wanted a standard, reusable architecture that wouldn’t interfere too much with a standard GSA setup. Still we wanted it to cover as much sources as possible without any adjustments on the specific connector that would be used.

Thus, we targeted the built-in crawler and the connector api for the solution, which can be described in a couple of steps:
  1. It resides as a stand-alone component between the GSA and the content sources
  2. The GSA fetches the content through the component
  3. The component delivers the content to both the GSA and a standalone pipeline
  4. The content is indexed in the GSA. When pipeline processing is done, it sends the updated content to the GSA.
The image below will give you a visual overview.

With this approach, when the solution is set the pipeline can be used with the GSA crawler and all connectors built using the connector api. We’ve also discussed to extend the solution to support the plain feed protocol, which shouldn’t be that much of a hazzle. If you’re interested to find out more about the solution, don’t hesitate to leave a comment or contact me. We will also put this up on Google Marketplace soon.

4 thoughts on “Processing Pipeline for the Google Search Appliance

  1. Pingback: Webhamer Weblog: Search & ICT-related blogging » links for 2010-10-29

  2. Pingback: Findability blog: Wrapping up 2010

  3. Hi,

    We have similar situation, we need to add some metadata to the document . We need to get the information for the metadata from our content management system and append it to document and hen feed it to GSA. Where can I fnd more resource on this ?

    Thanks

    Ganesan

  4. Hi Ganesan,

    This is interesting. Can you give us more information about how many and what kind of documents do you want to augment with Metadata? Also what is the CMS that you have and which GSA you have. You can write an email to andreas.franzon and me (svetoslav.marinov@findwise.com). We have done a lot of improvements to the preprocessing pipeline since the blogpost was written. I look forward to your mail.

Leave a Reply

Your email address will not be published. Required fields are marked *