Max has previously highlighted the subject of a processing pipeline for Apache Solr
. Another enterprise search engine that is lacking this feature is the Google Search Appliance
(GSA). Today, I’d like to share what me and my colleagues have done to overcome this issue in a couple of projects lately. Discussing the needs and motivation why a pipeline is needed is not really the scope of this post, but I can give you some brief examples why I believe it’s needed:
Normalize metadata across sources
In an enterprise search installation you are making several sources searchable. Say you have indexed the intranet and a file share with documents. On the intranet, each page is tagged with an author in a metadata field called “creator”. The documents on the file share also have an author information but this is stored in a metadata field called “author”. If you want to find all information from a given author you need to know all the different fields in all sources that holds the author. Using a pipeline, you can map the creator metadata from the intranet and the author metadata to a common field in the index (i.e author).
Overcome GSA shortcomings
One shortcoming of the GSA is that it doesn’t calculate the file size of a non-html document. In a pipeline you can easily calculate the size of each document coming in, regardless of file format, and put it in a field in the GSA index.
So how have we done this? We wanted a standard, reusable architecture that wouldn’t interfere too much with a standard GSA setup. Still we wanted it to cover as much sources as possible without any adjustments on the specific connector that would be used.
Thus, we targeted the built-in crawler and the connector api for the solution, which can be described in a couple of steps:
- It resides as a stand-alone component between the GSA and the content sources
- The GSA fetches the content through the component
- The component delivers the content to both the GSA and a standalone pipeline
- The content is indexed in the GSA. When pipeline processing is done, it sends the updated content to the GSA.
The image below will give you a visual overview.
With this approach, when the solution is set the pipeline can be used with the GSA crawler and all connectors built using the connector api. We’ve also discussed to extend the solution to support the plain feed protocol, which shouldn’t be that much of a hazzle. If you’re interested to find out more about the solution, don’t hesitate to leave a comment or contact me. We will also put this up on Google Marketplace