Solr Processing Pipeline

Hi again Internet,

For once I have had time to do some thinking. Why is there no powerful data processing layer between the Lucene Connector Framework and Solr? I´ve been looking into the Apache Commons Processing Pipeline. It seems like a likely candidate to do some cool stuff.  Look at the diagram below.

A schematic drawing of a Solr Pipeline concept. (Click to enlarge)

What I´m thinking of is to make a transparent Solr processing pipeline that speaks the Solr REST protocol on each end. This means that you would be able to use SolrJ or any other API to communicate with the Pipeline.

Has anyone attempted this before?  If you’re interested in chatting about the pipeline drop me a mail or just grab me at Eurocon in Prague this year.

13 thoughts on “Solr Processing Pipeline

  1. I was responsible for an Enterprise Search / Text Mining Software Architecture back in 2006 and went into the same issue. I integrated UIMA (formerly IBM UIMA at developerWorks) for this use case. Since UIMA with its extended and complex document model was the opposite of Lucene both components were hard to combine.

    Nevertheless, if you would ask me again today I would choose to integrate UIMA for this use case.

  2. Otis, you are of course right. On the end of LCF one would use the API that fits LCF best.

    But I think that most people would have their own “hacked” connectors that speak “Solr-REST”. And in that case it would be nice to have a pipeline that acts as a Solr on both sides. A more or less totally transparent pipeline.

  3. Hannes,
    Unfortunately I haven’t worked with UIMA yet, but I´ve hear a lot about it. When it comes to hot plugging new steps at runtime I totally agree with you. I’ve worked in another big enterprise project where we developed a Pipeline framework with a Domain Specific Language that was hot pluggable, came in handy a lot.

  4. I can see some pro’s and con’s of your idea

    – Speaking Solr at both ends makes it much easier to play with. One of the things I love about Solr is how simple it is to start using it. No need to set up crazy message queues, complex documents, etc. Just pop a document in. So this would lower the barrier to entry.
    – Give a hook into what is going into Solr without changing Solr itself. I can see putting this as a frontend and just logging/recording what is going through. Most of us probably can not easily tell “what docs changed in the last 15 minutes” for example.
    – Poor mans ESB… When I index a document of type X, also trigger this other job or something can go there.
    – Maybe a simpler way to share processing steps by fitting into a standard pipeline?

    – Doesn’t this to some extent duplicate the index time pipeline inside of Solr for various document types?
    – Another moving part for Solr?
    – Are you duplicating a lot of DIH does in terms of processing logic? Although, I actually don’t really like the DIH approach from an approach. It seems like it solves a specific issue well, but then the more processing you do, the more you are writing code/logic using complex XML!

    Something about your proposal sounded familiar, and I looked back at some projects and a couple of years ago built a data processing pipeline using Apache Commons Chain. An XML doc defined the pipeline and any specific values per step, and then you could have multiple pipelines. The ability to have multiple ones and try one versus the other was great. Love to see where this goes. Maybe replace DIH’s processing layer?

    I know every time I write an indexer for a new project, I often put some of the same code over and over!

  5. Hi Eric,
    Yeah that’s kinda of my vision. The LCF is a place to build stable connectors and the Pipeline is a playground to share small tidbits of code to manipulate your data. Stuff like:

    – Categorization.
    – Language detection.
    – Different language processing depending on the language detection.
    – And so much more.

    When it comes to the DIH I´m no fan of it. I really have no reason for not using it, but it just seems complicated.

    And yes, it’s another moving part of Solr, but (!) highly optional. This type of product would only be used for applications where you need to manipulate your data a lot. In for example large enterprise applications, wiki´s, library content etc. etc.

  6. The whole picture of akquiring information (e.g. Webcrawler), analysis, and indexing is currently covered as a Eclipse incubating project called SMILA:
    SMILA means SeMantic Information Logistics Architecture – which is a pretty neat idea but hard to realize. For the plug-in system of components for analysis (categorization, language detection etc.) they rely on OSGi which makes it possible to change components and the behaviour during runtime.

    Currently under (heavy?) development by two companies from germany (Empolis, Brox).

  7. Good to see you continue the thinking around this important topic, Max.

    I looked at commons pipeline, and what it does particularly well is scalability in terms of the individual processing stages – each stage can have its own thread! – and queuing between stages etc. Also the parallell nature allows for optimized utilization for heavy processing, although for many search use cases order is important so I don’t know about the usefulness of this.

    Carl M: I did not know about SMILA. Impressive. However, it feels way too heavy for what majority of Solr installations will be needing. I foresee SMILA useful only for the largest integrations where they are already using BPEL, not for the simple processing needs of a local search application. Also it smells a bit over-engineered?

    Back to your suggestion Max. If the pipeline is to live strictly outside of Solr it makes sense to speak Solr-API in both ends. However, you then short-circuit all other UpdateRequestHandlers from benefiting from the pipeline. DIH, CSV, Extracting, (Java)Binary and other handlers cannot use it.

    To overcome this, instead of integrating the pipeline as a standalone service, integrate it into Solr’s UpdateRequestProcessorChain, which sits between the RequestHandlers and indexing. There could be two versions of the pipeline factory – one Local, which executes the pipeline in same thread, and one Remote which streams the documents to a dedicated processing node/cluster.

    I think this plays better with the current and future Solr architecture because
    * The pipeline will be truly transparent, and ALL current RequestHandlers can be used, including DIH and LCF
    * Solr will get built-in shard routing logic, obeying the new concepts of collections etc from SolrCloud
    * It makes sense to have the choice of running a light-weight single-node without unnecessary HTTP calls, but also have the possibility of scaling out

    Another issue is that the REST document model and SolrInputDocument is not (currently) rich enough to hold metadata about a partly processed document. I think it is unavoidable at some point to at least do Tokenization in the pipeline. Then we need to pass on a document with both the original version of the field as well as the tokenized version with metadata telling that it is already tokenized. The Lucene analysis chain could then skip tokenization. This is the way OpenPipe ( is integrated with Solr. They found the need to invent a custom binary protocol to convey such metadata…

    I really like the idea of using commons-pipeline, keep up the thinking 🙂

  8. Hi Max.

    Thanks for an interesting talk about this topic at eurocon. Integrating the pipeline with UpdateRequestProcessorChain as Jan mentions is actually what I was thinking about talking to you after the talk. Also for simple needs – why not only implement a (few) stage/functionality directly as a stage in this chain (sub-classing UpdateRequestProcessorFactory) and configuring it in (a possibly seperate chain in) solrconfig.xml?

  9. Pingback: Processing pipeline for the Google Search Appliance « The Findability blog

  10. Pingback: - the Findability blog

  11. Pingback: - the Findability blog

Leave a Reply

Your email address will not be published. Required fields are marked *