Choosing an Open Source search engine: Solr or Elasticsearch?

There has never been a better time to be a search and open source enthusiast than 2017. Far behind are the old days of information retrieval being a field only available for academia and experts.

Now we have plenty of search engines that allow us not only to search, but also navigate and discover our information. We are going to be focusing on two of the leading search engines that happed to be open source projects: Elasticsearch and Solr.

Comparison (Solr and Elasticsearch)

Organisations and Support

Solr is an Apache sub-project developed parallelly along Lucene. Thanks to this it has synchronized releases and benefits directly from any new Lucene feature.

Lucidworks (previously Lucid Imagination) is the main company supporting Solr. They provide development resources for Solr, commercial support, consulting services, technical training and commercial software around Solr. Lucidworks is based in San Francisco and offer their services in the USA and all rest of the world through strategic partners. Lucidworks have historically employed around a third of the most active Solr and Lucene committers, contributing most of the Solr code base and organizing the Lucene/Revolution conference every year.

Elasticsearch is an open-source product driven by the company Elastic (formerly known as Elasticsearch). This approach creates a good balance between the open-source community contributing to the product and the company making long term plans for future functionality as well as ensuring transparency and quality.

Elastic comprehends not only Elasticsearch but a set of open-source products called the Elastic stack: Elassticsearch, Kibana, Logstash, and Beats. The company offers support over the whole Elastic stack and a set of commercial products called X-Pack, all included, in different tiers of subscriptions. They offer trainings every second week around the world and organize the ElasticON user conferences.

Ecosystem

Solr is an Apache project and by being so it benefits from a large variety of apache projects that can be used along with it. The first and foremost example is its Lucene core (http://lucene.apache.org/core/) that is released on the same schedule and from which it receives all its main functionalities. The other main project is Zookeper that handles SolrCloud clusters configuration and distribution.

On the information gathering side there is Apache Nutch, a web crawler, and Flume , a distributed log collector.

When it comes to process information, there are no end to Apache projects, the most commonly used alongside Solr are Mahout for machine learning, Tika for document text and metadata extraction and Spark for data processing.

The big advantage lies in the big data management and storage, with the highly popular Hadoop  library as well as Hive, HBase, and Cassandra databases. Solr has support to store the index in a Hadoop Highly Distributed File System for high resilience.

Elasticsearch is owned by the Elastic company that drives and develops all the products on its ecosystem, which makes it very easy to use together.

The main open-source products of the Elastic stack along Elasticsearch are Beats, Logstash and Kibana. Beats is a modular platform to build different lightweight data collectors. Logstash is a data processing pipeline. Kibana is a visualization platform where you can build your own data visualization, but already has many build-in tools to create dashboards over your Elasticsearch data.

Elastic also develop a set of products that are available under subscription: X-Pack. Right now, X-Pack includes five producs: Security, Alerting, Monitoring, Reporting, and Graph. They all deliver a layer of functionality over the Elastic Stack that is described by its name. Most of them are included as a part of Elasticsearch and Kibana.

Strengths

Solr

  • Many interfaces, many clients, many languages.
  • A query is as simple as solr/select?q=query.
  • Easy to preconfigure.
  • Base product will always be complete in functionality, commercial is an addon.

Elasticsearch

  • Everything can be done with a JSON HTTP request.
  • Optimized for time-based information.
  • Tightly coupled ecosystem.

Base product will contain the base and is expandable, commercial are additional features.

solr vs elasticsearch comparison open source search engine

Conclusion – Solr or Elasticsearch?

If you are already using one of them and do not explicitly need a feature exclusive of the other, there is no big incentive in making a migration.

In any case, as the common answer when it comes to hardware sizing recommendations for any of them: “It depends.” It depends on the amount of data, the expected growth, the type of data, the available software ecosystem around each, and mostly the features that your requirements and ambitions demand; just to name a few.

 

At Findwise we can help you make a Platform evaluation study to find the perfect match for your organization and your information.

 

Written by: Daniel Gómez Villanueva – Findability and Search Expert

How to improve search relevance using machine learning and statistics – Apache Solr Learning to Rank

In search, the relevance denotes how well a retrieved document or set of documents meets the information need of the user. Natural languages, synonymy, homonymy, term frequency, norms, relevance model, boosting, performance, subjectivity… the reasons why search relevancy still remains a hard problem are multiple. This article will deal with how can machine learning and search statistics improve the relevance using the learning to rank plugin which will be included in a newer version of Solr.  If you want more information than is provided in this blogpost, be sure to visit our website or contact us!

Background

Considering an Intranet search solution where users can be divided into two groups: developers and sales.
Search and clicks statistics are collected and the following picture illustrates a specific search performed 569 times by the users with the click statistics for each document.

Example of a search with click statistics

Example of a search with click statistics

As noticed, the top search hit, of which the score is computed from term frequency, index documents frequency and field-length norm, is less relevant (got less clicks) than documents with lower scores. Instead of trying to manually tweak the relevancy model and the different field boosts, which will probably lead, by tweaking for a specific query, to decrease the global relevancy for other queries, we will try to learn from the users click statistics to automatically generate a relevancy model.

Architecture, features and model

Architecture

Using search and click statistics as training data, the search engine can learn, from input features and with a ranking model, the probability that a document is relevant for a user.

Search architecture

Search architecture with a ranking training model

Features

With the Solr learning to rank plugin, features can be defined using standard Solr queries.
For my specific example, I will choose the following features:
– originalScore: Original score computed by Solr
– queryMatchTitle: Boolean stating if the user query text match the document title
– queryMatchDescription: Boolean stating if the user query text match the document description
– isPowerPoint: Boolean stating if the document type is PowerPoint

[
   { "name": "isPowerPoint",
     "class": "org.apache.solr.ltr.feature.SolrFeature",
     "params":{ "fq": ["{!terms f=filetype }.pptx"] }
   },
   {
    "name":"originalScore",
    "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
    "params":{}
   },
   {
    "name" : " queryMatchTitle",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=title}${user_query}" }
   },
   {
    "name" : " queryMatchDescription",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=description}${user_query}" }
   }
]

Training a ranking model

From the statistics click data, a training set (X, y), where X is the feature input vector and y a boolean stating if the user clicked a document or not, can be generated and used to compute a linear model using regression which will output the weight of each feature.

Example of statistics:

{
    q: "i3",
    docId: "91075403",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0,
    score: 3,43		
},
{
    q: "i3",
    docId: "82034458",
    userId: "507f1f77bcf86cd799439011",
    clicked: 1
    score: 3,43		
},
{
    q: "coucou",
    docId: "1246732",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0	
    score: 3,42	
}

Training data generated:
X= (originalScore, queryMatchTitle, queryMatchDescription, isPowerPoint)
y= (clicked)

{
    originalScore: 3,43,
    queryMatchTitle: 1
    queryMatchDescription: 1,
    isPowerPoint: 1, 
    clicked: 0		
},
{
    originalScore: 3,43,
    queryMatchTitle: 0
    queryMatchDescription: 1,
    isPowerPoint: 0, 
    clicked: 1			
},
{
    originalScore: 3,42	
    queryMatchTitle: 1
    queryMatchDescription: 0,
    isPowerPoint: 1, 
    clicked: 0		
}

Once the training is completed and the different features weight computed, the model can be sent to Solr using the following format:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6783,
            "queryMatchTitle": 0.4833,
            "queryMatchDescription": 0.7844,
            "isPowerPoint": 0.321
      }
    }
}

Run a Rerank Query

Solr LTR plugin allows to easily apply the re-rank model on the documents result by adding rq={!ltr model=myModelName reRankDocs=25} to the query.

Personalization of the search result

If your statistics data include information about users, specific re-rank model can be trained according different user groups.
In my current example, I trained a specific model for the developer group and for the sales representatives.

Dev model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"devModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6421,
            "queryMatchTitle": 0.4561,
            "queryMatchDescription": 0.5124,
            "isPowerPoint": 0.017
      }
    }
}

Sales model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"salesModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.712,
            "queryMatchTitle": 0.582,
            "queryMatchDescription": 0.243,
            "isPowerPoint": 0.623
      }
    }
}

From the statistics data, the system learnt that a PowerPoint document is more relevant for a sales representative than for a developers.

Developer search

Developer search with re-ranking

Sales representative search

Sales representative search with re-ranking

To conclude, with a search system continuously trained from a flow of statistics, not only the search relevance will be more customized and personalized to your users, but the relevance will also be automatically adapted to the users behavior change.

If you want more information, have further questions or need help please visit our website or contact us!

Solr LTR plugin to be release soon: https://github.com/bloomberg/lucene-solr/tree/master-ltr-plugin-release/solr/contrib/ltr