How to improve search relevance using machine learning and statistics – Apache Solr Learning to Rank

In search, the relevance denotes how well a retrieved document or set of documents meets the information need of the user. Natural languages, synonymy, homonymy, term frequency, norms, relevance model, boosting, performance, subjectivity… the reasons why search relevancy still remains a hard problem are multiple. This article will deal with how can machine learning and search statistics improve the relevance using the learning to rank plugin which will be included in a newer version of Solr.  If you want more information than is provided in this blogpost, be sure to visit our website or contact us!

Background

Considering an Intranet search solution where users can be divided into two groups: developers and sales.
Search and clicks statistics are collected and the following picture illustrates a specific search performed 569 times by the users with the click statistics for each document.

Example of a search with click statistics

Example of a search with click statistics

As noticed, the top search hit, of which the score is computed from term frequency, index documents frequency and field-length norm, is less relevant (got less clicks) than documents with lower scores. Instead of trying to manually tweak the relevancy model and the different field boosts, which will probably lead, by tweaking for a specific query, to decrease the global relevancy for other queries, we will try to learn from the users click statistics to automatically generate a relevancy model.

Architecture, features and model

Architecture

Using search and click statistics as training data, the search engine can learn, from input features and with a ranking model, the probability that a document is relevant for a user.

Search architecture

Search architecture with a ranking training model

Features

With the Solr learning to rank plugin, features can be defined using standard Solr queries.
For my specific example, I will choose the following features:
– originalScore: Original score computed by Solr
– queryMatchTitle: Boolean stating if the user query text match the document title
– queryMatchDescription: Boolean stating if the user query text match the document description
– isPowerPoint: Boolean stating if the document type is PowerPoint

[
   { "name": "isPowerPoint",
     "class": "org.apache.solr.ltr.feature.SolrFeature",
     "params":{ "fq": ["{!terms f=filetype }.pptx"] }
   },
   {
    "name":"originalScore",
    "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
    "params":{}
   },
   {
    "name" : " queryMatchTitle",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=title}${user_query}" }
   },
   {
    "name" : " queryMatchDescription",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=description}${user_query}" }
   }
]

Training a ranking model

From the statistics click data, a training set (X, y), where X is the feature input vector and y a boolean stating if the user clicked a document or not, can be generated and used to compute a linear model using regression which will output the weight of each feature.

Example of statistics:

{
    q: "i3",
    docId: "91075403",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0,
    score: 3,43		
},
{
    q: "i3",
    docId: "82034458",
    userId: "507f1f77bcf86cd799439011",
    clicked: 1
    score: 3,43		
},
{
    q: "coucou",
    docId: "1246732",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0	
    score: 3,42	
}

Training data generated:
X= (originalScore, queryMatchTitle, queryMatchDescription, isPowerPoint)
y= (clicked)

{
    originalScore: 3,43,
    queryMatchTitle: 1
    queryMatchDescription: 1,
    isPowerPoint: 1, 
    clicked: 0		
},
{
    originalScore: 3,43,
    queryMatchTitle: 0
    queryMatchDescription: 1,
    isPowerPoint: 0, 
    clicked: 1			
},
{
    originalScore: 3,42	
    queryMatchTitle: 1
    queryMatchDescription: 0,
    isPowerPoint: 1, 
    clicked: 0		
}

Once the training is completed and the different features weight computed, the model can be sent to Solr using the following format:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6783,
            "queryMatchTitle": 0.4833,
            "queryMatchDescription": 0.7844,
            "isPowerPoint": 0.321
      }
    }
}

Run a Rerank Query

Solr LTR plugin allows to easily apply the re-rank model on the documents result by adding rq={!ltr model=myModelName reRankDocs=25} to the query.

Personalization of the search result

If your statistics data include information about users, specific re-rank model can be trained according different user groups.
In my current example, I trained a specific model for the developer group and for the sales representatives.

Dev model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"devModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6421,
            "queryMatchTitle": 0.4561,
            "queryMatchDescription": 0.5124,
            "isPowerPoint": 0.017
      }
    }
}

Sales model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"salesModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.712,
            "queryMatchTitle": 0.582,
            "queryMatchDescription": 0.243,
            "isPowerPoint": 0.623
      }
    }
}

From the statistics data, the system learnt that a PowerPoint document is more relevant for a sales representative than for a developers.

Developer search

Developer search with re-ranking

Sales representative search

Sales representative search with re-ranking

To conclude, with a search system continuously trained from a flow of statistics, not only the search relevance will be more customized and personalized to your users, but the relevance will also be automatically adapted to the users behavior change.

If you want more information, have further questions or need help please visit our website or contact us!

Solr LTR plugin to be release soon: https://github.com/bloomberg/lucene-solr/tree/master-ltr-plugin-release/solr/contrib/ltr