While using a search application we rarely think about what happens inside it. We just type a query, sometime refine details with facets or additional filters and pick one of the returned results. Ideally, the most desired result is on the top of the list. The secret of returning appropriate results and figuring out which fits a query better than others is hidden in the scoring, ranking and similarity functions enclosed in relevancy models. These concepts are crucial for the search application user’s satisfaction.
In this post we will review basic components of the popular TF/IDF model with simple examples. Additionally, we will learn how to ask Elasticsearch for explanation of scoring for a specific document and query.
Document ranking is one of the fundamental problems in information retrieval, a discipline acting as a mathematical foundation of search. The ranking, which is literally assigning a rank to a document matching search query corresponds with a term of relevance. Document relevance is a function which determines how well given document meets the search query. A concept of similarity corresponds, in turn, to the relevance idea, since relevance is a metric of similarity between a candidate result document and a search query.
One of the most popular relevance calculation algorithms is TF/IDF – Term Frequency/Inverse Document Frequency. As the name suggests, the function consist of two main components. Term frequency is expressed by the following formula:
Where t is a term, d is a document and ntd is number of occurrences of the term in the document. Intuitively, the more times the term occurs in the document, the more relevant the document is to the term.
Inverse document frequency is calculated this way:
Just as previously, t is a term, D is a search space – set of all the indexed documents, nd is a number of elements in D and ndt is a number of elements of D containing the term t. This function amplifies document significance inversely to a number of documents containing the searched term. When there is only one document with the term, its significance is much stronger in contrary to a situation where half of the documents contain the term.
Both components are complementary to each other. TF is responsible for determining the term significance among other terms in the field, whereas IDF determines a document strength among other documents. Overall TF/IDF score for document is calculated as follows:
As an example, let’s perform search on four recent entries from Findability blog:
- What’s new in Apache Solr 6?
- Understanding politics with Watson using Text Analytics
- Migration from Google Search Appliance (GSA) in 4 easy steps
- 4 quick ways to improve your search!
We will perform search for the term t = “documents” in entries content. In the first document, term t occurs 4 times, so:
And for the other documents:
IDF value for the term is:
Taking everything into consideration, sorted ranking is the following (please mind the ordering):
The most relevant document in this search is the What’s new in Apache Solr 6?, and the least is Understanding politics with Watson using Text Analytics.
TF/IDF is commonly used with some additional enhancements. For example, Lucene and thus Elasticsearch, uses a field length norm among others. The norm is expressed as follows:
where d is a document and ntfd is an overall number of terms in field f in the document d. Intuitively, the norm decreases scoring when there are many terms in the field and increases when there are only a few. For example, a term included in the document title is indeed more significant than term in the document content. TF/IDF for the term in field in document is normalized by multiplication by normative value:
Multifield Query
Past three paragraphs describe relevance calculation for a single field. This is unlikely situation, in vast majority of application more than one field is searched. The simplest approach is to just sum all the individual TF/IDF values together. Their values are normalized with the field length, so their impact on the final result is normalized as well. Let’s get back to our test documents and search for the query “Solr” in title and content. First, let’s calculate TF/IDFs the same way as previously:
Document d3 has higher score in content, but in contrary to d1 does not have Solr in title. Will it preserve a best result after normalizing by the field lengths?
thus:
so overall scores of are:
According to the most user’s expectations, having the searched term in a document d1 title turned out to be much more influencing than a having more term occurrences in the document d3 content.
Multiterm Query
After considering multiple field query there’s only one important issue left. What about queries having more than one term? Vector Space Model addresses this problem. The idea isn’t very complicated, but to make it even more simple, we’ll assume terms are connected with the OR operator, which means we’ll return documents containing only one of terms. Query is sliced into many terms and then saved as a vector of IDF values for each term, so that vector dimensionality is equal to a number of terms in query. Then, each document is modeled as a vector with the same dimensionality as query vector, with corresponding TF/IDF values at every position. Then, every document vector is compared to the query vector using one of the vector similarity metrics, cosine similarity in particular. Cosine similarity results in a number between 0 and 1, so that its result can be used directly as a document score for ranking construction or further calculations. For sake of simplicity and readability, let’s skip computation details and let’s see what’s the result.
Query “documents solr” will take into consideration only the content field. Query vector will be:and document vectors will be:
Cosine similarity values (calculated using formulas from [1]) between query vector and individual document vectors are (please mind the ordering):
This means that the document d1 is the most relevant, d3 just a little bit less relevant and the rest not so much relevant. This conforms with manual investigation and previous results.
Ranking is expressed as an order of documents, where a precedence denotes a better fit. When it comes to presented methods resulting in a single scalar number, ranking construction is trivial – it just requires to sort scores descending.
The most important criterion when it comes to the relevancy model verification is effectiveness – how many among the top documents are really relevant and whether they are really more relevant than results classified with a lower ranks. Models can be examined for many other functional, as well as non-function requirements. The most important seems to be performance. It is especially important in modern search solutions offering a query completion or so-called instant search, where preliminary result are displayed as the successive terms land into to query input field.
Elasticsearch Explain API
TF/IDF is implemented in Apache Lucene (up to 6.0) library, which is foundation of Elasticsearch. Elasticsearch enables us to discover scoring details on the concrete data thanks to the Explain API. It can be used for relevance calculation tuning, index configuration adjustments as well as query tuning. Explain API consist of single request URL and method:
GET ‘localhost:9200/<index>/<type>/<docid>/_explain’ -d ‘<query>’
in case of the previously discussed example it would take the following shape:
$ curl -XGET ‘localhost:9200/blog/posts/1/_explain?q=title:solr&pretty’{ "_index" : "blog", "_type" : "posts", "_id" : "1", "matched" : true, "explanation" : { "value" : 0.11506981, "description" : "sum of:", "details" : [ { "value" : 0.11506981, "description" : "weight(title:solr in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.11506981, "description" : "fieldWeight in 0, product of:", "details" : [ { "value" : 1.0, "description" : "tf(freq=1.0), with freq of:", "details" : [ { "value" : 1.0, "description" : "termFreq=1.0", "details" : [ ] } ] }, { "value" : 0.30685282, "description" : "idf(docFreq=1, maxDocs=1)", "details" : [ ] }, { "value" : 0.375, "description" : "fieldNorm(doc=0)", "details" : [ ] } ] } ] }, {… other components ... } ] } }
Returned document consists of two elements. First goes a boolean information regarding whether the specific document would be returned as result of the given query or not. Second is the scoring explanation. It comprises of an overall score and results of specific phases presented in the TF/IDF description and additional components used by Elasticsearch, like boosting and query norm. Explain API enables to issue queries with different additional settings, like analyzer or implicit default field, so that they can be tried on a real data without deploying it on the production or testing environment.
Summary
The TF/IDF model, despite its definite advantages, such as effectiveness, simplicity and performance has some drawbacks too. Dependence on a term frequency makes it vulnerable to common, repeated words. It doesn’t consider context of the word’s occurrences, so that a returned document contains seeked term multiple times, but may doesn’t make fit into user’s expectations. In other words, it considers a documents only at the lexical level
This post covered basic components of a relevancy and scoring of candidate results in search engine, with specific focus on the popular TF/IDF model and Elasticsearch. We went through example search on the set of four Findability blog entries manually and using Elasticsearch. TF/IDF turned out to be a simple yet very powerful model. Supported by other components and Findwise competence it can show even more strength.