There are plenty of names for this functionality: query completion, suggestions, auto-complete, auto-suggest, word completion, type ahead and maybe some more. Even if we may point slight differences between them (suggestions can base on your index documents or external input such users queries), from technical point of view it’s all about the same: to propose a query for the end user.
early Google Suggest from 2008. Source: http://www.wpromote.com/blog/4-things-in-08-that-changed-the-face-of-search/
Suggester feature was started 8 years ago by Google, in 2008. Users got used to the query completion and nowadays it’s a common feature of all mature search engines, e-commerce platforms and even internal enterprise search solutions.
Suggestions help with navigating users through the web portal, allow to discover relevant content and recommend popular phrases (and thus search results). In the e-commerce area they are even more important because well implemented query completion is able to high up conversion rate and finally – increase sales revenue. Word completion never can lead to zero results, but this kind of mistake is made frequently.
And as many names describe this feature there are so many ways to build it. But still it’s not so trivial task to implement good working query completion. Software like Apache Solr doesn’t solve whole problem. Building auto-suggestions is also about data (what should we present to users), its quality (e.g. when we want to suggest other users’ queries), suggestions order (we got dozens matches, but we can show only 5; which are the most important?) or design (user experience or similar).
Going back to the technology. Query completion can be built in couple of ways with Apache Solr. You can use mechanisms like facets, terms, dedicated suggest component or just do a query (with e.g. dismax parser).
Take a look at Suggester. It’s very easy to run. You just need to configure searchComponent and requestHandler. Example:
<searchComponent name="suggester" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggester1</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title</str>
<str name="weightField">popularity</str>
<str name="suggestAnalyzerFieldType">text</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggester</str>
</arr>
</requestHandler>
SuggestComponent is a ready-to-use implementation, which is responsible for serving up suggestions based on commands and queries. It’s an efficient solution, i.e. because it works on structure separated from main index and it’s being kept in memory. There are some basic settings like field used for autocompleting or defining text analyzing chain. LookImpl defines how to match terms in index. There are about 10 algorithms with different purpose. Probably the most popular are:
- AnalyzingLookupFactory (default, finds matches based on prefix)
- FuzzyLookupFactory (finds matches with misspellings),
- AnalyzingInfixLookupFactory (finds matches anywhere in the text),
- BlendedInfixLookupFactory (combines matches based on prefix and infix lookup)
You need to choose the one which fulfill your requirements. The second important parameter is dictionaryImpl which represents how indexed suggestions are stored. And again, you can choose between couple of implementations, e.g. DocumentDictionaryFactory (stores terms, weights, and optional payload) or HighFrequencyDictionaryFactory (works when very common terms overwhelm others, you can set up proper threshold).
There are plenty of different settings you can use to customize your suggester. SuggestComponent is a good start and probably covers many cases, but like everything, there are some limitations like e.g. you can’t easily filter out results.
Example execution:
http://localhost:8983/solr/index/suggest?wt=json&suggest.dictionary=analyzingSuggester&suggest.q=lond
suggestions: [
{ term: "london" },
{ term: "londonderry" },
{ term: "londoño" },
{ term: "londoners" },
{ term: "londo" }
]
Another way to build a query completion is to use mechanisms like faceting, terms or highlighting.
The example of QC built on facets:
http://localhost:8983/solr/index/select?q=*:*&facet=on&facet.field=title_keyword&facet.mincount=1&facet.contains=lon&rows=0&wt=json
title_keyword: [
"blonde bombshell", 2,
"12-pounder long gun", 1,
"18-pounder long gun", 1,
"1957 liga española de baloncesto", 1,
"1958 liga española de baloncesto", 1
]
Please notice that here we have used facet.contains method, so query matches also in the middle of phrase. It works on the basis of regular expression. Additionally, we have a count for every suggestion in Solr response.
TermsComponent (returns indexed terms and the number of documents which contain each term) and highlighting (originally, emphasize fragments of documents that match the user’s query) can be also used, what is presented below.
Terms example:
<searchComponent name="terms" class="solr.TermsComponent"/>
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<bool name="terms">true</bool>
<bool name="distrib">false</bool>
</lst>
<arr name="components">
<str>terms</str>
</arr>
</requestHandler>
http://localhost:8983/solr/index/terms?terms.fl=title_general&terms.prefix=lond&terms.sort=index&wt=json
title_general: [
"londinium",
"londo",
"london",
"london's",
"londonderry"
]
Highlighting example:
http://localhost:8983/solr/index/select?q=title_ngram:lond &fl=title&hl=true&hl.fl=title&hl.simple.pre=&hl.simple.post=
title_ngram: [
"londinium",
"londo",
"london",
"london's",
"londonderry"
]
You can also do auto-complete even with usual, full-text query. It has lots of advantages: Lucene scoring is working, you have filtering, boosts, matching through many fields and whole Lucene/Solr queries syntax. Take a look at this eDisMax example:
http://localhost:8983/solr/index/select?q=lond&qf=title_ngram&fl=title&defType=edismax&wt=json
docs: [
{ title: "Londinium" },
{ title: "London" },
{ title: "Darling London" },
{ title: "London Canadians" },
{ title: "Poultry London" }
]
The secret is an analyzer chain whether you want to base on facets, query or SuggestComponent. Depending on what effect you want to achieve with your QC, you need to index data in a right way. Sometimes you may want to suggest single terms, another time – whole sentences or product names. If you want to suggest e.g. letter by letter you can use Edge N-Gram Filter. Example:
<fieldType name="text_ngram" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory minGramSize="1" maxGramSize="50" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
N-Gram is a structure of n items (size depends on given range) from a given sequence of text. Example: term Findwise, minGramSize = 1 and maxGramSize = 10 will be indexed as:
F
Fi
Fin
Find
Findw
Findwi
Findwis
Findwise
With such indexed text you can easily achieve functionality where user is able to see changing suggestions after each letter.
Another case is an ability to complete word after word (like Google does). It isn’t trivial, but you can try with shingle structure. Shingles are similar to N-Gram, but it works on whole words. Example: Searching is really awesome, minShingleSize = 2 and minShingleSize = 3 will be indexed as:
Searching is
Searching is really
is really
is really awesome
really awesome
Example of Shingle Filter:
<fieldType name="text_shingle" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="10" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
What if your users could use QC which supports synonyms? Then they could put e.g. abbreviation and find a full suggestion (NYC -> New York City, UEFA -> Union Of European Football Associations). It’s easy, just use Synonym Filter in your text field:
<fieldType name="text_synonym" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
</analyzer>
</fieldType>
And then just do a query:
http://localhost:8983//select?defType=edismax&fl=title&q=nyc&qf=title_synonym&wt=json
docs: [
{ title: "New York City" },
{ title: "New York New York" },
{ title: "Welcome to New York City" },
{ title: "City Club of New York" },
{ title: "New York" }
]
Another very similar example concerns language support and matching suggestions regardless of the terms’ form. It can be especially valuable for languages with the rich grammar rules and declination. In the same way how SynonymsFilter is used, we can configure a stemmer / lemmatization filter e.g. for English (take a look here and remember to put language filter both for index and query time) and expand matching suggestions.
As you can see, there are many ways to run query completion, you need to adjust right mechanism and text analysis based on your own limitations and also on what you want to achieve.
There are also other topics connected with preparing type ahead solution. You need to consider performance issues, they are mostly centered on response time and memory consumption. How many requests will generate QC? You can assume that at least 3 times more than your regular search service. You can handle traffic growth by optimizing Solr caches, installing separated Solr instanced only for suggesting service. If you’ll create n-gram, shingles or similar structures, be aware that your index size will increase. Remember that if you decided to use facets or highlighting for some reason to provide suggester, this both mechanisms make your CPU heavy loaded.
In my opinion, the most challenging issue to resolve is choosing a data source for query completion mechanism. Should you suggest parts of your documents (like titles, keywords, authors)? Or use NLP algorithms to extract meaningful phrases from your content? Maybe parse search/application logs and use the most popular users queries? Be careful, filter out rubbish, normalize users input). I believe the answer is YES – to all. Suggestions should be diversified (to lead your users to a wide range of search resources) and should come from variety of sources. More than likely, you will need to do a hard job when processing documents – remember that data cleaning is crucial.
Similarly, you need to take into account different strategies when we talk about the order of proposed suggestions. It’s good to show them in alphanumeric order (still respect scoring!), but you can’t stop here. Specificity of QC is that application can return hundreds of matches, but you can present only 5 or 10 of them. That’s why you need to promote suggestions with the highest occurrence in index or the most popular among the users. Further enhancements can involve personalizing query completion, using geographical coordinates or implementing security trimming (you can see only these suggestions you are allowed to).
I’m sure that this blog post doesn’t exhaust the subject of building query completion, but I hope I brought this topic closer and showed the complexity of such a task. There are many different dimension which you need to handle, like data source of your suggestions, choosing right indexing structure, performance issues, ranking or even UX and designing (how would you like to present hints – simple text or with some graphics/images? Would you like to divide suggestions into categories? Do you always want to show result page after clicked suggestion or maybe redirect to particular landing page?).
Search engine like Apache Solr is a tool, but you still need an application with whole business logic above it. Do you want to have a prefix-match and infix-match? To support typos and synonyms? To suggest letter after the letter or word by word? To implement security requirements or advanced ranking to propose the best tips for your users? These and even more questions need to be think over to deliver successful query completion.