Query Completion with Apache Solr

There are plenty of names for this functionality: query completion, suggestions, auto-complete, auto-suggest, word completion, type ahead and maybe some more. Even if we may point slight differences between them (suggestions can base on your index documents or external input such users queries), from technical point of view it’s all about the same: to propose a query for the end user.

google-suggestearly Google Suggest from 2008. Source: http://www.wpromote.com/blog/4-things-in-08-that-changed-the-face-of-search/

 

Suggester feature was started 8 years ago by Google, in 2008. Users got used to the query completion and nowadays it’s a common feature of all mature search engines, e-commerce platforms and even internal enterprise search solutions.

Suggestions help with navigating users through the web portal, allow to discover relevant content and recommend popular phrases (and thus search results). In the e-commerce area they are even more important because well implemented query completion is able to high up conversion rate and finally – increase sales revenue. Word completion never can lead to zero results, but this kind of mistake is made frequently.

And as many names describe this feature there are so many ways to build it. But still it’s not so trivial task to implement good working query completion. Software like Apache Solr doesn’t solve whole problem. Building auto-suggestions is also about data (what should we present to users), its quality (e.g. when we want to suggest other users’ queries), suggestions order (we got dozens matches, but we can show only 5; which are the most important?) or design (user experience or similar).

Going back to the technology. Query completion can be built in couple of ways with Apache Solr. You can use mechanisms like facets, terms, dedicated suggest component or just do a query (with e.g. dismax parser).

Take a look at Suggester. It’s very easy to run. You just need to configure searchComponent and requestHandler. Example:

<searchComponent name="suggester" class="solr.SuggestComponent">
  <lst name="suggester">
    <str name="name">suggester1</str>
    <str name="lookupImpl">FuzzyLookupFactory</str>
    <str name="dictionaryImpl">DocumentDictionaryFactory</str>
    <str name="field">title</str>
    <str name="weightField">popularity</str>
    <str name="suggestAnalyzerFieldType">text</str>
  </lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="suggest">true</str>
    <str name="suggest.count">10</str>
  </lst>
  <arr name="components">
    <str>suggester</str>
  </arr>
</requestHandler>

SuggestComponent is a ready-to-use implementation, which is responsible for serving up suggestions based on commands and queries. It’s an efficient solution, i.e. because it works on structure separated from main index and it’s being kept in memory. There are some basic settings like field used for autocompleting or defining text analyzing chain. LookImpl defines how to match terms in index. There are about 10 algorithms with different purpose. Probably the most popular are:

  • AnalyzingLookupFactory (default, finds matches based on prefix)
  • FuzzyLookupFactory (finds matches with misspellings),
  • AnalyzingInfixLookupFactory (finds matches anywhere in the text),
  • BlendedInfixLookupFactory (combines matches based on prefix and infix lookup)

You need to choose the one which fulfill your requirements. The second important parameter is dictionaryImpl which represents how indexed suggestions are stored. And again, you can choose between couple of implementations, e.g. DocumentDictionaryFactory (stores terms, weights, and optional payload) or HighFrequencyDictionaryFactory (works when very common terms overwhelm others, you can set up proper threshold).

There are plenty of different settings you can use to customize your suggester. SuggestComponent is a good start and probably covers many cases, but like everything, there are some limitations like e.g. you can’t easily filter out results.

Example execution:

http://localhost:8983/solr/index/suggest?wt=json&suggest.dictionary=analyzingSuggester&suggest.q=lond

suggestions: [
  { term: "london" },
  { term: "londonderry" },
  { term: "londoño" },
  { term: "londoners" },
  { term: "londo" }
]

Another way to build a query completion is to use mechanisms like faceting, terms or highlighting.

The example of QC built on facets:

http://localhost:8983/solr/index/select?q=*:*&facet=on&facet.field=title_keyword&facet.mincount=1&facet.contains=lon&rows=0&wt=json

title_keyword: [
  "blonde bombshell", 2,
  "12-pounder long gun", 1,
  "18-pounder long gun", 1,
  "1957 liga española de baloncesto", 1,
  "1958 liga española de baloncesto", 1
]

Please notice that here we have used facet.contains method, so query matches also in the middle of phrase. It works on the basis of regular expression. Additionally, we have a count for every suggestion in Solr response.

TermsComponent (returns indexed terms and the number of documents which contain each term) and highlighting (originally, emphasize fragments of documents that match the user’s query) can be also used, what is presented below.

Terms example:

<searchComponent name="terms" class="solr.TermsComponent"/>
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <bool name="terms">true</bool>
    <bool name="distrib">false</bool>
  </lst>
  <arr name="components">
    <str>terms</str>
  </arr>
</requestHandler>
http://localhost:8983/solr/index/terms?terms.fl=title_general&terms.prefix=lond&terms.sort=index&wt=json

title_general: [
  "londinium",
  "londo",
  "london",
  "london's",
  "londonderry"
]

Highlighting example:

http://localhost:8983/solr/index/select?q=title_ngram:lond &fl=title&hl=true&hl.fl=title&hl.simple.pre=&hl.simple.post=

title_ngram: [
  "londinium",
  "londo",
  "london",
  "london's",
  "londonderry"
]

You can also do auto-complete even with usual, full-text query. It has lots of advantages: Lucene scoring is working, you have filtering, boosts, matching through many fields and whole Lucene/Solr queries syntax. Take a look at this eDisMax example:

http://localhost:8983/solr/index/select?q=lond&qf=title_ngram&fl=title&defType=edismax&wt=json

docs: [
  { title: "Londinium" },
  { title: "London" },
  { title: "Darling London" },
  { title: "London Canadians" },
  { title: "Poultry London" }
]

The secret is an analyzer chain whether you want to base on facets, query or SuggestComponent. Depending on what effect you want to achieve with your QC, you need to index data in a right way. Sometimes you may want to suggest single terms, another time – whole sentences or product names. If you want to suggest e.g. letter by letter you can use Edge N-Gram Filter. Example:

<fieldType name="text_ngram" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory minGramSize="1" maxGramSize="50" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

N-Gram is a structure of n items (size depends on given range) from a given sequence of text. Example: term Findwise, minGramSize = 1 and maxGramSize = 10 will be indexed as:

F
Fi
Fin
Find
Findw
Findwi
Findwis
Findwise

With such indexed text you can easily achieve functionality where user is able to see changing suggestions after each letter.

Another case is an ability to complete word after word (like Google does). It isn’t trivial, but you can try with shingle structure. Shingles are similar to N-Gram, but it works on whole words. Example: Searching is really awesome, minShingleSize = 2 and minShingleSize = 3 will be indexed as:

Searching is
Searching is really
is really
is really awesome
really awesome

Example of Shingle Filter:

<fieldType name="text_shingle" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="10" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What if your users could use QC which supports synonyms? Then they could put e.g. abbreviation and find a full suggestion (NYC -> New York City, UEFA -> Union Of European Football Associations). It’s easy, just use Synonym Filter in your text field:

<fieldType name="text_synonym" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
  </analyzer>
</fieldType>

And then just do a query:

http://localhost:8983//select?defType=edismax&fl=title&q=nyc&qf=title_synonym&wt=json

docs: [
  { title: "New York City" },
  { title: "New York New York" },
  { title: "Welcome to New York City" },
  { title: "City Club of New York" },
  { title: "New York" }
]

Another very similar example concerns language support and matching suggestions regardless of the terms’ form. It can be especially valuable for languages with  the rich grammar rules and declination. In the same way how SynonymsFilter is used, we can configure a stemmer / lemmatization filter e.g. for English (take a look here and remember to put language filter both for index and query time) and expand matching suggestions.

As you can see, there are many ways to run query completion, you need to adjust right mechanism and text analysis based on your own limitations and also on what you want to achieve.

There are also other topics connected with preparing type ahead solution. You need to consider performance issues, they are mostly centered on response time and memory consumption. How many requests will generate QC? You can assume that at least 3 times more than your regular search service. You can handle traffic growth by optimizing Solr caches, installing separated Solr instanced only for suggesting service. If you’ll create n-gram, shingles or similar structures, be aware that your index size will increase. Remember that if you decided to use facets or highlighting for some reason to provide suggester, this both mechanisms make your CPU heavy loaded.

In my opinion, the most challenging issue to resolve is choosing a data source for query completion mechanism. Should you suggest parts of your documents (like titles, keywords, authors)? Or use NLP algorithms to extract meaningful phrases from your content? Maybe parse search/application logs and use the most popular users queries? Be careful, filter out rubbish, normalize users input). I believe the answer is YES – to all. Suggestions should be diversified (to lead your users to a wide range of search resources) and should come from variety of sources. More than likely, you will need to do a hard job when processing documents – remember that data cleaning is crucial.

Similarly, you need to take into account different strategies when we talk about the order of proposed suggestions. It’s good to show them in alphanumeric order (still respect scoring!), but you can’t stop here. Specificity of QC is that application can return hundreds of matches, but you can present only 5 or 10 of them. That’s why you need to promote suggestions with the highest occurrence in index or the most popular among the users. Further enhancements can involve personalizing query completion, using geographical coordinates or implementing security trimming (you can see only these suggestions you are allowed to).

I’m sure that this blog post doesn’t exhaust the subject of building query completion, but I hope I brought this topic closer and showed the complexity of such a task. There are many different dimension which you need to handle, like data source of your suggestions, choosing right indexing structure, performance issues, ranking or even UX and designing (how would you like to present hints – simple text or with some graphics/images? Would you like to divide suggestions into categories? Do you always want to show result page after clicked suggestion or maybe redirect to particular landing page?).

Search engine like Apache Solr is a tool, but you still need an application with whole business logic above it. Do you want to have a prefix-match and infix-match? To support typos and synonyms? To suggest letter after the letter or word by word? To implement security requirements or advanced ranking to propose the best tips for your users? These and even more questions need to be think over to deliver successful query completion.

Finding business values in the emerging digital workplace

How does one experience the promised business rewards of the emerging digital workplace (a.k.a the intranet)?

A group of renowned intranet professionals have taken on the task this question and offer sound practical advice as to how to achieve real business value in their new book “intranets that create business value” or in Swedish “intranät som skapar värde“,

intranat-som-skapar-varde-framsida

Today, in fact most days, end-users feel bewildered when using the intranet.It is to some extent impossible to navigate.There exists a hodgepodge of mixed user experiences, given that the intranet often serves as the access point to several tools. And findability too is low! With a coherent, smooth and interoperable workplace, users should be able to find information and data, peers and colleagues to solve their everyday tasks, in an efficient way…  anywhere, on any device and anytime.

The authors’ narrative describes how the intranet can best be used to produce beneficial business transformation, by including detailed chapters on: strategy, content & information architecture, search/findability, governance and stakeholder management, end-user engagement and adaptation. Measures and metrics are also included to qualify the sought after business values.

Findwise have contributed to the sections relating to organising principles. Put simply, it should be easy for a user to know where and how to contribute with information and content in a good manner, so that others are able to find and co-act on such codified knowledge.

Without sound and sustainable organising principles there will be no findability: shit in = shit out! Regardless of the technology platform employed for search or intranet

Buy the e-book today, in advance of the published printed version in May!

Finding the right information requires finding the right talent

At Findwise we are experts in helping organizations setting up systems to find their corporate information and presenting it in every way imaginable. But we are not only good at finding the right information, we must also be good at finding the right people to come work for us.

The people working here are highly skilled consultants within different areas such as business consulting, information management, text analytics, user experience, system design etc. They all have two things in common; they were handpicked to work here because of their unique expertise and passion for search technology and they could all easily have chosen to work someplace else.

Our way of finding these people is based on the notion that talent attracts talent. That means in order to find new talented people we must make the ones who are already working here thrive and come to work each day filled with joy and anticipation. That creates the ripple effect we need to compete for talent with the giants of the IT industry.

How do we accomplish that? Well, talented people must be respected as equals and be given the freedom to create and innovate. You don’t hire a talent to tell him or her exactly how to do what they are talented at. That would be like hiring Michael Jackson and then telling him how to write a hit song. We want our talented people to feel encouraged to act independently and bravely, that is how their talents best are put to use for Findwise and our clients.

Within the corporate world today these are still surprisingly uncommon ideas and two of the major daily newspapers in Sweden have both written about our approach to management, an article in SvD last fall and one recently in Dagens Nyheter.

To summarize the news, Findwise approach to management is to continue to uphold an open, trusting environment with a flat corporate structure, in which flexible working hours, freedom and own responsibility are principal. »Use your own judgment« is the golden rule which encourages a fighting spirit and the desire to develop new ideas.

We gladly walk this talk. And it has paid off so far. We are now employing more than 90 people, have managed more than 700 client projects and have enjoyed steady and profitable growth since the start in 2005. We are well on our way towards becoming a world-leading enterprise in our field, and it is all thanks to the talented people who work here.

/Olof Belfrage

Video interview: How to Improve the Search Experience

Video interview with Kristian Norling at the Intrateam Event in Copenhagen 2012. Kristian talks about his former work at VGR and what he thinks is important for improving the search experience.

Kristian Norling

Watch the video

Findability, a holistic approach to implementing search technology

We are proud to present the first video on our new Vimeo channel. Enjoy!

Findability Dimensions

Successful search project does not only involve technology and having the most skilled developers, it is not enough. To utilise the full potential and receive return on search technology investments there are five main dimensions (or perspectives) that all need to be in focus when developing search solutions, and that require additional competencies to be involved.

This holistic approach to implementing search technology we call Findability by Findwise.

Swedish Employees Waste Time and Money Looking for Information

Canon has just published a study showing that half of the Swedish employees waste about 4000 Euros or 6000 USD per employee and year searching for information. The conclusion was drawn after interviewing over 1000 people of which over half used more than 10 hours per month looking for information. A quarter of the subjects in the study said that they spent up to 20 hours. These are very interesting numbers that show how profitable an investment in Findability can be.

Link to the article (only in Swedish)

Search Conferences 2011

During 2011 a large number of search conferences will take place all over the world. Some of them are dedicated to search, whereas others discuss the topic related to specific products, information management, usability etc.

Here are a few that might be of interest for those of you looking to be inspired and broaden your knowledge. Within a few weeks we will compile all the research related conferences – there are quite a few of them out there!
If there is anything you miss, please post a comment.

March
IntraTeam Event Copenhagen 2011
Main focus: Social intranets, SharePoint and Enterprise Search
March 1, 2 and 3, 2011, Copenhagen, Denmark

Webcoast
Main focus: A web event that is an unconference, meaning that the attendees themselves create the program by presenting on topics of their own expertise and interest.
March 18-20 , Gothenburg, Sweden

Info360
Main focus: Business productivity, Enterprise Content Management, SharePoint 2010
March 21-24, Walter E. Washington Convention Center, Washington, USA

April
International Search Summit Munich
Main focus: International search and social media.
4th April 2011, Hilton Munich Park Hotel, Germany

ECIR 2011: European Conference on Information Retrieval
Main focus: Presentation of new research results in the field of Information Retrieval
April18-21, Dublin, Ireland

May
Enterprise Search Summit Spring 2011
Main focus: Develop, implement and enhance cutting-edge internal search capabilities
May 10-11, New York, USA

International Search Summit: London
Main focus: International search and social media
May 18th, Millennium Gloucester Hotel, London, England

Lucene Revolution
Main focus: The world’s largest conference dedicated to open source search.
May 25-26, San Francisco Airport Hyatt Regency, USA

SharePoint Fest – Denver 2011
Main focus: In search track: Enterprise Search, Search & Records Management, & FAST for SharePoint
May 19-20, Colorado Convention Center, USA

June
International Search Summit Seattle
Main focus: International search and social media
June 9th, Bell Harbor Conference Center, Seattle, USA

2011 Semantic Technology Conference
Main focus: Semantic technologies – including Search, Content Management, Business Intelligence
June 5-9, Hilton Union Square, San Francisco, USA

October
SharePoint Conference 2011
Main focus: SharePoint and related technologies
October 3-6, Anaheim, California, USA

November
Enterprise Search Summit Fall Nov 1-3
Main focus: How to implement, manage, and enhance search in your organization
Integrated with the KMWorld Conference, SharePoint Symposium and Taxonomy Bootcamp,

KM-world
(Co-locating with Enterprise Search Summit Fall, Taxonomy Boot Camp and Sharepoint Symposium)
Main focus: Knowledge creation, publishing, sharing, finding, mining, reuse etc
November 1 – 3, Washington Marriott Wardman Park, Washington DC, USA

Gilbane group Boston
Main focus: Within search: semantic, mobile, SharePoint, social search
November 29 – December 1, Boston, USA

Findability by Findwise

Being the hosts of “The Search and Findability blog”, we believe it is time to define and explain what Findwise means by these terms and how they relate.

“Findability” is not a new term or concept. As stated on Wikipedia, Peter Morville is often credited for having introduced the term and it is used in different areas related to the quality of being locatable or navigable either in terms of finding information in the digital world or geographical locations.

“Search” is, at least in the world of IT, commonly associated with either Google on the web, or a search box in the corner of the company Intranet or other websites. Most people have positive experiences from searching with Google on the web but rather poor, sometimes even terrible, experiences from searching at company websites and in internal systems and applications.

Simple search box

The primary focus of Findwise is to improve the experience and benefits from using search technology in the corporate setting. By itself, we don’t believe that the term “Search” or even “Enterprise Search” fully reflects this focus as it limits the scope of search technology to being “just” the search box in the website corner, which often provides undesirable results. From experience, we know that modern search technology can be utilised in multiple ways to fulfil the needs of an organisation to make information accessible both to their employees and customers. The search box is only one way. Therefore, to support and explain our aims and focus in relation to search technology, we have defined the concept of “Findability by Findwise”.

Findability by Findwise expands the area of search and value of search technology by taking a holistic approach to the challenge of creating business value from internal and external information assets. Findability by Findwise is all about maximising the customer business value gained from search technology investments. Making sure that search technology is implemented and utilised to best support and strengthen the business processes and help the organisation to reach its business goals.

The value generated by the Findability solution could be both:

  • Internal; Improving employee efficiency and their ability to truly benefit from existing information assets and previous investments in various systems to store and structure information.
  • External; Making sure stakeholders can access the information they need in order to become or remain profitable customers.

From the statement above, it is easy to understand that to gain the desired effects and value of search technology investments, it is not enough only to focus on and master the actual technology. Or as stated in an AIIM report from 2008:

“Findability is more about a well-defined and executed strategy model than it is about technology.”

AIIM Market IQ Intelligence Quarterly Q2 2008

Therefore, a Findability solution by Findwise creates true customer business value, i.e. it makes desired information accessible to internal or external stakeholders, by;

BOTH using the full potential of search technology,
AND focusing on the four other critical dimensions of Findability:

  • Business – The use of search technology should support and leverage the existing business processes.
  • Users – The solution must be designed and tailored to fit the needs and capabilities of the users.
  • Information – The quality and structure of existing and newly produced information is an important success factor of the solution.
  • Organisation – The organisation must establish a process to govern the solution and maintain Findability for future needs.

We have chosen the symbol of a flower to illustrate the concept and dimensions of Findability by Findwise:

Findability by Findwise

In other words, the beauty and health of the Findability Flower™ can be likened to the extent to which search technology is utilised to support and leverage the organisation’s business needs and goals. That is what Findability by Findwise is all about.

Visit our website to read more about Findability by Findwise and how we work to create Findability solutions that make our customers truly benefit from state-of-the-art search technology.