Building a chatbot – that actually works

In the era of artificial intelligence and machine learning, chatbots have gained a lot of attention. Chatbots can for example help a user to book restaurants or schedule flights. But why should organizations use chatbots instead of simple user interaction (UI) systems? Considering that chatbots are both easier and more natural to interact with -compared to that of a UI system – endorses the implementation of chatbots in certain use cases. Additionally, a chatbot can engage a user for a longer time which can result in a company increasing its business. A chatbot needs to understand the natural language as there can be multiple ways to express one’s intention with language ambiguity. Natural Language Processing (NLP) helps us to achieve this to some extent.

Natural language processing – the foundation for a chatbot

Compared to rule-based solutions, chatbots using machine learning and language understanding are much more efficient. After years and new waves of statistical models, such as deep learning RNN, LSTM, transformers etc., these algorithms have now become market standards.

NLP is a part of linguistics and artificial intelligence, where algorithms are used to understand, analyze, manipulate and potentially generate human readable text. Usually, it contains two components: Natural Language Understanding (NLU) and Natural Language Generation (NLG).

To start with, the natural language input is mapped into useful representation for machine reading comprehension. This is achieved through using basics like: tokenization, stemming / lemmatization or tagging part of speech. There are also more advanced elements such as recognizing named entities or chunking. The latter is a processing method which organizes the individual terms found previously into a more prominent structure. For example: ’South Africa’ – is more useful as a chunk than the individual words ‘South’ and ‘Africa’.

FIGURE 1: A PROCESS OF BREAKING A USER’S MESSAGE INTO TOKENS

FIGURE 1: A PROCESS OF BREAKING A USER’S MESSAGE INTO TOKENS

From the other side, NLG is the process of producing meaningful phrases and sentences in natural language from an internal structural representation using e.g. content determination, discourse planning, sentence aggregation, lexicalization, referring expression generation or linguistic realization.

Open-domain and Goal-Driven Chatbot

Chatbots can be classified into two categories: Goal-driven and Open-domain. Goal-driven chatbots are built to solve specific problems such as a flight bookings or restaurant reservations. On the other hand, the Open-domain dialogue system attempts to establish a long-term connection with the user, such as psychological support and language learning.

Goal-driven chatbots are based on slot filling and handcrafted rules, which are reliable but restrictive in conversation. A user has to go through a predefined dialogue flow to accomplish a task.

FIGURE 2: ARCHITECTURE FOR GOAL-DRIVEN CHATBOT

FIGURE 2: ARCHITECTURE FOR GOAL-DRIVEN CHATBOT

Open domain chatbots are intended to converse coherently and engagingly with humans and maintain a long dialog flow with a user. However, we need to have big amounts of data to train these chatbots.

FIGURE 3: ARCHITECTURE FOR OPEN-DOMAIN CHATBOT

FIGURE 3: ARCHITECTURE FOR OPEN-DOMAIN CHATBOT

Knowledge graphs bring connections and data structures to information

Knowledge graphs provides a semantic layer on the top of your database which provides you with all possible entities and the relationships between them. There are a number of representation and modeling instruments available for building a knowledge graph, ontologies being one of them.

Ontology comprises of classes, relationships and attributes as shown in Figure 9. This offers a robust way to store information and concepts – similar to how humans store information.

FIGURE 4: OVERVIEW OF A KNOWLEDGE GRAPH WITH AN RDF SCHEMA

FIGURE 4: OVERVIEW OF A KNOWLEDGE GRAPH WITH AN RDF SCHEMA

A chatbot based on ontology can help to clarify the user’s context and intent – and it can dynamically suggest related topics. Knowledge graphs represent the knowledge of an organization,  as depicted in the following Figure 10. Consider a knowledge graph based on an organization (as shown on the right image in Figure 10) and a chatbot (as shown on the left image in Figure 10) which is based on the ontology of this knowledge graph. In the chatbot example in Figure 10, the user asks a question about a specific employee. The NLP detects the employee as an entity and also detects the intent behind asking a question about this entity. The chatbot matches the employee entity in the ontology and navigates to the node in the graph. From that node we now know all possible relationships of that entity and the chatbot will ask back for possible options, such as co-workers and projects, to navigate further.

FIGURE 5: A SCENARIO - HOW A CHATBOT CAN INTERACT WITH A USER WITH A KNOWLEDGE GRAPH.

FIGURE 5: A SCENARIO – HOW A CHATBOT CAN INTERACT WITH A USER WITH A KNOWLEDGE GRAPH.

Moreover, the knowledge graph also improves the NLU in a chatbot. For example, if a user asks the following;

  • ‘Which assignments was employee A part of?’. To navigate further in the knowledge graph, a rank system can be created for possible connections from the employee node. This rank system might be based on word vector space and a similarity score.
  • In this scenario, ‘worked in, projects’ will have the highest rank when calculating the score with ‘part of, assignments’. So, the chatbot would know it needs to return the list of corresponding projects.

Virtual assistants with Lucidworks Fusion

Lucidworks Fusion is an example of a platform that supports building conversation interfaces. Fusion includes NLP features to understand the meaning of content and user intent. In the end, it’s all about retrieving the right answer at the right time. Virtual assistants, with a more human level of understanding, goes beyond static rules and profiles. It uses machine learning to predict user intention and provides insights. Customers and employees can locate critical insights to help them move to their next best action.

FIGURE 6: LUCIDWORKS FUSION DATA FLOW

FIGURE 6: LUCIDWORKS FUSION DATA FLOW

Lucidworks recently announced Smart Answers – new Fusion’s feature. Smart Answers enhances the intelligence of chatbots and virtual assistants by using deep learning to understand natural language questions. It uses deep learning models and mathematical logic to match the similarity of a question (which can be asked in many different ways) to the most relevant answer. As users interact with the system, Smart Answers continues to rank all answers and improve relevancy.

Fusion is focused on understanding a user’s intent. Smart Answers includes model training and serving methods for different scenarios:

  • When FAQs or question-answer pairs exist, they can be easily integrated into Smart Answers’ model training framework,
  • When there are no FAQ or question-answer pairs, knowledge base documents can be used to train deep learning models and match existing knowledge for the best answers to incoming queries. Once users click on documents returned for specific queries, they become question-answers pairs signals and can enrich the FAQ model training framework,
  • When there are no documents internally, Smart Answers uses cold-start models trained on large online sources, available in multiple languages. Once it goes live, the models begin training on actual user signals.

Smart Answers’ API enables easy integration with any platform, knowledge base, adding value to existing applications. One of the strengths of Fusion Smart Answers is integration with Rasa, an open-source conversation engine. It’s a framework that helps with understanding user intention and maintaining dialogue flow. It also has prebuilt NLP components such as word vectors, tokenizers, intent classifiers and entity extractor. Rasa allows to configure the pipeline that processes a user’s message and analyze human language. Another part of this engine enables modeling dialogues, so chatbot knows what the next action or response should be.

intent:greet 
- Hi 
- Hey 
- Hi bot 
- Hey bot 
 
## intent:request_restaurant 
- im looking for a restaurant 
- can i get [swedish](cuisine) food in any area. 
- a restaurant that serves [caribbean](cuisine) food. 
- id like a restaurant 
- im looking for a restaurant that serves [mediterranean](cuisine) food 
- can i find a restaurant that serves [chinese](cuisine)

Building chatbots requires a lot of training examples for every intent and entity to make them understand the user intention, domain knowledge and to improve NLU of the chatbot. When building a simple chatbot, using prebuilt trained models can be useful and requires less training data. For example: If we build a chatbot where we only need to detect the common location entity, few examples and spaCy models can be enough. However, there might be cases when you need to build a chatbot for an organization where you need different contextual entities – which might not be available in the pretrained models. Knowledge graphs can then be helpful to have a domain knowledge for a chatbot and can balance the amount of work related to training data.

Conclusion

Two main chatbot usages are: 1/solving employee frustration in accessing e.g. corporate information and 2/providing customers with answers to support questions. Both examples above are looking for a solution to reduce time spent on finding information. Especially for online commerce, key performance indicators are clear and can relate to e.g. decreasing call center traffic or call deflection from web and email – examples of situations where ontology based chatbots can be very helpful. From a short-term perspective creating a knowledge graph can initially require a lot of effort – but from a long-term perspective it can also create a lot of value. Companies rely on digital portals to provide information to users; employees search for HR or organization policies documents. Online retailers try to increase customers’ self-service in solving their problems or simply want to improve discovery of their products and services. With solutions like e.g. Fusion Smart Answers, we are able to cut down time-to-resolution, increase customer retention and take knowledge sharing to the next level. It helps employees and customers resolve issues more quickly and empowers users to find the right answer immediately without seeking out additional, digital channels.

Authors: Pragya Singh, Pedro Custodio, Tomasz Sobczak

To read more:

  1. Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Nat. Lang. Eng. 3, 1 (March 1997), 57–87. DOI:https://doi.org/10.1017/S1351324997001502.
  2. Challenges in Building Intelligent Open-domain Dialog Systems by Huang, M.; Zhu, X.; Gao, J.
  3. A Novel Approach for Ontology-Driven Information Retrieving Chatbot for Fashion Brands by Aisha Nazir, Muhammad Yaseen Khan, Tafseer Ahmed, Syed Imran Jami, Shaukat Wasi
  4. https://medium.com/@BhashkarKunal/conversational-ai-chatbot-using-rasa-nlu-rasa-core-how-dialogue-handling-with-rasa-core-can-use-331e7024f733
  5. https://lucidworks.com/products/smart-answers/

Query Completion with Apache Solr

There are plenty of names for this functionality: query completion, suggestions, auto-complete, auto-suggest, word completion, type ahead and maybe some more. Even if we may point slight differences between them (suggestions can base on your index documents or external input such users queries), from technical point of view it’s all about the same: to propose a query for the end user.

google-suggestearly Google Suggest from 2008. Source: http://www.wpromote.com/blog/4-things-in-08-that-changed-the-face-of-search/

 

Suggester feature was started 8 years ago by Google, in 2008. Users got used to the query completion and nowadays it’s a common feature of all mature search engines, e-commerce platforms and even internal enterprise search solutions.

Suggestions help with navigating users through the web portal, allow to discover relevant content and recommend popular phrases (and thus search results). In the e-commerce area they are even more important because well implemented query completion is able to high up conversion rate and finally – increase sales revenue. Word completion never can lead to zero results, but this kind of mistake is made frequently.

And as many names describe this feature there are so many ways to build it. But still it’s not so trivial task to implement good working query completion. Software like Apache Solr doesn’t solve whole problem. Building auto-suggestions is also about data (what should we present to users), its quality (e.g. when we want to suggest other users’ queries), suggestions order (we got dozens matches, but we can show only 5; which are the most important?) or design (user experience or similar).

Going back to the technology. Query completion can be built in couple of ways with Apache Solr. You can use mechanisms like facets, terms, dedicated suggest component or just do a query (with e.g. dismax parser).

Take a look at Suggester. It’s very easy to run. You just need to configure searchComponent and requestHandler. Example:

<searchComponent name="suggester" class="solr.SuggestComponent">
  <lst name="suggester">
    <str name="name">suggester1</str>
    <str name="lookupImpl">FuzzyLookupFactory</str>
    <str name="dictionaryImpl">DocumentDictionaryFactory</str>
    <str name="field">title</str>
    <str name="weightField">popularity</str>
    <str name="suggestAnalyzerFieldType">text</str>
  </lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="suggest">true</str>
    <str name="suggest.count">10</str>
  </lst>
  <arr name="components">
    <str>suggester</str>
  </arr>
</requestHandler>

SuggestComponent is a ready-to-use implementation, which is responsible for serving up suggestions based on commands and queries. It’s an efficient solution, i.e. because it works on structure separated from main index and it’s being kept in memory. There are some basic settings like field used for autocompleting or defining text analyzing chain. LookImpl defines how to match terms in index. There are about 10 algorithms with different purpose. Probably the most popular are:

  • AnalyzingLookupFactory (default, finds matches based on prefix)
  • FuzzyLookupFactory (finds matches with misspellings),
  • AnalyzingInfixLookupFactory (finds matches anywhere in the text),
  • BlendedInfixLookupFactory (combines matches based on prefix and infix lookup)

You need to choose the one which fulfill your requirements. The second important parameter is dictionaryImpl which represents how indexed suggestions are stored. And again, you can choose between couple of implementations, e.g. DocumentDictionaryFactory (stores terms, weights, and optional payload) or HighFrequencyDictionaryFactory (works when very common terms overwhelm others, you can set up proper threshold).

There are plenty of different settings you can use to customize your suggester. SuggestComponent is a good start and probably covers many cases, but like everything, there are some limitations like e.g. you can’t easily filter out results.

Example execution:

http://localhost:8983/solr/index/suggest?wt=json&suggest.dictionary=analyzingSuggester&suggest.q=lond

suggestions: [
  { term: "london" },
  { term: "londonderry" },
  { term: "londoño" },
  { term: "londoners" },
  { term: "londo" }
]

Another way to build a query completion is to use mechanisms like faceting, terms or highlighting.

The example of QC built on facets:

http://localhost:8983/solr/index/select?q=*:*&facet=on&facet.field=title_keyword&facet.mincount=1&facet.contains=lon&rows=0&wt=json

title_keyword: [
  "blonde bombshell", 2,
  "12-pounder long gun", 1,
  "18-pounder long gun", 1,
  "1957 liga española de baloncesto", 1,
  "1958 liga española de baloncesto", 1
]

Please notice that here we have used facet.contains method, so query matches also in the middle of phrase. It works on the basis of regular expression. Additionally, we have a count for every suggestion in Solr response.

TermsComponent (returns indexed terms and the number of documents which contain each term) and highlighting (originally, emphasize fragments of documents that match the user’s query) can be also used, what is presented below.

Terms example:

<searchComponent name="terms" class="solr.TermsComponent"/>
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <bool name="terms">true</bool>
    <bool name="distrib">false</bool>
  </lst>
  <arr name="components">
    <str>terms</str>
  </arr>
</requestHandler>
http://localhost:8983/solr/index/terms?terms.fl=title_general&terms.prefix=lond&terms.sort=index&wt=json

title_general: [
  "londinium",
  "londo",
  "london",
  "london's",
  "londonderry"
]

Highlighting example:

http://localhost:8983/solr/index/select?q=title_ngram:lond &fl=title&hl=true&hl.fl=title&hl.simple.pre=&hl.simple.post=

title_ngram: [
  "londinium",
  "londo",
  "london",
  "london's",
  "londonderry"
]

You can also do auto-complete even with usual, full-text query. It has lots of advantages: Lucene scoring is working, you have filtering, boosts, matching through many fields and whole Lucene/Solr queries syntax. Take a look at this eDisMax example:

http://localhost:8983/solr/index/select?q=lond&qf=title_ngram&fl=title&defType=edismax&wt=json

docs: [
  { title: "Londinium" },
  { title: "London" },
  { title: "Darling London" },
  { title: "London Canadians" },
  { title: "Poultry London" }
]

The secret is an analyzer chain whether you want to base on facets, query or SuggestComponent. Depending on what effect you want to achieve with your QC, you need to index data in a right way. Sometimes you may want to suggest single terms, another time – whole sentences or product names. If you want to suggest e.g. letter by letter you can use Edge N-Gram Filter. Example:

<fieldType name="text_ngram" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory minGramSize="1" maxGramSize="50" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

N-Gram is a structure of n items (size depends on given range) from a given sequence of text. Example: term Findwise, minGramSize = 1 and maxGramSize = 10 will be indexed as:

F
Fi
Fin
Find
Findw
Findwi
Findwis
Findwise

With such indexed text you can easily achieve functionality where user is able to see changing suggestions after each letter.

Another case is an ability to complete word after word (like Google does). It isn’t trivial, but you can try with shingle structure. Shingles are similar to N-Gram, but it works on whole words. Example: Searching is really awesome, minShingleSize = 2 and minShingleSize = 3 will be indexed as:

Searching is
Searching is really
is really
is really awesome
really awesome

Example of Shingle Filter:

<fieldType name="text_shingle" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="10" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What if your users could use QC which supports synonyms? Then they could put e.g. abbreviation and find a full suggestion (NYC -> New York City, UEFA -> Union Of European Football Associations). It’s easy, just use Synonym Filter in your text field:

<fieldType name="text_synonym" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
  </analyzer>
</fieldType>

And then just do a query:

http://localhost:8983//select?defType=edismax&fl=title&q=nyc&qf=title_synonym&wt=json

docs: [
  { title: "New York City" },
  { title: "New York New York" },
  { title: "Welcome to New York City" },
  { title: "City Club of New York" },
  { title: "New York" }
]

Another very similar example concerns language support and matching suggestions regardless of the terms’ form. It can be especially valuable for languages with  the rich grammar rules and declination. In the same way how SynonymsFilter is used, we can configure a stemmer / lemmatization filter e.g. for English (take a look here and remember to put language filter both for index and query time) and expand matching suggestions.

As you can see, there are many ways to run query completion, you need to adjust right mechanism and text analysis based on your own limitations and also on what you want to achieve.

There are also other topics connected with preparing type ahead solution. You need to consider performance issues, they are mostly centered on response time and memory consumption. How many requests will generate QC? You can assume that at least 3 times more than your regular search service. You can handle traffic growth by optimizing Solr caches, installing separated Solr instanced only for suggesting service. If you’ll create n-gram, shingles or similar structures, be aware that your index size will increase. Remember that if you decided to use facets or highlighting for some reason to provide suggester, this both mechanisms make your CPU heavy loaded.

In my opinion, the most challenging issue to resolve is choosing a data source for query completion mechanism. Should you suggest parts of your documents (like titles, keywords, authors)? Or use NLP algorithms to extract meaningful phrases from your content? Maybe parse search/application logs and use the most popular users queries? Be careful, filter out rubbish, normalize users input). I believe the answer is YES – to all. Suggestions should be diversified (to lead your users to a wide range of search resources) and should come from variety of sources. More than likely, you will need to do a hard job when processing documents – remember that data cleaning is crucial.

Similarly, you need to take into account different strategies when we talk about the order of proposed suggestions. It’s good to show them in alphanumeric order (still respect scoring!), but you can’t stop here. Specificity of QC is that application can return hundreds of matches, but you can present only 5 or 10 of them. That’s why you need to promote suggestions with the highest occurrence in index or the most popular among the users. Further enhancements can involve personalizing query completion, using geographical coordinates or implementing security trimming (you can see only these suggestions you are allowed to).

I’m sure that this blog post doesn’t exhaust the subject of building query completion, but I hope I brought this topic closer and showed the complexity of such a task. There are many different dimension which you need to handle, like data source of your suggestions, choosing right indexing structure, performance issues, ranking or even UX and designing (how would you like to present hints – simple text or with some graphics/images? Would you like to divide suggestions into categories? Do you always want to show result page after clicked suggestion or maybe redirect to particular landing page?).

Search engine like Apache Solr is a tool, but you still need an application with whole business logic above it. Do you want to have a prefix-match and infix-match? To support typos and synonyms? To suggest letter after the letter or word by word? To implement security requirements or advanced ranking to propose the best tips for your users? These and even more questions need to be think over to deliver successful query completion.

What’s new in Apache Solr 6?

Apache Solr 6 has been released recently! You need to remember about some important technical news: no more support for reading Lucene/Solr 4.x index or Java 8 is required. But what I think, the most interesting part is connected with its new features, which certainly follow world trends. I mean here: SQL engine at the top of the Solr, graph search and replicating data across different data centers.

Apache Solr

One of the most promising topic among the new features is Parallel SQL Interface. In a brief, it is possibility to run SQL queries on the top of the Solr Cloud (only Cloud mode right now). It can be very interesting to combine full-text capabilities with well-known SQL statements.
Solr uses Presto internally, which is a SQL query engine and works with various types of data stores. Presto is responsible for translating SQL statements to the Streaming Expression, since Solr SQL engine in based on the Streaming API.
Thanks to that, SQL queries can be executed at worker nodes in parallel. There are two implementations of grouping results (aggregations). First one is based on map reduce algorithm and the second one uses Solr facets. The basic difference is a number of fields used in grouping clause. Facet API can be used for better performance, but only when GROUP BY isn’t complex. If it is, better try aggregationMode=map_reduce.
From developer perspective it’s really transparent. Simple statement like “SELECT field1 FROM collection1” is translated to proper fields and collection. Right now clauses like WHERE, ORDER BY, LIMIT, DISTINCT, GROUP BY can be used.
Solr still doesn’t support whole SQL language, but even though it’s a powerful feature. First of all, it can make beginners life easier, since relational world is commonly known. What is more, I imagine this can be useful during some IT system migrations or collecting data from Solr for further analysis. I hope to hear many different study cases in the near future.

Apache Solr 6 introduces also a topic, which is crucial, wherever a search engine is a business critical system. I mean cross data center replication (CDCR).
Since Solr Cloud has been created to support near real-time (NRT) searching, it didn’t work well when cluster nodes were distributed across different data centers. It’s because of the communication overhead generated by the leaders, replicas and synchronizations operation.

New idea is in experimental phase and still under developing, but for now we have an active-passive mode, where data is pushed from the Source DC to the Target DC. Documents can be sent in a real-time or according to the schedule. Every leader from active cluster sends asynchronously updates to the proper leader in passive cluster. After that, target leaders replicate changes to their replicas as usual.
CDCR is crucial when we think about distributed systems working in high-availability mode. It always refers to disaster recovery, scaling or avoiding single points of failure (SPOF). Please visit documentation page to find some details and plans for the future: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

What if your business works in highly connected environment, where data relationships matter, but you still benefit from full-text searching? Solr 6 has a good news – it’s a graph traversal functionality.
A lot of enterprises know that focusing on relations between documents and graph data modeling is a future. Now you can build Solr queries which will allow you to discover information organized in nodes and edges. You can explore your collections in terms of data interactions and connections between particular data elements. We can think about the use cases from semantic search area (query augmentation, using ontologies etc.) or more prosaic, like organization security roles or access control.
Graph traversal query is still in progress, but we can use it from now and its basic syntax is really simple: fq={!graph from=parent_id to=id}id:”DOCUMENT_ID”

The last Solr 6 improvement, which I’m going to mention about is a new scoring algorithm – BM25. In fact, it’s a change forced by Apache Lucene 6. BM25 is now a default similarity implementation. Similarity is a process which examines which documents are similar to the query and to what extent. There are many different factors which determine document score. There are e.g.: number of search terms found in document, popularity of this search terms over the whole collection or document length. This is where BM25 improves scoring: it takes into consideration average length of the documents (fields) across the entire corpus. It also limits better an impact of terms frequency on results ranking.

As we can see, Apache Solr 6 provides us with many new features and those mentioned above are not all of them. We’re going to write more about the new functionalities soon. Until then, we encourage you to try the newest Solr on your own and remember: don’t hesitate to contact us in case of any problems!

How it all began: a brief history of Intranet Search

In accordance to sources, the birth of the intranet fell on a 1994 – 1996, that was true prehistory from an IT systems point of view. Intranet history is bound up with the development of Internet – the global network. The idea of WWW, proposed in 1989 by Tim Berners-Lee and others, which aim was to enable the connection and access to many various sources, became the prototype for the first internal networks. The goal of intranet invention was to increase employees productivity through the easier access to documents, their faster circulation and more effective communication. Although, access to information was always a crucial matter, in fact, intranet offered lots more functionalities, i.e.: e-mail, group work support, audio-video communication, texts or personal data searching.

Overload of information

Over the course of the years, the content placed on WWW servers had becoming more important than other intranet components. First, managing of more and more complicated software and required hardware led to development of new specializations. Second, paradoxically the easiness of information printing became a source of serious problems. There was too much information, documents were partly outdated, duplicated, without homogeneous structure or hierarchy. Difficulties in content management and lack of people responsible for this process led to situation, when final user was not able to reach desired piece of information or this had been requiring too much effort.

Google to the rescue

As early as in 1998 the Gartner company made a document which described this state of Internet as a “Wild West”. In case of Internet, this problem was being solved by Yahoo or Google, which became a global leader on information searching. In internal networks it had to be improved by rules of information publishing and by CMS and Enterprise Search software. In many organizations the struggle for easier access to information is still actual, in the others – it has just began.

cowboys

And the Search approached

It was search engine which impacted the most on intranet perception. From one side, search engine is directly responsible for realization of basic assumptions of knowledge management in the company. From the other, it is the main source of complaints and frustration among internal networks users. There are many reasons of this status quo: wrong or unreadable searching results, lack of documents, security problems and poor access to some resources. What are the consequences of such situation? First and foremost, they can be observed in high work costs (duplication of tasks, diminution in quality, waste of time, less efficient cooperation) as well as in lost chances for business. It must not be forgotten that search engine problems often overshadow using of intranet as a whole.

How to measure efficiency?

In 2002 Nielsen Norman Group consultants estimated that productivity difference between employees using the best and the worst corporate network is about 43%. On the other hand, annual report of Enterprise Search and Findability Survey shows that in situation, when almost 60% of companies underline the high importance of information searching for their business, nearly as 45% of employees have problem with finding the information.
Leaving aside comfort and level of employees satisfaction, the natural effect of implementation and improvement of Enterprise Search solutions is financial benefit. Contrary to popular belief, investments profits and savings from reaching the information faster are completely countable. Preparing such calculations is not pretty easy. The first step is: to estimate time, which is spent by employees on searching for information, to calculate what percentage of quests end in a fiasco and how long does it take to perform a task without necessary materials. It should be pointed out that findings of such companies as IDC or AIIM shows that office workers set aside at least 15-35% of their working hours for searching necessary information.
Problems with searching are rarely connected with technical issues. Search engines, currently present on our market, are mature products, regardless of technologies type (commercial/open-source). Usually, it is always a matter of default installation and leaving the system in untouched state just after taking it “out of the box”. Each search engine is different because it deals with various documents collections. Another thing is that users expectations and business requirements are changing continually. In conclusion, ensuring good quality searching is an unremitting process.

Knowledge workers main tool?

Intranet has become a comprehensive tool used for companies goals accomplishment. It supports employees commitment and effectiveness, internal communication and knowledge sharing. However, its main task is to find information, which is often hide in stack of documents or dispersed among various data sources. Equipped with search engine, intranet has become invaluable working tool practically in all sectors, especially in specific departments as customer service or administration.

So, how is your company’s access to information?


This text makes an introduction to series of articles dedicated to intranet searching. Subsequent articles are intended to deal with: search engine function in organization, benefit from using Enterprise Search, requirements of searching information system, the most frequent errors and obstacles of implementations and systems architecture.

Gamification in Information Retrieval

My last article was mainly about Collaborative Information Seeking – one of the trends in enterprise search. Another interesting topic is the use of games’ mechanics in CIS systems. I met up with this idea during previously mentioned ESE 2014 conference, but interest is so high, that this year in Amsterdam a GamifIR (workshops on Gamification for Information Retrieval) took place. IR community have debated about what kind of benefits can IR tasks bring from games’ techniques. Workshops cover gamified task in context of searching, natural language processing, analyzing user behavior or collaborating. The last one was discussed in article titled “Enhancing Collaborative Search Systems Engagement Through Gamification” and has been mentioned by Martin White in his great presentation about search trends on last ESE summit.

Gamification is a concept which provides and uses game elements in non-game environment. Its goal is to improve customers or employees motivation for using some services. In the case of Information Retrieval it is e.g. encouraging people to find information in more efficient way. It is quite instinctive because competition is  an inherent part of human nature. Long time ago, business sectors have noticed that higher engagement, activating new users and establishing interaction between them, rewarding the effort of doing something lead to measurable results. Even if quality of data given by users could be higher. Among those elements can be included: leaderboards, levels, badges, achievements, time or resources limitation, challenges and many others. There are even described design patterns and models connected with gameplay, components, game practices and processes. Such rules are essential because virtual badge has no value until being assigned by user.

Collaborative Information Seeking is an idea suited for people cooperating on complex task which leads to find specific information. Systems like this support team work, coordinate actions and improve communication in many different ways and with usage of various mechanisms. At first glance it seems that gamification is perfect adopted to CIS projects. Seekers become more social, feeling of competence foster actions which in turn are rewarded.

The most important thing is to know why do we need gamified system and what kind of benefits we will get. Next step is to understand fundamental elements of a game and find out how adopt them to IR case. In their article “Enhancing Collaborative Search Systems Engagement Through Gamification”, researchers of Granada and Holguin universities have listed propositions how to gamify CIS system.  Based on their suggestions I think essential points are to prepare highly sociable environment for seekers. Every player (seeker) needs to have own personal profile which stores previous achievements and can be customized. Constant feedback on progress, list of successful members, time limitations, keeping the spirit of competition by all kinds of widgets are important for motivating and building a loyalty. Worth to remember that points collected after achieving goals need to be converted into virtual values which can distinguish the most active players. Crucial thing is to construct clear and fair principles, because often information seeking with such elements is a fun and it can’t be ruined.

Researchers from Finnish universities, who published article “Does Gamification Work?”, have broken down a problem of gamifying into components and have thoroughly studied them. Their conclusion was that concept of gamification can work, but there are some weaknesses – context which is going to be gamified and the quality of the users. Probably, the main problem is lack of knowledge which elements really provide benefits.

Gamification can be treated as a new way to deal with complex data structures. Limitations of data analyzing can be replaced by mechanism which increase activity of users in Information Retrieval process. Even more – such concept may leads to more higher quality data, because of increased people motivation. I believe, Collaborative Information Seeking, Gamification and similar ideas are one of the solutions how to improve search experience by helping people to become better searchers than not by just tuning up algorithms.

Enterprise Search Europe 2014 – Short Review

ESE Summit

At the end of April  a third edition of Enterprise Search Europe conference took place.  The venue was Park Plaza Victoria Hotel in London. Two-day event was dedicated to widely understood search solutions. There were two tracks covering subjects relating to search management, big data, open source technologies, SharePoint and as always –  the future of search. According to the organizer’ information, there were 30 experts presenting their knowledge and experience in implementation search systems and making content findable. It was  opportunity to get familiar with lots of case studies focused on relevancy, text mining, systems architecture and even matching business requirements. There were also speeches on softer skills, like making  decisions or finding good  employees.

In a word, ESE 2014 summit was great chance to meet highly skilled professionals with competence in business-driven search solutions. Representatives from both specialized consulting companies and universities were present there. Even second day started from compelling plenary session about the direction of enterprise search. Presentation contained two points of view: Jeff Fried, CTO in BA-Insight and Elaine Toms, Professor of Information Science, University of Sheffield. From industrial aspect analyzing user behavior,  applying world knowledge or improving information structure is a  real success. On the other hand, although IR systems are currently in mainstream, there are many problems: integration is still a challenge, systems working rules are unclear, organizations neglect investments in search specialists. As Elaine Toms explained, the role of scientists is to restrain an uncertainty by prototyping and forming future researchers. According to her, major search problems are primitive user interfaces and too few systems services. What is more, data and information often become of secondary importance, even though it’s a core of every search engine.

Trends

Despite of many interesting presentations, particularly one caught my attention. It was “Collaborative Search” by Martin White, Conference Chair and Managing Director in Intranet Focus. The subject was current condition of enterprise search and  requirements which such systems will have to face in the future. Martin White is convinced that limited users satisfaction is mainly fault of poor content quality and insufficient information management. Presentation covered  absorbing results of various researches. One of them, described in “Characterizing and Supporting Cross-Device Search Tasks” document, was analysis of commercial search engine logs in order to find behavior patterns associated with cross device searching. Switching between devices can be a hindrance because of device multiplicity. That is why each user needs to remember both what he was searching and what has already been found. Findings show that there are lots of opportunities to handle information seeking more effectively in multi-device world. Saving and re-instating user session, using time between switching devices to get more results or making use of behavioral, geospatial data to predict task resumption are just a few examples of ideas.

Despite everything, the most interesting part of Martin White’s presentation was dedicated to Collaborative Information Seeking (CIS).

Collaborative Information Seeking

It is natural that difficult and complex tasks forced people to work together. Collaboration in information retrieval helps to use systems more effectively. This idea concentrate on situations when people should cooperate to seek information or sense-make. In fact, CIS covers on the one hand elements connected with organizational behavior or making decision, on the other – evolution of user interface and designing systems of immediate data processing. Furthermore, Martin White considers CIS context to be focused around the complex queries, “second phase” queries, results evaluation or ranking algorithms. This concept is able to bring the highest values in the domains like chemistry, medicine and law.

During the CIS exploration some definitions appeared:  collaborative information retrieval, social searching, co-browsing, collaborative navigation, collaborative information behavior, collaborative information synthesis.  My intention is to introduce some of them.

"Collaborative Information Seeking", Chirag Shah

1. “Collaborative Information Seeking”, Chirag Shah

Collaborative Information Retrieval (CIR) extends traditional IR for the purposes of many users. It supports scenarios when problem is complicated and when seeking common information is a need. To support groups’ actions, it is crucial to know how they work, what are their strengths and weaknesses. In general, it might be said that such system could be an overlay on search engine re-ranking results, based on users community knowledge. In agreement with Chirag Shah, the author of “Collaborative Information Seeking” book, there are some examples of systems where workgroup’s queries and related results are captured and used to filtering more relevant information for particular user. One of the most absorbing case is SearchTogether – interface designed for collaborative web search, described by Meredith R. Morris and Eric Horvitz. It allows to work both synchronously and asynchronously. History of queries, page metadata and annotations serve as information carrier for user. There had been implemented an automatic and manual division of labor. One of its feature was recommending pages to another information seeker. All sessions and past findings were persisted and stored for future collaborative searching.

Despite of many efforts made in developing such systems, probably none of them has been widely adopted. Perhaps it was caused partly by its non-trivial nature, partly by lack of concept how to integrate them with other parts of collaboration in organizations.

Another ideas associated with CIS are Social Search and Collaborative Filtering. First one is about how social interactions could help in searching together. What is interesting,  despite of rather weak ties between people in social networks, their enhancement may be already observed in collaborative networks. Second definition referred to provide more relevant search results based on user past behavior, but also community of users displaying similar interests. It is noteworthy that it is an example of asynchronous interaction, because its value is based on past actions – in contrast with CIS where emphasis is laid to active users communication. Collaborative Filtering has been applied in many domains: industry, financial, insurance or web. At present the last one is most common and it’s used in e-commerce business. CF methods make a base for recommender systems predicting users preferences. It is so broad topic, that certainly deserves a separate article.

CIS Barriers

Regardless of all these researches, CIS is facing many challenges nowadays. One of them is information security in the company. How to struggle out of situation when team members do not have the same security profile or when some person cannot even share with others what has been found? Discussed systems cannot be only created for information seeking, but also they need to  provide managing security, support situations when results were not found because of permissions or situations when it is necessary to view a new document created in cooperation process. If it is not enough, there are various organization’s barriers hindering CIS idea. They are divided into categories – organizational, technical, individual, and team. They consist of things such as organization culture and structure, multiple and un-integrated systems, individual person perception or varied conflicts appeared during team work. Barriers and their implications have been described in detail in document “Barriers to Collaborative Information Seeking in Organizations” by Arvind Karunakaran and Madhu Reddy.

Collaborative information seeking is exciting field of research and one of the search trend. Another absorbing topic is gamification adopting in IR systems. This is going to be a subject of my next article.