Query Completion with Apache Solr

There are plenty of names for this functionality: query completion, suggestions, auto-complete, auto-suggest, word completion, type ahead and maybe some more. Even if we may point slight differences between them (suggestions can base on your index documents or external input such users queries), from technical point of view it’s all about the same: to propose a query for the end user.

google-suggestearly Google Suggest from 2008. Source: http://www.wpromote.com/blog/4-things-in-08-that-changed-the-face-of-search/

 

Suggester feature was started 8 years ago by Google, in 2008. Users got used to the query completion and nowadays it’s a common feature of all mature search engines, e-commerce platforms and even internal enterprise search solutions.

Suggestions help with navigating users through the web portal, allow to discover relevant content and recommend popular phrases (and thus search results). In the e-commerce area they are even more important because well implemented query completion is able to high up conversion rate and finally – increase sales revenue. Word completion never can lead to zero results, but this kind of mistake is made frequently.

And as many names describe this feature there are so many ways to build it. But still it’s not so trivial task to implement good working query completion. Software like Apache Solr doesn’t solve whole problem. Building auto-suggestions is also about data (what should we present to users), its quality (e.g. when we want to suggest other users’ queries), suggestions order (we got dozens matches, but we can show only 5; which are the most important?) or design (user experience or similar).

Going back to the technology. Query completion can be built in couple of ways with Apache Solr. You can use mechanisms like facets, terms, dedicated suggest component or just do a query (with e.g. dismax parser).

Take a look at Suggester. It’s very easy to run. You just need to configure searchComponent and requestHandler. Example:

<searchComponent name="suggester" class="solr.SuggestComponent">
  <lst name="suggester">
    <str name="name">suggester1</str>
    <str name="lookupImpl">FuzzyLookupFactory</str>
    <str name="dictionaryImpl">DocumentDictionaryFactory</str>
    <str name="field">title</str>
    <str name="weightField">popularity</str>
    <str name="suggestAnalyzerFieldType">text</str>
  </lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="suggest">true</str>
    <str name="suggest.count">10</str>
  </lst>
  <arr name="components">
    <str>suggester</str>
  </arr>
</requestHandler>

SuggestComponent is a ready-to-use implementation, which is responsible for serving up suggestions based on commands and queries. It’s an efficient solution, i.e. because it works on structure separated from main index and it’s being kept in memory. There are some basic settings like field used for autocompleting or defining text analyzing chain. LookImpl defines how to match terms in index. There are about 10 algorithms with different purpose. Probably the most popular are:

  • AnalyzingLookupFactory (default, finds matches based on prefix)
  • FuzzyLookupFactory (finds matches with misspellings),
  • AnalyzingInfixLookupFactory (finds matches anywhere in the text),
  • BlendedInfixLookupFactory (combines matches based on prefix and infix lookup)

You need to choose the one which fulfill your requirements. The second important parameter is dictionaryImpl which represents how indexed suggestions are stored. And again, you can choose between couple of implementations, e.g. DocumentDictionaryFactory (stores terms, weights, and optional payload) or HighFrequencyDictionaryFactory (works when very common terms overwhelm others, you can set up proper threshold).

There are plenty of different settings you can use to customize your suggester. SuggestComponent is a good start and probably covers many cases, but like everything, there are some limitations like e.g. you can’t easily filter out results.

Example execution:

http://localhost:8983/solr/index/suggest?wt=json&suggest.dictionary=analyzingSuggester&suggest.q=lond

suggestions: [
  { term: "london" },
  { term: "londonderry" },
  { term: "londoño" },
  { term: "londoners" },
  { term: "londo" }
]

Another way to build a query completion is to use mechanisms like faceting, terms or highlighting.

The example of QC built on facets:

http://localhost:8983/solr/index/select?q=*:*&facet=on&facet.field=title_keyword&facet.mincount=1&facet.contains=lon&rows=0&wt=json

title_keyword: [
  "blonde bombshell", 2,
  "12-pounder long gun", 1,
  "18-pounder long gun", 1,
  "1957 liga española de baloncesto", 1,
  "1958 liga española de baloncesto", 1
]

Please notice that here we have used facet.contains method, so query matches also in the middle of phrase. It works on the basis of regular expression. Additionally, we have a count for every suggestion in Solr response.

TermsComponent (returns indexed terms and the number of documents which contain each term) and highlighting (originally, emphasize fragments of documents that match the user’s query) can be also used, what is presented below.

Terms example:

<searchComponent name="terms" class="solr.TermsComponent"/>
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <bool name="terms">true</bool>
    <bool name="distrib">false</bool>
  </lst>
  <arr name="components">
    <str>terms</str>
  </arr>
</requestHandler>
http://localhost:8983/solr/index/terms?terms.fl=title_general&terms.prefix=lond&terms.sort=index&wt=json

title_general: [
  "londinium",
  "londo",
  "london",
  "london's",
  "londonderry"
]

Highlighting example:

http://localhost:8983/solr/index/select?q=title_ngram:lond &fl=title&hl=true&hl.fl=title&hl.simple.pre=&hl.simple.post=

title_ngram: [
  "londinium",
  "londo",
  "london",
  "london's",
  "londonderry"
]

You can also do auto-complete even with usual, full-text query. It has lots of advantages: Lucene scoring is working, you have filtering, boosts, matching through many fields and whole Lucene/Solr queries syntax. Take a look at this eDisMax example:

http://localhost:8983/solr/index/select?q=lond&qf=title_ngram&fl=title&defType=edismax&wt=json

docs: [
  { title: "Londinium" },
  { title: "London" },
  { title: "Darling London" },
  { title: "London Canadians" },
  { title: "Poultry London" }
]

The secret is an analyzer chain whether you want to base on facets, query or SuggestComponent. Depending on what effect you want to achieve with your QC, you need to index data in a right way. Sometimes you may want to suggest single terms, another time – whole sentences or product names. If you want to suggest e.g. letter by letter you can use Edge N-Gram Filter. Example:

<fieldType name="text_ngram" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory minGramSize="1" maxGramSize="50" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

N-Gram is a structure of n items (size depends on given range) from a given sequence of text. Example: term Findwise, minGramSize = 1 and maxGramSize = 10 will be indexed as:

F
Fi
Fin
Find
Findw
Findwi
Findwis
Findwise

With such indexed text you can easily achieve functionality where user is able to see changing suggestions after each letter.

Another case is an ability to complete word after word (like Google does). It isn’t trivial, but you can try with shingle structure. Shingles are similar to N-Gram, but it works on whole words. Example: Searching is really awesome, minShingleSize = 2 and minShingleSize = 3 will be indexed as:

Searching is
Searching is really
is really
is really awesome
really awesome

Example of Shingle Filter:

<fieldType name="text_shingle" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="10" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What if your users could use QC which supports synonyms? Then they could put e.g. abbreviation and find a full suggestion (NYC -> New York City, UEFA -> Union Of European Football Associations). It’s easy, just use Synonym Filter in your text field:

<fieldType name="text_synonym" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
  </analyzer>
</fieldType>

And then just do a query:

http://localhost:8983//select?defType=edismax&fl=title&q=nyc&qf=title_synonym&wt=json

docs: [
  { title: "New York City" },
  { title: "New York New York" },
  { title: "Welcome to New York City" },
  { title: "City Club of New York" },
  { title: "New York" }
]

Another very similar example concerns language support and matching suggestions regardless of the terms’ form. It can be especially valuable for languages with  the rich grammar rules and declination. In the same way how SynonymsFilter is used, we can configure a stemmer / lemmatization filter e.g. for English (take a look here and remember to put language filter both for index and query time) and expand matching suggestions.

As you can see, there are many ways to run query completion, you need to adjust right mechanism and text analysis based on your own limitations and also on what you want to achieve.

There are also other topics connected with preparing type ahead solution. You need to consider performance issues, they are mostly centered on response time and memory consumption. How many requests will generate QC? You can assume that at least 3 times more than your regular search service. You can handle traffic growth by optimizing Solr caches, installing separated Solr instanced only for suggesting service. If you’ll create n-gram, shingles or similar structures, be aware that your index size will increase. Remember that if you decided to use facets or highlighting for some reason to provide suggester, this both mechanisms make your CPU heavy loaded.

In my opinion, the most challenging issue to resolve is choosing a data source for query completion mechanism. Should you suggest parts of your documents (like titles, keywords, authors)? Or use NLP algorithms to extract meaningful phrases from your content? Maybe parse search/application logs and use the most popular users queries? Be careful, filter out rubbish, normalize users input). I believe the answer is YES – to all. Suggestions should be diversified (to lead your users to a wide range of search resources) and should come from variety of sources. More than likely, you will need to do a hard job when processing documents – remember that data cleaning is crucial.

Similarly, you need to take into account different strategies when we talk about the order of proposed suggestions. It’s good to show them in alphanumeric order (still respect scoring!), but you can’t stop here. Specificity of QC is that application can return hundreds of matches, but you can present only 5 or 10 of them. That’s why you need to promote suggestions with the highest occurrence in index or the most popular among the users. Further enhancements can involve personalizing query completion, using geographical coordinates or implementing security trimming (you can see only these suggestions you are allowed to).

I’m sure that this blog post doesn’t exhaust the subject of building query completion, but I hope I brought this topic closer and showed the complexity of such a task. There are many different dimension which you need to handle, like data source of your suggestions, choosing right indexing structure, performance issues, ranking or even UX and designing (how would you like to present hints – simple text or with some graphics/images? Would you like to divide suggestions into categories? Do you always want to show result page after clicked suggestion or maybe redirect to particular landing page?).

Search engine like Apache Solr is a tool, but you still need an application with whole business logic above it. Do you want to have a prefix-match and infix-match? To support typos and synonyms? To suggest letter after the letter or word by word? To implement security requirements or advanced ranking to propose the best tips for your users? These and even more questions need to be think over to deliver successful query completion.

Plan for General Data Protection Regulation (GDPR)

Another new regulation from EU? Will this affect us? It seems so complex. Can’t we sit back and wait for the first fine to come and then act if necessary?

We have to care and act – start planning now!

I think we have to care and act now. Start planning now so you get it right. The GDPR is a good thing. This is not another EU thing about the right size of a strawberry or how bendy a banana could be. This is about the fact that all individuals should feel safe giving their personal information to business. Cyber security is a good thing, not protecting our data and our customers’ data is a bad thing for us. Credit card numbers and personal data leaks out from companies worldwide with large business risks, companies don’t just face fines or reputational damage, they can have their permission to issue credit cards and other financial services products withdrawn by the regulator and responsible employees faces imprisonment. We can only guess whether a company needs to be GDPR compliant or not to be allowed to compete in a bidding process?

What is the General Data Protection Regulation (GDPR)?

The General Data Protection Regulation (GDPR) is a new legal framework approved by the European Union (EU) to strengthen and unify data protection of personal information. GDPR will replace the current data protection directive (in Sweden Personuppgiftslagen, PUL) and applies from 25 May 2018.

Who is affected?

GDPR has global reach and applies to all companies worldwide that process personal data of European Union citizens.

Identify personal data and protect it

GDPR widely defines what constitutes personal data. Organisations needs to fully understand what information they have, where it is located and how it was collected. Discover, classify and manage all information, both structured and unstructured data and secure it.

Data breach notifications

GDPR requires organisations to notify the local data protection authority of a data breach within 72 hours after discovery.

Do you have the right to store this information? Explicit consent

Personal data should be gathered under strict conditions. Organisations need to ask for consent to collect personal data and they need to be clear about how they will use the information.

The right of access

Individuals will have the right to obtain access to their personal data and other supplementary information in a portable format. You must provide a copy of the information free of charge. GDPR also give individuals the right to have personal data corrected if it is inaccurate or incomplete.

The right to be forgotten

GDPR also introduces the right to be forgotten, or erased. Data are not to be hold for any longer than absolutely necessary, and data should not be used in any other way than it was originally collected for.

Penalties and fines

Companies that fails to protect customer data adequately will face significant fines up to €20m, or up to 4% of global turnover. This should be a serious incentive for companies to start preparing now.

First steps to GDPR compliance

  1. Create awareness and allocate resources
    First step is to make sure that your organisation is aware of the new EU legislation and what it means for you. How will your business be affected by the new regulation? You need to allocate enough resources, make sure you involve decision-makers and stakeholders in your organisation. Last, but not least, start today!
  2. Content Inventory
    The second step is to discover and classify all your information to identify exactly what types of personal identifiable data you have, where you have it and how it is collected.

Findwise can assist you in this process, please contact Maria Sunnefors to hear more.

Want to read more?

Read more about the GDPR at Datainspektionen (in Swedish) or at iCO.

PIM is for storage

– Add search for distribution, customization and seamless multichannel experiences.


Retailers, e-commerce and product data
Having met a number of retailers to discuss information management, we’ve noticed they all experience the same problem. Products are (obviously) central and information is typically stored in a PIM or DAM system. So far so good, these systems do the trick when it comes to storing and managing fundamental product data. However, when trying to embrace current trends1 of e-commerce, such as mobile friendliness, multi-channel selling and connecting products to other content, PIM systems are not really helping. As it turns out, PIM is great for storage but not for distribution.

Retailers need to distribute product information across various channels – online stores, mobile and desktop, spreadsheet exports, subsets of data with adjustments for different markets and industries. They also need connecting products to availability, campaigns, user generated content and fast changing business rules. Add to this the need for closing the analytics feedback loop, and the IT department realises that PIM (or DAM) is not the answer.

Product attributes

Adding search technology for distribution
Whereas PIM is great for storage, search technology is the champ not only for searching but also for distribution. You may have heard the popular Create Once Publish Everywhere? Well, search technology actually gives meaning to the saying. Gather any data (PIM, DAM, ERP, CMS), connect it to other data and display it across multiple channels and contexts.

Also, with the i32 package of components you can add information (metadata) or logic that is not available in the PIM system. This whilst source data stay intact – there is no altering, copying or moving.

Combined with a taxonomy for categorising information you’re good to go. You can now enrich products and connect them to other products and information (processing service). Categorise content according to product taxonomy and be done. Performance will be super high, as content is denormalised and stored in the search engine, ready for multi channel distribution. Also, with this setup you can easily also add new sources to enrich products or modify relevance. Who knows what information will be relevant for products in the future?

To summarise

  • PIM for input, search for output. Design for distribution!
  • Use PIM for managing products, not for managing business rules.
  • Add metadata and taxonomies to tailor product information for different channels.
  • Connect products to related content.
  • Use stand-alone components based on open source for strong TCO and flexibility.

References
1 Gartner for marketers
2The Findwise i3 package of components (for indexing, processing, searching and analysing data) is compatible with the open source search engines Apache Solr and Elasticsearch. 

Under the hood of the search engine

While using a search application we rarely think about what happens inside it. We just type a query, sometime refine details with facets or additional filters and pick one of the returned results. Ideally, the most desired result is on the top of the list. The secret of returning appropriate results and figuring out which fits a query better than others is hidden in the scoring, ranking and similarity functions enclosed in relevancy models. These concepts are crucial for the search application user’s satisfaction.

In this post we will review basic components of the popular TF/IDF model with simple examples. Additionally, we will learn how to ask Elasticsearch for explanation of scoring for a specific document and query.

Document ranking is one of the fundamental problems in information retrieval, a discipline acting as a mathematical foundation of search. The ranking, which is literally assigning a rank to a document matching search query corresponds with a term of relevance. Document relevance is a function which determines how well given document meets the search query. A concept of similarity corresponds, in turn, to the relevance idea, since relevance is a metric of similarity between a candidate result document and a search query. Continue reading

What’s new in Apache Solr 6?

Apache Solr 6 has been released recently! You need to remember about some important technical news: no more support for reading Lucene/Solr 4.x index or Java 8 is required. But what I think, the most interesting part is connected with its new features, which certainly follow world trends. I mean here: SQL engine at the top of the Solr, graph search and replicating data across different data centers.

Apache Solr

One of the most promising topic among the new features is Parallel SQL Interface. In a brief, it is possibility to run SQL queries on the top of the Solr Cloud (only Cloud mode right now). It can be very interesting to combine full-text capabilities with well-known SQL statements.
Solr uses Presto internally, which is a SQL query engine and works with various types of data stores. Presto is responsible for translating SQL statements to the Streaming Expression, since Solr SQL engine in based on the Streaming API.
Thanks to that, SQL queries can be executed at worker nodes in parallel. There are two implementations of grouping results (aggregations). First one is based on map reduce algorithm and the second one uses Solr facets. The basic difference is a number of fields used in grouping clause. Facet API can be used for better performance, but only when GROUP BY isn’t complex. If it is, better try aggregationMode=map_reduce.
From developer perspective it’s really transparent. Simple statement like “SELECT field1 FROM collection1” is translated to proper fields and collection. Right now clauses like WHERE, ORDER BY, LIMIT, DISTINCT, GROUP BY can be used.
Solr still doesn’t support whole SQL language, but even though it’s a powerful feature. First of all, it can make beginners life easier, since relational world is commonly known. What is more, I imagine this can be useful during some IT system migrations or collecting data from Solr for further analysis. I hope to hear many different study cases in the near future.

Apache Solr 6 introduces also a topic, which is crucial, wherever a search engine is a business critical system. I mean cross data center replication (CDCR).
Since Solr Cloud has been created to support near real-time (NRT) searching, it didn’t work well when cluster nodes were distributed across different data centers. It’s because of the communication overhead generated by the leaders, replicas and synchronizations operation.

New idea is in experimental phase and still under developing, but for now we have an active-passive mode, where data is pushed from the Source DC to the Target DC. Documents can be sent in a real-time or according to the schedule. Every leader from active cluster sends asynchronously updates to the proper leader in passive cluster. After that, target leaders replicate changes to their replicas as usual.
CDCR is crucial when we think about distributed systems working in high-availability mode. It always refers to disaster recovery, scaling or avoiding single points of failure (SPOF). Please visit documentation page to find some details and plans for the future: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

What if your business works in highly connected environment, where data relationships matter, but you still benefit from full-text searching? Solr 6 has a good news – it’s a graph traversal functionality.
A lot of enterprises know that focusing on relations between documents and graph data modeling is a future. Now you can build Solr queries which will allow you to discover information organized in nodes and edges. You can explore your collections in terms of data interactions and connections between particular data elements. We can think about the use cases from semantic search area (query augmentation, using ontologies etc.) or more prosaic, like organization security roles or access control.
Graph traversal query is still in progress, but we can use it from now and its basic syntax is really simple: fq={!graph from=parent_id to=id}id:”DOCUMENT_ID”

The last Solr 6 improvement, which I’m going to mention about is a new scoring algorithm – BM25. In fact, it’s a change forced by Apache Lucene 6. BM25 is now a default similarity implementation. Similarity is a process which examines which documents are similar to the query and to what extent. There are many different factors which determine document score. There are e.g.: number of search terms found in document, popularity of this search terms over the whole collection or document length. This is where BM25 improves scoring: it takes into consideration average length of the documents (fields) across the entire corpus. It also limits better an impact of terms frequency on results ranking.

As we can see, Apache Solr 6 provides us with many new features and those mentioned above are not all of them. We’re going to write more about the new functionalities soon. Until then, we encourage you to try the newest Solr on your own and remember: don’t hesitate to contact us in case of any problems!

Understanding politics with Watson using Text Analytics

To understand the topics that actually are important to different political parties is a difficult task. Can text analytics together with an search index be an approach to given a better understanding?

This blog post describes how IBM Watson Explorer Content Analytics (WCA) can be used to make sense of Swedish politics. All speeches (in Swedish: anföranden) in the Swedish Parliament from 2004 to 2015 are analyzed using WCA. In total 139 110 transcribed text documents were analyzed. The Swedish language support build by Findwise for WCA is used together with a few text analytic processing steps which parses out person names, political party, dates and topics of interest. The selected topics in this analyzed are all related to infrastructure and different types of fuels.

We start by looking at how some of the topics are mentioned over time.

Analyze of terms of interets in Swedsih parlament between 2004 and 2014.

Analyze of terms of interest in Swedish parliament between 2004 and 2014.

The view shows topic which has a higher number of mentions compared to what would be expected during one year. Here we can see among other topics that the topic flygplats (airport) has a high increase in number of mentioning during 2014.

So let’s dive down and see what is being said about the topic flygplats during 2014.

Swedish political parties mentioning Bromma Airport.

Swedish political parties mentioning Bromma Airport during 2014.

The above image shows how the different political parties are mentioning the topic flygplats during the year 2014. The blue bar shows the number of times the topic flygplats was mentioned by each political party during the year. The green bar shows the WCA correlation value which indicates how strongly related a term is to the current filter. What we can conclude is that party Moderaterna mentioned flygplats during 2014 more frequently than other parties.

Reviewing the most correlated nouns when filtering on flygplats and the year 2014 shows among some other nouns: Bromma (place in Sweden), airport and nedläggning (closing). This gives some idea what was discussed during the period. By filtering on the speeches which was held by Moderaterna and reading some of them makes it clear that Moderaterna is against a closing of Bromma airport.

The text analytics and the index provided by WCA helps us both discover trending topics over time and gives us a tool for understanding who talked about a subject and what was said.

All the different topics about infrastructure can together create a single topic for infrastructure. Speeches that are mentioning tåg (train), bredband (broadband) or any other defined term for infrastructure are also tagged with the topic infrastructure. This wider concept of infrastructure can of course also be viewed over time.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Another way of finding which party that are most correlated to a subject is by comparing pair of facets. The following table shows parties highly related to terms regarding infrastructure and type of fuels.

Political parties highly correlated to subjects regarding infrastructure and types of fuel.

Swedish political parties highly correlated to subjects regarding infrastructure and types of fuel.

Let’s start by explain the first row in order to understand the table. Mobilnät (mobile net) has only been mentioned 44 times by Centerpartiet, but Centerpartiet is still highly related to the term with a WCA correlation value of 3.7. This means that Centerpartiet has a higher share of its speeches mentioning mobilnät compared to other parties. The table indicates that two parties Centerpartiet and Miljöpartiet are more involved about the subject infrastructure topics than other political parties.

Swedish parties mentioning the defined concept of infrastructure.

Swedish parties mentioning the defined concept of infrastructure.

Filtering on the concept infrastructure also shows that Miljöpartiet and Centerpartiet are the two parties which has the highest share of speeches mentioning the defined infrastructure topics.

Interested to dig deeper into the data? Parsing written text with text analytics is a successful approach for increasing an understanding of subjects such as politics. Using IBM Watson Explorer Content Analytics makes it easy. Most of the functionality used in this example is also out of the box functionalities in WCA.

Migration from Google Search Appliance (GSA) in 4 easy steps

 

 

Google Search Appliance is being phased out and in 2018, renewals will end. As an existing client, you can buy one-year license renewals throughout 2017. However, if fancying a change, here’s 4 simple steps for switching to Apache Solr or Elasticsearch.

1. Choose your hosting solution or servers

Wikimedia_Foundation_Servers-8055_14 

Whereas Google Search Appliance comes ready to plug in, Apache Solr and Elasticsearch need to be deployed and hosted on servers. You can choose to host Solr or Elasticsearch on your own infrastructure or in the cloud. Both platforms are highly scalable and can be massively distributed.

  • Own infrastructure

Servers and hardware requirements are highly dependent on the number of documents, documents types, search use cases and number of users. Memory, CPUs, disk and network are the main parameters to consider.

Elasticsearch hardware recommendations: https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

Apache Solr performance: https://wiki.apache.org/solr/SolrPerformanceProblems

Both Elasticsearch and Solr requires running java. For SolrCloud, you will also need to install Zookeeper.

  • In the cloud

You can also choose to run Solr or Elasticsearch on a cloud platform.

Elastic official cloud platform: https://www.elastic.co/cloud

2. Define your schema and mapping

In Apache Solr and Elasticsearch, fields can be indexed and processed differently according on type, language, use case … A field and its type can be defined in Elasticsearch using the mapping API or in Apache Solr with the schema.xml

Elasticsearch mapping API: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Apache Solr schema: https://wiki.apache.org/solr/SchemaXml

3. Tune your connectors

the-cable-guy

Do you need to change all connectors?

The answer is no. Connectors sending GSA feeds can be kept, just refactor the output to match the Elasticsearch or Solr indexing syntax.

However, if you use GSA to crawl websites, you will need either to reconsider crawling as the method to get your data or to use an external webcrawler (like Norconex) Contrary to GSA, Apache Solr and Elasticsearch do not come with a webcrawler.

Elasticsearch Indexing API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

4. Rewrite your queries and fetch new output

All common query functions such as filtering, sorting and dynamic navigation are standard in both Apache Solr and Elasticsearch. However, query parameters and output (XML or JSON) are different, which means queries and front-end need adaption to your new search engine.

If you are using Jellyfish by Findwise, queries and output will roughly be the same.

Elasticsearch response body: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html

Apache Solr response: https://cwiki.apache.org/confluence/display/solr/Response+Writers

Google Search Appliance features equivalence

GSA feature Elasticsearch Apache Solr
Web crawling X X
Language Bundles Languages Language Analysis
Synonyms Synonyms Synonyms
Stopwords Stopwords Stopwords
Result Biasing Controlling relevance Query elevation
Suggestions Search-suggesters Suggester
Dynamic navigation Aggregations Faceting
Document preview X X
User result X X
Expert search X X
Keymatch X X
Related Queries X X
Secure search Shield Solr Security
Search reports Logstash+Kibana X
Mirroring/Distributed Scale Elastic Solr Cloud
System alert Watcher X
Email update/Alert Watcher X

X = not available outside of the box

4 quick ways to improve your search!

Findwise has recently published its annual report about Enterprise Search and Findability. We can see that a lot of people are complaining that the search engine is running poorly. There were 36% dissatisfied (users) in 2015. Is there any simple recipe for that? I bet there are some things that can be applied almost immediately!

 boy-cry

Data

It is quite common that the reason for bad results is content quality – there is simply no good content in a search engine!

Solution = editors and cleaning: Educate editors to produce better content! Decide on the main message, set informative titles and be consistent with internal wording and phrasing. Also, look over your index, archive outdated content and remove duplicates. Don’t be afraid to remove entire data sources from your index.


Metadata

If you already have data indexed, it is much easier to search using additional filters. Let’s say, when we are looking for a new washing machine on any good online store we can easily filter out the features such as energy class, manufacturer, price range etc. The same can happen to corporate data, provided that our documents are tagged in a consistent manner.

Solution = tagging: Check tags and metadata consistency for documents, which we search through. If the quality of the tagging leaves much to be desired, it should be corrected (note: this can be done automatically to the large extent!). Then you should consider what filters are the most useful for your company search and implement them in your browser.

 

Accuracy

Users’ expectations are very important. If they ask and search, they usually want and need, eg. current lunch menu, financial settlement form, a specific procedure for calculating credit risk, sales report for the previous quarter, etc. This unique need of each user is expressed through a simple query. And here we encounter significant problem: these queries are not always well interpreted by the search engine. If you don’t see the desired document/answer in the first five slots of the search results list, even after 2-3 trials by using various queries, you quickly come to the conclusion that the search engine doesn’t not work (well).

Solution = user feed-back: It is fundamental to regularly collect users feed-back on the search engine. If you receive signals that something does not work, then you absolutely need to examine what specific search scenarios aren’t functioning well. These things can be usually fixed pretty easily by using synonyms, promotions or by changing the order of results display.


Monitoring

It is not easy to gather the opinion of everyone in large organiations, as there might be thousands of them. A search engine, like everything else, sometimes breaks down, answers too long for queries and gives silly results, or even no result at all. Additionally, it’s not certain if such a thing contributes to our organization or not, and who makes the use of our search at all.

Solution = logging: Log analysis gives a lot of information about the real use of search engines by the users. Logs tell us how many people are looking for something, what they are asking for, how fast search engine responds, when it gives zero results. It’s priceless information to understand what works, who really benefits, what are the most popular contents and questions, what needs to be improved. It’s crucially important to do it on a regular basis.

 boy-happy

Summary

And now, when you fixed all these four points related to the search engine please tell me that it continues to be malfunctioning. I’ve yet to hear such a case 🙂

Generational renewal at work – a search challenge

The big generational shift

There have been discussions surrounding the great generational renewal in the workplace for a while. The 50’s generation, who have spent a large part of their working lives within the same company, are being replaced by an agile bunch born in the 90’s. We are not taken by tabloid claims that this new generation does not want to work, or that companies do not know how to attract them. What we are concerned with is that businesses are not adapting fast enough to the way the new generation handle information to enable the transfer of knowledge within the organisation.

Working for the same employer for decades

Think about it for a while, for how long have the 50’s generation been allowed to learn everything they know? We see it all the time, large groups of employees ready to retire, after spending their whole working lives within the same organisation. They began their careers as teenagers working on the factory floor or in a similar role, step by step growing within the company, together with the company. These employees have tended to carry a deep understanding of how their organisation work and after years of training, they possess a great deal of knowledge and experience. How many companies nowadays are willing to offer the 90’s workers the same kind of journey? Or should they even?

2016 – It’s all about constant accessibility

The world is different today, than 50 years ago. A number of key factors are shaping the change in knowledge-intense professions:

  • Information overload – we produce more and more information. Thanks to the Internet and the World Wide Web, the amount of information available is greater than ever.
  • Education has changed. Employees of the 50’s grew up during a time when education was about learning facts by rote. The schools of today focus more on teaching how to learn through experience, to find information and how to assess its reliability.
  • Ownership is less important. We used to think it was important to own music albums, have them in our collection for display. Nowadays it’s all about accessibility, to be able to stream Spotify, Netflix or an online game or e-book on demand. Similarly we can see the increasing trend of leasing cars over owning them. Younger generations take these services and the accessibility they offer for granted and they treat information the same way, of course. Why wouldn’t they? It is no longer a competitive advantage to know something by heart, since that information is soon outdated. A smarter approach of course is to be able to access the latest information. Knowing how to search for information – when you need it.

Factors supporting the need for organising the free flow of the right information:

  • Employees don’t stay as long as they used to in the same workplace anymore, which for example, requires a more efficient on boarding process. It’s no longer feasible to invest the same amount of time and effort on training one individual since he/she might be changing workplace soon enough anyway.
  • It is much debated whether it is possible to transfer knowledge or not. Current information on the other hand is relatively easy to make available to others.
  • Access to information does not automatically mean that the quality of information is high and the benefits great.

Organisations lack the right tools

Knowing a lot of facts and knowledge about a gradually evolving industry was once a competitive advantage. Companies and organisations have naturally built their entire IT infrastructure around this way of working. A lot of IT applications used today were built for a previous generation with another way of working and thinking. Today most challenges involve knowing where and how to find information. This is something we experience in our daily work with clients. Organisations more or less lack the necessary tools to support the needs of the newer generation in their daily work.

To summarize the challenge: organisations need to be able to supply their new workforce with the right tools to constantly find (and also manipulate) the latest and best information required for them to shine.

Success depends on finding the right information

In order for the new generation to succeed, companies must regularly review how information is handled plus the tools supporting information-heavy work tasks.

New employees need to be able to access the information and knowledge left by retiring employees, while creating and finding new content and information in such a way that information realises its true value as an asset.

Efficiency, automation… And Information Management!

There are several ways of improving efficiency, the first step is often to investigate if parts, or perhaps the entire creating and finding process can be automated. Secondly, attack the information challenges.

When we get a grip of the information we are to handle, it’s time to look into the supporting IT systems. How are employees supposed to find what they are looking for? How do they want to?

We have gotten used to find answers by searching online. This is in the DNA of the 90’s employee. By investing in a great search platform and developing processes to ensure high information quality within the organisation, we are certain the organisation will not only manage the generational renewal but excel in continuously developing new information centric services.

Written by: Maria “Ia” Björk & Joar Svensson

Sensemaking or Digital Despair

Finding our way in the bright, futuristic, data-driven & intertwined world, often taxes us and our digital-hungry senses. Fast rewind to the recent FindabilityDay 2015 and the parade of brilliant speaker talents on stage. Starting of with our dear friend and peer, Martin White, on the topic the future of search.

Human factors, from idea inception to design and practical UX of our digital artifacts. The key has been make-do and ship. This is the reason the more technically-advanced mobiles fell by the wayside 8 years ago Apple’s iPhone.

The social life with information, shapes our daily lives, in a hyper-connected world. It’s still very hard to find that information needle in the haystack, and most days we feel despair when losing the scent of information nuggets. The results from the Findability Survey, spoke clearly. Without sound organising principles to information and data, and a pliable recorded vision, we won’t find anything of value.

Next, moving into an old business model, with Luna’s and Sara’s presentation, a great example, where we see that the orchestration and choreography of their data assets will determine their survival or demise – in conjunction with infused means to information management practices, processes and tools. They showed a new set of facets to delivering on their mission in their line-of business.

Regardless of the line of business, it becomes clear that our fragmented workplace setting now only partly “on tap”. It makes our daily lives a mess, since things do not interoperate. The vision should show the way to a shared information commons, where we all cultivate.

So finally, How do we make sense of any mess?

Answer: Architect a place where you can find comfort with social conventions shared on the information used. Abby Covert, laid out a beautiful tapestry of things we all need to take on, to make sense in everyday life, and life at work. With clear and distinct guardrails, and signposts we don’t feel so distracted or lost. Her talk was a true enlightenment for me, being of the same profession, Information Architect.

View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog