Elastic{ON} 2017 – breaking all the records!

Elastic{ON} 2017 draws 2200 participants to Pier 48 during these somewhat chilly San Francisco days in March. It’s a 40% increase from the 1600 or so participants last year, in line with the growing interest for the Elastic Stack and the successes of Elastic commercially.

From Findwise – we are a team of 4 Findwizards, networking, learning and reporting.

Shay Banon, the creator of Elasticsearch and Elastic CTO, is doing both the opening and closing keynote. It is apparent that the transition of the CEO role from Steven Schuurman has already started.

ElasticON 2017

2016 in retrospective with the future in mind

Elastic reached 100 million downloads in 2016, and have managed to land approximately 4000 paying subscription customers out of this installed base to date. A lot of presentations during the conference is centered around new functionality that is developed and will be released to the open source community freely. Other functionality goes into the commercial X-pack subscriptions. Some X-pack functionality is available freely under the Basic subscription level that only requires registration.

Most presentations are centered around search powered analytics, and fewer around regular free text search. Elasticsearch and the Elastic Stack got its main use cases within logging, analytics and in various applications as a data platform or middle-layer with search use-cases as a strong sidekick.

A strong focus on analytics

There’s 22 sponsors at the event, and most of the companies are either offering cloud based monitoring or machine learning services. IBM, the platinum sponsor, are promoting the Bluemix cloud services for cognitive Watson functionality and uses the conference to reach out to the predominantly developer-focused audience.

Prelert was acquired in September last year, and is now being integrated into the Elastic Stack as the Machine Learning component and is used for unsupervised anomaly detection to give operation log insights. Together with the new modular Beats architecture and various Kibana improvements, it looks apparent that Elastic is chasing the huge market Splunk currently controls within logging and analytics.

Elasticsearch SQL – giving BI what it needs

Elasticsearch SQL will give the search engine SQL capability just like Solr got with their parallel SQL interface. Elasticsearch is becoming more and more a “data platform”. Increasingly becomming an competitor to HPE Vertica and Amazon RedShift as it hits a sweet spot use-case where a combination of faster data loading and extreme scalability is needed, and it is acceptable with the tradeoffs of limited functionality (such as the lack of JOIN operations). With SQL support the platform can use existing visualization tools such as Tableu and it expands the user base as many people in the Business Intelligence sector knows SQL by heart.

Fast and simple Beats is music to our ears

Beats will become modular in the next release, and more beats modules will be created either by Elastic or in the open source or commercial community. This increases simple connectivity to various data sources, and adds standardized dashboards for the data source, which will increase simplicity and speed in implementation.

Heartbeat is a new Beat (with a beautiful name!) that send pings to check that services are alive and functioning.

Kibana goes international

Kibana is maturing with some new key updates coming soon. A Time series visual builder that will give graphical guidance on how to build the dashboards, Kibana Canvas gives custom dynamic reports and enables slide show presentations with live data, and the GUI frontend is translated to various languages.

There’s a new tile service for maps, so instead of relying on external map services, Elastic now got control over the maps functionality. The service can be used free of charge but requires registration (Basic subscription) to use all 18 zoom levels.



To conclude, we’ve had three good days with exciting product news and lots of interesting meetings in what could very well be the biggest show for search and search-driven analytics right now! Be sure to see us at the next year’s Elastic{ON} again. If not before, see you then!


From San Francisco with love,

/Andreas, Christian, Joar and Peter

Digital wizardry for customers & employees – the next elements

A reflection on Mobile World Congress topics mobility, digitalisation, IoT, the Fourth Industrial Revolution and sustainability

MWC2017Commerce has always had a conversational, today it is digital. Organisations are asking how to organise clean effective data for an open digital conversation. 

Digitalization’s aim is to answer customer/consumer-centric demands effectively (with relevant and related data) and in an efficient manner. [for the remainder of the article read consumer and customer interchangeably]

This essentially means Joining the dots between clean data and information and being able to recognise the main and most consumer-valuable use cases, be it common transaction behaviour or negating their most painful user experiences.

This includes treading the fine line between being able to offer “intelligent” information (intelligent in terms of relevance and context)  to the consumer effectively and not seeming to be freaky or stalker-like. The latter is dealt with by forming a digital conversation where the consumer understands the use of their information only being used for their end needs or wants.

While clean, related data from the many various multi-channel customer touch-points forms the basis of an agile digital organisation, it is the combination of significant data analysis insight of user demand & behaviour (clicks, log analysis etc), machine learning and sensible prediction that forms the basis of artificial Intelligence. Artificial intelligence broken down is essentially resultant action based on the inferences of knowing certain information, i.e. the elementary Dr Watson, but done by computers.

This new digital data basis means being able to take data from what were previous data silos and combine it effectively in a meaningful way, for a valuable purpose. While the tag of Big Data becomes weary in a generalised context, key is the picking of data/information to get relevant answers to the mosts valuable questions, or in consumer speak, to get a question answered or a job done effectively.

Digitalisation (and then the following artificial intelligence) relies obviously on computer automation, but it still requires some thoughtful human-related input. Important steps in the move towards digitalization include:

  • Content and Data Inventory, to clean data/ the cleansing of data and information;
  • Its architecture (information modelling, content analysis, automatic classification and annotation/tagging);
  • Data analysis in combination with text analysis (or NLP: natural language processing for the more abundant unstructured data, content), the latter to put flesh on the bone as it were, or adding meaning and context
  • Information Governance: the process of being responsible for the collection, proper storage and use of important digital information (now made less ignorable with new citizen-centric data laws (GDPR) and the need for data agility or anonymization of data)
  • Data/system Interoperability: which data formats, structures, and standards, are most appropriate for you? What data collections are most Relational databases, Linked/graph data, data lakes etc.?); 
  • Language/cultural interoperability: letting people with different perspectives accessing the same information topics using their own terminology.
  • Interoperability for the future also means being able to link everything in your business ecosystem for collaboration, in- and outbound conversations, endless innovation and sustainability.
  • IoT or the Internet of Things is making the physical world digital and adding further to the interlinked network, soon to be superseded by the AoT (the analysis of things)
  • Newer steps of Machine learning (learning about consumer preferences and behaviour etc.) and artificial intelligence (being able to provide seemingly impossible relevant information, clever decision-making and/or seamless user experience).

The fusion of technologies continues further as the lines between the physical, digital, and biological spheres with developments in immersive Internet, as with Augmented Reality (AR) and Virtual Reality (VR).

The next elements are here already: semantic (‘intelligent’) search, virtual assistants, robots, chat bots… with 5G around the corner to move more data, faster.

Progress within mobility paves the way for a more sustainable world for all of us (UN Sustainable Development), with a future based on participation. In emerging markets we are seeing giant leaps in societal change. Rural areas now have access to the vast human resources of knowledge to service innovation e.g. through free-access to Wikipedia on cheap mobile devices and Open Campuses. Gender equality with changed monetary and mobile financial practices and blockchain means to raise to the challenge with interoperability. We have to address the open paradigm (e.g Open Data) and the participation economy, building the next elements. Shared experience and information commons. This also falls back to the intertwingled digital workplace, and practices to move into new cloud based arenas.

Some last remarks on the telecom industry, it is loaded with acronyms and for laymen in the area sometimes a maze to navigate and to build some sensemaking.

So are these steps straightforward, or is the reality still a potential headache for your organisation? 

Contact Findwise now to ease the process, before your competitor does 😉
View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Peter Voisey's LinkedIn profilePeter Voisey

Alfresco CMS permission hooking

Integration between Alfresco and Findwise i3

In Findwise we have been working on creating a custom AMP (Alfresco Module Package… essentially a nicely packeted extension to Alfresco) for quite a while making a nice integration between Alfresco and Findwise i3.

Alfresco is essentially an open source SharePoint written in Java. It’s a CMS (Content Management System), where the basic functionality is to allow end users to import his/her documents, to make them easy to manage and find later on.

Findwise i3 is built on top of a number of open source products, like Solr, Elastic Search etc. and its purpose is to provide a nice, clean, pluggable pipelined framework for loading in data in one end, making them searchable and applying search aspects to the documents, to allow for searching intelligently on subsets of data. Further it’s also a pluggable framework for presenting a nice web based search frontend, where the defined search aspects can be selected, along with all other criteria’s and nicely formatted and organized search result sets.

So why integrate the two?

There are many users of Alfresco around the world, and there are also many users of Findwise search functionality. As Findwise framework allows for making lots of different data searchable across an organization, it makes sense also to be able to search data stored in Alfresco, as just one of these many sources. That’s the background for the project and why we entered integrating the two.

Plugging into Alfresco’s repository (which stores documents) is fairly easy, because the developers of Alfresco have prepared hooks one can use. Thus it’s possible to get informed about whenever a document is added into Alfresco, when its content is changed, or when it’s deleted. On each of these occasions, we want to let Findwise i3 know about it, so it can live update its databases, and thus always provide relevant, current data for the end-users. As said, this is fairly straightforward and documented in the Alfresco documentation.

However, there is another scenario, for which no nice solution exists in Alfresco. Namely hooking into when permissions are changed on documents stored in Alfresco.

Why do we need information about permission changes?

Because Findwise i3 of course respects the IT security roles and settings defined in a company’s infrastructure. It wouldn’t be cool to have a document protected by ownership or security roles within Alfresco, but at the same time be fully accessible via the Findwise search API. So we need to be informed about all permission changes made to each and every document hosted within Alfresco. All permission requests and settings within Alfresco happens via the PermissionService interface (permissionService implementation).

So one could choose to replace or override that class with one of our own. That would result in having to dig deep into Alfresco’s configuration, with the risk that when a minor update of Alfresco is installed, our overridden class will have to be reconfigured into the Alfresco configuration once again. That’s basically a mess and should be avoided at all costs. We would much rather prefer to have the nicely AMP package, for which exists simple to use tools to reinstall into an Alfresco installation, be able to hook also the permission events.

The preferred way to do that is to create a method interceptor. That’s actually a Spring thing, and as Alfresco is built upon Spring, it makes sense to use one of those. The interceptor does exactly that, intercepts all method calls for a particular class or interface for which it has been registered. The easy part is to make the interceptor, the difficult part (due to its not very well described) is to have it registered from within the AMP itself

The following is skeleton code for a method interceptor:


And this is how to register it in your AMP’s service-context.xml


When you register your AMP with Alfresco using the usual apply_amps script, your new interceptor class will spring into life when a method on the PermissionService interface is called. It’s actually pretty simple, but the hard part is to figure out the registration process, and provide the correct case and naming for the PermissionService. We have spent quite allot of time searching for this on Google, reading the Alfresco documentation and only found fragments that didn’t show the whole picture.

Now we hope our findings were helpful to others.

Written by: Kim Bo Madsen, Consultant Findwise

What will happen in the information sector in 2017?

As we look back at 2016, we can say that it has been an exciting and groundbreaking year that has changed how we handle information. Let’s look back at the major developments from 2016 and list key focus areas that will play an important role in 2017.

3 trends from 2016 that will lay basis for shaping the year 2017


There has been a massive shift towards cloud, not only using the cloud for hosting services but building on top of cloud-based services. This has affected all IT projects, especially the Enterprise Search market when Google decided to discontinue GSA and replace it with a cloud based Springboard. More official information on Springboard is still to be published in writing, but reach out to us if you are keen on hearing about the latest developments.

There are clear reasons why search is moving towards the cloud. Some of the main ones being machine learning and the amount of data. We have an astonishing amount of information available, and the cloud is simply the best way to handle this overflow. Development in the cloud is faster, the cloud gives practically unlimited processing power and the latest developments available in the cloud are at an affordable price.

Machine learning

One area that has taken huge steps forward has been machine learning. It is nowadays being used in everyday applications. Google wrote a very informative blog post about how they use Cloud machine learning in various scenarios. But Google is not alone in this space – today, everyone is doing machine learning. A very welcome development was the formation of Partnership on AI by Amazon, Google, Facebook, IBM and Microsoft.

We have seen how machine learning helps us in many areas. One good example is health care and IBM Watson managing to find a rare type of leukemia in 10 minutes. This type of expert assistance is becoming more common. While we know that it is still a long path to come before AI becomes smarter than human beings, we are taking leaps forward and this can be seen by DeepMind beating a human at the complex board game Go.

Internet of Things

Another important area is IoT. In 2016 most IoT projects have, in addition to consumer solutions, touched industry, creating a smart city, energy utilization or connected cars. Companies have realized that they nowadays can track any physical object to the benefits of being able to serve machines before they break, streamline or build better services or even create completely new business based on data knowledge. On the consumer side, we’ve in 2016 seen how IoT has become mainstream with unfortunate effect of poorly secured devices being used for massive attacks.


3 predictions for key developments happening in 2017

As we move towards the year 2017, we see that these trends from 2016 have positive effects on how information will be handled. We will have even more data and even more effective ways to use it. Here are three predictions for how we will see the information space evolve in 2017.

Insight engine

The collaboration with computers are changing. For decades, we have been giving tasks to computers and waited for their answer. This is slowly changing so that we start to collaborate with computers and even expect computers to take the initiative. The developments behind this is in machine learning and human language understanding. We no longer only index information and search it with free text. Nowadays, we can build a computer understanding information. This information includes everything from IoT data points to human created documents and data from other AI systems. This enables building an insight engine that can help us formulate the right question or even giving us insight based on information to a question we never ask. This will revolutionize how we handle our information how we interact with our user interfaces.

We will see virtual private assistants that users will be happy to use and train so that they can help us to use information like never before in our daily live. Google Now, in its current form, is merely the first step of something like this, being active towards bringing information to the user.

Search-driven analytics

The way we use and interact with data is changing. With collected information about pretty much anything, we have almost any information right at our fingertips and need effective ways to learn from this – in real time. In 2017, we will see a shift away from classic BI systems towards search-driven evolutions of this. We already have Kibana Dashboards with TimeLion and ThoughtSpot but these are only the first examples of how search is revolutionizing how we interact with data. Advanced analytics available for anyone within the organization, to get answers and predictions directly in graphs and diagrams, is what 2017 insights will be all about.

Conversational UIs

We have seen the rise of Chatbots in 2016. In 2017, this trend will also take on how we interact with enterprise systems. A smart conversational user interface builds on top of the same foundations as an enterprise search platform. It is highly personalized, contextually smart and builds its answers from information in various systems and information in many forms.

Imagine discussing future business focus areas with a machine that challenges us in our ideas and backs everything with data based facts. Imagine your enterprise search responding to your search with a question asking you to detail what you actually are achieving.


What are your thoughts on the future developement?

How do you see the 2017 change the way we interact with our information? Comment or reach out in other ways to discuss this further and have a Happy Year 2017!


Written by: Ivar Ekman

Elastic Stack 5.0 is released

At a first glance, the major Elasticsearch version bump might seem frightening. Going from version 2.4.x to 5.0 is a big jump, but there’s no need to worry. The main reason is to align versions between the different products in the stack. Having all products on the same version will make it a lot easier to handle future upgrades and simplify the overall experience for both new and existing users.

All products in the stack have been updated, some more than others. Here are a few highlights regarding Elasticsearch 5.0 that we recommend you to read before upgrading. Or schedule an appointment with us and we’ll help you out!

New relevance model

Elasticsearch prior version 5 used the default scoring algorithm TF/IDF. From now on the default algorithm is BM25.

Depending on the nature of your indexed information, a re-index operation might give you slightly different results and most likely more relevant.

Re-index from remote

This new feature of the Elasticsearch API is really useful when for example upgrading from old clusters. By specifying a remote cluster in the API call, you can easily transfer old documents to your newly created 5.0 cluster without going through a rolling node upgrade procedure.

Ingest Node

There’s a new node type in town. Starting from version 5.0, Elasticsearch gives you the possibility to do simple data manipulation within a running cluster prior indexing. This is useful if you prefer a more simplistic architecture without Logstash instances, but still require to do some alterations to your data.

Most core processors found in Logstash are available. Often used ones include:

  • Date Processor
  • Convert processor
  • Grok Processor
  • Rename Processor
  • JSON Processor

Search and Aggregations

The search API has been refactored to be more clever regarding which indices are hit, but also if aggregations need to be recalculated or not when issuing range queries. By looking at when indices were last modified, range aggregations can be cached and only recalculated if really needed. This improvement is really useful for the typical log analytic case with time series data. You will notice speed improvements in your Kibana dashboards.

New data structures

Lucence 6.0 introduces a new feature called dimensional points, which uses the k-d tree geo-spatial data structure to enable fast single- and multi-dimensional numeric range and geo-spatial point-in-shape filtering. Elasticsearch 5.0 implements a variant called block k-d tree specifically designed for efficient IO, which gives significant performance boosts when indexing as well as filtering.

Should I upgrade?

If your typical use case involves geo-spatial queries and filtering, we definitely recommend that you upgrade your cluster and re-index your documents to gain the performance boost. Due to the simplicity in upgrading or even migrating data to a completely new cluster, it will be worth the time getting your Elastic Stack up to date and ready for features to come.

In case you need help, don’t hesitate to contact us and we will guide you through the process.

Written by: Joar Svensson, Consultant Findwise

How to improve search relevance using machine learning and statistics – Apache Solr Learning to Rank

In search, the relevance denotes how well a retrieved document or set of documents meets the information need of the user. Natural languages, synonymy, homonymy, term frequency, norms, relevance model, boosting, performance, subjectivity… the reasons why search relevancy still remains a hard problem are multiple. This article will deal with how can machine learning and search statistics improve the relevance using the learning to rank plugin which will be included in a newer version of Solr.  If you want more information than is provided in this blogpost, be sure to visit our website or contact us!


Considering an Intranet search solution where users can be divided into two groups: developers and sales.
Search and clicks statistics are collected and the following picture illustrates a specific search performed 569 times by the users with the click statistics for each document.

Example of a search with click statistics

Example of a search with click statistics

As noticed, the top search hit, of which the score is computed from term frequency, index documents frequency and field-length norm, is less relevant (got less clicks) than documents with lower scores. Instead of trying to manually tweak the relevancy model and the different field boosts, which will probably lead, by tweaking for a specific query, to decrease the global relevancy for other queries, we will try to learn from the users click statistics to automatically generate a relevancy model.

Architecture, features and model


Using search and click statistics as training data, the search engine can learn, from input features and with a ranking model, the probability that a document is relevant for a user.

Search architecture

Search architecture with a ranking training model


With the Solr learning to rank plugin, features can be defined using standard Solr queries.
For my specific example, I will choose the following features:
– originalScore: Original score computed by Solr
– queryMatchTitle: Boolean stating if the user query text match the document title
– queryMatchDescription: Boolean stating if the user query text match the document description
– isPowerPoint: Boolean stating if the document type is PowerPoint

   { "name": "isPowerPoint",
     "class": "org.apache.solr.ltr.feature.SolrFeature",
     "params":{ "fq": ["{!terms f=filetype }.pptx"] }
    "name" : " queryMatchTitle",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=title}${user_query}" }
    "name" : " queryMatchDescription",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=description}${user_query}" }

Training a ranking model

From the statistics click data, a training set (X, y), where X is the feature input vector and y a boolean stating if the user clicked a document or not, can be generated and used to compute a linear model using regression which will output the weight of each feature.

Example of statistics:

    q: "i3",
    docId: "91075403",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0,
    score: 3,43		
    q: "i3",
    docId: "82034458",
    userId: "507f1f77bcf86cd799439011",
    clicked: 1
    score: 3,43		
    q: "coucou",
    docId: "1246732",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0	
    score: 3,42	

Training data generated:
X= (originalScore, queryMatchTitle, queryMatchDescription, isPowerPoint)
y= (clicked)

    originalScore: 3,43,
    queryMatchTitle: 1
    queryMatchDescription: 1,
    isPowerPoint: 1, 
    clicked: 0		
    originalScore: 3,43,
    queryMatchTitle: 0
    queryMatchDescription: 1,
    isPowerPoint: 0, 
    clicked: 1			
    originalScore: 3,42	
    queryMatchTitle: 1
    queryMatchDescription: 0,
    isPowerPoint: 1, 
    clicked: 0		

Once the training is completed and the different features weight computed, the model can be sent to Solr using the following format:

        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
        "weights": {
            "originalScore": 0.6783,
            "queryMatchTitle": 0.4833,
            "queryMatchDescription": 0.7844,
            "isPowerPoint": 0.321

Run a Rerank Query

Solr LTR plugin allows to easily apply the re-rank model on the documents result by adding rq={!ltr model=myModelName reRankDocs=25} to the query.

Personalization of the search result

If your statistics data include information about users, specific re-rank model can be trained according different user groups.
In my current example, I trained a specific model for the developer group and for the sales representatives.

Dev model:

        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
        "weights": {
            "originalScore": 0.6421,
            "queryMatchTitle": 0.4561,
            "queryMatchDescription": 0.5124,
            "isPowerPoint": 0.017

Sales model:

        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
        "weights": {
            "originalScore": 0.712,
            "queryMatchTitle": 0.582,
            "queryMatchDescription": 0.243,
            "isPowerPoint": 0.623

From the statistics data, the system learnt that a PowerPoint document is more relevant for a sales representative than for a developers.

Developer search

Developer search with re-ranking

Sales representative search

Sales representative search with re-ranking

To conclude, with a search system continuously trained from a flow of statistics, not only the search relevance will be more customized and personalized to your users, but the relevance will also be automatically adapted to the users behavior change.

If you want more information, have further questions or need help please visit our website or contact us!

Solr LTR plugin to be release soon: https://github.com/bloomberg/lucene-solr/tree/master-ltr-plugin-release/solr/contrib/ltr

Involuntarily digital footprints violate personal integrity (learn about GDPR)

The aim of this blog post is to make “average Joe” understand how the new upcoming General Data Protection Regulation (GDPR) affects his everyday life.

To start with, let’s sort some expressions out.

Digital footprint

According to Wikipedia, there are two main classifications for digital footprints;
• Passive digital footprint – Data collected without the owner’s knowledge.
• Active digital footprint – Data released deliberately by the user himself (i.e. sharing an image on Facebook).

Personal integrity

Integrity could be described as the quality of being honest and having strong moral principles. In general, it’s a personal choice how to choose your standpoint in the question of integrity. Gossiping about secrets told in confidence is an example to illustrate with. Publishing images of others without their knowledge is another (this might even be illegal).

This illustrative case could be you

To understand what GDPR is about and how it affects your everyday life I will illustrate by an example that I hope you could recognize yourself in.

Imagine: You live in an apartment in a mid-size facility with other people (we can choose to call them neighbours). In front of the facilities there is a space dedicated for parking cars. One day a neighbour of yours chooses to move and therefore hires a real-estate agent, helping out with selling the apartment.

As you are somewhat curious about what the apartments in your neighbourhood is worth, you look the advertisement for the apartment up on the internet. When you find the apartment you see your own car on the picture in the parking space. On top of this you discover that the registration number of the car is fully visible.

Should you care?

According to Datainspektionen, registration numbers is considered as “personal data”. So the first mistake by the broker being done here is creating a passive digital footprint for you. The second mistake by the broker being done is breaking the law. In Sweden it is not allowed to publish personal data without acknowledgement by the owner.

The moral compass of the broker should be questioned here. A passive digital footprint in your name is created, your personal integrity has been violated and the law has been broken.

On top of that: GDPR starts in may 2018. You have the right to be forgotten whenever you want (you can push companies to remove your personal data from their systems).

Is there a business case?

A lawyer could probably build a business case around suing real-estate brokers for publishing pictures of cars registration numbers without the owner’s acknowledgement.

As a regular citizen you should probably not get to agitated about a picture of your cars registration number? Or maybe you should, it depends on your level of personal integrity. As the modern society evolves, the amount of different types of information being digitalized grows by the day.

By this example, I hope “average Joe” now understands what digital footprints, personal integrity and GDPR is. Maybe this got you thinking and you want to know more about GDPR.

There are probably two ways to see on this in a sober way. Live with your personal data being spread (and get used to that you soon won’t have anything personal anymore) or maybe it’s time to stick the neck out and say “hey, stop publishing my personal data without asking me”.

No matter if you want it or not, you are affected by GDPR.



Written by: Markus Edström

SharePoint 2013 entity extraction – part 2

In previous blog post we have described the built-in way for entity extraction in Sharepoint. It was pointed out that it’s good as long as you are able to create a full dictionary of all entities you want to extract. It’s not always possible. An alternative to the dictionary-based entity extraction is the statistical approach where we train a model for the purpose of recognizing entity names.

Sharepoint Content Enrichment

The content enrichment web service callout offers the possibility of processing the crawled content before it is indexed. This processing, which can consist of for example cleaning data, computing new values based on existing ones, or enriching the content with metadata, is done in addition to the processing done by SharePoint. Note however that this solution is limited to SharePoint Server 2013 with Enterprise CALs.

The processing can be applied not only to the SharePoint content, but it can benefit any content indexed by SharePoint, such as external websites.

Findwise Entity Extraction

Findwise has implemented a basic content processing enrichment service for customized processing of content indexed by SharePoint. This service can thus be used for processing and enriching documents using other already tested services developed by Findwise, such as the text analytic components. Moreover, it contains the basis on which other custom services can be built upon.

One of these text analytic services is Entity Extraction. It is based on statistical approach so before we can run it we need to train our model. The documents used for training must of course be representative of the domain in terms of form, terminology and writing style. However, this statistical approach has the potential of improving over time through training, as more examples are provided.

Make it run

To make it works we need to setup Findwise Entity Extraction service and provide documents for training. The more we get the better.

Findwise Entity Extraction Service is just Web Service that get text and return an array of found entities. Example:

For given text:

"Bill Wiggins has joined the findwise company in 2016. 
Since then much has happened, in example George Lucas joined and left, 
Tom Rubik has decided to move forward."

We got in response:

"Bill Wiggins",
"George Lucas",
"Tom Rubik "

Next we create Web Service Callout project. The only thing it does is getting document body and put the extracted entities in new “Entity” managed property. No mapping required this time – just make sure to create new managed property. Finally add the Web Service Callout to IIS and register it in SharePoint using the PowerShell scripts.

Information on how to create and register Content Enrichment Web Service Callout you can find at https://msdn.microsoft.com/en-us/library/office/jj163982.aspx

After than just run the content source and add new Refiner to your search page. Below you can find a result of run on Wikipedia texts. Note that extracted names doesn’t come from any dictionary but are returned by Findwise Entity Extraction text analytic component:sharepoint-results

SharePoint 2013 entity extraction – part 1

What is your search missing?

The built-in search experience in SharePoint 2013 has greatly improved from previous versions, and companies adopting it enjoy a bag of new features (such as the visual refiners, the social search, the hover-panel with previews, to name a few). However, is your implementation of the search in SharePoint 2013 matching all your business and information needs?! Is your search solution reaching the target search KPIs? Are you wondering how you can cut down on the task of the editor, improve the search experience for your users, or reduce the time spent by your information workers finding the relevant content?

Entity Extraction in SharePoint 2013 Search  

To make your search good you need good metadata. They can be then used as a filters, boosted fields etc. Usually that means that documents need to be tagged, which may take a long time if done manually by content owners.

However, it is possible to extract some metadata from document content during index time. In SharePoint 2013 there are two ways of doing it: “Custom entity extraction” or with use of “Custom Content processing”. In this post you can learn the first way.

Custom entity extraction

SharePoint 2013 introduced a new way for entity extraction. It allows to extract entities from document based on dictionary.

The first step is, of course, preparing the dictionary. It needs to be in following format:

Key,Display form
Findwise, Findwise
FW, Findwise
Sharepoint, Sharepoint

Then you need to register that dictionary file in SharePoint – using Powershell scripts: https://technet.microsoft.com/library/jj219614.aspx

Last thing left to do is enabling entity extraction on the Managed Property it should be applied to. To get them from content of a document just edit “body” Managed Property


and select “Word Extraction – Custom 1” checkbox:


We choose that one because our dictionary was registered with – DictionaryName “Microsoft.UserDictionaries.EntityExtraction.Custom.Word.1” parameter, if you need more dictionaries then you register them using different dictionary name values and selecting the right option in Managed Property settings.

After that run “Full Crawl” for your sources.

Finally, just add the “WordCustomRefiner1” to your refiners on search result list and start using new filter:scr3

This way is really good if you are able to generate a static dictionary. Eg. you can use a list of all countries or cities you can find on the internet for location extraction. You can also extract at dictionary from your customers database or your employees list and then update it on regular basis.

However usually it’s not possible to get a full list of all entities, and they must be extracted using one of NLP algorithms, that will be described in next part.

Query Completion with Apache Solr

There are plenty of names for this functionality: query completion, suggestions, auto-complete, auto-suggest, word completion, type ahead and maybe some more. Even if we may point slight differences between them (suggestions can base on your index documents or external input such users queries), from technical point of view it’s all about the same: to propose a query for the end user.

google-suggestearly Google Suggest from 2008. Source: http://www.wpromote.com/blog/4-things-in-08-that-changed-the-face-of-search/


Suggester feature was started 8 years ago by Google, in 2008. Users got used to the query completion and nowadays it’s a common feature of all mature search engines, e-commerce platforms and even internal enterprise search solutions.

Suggestions help with navigating users through the web portal, allow to discover relevant content and recommend popular phrases (and thus search results). In the e-commerce area they are even more important because well implemented query completion is able to high up conversion rate and finally – increase sales revenue. Word completion never can lead to zero results, but this kind of mistake is made frequently.

And as many names describe this feature there are so many ways to build it. But still it’s not so trivial task to implement good working query completion. Software like Apache Solr doesn’t solve whole problem. Building auto-suggestions is also about data (what should we present to users), its quality (e.g. when we want to suggest other users’ queries), suggestions order (we got dozens matches, but we can show only 5; which are the most important?) or design (user experience or similar).

Going back to the technology. Query completion can be built in couple of ways with Apache Solr. You can use mechanisms like facets, terms, dedicated suggest component or just do a query (with e.g. dismax parser).

Take a look at Suggester. It’s very easy to run. You just need to configure searchComponent and requestHandler. Example:

<searchComponent name="suggester" class="solr.SuggestComponent">
  <lst name="suggester">
    <str name="name">suggester1</str>
    <str name="lookupImpl">FuzzyLookupFactory</str>
    <str name="dictionaryImpl">DocumentDictionaryFactory</str>
    <str name="field">title</str>
    <str name="weightField">popularity</str>
    <str name="suggestAnalyzerFieldType">text</str>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="suggest">true</str>
    <str name="suggest.count">10</str>
  <arr name="components">

SuggestComponent is a ready-to-use implementation, which is responsible for serving up suggestions based on commands and queries. It’s an efficient solution, i.e. because it works on structure separated from main index and it’s being kept in memory. There are some basic settings like field used for autocompleting or defining text analyzing chain. LookImpl defines how to match terms in index. There are about 10 algorithms with different purpose. Probably the most popular are:

  • AnalyzingLookupFactory (default, finds matches based on prefix)
  • FuzzyLookupFactory (finds matches with misspellings),
  • AnalyzingInfixLookupFactory (finds matches anywhere in the text),
  • BlendedInfixLookupFactory (combines matches based on prefix and infix lookup)

You need to choose the one which fulfill your requirements. The second important parameter is dictionaryImpl which represents how indexed suggestions are stored. And again, you can choose between couple of implementations, e.g. DocumentDictionaryFactory (stores terms, weights, and optional payload) or HighFrequencyDictionaryFactory (works when very common terms overwhelm others, you can set up proper threshold).

There are plenty of different settings you can use to customize your suggester. SuggestComponent is a good start and probably covers many cases, but like everything, there are some limitations like e.g. you can’t easily filter out results.

Example execution:


suggestions: [
  { term: "london" },
  { term: "londonderry" },
  { term: "londoño" },
  { term: "londoners" },
  { term: "londo" }

Another way to build a query completion is to use mechanisms like faceting, terms or highlighting.

The example of QC built on facets:


title_keyword: [
  "blonde bombshell", 2,
  "12-pounder long gun", 1,
  "18-pounder long gun", 1,
  "1957 liga española de baloncesto", 1,
  "1958 liga española de baloncesto", 1

Please notice that here we have used facet.contains method, so query matches also in the middle of phrase. It works on the basis of regular expression. Additionally, we have a count for every suggestion in Solr response.

TermsComponent (returns indexed terms and the number of documents which contain each term) and highlighting (originally, emphasize fragments of documents that match the user’s query) can be also used, what is presented below.

Terms example:

<searchComponent name="terms" class="solr.TermsComponent"/>
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <bool name="terms">true</bool>
    <bool name="distrib">false</bool>
  <arr name="components">

title_general: [

Highlighting example:

http://localhost:8983/solr/index/select?q=title_ngram:lond &fl=title&hl=true&hl.fl=title&hl.simple.pre=&hl.simple.post=

title_ngram: [

You can also do auto-complete even with usual, full-text query. It has lots of advantages: Lucene scoring is working, you have filtering, boosts, matching through many fields and whole Lucene/Solr queries syntax. Take a look at this eDisMax example:


docs: [
  { title: "Londinium" },
  { title: "London" },
  { title: "Darling London" },
  { title: "London Canadians" },
  { title: "Poultry London" }

The secret is an analyzer chain whether you want to base on facets, query or SuggestComponent. Depending on what effect you want to achieve with your QC, you need to index data in a right way. Sometimes you may want to suggest single terms, another time – whole sentences or product names. If you want to suggest e.g. letter by letter you can use Edge N-Gram Filter. Example:

<fieldType name="text_ngram" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory minGramSize="1" maxGramSize="50" />
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>

N-Gram is a structure of n items (size depends on given range) from a given sequence of text. Example: term Findwise, minGramSize = 1 and maxGramSize = 10 will be indexed as:


With such indexed text you can easily achieve functionality where user is able to see changing suggestions after each letter.

Another case is an ability to complete word after word (like Google does). It isn’t trivial, but you can try with shingle structure. Shingles are similar to N-Gram, but it works on whole words. Example: Searching is really awesome, minShingleSize = 2 and minShingleSize = 3 will be indexed as:

Searching is
Searching is really
is really
is really awesome
really awesome

Example of Shingle Filter:

<fieldType name="text_shingle" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="10" />
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>

What if your users could use QC which supports synonyms? Then they could put e.g. abbreviation and find a full suggestion (NYC -> New York City, UEFA -> Union Of European Football Associations). It’s easy, just use Synonym Filter in your text field:

<fieldType name="text_synonym" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>

And then just do a query:


docs: [
  { title: "New York City" },
  { title: "New York New York" },
  { title: "Welcome to New York City" },
  { title: "City Club of New York" },
  { title: "New York" }

Another very similar example concerns language support and matching suggestions regardless of the terms’ form. It can be especially valuable for languages with  the rich grammar rules and declination. In the same way how SynonymsFilter is used, we can configure a stemmer / lemmatization filter e.g. for English (take a look here and remember to put language filter both for index and query time) and expand matching suggestions.

As you can see, there are many ways to run query completion, you need to adjust right mechanism and text analysis based on your own limitations and also on what you want to achieve.

There are also other topics connected with preparing type ahead solution. You need to consider performance issues, they are mostly centered on response time and memory consumption. How many requests will generate QC? You can assume that at least 3 times more than your regular search service. You can handle traffic growth by optimizing Solr caches, installing separated Solr instanced only for suggesting service. If you’ll create n-gram, shingles or similar structures, be aware that your index size will increase. Remember that if you decided to use facets or highlighting for some reason to provide suggester, this both mechanisms make your CPU heavy loaded.

In my opinion, the most challenging issue to resolve is choosing a data source for query completion mechanism. Should you suggest parts of your documents (like titles, keywords, authors)? Or use NLP algorithms to extract meaningful phrases from your content? Maybe parse search/application logs and use the most popular users queries? Be careful, filter out rubbish, normalize users input). I believe the answer is YES – to all. Suggestions should be diversified (to lead your users to a wide range of search resources) and should come from variety of sources. More than likely, you will need to do a hard job when processing documents – remember that data cleaning is crucial.

Similarly, you need to take into account different strategies when we talk about the order of proposed suggestions. It’s good to show them in alphanumeric order (still respect scoring!), but you can’t stop here. Specificity of QC is that application can return hundreds of matches, but you can present only 5 or 10 of them. That’s why you need to promote suggestions with the highest occurrence in index or the most popular among the users. Further enhancements can involve personalizing query completion, using geographical coordinates or implementing security trimming (you can see only these suggestions you are allowed to).

I’m sure that this blog post doesn’t exhaust the subject of building query completion, but I hope I brought this topic closer and showed the complexity of such a task. There are many different dimension which you need to handle, like data source of your suggestions, choosing right indexing structure, performance issues, ranking or even UX and designing (how would you like to present hints – simple text or with some graphics/images? Would you like to divide suggestions into categories? Do you always want to show result page after clicked suggestion or maybe redirect to particular landing page?).

Search engine like Apache Solr is a tool, but you still need an application with whole business logic above it. Do you want to have a prefix-match and infix-match? To support typos and synonyms? To suggest letter after the letter or word by word? To implement security requirements or advanced ranking to propose the best tips for your users? These and even more questions need to be think over to deliver successful query completion.