SharePoint 2013 entity extraction – part 1

What is your search missing?

The built-in search experience in SharePoint 2013 has greatly improved from previous versions, and companies adopting it enjoy a bag of new features (such as the visual refiners, the social search, the hover-panel with previews, to name a few). However, is your implementation of the search in SharePoint 2013 matching all your business and information needs?! Is your search solution reaching the target search KPIs? Are you wondering how you can cut down on the task of the editor, improve the search experience for your users, or reduce the time spent by your information workers finding the relevant content?

Entity Extraction in SharePoint 2013 Search  

To make your search good you need good metadata. They can be then used as a filters, boosted fields etc. Usually that means that documents need to be tagged, which may take a long time if done manually by content owners.

However, it is possible to extract some metadata from document content during index time. In SharePoint 2013 there are two ways of doing it: “Custom entity extraction” or with use of “Custom Content processing”. In this post you can learn the first way.

Custom entity extraction

SharePoint 2013 introduced a new way for entity extraction. It allows to extract entities from document based on dictionary.

The first step is, of course, preparing the dictionary. It needs to be in following format:

Key,Display form
Findwise, Findwise
FW, Findwise
Sharepoint, Sharepoint

Then you need to register that dictionary file in SharePoint – using Powershell scripts:

Last thing left to do is enabling entity extraction on the Managed Property it should be applied to. To get them from content of a document just edit “body” Managed Property


and select “Word Extraction – Custom 1” checkbox:


We choose that one because our dictionary was registered with – DictionaryName “Microsoft.UserDictionaries.EntityExtraction.Custom.Word.1” parameter, if you need more dictionaries then you register them using different dictionary name values and selecting the right option in Managed Property settings.

After that run “Full Crawl” for your sources.

Finally, just add the “WordCustomRefiner1” to your refiners on search result list and start using new filter:scr3

This way is really good if you are able to generate a static dictionary. Eg. you can use a list of all countries or cities you can find on the internet for location extraction. You can also extract at dictionary from your customers database or your employees list and then update it on regular basis.

However usually it’s not possible to get a full list of all entities, and they must be extracted using one of NLP algorithms, that will be described in next part.

Understanding politics with Watson using Text Analytics

To understand the topics that actually are important to different political parties is a difficult task. Can text analytics together with an search index be an approach to given a better understanding?

This blog post describes how IBM Watson Explorer Content Analytics (WCA) can be used to make sense of Swedish politics. All speeches (in Swedish: anföranden) in the Swedish Parliament from 2004 to 2015 are analyzed using WCA. In total 139 110 transcribed text documents were analyzed. The Swedish language support build by Findwise for WCA is used together with a few text analytic processing steps which parses out person names, political party, dates and topics of interest. The selected topics in this analyzed are all related to infrastructure and different types of fuels.

We start by looking at how some of the topics are mentioned over time.

Analyze of terms of interets in Swedsih parlament between 2004 and 2014.

Analyze of terms of interest in Swedish parliament between 2004 and 2014.

The view shows topic which has a higher number of mentions compared to what would be expected during one year. Here we can see among other topics that the topic flygplats (airport) has a high increase in number of mentioning during 2014.

So let’s dive down and see what is being said about the topic flygplats during 2014.

Swedish political parties mentioning Bromma Airport.

Swedish political parties mentioning Bromma Airport during 2014.

The above image shows how the different political parties are mentioning the topic flygplats during the year 2014. The blue bar shows the number of times the topic flygplats was mentioned by each political party during the year. The green bar shows the WCA correlation value which indicates how strongly related a term is to the current filter. What we can conclude is that party Moderaterna mentioned flygplats during 2014 more frequently than other parties.

Reviewing the most correlated nouns when filtering on flygplats and the year 2014 shows among some other nouns: Bromma (place in Sweden), airport and nedläggning (closing). This gives some idea what was discussed during the period. By filtering on the speeches which was held by Moderaterna and reading some of them makes it clear that Moderaterna is against a closing of Bromma airport.

The text analytics and the index provided by WCA helps us both discover trending topics over time and gives us a tool for understanding who talked about a subject and what was said.

All the different topics about infrastructure can together create a single topic for infrastructure. Speeches that are mentioning tåg (train), bredband (broadband) or any other defined term for infrastructure are also tagged with the topic infrastructure. This wider concept of infrastructure can of course also be viewed over time.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Another way of finding which party that are most correlated to a subject is by comparing pair of facets. The following table shows parties highly related to terms regarding infrastructure and type of fuels.

Political parties highly correlated to subjects regarding infrastructure and types of fuel.

Swedish political parties highly correlated to subjects regarding infrastructure and types of fuel.

Let’s start by explain the first row in order to understand the table. Mobilnät (mobile net) has only been mentioned 44 times by Centerpartiet, but Centerpartiet is still highly related to the term with a WCA correlation value of 3.7. This means that Centerpartiet has a higher share of its speeches mentioning mobilnät compared to other parties. The table indicates that two parties Centerpartiet and Miljöpartiet are more involved about the subject infrastructure topics than other political parties.

Swedish parties mentioning the defined concept of infrastructure.

Swedish parties mentioning the defined concept of infrastructure.

Filtering on the concept infrastructure also shows that Miljöpartiet and Centerpartiet are the two parties which has the highest share of speeches mentioning the defined infrastructure topics.

Interested to dig deeper into the data? Parsing written text with text analytics is a successful approach for increasing an understanding of subjects such as politics. Using IBM Watson Explorer Content Analytics makes it easy. Most of the functionality used in this example is also out of the box functionalities in WCA.

Analyzing the Voice of Customers with Text Analytics

Understanding what your customer thinks about your company, your products and your service can be done in many different ways. Today companies regularly analyze sales statistics, customer surveys and conduct market analysis. But to get the whole picture of the voice of customer, we need to consider the information that is not captured in a structured way in databases or questionnaires.

I attended the Text Analytics Summit earlier this year in London and was introduced to several real-life implementations of how text analytics tools and techniques are used to analyze text in different ways. There were applications for text analytics within pharmaceutical industry, defense and intelligence as well as other industries, but most common at the conference were the case studies within customer analytics.

For a few years now, the social media space has boomed as platforms of all kinds of human interaction and communication, and analyzing this unstructured information found on Twitter and Facebook can give corporations deeper insight into how their customers experience their products and services. But there’s also plenty of text-based information within an organization, that holds valuable insights about their customers, for instance notes being taken in customer service centers, as well as emails sent from customers. By combining both social media information with the internally available information, a company can get a more detailed understanding of their customers.

In its most basic form, the text analytics tools can analyze how different products are perceived in different customer groups. With sentiment analysis a marketing or product development department can understand if the products are retrieved in a positive, negative or just neutral manner. But the analysis could also be combined with other data, such as marketing campaign data, where traditional structured analysis would be combined with the textual analysis.

At the text analytics conference, several exciting solutions where presented, for example an European telecom company that used voice of customer analysis to listen in on the customer ‘buzz’ about their broadband internet services, and would get early warnings when customers where annoyed with the performance of the service, before customers started phoning the customer service. This analysis had become a part of the Quality of Service work at the company.

With the emergence of social media, and where more and more communication is done digitally, the tools and techniques for text analytics has improved and we now start to see very real business cases outside the universities. This is very promising for the adaptation of text analytics within the commercial industries.

Phonetic Algorithm: Bryan, Brian, Briane, Bryne, or … what was his name again?

Let the spelling loose …

What do Callie and Kelly have in common (except for the double ‘l’ in the middle)? What about “no” and “know”, or “Ceasar’s” and “scissors” and what about “message” and “massage”? You definitely got it – Callie and Kelly, “no” and “know”, “Ceasar’s” and “scissors” sound alike, but are spelled quite differently. “message” and “massage” on the other hand differ by only one vowel (“a” vs “e”) but their pronunciation is not at all the same.

It’s a well known fact for many languages that ortography does not determine the pronunciation of words. English is a classic example. George Bernard Shaw was the attributed author of “ghoti” as an alternative spelling of “fish”. And while phonology often reflects the current state of the development of the language, orthography may often lag centuries behind. And while English is notorious for that phenomenon it is not the only one. Swedish, French, Portuguese, among others, all have their ortography/pronunciation discrepancies.

Phonetic Algorithms

So how do we represent things that sound similar but are spelled different? It’s not trivial but for most cases it is not impossible either. Soundex is probably the first algorithm to tackle this problem. It is an example of the so called phonetic algorithms which attempt to solve the problem of giving the same encoding to strings which are pronounced in a similar fashion. Soundex was designed for English only but has its limits. DoubleMetaphone (DM) is one of the possible replacements and relatively successful. Designed by Lawrence Philips in the beginning of 1990s it not only deals with native English names but also takes proper care of foreign names so omnipresent in the language. And what is more – it can output two possible encodings for a given name, hence the “Double” in the naming of the algorithm, – an anglicised and a native (be that Slavic, Germanic, Greek, Spanish, etc.) version.

By relying on DM one can encode all the four names in the title of this post as “PRN”. The name George will get two encodings – JRJ and KRK, the second version reflecting a possible German pronunciation of the name. And a name with Polish origin, like Adamowicz, would also get two encodings – ATMTS and ATMFX, depending on whether you pronounce the “cz” as the English “ch” in “church” or “ts” in “hats”.

The original implementation by Lawrence Philips allowed a string to be encoded only with 4 characters. However, in most subsequent
implementations of the algorithm this option is parameterized or just omitted.

Apache Commons Codec has an implementation of the DM among others (Soundex, Metaphone, RefinedSoundex, ColognePhonetic, Coverphone, to
name just a few.) and here is a tiny example with it:

import org.apache.commons.codec.language.DoubleMetaphone;

public class DM {

public static void main(String[] args) {

String s = "Adamowicz";

DoubleMetaphone dm = new DoubleMetaphone();

// Default encoding length is 4!

// Let's make it 10


System.out.println("Alternative 1: " + dm.doubleMetaphone(s) +

// Remember, DM can output 2 possible encodings:

"nAlternative 2: " + dm.doubleMetaphone(s, true));


The above code will print out:

Alternative 1: ATMTS

Alternative 2: ATMFX

It is also relatively straightforward to do phonetic search with Solr. You just need to ensure that you add the phonetic analysis to a field which contains names in your schema.xml:


While DM does perform quite well, at first sight, it has its limitations. We should know that it still originated from the English language and although it aims to tackle a variety of non-native borrowings most of the rules are English-centric. Suppose you work on any of the Scandinavian languages (Swedish, Danish, Norwegian, Icelandic) and one of the names you want to encode is “Örjan”. However, “Orjan” and “Örjan” get different encodings – ARJN vs RJN. Why is that? One look under the hood (the implementation in will give you the answer:

private static final String VOWELS = "AEIOUY";

So the Scandinavian vowels “ö”, “ä”, “å”, “ø” and “æ” are not present. If we just add these then compile and use the new version of the DM implementation we get the desired output – ARJN for both “Örjan” and “Orjan”.

Finally, if you don’t want to use DM or maybe it is really not suitable for your task, you still may use the same principles and create your own encoder by relying on regular expressions for example. Suppose you have a list of bogus product names which are just (mis)spelling variations of some well known names and you want to search for the original name but get back all ludicrous variants. Here is one albeit very naïve way to do it. Given the following names:








and with a bunch of regular expressions you can easily encode them as “cphldR” and “mclnR”.

String[] ar = new String[]{"CupHoulder", "CappHolder", "KeepHolder",
"MacKleena", "MackCliiner", "MacqQleanAR", "Ma'cKcle'an'ar"};

for (String a : ar) {
a = a.toLowerCase();
a = a.replaceAll("[ae]r?$", "R");
a = a.replaceAll("[aeoiuy']", "");
a = a.replaceAll("pp+", "p");
a = a.replaceAll("q|k", "c");
a = a.replaceAll("cc+", "c");

You can now easily find all the ludicrous spellings of “CupHolder” och “MacCleaner”.

I hope this blogpost gave you some ideas of how you can use phonetic algorithms and their principles in order to better discover names and entities that sound alike but are spelled unlike. At Findwise we have done a number of enhancements to DM in order to make it work better with Swedish, Danish and Norwegian.


You can learn more about Double Metaphone from the following article by the creator of the algorithm:

A German phonetic algorithm is the Kölner Phonetik:ölner_Phonetik

And SfinxBis is a phonetic algorithm based on Soundex and is Swedish specific:

Searching for Zebras: Doing More with Less

There is a very controversial and highly cited 2006 British Medical Journal (BMJ) article called “Googling for a diagnosis – use of Google as a diagnostic aid: internet based study” which concludes that, for difficult medical diagnostic cases, it is often useful to use Google Search as a tool for finding a diagnosis. Difficult medical cases are often represented by rare diseases, which are diseases with a very low prevalence.

The authors use 26 diagnostic cases published in the New England Journal of Medicine (NEJM) in order to compile a short list of symptoms describing each patient case, and use those keywords as queries for Google. The authors, blinded to the correct disease (a rare diseases in 85% of the cases), select the most ‘prominent’ diagnosis that fits each case. In 58% of the cases they succeed in finding the correct diagnosis.

Several other articles also point to Google as a tool often used by clinicians when searching for medical diagnoses.

But is that so convenient, is that enough, or can this process be easily improved? Indeed, two major advantages for Google are the clinicians’ familiarity with it, and its fresh and extensive index. But how would a vertical search engine with focused and curated content compare to Google when given the task of finding the correct diagnosis for a difficult case?

Well, take an open-source search engine such as Indri, index around 30,000 freely available medical articles describing rare or genetic diseases, use an off-the-shelf retrieval model, and there you have Zebra. In medicine, the term “zebra” is a slang for a surprising diagnosis. In comparison with a search on Google, which often returns results that point to unverified content from blogs or content aggregators, the documents from this vertical search engine are crawled from 10 web resources containing only rare and genetic disease articles, and which are mostly maintained by medical professionals or patient organizations.

Evaluating on a set of 56 queries extracted in a similar manner to the one described above, Zebra easily beats Google. Zebra finds the correct diagnosis in top 20 results in 68% of the cases, while Google succeeds in 32% of them. And this is only the performance of the Zebra with the baseline relevance model — imagine how much more could be done (for example, displaying results as a network of diseases, clustering or even ranking by diseases, or automatic extraction and translation of electronic health record data).

Text Analytics in Enterprise Search

A presentation made by Daniel Ling at Apache Lucene Eurocon in Barcelona, october 2011.

We think this is the first of many forthcoming presentations.

We also want to get more involved in the community in the future. By doing presentations, sponsoring, contributing code. Hope to bring more news on this subject in the next few weeks. Enjoy the presentation:

Text Analytics in Enterprise Search, Daniel Ling, Findwise, Eurocon 2011 from Lucene Revolution on Vimeo.