SharePoint 2013 entity extraction – part 2

In previous blog post we have described the built-in way for entity extraction in Sharepoint. It was pointed out that it’s good as long as you are able to create a full dictionary of all entities you want to extract. It’s not always possible. An alternative to the dictionary-based entity extraction is the statistical approach where we train a model for the purpose of recognizing entity names.

Sharepoint Content Enrichment

The content enrichment web service callout offers the possibility of processing the crawled content before it is indexed. This processing, which can consist of for example cleaning data, computing new values based on existing ones, or enriching the content with metadata, is done in addition to the processing done by SharePoint. Note however that this solution is limited to SharePoint Server 2013 with Enterprise CALs.

The processing can be applied not only to the SharePoint content, but it can benefit any content indexed by SharePoint, such as external websites.

Findwise Entity Extraction

Findwise has implemented a basic content processing enrichment service for customized processing of content indexed by SharePoint. This service can thus be used for processing and enriching documents using other already tested services developed by Findwise, such as the text analytic components. Moreover, it contains the basis on which other custom services can be built upon.

One of these text analytic services is Entity Extraction. It is based on statistical approach so before we can run it we need to train our model. The documents used for training must of course be representative of the domain in terms of form, terminology and writing style. However, this statistical approach has the potential of improving over time through training, as more examples are provided.

Make it run

To make it works we need to setup Findwise Entity Extraction service and provide documents for training. The more we get the better.

Findwise Entity Extraction Service is just Web Service that get text and return an array of found entities. Example:

For given text:

"Bill Wiggins has joined the findwise company in 2016. 
Since then much has happened, in example George Lucas joined and left, 
Tom Rubik has decided to move forward."

We got in response:

"Bill Wiggins",
"George Lucas",
"Tom Rubik "

Next we create Web Service Callout project. The only thing it does is getting document body and put the extracted entities in new “Entity” managed property. No mapping required this time – just make sure to create new managed property. Finally add the Web Service Callout to IIS and register it in SharePoint using the PowerShell scripts.

Information on how to create and register Content Enrichment Web Service Callout you can find at https://msdn.microsoft.com/en-us/library/office/jj163982.aspx

After than just run the content source and add new Refiner to your search page. Below you can find a result of run on Wikipedia texts. Note that extracted names doesn’t come from any dictionary but are returned by Findwise Entity Extraction text analytic component:sharepoint-results

SharePoint 2013 entity extraction – part 1

What is your search missing?

The built-in search experience in SharePoint 2013 has greatly improved from previous versions, and companies adopting it enjoy a bag of new features (such as the visual refiners, the social search, the hover-panel with previews, to name a few). However, is your implementation of the search in SharePoint 2013 matching all your business and information needs?! Is your search solution reaching the target search KPIs? Are you wondering how you can cut down on the task of the editor, improve the search experience for your users, or reduce the time spent by your information workers finding the relevant content?

Entity Extraction in SharePoint 2013 Search  

To make your search good you need good metadata. They can be then used as a filters, boosted fields etc. Usually that means that documents need to be tagged, which may take a long time if done manually by content owners.

However, it is possible to extract some metadata from document content during index time. In SharePoint 2013 there are two ways of doing it: “Custom entity extraction” or with use of “Custom Content processing”. In this post you can learn the first way.

Custom entity extraction

SharePoint 2013 introduced a new way for entity extraction. It allows to extract entities from document based on dictionary.

The first step is, of course, preparing the dictionary. It needs to be in following format:

Key,Display form
Findwise, Findwise
FW, Findwise
Sharepoint, Sharepoint
Microsoft,Microsoft

Then you need to register that dictionary file in SharePoint – using Powershell scripts: https://technet.microsoft.com/library/jj219614.aspx

Last thing left to do is enabling entity extraction on the Managed Property it should be applied to. To get them from content of a document just edit “body” Managed Property

scr1

and select “Word Extraction – Custom 1” checkbox:

scr2

We choose that one because our dictionary was registered with – DictionaryName “Microsoft.UserDictionaries.EntityExtraction.Custom.Word.1” parameter, if you need more dictionaries then you register them using different dictionary name values and selecting the right option in Managed Property settings.

After that run “Full Crawl” for your sources.

Finally, just add the “WordCustomRefiner1” to your refiners on search result list and start using new filter:scr3

This way is really good if you are able to generate a static dictionary. Eg. you can use a list of all countries or cities you can find on the internet for location extraction. You can also extract at dictionary from your customers database or your employees list and then update it on regular basis.

However usually it’s not possible to get a full list of all entities, and they must be extracted using one of NLP algorithms, that will be described in next part.