In previous blog post we have described the built-in way for entity extraction in Sharepoint. It was pointed out that it’s good as long as you are able to create a full dictionary of all entities you want to extract. It’s not always possible. An alternative to the dictionary-based entity extraction is the statistical approach where we train a model for the purpose of recognizing entity names.
Sharepoint Content Enrichment
The content enrichment web service callout offers the possibility of processing the crawled content before it is indexed. This processing, which can consist of for example cleaning data, computing new values based on existing ones, or enriching the content with metadata, is done in addition to the processing done by SharePoint. Note however that this solution is limited to SharePoint Server 2013 with Enterprise CALs.
The processing can be applied not only to the SharePoint content, but it can benefit any content indexed by SharePoint, such as external websites.
Findwise Entity Extraction
Findwise has implemented a basic content processing enrichment service for customized processing of content indexed by SharePoint. This service can thus be used for processing and enriching documents using other already tested services developed by Findwise, such as the text analytic components. Moreover, it contains the basis on which other custom services can be built upon.
One of these text analytic services is Entity Extraction. It is based on statistical approach so before we can run it we need to train our model. The documents used for training must of course be representative of the domain in terms of form, terminology and writing style. However, this statistical approach has the potential of improving over time through training, as more examples are provided.
Make it run
To make it works we need to setup Findwise Entity Extraction service and provide documents for training. The more we get the better.
Findwise Entity Extraction Service is just Web Service that get text and return an array of found entities. Example:
For given text:
"Bill Wiggins has joined the findwise company in 2016. Since then much has happened, in example George Lucas joined and left, Tom Rubik has decided to move forward."
We got in response:
"Bill Wiggins", "George Lucas", "Tom Rubik "
Next we create Web Service Callout project. The only thing it does is getting document body and put the extracted entities in new “Entity” managed property. No mapping required this time – just make sure to create new managed property. Finally add the Web Service Callout to IIS and register it in SharePoint using the PowerShell scripts.
Information on how to create and register Content Enrichment Web Service Callout you can find at https://msdn.microsoft.com/en-us/library/office/jj163982.aspx
After than just run the content source and add new Refiner to your search page. Below you can find a result of run on Wikipedia texts. Note that extracted names doesn’t come from any dictionary but are returned by Findwise Entity Extraction text analytic component: