SharePoint 2013 entity extraction – part 1

What is your search missing?

The built-in search experience in SharePoint 2013 has greatly improved from previous versions, and companies adopting it enjoy a bag of new features (such as the visual refiners, the social search, the hover-panel with previews, to name a few). However, is your implementation of the search in SharePoint 2013 matching all your business and information needs?! Is your search solution reaching the target search KPIs? Are you wondering how you can cut down on the task of the editor, improve the search experience for your users, or reduce the time spent by your information workers finding the relevant content?

Entity Extraction in SharePoint 2013 Search  

To make your search good you need good metadata. They can be then used as a filters, boosted fields etc. Usually that means that documents need to be tagged, which may take a long time if done manually by content owners.

However, it is possible to extract some metadata from document content during index time. In SharePoint 2013 there are two ways of doing it: “Custom entity extraction” or with use of “Custom Content processing”. In this post you can learn the first way.

Custom entity extraction

SharePoint 2013 introduced a new way for entity extraction. It allows to extract entities from document based on dictionary.

The first step is, of course, preparing the dictionary. It needs to be in following format:

Key,Display form
Findwise, Findwise
FW, Findwise
Sharepoint, Sharepoint
Microsoft,Microsoft

Then you need to register that dictionary file in SharePoint – using Powershell scripts: https://technet.microsoft.com/library/jj219614.aspx

Last thing left to do is enabling entity extraction on the Managed Property it should be applied to. To get them from content of a document just edit “body” Managed Property

scr1

and select “Word Extraction – Custom 1” checkbox:

scr2

We choose that one because our dictionary was registered with – DictionaryName “Microsoft.UserDictionaries.EntityExtraction.Custom.Word.1” parameter, if you need more dictionaries then you register them using different dictionary name values and selecting the right option in Managed Property settings.

After that run “Full Crawl” for your sources.

Finally, just add the “WordCustomRefiner1” to your refiners on search result list and start using new filter:scr3

This way is really good if you are able to generate a static dictionary. Eg. you can use a list of all countries or cities you can find on the internet for location extraction. You can also extract at dictionary from your customers database or your employees list and then update it on regular basis.

However usually it’s not possible to get a full list of all entities, and they must be extracted using one of NLP algorithms, that will be described in next part.

Understanding politics with Watson using Text Analytics

To understand the topics that actually are important to different political parties is a difficult task. Can text analytics together with an search index be an approach to given a better understanding?

This blog post describes how IBM Watson Explorer Content Analytics (WCA) can be used to make sense of Swedish politics. All speeches (in Swedish: anföranden) in the Swedish Parliament from 2004 to 2015 are analyzed using WCA. In total 139 110 transcribed text documents were analyzed. The Swedish language support build by Findwise for WCA is used together with a few text analytic processing steps which parses out person names, political party, dates and topics of interest. The selected topics in this analyzed are all related to infrastructure and different types of fuels.

We start by looking at how some of the topics are mentioned over time.

Analyze of terms of interets in Swedsih parlament between 2004 and 2014.

Analyze of terms of interest in Swedish parliament between 2004 and 2014.

The view shows topic which has a higher number of mentions compared to what would be expected during one year. Here we can see among other topics that the topic flygplats (airport) has a high increase in number of mentioning during 2014.

So let’s dive down and see what is being said about the topic flygplats during 2014.

Swedish political parties mentioning Bromma Airport.

Swedish political parties mentioning Bromma Airport during 2014.

The above image shows how the different political parties are mentioning the topic flygplats during the year 2014. The blue bar shows the number of times the topic flygplats was mentioned by each political party during the year. The green bar shows the WCA correlation value which indicates how strongly related a term is to the current filter. What we can conclude is that party Moderaterna mentioned flygplats during 2014 more frequently than other parties.

Reviewing the most correlated nouns when filtering on flygplats and the year 2014 shows among some other nouns: Bromma (place in Sweden), airport and nedläggning (closing). This gives some idea what was discussed during the period. By filtering on the speeches which was held by Moderaterna and reading some of them makes it clear that Moderaterna is against a closing of Bromma airport.

The text analytics and the index provided by WCA helps us both discover trending topics over time and gives us a tool for understanding who talked about a subject and what was said.

All the different topics about infrastructure can together create a single topic for infrastructure. Speeches that are mentioning tåg (train), bredband (broadband) or any other defined term for infrastructure are also tagged with the topic infrastructure. This wider concept of infrastructure can of course also be viewed over time.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Another way of finding which party that are most correlated to a subject is by comparing pair of facets. The following table shows parties highly related to terms regarding infrastructure and type of fuels.

Political parties highly correlated to subjects regarding infrastructure and types of fuel.

Swedish political parties highly correlated to subjects regarding infrastructure and types of fuel.

Let’s start by explain the first row in order to understand the table. Mobilnät (mobile net) has only been mentioned 44 times by Centerpartiet, but Centerpartiet is still highly related to the term with a WCA correlation value of 3.7. This means that Centerpartiet has a higher share of its speeches mentioning mobilnät compared to other parties. The table indicates that two parties Centerpartiet and Miljöpartiet are more involved about the subject infrastructure topics than other political parties.

Swedish parties mentioning the defined concept of infrastructure.

Swedish parties mentioning the defined concept of infrastructure.

Filtering on the concept infrastructure also shows that Miljöpartiet and Centerpartiet are the two parties which has the highest share of speeches mentioning the defined infrastructure topics.

Interested to dig deeper into the data? Parsing written text with text analytics is a successful approach for increasing an understanding of subjects such as politics. Using IBM Watson Explorer Content Analytics makes it easy. Most of the functionality used in this example is also out of the box functionalities in WCA.

Big Data is a Big Challenge

Big Data is also a Big Challenge for a number of companies that would like to be ahead of the competition. I think Findwise can help a lot with both technical expertise in text analytics and search technology but also with how to put Big Data to use in a business.

During the last days of February I had the pleasure to attend IDG Big Data conference in Warsaw, Poland. It brought plenty of people from both vendors and industry that shared interesting insights on the topic. In general, big vendors that try to be associated with Big Data dominated the conference. IBM, SAS, SAP, Teradata has provided massive marketing information on software products and capabilities around Big Data. Interestingly every single presentation had its own definition on what Big Data is. This is probably caused by the fact that everybody tries to find the best definitions for fitting own products into it.

From my perspective it was very nice to hear that everyone agrees text analytics and search components are of big importance in any Big Data solution. In multiple applications analysis (both predictive and deductive) and for mass social media one must use advanced linguistic techniques for retrieving and structuring the data streams. This sounded especially strong in IBM and SAS presentations.

A couple of companies revealed what they have already achieved in so called Big Data. Orange and T-Mobile presented their approach of extending traditional business intelligence to harness Big Data. They want to go beyond standard data collected in transaction databases and open up for all the information they have from calls (picked and non-answered), SMS, data transmission logs, etc. Telecom companies consider this kind of information to be a good source for data about their clients.

But the most interesting sessions were held by companies that openly shared their experience about evolution of their Big Data solutions based mainly on open source software. In this way Adam Kawa from Spotify showed how they based their platform on Hadoop cluster starting from a single server to a few hundreds nowadays. To me that seems like a good way to grow and adapt easily to changing business needs and altering external conditions.

Nasza Klasa – a Polish Facebook competitor had a very good presentation on several dimensions connected to challenges in Big Data solutions that might be used for summarisation of this post:

  1. Lack of legal regulations – Currently there are no clear regulations on how the data might be used and how to make money out of it. It is especially important for social portals where all our personal information might be used for different kinds of analysis and sold in aggregated or non-aggregated form. But the laws might be changed soon, thus changing the business too.
  2. Big Data is a bit like research – it is hard to predict return on investment on Big Data as it is a novelty but also a very powerful tool. For many who are looking into this the challenge is internal, to convince executives to invest in something that is still rather vague.
  3. Lack of data scientists – even if there are tools for operating on Big Data, there is a huge lack of skilled people – Big Data operators. These are not IT people nor developers but rather open-minded people with a good mathematical background able to understand and find patterns in a constantly growing stream of various structured and unstructured information.

As I stated at the beginning of this post, Big Data is also a Big Challenge for a number of companies that would like to be ahead of the competition. I truly believe we at Findwise can help a lot within this area, we have both the technical expertise and experience on how to put Big Data to use in a business.

SLTC 2012 in retrospect – two cutting-edge components

The 4th Swedish Language Technology Conference (SLTC) was held in Lund on 24-26 October 2012.
It is a biennial event organized by prominent research centres in Sweden.
The conference is, therefore, an excellent venue to exchange ideas with Swedish researchers in the field of Natural Language Processing (NLP), as well as present own research and be updated of the state-of-the-art in most of the areas of Text Analytics (TA).

This year Findwise participated in two tracks – in a workshop and in the main conference.
As the area of Search Analytics (SA) is very important to us, we decided to be proactive and sent an application to organize a workshop on the topic of “Exploratory Query Log Analysis” in connection with the main conference. The application was granted and the workshop was very successful. It gathered researchers who work in the area of SA from very different perspective – from utilizing deep Machine Learning to discover users’ intent,  to looking at query logs as a totally new genre. I will do a follow-up on that in another post. All the contributions to the workshop will also be uploaded on our research page.

As for the main conference, we had two papers accepted for presentation. The first one dealt with the topic of document summarization – both single and multidocument summarization
(http://www.slideshare.net/findwise/extractive-document-summarization-an-unsupervised-approach).
The second paper was about detecting Named Enities in Swedish
(http://www.slideshare.net/findwise/identification-of-entities-in-swedish).

These two papers presented de facto state-of-the-art results for Swedish both when it comes to document summarization and Named Entity Recognition (NER). As for the former task, there is neither a standard corpus for evaluation of summarization systems, nor many previous results and just few other systems which made it unfeasible to compare our own system with. Thus, we have contributed two things to the research in document summarization – a Swedish corpus based on featured Wikipedia articles to be used for evaluation and a system based on unsupervised Machine Learning, which by relying on domain boosting achieves state-of-the-art results for English and Swedish. Our system can be further improved by relying on our enhanced NER and Coreference resolution modules.

As for the NER paper, our Entity recognition system for Swedish achieves 74.0% F-score, which is 4% higher than another study presented simultaneously at SLTC (http://www.ling.su.se/english/nlp/tools/stagger). Both systems were evaluated on the same corpus, which is considered a de facto standard for evaluation of different NLP resources for Swedish. The unlabelled score (i.e. no fine-grained division of classes but just entity vs non-entity) of our system achieved 91.3% F-score (93.1% Precision and 89.6% Recall). When identifying people, the Findwise NER system achieves 78.1% Precision and 90.5% Recall (83.9% F-score).

So, what did we take home from the conference? We were really happy to see that the tools we develop for our customers are not something mediocre but rather something that is of very high quality and is the state-of-the-art in Swedish NLP. We actively share our results and our corpora for research perposes. Findwise showed keen interest in cooperating with other researchers in developing better tools and systems in the area of NLP and Text Analytics. And this I think is a huge bonus to all our current and prospective customers – we actively follow the current trends in the research community and cooperate with researchers, and our products do incorporate the latest findings in the field, which make us leverage both high quality and cutting-edge technology.

As we continuously improve our products, we have also released a Polish NER and some work has been initiated on Danish and Norwegian ones. More NLP components will be soon available for demo and testing on our research page.

Analyzing the Voice of Customers with Text Analytics

Understanding what your customer thinks about your company, your products and your service can be done in many different ways. Today companies regularly analyze sales statistics, customer surveys and conduct market analysis. But to get the whole picture of the voice of customer, we need to consider the information that is not captured in a structured way in databases or questionnaires.

I attended the Text Analytics Summit earlier this year in London and was introduced to several real-life implementations of how text analytics tools and techniques are used to analyze text in different ways. There were applications for text analytics within pharmaceutical industry, defense and intelligence as well as other industries, but most common at the conference were the case studies within customer analytics.

For a few years now, the social media space has boomed as platforms of all kinds of human interaction and communication, and analyzing this unstructured information found on Twitter and Facebook can give corporations deeper insight into how their customers experience their products and services. But there’s also plenty of text-based information within an organization, that holds valuable insights about their customers, for instance notes being taken in customer service centers, as well as emails sent from customers. By combining both social media information with the internally available information, a company can get a more detailed understanding of their customers.

In its most basic form, the text analytics tools can analyze how different products are perceived in different customer groups. With sentiment analysis a marketing or product development department can understand if the products are retrieved in a positive, negative or just neutral manner. But the analysis could also be combined with other data, such as marketing campaign data, where traditional structured analysis would be combined with the textual analysis.

At the text analytics conference, several exciting solutions where presented, for example an European telecom company that used voice of customer analysis to listen in on the customer ‘buzz’ about their broadband internet services, and would get early warnings when customers where annoyed with the performance of the service, before customers started phoning the customer service. This analysis had become a part of the Quality of Service work at the company.

With the emergence of social media, and where more and more communication is done digitally, the tools and techniques for text analytics has improved and we now start to see very real business cases outside the universities. This is very promising for the adaptation of text analytics within the commercial industries.