Introduction

In this article we would like to present some of the possibilities offered by the entity recognition option of Google Search Appliance (GSA). Entity recognition was introduced with the release of version 7.0 and improvements will still be added in future releases. We have used version 7.2 to write this blogpost and illustrate how GSA can perform named-entity recognition and sentiment analysis.

Entity Recognition in brief

Entity recognition enables the GSA to discover entities (such as names of people, places, organizations, products, dates, etc.) in documents where these are not available in the Metadata or in general, may be needed in order to enhance the search experience (e.g. via faceted search/dynamic navigation). There are three ways of defining entities:

With a TXT format dictionary of entities, where each entity type is in a separate file.
With an XML format dictionary, where entities are defined by synonyms and regular expressions. Currently, the regular expressions only match single words.
With composite entities written as an LL1 grammar.

Example 1: Identifying people

The basic setup for recognition of person names is to upload a dictionary of first names and a dictionary of surnames. Then, you can create a composite entity full name by using a simple LL1 grammar rule, for example {fullname}::=[firstname] [surname]. Every first name in your dictionary, followed by a space and then followed by a surname will be recognized as a full name. With the same approach, you can define more complex full names such as:

{fullName}::= {Title set}{Name set}{Middlenames}{Surname set}
{Title set}::=[Title] {Title set}
{Title set} ::= [epsilon]
{Name set} ::= [Name] {Name set2}
{Name set2} ::= [Name] {Name set2}
{Name set2} ::= [epsilon]
{Middlenames} ::= [Middlename]
{Middlenames} ::= [epsilon]
{Surname set} ::= [Surname] {Surname set2}
{Surname set2} ::= [Surname] {Surname set2}
{Surname set2} ::= [epsilon]

A full name will be recognized if it matches 0 or 1 instances of a title, one or more first names, 0 or 1 middle names and one or more surnames, all separated with a space. (e.g.: Dr John Anders Lee).

Limitations

All the names in the content will be matched
Common words similar to names will be matched. Example: Charlotte Stone. To reduce this limitation, you can enable the case sensitive option and match a full name
In the preceding example, Dr John Anders Lee and John Anders Lee will be recognized as a different person
No support for multiple entities within composite entities. John Anders Lee will be matched as a full name, but John will not be matched as a name.

Example 2: Identifying places

Place names such as cities, countries, streets can be easily defined with the help of dictionaries in TXT format. One can also define locations by using regular expressions, especially if these share the same substring (e.g. “street” or “square”). For example, a Swedish street will often contain the substring “gata”, meaning “street”:

<instance>
<name> Street </name>
<pattern>.*gatan</pattern>
<pattern>.*gata</pattern>
<pattern>.*torget</pattern>
<pattern>.*plats</pattern>
<pattern>.*platsen</pattern>
<store_regex_or_name>regex</store_regex_or_name>
</instance>

This will allow us to identify one-word places like “Storgatan“, “Järntorget” but will fail in cases where we have 2 or more words in the name such as “Olof Palmes plats”.

Swedish postal codes can be defined with a regex matching 5 digits. Note, however, that all numbers of 5 digits will be matched as a postal code and that you cannot define space in the postal code due to the regular expression limitation of the GSA only matching a single word.

You can use the synonyms function of the xml dictionary to link a postal code with a city.

<instance>
<name> Göteborg </name>
<term>40330</term>
<term>40510</term>
<term>41190</term>
<term>41302</term>
<store_regex_or_name>name</store_regex_or_name>
</instance>

40330, 40510, 41190 and 41302 will be recognized as the entity Göteborg.

You can also use the synonyms to describe a territory division (kommun, län, country).

<instances>
   <instance>
     <name> Göteborg Stad</name> 
     <term> Angered </term>
     <term> Backa </term>
     <term> Göteborg </term>
     <term> Torslanda </term>
     <term> Västra Frölunda </term>
   </instance> 
   <instance>
     <name> Öckerö </name>
     <term> Hönö </term>
     <term> Öckerö </term> 
     <term> Rörö </term>
   </instance>
</instances>

Example 3: Sentiment analysis

Sentiment analysis aims at identifying the predominant mood (happy/sad, anger/happiness, positive/negative, etc) of a document by analyzing its content. Here we will show you a simple case of identifying positive vs negative mood in a document.

Basic analysis

For a basic analysis one can create two dictionaries, one with positive words (good, fine, excellent, like, love …) and one with negative words (bad, dislike, don’t, not …). Such an analysis is simplistic and very limited for the following reasons:

• There is no real grammar
• Limited coverage of the lexicons
• No degree of judgment
• No global analysis of the document (if a document has 3 different polarity words it will be tagged with 3 different categories)

Analysis with grammar

If you add a dictionary of negations, you can create a more powerful tool with just a small grammar of compose entities. For example, {en negative} ::= [en negation] [en positive word] will correctly identify the English “not good”, “don’t like”, “didn’t succeed” as negative terms. One can certainly create deeper analysis with more advanced grammar. Thus you can specify special dictionaries for gender, emphatic words, nouns, verbs, adjectives,etc and build composite entities, and grammar rules with them. Below you see an example of the application of a simple grammar.

Degrees of sentiment

You can also add some degrees in the sentiments using the synonyms feature.

<instances>
 <instance>
  <name> Good </name>
  <term> good </term>
  <term> fine </term>
  <term> like </term>
 </instance>
 <instance>
  <name> Very Good </name>
  <term> excellent </term>
  <term> amazing </term>
  <term> great </term>
 </instance>
 <instance>
  <name> Bad </name>
  <term> bad </term>
  <term> dislike </term>
  <term> don’t </term>
  <term> can’t </term>
  <term> not </term>
 </instance>
 <instance>
  <name> Very Bad </name>
  <term> awful </term>
  <term> hate </term>
 </instance>
</instances>

Note, however that you cannot combine such synonym entries with other entity dictionaries or grammar rules.

Limitations

There are some limitations of this approach as well:

No possibility to extract global sentiment for a given document. You cannot count in a document how many terms are matched as good and how many are matched as bad and then define the global sentiment for this document. However, when the regular expression limitations are fixed, one will be able to do so.
As with sentiment analysis in general and other dictionary-based approaches it is hard to discover sarcasm and irony.

Conclusion

In this blog post we showed how one can use the Entity recognition feature of GSA 7.2. While there are still some limitations of the tools provided, they are mature enough to enhance your search solution. Depending on the type of data, one can do simple sentiment analysis as well as more complex recognition of entities by using LL1 grammar.

A nice add-on to the Entity recognition setup in the GSA would be the possibility to load pre-trained models for Named Entity Recognition or sentiment analysis.

Links

Entity recognition with GSA:
http://www.google.com/support/enterprise/static/gsa/docs/admin/72/admin_console_help/crawl_entity_recognition.html

Dynamic navigation:
http://www.google.com/support/enterprise/static/gsa/docs/admin/72/admin_console_help/serve_dynamic_navigation.html

Recently, both clients of Findwise as well as the Enterprise Search community in general are increasingly showing interest in text analytics in order to get a higher business value out of their (often large) volumes of unstructured information.

Text Analytics merges techniques from linguistics, computer science, machine learning, statistics and many of the central algorithms in this field are publically available as open source tools and packages with easily accessible APIs. While many customers of commercial Enterprise Search solutions, such as Automomy, IBM Omnifind, Microsoft FAST ESP, etc., have long benefitted from some sort of Text Analytics (e.g. Entity Extraction, Keyword Extraction and document summarization), the open source components have now come a long way in providing alternative, free of charge solutions with similar performance and feature set.

As every modern enterprise search architecture today has some kind of document processing that is extensible by additional stages or APIs (for example the Open Pipeline with Solr or the pipeline that comes with Microsoft FAST) – the opportunity for plugging new text analytics stages to existing search implementations is open and ready for new innovation.

Among the most popular applications of text analytics that have emerged lately are customized entity extraction, sentiment analysis and document classification – each with a set of open source alternatives (such as Balie, OpenNLP and GATE) readily available for customization and implementation to your document processing.

Regardless of your industry domain, these techniques open up for a wide variety of new ways to interpret the content and discover new trends from your unstructured textual data – be it through sentiment analysis to support the decision making process, trend analysis or relevance model of search, or entity extraction in order to navigate your content by entities (such as company name or person), the enhancement of your texts by meta-data tagging or finding similar and related content.

How are you taking advantage of modern text analytics?

The Findability blog

the enterprise search and findability blog by Tietoevry Findwise

Tag Archives: Named entity recognition

Entity Recognition with Google Search Appliance 7.2

Introduction

Entity Recognition in brief

Example 1: Identifying people

Limitations

Example 2: Identifying places

Example 3: Sentiment analysis

Basic analysis

Analysis with grammar

Degrees of sentiment

Limitations

Conclusion

Links

Open Source Tools for Text Analytics

Search as a Tool for Information Quality Assurance