Entity Recognition with Google Search Appliance 7.2

Introduction

In this article we would like to present some of the possibilities offered by the entity recognition option of Google Search Appliance (GSA). Entity recognition was introduced with the release of version 7.0 and improvements will still be added in future releases. We have used version 7.2 to write this blogpost and illustrate how GSA can perform named-entity recognition and sentiment analysis.

Entity Recognition in brief

Entity recognition enables the GSA to discover entities (such as names of people, places, organizations, products, dates, etc.) in documents where these are not available in the Metadata or in general, may be needed in order to enhance the search experience (e.g. via faceted search/dynamic navigation). There are three ways of defining entities:

  • With a TXT format dictionary of entities, where each entity type is in a separate file.
  • With an XML format dictionary, where entities are defined by synonyms and regular expressions. Currently, the regular expressions only match single words.
  • With composite entities written as an LL1 grammar.

Example 1: Identifying people

The basic setup for recognition of person names is to upload a dictionary of first names and a dictionary of surnames. Then, you can create a composite entity full name by using a simple LL1 grammar rule, for example {fullname}::=[firstname] [surname]. Every first name in your dictionary, followed by a space and then followed by a surname will be recognized as a full name. With the same approach, you can define more complex full names such as:

{fullName}::= {Title set}{Name set}{Middlenames}{Surname set}
{Title set}::=[Title] {Title set}
{Title set} ::= [epsilon]
{Name set} ::= [Name] {Name set2}
{Name set2} ::= [Name] {Name set2}
{Name set2} ::= [epsilon]
{Middlenames} ::= [Middlename]
{Middlenames} ::= [epsilon]
{Surname set} ::= [Surname] {Surname set2}
{Surname set2} ::= [Surname] {Surname set2}
{Surname set2} ::= [epsilon]

A full name will be recognized if it matches 0 or 1 instances of a title, one or more first names, 0 or 1 middle names and one or more surnames, all separated with a space. (e.g.: Dr John Anders Lee).

Limitations

  • All the names in the content will be matched
  • Common words similar to names will be matched. Example: Charlotte Stone. To reduce this limitation, you can enable the case sensitive option and match a full name
  • In the preceding example, Dr John Anders Lee and John Anders Lee will be recognized as a different person
  • No support for multiple entities within composite entities. John Anders Lee will be matched as a full name, but John will not be matched as a name.

PersonEntityGSA

Example 2: Identifying places

Place names such as cities, countries, streets can be easily defined with the help of dictionaries in TXT format. One can also define locations by using regular expressions, especially if these share the same substring (e.g. “street” or “square”). For example, a Swedish street will often contain the substring “gata”, meaning “street”:

<instance>
<name> Street </name>
<pattern>.*gatan</pattern>
<pattern>.*gata</pattern>
<pattern>.*torget</pattern>
<pattern>.*plats</pattern>
<pattern>.*platsen</pattern>
<store_regex_or_name>regex</store_regex_or_name>
</instance>

This will allow us to identify one-word places like “Storgatan“, “Järntorget” but will fail in cases where we have 2 or more words in the name such as “Olof Palmes plats”.

Swedish postal codes can be defined with a regex matching 5 digits. Note, however, that all numbers of 5 digits will be matched as a postal code and that you cannot define space in the postal code due to the regular expression limitation of the GSA only matching a single word.

You can use the synonyms function of the xml dictionary to link a postal code with a city.

<instance>
<name> Göteborg </name>
<term>40330</term>
<term>40510</term>
<term>41190</term>
<term>41302</term>
<store_regex_or_name>name</store_regex_or_name>
</instance>

40330, 40510, 41190 and 41302 will be recognized as the entity Göteborg.

You can also use the synonyms to describe a territory division (kommun, län, country).

<instances>
   <instance>
     <name> Göteborg Stad</name> 
     <term> Angered </term>
     <term> Backa </term>
     <term> Göteborg </term>
     <term> Torslanda </term>
     <term> Västra Frölunda </term>
   </instance> 
   <instance>
     <name> Öckerö </name>
     <term> Hönö </term>
     <term> Öckerö </term> 
     <term> Rörö </term>
   </instance>
</instances>

PlacesEntityGSA

 

Example 3: Sentiment analysis

Sentiment analysis aims at identifying the predominant mood (happy/sad, anger/happiness, positive/negative, etc) of a document by analyzing its content. Here we will show you a simple case of identifying positive vs negative mood in a document.

Basic analysis

For a basic analysis one can create two dictionaries, one with positive words (good, fine, excellent, like, love …) and one with negative words (bad, dislike, don’t, not …). Such an analysis is simplistic and very limited for the following reasons:

• There is no real grammar
• Limited coverage of the lexicons
• No degree of judgment
• No global analysis of the document (if a document has 3 different polarity words it will be tagged with 3 different categories)

Screen Shot 2014-04-08 at 11.19.31

Analysis with grammar

If you add a dictionary of negations, you can create a more powerful tool with just a small grammar of compose entities. For example, {en negative} ::= [en negation] [en positive word] will correctly identify the English “not good”, “don’t like”, “didn’t succeed”  as negative terms. One can certainly create deeper analysis with more advanced grammar. Thus you can  specify special dictionaries for gender, emphatic words, nouns, verbs, adjectives,etc and build composite entities, and grammar rules with them. Below you see an example of the application of a simple grammar.

Screen Shot 2014-04-08 at 11.36.46

Degrees of sentiment

You can also add some degrees in the sentiments using the synonyms feature.

<instances>
 <instance>
  <name> Good </name>
  <term> good </term>
  <term> fine </term>
  <term> like </term>
 </instance>
 <instance>
  <name> Very Good </name>
  <term> excellent </term>
  <term> amazing </term>
  <term> great </term>
 </instance>
 <instance>
  <name> Bad </name>
  <term> bad </term>
  <term> dislike </term>
  <term> don’t </term>
  <term> can’t </term>
  <term> not </term>
 </instance>
 <instance>
  <name> Very Bad </name>
  <term> awful </term>
  <term> hate </term>
 </instance>
</instances>

Note, however that you cannot combine such synonym entries with other entity dictionaries or grammar rules.

Screen Shot 2014-04-08 at 12.00.16

Limitations

There are some limitations of this approach as well:

  • No possibility to extract global sentiment for a given document. You cannot count in a document how many terms are matched as good and how many are matched as bad and then define the global sentiment for this document. However, when the regular expression limitations are fixed, one will be able to do so.
  • As with sentiment analysis in general and other dictionary-based approaches it is hard to discover sarcasm and irony.

Conclusion

In this blog post we showed how one can use the Entity recognition feature of GSA 7.2. While there are still some limitations of the tools provided, they are mature enough to enhance your search solution. Depending on the type of data, one can do simple sentiment analysis as well as more complex recognition of entities by using LL1 grammar.

A nice add-on to the Entity recognition setup in the GSA would be the possibility to load pre-trained models for Named Entity Recognition or sentiment analysis.

Links

Entity recognition with GSA:
http://www.google.com/support/enterprise/static/gsa/docs/admin/72/admin_console_help/crawl_entity_recognition.html

Dynamic navigation:
http://www.google.com/support/enterprise/static/gsa/docs/admin/72/admin_console_help/serve_dynamic_navigation.html

Search as a Tool for Information Quality Assurance

Feedback from stakeholders in ongoing projects has highlighted the real need for a supporting tool to assist in the analysis of large amounts of content. This would introduce a phase where super-users and information owners have the possibility to go through a information quality assurance process across the information silos, before releasing information directly to end users.

Using standard features contained within enterprise search platforms, great value can be delivered as well as time saved in extracting essential information. Furthermore, you have the possibility to detect key information objects that are hidden by a lack of a holistic view.

In this way adapted applications can easily be built on top to support process specific analysing demands e.g. through entity extraction (automatic detection and extraction of names, places, dates etc) and cross-referencing unstructured and structured sources. The time is here to gain control of your enterprise information, the information quality and turn it into knowledge.