Entity Recognition with Google Search Appliance 7.2


In this article we would like to present some of the possibilities offered by the entity recognition option of Google Search Appliance (GSA). Entity recognition was introduced with the release of version 7.0 and improvements will still be added in future releases. We have used version 7.2 to write this blogpost and illustrate how GSA can perform named-entity recognition and sentiment analysis.

Entity Recognition in brief

Entity recognition enables the GSA to discover entities (such as names of people, places, organizations, products, dates, etc.) in documents where these are not available in the Metadata or in general, may be needed in order to enhance the search experience (e.g. via faceted search/dynamic navigation). There are three ways of defining entities:

  • With a TXT format dictionary of entities, where each entity type is in a separate file.
  • With an XML format dictionary, where entities are defined by synonyms and regular expressions. Currently, the regular expressions only match single words.
  • With composite entities written as an LL1 grammar.

Example 1: Identifying people

The basic setup for recognition of person names is to upload a dictionary of first names and a dictionary of surnames. Then, you can create a composite entity full name by using a simple LL1 grammar rule, for example {fullname}::=[firstname] [surname]. Every first name in your dictionary, followed by a space and then followed by a surname will be recognized as a full name. With the same approach, you can define more complex full names such as:

{fullName}::= {Title set}{Name set}{Middlenames}{Surname set}
{Title set}::=[Title] {Title set}
{Title set} ::= [epsilon]
{Name set} ::= [Name] {Name set2}
{Name set2} ::= [Name] {Name set2}
{Name set2} ::= [epsilon]
{Middlenames} ::= [Middlename]
{Middlenames} ::= [epsilon]
{Surname set} ::= [Surname] {Surname set2}
{Surname set2} ::= [Surname] {Surname set2}
{Surname set2} ::= [epsilon]

A full name will be recognized if it matches 0 or 1 instances of a title, one or more first names, 0 or 1 middle names and one or more surnames, all separated with a space. (e.g.: Dr John Anders Lee).


  • All the names in the content will be matched
  • Common words similar to names will be matched. Example: Charlotte Stone. To reduce this limitation, you can enable the case sensitive option and match a full name
  • In the preceding example, Dr John Anders Lee and John Anders Lee will be recognized as a different person
  • No support for multiple entities within composite entities. John Anders Lee will be matched as a full name, but John will not be matched as a name.


Example 2: Identifying places

Place names such as cities, countries, streets can be easily defined with the help of dictionaries in TXT format. One can also define locations by using regular expressions, especially if these share the same substring (e.g. “street” or “square”). For example, a Swedish street will often contain the substring “gata”, meaning “street”:

<name> Street </name>

This will allow us to identify one-word places like “Storgatan“, “Järntorget” but will fail in cases where we have 2 or more words in the name such as “Olof Palmes plats”.

Swedish postal codes can be defined with a regex matching 5 digits. Note, however, that all numbers of 5 digits will be matched as a postal code and that you cannot define space in the postal code due to the regular expression limitation of the GSA only matching a single word.

You can use the synonyms function of the xml dictionary to link a postal code with a city.

<name> Göteborg </name>

40330, 40510, 41190 and 41302 will be recognized as the entity Göteborg.

You can also use the synonyms to describe a territory division (kommun, län, country).

     <name> Göteborg Stad</name> 
     <term> Angered </term>
     <term> Backa </term>
     <term> Göteborg </term>
     <term> Torslanda </term>
     <term> Västra Frölunda </term>
     <name> Öckerö </name>
     <term> Hönö </term>
     <term> Öckerö </term> 
     <term> Rörö </term>



Example 3: Sentiment analysis

Sentiment analysis aims at identifying the predominant mood (happy/sad, anger/happiness, positive/negative, etc) of a document by analyzing its content. Here we will show you a simple case of identifying positive vs negative mood in a document.

Basic analysis

For a basic analysis one can create two dictionaries, one with positive words (good, fine, excellent, like, love …) and one with negative words (bad, dislike, don’t, not …). Such an analysis is simplistic and very limited for the following reasons:

• There is no real grammar
• Limited coverage of the lexicons
• No degree of judgment
• No global analysis of the document (if a document has 3 different polarity words it will be tagged with 3 different categories)

Screen Shot 2014-04-08 at 11.19.31

Analysis with grammar

If you add a dictionary of negations, you can create a more powerful tool with just a small grammar of compose entities. For example, {en negative} ::= [en negation] [en positive word] will correctly identify the English “not good”, “don’t like”, “didn’t succeed”  as negative terms. One can certainly create deeper analysis with more advanced grammar. Thus you can  specify special dictionaries for gender, emphatic words, nouns, verbs, adjectives,etc and build composite entities, and grammar rules with them. Below you see an example of the application of a simple grammar.

Screen Shot 2014-04-08 at 11.36.46

Degrees of sentiment

You can also add some degrees in the sentiments using the synonyms feature.

  <name> Good </name>
  <term> good </term>
  <term> fine </term>
  <term> like </term>
  <name> Very Good </name>
  <term> excellent </term>
  <term> amazing </term>
  <term> great </term>
  <name> Bad </name>
  <term> bad </term>
  <term> dislike </term>
  <term> don’t </term>
  <term> can’t </term>
  <term> not </term>
  <name> Very Bad </name>
  <term> awful </term>
  <term> hate </term>

Note, however that you cannot combine such synonym entries with other entity dictionaries or grammar rules.

Screen Shot 2014-04-08 at 12.00.16


There are some limitations of this approach as well:

  • No possibility to extract global sentiment for a given document. You cannot count in a document how many terms are matched as good and how many are matched as bad and then define the global sentiment for this document. However, when the regular expression limitations are fixed, one will be able to do so.
  • As with sentiment analysis in general and other dictionary-based approaches it is hard to discover sarcasm and irony.


In this blog post we showed how one can use the Entity recognition feature of GSA 7.2. While there are still some limitations of the tools provided, they are mature enough to enhance your search solution. Depending on the type of data, one can do simple sentiment analysis as well as more complex recognition of entities by using LL1 grammar.

A nice add-on to the Entity recognition setup in the GSA would be the possibility to load pre-trained models for Named Entity Recognition or sentiment analysis.


Entity recognition with GSA:

Dynamic navigation:

Swedish language support (natural language processing) for IBM Content Analytics (ICA)

Findwise has now extended the NLP (natural language processing) in ICA to include both support for Swedish PoS tagging and Swedish sentiment analysis.

IBM Content Analytics with Enterprise Search (ICA) has its strength in natural language processing (NLP) which is achieved in the UIMA pipeline. From a Swedish perspective, one concern with ICA has always been its lack of NLP for Swedish. Previously the Swedish support in ICA consisted only of dictionary-based lemmatization (word: “sprang” -> lemma: “springa”). However, for a number of other languages ICA has also provided part of speech (PoS) tagging and sentiment analysis. One of the benefits of the PoS tagger is its ability to disambiguate words, which belong to multiple classes (e.g. “run” can be both a noun and a verb) as well as assign tags to words, which are not found in the dictionary. Furthermore, the POS tagger is crucial when it comes to improving entity extraction, which is important when a deeper understanding of the indexed text is needed.

Findwise has now extended the NLP in ICA to include both support for Swedish PoS tagging and Swedish sentiment analysis. The two images below shows simple examples of the PoS support.

Example when ICA uses NLP to analyse the string "ICA är en produkt som klarar entitetsextrahering"Example when ICA uses NLP to analyse the string "Watson deltog i jeopardy"

The question is how this extended functionality could be used?

IBM uses ICA and its NLP support together with several of their products. The jeopardy playing computer Watson may be the most famous example, even if it is not a real product. Watson used NLP in its UIMA pipeline when it analyzed its data from sources such as Wikipedia and Imdb.

One product which leverage from ICA and its NLP capabilities is Content and Predictive Analytics for Healthcare. This product helps doctors to determine which action to take for a patient given the patient’s journal and the symptoms. By also leveraging the predictive analytics from SPSS it is possible to suggest the next action for the patient.

ICA can also be connected directly to IBM Cognos or SPSS where ICA is the tool which creates structure to unstructured data. By using the NLP or sentiment analytics in ICA, structured data can be extracted from text documents. This data can then be fed to IBM Cognos, SPSS or non IBM products such as Splunk.

ICA can also be used on its own as a text miner or a search platform, but in many cases ICA delivers its maximum value together with other products. ICA is a product which helps enriching data by creating structure to unstructured data. The processed data can then be used by other products which normally work with structured data.