Semantic Search Engine – What is the Meaning?

The shortest dictionary definition of semantics is: the study of meaning. The more complex explanation of this term would lead to a relationship that maps words, terms and written expressions into common sense and understanding of objects and phenomena in the real world. It is worthy to mention that objects, phenomena and relationships between them are language independent. It means that the same semantic network of concepts can map to multiple languages which is useful in automatic translations or cross-lingual searches.

The approach

In the proposed approach semantics will be modeled as a defined ontology making it possible for the web to “understand” and satisfy the requests and intents of people and machines to use the web content. The ontology is a model that encapsulates knowledge from specific domain and consists of hierarchical structure of classes (taxonomy) that represents concepts of things, phenomena, activities etc. Each concept has a set of attributes that represent the mapping of that particular concept to words and phrases that represents that concepts in written language (as shown at the top of the figure below). Moreover, the proposed ontology model will have horizontal relationships between concepts, e.g. the linguistic relationships (synonymy, homonymy etc.) or domain specific relationships (medicine, law, military, biological, chemical etc.). Such a defined ontology model will be called a Semantic Map and will be used in the proposed search engine. An exemplar part of an enriched ontology of beverages is shown in the figure below. The ontology is enriched, so that the concepts can be easily identified in text using attributes such as the representation of the concept in the written text.

Semantic Map

The Semantic Map is an ontology that is used for bidirectional mapping of textual representation of concepts into a space of their meaning and associations. In this manner, it becomes possible to transform user queries into concepts, ideas and intent that can be matched with indexed set of similar concepts (and their relationships) derived from documents that are returned in a form of result set. Moreover, users will be able to precise and describe their intents using visualized facets of concept taxonomy, concept attributes and horizontal (domain) relationships. The search module will also be able to discover users’ intents based on the history of queries and other relevant factors, e.g. ontological axioms and restrictions. A potentially interesting approach will retrieve additional information regarding the specific user profile from publicly available information available in social portals like Facebook, blog sites etc., as well as in user’s own bookmarks and similar private resources, enabling deeper intent discovery.

Semantic Search Map

Semantic Search Engine

The search engine will be composed of the following components:

  • Connector – This module will be responsible for acquisition of data from external repositories and pass it to the search engine. The purpose of the connector is also to extract text and relevant metadata from files and external systems and pass it to further processing components.
  • Parser – This module will be responsible for text processing including activities like: tokenization (breaking text into lexems – words or phrases), lemmatization (normalization of grammar forms), exclusion of stop-words, paragraph and sentence boundary detector. The result of parsing stage is structured text with additional annotations that is passed to semantic Tagger.
  • Tagger – This module is responsible for adding semantic information for each lexem extracted from the processed text. Technically it refers to addition of identifiers to relevant concepts stored in the Semantic Map for each lexem. Moreover phrases consisting of several words are identified and disambiguation is performed basing on derived contexts. Consider the example illustrated in the figure.
  • Indexer – This module is responsible for taking all the processed information, transformation and storage into the search index. This module will be enriched with methods of semantic indexing using ontology (semantic map) and language tools.
  • Search index – The central storage of processed documents (document repository) structured properly to manage full text of the documents, their metadata and all relevant semantic information (document index). The structure is optimized for search performance and accuracy.
  • Search – This module is responsible for running queries against the search index and retrieval of relevant results. The search algorithms will be enriched to use user intents (complying data privacy) and the prepared Semantic Map to match semantic information stored in the search index.

What do you think? Please let us know by writing a comment.

How to Index and Search XML Content in Solr

Indexing XML Content

In solr, there is an xml update request handler which can be used to update xml formatted data.

For example,

<add>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>

However when a field itself should contain xml formatted data, the xml update handler will fail to import. Because, xml update handler parse the import data with xml parser, it will try to get direct child text under ‘field’ node, which is empty if a field’s direct child is xml tag.

What we can do is to use json update handler. For example:

[
  {
    "id" : "MyTestDocument",
    "title" : "<root p="cc">test \ node</root>"
  }
]

There are two things to notice,

  1. Both ‘‘ and ‘‘ characters should be escaped
  2. The xml content should be kept as a single line

Json import data can be loaded into Solr by the curl command,

curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'

Or, by using solrj:

CommonsHttpSolrServer server = new CommonsHttpSolrServer(serverpath);
server.setMaxRetries(1);
ContentStreamUpdateRequest csureq = new ContentStreamUpdateRequest("/update/json");
csureq.addFile(file);
NamedList<Object> result = server.request(csureq);
NamedList<Object> responseHeader = (NamedList<Object>) result.get("responseHeader");

Integer status = (Integer) responseHeader.get("status");

Stripping out xml tags in Schema definition

When querying xml content, we most likely will not be interested in xml tags. So we need to strip out xml tags before indexing the xml text. We can do that by applying HTMLStripCharFilter to the xml content.
            <analyzer type="index">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>
            <analyzer type="query">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>

Search XML Content

Xml content search does not differ much from text content search. However, if people want to search for xml attributes, there requires some special tweak.

HTMLStripCharFilter we mentioned earlier will filter out all xml tags including attributes, in order to index attributes, we need to find a way to make HTMLStripCharFilter keep the attribute text.

For example if we have original xml content as following,

<sample attr=”key_o2_4”>find it </sample>
After applying HTMLStripCharFilter, we want to have,

key_o2_4    find it
One way we can do is to add assistance xml instruction tags in original xml content such as,

<sample attr=”key_o2_4”><?solr key_o2_4?>find it</sample>

And apply Solr.PatternReplaceCharFilterFactory to it as shown in following schema fieldtype definition.

<analyzer type="index">
...
<charFilter pattern="&lt;?solr ([A-Z0-9_-]*)?&gt; " replacement="       $1  " maxBlockChars="10000000"/>
<charFilter/>
...
</analyzer>

Which will make replace <?solr key_o2_4?> with 7 leading empty spaces + key_o2_4 + 2 ending empty spaces in order to keep the original offset,

With this technique, we can do a search on attr attribute and get a hit.

Do you have questions? Visit our website or contact us for more information.

Evaluate Your Search Application

Search is the worst usability problem on the web according to Peter Morville (in his book Search Patterns). With that in mind it is good to know that there are best practices and search patterns that one can follow to ensure that your search will work. Yet, just applying best practices and patterns will not always do the trick for you. Patterns are examples of good things that often work but they do not come with a guarantee that your users will understand and use search simply because you used best practice solutions.

There is no real substitute for testing your designs, whether it’s on websites intranets or any other type of application. Evaluating your design you will learn what works and does not work with your users. Search is a bit tricky when it comes to testing since there is not one single way or flow for the users to take to their goal. You need to account for multiple courses of actions. But that is also the beauty of it, you learn how very different paths users take when searching for the same information. And it does not have to be expensive to do the testing even if it is a bit tricky. There are several ways you can test your designs:

  • Test your ideas using pen and paper
  • Let a small group of users into your development or test environment to evaluate ideas under development
  • Create a computer prototype that is limited to the functionality you are evaluating
  • You can also evaluate the existing site before starting new development to identify what things need improvement
  • Your search logs are another valuable source of information regarding your users behaviors. Have a look at them as a complement.

And the best part of testing your ideas with users is, as a bonus you will learn even more stuff about your users that will be valuable to you in the future. Even if you are evaluating the smallest part of your website you will learn things that affects the experience of the overall site. So what are you waiting for? Start testing your site as well. I promise you will learn a lot from it. If you have any questions about how to best evaluate the search functionality on your site or intranet, write a comment here or drop me an email. In the meanwhile we will soon go on summer holiday. But we’ll be back again in August. Have a nice summer everyone!

Customer Service Powered by Search Technology

I was on the train, on my way to Copenhagen and UX intensive a four day seminar hosted by Adaptive Path. Looking forward to this week I was also contemplating the past year and the projects we’ve been working on. I recently finished a project at a customer service organization at a large company. The objective was to see if the agents (employees) helping customers could benefit from having a search platform. Would the search engine help the users in finding the right content to help their customers?

Our point was off course that it would, but it was up to us to prove it. And we did. The usages tests showed results better than I would have dared to hope for.

  • All users found it to be easier to search for information than browsing for it.
  • Searching helped the users not only in finding information faster, but finding information they didn’t know where to find or didn’t even know existed.
  • All users preferred using the search functionality instead of navigation for information.
  • The search functionality helped new employees learn the information they needed to know in order to help the customers, hence they were productive faster.Less time was spent asking for help from colleagues and support since users found the information they needed by searching for it.

These results are all very positive, but the most overwhelming thing for me in this project was the level of engagement from the users. They really enjoyed being a part of the evaluations, bringing feedback to the project team. They felt that they were a part of the process and this made them very positive to the change this project meant.

Change is often a hard thing in development projects. Even if the change is better for the end users of the system, the change in itself can still be problematic making people hostile to the idea, even though it is improving their situation. Involving users not only helps in creating a good product, but also in creating a good spirit around the project. I have experienced this in other projects as well. By setting up reference groups for the development process we have not only managed to get good feedback to the project but have also created a buzz about what’s happening. People are volunteering for being participants in our reference groups. This buzz spreads and creates a positive feeling about the change the project is bringing. Instead of dreading the users are welcoming the change. It’s user research at its best.

So the next time you are asking yourself why you should involve users in your project and not only business stakeholders – think about how not only the end product, but the project and the process as a whole, could benefit from this.

Importance of Interaction Design

Lately I’ve been working in a couple of projects involving big companies which has given me a lot of new experience and knowledge. One of the things I’ve realized is how important it is to have a good interaction design and how that is not always the case.

The common thing in these projects have been that the customer has already started a new IT project. As time comes to implement the search functionality, they contact us. Thus, involvement from our side is after the interaction design has been made.

Since the customers are big companies, the interaction design has been made by external consultants who usually have a long going relationship with our customer, but don’t have a great knowledge about search. When the implementation starts, we’ve discovered that the interaction design is not perfect in terms of giving the end users a great search experience. This is due to lack of knowledge about search technology and what can be made with it. Using my knowledge in the search area I can propose changes in functionality that will give a better user experience. These changes of course requires new interaction design, but since the interaction design consultants has finished their assignment, the interaction design decisions needs to be worked out by our company.

In the worst case scenario this means that the complete interaction design needs to be redone from scratch. This will not be popular for the customer which needs to pay for the same thing twice. However since we at Findwise are search experts with lots of experience from past project and dedicated people working with interaction design we know how to create a good interaction design for search.

In the end this means that the customer is happy with the end result, but hiring us to also do the interaction design would have resulted in less cost for the customer!

How Many Users Can You Afford to Annoy?

The second keynote at the Human Computer Interaction conference in Lancaster was given by Jared Spool who talked about Breaking through the invisible walls of usability research. Jared is a very inspiring and entertaining speaker. If you have the chance to listen to him, take it!

One of the things he talked about was the fact that the usability techniques that are widely used today were in fact not designed for large amounts of users. We have all kinds of data about the users’ behaviors online, but can we really use that data in a productive way? As Jared said;

“there is a big difference between data and information, we don’t know what inferences to make from the data we have.”

He also gave examples from a couple of large american ecommerce sites that have millions of users every day. With traditional usability measures you, according to Jacob Nielsens report, can identify 80% of the usability problems with as few as five users. But if you have one million customers, then you could say that 200.000 of the customers would be annoyed. Imagine how much money’s worth of lost revenue 200.000 users is. So how many nines to we need? (90, 99, 99,999?) How many percent is enough? It is apparent that we need to find methods that can solve these problems with usability evalutations and testing.

Jared Spool visualizes how few users actually spend money on an ecommerce site, and how few users the company relies on for their revenue.

Jared also talked about the consequences that web 2.0 have had for web applications and communities. He talked about what things that make people want to use “extra functionality”, as for example review functionality; what things delight people. Things that are excitement generators today soon come to be expected in every application. And when, as Jared said, HCI becomes HHHHHCI; when social networks are widely used, things that delight us or aggravate us, spread very fast. So instead of thinking about the five user rule, think about this next time you plan a release of a new product or application: How many users can you afford to annoy?