Semantic Search Engine – What is the Meaning?

The shortest dictionary definition of semantics is: the study of meaning. The more complex explanation of this term would lead to a relationship that maps words, terms and written expressions into common sense and understanding of objects and phenomena in the real world. It is worthy to mention that objects, phenomena and relationships between them are language independent. It means that the same semantic network of concepts can map to multiple languages which is useful in automatic translations or cross-lingual searches.

The approach

In the proposed approach semantics will be modeled as a defined ontology making it possible for the web to “understand” and satisfy the requests and intents of people and machines to use the web content. The ontology is a model that encapsulates knowledge from specific domain and consists of hierarchical structure of classes (taxonomy) that represents concepts of things, phenomena, activities etc. Each concept has a set of attributes that represent the mapping of that particular concept to words and phrases that represents that concepts in written language (as shown at the top of the figure below). Moreover, the proposed ontology model will have horizontal relationships between concepts, e.g. the linguistic relationships (synonymy, homonymy etc.) or domain specific relationships (medicine, law, military, biological, chemical etc.). Such a defined ontology model will be called a Semantic Map and will be used in the proposed search engine. An exemplar part of an enriched ontology of beverages is shown in the figure below. The ontology is enriched, so that the concepts can be easily identified in text using attributes such as the representation of the concept in the written text.

Semantic Map

The Semantic Map is an ontology that is used for bidirectional mapping of textual representation of concepts into a space of their meaning and associations. In this manner, it becomes possible to transform user queries into concepts, ideas and intent that can be matched with indexed set of similar concepts (and their relationships) derived from documents that are returned in a form of result set. Moreover, users will be able to precise and describe their intents using visualized facets of concept taxonomy, concept attributes and horizontal (domain) relationships. The search module will also be able to discover users’ intents based on the history of queries and other relevant factors, e.g. ontological axioms and restrictions. A potentially interesting approach will retrieve additional information regarding the specific user profile from publicly available information available in social portals like Facebook, blog sites etc., as well as in user’s own bookmarks and similar private resources, enabling deeper intent discovery.

Semantic Search Map

Semantic Search Engine

The search engine will be composed of the following components:

  • Connector – This module will be responsible for acquisition of data from external repositories and pass it to the search engine. The purpose of the connector is also to extract text and relevant metadata from files and external systems and pass it to further processing components.
  • Parser – This module will be responsible for text processing including activities like: tokenization (breaking text into lexems – words or phrases), lemmatization (normalization of grammar forms), exclusion of stop-words, paragraph and sentence boundary detector. The result of parsing stage is structured text with additional annotations that is passed to semantic Tagger.
  • Tagger – This module is responsible for adding semantic information for each lexem extracted from the processed text. Technically it refers to addition of identifiers to relevant concepts stored in the Semantic Map for each lexem. Moreover phrases consisting of several words are identified and disambiguation is performed basing on derived contexts. Consider the example illustrated in the figure.
  • Indexer – This module is responsible for taking all the processed information, transformation and storage into the search index. This module will be enriched with methods of semantic indexing using ontology (semantic map) and language tools.
  • Search index – The central storage of processed documents (document repository) structured properly to manage full text of the documents, their metadata and all relevant semantic information (document index). The structure is optimized for search performance and accuracy.
  • Search – This module is responsible for running queries against the search index and retrieval of relevant results. The search algorithms will be enriched to use user intents (complying data privacy) and the prepared Semantic Map to match semantic information stored in the search index.

What do you think? Please let us know by writing a comment.

Searching for Zebras: Doing More with Less

There is a very controversial and highly cited 2006 British Medical Journal (BMJ) article called “Googling for a diagnosis – use of Google as a diagnostic aid: internet based study” which concludes that, for difficult medical diagnostic cases, it is often useful to use Google Search as a tool for finding a diagnosis. Difficult medical cases are often represented by rare diseases, which are diseases with a very low prevalence.

The authors use 26 diagnostic cases published in the New England Journal of Medicine (NEJM) in order to compile a short list of symptoms describing each patient case, and use those keywords as queries for Google. The authors, blinded to the correct disease (a rare diseases in 85% of the cases), select the most ‘prominent’ diagnosis that fits each case. In 58% of the cases they succeed in finding the correct diagnosis.

Several other articles also point to Google as a tool often used by clinicians when searching for medical diagnoses.

But is that so convenient, is that enough, or can this process be easily improved? Indeed, two major advantages for Google are the clinicians’ familiarity with it, and its fresh and extensive index. But how would a vertical search engine with focused and curated content compare to Google when given the task of finding the correct diagnosis for a difficult case?

Well, take an open-source search engine such as Indri, index around 30,000 freely available medical articles describing rare or genetic diseases, use an off-the-shelf retrieval model, and there you have Zebra. In medicine, the term “zebra” is a slang for a surprising diagnosis. In comparison with a search on Google, which often returns results that point to unverified content from blogs or content aggregators, the documents from this vertical search engine are crawled from 10 web resources containing only rare and genetic disease articles, and which are mostly maintained by medical professionals or patient organizations.

Evaluating on a set of 56 queries extracted in a similar manner to the one described above, Zebra easily beats Google. Zebra finds the correct diagnosis in top 20 results in 68% of the cases, while Google succeeds in 32% of them. And this is only the performance of the Zebra with the baseline relevance model — imagine how much more could be done (for example, displaying results as a network of diseases, clustering or even ranking by diseases, or automatic extraction and translation of electronic health record data).

Enterprise Search Market Overview 2011

A few weeks ago Forrester research released a report with an Enterprise search market overview of the 12 leading  vendors on the global market (Attivio, Autonomy, Coveo, Endeca, Exalead, Fabasoft, Google, IBM, ISYS Search, Microsoft, Sinequa and Vivisimo).

When I wrote about the Gartner report, readers commented on the fact that open source solutions were not part of the scope, even though their market share is increasing rapidly. The Forrester report has the same approach, except it includes vendors offering their products stand-alone as well as those with products integrated in portal/ECM solutions.

So why the exclusion of open source? Well, it appears difficult to decide on how to evaluate open source, especially when it comes to more advanced appliances.

Looking at the Forrester report, it includes some familiar conclusions but also a few new insights. Leslie Owen from Forrester concludes that “Google, Autonomy, and Microsoft are the most well-known names; they own a large portion of the existing market”. Hence, these vendors are still standing strong, even though they are challenged in various areas.

More surprisingly, some niche players get higher scores than the giants in core areas such as “Indexing and connectivity”, “Interface flexibility” and “Social and collaborative features”.

Vivisimo is seen as somewhat of a leader (with a slightly lower score on Mobile support and Semantics/text analysis). In the Gartner report, Vivisimo was excluded from the information access evaluation due to the fact that they were ”focusing on specialized application categories, such as customer service”.

Search vendor overview

An interesting reflection from Forrester is that “in the next few years, we expect prices to rise as specialized vendors wax poetic on the transformative power of search in order to distinguish their products from Google and Microsoft FAST Search for SharePoint”. On the Nordic market, we have not seen a shift to such a strategy, but rather the opposite, since open source (with zero license fees) is becoming accepted in an Enterprise environment to a larger extent.

The vendors that provide integrated solutions (to CMS/WCM etc) still remains strong, whereas the stand-alone solutions becomes exposed to completion in new ways. It will be interesting to follow the US and Nordic market to see how this evolves within the next year. It might be that the market differs when it comes to open source adaption.

If you wish to read the full report it can be downloaded from Vivisimo through a simple registration.

To get a complete overview of vendors, I recommend reading both the Gartner and Forrester report.

Delivering Information Where It is Needed: Location Based Information

I recently started working at Findwise after having finished my thesis on location based information delivery in a mobile phone. The purpose of my thesis was to:

  • Investigate how location based information (as opposed to fixed locations) could be connected to search results
  • Improve quality of location based information by considering the course and velocity of the user

To start with, I created an iPhone application with a location-based reminder system. The reminders described location constraints and users could create reminders with single locations (at home) or groups of locations (at any pharmacy). To find these groups of locations, the system searched for locations with associated information (like nearby pharmacies) and delivered this information without users having to click Search repeatedly.

This is an unusual approach to search as the user is passive, instead the system is performing searches for the user. However, to make search results relevant one has to add contextual constraints to describe when, where and to whom a piece of information is relevant. When all constraints are met, information should be relevant. If not, the system lacks some crucial contextual constraints.

When search is automated, the importance of relevant search results increases and the more you know of the users world, the better you can adjust the results. However, traditional search can also benefit from contextual information. It can be used as a filter where search results that are irrelevant in the current context are removed. Alternatively it could be a part of the relevance model, improving search results by reordering them according to context. Hence, whereas automatic information delivery is probably undesirable for many types of information – contextual constraints can still be of good use!

The people who tested my application created 25% of their reminders as groups of locations and found it useful as it helped them find places they weren’t aware of, facilitating opportunistic behavior. The course and velocity information reduced the number of false-positive information deliveries. Overall, the system worked well as a niche product.

Gartner and the Magic Quadrants – Crowning the Leaders of Enterprise Search

For years Gartner, the research and advisory company, has been publishing their magic quadrants – and their verdict of everything from ECM-systems to Data Warehouse and E-commerce plays a big role in many company’s decision to choose the right tools.
Simply put, the vendors are presented in a matrix measuring the different players by ability to execute (product, overall viability, customer experience etc.) and the completeness of their vision (offering strategy, innovation etc.). The vendors are then positioned as niche players (a rather crowded spot), visionaries, challengers and leaders.

At the end of last year Gartner decided to retire their old “Information Access Quadrant” (Enterprise Search Quadrant) and introduce “Enterprise Search MarketScope” due to a more mature market. A number of vendors (such as Vivisimo and Recommind) were removed, in order to exclude those whose businesses were not entirely search driven.

The evaluation criteria’s for MarketScope cover: offering (product) strategy, Innovation, Overall viability (business unit, financial, strategy, and organization), Customer experience, Market understanding and business model.

To summarize: the criteria’s are to a large extent the same, but the two areas “overall viability” and “customer experience” are weighted higher than the rest. This is most likely a result of the last years discussion around user friendly interfaces, easier administration and the fact that some customers have suffered quite bad when vendors do not survive (one example in Northen Europe is the Danish vendor that went bankrupted for some time)

The yearly fight between the three leaders; Microsoft, Endeca and Autonomy has been somewhat disrupted and Microsoft, Endeca and Google are now seen as the leaders.
Microsoft has got a very broad product line, which stretches from low-price and less functionality to Enterprise Search built on the former FAST technology. Endeca follow the same trend, as Gartner puts it their “products (are) intended to serve organizations seeking to develop general search installations..(..) broadly applicable for a variety of different search challenges”.

In the old quadrant, Google remained a “challenger” for quite some time – but never made it to the “leaders” corner. Ease of administration and “user friendly” are two words that keeps being repeated. That, in combination with a profit of $ 7290000000 during the last quarter of 2010 makes Google a player that easily can continue to develop their Enterprise business.

Gartner’s MarketScope for Enterprise Search

Autonomy should still not be disregarded, the main reason for it falling a bit behind the three others seem to be conquerable problems with support and pricing transparency. It will be interesting to see how Autonomy chooses to handle these issues during 2011.

To put it short: the new MarketScope is good reading with quite few surprises. If you wish to get a better understanding of the development going on at the different vendors, start with Gartner and continue to search among our blog posts.

Better Search Engines and Information Practices in Digital Workplaces

During this year I have worked on a research project that aims to facilitate the development and implementation of an enterprise search engine. By understanding the use and value of information at the digital workplaces, we hope to create even better preconditions for optimizing a search engine to the requirements of a specific organization.

We use a work-task based research approach where we study information practices – that is, the normalized ways we use to recognize information needs, look for information, and how it is valued and used. By studying such practices in real-life work tasks, we can outline the role that a search engine plays in relation to other work tasks as well as to other ways of finding information. In short, being engaged in a creativity-oriented work task initiates different types of information practices compared to the practices we use in everyday, routine-based work tasks …

The creativity-oriented work tasks involve a dimension of innovation, and concepts such as learning and development are often used to describe these activities. Uncertainty is something that is associated with curiosity and may be seen as a driving force behind information seeking. Information that is rich in nuances and that offers different, even contradictory explanations or descriptions is usually appreciated, and the task outcome is only vaguely discerned at first. Routine-oriented tasks, on the other hand, are focused on increasing effectiveness and reducing uncertainty as quickly as possible in the task outcome, which itself may be sketched out relatively clearly from the beginning. Information seeking is often directed to readily available facts. All this means that a search engine must support a variety of information practices at any given workplace!

The “we” in this project is myself together with a Findwise colleague Henrik Strindberg. The project is financially supported by the Swedish Foundation for Strategic Research, and while I am not working with the present project I am employed by the University of Borås.

Just now I am finalizing a presentation of the project for the ICKM conference in Pittsburgh, PA, USA, next week. The presentation is entitled “Interrelated use and value of information sources”, and will be available through the conference proceedings in due time.

Very exciting … and while there I will also attend the board meetings of the ASIS&T’s Board of Directors as a newly appointed Director-at-Large. Very exciting, too!

The 73rd Annual Meeting of ASIS&T focuses on “Navigation Streams in an Information Ecosystem”.

Why is Search Easy and Hard?

Last year my colleague Lina and I went to the Workshop on Human Computer Interaction and Information Retrieval (HCIR) in Washington DC. This year we did not have the possibility to attend but since all the material is available online I took part remotely any way. I wanted to share with you what I found most interesting this year. (Daniel Tunkelang who was one of the organizers also posted a good overview of the event on his blog.)

This years keynote speaker was Dan Russell, a researcher from Google. He talked about Search Quality and user happiness; Why search is easy and hard. The point I found most interesting in his presentation was how improvement is not only needed when it comes to tools and data but also improving the users’ search skills. My own experience from various search projects is similar; users are not good at searching. Even though they are looking for a specific version of a technical documentation for a specific product they might just enter the name of the product, or even the product family. (It’s a bit like searching for ‘camera’ when you expect to find support documentation on your Dioptric lens for you Canon EOS 60D.) So I agree that users need better search skills. In his presentation Russell also presented some ideas on how a search application can help users improve their search skills.

Search is both easy and hard. Perhaps this is one of the reasons for the introduction of the HCIR Challenge as a new part of the workshop . From the HCIR website:

The aims of the challenge are to encourage researchers and practitioners to build and demonstrate information access systems satisfying at least one of the following:

  • Not only deliver relevant documents, but provide facilities for making meaning with those documents.
  • Increase user responsibility as well as control; that is, the systems require and reward human effort.
  • Offer the flexibility to adapt to user knowledge / sophistication / information need.
  • Are engaging and fun to use.

The winner of the challenge was a team of researchers from Yahoo Labs who presented Searching Through Time in the New York Times. The Time Explorer features a results page with an interactive time line that illustrates how the volume of articles (results) have changed over time. I recommend that you read the article in tech review to learn more about the project, or try out the Time explorer demo yourself. You can also learn more about the challenge in this blog post by Gene Golovchinsky.

All the papers and posters from the workshop can be found on the new website.

Combining Search and Browse – Integrated Faceted Breadcrumbs

Finding information can be tricky and as I have written about in one of my previous posts improving findability is not about providing a single entrypoint to information. Users have different ways of finding information (browsing, searching and asking). They often combine these techniques with each other (berrypicking) and so they all need to be supported. Peter Morville states that.

“Browse and Search work best in tandem… the best finding interfaces achieve a balance, letting users move fluidly between browsing and searching.”

A lot of sites are improving their search experience through the implementation of faceted search. However, very few successfully integrate faceted search and browsing on their site. Searching and browsing are treated as two separate flows of interaction instead of trying to combine them which would provide the users with a much better experience.

That is why I was glad to learn about an idea from Greg Nudelman which he presented in his session at the IASummit which I attended last week. In his session Greg introduced his idea about Integrated Faceted Breadcrumb. According to him breadcrumbs are intuitive, flexible and resourceful and they are design elements that don’t cause problems but simply work. To test his idea he conducted usability tests on a prototype using the Integrated Faceted Breadcrumb. According to his evaluation the integrated faceted breadcrumb has a lot of advantages over other faceted solutions:

  1. Combine hierarchical Location & Attribute breadcrumbs
  2. Use Change instead of Set-Remove-Set
  3. Automatically retain relevant query information
  4. Label breadcrumb aspects
  5. Make it clear how to start a new search
  6. Allow direct keyword manipulation.

I find this idea interesting and I am currently thinking about whether it could be applied into one of my own projects. (According to Greg it has not been implemented anywhere yet even though the findings from the usability testing were positive.) However I wonder if this is a concept that works well only for sites with relatively homogeneous content or if it would also work on larger collections of sites such as intranets? Can it be used in an intuitive way with a large number of facets and can it cope with the use of more complex filtering functionalities? For some sites it might not be the best idea to keep the search settings when the user changes search terms. These are some things I would like to find out. What do you think about this? Could you apply it to your site(s)? I recommend that you have a look at Greg Nudelman’s presentation on slideshare and find out for yourself. You can also find an article about the Integrated Faceted Breadcrumb on Boxes and Arrows. I look forward to a discussion about whether this is any good so write me a comment here at the findability blog or find me on twitter.

Search Driven Portals – Personalizing Search

To stay in the front edge within search technology, Findwise has a focus on research, both in the form of larger research projects and with different thesis projects. Mohammad Shadab and I just finished our thesis work at Findwise, where we have explored an idea of search user interfaces which we call search driven portals. User interfaces are mostly based on analysis of a smaller audience but the final interface is then put in production which targets a much wider range of users. The solution is in many cases static and cannot easily be changed or adapted. With Search driven portals, which is a portlet based UI, the users or administrators can adapt the interface specially designed to fulfill the need for different groups. Developers design and develop several searchlets (portlets powered by search technology), where every searchlet provides a specific functionality such as faceted search, results list, related information etc. Users can then choose to add the searchlets with functionality that suits them into their page on a preferred location. From architectural perspective, searchlets are standalone components independent from each other and are also easy to reuse.

Such functionality includes faceted search which serves as filters to narrow a search. These facets might need to be different based on what kind of role, department or background users have. Developers can create a set of facets and let the users choose the ones that satisfy their needs. Search driven portals is a great tool to make sure that sites don’t get flooded with information as new functionalities are developed. If a new need evolves, or if the provider comes with new ideas, the functionality is put into new searchlets which are deployed into the searchlet library. The administrator can broadcast new functionality to users by putting new searchlets on the master page, which affects every user’s own site. However, the users can still adjust new changes by removing the new functionality provided.

Search driven portals opens new ways of working, both in developer and usage perspective. It is one step away from the one size fits all concept, which many sites is supposed to fulfill. Providers such as Findwise can build a large component library which can be customized into packages for different customers. With help of the searchlet library, web administrators can set up designs for different groups, project managers can set up a project adjusted layout and employees can adjust their site after their own requirements. With search-driven portals, a wider range of users needs can more easily be covered.