The shortest dictionary definition of semantics is: the study of meaning. The more complex explanation of this term would lead to a relationship that maps words, terms and written expressions into common sense and understanding of objects and phenomena in the real world. It is worthy to mention that objects, phenomena and relationships between them are language independent. It means that the same semantic network of concepts can map to multiple languages which is useful in automatic translations or cross-lingual searches.
The approach
In the proposed approach semantics will be modeled as a defined ontology making it possible for the web to “understand” and satisfy the requests and intents of people and machines to use the web content. The ontology is a model that encapsulates knowledge from specific domain and consists of hierarchical structure of classes (taxonomy) that represents concepts of things, phenomena, activities etc. Each concept has a set of attributes that represent the mapping of that particular concept to words and phrases that represents that concepts in written language (as shown at the top of the figure below). Moreover, the proposed ontology model will have horizontal relationships between concepts, e.g. the linguistic relationships (synonymy, homonymy etc.) or domain specific relationships (medicine, law, military, biological, chemical etc.). Such a defined ontology model will be called a Semantic Map and will be used in the proposed search engine. An exemplar part of an enriched ontology of beverages is shown in the figure below. The ontology is enriched, so that the concepts can be easily identified in text using attributes such as the representation of the concept in the written text.
Semantic Map
The Semantic Map is an ontology that is used for bidirectional mapping of textual representation of concepts into a space of their meaning and associations. In this manner, it becomes possible to transform user queries into concepts, ideas and intent that can be matched with indexed set of similar concepts (and their relationships) derived from documents that are returned in a form of result set. Moreover, users will be able to precise and describe their intents using visualized facets of concept taxonomy, concept attributes and horizontal (domain) relationships. The search module will also be able to discover users’ intents based on the history of queries and other relevant factors, e.g. ontological axioms and restrictions. A potentially interesting approach will retrieve additional information regarding the specific user profile from publicly available information available in social portals like Facebook, blog sites etc., as well as in user’s own bookmarks and similar private resources, enabling deeper intent discovery.
Semantic Search Engine
The search engine will be composed of the following components:
- Connector – This module will be responsible for acquisition of data from external repositories and pass it to the search engine. The purpose of the connector is also to extract text and relevant metadata from files and external systems and pass it to further processing components.
- Parser – This module will be responsible for text processing including activities like: tokenization (breaking text into lexems – words or phrases), lemmatization (normalization of grammar forms), exclusion of stop-words, paragraph and sentence boundary detector. The result of parsing stage is structured text with additional annotations that is passed to semantic Tagger.
- Tagger – This module is responsible for adding semantic information for each lexem extracted from the processed text. Technically it refers to addition of identifiers to relevant concepts stored in the Semantic Map for each lexem. Moreover phrases consisting of several words are identified and disambiguation is performed basing on derived contexts. Consider the example illustrated in the figure.
- Indexer – This module is responsible for taking all the processed information, transformation and storage into the search index. This module will be enriched with methods of semantic indexing using ontology (semantic map) and language tools.
- Search index – The central storage of processed documents (document repository) structured properly to manage full text of the documents, their metadata and all relevant semantic information (document index). The structure is optimized for search performance and accuracy.
- Search – This module is responsible for running queries against the search index and retrieval of relevant results. The search algorithms will be enriched to use user intents (complying data privacy) and the prepared Semantic Map to match semantic information stored in the search index.
What do you think? Please let us know by writing a comment.
Two observations from someone who has designed and built a semantic search engine (at Fablo):
1. What you described is based on strict taxonomies and ontologies. While those exist in many domains, in general you can’t assume that level of data quality. This is why we built a semantic search engine based on latent semantic indexing: that way you can throw any data at it and have the engine figure out the rest.
2. What we learned is that in the e-commerce search market this is not a sellable proposition. The market is not ready for semantic search. What people expect is a straightforward and understandable search. There is also a browsability expectation: search is supposed to bring up *all* possible candidates and let users browse through them. In most semantic search solutions you end up with excellent top results, but poor browsability.
Thanks Jan for your comment. Nice to see that somebody has even thought about it.
Anyway, what you put is what you get. The quality on ontology is a separate topic itself but I believe there is a way to do it and in most of cases it should be developed. I also found out that there are ontology developers who specialize in it.
Agree that e-commerce is not the best place to experiment with semantic search while margins and conversion rates apply strongly. But there are quite interesting other domains like Knowledge Management, Intelligence, Web Search, Portal Search and maybe more?
Nice blogpost. How would you Google’s move towards a semantic (web) 1) relation to what you outline here?
1) http://semanticweb.com/google-plans-to-incorporate-semantic-search_b27477
I am eager to see how Google will handle this. Nevertheless you can see now some other examples of semantic web search. If you try Bing.com (US version) or Ask.com or Hakia.com you can see how they implement it. For example if you put to Bing “lung cancer” – you will be able to see not only the relevant results but meaningful suggestion to narrow or broaden your search; look for related medications (with proper names!) or seek for related conditions. So Bing somehow inderstood that two-words-query “lung cancer” is a specific disease and proposed several relevant actions to help you discover more info about it.
My point is that it is not so hard to implement it for your own company or enterprise!
My post is about how we could make it using search technologies like e.g. SolR.
The performance of semantic search engine will depend upon the ontology which lies beneath it. Once the Ontology is developed , what are the techniques that can be applied over the ontology to build a semantic search engine.
Thanks for comment. Once ontology is developed it can be stored as standard RDF or OWL format. Then the tagger stage can be written nicely to consume the ontology in a good way. The quality of a tagger stage depends on custom code quality. At Findwise we usually use java code within Openpipeline to write custom stages. Please note that semantic tagging is done during the indexing phase and it will not impact search response time.