Semantic Annotation (how to make stuff findable, and more)

With semantic annotation, your customers and employees can get the right information in order to make better decisions

Why automatic Semantic Annotation?  

Empower customers & employees with the right information 

Moving data and services to the Cloud have many advantages, including the flexibility of work practices. COVID-19 has boosted thtrend and many organisations are benefiting from employees also being able to work from home. If employees are to become customers themselves, they should be expecting a quality Search service. Semantic Annotation can help with this.

For many employees, finding information is still a problem. Having poor Search does little to encourage users either to use it, or to improve their decision-making, knowledge sharing or curiosity & innovation. Let’s not forget, better search means less duplication too. 

Making data and content “smarter” makes it more findable. 


Data and content are rarely structured with good metadata or tagging (annotation) unless either they are being used to sell something, or they are deemed as business critical. Generally, when we create (data, content), we just save it to storage(s). 

We could tag manually, but research shows that we’re not good at this. Even if we bother to tag, we only do it from our own perspective, and even then, we do it inconsistently over time.  

Alternatively, we could let AI do the work. Give data/content structure, meaning and context (all automatically and consistently), so that it can be found. 

The main need for automatic Semantic Annotation? About 70-80% of the average organisation’s data is unstructured (/textual). Add to this: even databases have textual labels and headings. 

How to create automatic Semantic Annotation?  

Use stored knowledge (from an Enterprise Knowledge Graph) 

When thinking about the long-term data health of an organisation, the most effective and sustainable way to set up semantic annotation, is to create your own Enterprise Knowledge Graph (which can then be used for multiple usecase scenarios, not just annotation). 

In an Enterprise Knowledge Graph (EKG), an organisation can store its key knowledge (taxonomies, thesauri, ontologies, business rules). Tooling now exists so that business owners and domain experts can collaboratively add their knowledge, not having to know about the underlying semantic web-based technologies, the ones that allow your machines and applications to read this knowledge as well (before making their decisions). 

 Your EKG is best created using both human input and AI (NLP & ML = Natural Language Processing & Machine Learning). The AI part exploits your existing data plus any existing industry-standard terminologies or ontologies that fit your business needs (you may want to just be able to link to them). While the automation of EKG creation is set to improve, EKG robustness can be tested by using corpus analysis with your data to find any key business concepts that are missing.

How does automatic Semantic Annotation work?  

Smart processing 

Despite improvements in search features and functionality, Search in the digital workplace may still have that long-tail of search – where the lessfrequent queries are harder to cater for. With an EKG annotation process, the quality of search results can significantly improve. Processing takes extracted concepts (Named Entity Recognition) from the resource asset that needs to be annotated. It then finds all the relationships that link these concepts to other concepts within the graphIn doing so, the aboutness of the asset is calculated using an algorithm before appropriate annotation takes place. The annotations go to making an improved index. The process essentially makes your data assets “smarter,” and therefore, more findable.  

Processing also includes shadow concept annotations – the adding of concept tag where the concept itself does not appear within the resource asset, but which perfectly describes the resource (thanks to known concept relationships in the graph). Similarly, the quality of retrieved search results can be increased as the annotation process reduces the ambiguity about the meaning of certain concepts e.g. it differentiates between Apple (the brand) and apple (the fruit) by virtue of their connections to other concepts i.e. it can answer: are we talking tech or snacks? 

Your preferred tooling may be that which supports the parthumanexpert maintenance of key business language (taxonomies – including phrases, alternative labels, acronyms, synonyms etc). Thus, the EKG is used for differing language and culture perspectives of both customers and employees (think Diversity & Inclusion). And of course, search just gets better when linked to any user profile concepts for personalisation. 

Analysis of search queries to find “new” language, means that business language can be kept “alive,” and reflect both your data and query trends (typed and spoken). Resultant APIs can offer many different UX options e.g. for “misfired” queries: clickable, search-generating related concepts, or broader/narrower concepts for decreased/increased search granularity.

What are the alternatives? 

EKGs, AI enhancements and COTS 

There are several providers of commercial knowledge engineering and graph software in the market, many of whom Findwise partner with. As EKGs are RDF-based, once made, they are transferrable between software products, should the need arise. 

Incremental AI-based algorithmic additions can be added to improve existing search (e.g. classifiers, vector embeddings etc), having more of a single-focus, single-system perspective. Very often these same enhancement techniques can also provide input for improving and automating EKGs – just as the EKG can offer logical base and rules for a robust AI engineering strategy. 

EKGs offer a hybrid architecture with open source search engines. There are of course commercial off-the-shelf solutions (COTS) that offer improved search over data assets (often also with a graph behind them). But before you go for any vendor lock in, check what it is you need and if they cover all or any of the possible EKG-related scenarios: 

Are they inclusive of all your data? Do they help formalise data governance and accountability framework? Is the AI transparent enough to understand? Can your information and business model(s) be built in and be reflected in data structures? How easy would it be to alter your business model(s) and see such changes reflected in data structures

Does the software solution cope with other use cases? e.g. Data findability? FAIR data? Do they have multilingual functionality? Can they help make your data interoperable or connected with your ecosystem or Web data? Do they support potential data-centric solutions or just application-centric ones?


Semantic Annotation: How to make it happen? 

Your ultimate choice may be the degree to which you want or need to control your data and data assets, plus how important it is for your organisation to monitor their usage by customers and employees. 

EKGs are mostly introduced into an organisation via a singular use case rather the result of a future-looking, holistic, data-centric strategy – though this is not unheard ofThat said, introducing automatic Semantic Annotation with an EKG could prove a great follow up to your organisation’s Cloud project, as together they can dramatically increase the value of your data assets within the first processing. 

For an example of an implemented semantic annotation use case, click here: NHS Learning Hub, a collaborative Health Education England and Findwise project. 

Alternatively check out Findability by Findwise and reach out to get the very best digital transformation roadmap for your organisation.

Peter Voisey     Linkedin   Twitter

Uncover hidden insights using Information Retrieval and Social Media

Arjen de Vries’ talk at ESSIR 2013 (Granada, Spain) highlighted the opportunities and difficulties in using Information Retrieval (IR) and social media to both make sense of unstructured data that a computer cannot easily make sense of by itself and realise deeper hidden information.

Today social media has become more important as interactions on sites have developed to include user-generated content of all types from ratings, comments and experiences to uploaded images, blogs and videos. Users now not only consume content and products but also co-create, interact with other users and help categorise content through means of tagging or hash-tags. All of which may result in ‘data consumption’ traces.

Some of these social media platforms like Twitter provide access to a variety of data such as user profiles, their connections with other users, their shared or published content and even how they react to each other’s content through comments and ratings. Analysis of this data can provide new insights.

One example is a case from CIW research. They calculated a top music artist popularity chart for each of the following different music sites: EchoNest, and Spotify. Further research was then done for the band The Black Keys, where it was found that their popularity did not vary over time for either or Spotify. However when using data from tweets about the band for the very same period, it was found that interest in them rocketed after their Grammy win in the States, information that was not apparent from the previous research.

Using social media data to enrich information

The challenge that remains for IR research however is that these social media platforms vary in functionality. What they let users do will often determine the usefulness of the resultant data. For example, YouTube and Flickr will only let the up-loader tag their own content, while the film site IMDb allows tagging but the tags are not registered personally, rather they go into a pool. Arjen cited ‘Red Hot Chilli Peppers’ as a simple example of the usefulness of such social media data in allowing disambiguation either through implicit metadata from a user comment about eating chillies and/or information derived from the organisational data from say Flickr where a picture of red hot chilli peppers is grouped with other pictures of fruit and vegetables.

The key point here is that researchers often bemoan the fact that they do not always have access to log server files. Social media data left by users about content or objects can at times provide a richer representation in matching an information need and the response to that need. The potential benefits here for Information Retrieval are several:

  • The expanded content representation
  • The reduction in the ‘vocabulary gap(s)’ between content creator, indexers and information seekers
  • The increase in diversity of view on the same content
  • And the opportunity to make better assumptions about a user’s context and the variety of contexts that may exist

Allowing all users to tag all available content improves retrieval tasks

Where information about users on media sites is ‘open,’ (sometimes it is not, sites like Facebook and Linkedin are notoriously difficult to retrieve data from) there is the ability to discover which user labels what item with what word, and in some cases, even what rating they give. In essence many new sources play the role of anchor texts, be they tags, ratings, tweets, comments or reviews. The standard triangle of user, tag and item allows a unifying research approach based on random walks and can answer many questions. The talk emphasised that the area clearly has ample opportunity for researchers to make their mark.

One example of the research potential was shown in the case where was used to detect synonyms. The website allows users to keep a record of books they have read, tag, rate and comment on them. From these connections a synonym detector was created, allowing the example query word of ‘humour,’ to have a proposed list of synonyms created that included: ‘humor (US English), funny, humorous and British humour.

Analysis of this type of research has shown that allowing all users to tag all available content improves retrieval tasks, and that combining tags and ratings may both improve search and recommendation tasks even though there are some cases where lost relations between user, tag and item may occur.

The takeaway from this talk was that there is no one means or approach in retrieving information from Social Media due to the many complexities involved – not least because the limitations of user interactivity in some platforms but also due to limitations in the usage of data accessed. As Arjen demonstrated though, it is possible in certain cases to be innovative in the collection of rewarding data – an approach that may be utilized more and more, particularly as users/customers move closer to product and service providers through increased online interactions.

Overview of the 9th European Summer School in Information Retrieval

This was an excellent week of knowledge sharing and exchanging of ideas, seen mostly (but not exclusively) from an Information Retrieval (IR) research perspective, given the high percentage of PhD students in the 100+ audience. Here are just a few brief summaries of my favourite talks from a great crop that were made throughout the week.

Highlights for me started with the IR ‘guru’ Bruce Croft (University of Massachusetts; twitter: @wbc11) who stormed through over 150 slides over the morning session in which a speedy and concise history of IR was given, its foundations and formal models and the current IR research focus. Three main issues were seen to dominate: relevance, evaluation and users along with their information needs.

He emphasised that the IR research focus was still very much on document and query representations and the various/mixed retrieval models attempting to marry them in order to produce the most relevant results (with relevance here incorporating topical and people relevance, task context and novelty). A key shift within retrieval models however was noted from the processing of text into units of language, towards the use of distribution of word counts with more statistical and predictive properties in retrieval models and algorithms. The current research ‘gap’ was seen as dealing with the long(er) query at the specific passage-level answer.

Michail Salampasis (from the Institute of Software Technology and Interactive Systems, Vienna University of Technology) gave a challenging talk on Integrating IR Technologies for Professional Search in which he highlighted the increased difficulties when dealing with a specialized domain and a smaller user base. Enterprise Search (ES) was very much seen as comparable to IR Search in terms of relevance, evaluation, user informational needs and user interactions. It was rightly noted that within ES however, other factors also play a more important part such as system performance, incorporating new data, scalability, information freshness and the presence of multiple information sources along with the need for tuning for different applications.

IR and Social Media by Arjen de Vries (leader of the Interactive Information Retrieval research group Centrum Wiskunde & Informatica, Utrecht, Netherlands; twitter: @arjenpdevries). This was an inspirational talk telling researchers of the vast and varied social media data out there ready to be culled as a direct result of users’ interactions. He talked of having the key ideal information triangle of linked data between people (their connections and profiles), items and tags/ratings (endorsement and sharing). The take home messages were that social media can at times give IR research a rich resource of context that is an alternative to click data, although that finding one theory to address the various recommendation and retrieval tasks is going to be problematical. An example was shown where a band’s popularity was shown to be static after having won a Grammy on both Spotify and EchoNest, while analysis using bitly clearly showed a sharp increase in interest.

Tony Russell-Rose (UX Labs, London; twitter: @tonygrr) gave an entertaining talk, very much from the user perspective, on Designing the User Experience. He noted how the earliest classic IR models either lacked a user perspective completely or were too linear. He proposed four Dimensions of Search User Experience involving the user themselves (their level of expertise), their goal (its scope and complexity), the context (again its complexity) and the type of search mode to be employed (depending on whether the user was looking up something, learning or investigating). His talk went on to urge for the adoption of a principled approach to design using these dimensions, with particular reference to the differing contexts requiring differing designs. Finally there was a call to apply proven design patterns and principles but to look at search holistically – towards both the analysis and the sense making of information.

Norbert Fuhr (from the Faculty of Engineering Sciences at the University of Duisburg-Essen, Germany) gave a measured talk on Interactive IR. Firstly quantitative modelling and the Probability Ranking Principle (PRP) were looked at. The PRP ranks documents according to decreasing values of the probability of relevance (based on user choices represented as binary choices) that in turn yield optimum retrieval quality. Some obvious shortcomings with this ranking were shown: the user-assessment focus; the relevance judgements of documents are independent; the users’ search paths are often non-linear and the fact that their information need may alter during a search session.

The second more impactful part of the talk dealt with various cognitive models of information seeking and searching and showed how better understanding of the user has influenced interface designs to go beyond the more traditional query-result list paradigm.

Paul Clough (University of Sheffield) gave an authoritative talk on Multilingual Retrieval, with much information coming from his new book out on the subject. His talk mentioned the many reasons that multilingual search is becoming more important in its various forms as well as highlighting the more common stumbling blocks. He mentioned the many models employed today, including the more sophisticated language models. He stated that lab test results can often reach up to 99% of monolingual IR but it is often the cost and the uncertainty of results due to nuances in language that has prevented the presence of more multilingual IR systems. Again the users themselves and their needs have been taken into account more with regards system functionality.

There were of course other talks of note but some of the above topics will be covered in more depth in upcoming blog posts.

/Peter Voisey