Toward data-centric solutions with Knowledge graphs

In the last blog posts [1, 2] in this series by Fredric Landqvist and Peter Voisey we have outlined for you, at a high level, about the benefits of making data smarter and F.A.I.R., ideally made findable through a shareable, but controlled, type of Information Commons. In this post, we introduce you to Knowledge Graphs (based on Semantic Web Technologies), the source for the magic of smart and FAIR data automation. Data that is findable, accessible, interoperable and reusable. They can help tackle a range of problems, from the data tsunami to the scarcity of (quality) data for that next AI project.

What is a Knowledge Graph?

There are several different types of graph and certainly many have been many attempted definitions of a Knowledge Graph. Here’s ours:

A Knowledge Graph is the structural representation of explicit knowledge for a domain, encoded in such a way that both humans and machines can read (process) it.

Ultimately, we are wanting to exploit data and their connections or relationships within the graph format in order to surface important and relevant data and information. Without these relationships, the understandings, the stories and the searches around our data tend to dry up fairly quickly. Our world is increasingly connected. So we hope, from an organisational perspective, you are asking: Why isn’t our data connected?!

Where does the term “Knowledge Graph” come from?

The term Knowledge Graph was coined by Google on the release of its own Knowledge Graph in 2012. More recently, organisations have been cottoning on to the collective benefits of employing a Knowledge Graph, so much so, that many refer to the Enterprise Knowledge Graph today.

What are the technologies behind the Enterprise Knowledge Graph?

The Enterprise Knowledge Graph is based on a stack of W3C-ratified Semantic Web Technologies. As their name alludes to, they form the basis of the Semantic Web. Their formulation began in 2001 with Sir Tim Berners-Lee. Sir Tim, not content with giving us the World Wide Web for free, pictured a web of connected data and concepts, besides the web of linked documents, so that machines would be able to understand our requests by virtue of known connections and relationships.

Why Enterprise Knowledge Graphs now?

These technologies are complex to the layperson and some of them are nearly 20 years old. What’s changed to make Enterprises take note of them now? Well worsening internal data management problems, the need for some knowledge input for most sustainable AI projects and the fact that Knowledge Graph building tools have improved to become collaborative and more user-friendly for the knowledge engineer, domain expert and business executive. The underlying technologies in new tools are more hidden from the end user’s perspective, allowing them to concentrate on encoding their knowledge so that it can be used across enterprise systems and applications. In essence, linking enterprise data.

Thanks to Google’s success in using their Knowledge Graph with their search, Enterprise Knowledge Graphs are becoming recognised as the difference between “googling” and using the sometimes-less-than-satisfying enterprise consumer-facing or intranet search.

The key takeaway here though is that real power of any knowledge graph is in its relationships/connections between concepts. We’ll look into this in more detail next.

RDF, at the heart of the Enterprise Knowledge Graphs (EKGs)

EKGs use the simple RDF graph data model at their base. RDF stands for Resource Description Framework – a framework for the way resources or things are described so that we can recognise more easily plus understand more about them.

An aside: We’re talking RDF (namespace) Knowledge Graphs here, rather than their sister graph type, Property Graphs, which we will cover in a future post. It is important to note that there are advantages with both types of graph and indeed new technologies are being developed, so processes can straddle both types.

The RDF graph data model describes a thing or a resource in terms of “triples”: Subject – predicate – Object. The diagram below illustrates this more clearly with an example.


Figure 1. What does a Knowledge Graph look like? The RDF elements of a Knowledge Graph

The graph consists of nodes (vertices) that represent entities (a.k.a. concepts both concrete and abstract, terms, phrases, but now think things, not strings), and edges (lines or arrows) representing the relationships between nodes. Each concept and each relationship have their own URI (a kind of ID), that helps a search engine or application understand their meaning to spot differences (disambiguation) e.g. homonyms (words spelt or pronounced similarly, but that have different meaning) or similarities e.g. alternative labels, synonyms, acronyms, misspellings, foreign language term equivalents etc.

Google uses its Knowledge Graph when it crawls websites to recognise entities like: People, Places, Products, Organisations and more recently Topics, plus all their known relationships between them. There is often a dire need within most organisations for readily available knowledge about People and their related Roles, Skills/Competencies, Projects, Organisations/Departments and Locations.

There are of course many other well-known Knowledge Graphs now including IBM’s Watson,  Microsoft’s Academic Knowledge Graph, Amazon’s Cortex Knowledge Graph, the Bing Knowledge Graph etc.

One thing to note about Google is that the space devoted to their organic (non-paid for) search results has reduced dramatically over the last ten years. In place, they have used their Knowledge Graph to better understand the end user’s query and context. Information too is served automatically based on query concept relationships, either within an Information Panel or as commonly known Questions and Answers (Q&As). Your employees (as consumers) of course are at home with this intuitive, easy-click user experience. While Google’s supply of information has become sharper, so has its automatic assessment of all webpage content, relying increasingly on websites to provide it with semantic information e.g. declaring their “aboutness” by using schema.org or other microformats in their markup rather than relying on SEO keywords.

How does Knowledge Graph engineering differ from traditional KM/IM processes?

In reality, not that much. We still want the same governing principles that can give data good structure, metadata, context and meaning.

Constructing a Knowledge Graph can still be likened to the development of taxonomy or thesaurus with their concepts and an ontology (the relationships between concepts). Here the relationships include firstly: poly-hierarchical relationships (in terms of the taxonomy): a concept may have several broader concepts meaning that the concept itself (with its own URI) can appear in multiple times within a taxonomy. This polyhierarchy can be exploited later for example in both search filtering and website navigation.

Secondly, relationships can also be associative/relational with regards to meaning and context – your organisation’s own made +/or industry-adopted concepts and the key relationships that define your business, and even its goals, strategy and workflows.

A key difference though is the way in which you can think about your data and its organisation. It is no longer flat or 2-D, but rather think 3-D and 360-degree concept- or consumer-centric views to see how they connect to other concepts.

A semantic layer for Automatic Annotation, smarter data & semantic search

We will look at the many different benefits of a Knowledge Graph and further use cases in the next post, but for now, we go with the magic that an EKG can sit virtually on top of any or all your data sources (with different formats and metadata) without the need to move or copy any data. Any data source or data catalogue then consumed via a processing pipeline can be automatically and consistently be annotated (“tagged”) and classified according to declared industry or in-house standards, thus becoming more structured and its meaning more readily “understood,” ready to be found and consumed in accordance with any known or stated conditions.

The classification may also extend to including levels of data security and sensitivity, provenance or trust or location, device and time accessibility.

Figure 2 The automatic annotation & classification process for making data/content smart by using an Enterprise Knowledge Graph

It’s often assumed, incorrectly, that there is only one Enterprise Knowledge Graph. Essentially an enterprise can have one or many, perhaps overlapping graphs for different purposes, subject domains or applications. The importance is that knowledge becomes encoded and readily usable for humans and machines.

What’s wrong with Relational Databases?

There’s nothing wrong with relational databases per se and Knowledge Graphs will not necessarily replace them any time soon. It’s good to note though that data in tabular format can be converted to RDF graph data (triples/tuples) relatively easily and stored in a triple store (Graph Database) or some equivalent. 

In relational databases, references to other rows and tables are indicated by referring to primary key attributes via foreign key columns. Joins are computed at query time by matching primary and foreign keys of all rows in the connected tables. 

Understanding the connections or relations is usually very cumbersome, and those types of costly join operations are often addressed by denormalizing the data to reduce the number of joins necessary, therefore breaking the data integrity of a relational database.

The data models for relational versus graph are different. If you are used to modelling with relational databases, remember the ease and beauty of a well-designed, normalized entity-relationship diagram (i.e using UML) –  a graph is exactly that – a clear model of the domain. Each node (entity or attribute) in the graph model directly and physically contains a list of relationship records that represent the relationships to other nodes. These relationship records are organized by type and direction and may hold additional attributes.

Querying relational databases is easy with SQL. The graph has something similar by using SPARQL, a query language for RDF. If you have ever tried to write a SQL statement with a large number of joins, you know that you quickly lose sight of what the query actually does. In SPARQL, the syntax remains concise and focused on domain components and the connections among them.

Toward data-centric solutions with RDF

With enterprise-linked-data, as with knowledge graphs, one is able to connect many different schemas (data models) and formats in different relational databases and build a connected worldview, domain of discourse. Herein lays the strengths with linking-data, and liberating data from lock-in mechanisms either by schemas (data models) or vendor (software). To do queries and inferencing to find new knowledge and insights that were not possible before due to time or human computation factors. Semantics support this reasoning!

Of course, having interoperable graph data means could well mean fewer code patches on individual systems and more sustainable and agile data-centric solutions in the future.

In conclusion

The expression “in the right place, at the right time” is generally associated with luck. We’ve been talking in our enterprises about “the right information, in the right place, at the right time” for ages, unfortunately sometimes with similar fortune attached. The opportunity is here now to embark on a journey to take back control of your data if you haven’t already, and make it an asset again in achieving your enterprise aims and goals.

More reading on graphs and linked enterprise data:

Next up in the series: Knowledge Graphs: The collective Why?

View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Peter Voisey's LinkedIn profilePeter Voisey

Open or Opaque Artificial Intelligence

Data is the black gold in the information era and has similar value creation and ecology to that of petroleum. Data in its raw format needs to be refined (as does crude oil) to make sense and to add meaning and usefulness to any domain.

AI and its parts (machine learning, natural language processing, deep-learning etc.) are set to be a societal game changer in all collective human imagination domains.

opaque

The ambition should be to design for a sustainable AI future, aiming to incorporate the  UNs 17 development goals with ethics at the core. One omnipresent hurdle still is the black box or opaque setting i.e. being able to understand how, why and where different AI operates and influences

The open paradigm

Since all known to man utilities with AI, have a simple model, being:

inputmodeloutput and feedback (learning).

There is a need to shift the control from the computer back towards the human, and thereby enable the addition of meaning and semantics along with conceptual models.

By using open innovation, -standards, -models (knowledge graphs, ontologies, terminologies, code systems and the like), -software, -platforms (technology stacks, i.e. Singularity net) in the design for future AI utilities and cognitive computing, there exists opportunities for  leverage learning in a meaningful way – away from the opaque regime and towards cognitive-informed artificial intelligence. Efficient communication through interoperability that can accommodate data from different semantic domains that traditionally have been separate. Open domain knowledge and data-sets (as linked-data) will provide very good platforms for continuously improved datasets within the AI loop, both in terms of refining and addressing the contextual matter, but also enabling improved precision and outcome.

Informative communication – the word’s meaning should allow accurate mental reconstruction of the senders intended meaning, but we are well aware of the human messiness (complexity) within a language as described in Information bottleneck (Tishby), rate distortion theory (Shannon).

To take on the challenges and opportunities within AI, there are strong undercurrents to build interdisciplinary capacities as with Chalmers AI Research and AI innovation of Sweden and the like. Where computer science, cognitive science, data science, information science, social sciences and more disciplines meet and swap ideas to improve value creation within different domains, while at the same time beginning to blend industry, public sector, academia and society together.

The societal challenges that lay ahead, open up for innovation, where AI-assisted utilities will augment and automate for the benefit of mankind and the earth, but to do so require a balancing act where the open paradigm is favoured. AI is designed and is an artefact, hence we need to address ethics in its design with ART (Accountability, Responsibility and Transparency) The EU draft on AI ethics.

Tinkering with AI

The emerging development of AI shows a different pathway than that of traditional software engineering. All emerging machine learning, NLP and/or Deep-Learning machinery relies on a tinkering approach with trial and error -re-model, refine data-set, test-bed with different outcomes and behaviours -before it can reach a maturity level for the industrial stages in digital infrastructure, as with Google Cloud, or similar services. A great example is image recognition and computer vision with its data optimization algorithms. and processing steps. Here each development has emerged from previous learnings and tinkering. Sometimes the development and use of mathematical models simply do not provide up for real AI matter and utilities.

Here in the value creation, or the why in the first place, we should design and use ML, NLP and Deep-Learning in the process with an expected outcome.  AI is not, and never will be the silver bullet for all problem domains in computing! Start making sense, in essence, is needed, with contextual use-cases and utilities, long before we reach Artificial General Intelligence

The 25th of April an event will cover Sustainable Knowledge Graphs and AI together with linked-data Sweden network.

Tinkering with knowledge graphs

I don’t want to sail with this ship of fools, on the opulent data sea, where people are drowning without any sense-making knowledge shores in sight. You don’t see the edge before you drop!

Knowledge EngineeringEchoencephalogram (Lars Leksell)  and neural networks

How do organisations reach a level playing field, where it is possible to create a sustainable learning organisation [cybernetics]?
(Enacted Knowledge Management practices and processes)

Sadly, in many cases, we face the tragedy of the commons!

There is an urgent need to iron out the social dilemmas and focus on motivational solutions that strive for cooperation and collective action. Knowledge deciphered with the notion of intelligence and emerging utilities with AI as an assistant with us humans. We the peoples!

To make a model of the world, to codify our knowledge and enable worldviews to complex data is nothing new per se. A Knowlege Graph – is in its essence a constituted shared narrative within the collective imagination (i.e organisation). Where facts of things and their inherited relationships and constraints define the model to be used to master the matrix.  These concepts and topics are our communication means to bridge between groups of people. Shared nomenclatures and vocabularies.

Terminology Management

Knowledge Engineering in practice


At work – building a knowledge graph – there are some pillars, that the architecture rests upon.  First and foremost is the language we use every day to undertake our practices within an organisation. The corpus of concepts, topics and things that revolve around the overarching theme. No entity act in a vacuum with no shared concepts. Humans coordinate work practices by shared narratives embedded into concepts and their translations from person to person. This communication might be using different means, like cuneiform (in ancient Babel) or digital tools of today. To curate, cultivate and nurture a good organisational vocabulary, we also need to develop practices and disciplines that to some extent renders similarities to ancient clay-tablet librarians. Organising principles, to the organising system (information system, applications).  This discipline could be defined as taxonomists (taxonomy manager) or knowledge engineers. (or information architect)

Set the scope – no need to boil the ocean


All organisations, independent of business vertical, have known domain concepts that either are defined by standards, code systems or open vocabularies. A good idea will obviously be to first go foraging in the sea of terminologies, to link, re-hash/re-use and manage the domain. The second task in this scoping effort will be to audit and map the internal terrain of content corpora. Since information is scattered across a multitude of organising systems, but within these, there are pockets of a structure. Here we will find glossaries, controlled vocabularies, data-models and the like.  The taxonomist will then together with subject matter experts arrange governance principles and engage in conversations on how the outer and inner loop of concepts link, and start to build domain-specific taxonomies. Preferable using the simple knowledge organisation system (SKOS) standard

Participatory Design from inception


Concepts and their resource description will need to be evaluated and semantically enhanced with several different worldviews from all practices and disciplines within the organisation. Concepts might have a different meaning. Meaning is subjective, demographic, socio-political, and complex. Meaning sometimes gets lost in translation (between different communities of practices).

The best approach to get a highly participatory design in the development of a sustainable model is by simply publish the concepts as open thesauri. A great example is the HealthDirect thesaurus. This service becomes a canonical reference that people are able to search, navigate and annotate.

It is smart to let people edit and refine and comment (annotate) in the same manner as the Wikipedia evolves, i.e edit wiki data entries. These annotations will then feedback to the governance network of the terminologies. 

Term Uppdate

Link to organising systems

All models (taxonomies, vocabularies, ontologies etc.) should be interlinked to the existing base of organising systems (information systems [IS]) or platforms. Most IS’s have schemas and in-built models and business rules to serve as applications for a specific use-case.  This implies also the use of concepts to define and describe the data in metadata, as reference data tables or as user experience controls. In all these lego pieces within an IS or platform, there are opportunities to link these concepts to the shared narratives in the terminology service.  Linked-enterprise-data building a web of meaning, and opening up for a more interoperable information landscape.

One omnipresent quest is to set-up a sound content model and design for i.e Office 365, where content types, collections, resource descriptions and metadata have to be concerted in the back-end services as managed-metadata-service. Within these features and capacities, it is wise to integrate with the semantic layer. (terminologies, and graphs). Other highly relevant integrations relate to search-as-a-service, where the semantic layer co-acts in the pipeline steps, add semantics, link, auto-classify and disambiguate with entity extraction. In the user experience journey, the semantic layer augments and connect things. Which is for instance how Microsoft Graph has been ingrained all through their platform. Search and semantics push the envelope 😉

Data integration and information mechanics

A decoupled information systems architecture using an enterprise service bus (messaging techniques) is by far the most used model.  To enable a sustainable data integration, there is a need to have a data architecture and clear integration design. Adjacent to the data integration, are means for cleaning up data and harmonise data-sets into a cohesive matter, extract-load-transfer [etl]. Data Governance is essential! In this ballpark we also find cues to master data management. Data and information have fluid properties, and the flow has to be seamless and smooth.  

When defining the message structure (asynchronous) in information exchange protocols and packages. It is highly desired to rely on standards, well-defined models (ontologies). As within the healthcare & life science domain using Hl7/FHIR.  These standards have domain-models with entities, properties, relations and graphs. The data serialisation for data exchange might use XML or RDF (JSON-LD, Turtle etc.). The value-set (namespaces) for properties will be possible to link to SKOS vocabularies with terms.

Query the graph

Knowledge engineering is both setting the useful terminologies into action, but also load, refine and develop ontologies (information models, data models). There are many very useful open ontologies that could or should be used and refined by the taxonomists, i.e ISA2 Core Vocabularies, With data-sets stored in a graph (triplestore) there are many ways to query the graph to get results and insights (links). Either by using SPARQL (similar to SQL in schema-based systems), or combine this with SHACL (constraints) or via Restful APIs.

These means to query the knowledge graph will be one reasoning to add semantics to data integration as described above.

Adding smartness and we are all done…

Semantic AI or means to bridge between symbolic representation (semantics) and machine learning (ML), natural language processing (NLP), and deep-learning is where all thing come together.

In the works (knowledge engineering) to build the knowledge graph, and govern it, it taxes many manual steps as mapping models, standards and large corpora of terminologies.  Here AI capacities enable automation and continuous improvements with learning networks. Understanding human capacities and intelligence, unpacking the neurosciences (as Lars Leksell) combined with neural-networks will be our road ahead with safe and sustainable uses of AI.
View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog