This is the second post in a new series by Fredric Landqvist & Peter Voisey, explaining how your organisation could best shape its data landscape for the future.
How to create a smart data framework for your organisation
In our last post for you, we presented the benefits of F.A.I.R data, how to make data smarter for search engines and the potentials of an Information Commons. In this post, we’re giving you the pragmatic steps to make your data FAIR by creating and applying your own smart data framework. Your data-sharing dream, internally and externally, is possible.
A smart data framework, using FAIR data principles, encompasses the tooling, models and standards that govern datasets and the different context-specific information systems (registers, catalogues). The data is then ingested and processed (enriched/refined) into smart data, datasets and data catalogues. It can then be used and reused by different applications and e-services via open APIs. In this ecosystem, all actors and information behaviours (personas) interplay: provision agents, owners, builders, enrichers, end-user searchers and referrers.
A smart data & metadata catalogue
A smart data & metadata catalogue (illustrated below), provides an organisational capability that aligns data management with the FAIR data principles. View it not so much as one system to rule them all, but rather an ecosystem that is smart and sustainable. In order to simplify your complex and heterogeneous information environment, this set-up can be instantiated, as one overarching mechanism. Although we are describing a data and metadata catalogue here, the exact same framework and set up would of course apply also to your organisation’s content, making it smarter and more findable (i.e. it gets the sustainable stamp).
The above picture illustrates the services and components that, together, build smart data and metadata catalogue capabilities. We now describe each one of them for you:
Processing (Ingestion & Enrichment) for great Findability & Interoperability
- (A) Ingest, harvest and operate. Here you connect the heterogeneous data sources for ingestion.
The configured input mechanisms describe each of the data sources, with their data, datasets and metadata ready for your catalogue search. Hopefully, at the dataset upload stage, you have provided a good system/form that now provides your search engine with great metadata (i.e. we recommend you use the open data catalogue standard DCAT-AP). The concept upload is interchangeable with either machine-to-machine harvester mechanisms, as with open-data, traditional data integration, or manual provision by human upload effort. (D) Enterprise Metadata Repository: here is the persistent storage of data in both data catalogue, index and graph. All things get a persistent ID (how to design persistent URI) and rich metadata.
- (B) Enrich, refine analyze, and curate. This is the AI part (NLP, Semantics, ML) that enriches the data and datasets, making them smarter.
Concepts (read also entities, terms, phrases, synonyms, acronyms etc.) from the data sources are found using named entity extraction (NER). By referring to a Knowledge Graph in the Enricher, the appropriate resources are annotated (“tagged”) with the said concept. It does not end here, however. The concept also takes with it from the Knowledge Graph all of the known relationships it has with other concepts.
Essentially a Knowledge Graph is your encoded domain knowledge in a connected graph format. It is by reading these encoded relationships that the machine “understands” the meaning or aboutness of data.
This opens up a very nice Pandora’s box for your search (understanding query intent) and for your Graphical User Interface (GUI) as your data becomes smarter now through your ability to exploit the relationships and connections (semantics and context) between concepts.
You and AI can have a symbiotic relationship in the development of your Knowledge Graph. AI can suggest new concepts and relationships as new data is added. It is, however, you and your colleagues who determine the of concepts/relationships in the Knowledge Graph – concepts/relationships that are important to your department or business. Remember you can utilise more than one knowledge graph, or part of one, for a particular business need(s) or data source(s). The Knowledge Graph is a flexible expression of your business/information models that give structure to all your data and its access.
Extra optional step: If you can manage not only to index the dataset metadata but the datasets themselves, you can make your Pandora’s box even nicer. Those cryptic/nonsensical field names that your traditional database experts love to create can also be incorporated and mapped (one time only!) into your Knowledge Graph, thus increasing the machine “understanding” of the data. Thus, there is a better chance of the data asset being used more widely.
The configuration of processing with your Knowledge Graph can take care of dataset versioning, lineage and add further specific classifications e.g. data sensitivity, user access and personal information.
Lastly on Processing, your cultural and system interoperability is immensely improved. We’re not talking everyone speaking the same language here, rather everyone talking their language (/culture) and still being able to find the same thing. In this open and FAIR vocabularies further, enrich the meaning to data and your metadata is linked. System interoperability is partially achieved by exploiting the graph of connections that now “sit over” your various data sources.
Controlled Access (Accessible and Reusable)
- (C) Access, search and visualize APIs. These tools control and influence the delivery, representation, exploration and consumption/use of datasets and data catalogues via a smarter search (made so by smarter data) and a more intuitive Graphical User interface (GUI).
This means your search can now “understand” user intent from just one or two keyword queries (through known relationship connections in the Knowledge Graph).
Your search now also caters for your searchers who are searching in an unfamiliar subject area or are just having a query off day. Besides offering the standard results page, the GUI can also present related information (again due to the Knowledge Graph), past related user queries, information and question-answer (Q&A) type material. So: search, discovery, learning, serendipity.
Your GUI can also now become more intuitive, changing its information presentation and facets/filters automatically, depending on the query itself (more sustainable front-end coding).
An alternative to complex scenario coding also includes the possibility for you to create rules (set in your Knowledge Graph) that can control what data users can access (when, how and where) based on their profile, their role, their location, the time and on the device they are using. This same Knowledge Graph can help push and recommend data for certain users proactively. Accessibility will be possible by using standard communication protocols, open access (when possible), authentication where necessary, and always with metadata at hand.
Reusable: your new smart data framework can help increase the time your Data Managers (/Scientists, Analysts) spend using data (and not trying to find it, the 80/20 data science dilemma). It can also help reduce the risk to your AI projects (50% failure rate) by helping searchers find the right data, with its meaning and context, more easily. Reuse will also be possible with the design that metadata multiple attributes, use licence and provenance in line with community standards
Users and information behaviour (personas)
From experience we have defined the following broad conceptual user-groups:
- Data Managers, a.k.a. Data Op’s or Data Scientists
Data Managers are i.e. knowledge engineers, taxonomists and analysts.
- Data Stewards
Data Stewards are responsible for Data Governance, such as data lineage.
- Business Professionals/Business end-users
Business Users may have a diverse background. Hence Business end-users.
- Actor System are different information systems and applications and services that integrate information via the rich open APIs from the Smart Data Catalogue
The outlined collaborative actors (E-H user groups) and their interplay as information behaviour (personas) with the data (repository) and services (components), together, build the foundation for a more FAIR data management within your organisation, providing for you at the same time, the option to contribute to an even broader shared open FAIR information commons.
- (E) Data Op’s workplace and dashboard is a combination of tools supporting Data Op’s data management processes in the information behaviours: data provision agents, enrichers and developers.
- (F) Data Governance workplace is the tools to support Data Stewards collaborative data governance work with Data Managers in the information behaviours: data owner.
- (G) Access, search, visualize APIs, is the user experience to explore, find and interact with the catalogue and data in the information behaviours: searcher and referrer.
- (H) API, is the set of open APIs to support access to catalogue data for consuming information systems in the information behaviours: referrer (a.k.a. data exchange).
Potential tooling for this smart data framework:
- Search, integration and analytics: Findwise i3, Elastic, Fusion
- Semantic tools: PoolParty, Synaptica, Smartlogic, TopQuadrant
- Catalogue and integration: Entryscape, Informatica, Stibo, Talend, Marklogic
We hope you enjoyed this post and understand the potential benefits such a smart data framework incorporating FAIR data principles can have on your data catalogue, or for that matter, your organisational content or even your data swamps.
Fredric Landqvist research blog
In the next upcoming post, Toward data-centric solutions with RDF & Knowledge graphs, we talk about Knowledge Graphs (KG) and its non-proprietary RDF semantic web tech, how you can create your KG(s) and the benefits they can bring to your future data landscape.