This is the first post in a new series by Fredric Landqvist and Peter Voisey, explaining how your organisation could best shape its data landscape for the future.
A Quest for a FAIR Information Commons
You might have heard recently of the phrase, “data that saves lives”. It certainly can, but just as you need to be in shape to do your work, so does data, to work its magic. Data too needs to be shaped by governing principles that we can apply along their life journey, in order that we can reap the consequential rewards and benefits that are there to be had. Data in shape, saves lives.
We all need to fix problems, usually quickly hence the presence of the closed model, data silos and data interoperability. It has had to happen this way, will continue to do so and there’s no shame in that. But if we can be part of a reliable data sharing community, whose data can help us to collaborate and solve better, well, we’d be foolish to turn it down.
So imagine a type of information commons. This isn’t so far-fetched, we just need to widen our horizons and collaborative ecosystem for it to happen, and perhaps take the same model advice internally for our own organisations.
The challenge of really saving lives with data requires new collaborators. As collaborators we require trust. In essence then to be part of this challenge we need to be willing to share data (we use the term data, content, information interchangeably here). Proof of that trust, is to sign an agreement to be part of an information commons, where data has certain principles (a.k.a. terms & conditions, T&Cs). In essence rules of engagement!
- Declare interest
- Sign a future rules of engagement to share and access data
- Get ready to adhere to them
The T&Cs largely apply to the condition of the data being shared and the information about them. They match precisely how you would hope to find data in this new treasure trove. They may also be known as F.A.I.R. – data that is Findable, Accessible, Interoperable and Reusable. FAIR obviously alludes also to the fairness in collaboration and the F.A.I.R data principles originate from a good sharing place.
Here’s a great summary in image form, from Australian National Data Service [ANDS]:
Still here? Great! Let’s get started then with Findable!
We can only truly make data findable when we really think about the range of people who might want to find it and how they might want to use or reuse it (their need determines how they will ask for it). The reality of different data sources, formats, protocols and their possible attributes or descriptors, makes describing data for others problematic, plus, do you really have the time for tagging? Regardless of time, we’re not very good at putting ourselves in somebody else’s shoes (unless of course we’re selling something) and certainly not able to cover the variation in how people (with differing perspectives) search for data.
The best answer we have at the moment is to describe data or datasets by using agreed standards, perhaps that may vary a bit from domain to domain. Sharing or uploading data to the “ether” gives a different feeling to uploading data that matters to a known shared source and accessed by users who understand its value. In doing so it may inspire us to describe data according to a collective standard, with that feeling of having done something good for a bigger cause.
But hang on. Why is the onus on the end-user? We have the tech here now to automate much of this process. We just need a good sharing and upload design that can recognise the (hopefully changeable) standards of description (metadata). By processing data on upload, we can get a better understanding of data with reference to our standards and rules. Thus, according to what the machine recognises (pattern matching) or “understands” (by way of concept relationships in a knowledge graph) it can annotate the data, ready to serve the requests of data searchers and data applications, or at least be able to offer a related alternative.
Such processing is done using AI (NLP, ML etc.), but it’s not magic. We still have to teach our machines the agreed standards and rules in the first place. While that may sound cumbersome to some, it’s not like you keep having to teach them repeatedly. Conversely the student (AI) can also suggest new rules and annotations, keeping them current according to the data being processed. The beauty for most, is though, that we can employ more than one descriptive rule set for different data or datasets. Depending on data source, format and context, the machine can activate different metadata rule sets. The smart part for the uploader is the presentation of a semi-automated metadata form for their data, leaving them to confirm or alter it before hitting send. The “uploader” in this context, is a broad concept to address any agent that contributes with data to the shared information space, be they programmatic or human.
Let’s not forget we’re at the stage where we can use “search” not only for indexing and this automatic annotation, but for calculations on parsing to potentially annotate with even higher understanding. Such a solution fits well with the increasing demand for real-time data too.
So Findable, is really both about making data smarter, and findable.
There’s nothing worse than finding something you want, only to be told you can’t use it.
While the premise of an Information Commons is sharing, it doesn’t necessarily have to mean that everything is accessible by everyone – the reason that why some readers left this page at the third paragraph.
Let’s be clever about this. There are lots of ways to automatically control accessibility and automatically police it. This could be technical, IP address, sign on, authorisation (classification of user) etc. But it could also be done by processing data on upload in determining the sensitivity level of data and/or the indicators of GDPR data.
Back to the end-user: they don’t want to see stuff they can’t use, but they also want to see from the go, if they need to get any new software to be able to access data that they are interested in.
Now for the hard part. The reality is that variety in data sources, protocols and formats ain’t going to go away any time soon. We have to accept that. We’ve just mentioned about the technical interoperability in Accessible. There’s also language interoperability (cultural and language) that can again be solved by using a knowledge graph with search (tinkering with knowledge graphs, just like Google does).
Lastly there’s data interoperability. Barriers preventing data and system interoperability are slowly being brought down through collaboration. In the meantime, it is possible for us to convert key data into the same data format, so AI and inferencing can be used on different (previously incompatible) datasets. The kind of thing that can lead to computation-derived insights that a human on their own couldn’t make. Converting data to RDF could be such a point in case, a real lingua franca of data, also connected to the Web.
The “F.A.I.” part of FAIR, really already covers Reusable. We want to be able to find data that we can reuse. To do this, we need to be able to see related information on finding content, information, datasets and data catalogues as to the how, what, when, who, why, where of its potential usage. More: working on the shoulders of giants, less: reinventing the wheel. The information (rich metadata) associated with Reusable, also refers to its usefulness: value, age and provenance.
Healthcare Data Commons
There is an emerging FAIR Data paradigm shift within the health informatics and research professional communities, that has been sparked by those within the bio- and life science domains.
There are obvious regulatory constraints when speaking about patient data, or health data, that any data commons arena will have to nail, upfront.
Health Data: quality register data, EHRs data and the patient’s self-created data, together, would be a real gold mine in the pursuit of personalised medicine and health care. Patient-centric data and FAIR data governance will be key.
The outlined scenario for a FAIR Data Commons
The illustration above shows a FAIR data commons. It will be the foundation framework for all information systems (register) in the data ecology. These information systems need to harmonise and align to become FAIR. There is a set of generic agent information behaviour patterns (user personas):
- Data provision agent, is an information behaviour with either a human actor that upload (provision data) or machine to machine data integration contributing to the datasets in the register.
- Data owner, is an information behaviour relating to governance and ownership, stewardship to the datasets in the register.
- Application builder, is an information behaviour relating to building capabilities with the use and reused datasets in the register.
- Data enricher, is an information behaviour relating to expanding the models, and enriching the datasets. With i.e. use of linked-data, semantics and more to create richer metadata.
- Searcher, is an information behaviour relating to finding and acting upon data.
- Referrer, is an information behaviour relating to using data in information flows and data exchange to support different kinds of processes, activities and actions with other actors in the ecology.
The business value realised (effect) using the FAIR Data Commons will be via different means to e-services, used in the scenarios for searcher and referrer, but also in improved efficiency and improved data quality in the other information behaviours.
Next post in the series: Making Your data smart and F.A.I.R. Further reading to help inspire you:
- The FAIRsharing.org provides very useful resources as building blocks in the creation of any context-specific data commons.
- The National Institute for Health (NIH) in the USA have a Data Commons programme, with on-going pilots.
- Similar in the Nordics there are initiatives (Finnish catalog, Swedish register [RUT], HelseData in Norway and Danish Healthdata) coordinated via EU funded research programmes.
- The Life Science industry together with Healthcare have some impressive initiatives e.g. Electronic Health Records 4 Clinical Research [EHR4CR], with its information platform, InSite (by TriNetX), in line with FAIR data.