Making your data F.A.I.R and smart

This is the second post in a new series by Fredric Landqvist & Peter Voisey, explaining how your organisation could best shape its data landscape for the future.

How to create a smart data framework for your organisation

In our last post for you, we presented the benefits of F.A.I.R data, how to make data smarter for search engines and the potentials of an Information Commons. In this post, we’re giving you the pragmatic steps to make your data FAIR by creating and applying your own smart data framework. Your data-sharing dream, internally and externally, is possible.

A smart data framework, using FAIR data principles, encompasses the tooling, models and standards that govern datasets and the different context-specific information systems (registers, catalogues). The data is then ingested and processed (enriched/refined) into smart data, datasets and data catalogues. It can then be used and reused by different applications and e-services via open APIs. In this ecosystem, all actors and information behaviours (personas) interplay: provision agents, owners, builders, enrichers, end-user searchers and referrers.

The workings of a smart data framework

A smart data & metadata catalogue   

A smart data & metadata catalogue (illustrated below), provides an organisational capability that aligns data management with the FAIR data principles. View it not so much as one system to rule them all, but rather an ecosystem that is smart and sustainable. In order to simplify your complex and heterogeneous information environment, this set-up can be  instantiated, as one overarching mechanism. Although we are describing a data and metadata catalogue here, the exact same framework and set up would of course apply also to your organisation’s content, making it smarter and more findable (i.e. it gets the sustainable stamp).

Smart Data Catalogue
The necessary services and component of a smart data catalogue

The above picture illustrates the services and components that, together, build smart data and metadata catalogue capabilities. We now describe each one of them for you:

Processing (Ingestion & Enrichment) for great Findability & Interoperability

  • (A) Ingest, harvest and operate. Here you connect the heterogeneous data sources for ingestion.

The configured input mechanisms describe each of the data sources, with their data, datasets and metadata ready for your catalogue search. Hopefully, at the dataset upload stage, you have provided a good system/form that now provides your search engine with great metadata (i.e. we recommend you use the open data catalogue standard DCAT-AP). The concept upload is interchangeable with either machine-to-machine harvester mechanisms, as with open-data, traditional data integration, or manual provision by human upload effort. (D) Enterprise Metadata Repository: here is the persistent storage of data in both data catalogue, index and graph. All things get a persistent ID (how to design persistent URI) and rich metadata.

  • (B) Enrich, refine analyze, and curate. This is the AI part (NLP, Semantics, ML) that enriches the data and datasets, making them smarter. 

Concepts (read also entities, terms, phrases, synonyms, acronyms etc.) from the data sources are found using named entity extraction (NER). By referring to a Knowledge Graph in the Enricher, the appropriate resources are annotated (“tagged”) with the said concept. It does not end here, however. The concept also takes with it from the Knowledge Graph all of the known relationships it has with other concepts.

Essentially a Knowledge Graph is your encoded domain knowledge in a connected graph format. It is by reading these encoded relationships that the machine “understands” the meaning or aboutness of data.

This opens up a very nice Pandora’s box for your search (understanding query intent) and for your Graphical User Interface (GUI) as your data becomes smarter now through your ability to exploit the relationships and connections (semantics and context) between concepts.

You and AI can have a symbiotic relationship in the development of your Knowledge Graph. AI can suggest new concepts and relationships as new data is added. It is, however, you and your colleagues who determine the of concepts/relationships in the Knowledge Graph – concepts/relationships that are important to your department or business. Remember you can utilise more than one knowledge graph, or part of one, for a particular business need(s) or data source(s). The Knowledge Graph is a flexible expression of your business/information models that give structure to all your data and its access.

Extra optional step: If you can manage not only to index the dataset metadata but the datasets themselves, you can make your Pandora’s box even nicer. Those cryptic/nonsensical field names that your traditional database experts love to create can also be incorporated and mapped (one time only!) into your Knowledge Graph, thus increasing the machine “understanding” of the data. Thus, there is a better chance of the data asset being used more widely. 

The configuration of processing with your Knowledge Graph can take care of dataset versioning, lineage and add further specific classifications e.g. data sensitivity, user access and personal information.

Lastly on Processing, your cultural and system interoperability is immensely improved. We’re not talking everyone speaking the same language here, rather everyone talking their language (/culture) and still being able to find the same thing. In this open and FAIR vocabularies further, enrich the meaning to data and your metadata is linked. System interoperability is partially achieved by exploiting the graph of connections that now “sit over” your various data sources.

Controlled Access (Accessible and Reusable)

  • (C) Access, search and visualize APIs. These tools control and influence the delivery, representation, exploration and consumption/use of datasets and data catalogues via a smarter search (made so by smarter data) and a more intuitive Graphical User interface (GUI).

This means your search can now “understand” user intent from just one or two keyword queries (through known relationship connections in the Knowledge Graph). 

Your search now also caters for your searchers who are searching in an unfamiliar subject area or are just having a query off day. Besides offering the standard results page, the GUI can also present related information (again due to the Knowledge Graph), past related user queries, information and question-answer (Q&A) type material. So: search, discovery, learning, serendipity.

Your GUI can also now become more intuitive, changing its information presentation and facets/filters automatically, depending on the query itself (more sustainable front-end coding). 

An alternative to complex scenario coding also includes the possibility for you to create rules (set in your Knowledge Graph) that can control what data users can access (when, how and where) based on their profile, their role, their location, the time and on the device they are using. This same Knowledge Graph can help push and recommend data for certain users proactively. Accessibility will be possible by using standard communication protocols, open access (when possible), authentication where necessary, and always with metadata at hand.

Reusable: your new smart data framework can help increase the time your Data Managers (/Scientists, Analysts) spend using data (and not trying to find it, the 80/20 data science dilemma). It can also help reduce the risk to your AI projects (50% failure rate) by helping searchers find the right data, with its meaning and context, more easily.  Reuse will also be possible with the design that metadata multiple attributes, use licence and provenance in line with community standards

Users and information behaviour (personas)

Users and personas
User groups and services

From experience we have defined the following broad conceptual user-groups:

  • Data Managers, a.k.a. Data Op’s or Data Scientists
    Data Managers are i.e. knowledge engineers, taxonomists and analysts. 
  • Data Stewards
    Data Stewards are responsible for Data Governance, such as data lineage. 
  • Business Professionals/Business end-users
    Business Users may have a diverse background. Hence Business end-users.
  • Actor System are different information systems and applications and services that integrate information via the rich open APIs from the Smart Data Catalogue

The outlined collaborative actors (E-H user groups) and their interplay as information behaviour (personas) with the data (repository) and services (components), together, build the foundation for a more FAIR data management within your organisation, providing for you at the same time, the option to contribute to an even broader shared open FAIR information commons.

  • (E) Data Op’s workplace and dashboard is a combination of tools supporting Data Op’s data management processes in the information behaviours: data provision agents, enrichers and developers.
  • (F) Data Governance workplace is the tools to support Data Stewards collaborative data governance work with Data Managers in the information behaviours: data owner.
  • (G) Access, search, visualize APIs, is the user experience to explore, find and interact with the catalogue and data in the information behaviours: searcher and referrer.
  • (H) API, is the set of open APIs to support access to catalogue data for consuming information systems in the information behaviours: referrer (a.k.a. data exchange).

Potential tooling for this smart data framework:

We hope you enjoyed this post and understand the potential benefits such a smart data framework incorporating FAIR data principles can have on your data catalogue, or for that matter, your organisational content or even your data swamps.


In the next post, Toward data-centric solutions with Knowledge Graphs, we talk about Knowledge Graphs (KG) and its non-proprietary RDF semantic web tech, how you can create your KG(s) and the benefits they can bring to your future data landscape.

View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Peter Voisey's LinkedIn profilePeter Voisey

Well-known findability challenges in the AI-hype

Organisations are facing new types of information challenges in the AI-hype. At least the use cases, the data and the technology are different. The recommended approach and the currently experienced findability challenges remains however the same.

Findability is getting worse as the data landscape is changing

As clearly shown in the result of the 2019 Search & Findability Survey, finding relevant information is still a major challenge to most organisations. In the internal context as many as 55% find it difficult or very difficult to find information which brings us back to same levels that we saw in the survey results from 2012 and 2013.

Given the main obstacles that the respondents experience to improve search and findability, this is not very surprising:

  • Lack of resources/staff
  • Lack of ownership/mandate
  • Poor information quality

One reason behind the poor information quality might be the decreasing focus and efforts spent on traditional information management activities such as content life cycle, controlled vocabularies and metadata standards as illustrated in the below diagrams*. In 2015-16 we saw an increase in these activities which made perfect sense since “lack of tags” or “inconsistent tagging” was considered the largest obstacles for findability in 2013-2015. Unfortunately, the lack of attention to these areas don’t seem to indicate that the information quality has improved, rather the opposite.

(*percent working with the noted areas)

A likely reason behind the experienced obstacles and the lack of resources to improve search and findability is a shift of focus in data and metadata management efforts following the rapid restructuring of the data landscape. In the era of digital transformation, attention is rather on the challenge to identify, collect and store the massive amounts of data that is being generated from all sorts of systems and sensors, both within and outside the enterprise. As a result, it is no longer only unstructured information and documents that are hard to find but all sorts of data that are being aggregated in data lakes and similar data storage solutions.

Does this mean that search and findability of unstructured information is no longer relevant? No, but in combination with finding individual documents, the target groups in focus (typically Data Scientists) have an interest in finding relevant and related data(sets) from various parts of the organisation in order to perform their analysis.

Digital (or data-driven) transformation is often focused on utilising data in combination with new technology to reach level 3 and 4 in the below “pyramid of data-driven transformation” (from In search for insight):

This fact is also illustrated by the technology trends that we can see from the survey results and that is presented in the article “What are organisations planning to focus on to improve Search and Findability?”. Two of the most emerging technologies are Natural Language Processing (NLP) and Machine Learning which are both key components in what is often labelled as “AI”. To use AI to drive transformation has become the ultimate goal for many organisations.

However, as the pyramid clearly shows, to realise digital transformation, automation and AI, you must start by sorting out the mess. If not, the mess will grow by the minute, quickly turning the data lake into a swamp. One of the biggest challenges for organisations in realising digital transformation initiatives still lies in how to access and use the right data.  

New data and use cases – same approach and challenges

The survey results indicate that, irrespective of what type of data that you want to make useful, you need to take a holistic approach to succeed. In other words, if you want to get passed the POC-phase and achieve true digital transformation you must consider all perspectives:

  • Business – Identify the business challenge and form a common vision of the solution
  • User – Get to know your users and what it takes to form a successful solution
  • Information – Identify relevant data and make it meaningful and F.A.I.R.*
  • Technology – Evaluate and select the technology that is best fit for purpose
  • Organisation – Establish roles and responsibilities to manage and improve the solution over time

You might recognise the five findability dimensions that was originally introduced back in 2010 and that are more relevant than ever in the new data landscape. The survey results and the experienced obstacles indicate that the main challenges will remain and even increase within the dimensions of information and organisation.

Also, it is important to remember that to create value from information it is not always necessary to aim for the top of the pyramid. In many cases it is enough to extract knowledge and thereby provide better insights and decision support by aggregating relevant data from different sources. Given that the data quality is good enough that is.

*The strategy to get a sustainable data management, implies leaning upon the FAIR Data Principles

  1. Make data Findable, through persistent ID, rich metadata, indexes and combine id+index.
  2. Make data Accessible, through standard communication protocols, open and free protocols and authentication mechanism where necessary and always keep metadata available.
  3. Make data Interoperable, through the use of vocabularies, terminologies, glossaries, use open vocabularies/models and link the metadata.
  4. Finally make data Reusable, by using multiple metadata attributes, set constraints based on licenses, and express provenance to build trusted and quality datasets leaning upon community standards.

Author: Mattias Ellison, Findability Business Consultant

What are organisations planning to focus on to impove Search and Findability?

This year’s Search and Findability survey gave us a good indication of upcoming trends on the market. The activities and technologies that organisations are planning to start working with, are all connected to improving effectiveness. By using technology to automatically perform tasks, and by understanding the users’ needs and giving them a tailored search experience, there is a lot of potential to save time and effort. 

Top 5 activities organisations will focus in:

  • Natural language search interface, e.g. Query aid or chatbots (29%)
  • Personalisation e.g. tailored search experience (27%)
  • Automatic content tagging (24%)
  • Natural Language Processing, NLP (22%)
  • Machine Learning (20%)

The respondents planning to start working with one of these areas are more likely to be interested in, or are already working with, the other areas in the top 5. For example, out of the respondents saying that they are planning to use a natural language search interface, 44% are planning to start with personalisation as well. If you were to add the respondents already working with personalisation to that amount, it would increase by 75%. This might not be a big surprise since the different areas are much related to one another. A natural language search interface can support a tailored search experience, in other words – lead to personalisation. Automatic content tagging can be enabled by using techniques such as NLP and Machine Learning.

A Natural Language Search interface is a way of trying to find targeted answers to user questions. Instead of search based on keywords, the goal is to understand the question and generate answers with a higher relevancy. Since a large amount of the questions asked in an organisation are similar, you could save a lot of time by clustering and/or providing answers automatically using conversational UI. Learn more about Conversational UI.

One way to improve the Natural Language Search interface is by using Natural Language Processing (NLP). The aim with NLP is to improve a computer’s speech recognition for example by interpreting synonyms and spelling mistakes. NLP started out as a rule-based technique which was manually coded, but the introduction of Machine Learning (ML) improved the technology further. By using statistical techniques, ML makes it possible to learn from data without having to manually program the computer system.  Read more about improving search with NLP.

Automatic content tagging is a trend that we see within the area of Information Management. Instead of relying on user created tags (of various quality) the tags are created automatically based on different patterns. The advantage of using automatic content tagging is that the metadata will be consistent and that the data will be easier to analyse.

Personalisation e.g. tailored search experience is a way to sort out information based on the user profile. Basically, search results are adapted to the user needs, for example by not showing things that the user do not have access to and promoting search results that the user frequently looks for. Our findings in this year’s survey, show that respondents saying they are currently working with personalisation consider that users on both the internal and extern site find information easier. Users that find the information they search for easily, tend to be more satisfied with the search solution.


Results from this year’s survey indicates that organisations are working with or planning to working with, AI and Cognitive-related techniques. The percentage doing so has grown compared to previous surveys.

Do you want to learn more about cognitive search

Author: Angelica Lahti, Findability Business Consultant

Comparison of two different methods for generating tree facets, with Elasticsearch and Solr

Let’s try to explain what a tree facet is, by starting with a common use case of a “normal” facet. It consists of a list of filters, each corresponding to a value of a common search engine field and a count representing the number of documents matching that value. The main characteristic of a tree facet is that its filters each may have a list of child filters, each of which may have a list of child filters, etc. This is where the “tree” part of its name comes from.

Tree facets are therefore well suited to represent data that is inherently hierarchical, e.g. a decision tree, a taxonomy or a file system.

Two commons methods of generating tree facets, using either Elasticsearch or Solr, are the pivot approach and the path approach. Some of the characteristics, benefits and drawbacks of each method are presented below.

While ordinary facets consist of a flat list of buckets, tree facets consist of multiple levels of buckets, where each bucket may have child buckets, etc. If applying a filter query equivalent to some bucket, all documents matching that bucket, or any bucket in that sub-tree of child buckets, are returned.

Tree facets with Pivot

The name is taken from Solr (Pivot faceting) and allows faceting within results of the parent facet. This is a recursive setting, so pivot faceting can be configured for any number of levels. Think of pivot faceting as a Cartesian product of field values.

A list of fields is provided, where the first element in the list will generate the root level facet, the second element will generate the second level facet, and so on. In Elasticsearch, the same result is achieved by using the more general concept of aggregations. If we take a terms aggregation as an example, this simply means a terms aggregation within a parent terms aggregation, and so on.

Benefits

The major benefit of pivot faceting is that it can all be configured in query time and the data does not need to be indexed in any specific way. E.g. the list of fields can be modified to change the structure of the returned facet, without having to re-index any content.

The values of the returned facet/aggregation are already in a structured, hierarchical format. There is no need for any parsing of paths to build the tree.

Drawbacks

The number of levels in the tree must be known at query time. Since each field must be specified explicitly, it puts a limit on the maximum depth of the tree. If the tree should be extended to allow for more levels, then content must be indexed to new fields and the query needs to include these new fields.

Pivot faceting assumes a uniformity in the data, in that the values on each level in the tree, regardless of their parent, are of the same types. This is because all values on some specific level comes from the same field.

When to use

At least one of the following statements hold:

  • The data is homogenous – different objects share similar sets of properties
  • The data will, structurally, not change much over time
  • There is a requirement on a high level of query time flexibility
  • There is a requirement on a high level of flexibility without re-indexing documents

Tree facets with Path

Data is indexed into a single field, on a Unix style file path format, e.g. root/middle/leaf (the path separator is configurable). The index analyzer of this field should be using a path hierarchy tokenizer (Elasticsearch, Solr). It will expand the path so that a filter query for some node in the tree will include the nodes in the sub-tree below the node. The example path above would be expanded to root, root/middle, root/middle/leaf. These represent the filter queries for which the document with this path should be returned. Note that the query analyzer should be keyword/string so that queries are interpreted verbatim.

Once the values have been indexed, a normal facet or terms aggregation is put on the field. This will return all possible paths and sub-paths, which can be a large number, so make sure to request all of them. Once facet/aggregation is returned, its values need to be parsed and built into a tree structure.

Benefits

The path approach can handle any number of levels in the tree, without any configuration explicitly stating how many levels there are, both on the indexing side and on the query side. It is also a natural way of handling different depths in different places in the tree, not all branches need to be the same length.

Closely related to the above-mentioned benefit, is the fact that the path approach does not impose any restrictions on the uniformity of the tree. Nodes on a specific level in the tree may represent different concepts, dependent only on their parent. This fits very well with many real-world applications, as different objects and entities have different sets of properties.

Drawbacks

Data must be formatted in index time. If any structural changes to the tree are required, affected documents need to be re-indexed.

To construct a full tree representation of the paths returned in the facet/aggregation, all paths need to be requested. If the tree is big, this can become costly, both for the search engines to generate and with respect to the size of the response payload.

Data is not returned in a hierarchical format and must be parsed to build the tree structure.

When to use

At least one of the following statements hold:

  • The data is heterogenous – different objects have different sets of properties, varying numbers of levels needed in different places in the tree
  • The data could change structurally over time
  • The content and structure of the tree should be controlled by content only, no configuration changes

Tree facets – Conclusion

The listed benefits and drawback of each method can be used as a guide to find the best method from case to case.

When there is no clear choice, I personally tend to go for the path approach, just because it is so powerful and dynamic. This comes with the main drawback of added cost of configuration for index time data formatting, but it is usually worth it in my opinion.

tree facets, data

Author: Martin Johansson, Senior Search Consultant at Findwise

Open or Opaque Artificial Intelligence

Data is the black gold in the information era and has similar value creation and ecology to that of petroleum. Data in its raw format needs to be refined (as does crude oil) to make sense and to add meaning and usefulness to any domain.

AI and its parts (machine learning, natural language processing, deep-learning etc.) are set to be a societal game changer in all collective human imagination domains.

opaque

The ambition should be to design for a sustainable AI future, aiming to incorporate the  UNs 17 development goals with ethics at the core. One omnipresent hurdle still is the black box or opaque setting i.e. being able to understand how, why and where different AI operates and influences

The open paradigm

Since all known to man utilities with AI, have a simple model, being:

inputmodeloutput and feedback (learning).

There is a need to shift the control from the computer back towards the human, and thereby enable the addition of meaning and semantics along with conceptual models.

By using open innovation, -standards, -models (knowledge graphs, ontologies, terminologies, code systems and the like), -software, -platforms (technology stacks, i.e. Singularity net) in the design for future AI utilities and cognitive computing, there exists opportunities for  leverage learning in a meaningful way – away from the opaque regime and towards cognitive-informed artificial intelligence. Efficient communication through interoperability that can accommodate data from different semantic domains that traditionally have been separate. Open domain knowledge and data-sets (as linked-data) will provide very good platforms for continuously improved datasets within the AI loop, both in terms of refining and addressing the contextual matter, but also enabling improved precision and outcome.

Informative communication – the word’s meaning should allow accurate mental reconstruction of the senders intended meaning, but we are well aware of the human messiness (complexity) within a language as described in Information bottleneck (Tishby), rate distortion theory (Shannon).

To take on the challenges and opportunities within AI, there are strong undercurrents to build interdisciplinary capacities as with Chalmers AI Research and AI innovation of Sweden and the like. Where computer science, cognitive science, data science, information science, social sciences and more disciplines meet and swap ideas to improve value creation within different domains, while at the same time beginning to blend industry, public sector, academia and society together.

The societal challenges that lay ahead, open up for innovation, where AI-assisted utilities will augment and automate for the benefit of mankind and the earth, but to do so require a balancing act where the open paradigm is favoured. AI is designed and is an artefact, hence we need to address ethics in its design with ART (Accountability, Responsibility and Transparency) The EU draft on AI ethics.

Tinkering with AI

The emerging development of AI shows a different pathway than that of traditional software engineering. All emerging machine learning, NLP and/or Deep-Learning machinery relies on a tinkering approach with trial and error -re-model, refine data-set, test-bed with different outcomes and behaviours -before it can reach a maturity level for the industrial stages in digital infrastructure, as with Google Cloud, or similar services. A great example is image recognition and computer vision with its data optimization algorithms. and processing steps. Here each development has emerged from previous learnings and tinkering. Sometimes the development and use of mathematical models simply do not provide up for real AI matter and utilities.

Here in the value creation, or the why in the first place, we should design and use ML, NLP and Deep-Learning in the process with an expected outcome.  AI is not, and never will be the silver bullet for all problem domains in computing! Start making sense, in essence, is needed, with contextual use-cases and utilities, long before we reach Artificial General Intelligence

The 25th of April an event will cover Sustainable Knowledge Graphs and AI together with linked-data Sweden network.

Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!

This is the first joint post in a series where Findwise & SearchExplained, together decompose Microsoft’s realm with the focus on knowledge graphs and AI. The advent of graph technologies and more specific knowledge graphs have become the epicentre of the AI hyperbole.

microsoft_graph

The use of a symbolic representation of the world, as with ontologies (domain models) within AI is by far nothing new. The CyC project, for instance, started back in the 80’s. The most common use for average Joe would be by the use of Google Knowlege Graph that links things and concepts. In the world of Microsoft, this has become a foundational platform capacity with the Microsoft Graph.

It is key to separate the wheat from the chaff since the Microsoft Graph is by no means a Knowledge Graph. It is a highly platform-centric way to connect things, applications, users and information and data. Which is good, but still it lacks the obvious capacity to disambiguate complex things of the world, since this is not its core functionality to build a knowledge graph (i.e ontology).

From a Microsoft centric worldview, one should combine the Microsoft Graph with different applications with AI to automate, and augment the life with Microsoft at Work. The reality is that most enterprises do not use Microsoft only to envelop the enterprise information landscape. The information environment goes far beyond, into a multitude of organising systems within or outside to company walls.

Question: How does one connect the dots in this maze-like workplace? By using knowledge graphs and infuse them into the Microsoft Graph realm?

Office 365 MDM

The model, artefacts and pragmatics

People at work continuously have to balance between modalities (provision/find/act) independent of work practice, or discipline when dealing with data and information. People also have to interact with groups, and imaged entities (i.e. organisations, corporations and institutions). These interactions become the mould whereupon shared narratives emerge.

Knowledge Graphs (ontologies) are the pillar artefacts where users will find a level playing field for communication and codification of knowledge in organising systems. When linking the knowledge graphs, with a smart semantic information engine utility, we get enterprise-linked-data that connect the dots. A sustainable resilient model in the content continuum.

Microsoft at Work – the platform, as with Office 365 have some key building blocks, the content model that goes cross applications and services. The Meccano pieces like collections [libraries/sites] and resources [documents, pages, feeds, lists] should be configured with sound resource descriptions (metadata) and organising principles. One of the back-end service to deal with this is Managed Metadata Service and the cumbersome TermStore (it is not a taxonomy management system!). The pragmatic approach will be to infuse/integrate the smart semantic information engine (knowledge graphs) with these foundation blocks. One outstanding question, is why Microsoft has left these services unchanged and with few improvements for many years?

The unabridged pathway and lifecycle to content provision, as the creation of sites curating documents, will be a guided (automated and augmented [AI & Semantics]) route ( in the best of worlds). The Microsoft Graph and the set of API:s and connectors, push the envelope with people at centre. As mentioned, it is a platform-centric graph service, but it lacks connection to shared narratives (as with knowledge graphs).  Fuzzy logic, where end-user profiles and behaviour patterns connect content and people. But no, or very limited opportunity to fine-tune, or align these patterns to the models (concepts and facts).

Akin to the provision modality pragmatics above is the find (search, navigate and link) domain in Office 365. The Search road-map from Microsoft, like a yellow brick road, envision a cohesive experience across all applications. The reality, it is a silo search still 😉 The Microsoft Graph will go hand in hand to realise personalised search, but since it is still constraint in the means to deliver a targeted search experience (search-driven-application) in the modern search. It is problematic, to say the least. And the back-end processing steps, as well as the user experience do not lean upon the models to deliver i.e semantic-search to connect the dots. Only using the end-user behaviour patterns, end-user tags (/system/keyword) surface as a disjoint experience with low precision and recall.

The smart semantic information engine will usually be a mix of services or platforms that work in tandem,  an example:

  1. Semantic Tools (PoolParty, Semaphore)
  2. Search and Analytics (i3, Elastic Stack)
  3. Data Integration (Marklogic, Biztalk)
  4. AI modules (MS Cognitive stack)

In the forthcoming post on the theme Beyond Office 365 unpacking the promised land with knowledge graphs and AI, there will be some more technical assertions.
View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Agnes Molnar's LinkedIn profileAgnes Molnar SearchExplained

.

Activate conference 2018

Opensource has won! Now, what about AI?

Grant Ingersoll is on stage at the opening of Activate18 explaining the reasoning behind changing the name.

The revolution is won, opensource won, search as a concept to reckon with, they all won.

The times I come across a new search project where someone is pushing anything but opensource search is few and far between these days.

Since Search has taken a turn towards AI, a merge with that topic seems reasonable, not to say obvious. But AI in this context should probably be interpreted as AI to support good search results. At least if judging from the talks I attended. Interesting steps forward is expert systems and similar, none which was extensively discussed as of my knowledge. A kind of system we work with at Findwise. For instance, using NLP, machine learning and text analytics to improve a customer service.

Among the more interesting talks I attended was Doug Turnbulls talk on Neural Search Frontier. Some of the matrix-math threw me back to a ANN-course I took 10 years ago. Way before I ever learned any matrix maths. Now, way post remembering any matrix math-course I ever took, it’s equally confusing, possibly on a bit higher level. But he pointed out interesting aspects and show conceptually how Word2Vec-vectors work and won’t work. Simon Hughes talk “Vectors in search – Towards more semantic matching” is in the same area but more towards actually using it.

Machine Learning is finally mainstream

If we have a look at the overall distribution of talks, I think it’s safe to say that almost all talks touched on machine learning in some way. Most commonly using Learning to Rank and Word2Vec. None of these are new techniques (Our own Mickaël Delaunay wrote a nice blog-post about how to use LTR for personalization a couple of years ago. They have been covered before to some extent but this time around we see some proper, big scale implementations that utilizes the techniques. Bloomberg gave a really interesting presentation on what their evolution from hand tuned relevance to LTR over millions of queries have been like. Even if many talks were held on a theoretical/demo-level it is now very clear. It’s fully possible and feasible to build actual, useful and ROI-reasonable Machine Learning into your solutions.

As Trey Grainer pointed out, there are different generations of this conference. A couple of years ago Hadoop were everywhere. Before that everything was Solr cloud. This year not one talk-description referenced the Apache elephant (but migration to cloud was still referenced, albeit not in the topic). Probably not because big data has grown out of fashion, even though that point was kind of made, but rather that we have other ways of handling and manage it these days.

Don’t forget: shit in > shit out!

And of course, there were the mandatory share of how-we-handle-our-massive-data-talks. Most prominently presented by Slack, all developers favourite tool. They did show a MapReduce offline indexing pipeline that not only enabled them to handle their 100 billion documents, but also gave them an environment which was quick on its feet and super suitable for testing new stuff and experimenting. Something an environment that size usually completely blocks due to re-indexing times, fear of bogging down your search-machines and just general sluggishness.

Among all these super interesting technical solutions to our problems, it’s really easy to forget that loads of time still have to be spent getting all that good data into our systems. Doing the groundwork, building connectors and optimizing data analysis. It doesn’t make for so good talks though. At Findwise we ususally do that using our i3-framework which enables you to ingest, process, index and query your unstructured data in a nice framework.activate 2018 solr lucid opensource

I now look forward to doing the not so ground work using inspiration from loads of interesting solutions here at Activate.

Thanks so much for this year!

The presentations from the conference are available on YouTube in Lucidworks playlist for Activate18.

Author and event participant: Johan Persson Tingström, Findability Expert at Findwise

Tinkering with knowledge graphs

I don’t want to sail with this ship of fools, on the opulent data sea, where people are drowning without any sense-making knowledge shores in sight. You don’t see the edge before you drop!

Knowledge EngineeringEchoencephalogram (Lars Leksell)  and neural networks

How do organisations reach a level playing field, where it is possible to create a sustainable learning organisation [cybernetics]?
(Enacted Knowledge Management practices and processes)

Sadly, in many cases, we face the tragedy of the commons!

There is an urgent need to iron out the social dilemmas and focus on motivational solutions that strive for cooperation and collective action. Knowledge deciphered with the notion of intelligence and emerging utilities with AI as an assistant with us humans. We the peoples!

To make a model of the world, to codify our knowledge and enable worldviews to complex data is nothing new per se. A Knowlege Graph – is in its essence a constituted shared narrative within the collective imagination (i.e organisation). Where facts of things and their inherited relationships and constraints define the model to be used to master the matrix.  These concepts and topics are our communication means to bridge between groups of people. Shared nomenclatures and vocabularies.

Terminology Management

Knowledge Engineering in practice


At work – building a knowledge graph – there are some pillars, that the architecture rests upon.  First and foremost is the language we use every day to undertake our practices within an organisation. The corpus of concepts, topics and things that revolve around the overarching theme. No entity act in a vacuum with no shared concepts. Humans coordinate work practices by shared narratives embedded into concepts and their translations from person to person. This communication might be using different means, like cuneiform (in ancient Babel) or digital tools of today. To curate, cultivate and nurture a good organisational vocabulary, we also need to develop practices and disciplines that to some extent renders similarities to ancient clay-tablet librarians. Organising principles, to the organising system (information system, applications).  This discipline could be defined as taxonomists (taxonomy manager) or knowledge engineers. (or information architect)

Set the scope – no need to boil the ocean


All organisations, independent of business vertical, have known domain concepts that either are defined by standards, code systems or open vocabularies. A good idea will obviously be to first go foraging in the sea of terminologies, to link, re-hash/re-use and manage the domain. The second task in this scoping effort will be to audit and map the internal terrain of content corpora. Since information is scattered across a multitude of organising systems, but within these, there are pockets of a structure. Here we will find glossaries, controlled vocabularies, data-models and the like.  The taxonomist will then together with subject matter experts arrange governance principles and engage in conversations on how the outer and inner loop of concepts link, and start to build domain-specific taxonomies. Preferable using the simple knowledge organisation system (SKOS) standard

Participatory Design from inception


Concepts and their resource description will need to be evaluated and semantically enhanced with several different worldviews from all practices and disciplines within the organisation. Concepts might have a different meaning. Meaning is subjective, demographic, socio-political, and complex. Meaning sometimes gets lost in translation (between different communities of practices).

The best approach to get a highly participatory design in the development of a sustainable model is by simply publish the concepts as open thesauri. A great example is the HealthDirect thesaurus. This service becomes a canonical reference that people are able to search, navigate and annotate.

It is smart to let people edit and refine and comment (annotate) in the same manner as the Wikipedia evolves, i.e edit wiki data entries. These annotations will then feedback to the governance network of the terminologies. 

Term Uppdate

Link to organising systems

All models (taxonomies, vocabularies, ontologies etc.) should be interlinked to the existing base of organising systems (information systems [IS]) or platforms. Most IS’s have schemas and in-built models and business rules to serve as applications for a specific use-case.  This implies also the use of concepts to define and describe the data in metadata, as reference data tables or as user experience controls. In all these lego pieces within an IS or platform, there are opportunities to link these concepts to the shared narratives in the terminology service.  Linked-enterprise-data building a web of meaning, and opening up for a more interoperable information landscape.

One omnipresent quest is to set-up a sound content model and design for i.e Office 365, where content types, collections, resource descriptions and metadata have to be concerted in the back-end services as managed-metadata-service. Within these features and capacities, it is wise to integrate with the semantic layer. (terminologies, and graphs). Other highly relevant integrations relate to search-as-a-service, where the semantic layer co-acts in the pipeline steps, add semantics, link, auto-classify and disambiguate with entity extraction. In the user experience journey, the semantic layer augments and connect things. Which is for instance how Microsoft Graph has been ingrained all through their platform. Search and semantics push the envelope 😉

Data integration and information mechanics

A decoupled information systems architecture using an enterprise service bus (messaging techniques) is by far the most used model.  To enable a sustainable data integration, there is a need to have a data architecture and clear integration design. Adjacent to the data integration, are means for cleaning up data and harmonise data-sets into a cohesive matter, extract-load-transfer [etl]. Data Governance is essential! In this ballpark we also find cues to master data management. Data and information have fluid properties, and the flow has to be seamless and smooth.  

When defining the message structure (asynchronous) in information exchange protocols and packages. It is highly desired to rely on standards, well-defined models (ontologies). As within the healthcare & life science domain using Hl7/FHIR.  These standards have domain-models with entities, properties, relations and graphs. The data serialisation for data exchange might use XML or RDF (JSON-LD, Turtle etc.). The value-set (namespaces) for properties will be possible to link to SKOS vocabularies with terms.

Query the graph

Knowledge engineering is both setting the useful terminologies into action, but also load, refine and develop ontologies (information models, data models). There are many very useful open ontologies that could or should be used and refined by the taxonomists, i.e ISA2 Core Vocabularies, With data-sets stored in a graph (triplestore) there are many ways to query the graph to get results and insights (links). Either by using SPARQL (similar to SQL in schema-based systems), or combine this with SHACL (constraints) or via Restful APIs.

These means to query the knowledge graph will be one reasoning to add semantics to data integration as described above.

Adding smartness and we are all done…

Semantic AI or means to bridge between symbolic representation (semantics) and machine learning (ML), natural language processing (NLP), and deep-learning is where all thing come together.

In the works (knowledge engineering) to build the knowledge graph, and govern it, it taxes many manual steps as mapping models, standards and large corpora of terminologies.  Here AI capacities enable automation and continuous improvements with learning networks. Understanding human capacities and intelligence, unpacking the neurosciences (as Lars Leksell) combined with neural-networks will be our road ahead with safe and sustainable uses of AI.
View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog

Reflection, part 2

Some time ago I was writing about the Reflection mechanism in .NET Framework.

This time I will show you a use case, where it’s better NOT TO USE the Reflection.

Introduction

In the previous post about the Reflection I mentioned some doubt thoughts about using this mechanism and one of them has actually its justification.

So, when it’s better not to use Reflection and why?

Consider a method that accepts some objects and we want to access some property on these objects inside this method.

private void MyUniversalMethod(object obj)
{
    if (obj.GetType().GetProperty("MyPropertyName ") is System.Reflection.PropertyInfo myProperty) //Check if our object actually has the property we're interested in.
    {
        var myPropertyValue = myProperty.GetValue(obj); //Get the property value on our object.
        myProperty.SetValue(obj, new object()); //Set the property value on our object.
    }
} 

Although, technically, there’s nothing wrong with this approach, it should be avoided in most cases, because it totally breaks the concept of strong typing.

How do we do it properly?

If we are in control over the classes we are using, we should always extract the property we want to access in such method to an interface.

interface IMyInterface
{
    object MyProperty { get; set; }
} 

So our method will look a lot simpler and, what’s most important, the compiler upholds the code integrity so we don’t have to bother if the property doesn’t exist or is inaccessible on our object, because the interface forces the accessibility for us:

private void MyUniversalMethod(IMyInterface obj)
{
    var myPropertyValue = obj.MyProperty; //Get the property value on our object.
    obj.MyProperty = new object(); //Set the property value on our object.
} 

But, what if we have no control over the classes?

There are scenarios where we have to use someone else’s code and we have to adapt our code to the already existing one. And, what’s worse, the property we are interested in is not defined in any interface but there are several classes that can contain such property.

But then, it is still recommended that we don’t use the Reflection in that case.

Instead of that, we should filter the objects that come to our method to specific types that actually contain the property we are interested in.

private void MyUniversalMethod(object obj)
{
    if (obj is TheirClass theirClass) //Check if our object is of type that has the property we're interested in. If so, assign it to a temporary variable.
    {
        var theirPropertyValue = theirClass.TheirProperty; //Get the property value on our object.
        theirClass.TheirProperty = new object(); //Set the property value on our object.
    }
} 

There’s an inconvenience in the example above that we have to specify all the types that might contain the property we are interested in and handle them separately but this protects us from cases where in different classes a property of the same name is of a different type. Here we have the full control over what’s happening with strongly typed property.

Then, what is the Reflection good for?

Although I said it is not recommended in most cases, there are, however, cases where the Reflection approach would be the preferred way.

Consider a list containing names of objects represented by some classes.

We create a method that will retrieve the name for us to display:

private string GetName(object obj)
{
    var type = obj.GetType();
    return (type.GetProperty("Name") as System.Reflection.PropertyInfo)? //Try to get the "Name" property.
        .GetValue(obj)? //Try to get the "Name" property value from the object.
        .ToString() //Get the string representation of the value, if it's string it just returns its value.
        ?? type.Name; //If the tries above fail, get the type name.
}

The property “Name” is commonly used by many classes, though it’s very rarely defined in an interface. We can be also almost certain that it will be a string. We can then just look for this property by a Reflection and in case we didn’t find it use the type name. This approach is commonly used in Windows Forms PropertyGrid’s collection editors.

Use of dynamic keyword

At the point we are certain we don’t want to rely on strong typing, we can access the properties at runtime in an even simpler way, by using the dynamic keyword, which introduces the flexibility of duck typing.

private void MyUniversalMethod(dynamic obj)
{
    var theirPropertyValue = obj.TheirProperty; //Get the property value on our object.
    obj.TheirProperty = new object(); //Set the property value on our object.
} 

This is very useful in cases where we don’t know the type of the object passed to the method at the design time. It is also required by some interop interfaces.

But be careful what you are passing to this method, because in case you try to access a member which doesn’t exist or is inaccessible you will get a RuntimeBinderException.

Note that all members you will try to access on a dynamic object are also dynamic and the IntelliSense is disabled for them – you’re on your own.