Enterprise-Linked-Data and the Connected Digital Workplace

The emerging hyper-connected and agile enterprises of today are stigmatised by their IS/IT-legacy, so the question is: Will emerging web and semantic technologies and practices undo this stigma?

The Shift

Semantic Technologies and Linked-Open-Data (LOD) have evolved since Tim Berners-Lee introduced their basic concepts, and they are now part of everyday business on the Internet, thanks mainly due to their uptake by information and data-run companies like Google, social networks like Facebook and large content sites, like Wikipedia. The enterprise information landscape is ready to be enhanced by semantic web, to increase findability and usability. This change will enable a more agile digital workplace where members of staff can use cloud based services, anywhere, anytime on any device, in combination with the set of legacy systems backing their line-of-business. All in all, more efficient organising principles for information and data.

The Corporate Information Landscape of today

In everyday workplace we use digital tools to cope with the tasks at hand. These tools have been set into action to address meta models to structure the social life dealing with information and data. The legacy of more than 60 years of digital records keeping, has left us in an extremely complex environment, where most end-users have a multitude of spaces where they are supposed to contribute. In many cases their information environment lacks interoperability.

A good, or rather bad example of this, is the electronic health records (EHR) of a hospital, where several different health professionals try to codify their on-going work in order to make better informed decisions regarding the different medical treatments. While this is a good thing, it is heavily hampered with closed-down silos of data that do not work in conjunction with the new more agile work practices. It is not uncommon to have more than 20 different information systems employed to do provisioning during a workday.

The information systems architecture, in any organisation or enterprise, may comprise of home-grown legacy systems from the past, or bought off-the-shelf software suites and extremely complex enterprise-wide information systems like ERP, BI, CRM and the like. The connections between these information systems (or integration points) often resemble “spaghetti” syndrome, point-to-point. The work practice for many IT professionals is to map this landscape of connections and information flows, using for example Enterprise Architecture models. Many organisations use information integration engines, like enterprise-service-bus applications, or master data applications, as means to decouple the tight integration and get away from the proprietary software lock-in.

On top of all these schema-based, structured data, information systems, lies the social and collaborative layer of services, with things like intranet (web based applications), document management, enterprise wide social networks (e.g. Yammer) and collaborative platforms (e.g SharePoint) and more obviously e-mail, instant messaging and voice/video meeting applications. All of these platforms and spaces where one  carries out work tasks, have either semi-structured (document management) or unstructured data.

Wayfinding

A matter of survival in the enterprise information environment, requires a large dose of endurance, and skills. Many end-users get lost in their quest to find the relevant data when they should be concentrating on making well-informed decisions. Wayfinding is our in-built adaptive way of coping with the unexpected and dealing with it. Finding different pathways and means to solve the issues. In other words … Findability.

Outside-in and Inside-Out

Today most organisations and enterprises workers act on the edge of the corporate landscape – in network conversations with customers, clients, patients/citizens, partners, or even competitors, often employing means not necessarily hosted inside the corporate walls. On the Internet we see newly emerging technologies become used and adapted at a faster rate and in a more seamless fashion than the existing cumbersome ones of the internal information landscape. So the obvious question raised in all this flux is: why can’t our digital workplace (the inside information landscape) be as easy to use and to find things / information as in the external digital landscape? Why do I find knowledgeable peers in communities of practice more easily outside than I do on the inside? Knowledge sharing on the outpost of the corporate wall is vivid, and truly passionate whereas inside it is pretty stale and lame to say the least.

Release the DATA now

Aggregate technologies, such as Business Intelligence and Datawarehouse, use a capture, clean-up, transform and load mechanism (ETL) from all the existing supporting information systems. The problem is that the schemas and structures of things do not compile that easily. Different uses and contexts make even the most central terms difficult to unleash into a new context. This simply does not work. The same problem can be seen in the enterprise search realm where we try to cope with both unstructured or semi-structured data. One way of solving all this is to create one standard that all the others have to follow and including a least common denominator combined with master data management. In some cases this can work, but often the set of failures fromsuch efforts are bigger than those arising from trying to squeeze an enterprise into a one-size-fits-all mega-matrix ERP-system.

Why is that? you might ask, from the blueprint it sounds compelling. Just align the business processes and then all data flows will follow a common path. The reality unfortunately is way more complex because any organisation comprises of several different processes, practices, professions and disciplines. These all have a different perspectives of the information and data that is to be shared. This is precisely why we have so many applications in the first place! To what extent are we able to solve this with emerging semantic technologies? These technologies are not a silver bullet, far from it! The Web however shows a very different way of integration thinking, with interoperability and standards becoming the main pillars that all the other things rely on. If you use agreed and controlled vocabularies and standards, there is a better chance of actually being able to sort out all the other things.

Remember that most members of staff, work on the edges of the corporate body, so they have to align themselves to the lingo from all the external actor-networks and then translate it all into codified knowledge for the inside.

Semantic Interoperability

Today most end-users use internet applications and services that already use semantic enhancements to bridge the gap between things, without ever having to think about such things. One very omnipresent social network is Facebook, that relies upon the FOAF (Friend-of-a-Friend) standard for their OpenGraph. Using a Graph to connect data, is the very corner stone of linked-data and the semantic web. A thing (entity) has descriptive properties, and relations to other entities. One entity’s property might be another entity in the Graph. The simple relationship subject-predicate-object. Hence from the graph we get a very flexible and resilient platform, in stark contrast to the more traditional fixed schemas.

The Semantic Web and Linked-Data are a way to link different data sets that may grow from a multitude of schemas and contexts into one fluid interlinked experience. If all internal supporting systems or at least the aggregate engines could simply apply a semantic texture to all the bits and bytes flowing around, it could well provide a solution to the area where other set ups have failed. Remember that these linked-data sets are resilient by nature.

There is a  set of controlled vocabularies (thesauri, ontologies and taxonomies) that capture all the of topics, themes and entities that make up the world. These vocabs have to some extent already been developed, classified and been given sound resource descriptors (RDF). The Linked-Open-Data clouds are experiencing a rapid growth of meaningful expressions. WikiData, dbPedia, Freebase and many more ontologies have a vast set of crispy and useful data that when intersected with internal vocabularies, can make things so much easier. A very good example of such useful vocabularies, are the ones developed by professional information science people is that of the Getty Institute’s recently released thesari for AAT (Arts and Architecture), CONA (Cultural Object Authority) and TGN (Geographical Names). These are very trustworthy resources, and using linked-data anybody developing a web or mobile app can reuse their namespace for free and with high accuracy. And the same goes for all the other data-sets in the linked-open-data cloud. Many governments have declared open data as the main innovation space in which to release their things, under the realm of the “Commons”.

Inaddition to this, all major search engines have agreed on a set of very simple-to-use schemas captured in the www.schema.org world. These schemas have been very well received from their very inception by the webmaster community. All of these are feeding into the Google Knowledge Graph and all the other smart-things (search-enabled) we are using daily.

From the corporate world, these Internet mega-trends, have, or should have, a big impact on the way we do information management inside the corporate walls. This would be particularly the case if the siloed repositories and records were semantically enhanced from their inception (creation), for subsequent use and archiving. We would then see more flexible and fluid information management within the digital workplace.

The name of the game is interoperability at every level: not just technical device specifics, but interoperability at the semantic level and at the level we use governing principles for how we organise our data and information, regardless of their origin.

Stepping down, to some real-life examples

In the law enforcement system in any country, there is a set of actor-networks at play: the police, attorneys, courts, prisons and the like. All of them work within an inter-organisational process from capturing a suspect, filing a case, running a court session, judgement, sentencing and imprisonment; followed at the end by a reassimilated member of society.  Each of these actor-networks or public agencies have their own internal information landscape with supporting information systems, and they all rely on a coherent and smooth flow of information and data between each other. The problem is that while they may use similar vocabularies, the contexts in which they are used may be very different due to their different responsibilities and enacted environment (laws, regulations, policies, guidelines, processes and practices) when looking from a holistic perspective.

IA LOD Innovation

A way to supersede this would be to infuse semantic technologies and shared controlled vocabularies throughout, so that the mix of internal information systems could become interoperable regardless of the supporting information system or storage type. In such a case linked-open-data and semantic enhancements could glue and bridge the gaps to form one united composite, managed by just one individual’s record keeping. In such a way, the actual content would not be exposed, rather a metadata schema would be employed to cross any of the previously existing boundaries.

This is a win-win situation, as semantic technologies and any linked-open-data tinkering use the shared conversation (terms and terminologies) that already exists within the various parts of the process. While all parts cohere to the semantic layers, there is no need to reconfigure  internal processes or apply other parties’ resource descriptions and elements. In such a way only parts of schemas are used that are context specific for a given part of a process, and so allowing the lingo of the related practices and professions to be aligned.

This is already happening in practice in the internal workplace environment of an existing court, where a shared intranet is based on such organising principles as already mentioned, uses applied sound and pragmatic information management practices and metadata standards like Dublin Core and Common Vocabularies –  all of which are infused in Content Provisioning.

For the members of staff, working inside a court setting, this is a major improvement, as they use external databases everyday to gain insights in order to carry out their duties. And when the internal workplace uses such a set up, their knowledge sharing can grow –  leading to both improved wayfinding and findability.

Yet another interesting case, is a service company that operates on a global scale. They are an authoritative resource in their line-of-business, maintaining a resource of rules and regulations that have become a canonical reference. By moving into a new expanded digital workplace environment (internet, extranet and intranet) and using semantic enhancement and search, they get a linked-data set that can be used by clients, competitors and all others working within their environment. At the same time their members of staff can use the very same vocabularies to semantically enhance their provision of information and data into the different information systems internally.

The last example is an industrial company with a mix of products within their line-of-business. They have grown through M&A over the years, and ended up in a dead-end mess of information systems that do not interoperate at all. A way to overcome the effect of past mergers and aquisitions, was to create an information governance framework. Applying it  with MDM and semantic search they were able to decouple data and information, and as a result making their workplace more resilient in a world of constant flux.

One could potentially apply these pragmatic steps to any line of business, since most themes and topics have been created and captured by the emerging semantic web and linked-data realm. It is only a matter of time before more will jump on this bandwagon in order to take advantage of changes that have the ability to make them a canonical reference, and a market leader. Just think of the film industry’s IMDB.

A final thought: Are the vendors ready and open-minded enough to alter their software and online services in order to realise this outlined future enterprise information landscape?

For more information please read these online resources, or go for the executive brief video clip:
Enterprise-Linked-Data
http://testing.rachaelkalicun.info/led_book/led-contents.html

Exec Brief

Europeana brief for memory institutions using linked-open-data:
http://en.wikipedia.org/wiki/File:Linked-open-data-Europeana-video.ogv

Linked-Open-Data network Sweden 2014 presentation:
http://livingarchives.mah.se/2014/03/linked-data-2014/
and Fredric’s talk about semantic enhanced citizen participation and slides.

The future linked-data enterprise, from Intranätverk conference in Göteborg, in May 2014
Fredric Landqvist and Kerstin Forsbergs’s talk, and slides.

Swedish language support (natural language processing) for IBM Content Analytics (ICA)

Findwise has now extended the NLP (natural language processing) in ICA to include both support for Swedish PoS tagging and Swedish sentiment analysis.

IBM Content Analytics with Enterprise Search (ICA) has its strength in natural language processing (NLP) which is achieved in the UIMA pipeline. From a Swedish perspective, one concern with ICA has always been its lack of NLP for Swedish. Previously the Swedish support in ICA consisted only of dictionary-based lemmatization (word: “sprang” -> lemma: “springa”). However, for a number of other languages ICA has also provided part of speech (PoS) tagging and sentiment analysis. One of the benefits of the PoS tagger is its ability to disambiguate words, which belong to multiple classes (e.g. “run” can be both a noun and a verb) as well as assign tags to words, which are not found in the dictionary. Furthermore, the POS tagger is crucial when it comes to improving entity extraction, which is important when a deeper understanding of the indexed text is needed.

Findwise has now extended the NLP in ICA to include both support for Swedish PoS tagging and Swedish sentiment analysis. The two images below shows simple examples of the PoS support.

Example when ICA uses NLP to analyse the string "ICA är en produkt som klarar entitetsextrahering"Example when ICA uses NLP to analyse the string "Watson deltog i jeopardy"

The question is how this extended functionality could be used?

IBM uses ICA and its NLP support together with several of their products. The jeopardy playing computer Watson may be the most famous example, even if it is not a real product. Watson used NLP in its UIMA pipeline when it analyzed its data from sources such as Wikipedia and Imdb.

One product which leverage from ICA and its NLP capabilities is Content and Predictive Analytics for Healthcare. This product helps doctors to determine which action to take for a patient given the patient’s journal and the symptoms. By also leveraging the predictive analytics from SPSS it is possible to suggest the next action for the patient.

ICA can also be connected directly to IBM Cognos or SPSS where ICA is the tool which creates structure to unstructured data. By using the NLP or sentiment analytics in ICA, structured data can be extracted from text documents. This data can then be fed to IBM Cognos, SPSS or non IBM products such as Splunk.

ICA can also be used on its own as a text miner or a search platform, but in many cases ICA delivers its maximum value together with other products. ICA is a product which helps enriching data by creating structure to unstructured data. The processed data can then be used by other products which normally work with structured data.

Semantic Search Engine – What is the Meaning?

The shortest dictionary definition of semantics is: the study of meaning. The more complex explanation of this term would lead to a relationship that maps words, terms and written expressions into common sense and understanding of objects and phenomena in the real world. It is worthy to mention that objects, phenomena and relationships between them are language independent. It means that the same semantic network of concepts can map to multiple languages which is useful in automatic translations or cross-lingual searches.

The approach

In the proposed approach semantics will be modeled as a defined ontology making it possible for the web to “understand” and satisfy the requests and intents of people and machines to use the web content. The ontology is a model that encapsulates knowledge from specific domain and consists of hierarchical structure of classes (taxonomy) that represents concepts of things, phenomena, activities etc. Each concept has a set of attributes that represent the mapping of that particular concept to words and phrases that represents that concepts in written language (as shown at the top of the figure below). Moreover, the proposed ontology model will have horizontal relationships between concepts, e.g. the linguistic relationships (synonymy, homonymy etc.) or domain specific relationships (medicine, law, military, biological, chemical etc.). Such a defined ontology model will be called a Semantic Map and will be used in the proposed search engine. An exemplar part of an enriched ontology of beverages is shown in the figure below. The ontology is enriched, so that the concepts can be easily identified in text using attributes such as the representation of the concept in the written text.

Semantic Map

The Semantic Map is an ontology that is used for bidirectional mapping of textual representation of concepts into a space of their meaning and associations. In this manner, it becomes possible to transform user queries into concepts, ideas and intent that can be matched with indexed set of similar concepts (and their relationships) derived from documents that are returned in a form of result set. Moreover, users will be able to precise and describe their intents using visualized facets of concept taxonomy, concept attributes and horizontal (domain) relationships. The search module will also be able to discover users’ intents based on the history of queries and other relevant factors, e.g. ontological axioms and restrictions. A potentially interesting approach will retrieve additional information regarding the specific user profile from publicly available information available in social portals like Facebook, blog sites etc., as well as in user’s own bookmarks and similar private resources, enabling deeper intent discovery.

Semantic Search Map

Semantic Search Engine

The search engine will be composed of the following components:

  • Connector – This module will be responsible for acquisition of data from external repositories and pass it to the search engine. The purpose of the connector is also to extract text and relevant metadata from files and external systems and pass it to further processing components.
  • Parser – This module will be responsible for text processing including activities like: tokenization (breaking text into lexems – words or phrases), lemmatization (normalization of grammar forms), exclusion of stop-words, paragraph and sentence boundary detector. The result of parsing stage is structured text with additional annotations that is passed to semantic Tagger.
  • Tagger – This module is responsible for adding semantic information for each lexem extracted from the processed text. Technically it refers to addition of identifiers to relevant concepts stored in the Semantic Map for each lexem. Moreover phrases consisting of several words are identified and disambiguation is performed basing on derived contexts. Consider the example illustrated in the figure.
  • Indexer – This module is responsible for taking all the processed information, transformation and storage into the search index. This module will be enriched with methods of semantic indexing using ontology (semantic map) and language tools.
  • Search index – The central storage of processed documents (document repository) structured properly to manage full text of the documents, their metadata and all relevant semantic information (document index). The structure is optimized for search performance and accuracy.
  • Search – This module is responsible for running queries against the search index and retrieval of relevant results. The search algorithms will be enriched to use user intents (complying data privacy) and the prepared Semantic Map to match semantic information stored in the search index.

What do you think? Please let us know by writing a comment.