People and the Post, Postal History from the Smithsonian’s National Postal Museum
Mankind’s preoccupation for much of this century has to become fully digitalized. Utilities, software, services and platforms are all becoming an ‘intertwingled’ reality for all of us. Being mobile, the blurring of the borders between the workplace and recreational life plus the ease of digital creation are creating information overloads and (out-of-sight) digital landfills. While digital content creation is cheaper to create and store, its volume and its uncared for status makes it harder for everyone else to find and consume the bits they really need (and have some provenance for peace of mind).
Fear not. A collection of emerging digital technologies exist that can both support and maintain future sustainable digital recycling – things like: Cognitive Computing, Artificial Intelligence; Natural Language Processing; Machine Learning and the like, Semantics adding meaning to shared concepts, and Graphs linking our content and information resources. With good information management practice and having the appropriate supporting tools to tinker with, there is a great opportunity to not only automate knowledge digitization but to augment it.
In the content continuum (from its creation to its disposal) there is a great need for automating processes as much as possible in order to reduce the amount of obsolete or hidden (currently value-less) digital content. Digital knowledge recycling is difficult as nearly every document or content creator is, by nature, reluctant to add further digital tags (a.k.a. metadata) describing their content or documents once they have been created. What’s more experience shows this is inefficient on a number of accounts, one of which is inconsistency.
Most digital documents (and most digital content, unless intended to sell something publicly) therefore lack the proper recycling resource descriptors that can help with e.g. classification, topic description or annotation with domain specific (shared, consistent) concepts. Such descriptions add appropriate meaning or context to content, aiding its further digital reuse (consumption). Without them, the problem of findability is likely to remain omnipresent across many intranets and searched resources.
Smartphones generate content automatically, often without the user thinking or realizing. All kinds of resource descriptors (time, place etc.) are created automatically through movement and mobile usage. With the addition of further machine learning and algorithms, online services such as Google Photos use these descriptors (and some automatic annotation of their own) to add more contextual data before classifying pictures into collections. This improved data quality (read: metadata addition and improved findability) allows us to find the pictures or timeline we want more easily.
In the very same manner, workplace content or documents can now have this same type of supporting technical platform that automatically adds additional business specific context and meaning. This could include data from users: their profiles, departments or their system user behaviour patterns.
For real organizational agility though a further extra layer of automatic annotation (tagging) and classification is needed – achieved using shared models of the business. These models can be expressed through a combination of various controlled vocabularies (taxonomies) that can be further joined through relationships (ontologies) and finally published (publicly or privately) as domain models as linked data (in graphs). Within this layer exist not just synonyms, but alternative and preferred labels, and more importantly relationships can be expressed between concepts – hence the graph: concepts being the dots (nodes) with relationships the joining lines (vertices). Using certain tools, the certain relationships between concepts can be further given a weighting.
This added layer generates a higher quality of automated context, meaning and consistency for the annotation (tagging) of content and documents alike. The very same layer feeds information architecture in the navigation of resources (e.g. websites). In Search, it helps to disambiguate between queries (e.g. apple the fruit, or apple the organization?).
This digital helper application layer works very much in the same smooth manner as e.g. Google Photos, i.e. in the background, without troubling the user.
This automation however, will not work without sustainable organizing principles, applied in information management practices and tools. We still need a bit of human touch! (Just as Google Photos added theirs behind the scenes earlier, as a work in progress)
This codification or digitalization of knowledge allows content to be annotated, classified and navigated more efficiently. We are all becoming more aware of the Google Knowledge Graph or the Microsoft Graph that can connect content and people. The analogy of connecting the dots in a graph is like linking digital concepts and their known relationships or values.
Augmentation can take shape in a number of forms. A user searching for a particular query can be presented not only with the most appropriate search results (via the sense-making connections and relationships) but also can be presented with related ideas they had not thought of or were unaware of – new knowledge and serendipity!
Search, semantic, and cognitive platforms have now reached a much more useful level than in earlier days of AI. Through further techniques new knowledge can also be discovered by inference, using the known relationships within the graph to fill in missing knowledge.
Key to all of this though is the building of a supporting back-end platform for continuous improvement in the content continuum. Technically, something that is easier to start than one may first suspect.
Sustainable Organising Principles to the Digital Workplace
A reflection on Mobile World Congress topics mobility, digitalisation, IoT, the Fourth Industrial Revolution and sustainability
Commerce has always had a conversational, today it is digital. Organisations are asking how to organise clean effective data for an open digital conversation.
Digitalization’s aim is to answer customer/consumer-centric demands effectively (with relevant and related data) and in an efficient manner. [for the remainder of the article read consumer and customer interchangeably]
This essentially means Joining the dots between clean data and information and being able to recognise the main and most consumer-valuable use cases, be it common transaction behaviour or negating their most painful user experiences.
This includes treading the fine line between being able to offer “intelligent” information (intelligent in terms of relevance and context) to the consumer effectively and not seeming to be freaky or stalker-like. The latter is dealt with by forming a digital conversation where the consumer understands the use of their information only being used for their end needs or wants.
While clean, related data from the many various multi-channel customer touch-points forms the basis of an agile digital organisation, it is the combination of significant data analysis insight of user demand & behaviour (clicks, log analysis etc), machine learning and sensible prediction that forms the basis of artificial Intelligence. Artificial intelligence broken down is essentially resultant action based on the inferences of knowing certain information, i.e. the elementary Dr Watson, but done by computers.
This new digital data basis means being able to take data from what were previous data silos and combine it effectively in a meaningful way, for a valuable purpose. While the tag of Big Data becomes weary in a generalised context, key is the picking of data/information to get relevant answers to the mosts valuable questions, or in consumer speak, to get a question answered or a job done effectively.
Digitalisation (and then the following artificial intelligence) relies obviously on computer automation, but it still requires some thoughtful human-related input. Important steps in the move towards digitalization include:
Content and Data Inventory, to clean data/ the cleansing of data and information;
Its architecture (information modelling, content analysis, automatic classification and annotation/tagging);
Data analysis in combination with text analysis (or NLP: natural language processing for the more abundant unstructured data, content), the latter to put flesh on the bone as it were, or adding meaning and context
Information Governance: the process of being responsible for the collection, proper storage and use of important digital information (now made less ignorable with new citizen-centric data laws (GDPR) and the need for data agility or anonymization of data)
Data/system Interoperability: which data formats, structures, and standards, are most appropriate for you? What data collections are most Relational databases, Linked/graph data, data lakes etc.?);
Language/cultural interoperability: letting people with different perspectives accessing the same information topics using their own terminology.
Interoperability for the future also means being able to link everything in your business ecosystem for collaboration, in- and outbound conversations, endless innovation and sustainability.
IoT or the Internet of Thingsis making the physical world digital and adding further to the interlinked network, soon to be superseded by the AoT (the analysis of things)
Newer steps of Machine learning (learning about consumer preferences and behaviour etc.) and artificial intelligence (being able to provide seemingly impossible relevant information, clever decision-making and/or seamless user experience).
The fusion of technologies continues further as the lines between the physical, digital, and biological spheres with developments in immersive Internet, as with Augmented Reality (AR) and Virtual Reality (VR).
The next elements are here already: semantic (‘intelligent’) search, virtual assistants, robots, chat bots… with 5G around the corner to move more data, faster.
Progress within mobility paves the way for a more sustainable world for all of us (UN Sustainable Development), with a future based on participation. In emerging markets we are seeing giant leaps in societal change. Rural areas now have access to the vast human resources of knowledge to service innovation e.g. through free-access to Wikipedia on cheap mobile devices and Open Campuses. Gender equality with changed monetary and mobile financial practices and blockchain means to raise to the challenge with interoperability. We have to address the open paradigm (e.g Open Data) and the participation economy, building the next elements. Shared experience and information commons. This also falls back to the intertwingled digital workplace, and practices to move into new cloud based arenas.
Some last remarks on the telecom industry, it is loaded with acronyms and for laymen in the area sometimes a maze to navigate and to build some sensemaking.
So are these steps straightforward, or is the reality still a potential headache for your organisation?
This is the third post in a series (1, 2,4, 5, 6, 7) on the challenges organisations face as they move from having online content and tools hosted firmly on their estate to renting space in the cloud. We will help you to consider the options and guide on the steps you need to take.
In the first post we set out the most common challenges you are likely to face and how you may overcome these. In the second post we focused on how Office 365 and SharePoint can play a part in moving to the cloud. Here we cover how they can help join up your organisation online using their collaboration tools and features.
When arranging the habitat, it is key to address the theme of collaboration. Since each of these themes, derives different feature settings of artifacts and services. In many cases, teamwork is situated in the context of a project. Other themes for collaboration are the line of business unit teamwork, or the more learning networks a.k.a communities of practice. I will leave these later themes for now.
Most enterprises have some project management process (i.e. PMP) that all projects do have to adhere to, with added complementary documentation, and reporting mechanisms. This is so the leadership within the organisation will be able to align resources, govern the change portfolio across different business units. Given this structure, it is very easy to depict measurable outcomes, as project documents have to be produced, regardless of what the project is supposed to contribute towards.
Why? usually defined in project description, setting common ground for the goals and expected outcome. ( dc.description )
How? defines used processes, practices and tools to create the expected outcome for the project, with links to common resources as the PMP framework, but also links to other key data-sets. Like ERP record keeping and masterdata, for project number and other measures not stored in the habitat, but still pillars to align to the overarching model. (dc.relation)
When these questions have been answered, the resource description for the habitat is set. In Sharepoint the properties bag (code) feature. During the lifespan of the on-going project, all contribution, conversations and creation of things can inherit rule-based metadata for the artifacts from the collections resource description. This reduces the burden weighing on the actors building the content, by enabling automagic metadata completion where applicable. And from the wayfinding, and findability within and between habitats, these resource descriptions will be the building blocks for a sustainable information architecture. In our next post we will cover how to encourage employee engagement with your content.
This is the second post in a series (1, 3, 4, 5, 6, 7) on the challenges organisations face as they move from having online content and tools hosted firmly on their estate to renting space in the cloud. We will help you to consider the options and guide on the steps you need to take.
In the first post we set out the most common challenges you are likely to face and how you may overcome these. In this post we focus on how Office 365 and SharePoint online can play a part in moving to the cloud.
Let us be pragmatic and down-to-earth! It is time to roll up our sleeves and consider using Office 365, as one example of how organisations can make this transition from their estate to the cloud. Given that this is the collaborative space many organisations consider using, Office 365 is compelling as a one-size-fits-all, instant build and just roll-out enterprise-wide approach to take sometimes without an Information Architecture plan whatsoever!
In the Office 365 environment, one has to map the terrain, so that there are distinct districts to where things relate – the same goes for the structure of neighborhoods of clustered habitats. But where it gets tough is to have an agile and resilient city plan for the real-world experience. This is actually the pillar construction in a digital domain, aiming for resilience and emerging uses over the time… but with a simple and agreed upon game plan.
Pace-layering the information architecture
Most organisations have an ontology of entities, things, that are generic, as stated in the W3C Organisation schema. And these perspectives, domain models, vocabularies and ontologies, add up to become districts, and neighborhoods in the Information Architecture map, with a few angles:
Organisation Units (Business Unit, Division, Function, Group)
Governing agencies, or regulatory entities, intermediaries
Locations (Sites, Geographical places as /world/continent/country/region/city/address …)
Business Processes (Process & Activities)
Professions, and Disciplines ( Roles), Practices
Topics (derived from line of Business, and controlled vocabularies)
Regardless of line of Business for an organisation, these pan out as pretty good structural elements on which to build upon. Since an enterprise is a social construct with agreed borders, it is populated with people who act and interplay in various different ways, and have a multitude facets with regards to everyday work. Some entities change more frequently, generally in the organisational units further down in the leaves, and less so in the top main branches. The vocabularies within an organisation needs to be the center pillar, to reduce linguistic insecurities.
From an Information Architecture perspective, in using Office 365 or Sharepoint it is wise to use pace-layering to the building blocks, on to which navigational constructs are built upon. This means, using the highest level of the organisational unit tree branch, a pretty stable foundation for the site-structure can be built. This is where content and teamsites live. More fluid navigational themes (temporal, or topic entities) can then be added.
This goes for activities undertaken within daily practices, where a set of professions and disciplines interact. All of these activities lay out a tapestry of overarching business processes. The outcome or result, might be a thing that is detailed as topic taxonomies. For example, a product structure for a specific manufacturing industry. Since all organisations have actor networks in their ecology, it is preferable to add these entities into the structure, as clients, partners, competitor, regulatory agencies, social networks, communities of practice and so forth.
All of these set of terms, have to be maintained in a Managed Metadata Service, a.k.a TermStore. In most organisations there are other sources of their controlled vocabularies, hence mapping is key, to have aligned master term sets. Either through subscription models (batch) or enterprise-linked-data sets. All these actions, defines the terrain, so we map the ecosystem as taxonomic chartographers.
The Building Blocks: Artifacts and Collections
Office 365 comes with a pretty organised set of tools, themes and things to build upon. For more website related things, one could either use published web sites / portals, or enterprise wikis. The other main services are digital habitats, or collaborative spaces, team-sites. And lasty there are ESN (Enterprise Social Networks) like Yammer, and instant messaging tools like Lync, and Exchange Services like mail and calendar. Sharepoint Online and Office 365 is a Swiss Army Knife.
The emerging hyper-connected and agile enterprises of today are stigmatised by their IS/IT-legacy, so the question is: Will emerging web and semantic technologies and practices undo this stigma?
Semantic Technologies and Linked-Open-Data (LOD) have evolved since Tim Berners-Lee introduced their basic concepts, and they are now part of everyday business on the Internet, thanks mainly due to their uptake by information and data-run companies like Google, social networks like Facebook and large content sites, like Wikipedia. The enterprise information landscape is ready to be enhanced by semantic web, to increase findability and usability. This change will enable a more agile digital workplace where members of staff can use cloud based services, anywhere, anytime on any device, in combination with the set of legacy systems backing their line-of-business. All in all, more efficient organising principles for information and data.
The Corporate Information Landscape of today
In everyday workplace we use digital tools to cope with the tasks at hand. These tools have been set into action to address meta models to structure the social life dealing with information and data. The legacy of more than 60 years of digital records keeping, has left us in an extremely complex environment, where most end-users have a multitude of spaces where they are supposed to contribute. In many cases their information environment lacks interoperability.
A good, or rather bad example of this, is the electronic health records (EHR) of a hospital, where several different health professionals try to codify their on-going work in order to make better informed decisions regarding the different medical treatments. While this is a good thing, it is heavily hampered with closed-down silos of data that do not work in conjunction with the new more agile work practices. It is not uncommon to have more than 20 different information systems employed to do provisioning during a workday.
The information systems architecture, in any organisation or enterprise, may comprise of home-grown legacy systems from the past, or bought off-the-shelf software suites and extremely complex enterprise-wide information systems like ERP, BI, CRM and the like. The connections between these information systems (or integration points) often resemble “spaghetti” syndrome, point-to-point. The work practice for many IT professionals is to map this landscape of connections and information flows, using for example Enterprise Architecture models. Many organisations use information integration engines, like enterprise-service-bus applications, or master data applications, as means to decouple the tight integration and get away from the proprietary software lock-in.
On top of all these schema-based, structured data, information systems, lies the social and collaborative layer of services, with things like intranet (web based applications), document management, enterprise wide social networks (e.g. Yammer) and collaborative platforms (e.g SharePoint) and more obviously e-mail, instant messaging and voice/video meeting applications. All of these platforms and spaces where one carries out work tasks, have either semi-structured (document management) or unstructured data.
A matter of survival in the enterprise information environment, requires a large dose of endurance, and skills. Many end-users get lost in their quest to find the relevant data when they should be concentrating on making well-informed decisions. Wayfinding is our in-built adaptive way of coping with the unexpected and dealing with it. Finding different pathways and means to solve the issues. In other words … Findability.
Outside-in and Inside-Out
Today most organisations and enterprises workers act on the edge of the corporate landscape – in network conversations with customers, clients, patients/citizens, partners, or even competitors, often employing means not necessarily hosted inside the corporate walls. On the Internet we see newly emerging technologies become used and adapted at a faster rate and in a more seamless fashion than the existing cumbersome ones of the internal information landscape. So the obvious question raised in all this flux is: why can’t our digital workplace (the inside information landscape) be as easy to use and to find things / information as in the external digital landscape? Why do I find knowledgeable peers in communities of practice more easily outside than I do on the inside? Knowledge sharing on the outpost of the corporate wall is vivid, and truly passionate whereas inside it is pretty stale and lame to say the least.
Release the DATA now
Aggregate technologies, such as Business Intelligence and Datawarehouse, use a capture, clean-up, transform and load mechanism (ETL) from all the existing supporting information systems. The problem is that the schemas and structures of things do not compile that easily. Different uses and contexts make even the most central terms difficult to unleash into a new context. This simply does not work. The same problem can be seen in the enterprise search realm where we try to cope with both unstructured or semi-structured data. One way of solving all this is to create one standard that all the others have to follow and including a least common denominator combined with master data management. In some cases this can work, but often the set of failures fromsuch efforts are bigger than those arising from trying to squeeze an enterprise into a one-size-fits-all mega-matrix ERP-system.
Why is that? you might ask, from the blueprint it sounds compelling. Just align the business processes and then all data flows will follow a common path. The reality unfortunately is way more complex because any organisation comprises of several different processes, practices, professions and disciplines. These all have a different perspectives of the information and data that is to be shared. This is precisely why we have so many applications in the first place! To what extent are we able to solve this with emerging semantic technologies? These technologies are not a silver bullet, far from it! The Web however shows a very different way of integration thinking, with interoperability and standards becoming the main pillars that all the other things rely on. If you use agreed and controlled vocabularies and standards, there is a better chance of actually being able to sort out all the other things.
Remember that most members of staff, work on the edges of the corporate body, so they have to align themselves to the lingo from all the external actor-networks and then translate it all into codified knowledge for the inside.
Today most end-users use internet applications and services that already use semantic enhancements to bridge the gap between things, without ever having to think about such things. One very omnipresent social network is Facebook, that relies upon the FOAF (Friend-of-a-Friend) standard for their OpenGraph. Using a Graph to connect data, is the very corner stone of linked-data and the semantic web. A thing (entity) has descriptive properties, and relations to other entities. One entity’s property might be another entity in the Graph. The simple relationship subject-predicate-object. Hence from the graph we get a very flexible and resilient platform, in stark contrast to the more traditional fixed schemas.
The Semantic Web and Linked-Data are a way to link different data sets that may grow from a multitude of schemas and contexts into one fluid interlinked experience. If all internal supporting systems or at least the aggregate engines could simply apply a semantic texture to all the bits and bytes flowing around, it could well provide a solution to the area where other set ups have failed. Remember that these linked-data sets are resilient by nature.
There is a set of controlled vocabularies (thesauri, ontologies and taxonomies) that capture all the of topics, themes and entities that make up the world. These vocabs have to some extent already been developed, classified and been given sound resource descriptors (RDF). The Linked-Open-Data clouds are experiencing a rapid growth of meaningful expressions. WikiData, dbPedia, Freebase and many more ontologies have a vast set of crispy and useful data that when intersected with internal vocabularies, can make things so much easier. A very good example of such useful vocabularies, are the ones developed by professional information science people is that of the Getty Institute’s recently released thesari for AAT (Arts and Architecture), CONA (Cultural Object Authority) and TGN (Geographical Names). These are very trustworthy resources, and using linked-data anybody developing a web or mobile app can reuse their namespace for free and with high accuracy. And the same goes for all the other data-sets in the linked-open-data cloud. Many governments have declared open data as the main innovation space in which to release their things, under the realm of the “Commons”.
Inaddition to this, all major search engines have agreed on a set of very simple-to-use schemas captured in the www.schema.org world. These schemas have been very well received from their very inception by the webmaster community. All of these are feeding into the Google Knowledge Graph and all the other smart-things (search-enabled) we are using daily.
From the corporate world, these Internet mega-trends, have, or should have, a big impact on the way we do information management inside the corporate walls. This would be particularly the case if the siloed repositories and records were semantically enhanced from their inception (creation), for subsequent use and archiving. We would then see more flexible and fluid information management within the digital workplace.
The name of the game is interoperability at every level: not just technical device specifics, but interoperability at the semantic level and at the level we use governing principles for how we organise our data and information, regardless of their origin.
Stepping down, to some real-life examples
In the law enforcement system in any country, there is a set of actor-networks at play: the police, attorneys, courts, prisons and the like. All of them work within an inter-organisational process from capturing a suspect, filing a case, running a court session, judgement, sentencing and imprisonment; followed at the end by a reassimilated member of society. Each of these actor-networks or public agencies have their own internal information landscape with supporting information systems, and they all rely on a coherent and smooth flow of information and data between each other. The problem is that while they may use similar vocabularies, the contexts in which they are used may be very different due to their different responsibilities and enacted environment (laws, regulations, policies, guidelines, processes and practices) when looking from a holistic perspective.
A way to supersede this would be to infuse semantic technologies and shared controlled vocabularies throughout, so that the mix of internal information systems could become interoperable regardless of the supporting information system or storage type. In such a case linked-open-data and semantic enhancements could glue and bridge the gaps to form one united composite, managed by just one individual’s record keeping. In such a way, the actual content would not be exposed, rather a metadata schema would be employed to cross any of the previously existing boundaries.
This is a win-win situation, as semantic technologies and any linked-open-data tinkering use the shared conversation (terms and terminologies) that already exists within the various parts of the process. While all parts cohere to the semantic layers, there is no need to reconfigure internal processes or apply other parties’ resource descriptions and elements. In such a way only parts of schemas are used that are context specific for a given part of a process, and so allowing the lingo of the related practices and professions to be aligned.
This is already happening in practice in the internal workplace environment of an existing court, where a shared intranet is based on such organising principles as already mentioned, uses applied sound and pragmatic information management practices and metadata standards like Dublin Core and Common Vocabularies – all of which are infused in Content Provisioning.
For the members of staff, working inside a court setting, this is a major improvement, as they use external databases everyday to gain insights in order to carry out their duties. And when the internal workplace uses such a set up, their knowledge sharing can grow – leading to both improved wayfinding and findability.
Yet another interesting case, is a service company that operates on a global scale. They are an authoritative resource in their line-of-business, maintaining a resource of rules and regulations that have become a canonical reference. By moving into a new expanded digital workplace environment (internet, extranet and intranet) and using semantic enhancement and search, they get a linked-data set that can be used by clients, competitors and all others working within their environment. At the same time their members of staff can use the very same vocabularies to semantically enhance their provision of information and data into the different information systems internally.
The last example is an industrial company with a mix of products within their line-of-business. They have grown through M&A over the years, and ended up in a dead-end mess of information systems that do not interoperate at all. A way to overcome the effect of past mergers and aquisitions, was to create an information governance framework. Applying it with MDM and semantic search they were able to decouple data and information, and as a result making their workplace more resilient in a world of constant flux.
One could potentially apply these pragmatic steps to any line of business, since most themes and topics have been created and captured by the emerging semantic web and linked-data realm. It is only a matter of time before more will jump on this bandwagon in order to take advantage of changes that have the ability to make them a canonical reference, and a market leader. Just think of the film industry’s IMDB.
A final thought: Are the vendors ready and open-minded enough to alter their software and online services in order to realise this outlined future enterprise information landscape?
IBM Content Analytics with Enterprise Search (ICA) has its strength in natural language processing (NLP) which is achieved in the UIMA pipeline. From a Swedish perspective, one concern with ICA has always been its lack of NLP for Swedish. Previously the Swedish support in ICA consisted only of dictionary-based lemmatization (word: “sprang” -> lemma: “springa”). However, for a number of other languages ICA has also provided part of speech (PoS) tagging and sentiment analysis. One of the benefits of the PoS tagger is its ability to disambiguate words, which belong to multiple classes (e.g. “run” can be both a noun and a verb) as well as assign tags to words, which are not found in the dictionary. Furthermore, the POS tagger is crucial when it comes to improving entity extraction, which is important when a deeper understanding of the indexed text is needed.
The question is how this extended functionality could be used?
IBM uses ICA and its NLP support together with several of their products. The jeopardy playing computer Watson may be the most famous example, even if it is not a real product. Watson used NLP in its UIMA pipeline when it analyzed its data from sources such as Wikipedia and Imdb.
One product which leverage from ICA and its NLP capabilities is Content and Predictive Analytics for Healthcare. This product helps doctors to determine which action to take for a patient given the patient’s journal and the symptoms. By also leveraging the predictive analytics from SPSS it is possible to suggest the next action for the patient.
ICA can also be connected directly to IBM Cognos or SPSS where ICA is the tool which creates structure to unstructured data. By using the NLP or sentiment analytics in ICA, structured data can be extracted from text documents. This data can then be fed to IBM Cognos, SPSS or non IBM products such as Splunk.
ICA can also be used on its own as a text miner or a search platform, but in many cases ICA delivers its maximum value together with other products. ICA is a product which helps enriching data by creating structure to unstructured data. The processed data can then be used by other products which normally work with structured data.
The shortest dictionary definition of semantics is: the study of meaning. The more complex explanation of this term would lead to a relationship that maps words, terms and written expressions into common sense and understanding of objects and phenomena in the real world. It is worthy to mention that objects, phenomena and relationships between them are language independent. It means that the same semantic network of concepts can map to multiple languages which is useful in automatic translations or cross-lingual searches.
In the proposed approach semantics will be modeled as a defined ontology making it possible for the web to “understand” and satisfy the requests and intents of people and machines to use the web content. The ontology is a model that encapsulates knowledge from specific domain and consists of hierarchical structure of classes (taxonomy) that represents concepts of things, phenomena, activities etc. Each concept has a set of attributes that represent the mapping of that particular concept to words and phrases that represents that concepts in written language (as shown at the top of the figure below). Moreover, the proposed ontology model will have horizontal relationships between concepts, e.g. the linguistic relationships (synonymy, homonymy etc.) or domain specific relationships (medicine, law, military, biological, chemical etc.). Such a defined ontology model will be called a Semantic Map and will be used in the proposed search engine. An exemplar part of an enriched ontology of beverages is shown in the figure below. The ontology is enriched, so that the concepts can be easily identified in text using attributes such as the representation of the concept in the written text.
The Semantic Map is an ontology that is used for bidirectional mapping of textual representation of concepts into a space of their meaning and associations. In this manner, it becomes possible to transform user queries into concepts, ideas and intent that can be matched with indexed set of similar concepts (and their relationships) derived from documents that are returned in a form of result set. Moreover, users will be able to precise and describe their intents using visualized facets of concept taxonomy, concept attributes and horizontal (domain) relationships. The search module will also be able to discover users’ intents based on the history of queries and other relevant factors, e.g. ontological axioms and restrictions. A potentially interesting approach will retrieve additional information regarding the specific user profile from publicly available information available in social portals like Facebook, blog sites etc., as well as in user’s own bookmarks and similar private resources, enabling deeper intent discovery.
Semantic Search Engine
The search engine will be composed of the following components:
Connector – This module will be responsible for acquisition of data from external repositories and pass it to the search engine. The purpose of the connector is also to extract text and relevant metadata from files and external systems and pass it to further processing components.
Parser – This module will be responsible for text processing including activities like: tokenization (breaking text into lexems – words or phrases), lemmatization (normalization of grammar forms), exclusion of stop-words, paragraph and sentence boundary detector. The result of parsing stage is structured text with additional annotations that is passed to semantic Tagger.
Tagger – This module is responsible for adding semantic information for each lexem extracted from the processed text. Technically it refers to addition of identifiers to relevant concepts stored in the Semantic Map for each lexem. Moreover phrases consisting of several words are identified and disambiguation is performed basing on derived contexts. Consider the example illustrated in the figure.
Indexer – This module is responsible for taking all the processed information, transformation and storage into the search index. This module will be enriched with methods of semantic indexing using ontology (semantic map) and language tools.
Search index – The central storage of processed documents (document repository) structured properly to manage full text of the documents, their metadata and all relevant semantic information (document index). The structure is optimized for search performance and accuracy.
Search – This module is responsible for running queries against the search index and retrieval of relevant results. The search algorithms will be enriched to use user intents (complying data privacy) and the prepared Semantic Map to match semantic information stored in the search index.
What do you think? Please let us know by writing a comment.