Making your data F.A.I.R and smart

This is the second post in a new series by Fredric Landqvist & Peter Voisey, explaining how your organisation could best shape its data landscape for the future.

How to create a smart data framework for your organisation

In our last post for you, we presented the benefits of F.A.I.R data, how to make data smarter for search engines and the potentials of an Information Commons. In this post, we’re giving you the pragmatic steps to make your data FAIR by creating and applying your own smart data framework. Your data-sharing dream, internally and externally, is possible.

A smart data framework, using FAIR data principles, encompasses the tooling, models and standards that govern datasets and the different context-specific information systems (registers, catalogues). The data is then ingested and processed (enriched/refined) into smart data, datasets and data catalogues. It can then be used and reused by different applications and e-services via open APIs. In this ecosystem, all actors and information behaviours (personas) interplay: provision agents, owners, builders, enrichers, end-user searchers and referrers.

The workings of a smart data framework

A smart data & metadata catalogue   

A smart data & metadata catalogue (illustrated below), provides an organisational capability that aligns data management with the FAIR data principles. View it not so much as one system to rule them all, but rather an ecosystem that is smart and sustainable. In order to simplify your complex and heterogeneous information environment, this set-up can be  instantiated, as one overarching mechanism. Although we are describing a data and metadata catalogue here, the exact same framework and set up would of course apply also to your organisation’s content, making it smarter and more findable (i.e. it gets the sustainable stamp).

Smart Data Catalogue
The necessary services and component of a smart data catalogue

The above picture illustrates the services and components that, together, build smart data and metadata catalogue capabilities. We now describe each one of them for you:

Processing (Ingestion & Enrichment) for great Findability & Interoperability

  • (A) Ingest, harvest and operate. Here you connect the heterogeneous data sources for ingestion.

The configured input mechanisms describe each of the data sources, with their data, datasets and metadata ready for your catalogue search. Hopefully, at the dataset upload stage, you have provided a good system/form that now provides your search engine with great metadata (i.e. we recommend you use the open data catalogue standard DCAT-AP). The concept upload is interchangeable with either machine-to-machine harvester mechanisms, as with open-data, traditional data integration, or manual provision by human upload effort. (D) Enterprise Metadata Repository: here is the persistent storage of data in both data catalogue, index and graph. All things get a persistent ID (how to design persistent URI) and rich metadata.

  • (B) Enrich, refine analyze, and curate. This is the AI part (NLP, Semantics, ML) that enriches the data and datasets, making them smarter. 

Concepts (read also entities, terms, phrases, synonyms, acronyms etc.) from the data sources are found using named entity extraction (NER). By referring to a Knowledge Graph in the Enricher, the appropriate resources are annotated (“tagged”) with the said concept. It does not end here, however. The concept also takes with it from the Knowledge Graph all of the known relationships it has with other concepts.

Essentially a Knowledge Graph is your encoded domain knowledge in a connected graph format. It is by reading these encoded relationships that the machine “understands” the meaning or aboutness of data.

This opens up a very nice Pandora’s box for your search (understanding query intent) and for your Graphical User Interface (GUI) as your data becomes smarter now through your ability to exploit the relationships and connections (semantics and context) between concepts.

You and AI can have a symbiotic relationship in the development of your Knowledge Graph. AI can suggest new concepts and relationships as new data is added. It is, however, you and your colleagues who determine the of concepts/relationships in the Knowledge Graph – concepts/relationships that are important to your department or business. Remember you can utilise more than one knowledge graph, or part of one, for a particular business need(s) or data source(s). The Knowledge Graph is a flexible expression of your business/information models that give structure to all your data and its access.

Extra optional step: If you can manage not only to index the dataset metadata but the datasets themselves, you can make your Pandora’s box even nicer. Those cryptic/nonsensical field names that your traditional database experts love to create can also be incorporated and mapped (one time only!) into your Knowledge Graph, thus increasing the machine “understanding” of the data. Thus, there is a better chance of the data asset being used more widely. 

The configuration of processing with your Knowledge Graph can take care of dataset versioning, lineage and add further specific classifications e.g. data sensitivity, user access and personal information.

Lastly on Processing, your cultural and system interoperability is immensely improved. We’re not talking everyone speaking the same language here, rather everyone talking their language (/culture) and still being able to find the same thing. In this open and FAIR vocabularies further, enrich the meaning to data and your metadata is linked. System interoperability is partially achieved by exploiting the graph of connections that now “sit over” your various data sources.

Controlled Access (Accessible and Reusable)

  • (C) Access, search and visualize APIs. These tools control and influence the delivery, representation, exploration and consumption/use of datasets and data catalogues via a smarter search (made so by smarter data) and a more intuitive Graphical User interface (GUI).

This means your search can now “understand” user intent from just one or two keyword queries (through known relationship connections in the Knowledge Graph). 

Your search now also caters for your searchers who are searching in an unfamiliar subject area or are just having a query off day. Besides offering the standard results page, the GUI can also present related information (again due to the Knowledge Graph), past related user queries, information and question-answer (Q&A) type material. So: search, discovery, learning, serendipity.

Your GUI can also now become more intuitive, changing its information presentation and facets/filters automatically, depending on the query itself (more sustainable front-end coding). 

An alternative to complex scenario coding also includes the possibility for you to create rules (set in your Knowledge Graph) that can control what data users can access (when, how and where) based on their profile, their role, their location, the time and on the device they are using. This same Knowledge Graph can help push and recommend data for certain users proactively. Accessibility will be possible by using standard communication protocols, open access (when possible), authentication where necessary, and always with metadata at hand.

Reusable: your new smart data framework can help increase the time your Data Managers (/Scientists, Analysts) spend using data (and not trying to find it, the 80/20 data science dilemma). It can also help reduce the risk to your AI projects (50% failure rate) by helping searchers find the right data, with its meaning and context, more easily.  Reuse will also be possible with the design that metadata multiple attributes, use licence and provenance in line with community standards

Users and information behaviour (personas)

Users and personas
User groups and services

From experience we have defined the following broad conceptual user-groups:

  • Data Managers, a.k.a. Data Op’s or Data Scientists
    Data Managers are i.e. knowledge engineers, taxonomists and analysts. 
  • Data Stewards
    Data Stewards are responsible for Data Governance, such as data lineage. 
  • Business Professionals/Business end-users
    Business Users may have a diverse background. Hence Business end-users.
  • Actor System are different information systems and applications and services that integrate information via the rich open APIs from the Smart Data Catalogue

The outlined collaborative actors (E-H user groups) and their interplay as information behaviour (personas) with the data (repository) and services (components), together, build the foundation for a more FAIR data management within your organisation, providing for you at the same time, the option to contribute to an even broader shared open FAIR information commons.

  • (E) Data Op’s workplace and dashboard is a combination of tools supporting Data Op’s data management processes in the information behaviours: data provision agents, enrichers and developers.
  • (F) Data Governance workplace is the tools to support Data Stewards collaborative data governance work with Data Managers in the information behaviours: data owner.
  • (G) Access, search, visualize APIs, is the user experience to explore, find and interact with the catalogue and data in the information behaviours: searcher and referrer.
  • (H) API, is the set of open APIs to support access to catalogue data for consuming information systems in the information behaviours: referrer (a.k.a. data exchange).

Potential tooling for this smart data framework:

We hope you enjoyed this post and understand the potential benefits such a smart data framework incorporating FAIR data principles can have on your data catalogue, or for that matter, your organisational content or even your data swamps.


In the next post, Toward data-centric solutions with Knowledge Graphs, we talk about Knowledge Graphs (KG) and its non-proprietary RDF semantic web tech, how you can create your KG(s) and the benefits they can bring to your future data landscape.

View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Peter Voisey's LinkedIn profilePeter Voisey

Activate conference 2018

Opensource has won! Now, what about AI?

Grant Ingersoll is on stage at the opening of Activate18 explaining the reasoning behind changing the name.

The revolution is won, opensource won, search as a concept to reckon with, they all won.

The times I come across a new search project where someone is pushing anything but opensource search is few and far between these days.

Since Search has taken a turn towards AI, a merge with that topic seems reasonable, not to say obvious. But AI in this context should probably be interpreted as AI to support good search results. At least if judging from the talks I attended. Interesting steps forward is expert systems and similar, none which was extensively discussed as of my knowledge. A kind of system we work with at Findwise. For instance, using NLP, machine learning and text analytics to improve a customer service.

Among the more interesting talks I attended was Doug Turnbulls talk on Neural Search Frontier. Some of the matrix-math threw me back to a ANN-course I took 10 years ago. Way before I ever learned any matrix maths. Now, way post remembering any matrix math-course I ever took, it’s equally confusing, possibly on a bit higher level. But he pointed out interesting aspects and show conceptually how Word2Vec-vectors work and won’t work. Simon Hughes talk “Vectors in search – Towards more semantic matching” is in the same area but more towards actually using it.

Machine Learning is finally mainstream

If we have a look at the overall distribution of talks, I think it’s safe to say that almost all talks touched on machine learning in some way. Most commonly using Learning to Rank and Word2Vec. None of these are new techniques (Our own Mickaël Delaunay wrote a nice blog-post about how to use LTR for personalization a couple of years ago. They have been covered before to some extent but this time around we see some proper, big scale implementations that utilizes the techniques. Bloomberg gave a really interesting presentation on what their evolution from hand tuned relevance to LTR over millions of queries have been like. Even if many talks were held on a theoretical/demo-level it is now very clear. It’s fully possible and feasible to build actual, useful and ROI-reasonable Machine Learning into your solutions.

As Trey Grainer pointed out, there are different generations of this conference. A couple of years ago Hadoop were everywhere. Before that everything was Solr cloud. This year not one talk-description referenced the Apache elephant (but migration to cloud was still referenced, albeit not in the topic). Probably not because big data has grown out of fashion, even though that point was kind of made, but rather that we have other ways of handling and manage it these days.

Don’t forget: shit in > shit out!

And of course, there were the mandatory share of how-we-handle-our-massive-data-talks. Most prominently presented by Slack, all developers favourite tool. They did show a MapReduce offline indexing pipeline that not only enabled them to handle their 100 billion documents, but also gave them an environment which was quick on its feet and super suitable for testing new stuff and experimenting. Something an environment that size usually completely blocks due to re-indexing times, fear of bogging down your search-machines and just general sluggishness.

Among all these super interesting technical solutions to our problems, it’s really easy to forget that loads of time still have to be spent getting all that good data into our systems. Doing the groundwork, building connectors and optimizing data analysis. It doesn’t make for so good talks though. At Findwise we ususally do that using our i3-framework which enables you to ingest, process, index and query your unstructured data in a nice framework.activate 2018 solr lucid opensource

I now look forward to doing the not so ground work using inspiration from loads of interesting solutions here at Activate.

Thanks so much for this year!

The presentations from the conference are available on YouTube in Lucidworks playlist for Activate18.

Author and event participant: Johan Persson Tingström, Findability Expert at Findwise

Trials & Jubilations: the two sides of the GDPR coin

We have all heard about the totally unhip GDPR and the potential wave of fines and lawsuits. The long arm of the law and it’s stick have been noted. Less talked about but infinitely more exciting is the other side. Turn over the coin and there’s a whole A-Z of organisational and employee carrots. How so?

Sign up to the joint webinar the 18th of April 3PM CET with Smartlogic & Findwise, to find out more.

https://flic.kr/p/fJD1eA

Signal Tools

We all leave digital trails behind us, trails about us. Others that have access to these trails can use our data and information. The new European General Data Protection Regulation (GDPR) intends the usage of such Personal Identifiable Information (PII) to be correct and regulated, with the power to decide given to the individual.

Some organisations are wondering how on earth they can become GDPR compliant when they already have a business to run. But instead of a chore, setting a pathway to allow for some more principled digital organisational housekeeping can bring big organisational gains sooner rather than later.

Many enterprises are now beginning to realise the extra potential gains of having introduced new organisational principles to become compliant. The initial fear of painful change soon subsides when the better quality data comes along to make business life easier. With the further experience of new initiatives from new data analysis, NLP, deep learning, AI, comes the feeling:  why we didn’t we just do this sooner?

Most organisations have a system(s) in place holding PII data, even if getting the right data out in the right format remains problematical. The organisation of data for GDPR compliance can be best achieved so that it becomes transformed to be part of a semantic data layer. With such a layer, knowing all the related data from different sources you have on Joe Bloggs becomes so much easier when he asks for a copy of the data you have about him. Such a semantic data layer will also bring other far-reaching and organisation-wide benefits.

Semantic Data Layer

Semantic Data Layer

For example, heterogeneous data in different formats and from different sources can become unified for all sorts of new smart applications, new insights and new innovation that would have been previously unthinkable. Data can stay where it is… no need to change that relational database yet again because of a new type of data. The same information principles and technologies involved in keeping an eye on PII use, can also be used to improve processes or efficiencies and detect consumer behaviour or market changes.

But it’s not just the business operations that benefit, empowered employees become happier having the right information at hand to do their job. Something that is often difficult to achieve, as in many organisations, no one area “owns” search, making it is usually somebody else’s problem to solve. For the Google-loving employee, not finding stuff at work to help them in their job can be downright frustrating. Well ordered data (better still in a semantic layer) can give them the empowering results page they need. It’s easy to forget that Google only deals with the best structured and linked documentation, why shouldn’t we do the same in our organisations?

Just as the combination of (previously heterogeneous) datasets can give us new insights for innovation, we also observe that innovation increasingly comes in the form of external collaboration. Such collaboration of course increases the potential GDPR risk through data sharing, Facebook being a very current point in case. This brings in the need for organisational policy covering data access, the use and handling of existing data and any new (extra) data created through its use. Such policy should for example cover newly created personal data from statistical inference analysis.

While having a semantic layer may in fact make human error in data usage potentially more possible through increased access, it also provides a better potential solution to prevent misuse as metadata can be baked into the data to classify both information “sensitivity” and control user accessibility rights.

So how does one start?

The first step is to apply some organising principles to any digital domain, be it in or outside the corporate walls [the discipline of organising, Robert Gluschko] and to ask the key questions:

  1. What is being organised?
  2. Why is it being organised?
  3. How much of it is being organised?
  4. When is it being organised?
  5. Where is it being organised?

Secondly start small, apply organising principles by focusing on the low-hanging fruit: the already structured data within systems. The creation of quality data with added metadata in a semantic layer can have a magnetic effect within an organisation (build that semantic platform and they will come).

Step three: start being creative and agile.

A case story

A recent case, within the insurance industry reveals some cues to why these set of tools will improve signals and attention for becoming more compliant with regulations dealing with PII. Our client knew about a set of collections (file shares) where PII might be found. Adding search, and NLP/ML opened up the pandoras box with visual analytic tools. This is the simple starting point, finding i.e names or personal number concepts in the text. Second to this will be to add semantics, where industry standard terminologies and ontologies can further help define the meaning of things.

In all corporate settings, there exist both well-cultivated and governed collections of information resources, but usually also a massive unmapped terrain of content collections, where no one has a clue if there might be PII hidden amongst it. The strategy using a semantic data layer should always be combined with operations to narrowing down the collections to become part of the signalling system – it is generally not a good idea to boil the whole-data-ocean in the enterprise information environment. Rather through such work practices, workers are aware of the data hot-spots, the well-cultivated collections of information and that unmapped terrain. Having the additional notion of PII to contend with will make it that just bit easier to recognise those places where semantic enhancement is needed.

not a good idea to boil the whole-data-ocean

Running with the same pipeline (with the option of further models to refine and improve certain data) will not only allow for the discovery of multiple occurrences of named entities (individuals) but also the narrative and context in which they appear.
Having a targeted model & terminology for the insurance industry will only go to improve this semantic process further. This process can certainly ease what may be currently manual processes or processes that don’t exist because of their manual pain: for example, finding sensitive textual information from documents within applications or from online textual chats. Developing such a smart information platform enables the smarter linking of other things from the model, such as service packages, service units / or organisational entities, spatial data as named places or timelines, or medical treatments, things perhaps currently you have less control over.

There’s not much time before the 25th May and the new GDPR, but we’ll still be here afterwards to help you with a compliance burden or a creative pathway, depending on your outlook.

Alternatively sign up to the joint webinar the 11th of April 3PM CET with Smartlogic & Findwise, to find out more.

View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Peter Voisey's LinkedIn profilePeter Voisey
View James Morris's LinkedIn profileJames Morris

Stay Cleaning and moving boxes for cloud

This is the seventh post in a series (1, 2, 3, 4, 5, 6) on the challenges organisations face as they move from having online content and tools hosted firmly on their estate to renting space in the cloud.  We will help you to consider the options and guide on the steps you need to take.

Starting from our first post we have covered different aspects you need to consider as you take each step including information structure and how it is managed using Office 365 and SharePoint as a technology example.  Planning for migration.

Moving Boxes

Do not even think about moving into the cloud apartment without a proper  cleaning of the content buckets. Moving from an architected household to a rented place, taxes a structured audit. Clean out all redundant, outdated and trivial matter (ROT). The very same habit you have cleaning up the attic when moving out from your old house.

It is also a good idea to decorate and add any features to your new cloud apartment before the content furniture is there.  It means the content will fit with any new design and adapt to any extra functionality with new features like windows and doors.  This can be done by reviewing and updating your publishing templates at the same time.  This will save time in the future.

Leaning upon the information governance standards, it should be easy to address the cleaning before moving, for all content owners who have been appointed to a set of collections or habitats. Most organisations could use a content vacuum cleaner, or rather use the search facilities and metric means to deliver up to date reports on:

  1. Active / in-Active habitats
  2. No clear ownership or the owner has left the building
  3. Metadata and link quality to content and collections to be moved across to the cloud apartments.
  4. Review publishing templates and update features or design to be used in the Cloud

When all active habitats and qualified content buckets have been revisited by their set of curators and information owners. The preparation and use of moving boxes, should be applied.

All moving boxes do need proper tagging, so that any moving company will be able to sort out where about the stuff should be placed in the new house, or building. For collections, and habitats, this means using the very same set of questions stated for adding a new habitat or collection to the cloud apartment house. Who, why, where and so forth, through the use of a structured workflow and form. When this first cleaning steps have been addressed, there should be automatic metadata enhancement, aligned with the information management processes to be used in the new cloud.

With decent resource descriptions and cleaned up content through the audit (ROT), this last step will auto-tag content based upon the business rules applied for the collection or habitat. Then been loaded into the content moving truck, or loading dock. Ready to added to the cloud.

All content that neither have proper assigned information ownership, or are in such a shape that migration can’t be done should persist on the estate or be archived or purged. This means that all metadata and links to either content bucket or habitat that won’t be moved in the first instances, should at least have correct and unique uri:s, address, to this content. And in the case a bucket or habitat have been run down by a demolition firm, purged. All inter-linkage to that piece of content or collection have to be changed.

This is typically a perfect quality report, to the information owners and content editors, that they need to work through prior to actually loading the content on the content dock.

Rubbish and Weed
Finally when all rotten data, deserted habitats and unmanageable buckets have been weeded out. It is time to prepare the moving truck, sending the content into its new destination.

Our final thread will cover how will the organisation and it habitants will be able to find content in this mix of clouds, and things left behind on the old estate? Cloud Search and Enterprise Search, seamless or a nightmare?

Please join our Live Stream on YouTube the 20th November 8.30AM – 10AM Central European Time
View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Mark Morrell's LinkedIn profileMark Morell intranet-pioneer

Content Governance – life cycle and reach

This is the fifth post in a series (1, 2, 3, 4, 6, 7 ) on the challenges organisations face as they move from having online content and tools hosted firmly on their estate to renting space in the cloud.  We will help you to consider the options and guide on the steps you need to take.

 Starting from our first post we have covered different aspects you need to consider as you take each step including information structure and how it is managed using Office 365 and SharePoint as a technology example.  We will cover governance and how content should be managed in the cloud in this post.

content buckets

Content created within a context, as either a departmental site, or team habitat has usually only reach and bearing for the local context of fellow members of staff within this unit. Other pieces of content have a coverage that stretches all parts of the business. One simple example, is the bucket of content that makes up the management system, with governing principles, strategies, policies and guidelines that describes the core processes, activities, roles and so forth within an organisation.

Yet other content, as the outcome from a project, will build a bucket of content that either lives in a new context, improves a bucket of content or feeds into yet another following project.

From an information management perspective, it is vital that you have organising principles to all your content, where all these layers have been covered. Both reach, and the life cycle to the set of content.

You need a governance framework that reaches out to every bucket of content.  This covers what is still on your estate as well as the growing amount in the cloud.  All content needs to be managed to remove risks of leakage of sensitive information and prevent people having an inconsistent user experience as they move from one bucket of content in the cloud to another content bucket still on the estate.

You need to make sure people do not see the difference between buckets of content on the estate from content buckets in the cloud.  People using your content to help with their work don’t need to know where the content is kept.  They need to find it as easily as before, preferably even easier!  Content in the cloud  should feel the same and be a natural extension to the digital environment people are already used to.  Manage it with a governance framework that covers every bucket of content and make it more easy to adopt quicker and use more often without caution or delay.

Part of your governance needs to cover publishing standards based on business needs so it is easy to access from any device e.g laptops, tablets and smartphones, and to view without unnecessary authentication levels.  This helps to create that consistent good user experience that encourages people to use your content whether the bucket is in the cloud or not.

A professional team from group HR, might work in their local teamsite, with on-going conversations, work-in-progress documents and so forth. Pieces of their content production leads to governing policies that have a global reach within the organisation, and needs to be linked from the corporate intranet spaces. with versioning and good quality to resource descriptions (meta data). This practice and professional network of HR people, do also share content on a departmental site. With links and resources, that have direct impact on their internal processes. The group of people, have outreaching triggers, and in-bound conversations. And have to balance these two states.

When it comes to temporal content buckets, like a project team site. There are several considerations one have to capture. First where will the outcome and result be stored, when the project is finished. In which context will these content pieces contribute. Second, what should be captured from all on-going conversations (social elements) and work-in-progress and drafts developed during the projects lifecycle? Should a project habitat, be searchable after closing down? Or do the habitat change status, hence all documentation stay within the collection, but the overarching state to the habitat changes? Within Sharepoint these temporal states, versions, workflow and properties. All sum up the organising principles.

If these principles haven’t been ironed out, and been described and decided. Inevitable there will be emerging ghost towns, of dead habitats and lost collections of content. With no governance or ownership whatsoever. All this will become a digital landfill.

We will cover more about SharePoint in our next post in this series. Please visit Michael Sampson‘s recent slides where he takes you through strategy, planning, governance and user adoption for collaboration!
Please join our Live Stream on YouTube the 20th November 8.30AM – 10AM Central European Time
View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Mark Morrell's LinkedIn profileMark Morell intranet-pioneer

Wagon Trains to the Cloud

This is the first post in a series(2, 3, 4, 5, 6, 7) on the challenges organisations face when they move from having online content and tools hosted firmly on their estate to renting space in the cloud.  We will help you to consider the options and guide you on the steps you need to take.

In this first post we show you  the most common challenges that you are likely to face and how you may overcome these.

A fast migration path, to become tenants in a cloud apartment housing unfolds a set of business critical issues that have to be mitigated:

  • Wayfinding in a maze of content buckets and social habitats.
  • Emerging digital Ghost Towns due to lack of information governance.
  • Digital Landfills without organising principles for information and data.
  • Digital Litter with little or no governance or principles for ownership, with redundant, outdated and trivial (ROT) content.
  • With no strategy or plan, erodes any possibility to positive business outcome from moving to the clouds.

WagonTrn.jpg
WagonTrn” by Tillman at en.wikipedia – Transferred from en.wikipedia by SreeBot. Licensed under Public domain via Wikimedia Commons.

The way forward is to settle a sustainable information architecture, that supports an information environment in constant flux. With information and data interoperable on any platform, everywhere, anytime and on any device.

You need to show how everything is managed and everyone fits together.  A governance framework can help do this.  It can show who is responsible for the intranet, what their responsibilities are and fit with the strategy and plan.  Making it available to everyone on the intranet helps their understanding of how it is managed and supports the business.

The main point is to have a governance framework and information architecture with the same scope to avoid gaps in content being managed or not being found.

Both need to be in harmony and included in any digital strategy.  This avoids competing information architectures and governance frameworks being created by different people that causes people to have inconsistent experiences not finding that they need and using alternative, less efficient, ways in future to find what they need to help with their work.

Background

Building huts, houses and villages is an emerging social construction. As humans we coordinate our common resources, tools and practices. A habitat populated by people needs housekeeping rules with available resources for cooking, cleaning, social life and so on. Routines that defines who does what task and by when in order to keep everything ok.

A framework with governing principles that set out roles and responsibilities along with standards that set out the expected level of quality and quantity of each task that everyone is engaged and complies with, is similar to how the best intranets and digital workplaces are managed.

In the early stages with a small number of habitats the rules for coordination are pretty simple, both for shared resources between the groups and pathways to connect them. The bigger a village gets, it taxes the new structures to keep things smooth. When we move ahead into mega cities with 20+ million people living close, it boils down to a general overarching plan and common infrastructures, but you also need local networked communities, in order to find feasible solutions for living together.

Like villages and mega cities there is a need for consistency that helps everyone to work and live together.  Whenever you go out you know that there are pavements to walk on, roads for driving, traffic lights that we stop at when they turn red and signs to help us show the easiest way to get to our destination.

Sustainable architecture and governance creates a consistent user experience. A well structured information architecture that is aligned with a clear governance framework sets out roles and responsibilities. Publishing standards based on business needs that supports the publishers follow them. This means wherever content is published, whether it is accredited or collaborative, it will appear to be consistent to people and located where they expect it to be.  This encourages a normal way to move through a digital environment with recognizable headings and consistently placed search and other features.

This allegori, fits like a glove when moving into large enterprise-wide shared spaces for collaboration. Whether it is cloud based, on-premises or a mix thereof. The social constructions and constraints still remain the same. As an IT-services on tap, cloud, has certainly constraints for a flexible and adjustable habitual construction to be able to host as many similar habitats as possible. But offers a key solutions to instantly move into! Tenants share the same apartment building (Sharepoint online).

When the set of habitats grow, navigation in this maze becomes a hazard for most of us. Wayfinding in a digital mega city, is extremely difficult. To a large extent, enterprises moving into collaboration suites suffer from the same stigma. Regardless if it is SharePoint, IBM Connections, Google Apps for Work, or a similar setting. It is not a discussion of which type of house to choose, but rather which architecture and plan that work in the emerging environment.

Information Architecture for Digital Habitats

If one leans upon linked-data,  linked-open-data, and emerging semantic web and web of data standards, there are a set of very simple guidelines that one should adhere to when building a Digital Village or Mega City. The 5 stars, our beacon of light!

All collections and shared spaces, should have persistent URI:s, which is the fourth star in the ladder. When it comes to the third star of non-proprietary formats it obviously becomes a bit tricky, since i.e. MS Sharepoint and MS Office like to encourage their own format to things. But if one add resource descriptions to collections and artifacts using Dublin Core elements, it will be possible to connect different types of matter. With feasible and standardised resource descriptions it will be possible to add schemas and structures, that can tell us a little bit more about the artifacts or collection thereof. Hence the option to adhere to the second star. The first star, will inside the corporate setting become key to connect different business units, areas with open licenses and with restrictions to internal use only and in some cases open for other external parties.

Linking data-sets, that is collections or habitats, with different artifacts is the fifth star. This is where it all starts to make sense, enabling a connected digital workplace. Building a city plan, with pathways, traffic signals and rules, highways, roads, neighborhoods  and infrastructural services and more. In other words, placemaking!

Placemaking is a multi-faceted approach to the planning, design and management of public spaces. Placemaking capitalizes on a local community’s assets, inspiration, and potential, with the intention of creating public spaces that promote people’s health, happiness, and well being.

We will cover more about how this applies to Office 365 and SharePoint in our next post.

Please join our Live Stream on YouTube the 20th November 8.30AM – 10AM Central European Time
View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Mark Morrell's LinkedIn profileMark Morell intranet-pioneer

Phonetic Algorithm: Bryan, Brian, Briane, Bryne, or … what was his name again?

Let the spelling loose …

What do Callie and Kelly have in common (except for the double ‘l’ in the middle)? What about “no” and “know”, or “Ceasar’s” and “scissors” and what about “message” and “massage”? You definitely got it – Callie and Kelly, “no” and “know”, “Ceasar’s” and “scissors” sound alike, but are spelled quite differently. “message” and “massage” on the other hand differ by only one vowel (“a” vs “e”) but their pronunciation is not at all the same.

It’s a well known fact for many languages that ortography does not determine the pronunciation of words. English is a classic example. George Bernard Shaw was the attributed author of “ghoti” as an alternative spelling of “fish”. And while phonology often reflects the current state of the development of the language, orthography may often lag centuries behind. And while English is notorious for that phenomenon it is not the only one. Swedish, French, Portuguese, among others, all have their ortography/pronunciation discrepancies.

Phonetic Algorithms

So how do we represent things that sound similar but are spelled different? It’s not trivial but for most cases it is not impossible either. Soundex is probably the first algorithm to tackle this problem. It is an example of the so called phonetic algorithms which attempt to solve the problem of giving the same encoding to strings which are pronounced in a similar fashion. Soundex was designed for English only but has its limits. DoubleMetaphone (DM) is one of the possible replacements and relatively successful. Designed by Lawrence Philips in the beginning of 1990s it not only deals with native English names but also takes proper care of foreign names so omnipresent in the language. And what is more – it can output two possible encodings for a given name, hence the “Double” in the naming of the algorithm, – an anglicised and a native (be that Slavic, Germanic, Greek, Spanish, etc.) version.

By relying on DM one can encode all the four names in the title of this post as “PRN”. The name George will get two encodings – JRJ and KRK, the second version reflecting a possible German pronunciation of the name. And a name with Polish origin, like Adamowicz, would also get two encodings – ATMTS and ATMFX, depending on whether you pronounce the “cz” as the English “ch” in “church” or “ts” in “hats”.

The original implementation by Lawrence Philips allowed a string to be encoded only with 4 characters. However, in most subsequent
implementations of the algorithm this option is parameterized or just omitted.

Apache Commons Codec has an implementation of the DM among others (Soundex, Metaphone, RefinedSoundex, ColognePhonetic, Coverphone, to
name just a few.) and here is a tiny example with it:

import org.apache.commons.codec.language.DoubleMetaphone;

public class DM {

public static void main(String[] args) {

String s = "Adamowicz";

DoubleMetaphone dm = new DoubleMetaphone();

// Default encoding length is 4!

// Let's make it 10

dm.setMaxCodeLen(10);

System.out.println("Alternative 1: " + dm.doubleMetaphone(s) +

// Remember, DM can output 2 possible encodings:

"nAlternative 2: " + dm.doubleMetaphone(s, true));

}
}

The above code will print out:

Alternative 1: ATMTS

Alternative 2: ATMFX

It is also relatively straightforward to do phonetic search with Solr. You just need to ensure that you add the phonetic analysis to a field which contains names in your schema.xml:

Enhancements

While DM does perform quite well, at first sight, it has its limitations. We should know that it still originated from the English language and although it aims to tackle a variety of non-native borrowings most of the rules are English-centric. Suppose you work on any of the Scandinavian languages (Swedish, Danish, Norwegian, Icelandic) and one of the names you want to encode is “Örjan”. However, “Orjan” and “Örjan” get different encodings – ARJN vs RJN. Why is that? One look under the hood (the implementation in DoubleMetaphone.java) will give you the answer:

private static final String VOWELS = "AEIOUY";

So the Scandinavian vowels “ö”, “ä”, “å”, “ø” and “æ” are not present. If we just add these then compile and use the new version of the DM implementation we get the desired output – ARJN for both “Örjan” and “Orjan”.

Finally, if you don’t want to use DM or maybe it is really not suitable for your task, you still may use the same principles and create your own encoder by relying on regular expressions for example. Suppose you have a list of bogus product names which are just (mis)spelling variations of some well known names and you want to search for the original name but get back all ludicrous variants. Here is one albeit very naïve way to do it. Given the following names:

CupHoulder

CappHolder

KeepHolder

MacKleena

MackCliiner

MacqQleanAR

Ma’cKcle’an’ar

and with a bunch of regular expressions you can easily encode them as “cphldR” and “mclnR”.

String[] ar = new String[]{"CupHoulder", "CappHolder", "KeepHolder",
"MacKleena", "MackCliiner", "MacqQleanAR", "Ma'cKcle'an'ar"};

for (String a : ar) {
a = a.toLowerCase();
a = a.replaceAll("[ae]r?$", "R");
a = a.replaceAll("[aeoiuy']", "");
a = a.replaceAll("pp+", "p");
a = a.replaceAll("q|k", "c");
a = a.replaceAll("cc+", "c");
System.out.println(a);
}

You can now easily find all the ludicrous spellings of “CupHolder” och “MacCleaner”.

I hope this blogpost gave you some ideas of how you can use phonetic algorithms and their principles in order to better discover names and entities that sound alike but are spelled unlike. At Findwise we have done a number of enhancements to DM in order to make it work better with Swedish, Danish and Norwegian.

References

You can learn more about Double Metaphone from the following article by the creator of the algorithm:
http://drdobbs.com/cpp/184401251?pgno=2

A German phonetic algorithm is the Kölner Phonetik:
http://de.wikipedia.org/wiki/Kölner_Phonetik

And SfinxBis is a phonetic algorithm based on Soundex and is Swedish specific:
http://www.swami.se/projekt/sfinxbis.68.html

Search and Content Quality – Ways of Improving Your Intranet

If you have 6 minutes to spare I would recommend you to watch this interview with Gabriel Olsson from Tetra Pak. During the last years Tetra Pak has been working strategically with turning their intranet into something true end user-centric. Tetra Pak has also put effort into search and content quality.

By actually asking the employees what they expect to find and what sort of information that would make their everyday work (tasks) more efficient, Tetra Pak has managed to create a navigation structure based on facts reflecting these needs. The method used is Gerry McGovern’s Task based Customer Carewords… and the result? The ones that scream the loudest are not the most important – the need of the employees is.

Gabriel is also talking about the importance of following up on search by key matches and synonyms. This, together with content quality initiatives, helps create a solid foundation for search, the simple reasons being:

Use metadata to filter search results (note, not a Tetra Pak picture)

  • If the quality of the information is good (clear headings, good metadata, frequent keywords), the information found through search will be good as well. If you have a lot of old content and duplicates this will be just as visible, making it hard for the users to determinate what is qualitative and trustworthy.Good quality will also make it possible to group and categorize information.
  • Synonyms makes it easy to adjust the corporate language to the one used by the employees. Let people search for “report” when they want to find a “bulletin”. A simple synonym list, based on search statistics will make users find what they want, without thinking about how to phrase the query.The synonyms can used in the background (without the users knowledge) or as ‘did you mean-suggestions’:

    Synonyms used for ‘Did you mean” functionality (note, not a Tetra Pak picture)

  • Key matches (also referred to as sponsored links, best bets or editor’s pick) are used to manually force the first hit in the search result list to refer to a specific page or document. By following up on search statistics and knowing what sort of information that is frequently most asked for, it is easy to adjust the search result list. However, this take  time and effort to follow up.

Tetra Pak is not alone when it comes to adjusting their intranets to true end-user needs. During the spring there will be a number of conferences where our customers will be sharing experiences from their initiatives. Among others Ability Partner, and the recently completed IntraTeam.

Apart from this, our own breakfast seminaries is a, as always, announced on our homepage and on twitter. Looking forward to seeing you!