Activate conference 2018

Opensource has won! Now, what about AI?

Grant Ingersoll is on stage at the opening of Activate18 explaining the reasoning behind changing the name.

The revolution is won, opensource won, search as a concept to reckon with, they all won.

The times I come across a new search project where someone is pushing anything but opensource search is few and far between these days.

Since Search has taken a turn towards AI, a merge with that topic seems reasonable, not to say obvious. But AI in this context should probably be interpreted as AI to support good search results. At least if judging from the talks I attended. Interesting steps forward is expert systems and similar, none which was extensively discussed as of my knowledge. A kind of system we work with at Findwise. For instance, using NLP, machine learning and text analytics to improve a customer service.

Among the more interesting talks I attended was Doug Turnbulls talk on Neural Search Frontier. Some of the matrix-math threw me back to a ANN-course I took 10 years ago. Way before I ever learned any matrix maths. Now, way post remembering any matrix math-course I ever took, it’s equally confusing, possibly on a bit higher level. But he pointed out interesting aspects and show conceptually how Word2Vec-vectors work and won’t work. Simon Hughes talk “Vectors in search – Towards more semantic matching” is in the same area but more towards actually using it.

Machine Learning is finally mainstream

If we have a look at the overall distribution of talks, I think it’s safe to say that almost all talks touched on machine learning in some way. Most commonly using Learning to Rank and Word2Vec. None of these are new techniques (Our own Mickaël Delaunay wrote a nice blog-post about how to use LTR for personalization a couple of years ago. They have been covered before to some extent but this time around we see some proper, big scale implementations that utilizes the techniques. Bloomberg gave a really interesting presentation on what their evolution from hand tuned relevance to LTR over millions of queries have been like. Even if many talks were held on a theoretical/demo-level it is now very clear. It’s fully possible and feasible to build actual, useful and ROI-reasonable Machine Learning into your solutions.

As Trey Grainer pointed out, there are different generations of this conference. A couple of years ago Hadoop were everywhere. Before that everything was Solr cloud. This year not one talk-description referenced the Apache elephant (but migration to cloud was still referenced, albeit not in the topic). Probably not because big data has grown out of fashion, even though that point was kind of made, but rather that we have other ways of handling and manage it these days.

Don’t forget: shit in > shit out!

And of course, there were the mandatory share of how-we-handle-our-massive-data-talks. Most prominently presented by Slack, all developers favourite tool. They did show a MapReduce offline indexing pipeline that not only enabled them to handle their 100 billion documents, but also gave them an environment which was quick on its feet and super suitable for testing new stuff and experimenting. Something an environment that size usually completely blocks due to re-indexing times, fear of bogging down your search-machines and just general sluggishness.

Among all these super interesting technical solutions to our problems, it’s really easy to forget that loads of time still have to be spent getting all that good data into our systems. Doing the groundwork, building connectors and optimizing data analysis. It doesn’t make for so good talks though. At Findwise we ususally do that using our i3-framework which enables you to ingest, process, index and query your unstructured data in a nice framework.activate 2018 solr lucid opensource

I now look forward to doing the not so ground work using inspiration from loads of interesting solutions here at Activate.

Thanks so much for this year!

The presentations from the conference are available on YouTube in Lucidworks playlist for Activate18.

Author and event participant: Johan Persson Tingström, Findability Expert at Findwise

Are the messages on the election posters just empty words?

It is impossible not to notice all the political conversations in Sweden now, less then two weeks before election day. During times like these parties focus a lot of energy on getting their point across to the public, but how much is just slogans that sound good when you print them on a poster and how much is rooted in the everyday work of their organisation.

Are the words printed on the posters present in every street corner really the same as the ones being exchanged between the walls of the Swedish parliament building?

While ferociously staying away from the subject of who is right or wrong, let’s see if there is a way to evaluate if what they are talking about in the parliament’s everyday sessions is the same as what is being printed in the manifestos released during the last two elections (2014 and 2018 respectively). Continue reading

Analytical power at your fingertips with natural language and modern visualisation

Today we are all getting used to interactive dashboards and plots in self-service business intelligence (BI) solutions to drill down and slice our facts and figures. The market for BI tools has seen an increased competition recently with Microsoft Power BI challenging proven solutions such as Tableau, Qlik, IBM Cognos, SAP Lumira and others. At the same time, it is hard to benchmark tools against each other as they all come with very similar features. Has the BI development saturated?

Compared to how we are used to consume graphics and information, the BI approach to interactive analysis is somewhat different. For instance: a dashboard or report is typically presented in a printer-oriented flat layout on white background, weeks of user training is typically needed before “self-service” can be reached, and interactions are heavily click-oriented – you could almost feel it in your mouse elbow when opening the BI frontend.

On the other hand, when surfing top internet sites and utilizing social media, our interactions are centred around the search box and the natural interface of typing or speaking. Furthermore, there is typically no training needed to make use of Google, Facebook, LinkedIn, Pinterest, Twitter, etc. Through an intuitive interface we learn along the way. And looking at graphics and visualization, we can learn a lot from the gaming industry where players are presented with well-designed artwork – including statistics presented in an intuitive way to maximize the graphical impression.

Take a look at this live presentation to see how a visiual analysis using natural language can look like. 

screenshot007040

Rethink your business analytics

It appears as if BI tools are sub optimized for a limited scope and use case. To really drive digitalization and make use of our full information potential, we need a new way of thinking for business analytics. Not just continuous development, rather a revolution to the business intelligence approach. Remember: e-mail was not a consequence of the continuous development of post offices and mail handling. We need to rethink business analytics.

At Findwise, we see that the future for business analytics involves:

  • added value by enriching information with new unstructured sources,
  • utilizing the full potential of visualization and graphics to explore our information,
  • using natural language to empower colleagues to draw their own conclusions intuitively and secure

 

Enrich data

There is a lot of talk about data science today; how we can draw conclusions from our data and make predictions about the future. This power largely depends on the value in the data we possess. Enriching data is all about adding new value. The enrichment may include a multitude of sources, internal and external, for instance:

  • detailed customer transaction logs
  • weather history and forecasts
  • geospatial data (locations and maps)
  • user tracking and streams
  • social media and (fake) news

Comparing with existing data, a new data source could be orthogonal to the existing data and add a completely new understanding. Business solutions of today are often limited to highly structured information sources or information providers. There is a large power in unstructured, often untouched, information sources. However, it is not as straight forward as launching a data warehouse integration, since big data techniques are required to handle the volume, velocity and variety.

At Findwise, utilizing the unstructured data has always been the key in developing unique solutions for search and analytics. The power of our solutions lies in incorporating multiple sources online and continuously enrich with new aspects. For this we even developed our own framework, i3, with over hundred connectors for unstructured data sources. A modern search engine (or insight engine) scales horizontally for big data applications and easily consumes billions of texts, logs, geospatial and other unstructured – as well as structured – data. This is where search meets analytics, and where all the enrichment takes place to add unique information value.

 

Visually explore

As human beings we have very strong visual and cognitive abilities, developed over millions of years to distinguish complex patterns and scenarios. Visualization of data is all about packaging information in such a way that we can utilize our cognitive skills to make sense out of the noise. Great visualization and interaction unleash the human power of perception and derivation. It allows us make sense out of the complex world around us.

When it comes to computer visualization, we have seen strong development in the use of graphical processors (GPUs) for games but recently also for analytics – not the least in deep learning where powerful GPUs solve heavy computations. For visualisation however, typical business intelligence tools today only use a minimal fraction of the total power of our modern devices. As a comparison: a typical computer game renders millions of pixels in 3D several times per second (even via the web browser). In a modern BI tool however, we may struggle to display 20 000 distinct points in a plot.

There are open standards and interfaces to fully utilize the graphical power of a modern display. Computer games often build on OpenGL  to interact with the GPU. In web browsers, a similar performance can be reached with WebGL and JavaScript libraries. Thus, this is not only about regular computers or installed applications, The Manhattan Population Explorer (built with JavaScript on D3.js and Mapbox GL JS) is a notable example of an interactive and visually appealing analysis application that very well runs on a regular smart phone.

price-over-time

Example from one of our prototypes: analysing the housing market – plotting 500 000 points interactively utilizing OpenGL.

Current analysis solutions and application built with advanced graphical analysis are typically custom made for a specific purpose and topic, as in the example above. This is very similar to how BI solutions were built before self-service BI came in to play – specific solutions hand crafted for a few use cases. In contrast to this, Open graphical libraries, incorporated as the core of visualizations, with inspiration from gaming art work, can spark a revolution to how we visually consume and utilize information.

screenshot014002

Natural language empowers

The process of interpreting and working with speech and text is referred to as Natural Language Processing (NLP). NLP interfaces are moving towards the default interface to interaction. For instance Google’s search engine can give you instant replies on questions such as “weather London tomorrow” and with Google Duplex (under development) NLP is used to automate phone calls making appointments for you.  Other examples include the search box popping up as a central feature on many larger web sites and voice services such as Amazon Alexa, Microsoft Cortana, Apple Siri, etc.

When it comes to analysis tools we have seen some movements in this direction lately. In Power BI Service (web) Cortana can be activated to allow for simple Q&A on your prepared reports. Tableau has started talking about NLP for data exploration with “research prototypes you might see in the not too distant future”. The clearest example in this direction is probably ThoughtSpot built with a search-driven analytics interface. Although for most of the business analytics carried out today, clicking is still in focus and clicking is what is being taught on trainings. How can this be, when our other interactions with information move towards natural language interfaces? The key to move forward is to give NLP and advanced visualization a vital role in our solutions, allowing for an entirely natural interface.

Initially it may appear hard to know exactly what to type to get the data right. Isn’t training needed also with an NLP interface? This is where AI comes in to help us interpret our requests and provide us with smart feedback. Having a look at Google again, we continuously get recommendations, automatic spelling correction and lookup of synonyms to optimize our search and hits. With a modern NLP interface, we learn along the way as we utilize it. Frankly speaking though, a natural language interface is best suited for common queries that aren’t too advanced. For more advanced data munging and customized analysis, a data scientist skillset and environment may well be needed. However, the power of e.g. Scientific Python or the R language could easily be incorporated into an NLP interface, where query suggestions turn into code completion. Scripting is a core part of the data science workflow.

An analytical interface built around natural language helps direct focus and fine-tunes your analysis to arrive at intuitive facts and figures, explaining relevant business questions. This is all about empowering all users, friends and colleagues to draw their own conclusions and spread a data-driven mentality. Data science and machine learning techniques fit well into this concept to leverage deeper insights.

 

Conclusion – Business data at everyone’s fingertips

We have highlighted the importance of enriching data with concern taken to unstructured data sources, demonstrated the importance of visual exploration to enable our cognitive abilities, and finally empowering colleagues to draw conclusions through a natural language interface.

Compared with the current state of the art for analysis and business intelligence tools, we stand before a paradigm shift. Standardized self-service tools built on clicking, basic graphics and the focus on structured data will be overrun by a new way of thinking analysis. We all want to create intuitive insights without the need of thorough training on how to use a tool. And we all want our insights and findings to be visually appealing. Seeing is believing. To communicate our findings, conclusions and decisions we need to show the why. Convincing. This is where advanced graphics and art will help us. Natural language is the interface we use for more and more services. It can easily be powered by voice as well. With a natural interface, anyone will learn to utilize the analytical power in the information and draw conclusions. Business data at everyone’s fingertips!

To experience our latest prototype where we demonstrate the concept of data enrichment, advanced visualization and natural language interfaces, take a look at this live presentation.

 

Author: Fredrik Moeschlin, senior Data Scientist at Findwise

Pragmatic or spontaneous – What are the most common personal qualities in IT-job ads?

Open Data Analytics

At Findwise we regularly have companywide hackathons with different themes. The latest theme was insights in open data, which I personally find very interesting.

Our group chose to fetch data from the Arbetsförmedlingen (Swedish Employment Agency), where ten years of job ads are available. There are about 4 million job ads in total during this time-period, so there is some material to analyze.

To make it easier to enable ad hoc analysis, we started off by extracting competences and personal traits mentioned in the job ads. This would allow us to spot trends in competences over time, in different regions or correlate competences and trait. Lots of possibilities.

 

Personal qualities and IT competences

As an IT-ninja I find it more exciting to focus on jobs, competences and traits within the IT industry. A lot is happening, and it is easy for me to relate to this area, of course. A report from Almega suggests that there is a huge demand of competences within IT for the coming years and it brings up a lot of examples of lacking technical skills. What is rarely addressed is what personality types are connected to these specific competences. We’re able to answer this interesting question from our data:

 

What personal traits are common complementary to the competences that are in demand?

arbetsförmedlingen hack

Figure 1 – Relevant worktitles, competences and traits for the search term “big data”

 

The most wanted personal traits are in general “Social, driven, passionate, communicative”. All these results should of course be taken with a grain of salt, since a few staffing/general IT consulting companies are a big part of the number of job ads within IT. But we can also look at a single competence and answer the question:

 

What traits are more common with this competence than in general? (Making the question a bit more specific.)

Some examples of competences in demand are system architecture, support and JavaScript. The most outstanding traits for system architecture are sharp, quality orientated and experienced. It can always be discussed if experienced is a trait (although our model thoughts so) but it makes sense in any case since system architecture tend to be more common among senior roles. For support we find traits such as service orientated, happy and nice, which is not unexpected, Lastly, for job-ads needing javascript-competence, personal traits such as quality orientated, quality aware and creative are the most predominant.

 

Differences between Stockholm and Gothenburg

Or let’s have a look at geographical differences between Sweden’s two largest cities when it comes to personal qualities in IT-job ads. In Gothenburg there is a stronger correlation to the traits spontaneous, flexible and curious while Stockholm correlates with traits such as sharp, pragmatic and delivery-focused.

 

What is best suitable for your personality?

You could also look at it the other way around and start with the personal traits to see which jobs/competences are meant for you. If you are analytical then jobs as controller or accountant could be jobs for you. If you are an optimist, then job coach or guidance counselors seems to be a good fit. We created a small application where you can type in competences or personal traits and get suggested jobs in this way. Try it out here!

 

Lear more about Open Data Analytics

In addition, we’re hosting a breakfast seminar December 12th where we’ll use the open data from Arbetsförmedlingen to show a process of how to make more data driven decisions. More information and registration (the seminar will be held in Swedish)

 

Author: Henrik Alburg, Data Scientist

Digital recycling & knowledge growth

How do we prevent the digital debris of human clutter and mess? And to what extent will future digital platforms guide us in knowledge creation and use?

Start making sense, and the art of making sense!

People and the Post, Postal History from the Smithsonian's  National Postal Museum

People and the Post, Postal History from the Smithsonian’s National Postal Museum

Mankind’s preoccupation for much of this century has to become fully digitalized. Utilities, software, services and platforms are all becoming an ‘intertwingled’ reality for all of us. Being mobile, the blurring of the borders between the workplace and recreational life plus the ease of digital creation are creating information overloads and (out-of-sight) digital landfills. While digital content creation is cheaper to create and store, its volume and its uncared for status makes it harder for everyone else to find and consume the bits they really need (and have some provenance for peace of mind).

Fear not. A collection of emerging digital technologies exist that can both support and maintain future sustainable digital recycling – things like: Cognitive Computing, Artificial Intelligence; Natural Language Processing; Machine Learning and the like, Semantics adding meaning to shared concepts, and Graphs linking our content and information resources. With good information management practice and having the appropriate supporting tools to tinker with, there is a great opportunity to not only automate knowledge digitization but to augment it.

Automation

In the content continuum (from its creation to its disposal) there is a great need for automating processes as much as possible in order to reduce the amount of obsolete or hidden (currently value-less) digital content. Digital knowledge recycling is difficult as nearly every document or content creator is, by nature, reluctant to add further digital tags (a.k.a. metadata) describing their content or documents once they have been created. What’s more experience shows this is inefficient on a number of accounts, one of which is inconsistency.

Most digital documents (and most digital content, unless intended to sell something publicly) therefore lack the proper recycling resource descriptors that can help with e.g. classification, topic description or annotation with domain specific (shared, consistent) concepts. Such descriptions add appropriate meaning or context to content, aiding its further digital reuse (consumption). Without them, the problem of findability is likely to remain omnipresent across many intranets and searched resources.

Smartphones generate content automatically, often without the user thinking or realizing. All kinds of resource descriptors (time, place etc.) are created automatically through movement and mobile usage. With the addition of further machine learning and algorithms, online services such as Google Photos use these descriptors (and some automatic annotation of their own) to add more contextual data before classifying pictures into collections. This improved data quality (read: metadata addition and improved findability) allows us to find the pictures or timeline we want more easily.

In the very same manner, workplace content or documents can now have this same type of supporting technical platform that automatically adds additional business specific context and meaning. This could include data from users: their profiles, departments or their system user behaviour patterns.

For real organizational agility though a further extra layer of automatic annotation (tagging) and classification is needed – achieved using shared models of the business. These models can be expressed through a combination of various controlled vocabularies (taxonomies) that can be further joined through relationships (ontologies) and finally published (publicly or privately) as domain models as linked data (in graphs). Within this layer exist not just synonyms, but alternative and preferred labels, and more importantly relationships can be expressed between concepts – hence the graph: concepts being the dots (nodes) with relationships the joining lines (vertices). Using certain tools, the certain relationships between concepts can be further given a weighting.

This added layer generates a higher quality of automated context, meaning and consistency for the annotation (tagging) of content and documents alike. The very same layer feeds information architecture in the navigation of resources (e.g. websites). In Search, it helps to disambiguate between queries (e.g. apple the fruit, or apple the organization?).

This digital helper application layer works very much in the same smooth manner as e.g. Google Photos, i.e. in the background, without troubling the user.

This automation however, will not work without sustainable organizing principles, applied in information management practices and tools. We still need a bit of human touch! (Just as Google Photos added theirs behind the scenes earlier, as a work in progress)

Augmentation

This codification or digitalization of knowledge allows content to be annotated, classified and navigated more efficiently. We are all becoming more aware of the Google Knowledge Graph or the Microsoft Graph that can connect content and people. The analogy of connecting the dots in a graph is like linking digital concepts and their known relationships or values.

Augmentation can take shape in a number of forms. A user searching for a particular query can be presented not only with the most appropriate search results (via the sense-making connections and relationships) but also can be presented with related ideas they had not thought of or were unaware of – new knowledge and serendipity!

Search, semantic, and cognitive platforms have now reached a much more useful level than in earlier days of AI. Through further techniques new knowledge can also be discovered by inference, using the known relationships within the graph to fill in missing knowledge.

Key to all of this though is the building of a supporting back-end platform for continuous improvement in the content continuum. Technically, something that is easier to start than one may first suspect.

Sustainable Organising Principles to the Digital Workplace

 


View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Peter Voisey's LinkedIn profilePeter Voisey

Elastic{ON} 2017 – breaking all the records!

Elastic{ON} 2017 draws 2200 participants to Pier 48 during these somewhat chilly San Francisco days in March. It’s a 40% increase from the 1600 or so participants last year, in line with the growing interest for the Elastic Stack and the successes of Elastic commercially.

From Findwise – we are a team of 4 Findwizards, networking, learning and reporting.

Shay Banon, the creator of Elasticsearch and Elastic CTO, is doing both the opening and closing keynote. It is apparent that the transition of the CEO role from Steven Schuurman has already started.

ElasticON 2017

2016 in retrospective with the future in mind

Elastic reached 100 million downloads in 2016, and have managed to land approximately 4000 paying subscription customers out of this installed base to date. A lot of presentations during the conference is centered around new functionality that is developed and will be released to the open source community freely. Other functionality goes into the commercial X-pack subscriptions. Some X-pack functionality is available freely under the Basic subscription level that only requires registration.

Most presentations are centered around search powered analytics, and fewer around regular free text search. Elasticsearch and the Elastic Stack got its main use cases within logging, analytics and in various applications as a data platform or middle-layer with search use-cases as a strong sidekick.

A strong focus on analytics

There’s 22 sponsors at the event, and most of the companies are either offering cloud based monitoring or machine learning services. IBM, the platinum sponsor, are promoting the Bluemix cloud services for cognitive Watson functionality and uses the conference to reach out to the predominantly developer-focused audience.

Prelert was acquired in September last year, and is now being integrated into the Elastic Stack as the Machine Learning component and is used for unsupervised anomaly detection to give operation log insights. Together with the new modular Beats architecture and various Kibana improvements, it looks apparent that Elastic is chasing the huge market Splunk currently controls within logging and analytics.

Elasticsearch SQL – giving BI what it needs

Elasticsearch SQL will give the search engine SQL capability just like Solr got with their parallel SQL interface. Elasticsearch is becoming more and more a “data platform”. Increasingly becomming an competitor to HPE Vertica and Amazon RedShift as it hits a sweet spot use-case where a combination of faster data loading and extreme scalability is needed, and it is acceptable with the tradeoffs of limited functionality (such as the lack of JOIN operations). With SQL support the platform can use existing visualization tools such as Tableu and it expands the user base as many people in the Business Intelligence sector knows SQL by heart.

Fast and simple Beats is music to our ears

Beats will become modular in the next release, and more beats modules will be created either by Elastic or in the open source or commercial community. This increases simple connectivity to various data sources, and adds standardized dashboards for the data source, which will increase simplicity and speed in implementation.

Heartbeat is a new Beat (with a beautiful name!) that send pings to check that services are alive and functioning.

Kibana goes international

Kibana is maturing with some new key updates coming soon. A Time series visual builder that will give graphical guidance on how to build the dashboards, Kibana Canvas gives custom dynamic reports and enables slide show presentations with live data, and the GUI frontend is translated to various languages.

There’s a new tile service for maps, so instead of relying on external map services, Elastic now got control over the maps functionality. The service can be used free of charge but requires registration (Basic subscription) to use all 18 zoom levels.

kibana-int

 

To conclude, we’ve had three good days with exciting product news and lots of interesting meetings in what could very well be the biggest show for search and search-driven analytics right now! Be sure to see us at the next year’s Elastic{ON} again. If not before, see you then!

 

From San Francisco with love,

/Andreas, Christian, Joar and Peter

Digital wizardry for customers & employees – the next elements

A reflection on Mobile World Congress topics mobility, digitalisation, IoT, the Fourth Industrial Revolution and sustainability

MWC2017Commerce has always had a conversational, today it is digital. Organisations are asking how to organise clean effective data for an open digital conversation. 

Digitalization’s aim is to answer customer/consumer-centric demands effectively (with relevant and related data) and in an efficient manner. [for the remainder of the article read consumer and customer interchangeably]

This essentially means Joining the dots between clean data and information and being able to recognise the main and most consumer-valuable use cases, be it common transaction behaviour or negating their most painful user experiences.

This includes treading the fine line between being able to offer “intelligent” information (intelligent in terms of relevance and context)  to the consumer effectively and not seeming to be freaky or stalker-like. The latter is dealt with by forming a digital conversation where the consumer understands the use of their information only being used for their end needs or wants.

While clean, related data from the many various multi-channel customer touch-points forms the basis of an agile digital organisation, it is the combination of significant data analysis insight of user demand & behaviour (clicks, log analysis etc), machine learning and sensible prediction that forms the basis of artificial Intelligence. Artificial intelligence broken down is essentially resultant action based on the inferences of knowing certain information, i.e. the elementary Dr Watson, but done by computers.

This new digital data basis means being able to take data from what were previous data silos and combine it effectively in a meaningful way, for a valuable purpose. While the tag of Big Data becomes weary in a generalised context, key is the picking of data/information to get relevant answers to the mosts valuable questions, or in consumer speak, to get a question answered or a job done effectively.

Digitalisation (and then the following artificial intelligence) relies obviously on computer automation, but it still requires some thoughtful human-related input. Important steps in the move towards digitalization include:

  • Content and Data Inventory, to clean data/ the cleansing of data and information;
  • Its architecture (information modelling, content analysis, automatic classification and annotation/tagging);
  • Data analysis in combination with text analysis (or NLP: natural language processing for the more abundant unstructured data, content), the latter to put flesh on the bone as it were, or adding meaning and context
  • Information Governance: the process of being responsible for the collection, proper storage and use of important digital information (now made less ignorable with new citizen-centric data laws (GDPR) and the need for data agility or anonymization of data)
  • Data/system Interoperability: which data formats, structures, and standards, are most appropriate for you? What data collections are most Relational databases, Linked/graph data, data lakes etc.?); 
  • Language/cultural interoperability: letting people with different perspectives accessing the same information topics using their own terminology.
  • Interoperability for the future also means being able to link everything in your business ecosystem for collaboration, in- and outbound conversations, endless innovation and sustainability.
  • IoT or the Internet of Things is making the physical world digital and adding further to the interlinked network, soon to be superseded by the AoT (the analysis of things)
  • Newer steps of Machine learning (learning about consumer preferences and behaviour etc.) and artificial intelligence (being able to provide seemingly impossible relevant information, clever decision-making and/or seamless user experience).

The fusion of technologies continues further as the lines between the physical, digital, and biological spheres with developments in immersive Internet, as with Augmented Reality (AR) and Virtual Reality (VR).

The next elements are here already: semantic (‘intelligent’) search, virtual assistants, robots, chat bots… with 5G around the corner to move more data, faster.

Progress within mobility paves the way for a more sustainable world for all of us (UN Sustainable Development), with a future based on participation. In emerging markets we are seeing giant leaps in societal change. Rural areas now have access to the vast human resources of knowledge to service innovation e.g. through free-access to Wikipedia on cheap mobile devices and Open Campuses. Gender equality with changed monetary and mobile financial practices and blockchain means to raise to the challenge with interoperability. We have to address the open paradigm (e.g Open Data) and the participation economy, building the next elements. Shared experience and information commons. This also falls back to the intertwingled digital workplace, and practices to move into new cloud based arenas.

Some last remarks on the telecom industry, it is loaded with acronyms and for laymen in the area sometimes a maze to navigate and to build some sensemaking.

So are these steps straightforward, or is the reality still a potential headache for your organisation? 

Contact Findwise now to ease the process, before your competitor does 😉
View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Peter Voisey's LinkedIn profilePeter Voisey

What will happen in the information sector in 2017?

As we look back at 2016, we can say that it has been an exciting and groundbreaking year that has changed how we handle information. Let’s look back at the major developments from 2016 and list key focus areas that will play an important role in 2017.

3 trends from 2016 that will lay basis for shaping the year 2017

Cloud

There has been a massive shift towards cloud, not only using the cloud for hosting services but building on top of cloud-based services. This has affected all IT projects, especially the Enterprise Search market when Google decided to discontinue GSA and replace it with a cloud based Springboard. More official information on Springboard is still to be published in writing, but reach out to us if you are keen on hearing about the latest developments.

There are clear reasons why search is moving towards the cloud. Some of the main ones being machine learning and the amount of data. We have an astonishing amount of information available, and the cloud is simply the best way to handle this overflow. Development in the cloud is faster, the cloud gives practically unlimited processing power and the latest developments available in the cloud are at an affordable price.

Machine learning

One area that has taken huge steps forward has been machine learning. It is nowadays being used in everyday applications. Google wrote a very informative blog post about how they use Cloud machine learning in various scenarios. But Google is not alone in this space – today, everyone is doing machine learning. A very welcome development was the formation of Partnership on AI by Amazon, Google, Facebook, IBM and Microsoft.

We have seen how machine learning helps us in many areas. One good example is health care and IBM Watson managing to find a rare type of leukemia in 10 minutes. This type of expert assistance is becoming more common. While we know that it is still a long path to come before AI becomes smarter than human beings, we are taking leaps forward and this can be seen by DeepMind beating a human at the complex board game Go.

Internet of Things

Another important area is IoT. In 2016 most IoT projects have, in addition to consumer solutions, touched industry, creating a smart city, energy utilization or connected cars. Companies have realized that they nowadays can track any physical object to the benefits of being able to serve machines before they break, streamline or build better services or even create completely new business based on data knowledge. On the consumer side, we’ve in 2016 seen how IoT has become mainstream with unfortunate effect of poorly secured devices being used for massive attacks.

 

3 predictions for key developments happening in 2017

As we move towards the year 2017, we see that these trends from 2016 have positive effects on how information will be handled. We will have even more data and even more effective ways to use it. Here are three predictions for how we will see the information space evolve in 2017.

Insight engine

The collaboration with computers are changing. For decades, we have been giving tasks to computers and waited for their answer. This is slowly changing so that we start to collaborate with computers and even expect computers to take the initiative. The developments behind this is in machine learning and human language understanding. We no longer only index information and search it with free text. Nowadays, we can build a computer understanding information. This information includes everything from IoT data points to human created documents and data from other AI systems. This enables building an insight engine that can help us formulate the right question or even giving us insight based on information to a question we never ask. This will revolutionize how we handle our information how we interact with our user interfaces.

We will see virtual private assistants that users will be happy to use and train so that they can help us to use information like never before in our daily live. Google Now, in its current form, is merely the first step of something like this, being active towards bringing information to the user.

Search-driven analytics

The way we use and interact with data is changing. With collected information about pretty much anything, we have almost any information right at our fingertips and need effective ways to learn from this – in real time. In 2017, we will see a shift away from classic BI systems towards search-driven evolutions of this. We already have Kibana Dashboards with TimeLion and ThoughtSpot but these are only the first examples of how search is revolutionizing how we interact with data. Advanced analytics available for anyone within the organization, to get answers and predictions directly in graphs and diagrams, is what 2017 insights will be all about.

Conversational UIs

We have seen the rise of Chatbots in 2016. In 2017, this trend will also take on how we interact with enterprise systems. A smart conversational user interface builds on top of the same foundations as an enterprise search platform. It is highly personalized, contextually smart and builds its answers from information in various systems and information in many forms.

Imagine discussing future business focus areas with a machine that challenges us in our ideas and backs everything with data based facts. Imagine your enterprise search responding to your search with a question asking you to detail what you actually are achieving.

 

What are your thoughts on the future developement?

How do you see the 2017 change the way we interact with our information? Comment or reach out in other ways to discuss this further and have a Happy Year 2017!

 

Written by: Ivar Ekman

How to improve search relevance using machine learning and statistics – Apache Solr Learning to Rank

In search, the relevance denotes how well a retrieved document or set of documents meets the information need of the user. Natural languages, synonymy, homonymy, term frequency, norms, relevance model, boosting, performance, subjectivity… the reasons why search relevancy still remains a hard problem are multiple. This article will deal with how can machine learning and search statistics improve the relevance using the learning to rank plugin which will be included in a newer version of Solr.  If you want more information than is provided in this blogpost, be sure to visit our website or contact us!

Background

Considering an Intranet search solution where users can be divided into two groups: developers and sales.
Search and clicks statistics are collected and the following picture illustrates a specific search performed 569 times by the users with the click statistics for each document.

Example of a search with click statistics

Example of a search with click statistics

As noticed, the top search hit, of which the score is computed from term frequency, index documents frequency and field-length norm, is less relevant (got less clicks) than documents with lower scores. Instead of trying to manually tweak the relevancy model and the different field boosts, which will probably lead, by tweaking for a specific query, to decrease the global relevancy for other queries, we will try to learn from the users click statistics to automatically generate a relevancy model.

Architecture, features and model

Architecture

Using search and click statistics as training data, the search engine can learn, from input features and with a ranking model, the probability that a document is relevant for a user.

Search architecture

Search architecture with a ranking training model

Features

With the Solr learning to rank plugin, features can be defined using standard Solr queries.
For my specific example, I will choose the following features:
– originalScore: Original score computed by Solr
– queryMatchTitle: Boolean stating if the user query text match the document title
– queryMatchDescription: Boolean stating if the user query text match the document description
– isPowerPoint: Boolean stating if the document type is PowerPoint

[
   { "name": "isPowerPoint",
     "class": "org.apache.solr.ltr.feature.SolrFeature",
     "params":{ "fq": ["{!terms f=filetype }.pptx"] }
   },
   {
    "name":"originalScore",
    "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
    "params":{}
   },
   {
    "name" : " queryMatchTitle",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=title}${user_query}" }
   },
   {
    "name" : " queryMatchDescription",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=description}${user_query}" }
   }
]

Training a ranking model

From the statistics click data, a training set (X, y), where X is the feature input vector and y a boolean stating if the user clicked a document or not, can be generated and used to compute a linear model using regression which will output the weight of each feature.

Example of statistics:

{
    q: "i3",
    docId: "91075403",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0,
    score: 3,43		
},
{
    q: "i3",
    docId: "82034458",
    userId: "507f1f77bcf86cd799439011",
    clicked: 1
    score: 3,43		
},
{
    q: "coucou",
    docId: "1246732",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0	
    score: 3,42	
}

Training data generated:
X= (originalScore, queryMatchTitle, queryMatchDescription, isPowerPoint)
y= (clicked)

{
    originalScore: 3,43,
    queryMatchTitle: 1
    queryMatchDescription: 1,
    isPowerPoint: 1, 
    clicked: 0		
},
{
    originalScore: 3,43,
    queryMatchTitle: 0
    queryMatchDescription: 1,
    isPowerPoint: 0, 
    clicked: 1			
},
{
    originalScore: 3,42	
    queryMatchTitle: 1
    queryMatchDescription: 0,
    isPowerPoint: 1, 
    clicked: 0		
}

Once the training is completed and the different features weight computed, the model can be sent to Solr using the following format:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6783,
            "queryMatchTitle": 0.4833,
            "queryMatchDescription": 0.7844,
            "isPowerPoint": 0.321
      }
    }
}

Run a Rerank Query

Solr LTR plugin allows to easily apply the re-rank model on the documents result by adding rq={!ltr model=myModelName reRankDocs=25} to the query.

Personalization of the search result

If your statistics data include information about users, specific re-rank model can be trained according different user groups.
In my current example, I trained a specific model for the developer group and for the sales representatives.

Dev model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"devModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6421,
            "queryMatchTitle": 0.4561,
            "queryMatchDescription": 0.5124,
            "isPowerPoint": 0.017
      }
    }
}

Sales model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"salesModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.712,
            "queryMatchTitle": 0.582,
            "queryMatchDescription": 0.243,
            "isPowerPoint": 0.623
      }
    }
}

From the statistics data, the system learnt that a PowerPoint document is more relevant for a sales representative than for a developers.

Developer search

Developer search with re-ranking

Sales representative search

Sales representative search with re-ranking

To conclude, with a search system continuously trained from a flow of statistics, not only the search relevance will be more customized and personalized to your users, but the relevance will also be automatically adapted to the users behavior change.

If you want more information, have further questions or need help please visit our website or contact us!

Solr LTR plugin to be release soon: https://github.com/bloomberg/lucene-solr/tree/master-ltr-plugin-release/solr/contrib/ltr

Swedish language support (natural language processing) for IBM Content Analytics (ICA)

Findwise has now extended the NLP (natural language processing) in ICA to include both support for Swedish PoS tagging and Swedish sentiment analysis.

IBM Content Analytics with Enterprise Search (ICA) has its strength in natural language processing (NLP) which is achieved in the UIMA pipeline. From a Swedish perspective, one concern with ICA has always been its lack of NLP for Swedish. Previously the Swedish support in ICA consisted only of dictionary-based lemmatization (word: “sprang” -> lemma: “springa”). However, for a number of other languages ICA has also provided part of speech (PoS) tagging and sentiment analysis. One of the benefits of the PoS tagger is its ability to disambiguate words, which belong to multiple classes (e.g. “run” can be both a noun and a verb) as well as assign tags to words, which are not found in the dictionary. Furthermore, the POS tagger is crucial when it comes to improving entity extraction, which is important when a deeper understanding of the indexed text is needed.

Findwise has now extended the NLP in ICA to include both support for Swedish PoS tagging and Swedish sentiment analysis. The two images below shows simple examples of the PoS support.

Example when ICA uses NLP to analyse the string "ICA är en produkt som klarar entitetsextrahering"Example when ICA uses NLP to analyse the string "Watson deltog i jeopardy"

The question is how this extended functionality could be used?

IBM uses ICA and its NLP support together with several of their products. The jeopardy playing computer Watson may be the most famous example, even if it is not a real product. Watson used NLP in its UIMA pipeline when it analyzed its data from sources such as Wikipedia and Imdb.

One product which leverage from ICA and its NLP capabilities is Content and Predictive Analytics for Healthcare. This product helps doctors to determine which action to take for a patient given the patient’s journal and the symptoms. By also leveraging the predictive analytics from SPSS it is possible to suggest the next action for the patient.

ICA can also be connected directly to IBM Cognos or SPSS where ICA is the tool which creates structure to unstructured data. By using the NLP or sentiment analytics in ICA, structured data can be extracted from text documents. This data can then be fed to IBM Cognos, SPSS or non IBM products such as Splunk.

ICA can also be used on its own as a text miner or a search platform, but in many cases ICA delivers its maximum value together with other products. ICA is a product which helps enriching data by creating structure to unstructured data. The processed data can then be used by other products which normally work with structured data.