Plan for General Data Protection Regulation (GDPR)

Another new regulation from EU? Will this affect us? It seems so complex. Can’t we sit back and wait for the first fine to come and then act if necessary?

We have to care and act – start planning now!

I think we have to care and act now. Start planning now so you get it right. The GDPR is a good thing. This is not another EU thing about the right size of a strawberry or how bendy a banana could be. This is about the fact that all individuals should feel safe giving their personal information to business. Cyber security is a good thing, not protecting our data and our customers’ data is a bad thing for us. Credit card numbers and personal data leaks out from companies worldwide with large business risks, companies don’t just face fines or reputational damage, they can have their permission to issue credit cards and other financial services products withdrawn by the regulator and responsible employees faces imprisonment. We can only guess whether a company needs to be GDPR compliant or not to be allowed to compete in a bidding process?

What is the General Data Protection Regulation (GDPR)?

The General Data Protection Regulation (GDPR) is a new legal framework approved by the European Union (EU) to strengthen and unify data protection of personal information. GDPR will replace the current data protection directive (in Sweden Personuppgiftslagen, PUL) and applies from 25 May 2018.

Who is affected?

GDPR has global reach and applies to all companies worldwide that process personal data of European Union citizens.

Identify personal data and protect it

GDPR widely defines what constitutes personal data. Organisations needs to fully understand what information they have, where it is located and how it was collected. Discover, classify and manage all information, both structured and unstructured data and secure it.

Data breach notifications

GDPR requires organisations to notify the local data protection authority of a data breach within 72 hours after discovery.

Do you have the right to store this information? Explicit consent

Personal data should be gathered under strict conditions. Organisations need to ask for consent to collect personal data and they need to be clear about how they will use the information.

The right of access

Individuals will have the right to obtain access to their personal data and other supplementary information in a portable format. You must provide a copy of the information free of charge. GDPR also give individuals the right to have personal data corrected if it is inaccurate or incomplete.

The right to be forgotten

GDPR also introduces the right to be forgotten, or erased. Data are not to be hold for any longer than absolutely necessary, and data should not be used in any other way than it was originally collected for.

Penalties and fines

Companies that fails to protect customer data adequately will face significant fines up tp €20m, or up to 4% of global turnover. This should be a serious incentive for companies to start preparing now.

First steps to GDPR compliance

  1. Create awareness and allocate resources
    First step is to make sure that your organisation is aware of the new EU legislation and what it means for you. How will your business be affected by the new regulation? You need to allocate enough resources, make sure you involve decision-makers and stakeholders in your organisation. Last, but not least, start today!
  2. Content Inventory
    The second step is to discover and classify all your information to identify exactly what types of personal identifiable data you have, where you have it and how it is collected.

Findwise can assist you in this process, please contact Maria Sunnefors to hear more.

Want to read more?

Read more about the GDPR at Datainspektionen (in Swedish) or at iCO.

PIM is for storage

– Add search for distribution, customization and seamless multichannel experiences.


Retailers, e-commerce and product data
Having met a number of retailers to discuss information management, we’ve noticed they all experience the same problem. Products are (obviously) central and information is typically stored in a PIM or DAM system. So far so good, these systems do the trick when it comes to storing and managing fundamental product data. However, when trying to embrace current trends1 of e-commerce, such as mobile friendliness, multi-channel selling and connecting products to other content, PIM systems are not really helping. As it turns out, PIM is great for storage but not for distribution.

Retailers need to distribute product information across various channels – online stores, mobile and desktop, spreadsheet exports, subsets of data with adjustments for different markets and industries. They also need connecting products to availability, campaigns, user generated content and fast changing business rules. Add to this the need for closing the analytics feedback loop, and the IT department realises that PIM (or DAM) is not the answer.

Product attributes

Adding search technology for distribution
Whereas PIM is great for storage, search technology is the champ not only for searching but also for distribution. You may have heard the popular Create Once Publish Everywhere? Well, search technology actually gives meaning to the saying. Gather any data (PIM, DAM, ERP, CMS), connect it to other data and display it across multiple channels and contexts.

Also, with the i32 package of components you can add information (metadata) or logic that is not available in the PIM system. This whilst source data stay intact – there is no altering, copying or moving.

Combined with a taxonomy for categorising information you’re good to go. You can now enrich products and connect them to other products and information (processing service). Categorise content according to product taxonomy and be done. Performance will be super high, as content is denormalised and stored in the search engine, ready for multi channel distribution. Also, with this setup you can easily also add new sources to enrich products or modify relevance. Who knows what information will be relevant for products in the future?

To summarise

  • PIM for input, search for output. Design for distribution!
  • Use PIM for managing products, not for managing business rules.
  • Add metadata and taxonomies to tailor product information for different channels.
  • Connect products to related content.
  • Use stand-alone components based on open source for strong TCO and flexibility.

References
1 Gartner for marketers
2The Findwise i3 package of components (for indexing, processing, searching and analysing data) is compatible with the open source search engines Apache Solr and Elasticsearch. 

Under the hood of the search engine

While using a search application we rarely think about what happens inside it. We just type a query, sometime refine details with facets or additional filters and pick one of the returned results. Ideally, the most desired result is on the top of the list. The secret of returning appropriate results and figuring out which fits a query better than others is hidden in the scoring, ranking and similarity functions enclosed in relevancy models. These concepts are crucial for the search application user’s satisfaction.

In this post we will review basic components of the popular TF/IDF model with simple examples. Additionally, we will learn how to ask Elasticsearch for explanation of scoring for a specific document and query.

Document ranking is one of the fundamental problems in information retrieval, a discipline acting as a mathematical foundation of search. The ranking, which is literally assigning a rank to a document matching search query corresponds with a term of relevance. Document relevance is a function which determines how well given document meets the search query. A concept of similarity corresponds, in turn, to the relevance idea, since relevance is a metric of similarity between a candidate result document and a search query. Continue reading

What’s new in Apache Solr 6?

Apache Solr 6 has been released recently! You need to remember about some important technical news: no more support for reading Lucene/Solr 4.x index or Java 8 is required. But what I think, the most interesting part is connected with its new features, which certainly follow world trends. I mean here: SQL engine at the top of the Solr, graph search and replicating data across different data centers.

Apache Solr

One of the most promising topic among the new features is Parallel SQL Interface. In a brief, it is possibility to run SQL queries on the top of the Solr Cloud (only Cloud mode right now). It can be very interesting to combine full-text capabilities with well-known SQL statements.
Solr uses Presto internally, which is a SQL query engine and works with various types of data stores. Presto is responsible for translating SQL statements to the Streaming Expression, since Solr SQL engine in based on the Streaming API.
Thanks to that, SQL queries can be executed at worker nodes in parallel. There are two implementations of grouping results (aggregations). First one is based on map reduce algorithm and the second one uses Solr facets. The basic difference is a number of fields used in grouping clause. Facet API can be used for better performance, but only when GROUP BY isn’t complex. If it is, better try aggregationMode=map_reduce.
From developer perspective it’s really transparent. Simple statement like “SELECT field1 FROM collection1” is translated to proper fields and collection. Right now clauses like WHERE, ORDER BY, LIMIT, DISTINCT, GROUP BY can be used.
Solr still doesn’t support whole SQL language, but even though it’s a powerful feature. First of all, it can make beginners life easier, since relational world is commonly known. What is more, I imagine this can be useful during some IT system migrations or collecting data from Solr for further analysis. I hope to hear many different study cases in the near future.

Apache Solr 6 introduces also a topic, which is crucial, wherever a search engine is a business critical system. I mean cross data center replication (CDCR).
Since Solr Cloud has been created to support near real-time (NRT) searching, it didn’t work well when cluster nodes were distributed across different data centers. It’s because of the communication overhead generated by the leaders, replicas and synchronizations operation.

New idea is in experimental phase and still under developing, but for now we have an active-passive mode, where data is pushed from the Source DC to the Target DC. Documents can be sent in a real-time or according to the schedule. Every leader from active cluster sends asynchronously updates to the proper leader in passive cluster. After that, target leaders replicate changes to their replicas as usual.
CDCR is crucial when we think about distributed systems working in high-availability mode. It always refers to disaster recovery, scaling or avoiding single points of failure (SPOF). Please visit documentation page to find some details and plans for the future: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

What if your business works in highly connected environment, where data relationships matter, but you still benefit from full-text searching? Solr 6 has a good news – it’s a graph traversal functionality.
A lot of enterprises know that focusing on relations between documents and graph data modeling is a future. Now you can build Solr queries which will allow you to discover information organized in nodes and edges. You can explore your collections in terms of data interactions and connections between particular data elements. We can think about the use cases from semantic search area (query augmentation, using ontologies etc.) or more prosaic, like organization security roles or access control.
Graph traversal query is still in progress, but we can use it from now and its basic syntax is really simple: fq={!graph from=parent_id to=id}id:”DOCUMENT_ID”

The last Solr 6 improvement, which I’m going to mention about is a new scoring algorithm – BM25. In fact, it’s a change forced by Apache Lucene 6. BM25 is now a default similarity implementation. Similarity is a process which examines which documents are similar to the query and to what extent. There are many different factors which determine document score. There are e.g.: number of search terms found in document, popularity of this search terms over the whole collection or document length. This is where BM25 improves scoring: it takes into consideration average length of the documents (fields) across the entire corpus. It also limits better an impact of terms frequency on results ranking.

As we can see, Apache Solr 6 provides us with many new features and those mentioned above are not all of them. We’re going to write more about the new functionalities soon. Until then, we encourage you to try the newest Solr on your own and remember: don’t hesitate to contact us in case of any problems!

Understanding politics with Watson using Text Analytics

To understand the topics that actually are important to different political parties is a difficult task. Can text analytics together with an search index be an approach to given a better understanding?

This blog post describes how IBM Watson Explorer Content Analytics (WCA) can be used to make sense of Swedish politics. All speeches (in Swedish: anföranden) in the Swedish Parliament from 2004 to 2015 are analyzed using WCA. In total 139 110 transcribed text documents were analyzed. The Swedish language support build by Findwise for WCA is used together with a few text analytic processing steps which parses out person names, political party, dates and topics of interest. The selected topics in this analyzed are all related to infrastructure and different types of fuels.

We start by looking at how some of the topics are mentioned over time.

Analyze of terms of interets in Swedsih parlament between 2004 and 2014.

Analyze of terms of interest in Swedish parliament between 2004 and 2014.

The view shows topic which has a higher number of mentions compared to what would be expected during one year. Here we can see among other topics that the topic flygplats (airport) has a high increase in number of mentioning during 2014.

So let’s dive down and see what is being said about the topic flygplats during 2014.

Swedish political parties mentioning Bromma Airport.

Swedish political parties mentioning Bromma Airport during 2014.

The above image shows how the different political parties are mentioning the topic flygplats during the year 2014. The blue bar shows the number of times the topic flygplats was mentioned by each political party during the year. The green bar shows the WCA correlation value which indicates how strongly related a term is to the current filter. What we can conclude is that party Moderaterna mentioned flygplats during 2014 more frequently than other parties.

Reviewing the most correlated nouns when filtering on flygplats and the year 2014 shows among some other nouns: Bromma (place in Sweden), airport and nedläggning (closing). This gives some idea what was discussed during the period. By filtering on the speeches which was held by Moderaterna and reading some of them makes it clear that Moderaterna is against a closing of Bromma airport.

The text analytics and the index provided by WCA helps us both discover trending topics over time and gives us a tool for understanding who talked about a subject and what was said.

All the different topics about infrastructure can together create a single topic for infrastructure. Speeches that are mentioning tåg (train), bredband (broadband) or any other defined term for infrastructure are also tagged with the topic infrastructure. This wider concept of infrastructure can of course also be viewed over time.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Another way of finding which party that are most correlated to a subject is by comparing pair of facets. The following table shows parties highly related to terms regarding infrastructure and type of fuels.

Political parties highly correlated to subjects regarding infrastructure and types of fuel.

Swedish political parties highly correlated to subjects regarding infrastructure and types of fuel.

Let’s start by explain the first row in order to understand the table. Mobilnät (mobile net) has only been mentioned 44 times by Centerpartiet, but Centerpartiet is still highly related to the term with a WCA correlation value of 3.7. This means that Centerpartiet has a higher share of its speeches mentioning mobilnät compared to other parties. The table indicates that two parties Centerpartiet and Miljöpartiet are more involved about the subject infrastructure topics than other political parties.

Swedish parties mentioning the defined concept of infrastructure.

Swedish parties mentioning the defined concept of infrastructure.

Filtering on the concept infrastructure also shows that Miljöpartiet and Centerpartiet are the two parties which has the highest share of speeches mentioning the defined infrastructure topics.

Interested to dig deeper into the data? Parsing written text with text analytics is a successful approach for increasing an understanding of subjects such as politics. Using IBM Watson Explorer Content Analytics makes it easy. Most of the functionality used in this example is also out of the box functionalities in WCA.

Migration from Google Search Appliance (GSA) in 4 easy steps

 

 

Google Search Appliance is being phased out and in 2018, renewals will end. As an existing client, you can buy one-year license renewals throughout 2017. However, if fancying a change, here’s 4 simple steps for switching to Apache Solr or Elasticsearch.

1. Choose your hosting solution or servers

Wikimedia_Foundation_Servers-8055_14 

Whereas Google Search Appliance comes ready to plug in, Apache Solr and Elasticsearch need to be deployed and hosted on servers. You can choose to host Solr or Elasticsearch on your own infrastructure or in the cloud. Both platforms are highly scalable and can be massively distributed.

  • Own infrastructure

Servers and hardware requirements are highly dependent on the number of documents, documents types, search use cases and number of users. Memory, CPUs, disk and network are the main parameters to consider.

Elasticsearch hardware recommendations: https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

Apache Solr performance: https://wiki.apache.org/solr/SolrPerformanceProblems

Both Elasticsearch and Solr requires running java. For SolrCloud, you will also need to install Zookeeper.

  • In the cloud

You can also choose to run Solr or Elasticsearch on a cloud platform.

Elastic official cloud platform: https://www.elastic.co/cloud

2. Define your schema and mapping

In Apache Solr and Elasticsearch, fields can be indexed and processed differently according on type, language, use case … A field and its type can be defined in Elasticsearch using the mapping API or in Apache Solr with the schema.xml

Elasticsearch mapping API: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Apache Solr schema: https://wiki.apache.org/solr/SchemaXml

3. Tune your connectors

the-cable-guy

Do you need to change all connectors?

The answer is no. Connectors sending GSA feeds can be kept, just refactor the output to match the Elasticsearch or Solr indexing syntax.

However, if you use GSA to crawl websites, you will need either to reconsider crawling as the method to get your data or to use an external webcrawler (like Norconex) Contrary to GSA, Apache Solr and Elasticsearch do not come with a webcrawler.

Elasticsearch Indexing API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

4. Rewrite your queries and fetch new output

All common query functions such as filtering, sorting and dynamic navigation are standard in both Apache Solr and Elasticsearch. However, query parameters and output (XML or JSON) are different, which means queries and front-end need adaption to your new search engine.

If you are using Jellyfish by Findwise, queries and output will roughly be the same.

Elasticsearch response body: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html

Apache Solr response: https://cwiki.apache.org/confluence/display/solr/Response+Writers

Google Search Appliance features equivalence

GSA feature Elasticsearch Apache Solr
Web crawling X X
Language Bundles Languages Language Analysis
Synonyms Synonyms Synonyms
Stopwords Stopwords Stopwords
Result Biasing Controlling relevance Query elevation
Suggestions Search-suggesters Suggester
Dynamic navigation Aggregations Faceting
Document preview X X
User result X X
Expert search X X
Keymatch X X
Related Queries X X
Secure search Shield Solr Security
Search reports Logstash+Kibana X
Mirroring/Distributed Scale Elastic Solr Cloud
System alert Watcher X
Email update/Alert Watcher X

X = not available outside of the box

4 quick ways to improve your search!

Findwise has recently published its annual report about Enterprise Search and Findability. We can see that a lot of people are complaining that the search engine is running poorly. There were 36% dissatisfied (users) in 2015. Is there any simple recipe for that? I bet there are some things that can be applied almost immediately!

 boy-cry

Data

It is quite common that the reason for bad results is content quality – there is simply no good content in a search engine!

Solution = editors and cleaning: Educate editors to produce better content! Decide on the main message, set informative titles and be consistent with internal wording and phrasing. Also, look over your index, archive outdated content and remove duplicates. Don’t be afraid to remove entire data sources from your index.


Metadata

If you already have data indexed, it is much easier to search using additional filters. Let’s say, when we are looking for a new washing machine on any good online store we can easily filter out the features such as energy class, manufacturer, price range etc. The same can happen to corporate data, provided that our documents are tagged in a consistent manner.

Solution = tagging: Check tags and metadata consistency for documents, which we search through. If the quality of the tagging leaves much to be desired, it should be corrected (note: this can be done automatically to the large extent!). Then you should consider what filters are the most useful for your company search and implement them in your browser.

 

Accuracy

Users’ expectations are very important. If they ask and search, they usually want and need, eg. current lunch menu, financial settlement form, a specific procedure for calculating credit risk, sales report for the previous quarter, etc. This unique need of each user is expressed through a simple query. And here we encounter significant problem: these queries are not always well interpreted by the search engine. If you don’t see the desired document/answer in the first five slots of the search results list, even after 2-3 trials by using various queries, you quickly come to the conclusion that the search engine doesn’t not work (well).

Solution = user feed-back: It is fundamental to regularly collect users feed-back on the search engine. If you receive signals that something does not work, then you absolutely need to examine what specific search scenarios aren’t functioning well. These things can be usually fixed pretty easily by using synonyms, promotions or by changing the order of results display.


Monitoring

It is not easy to gather the opinion of everyone in large organiations, as there might be thousands of them. A search engine, like everything else, sometimes breaks down, answers too long for queries and gives silly results, or even no result at all. Additionally, it’s not certain if such a thing contributes to our organization or not, and who makes the use of our search at all.

Solution = logging: Log analysis gives a lot of information about the real use of search engines by the users. Logs tell us how many people are looking for something, what they are asking for, how fast search engine responds, when it gives zero results. It’s priceless information to understand what works, who really benefits, what are the most popular contents and questions, what needs to be improved. It’s crucially important to do it on a regular basis.

 boy-happy

Summary

And now, when you fixed all these four points related to the search engine please tell me that it continues to be malfunctioning. I’ve yet to hear such a case 🙂

Generational renewal at work – a search challenge

The big generational shift

There have been discussions surrounding the great generational renewal in the workplace for a while. The 50’s generation, who have spent a large part of their working lives within the same company, are being replaced by an agile bunch born in the 90’s. We are not taken by tabloid claims that this new generation does not want to work, or that companies do not know how to attract them. What we are concerned with is that businesses are not adapting fast enough to the way the new generation handle information to enable the transfer of knowledge within the organisation.

Working for the same employer for decades

Think about it for a while, for how long have the 50’s generation been allowed to learn everything they know? We see it all the time, large groups of employees ready to retire, after spending their whole working lives within the same organisation. They began their careers as teenagers working on the factory floor or in a similar role, step by step growing within the company, together with the company. These employees have tended to carry a deep understanding of how their organisation work and after years of training, they possess a great deal of knowledge and experience. How many companies nowadays are willing to offer the 90’s workers the same kind of journey? Or should they even?

2016 – It’s all about constant accessibility

The world is different today, than 50 years ago. A number of key factors are shaping the change in knowledge-intense professions:

  • Information overload – we produce more and more information. Thanks to the Internet and the World Wide Web, the amount of information available is greater than ever.
  • Education has changed. Employees of the 50’s grew up during a time when education was about learning facts by rote. The schools of today focus more on teaching how to learn through experience, to find information and how to assess its reliability.
  • Ownership is less important. We used to think it was important to own music albums, have them in our collection for display. Nowadays it’s all about accessibility, to be able to stream Spotify, Netflix or an online game or e-book on demand. Similarly we can see the increasing trend of leasing cars over owning them. Younger generations take these services and the accessibility they offer for granted and they treat information the same way, of course. Why wouldn’t they? It is no longer a competitive advantage to know something by heart, since that information is soon outdated. A smarter approach of course is to be able to access the latest information. Knowing how to search for information – when you need it.

Factors supporting the need for organising the free flow of the right information:

  • Employees don’t stay as long as they used to in the same workplace anymore, which for example, requires a more efficient on boarding process. It’s no longer feasible to invest the same amount of time and effort on training one individual since he/she might be changing workplace soon enough anyway.
  • It is much debated whether it is possible to transfer knowledge or not. Current information on the other hand is relatively easy to make available to others.
  • Access to information does not automatically mean that the quality of information is high and the benefits great.

Organisations lack the right tools

Knowing a lot of facts and knowledge about a gradually evolving industry was once a competitive advantage. Companies and organisations have naturally built their entire IT infrastructure around this way of working. A lot of IT applications used today were built for a previous generation with another way of working and thinking. Today most challenges involve knowing where and how to find information. This is something we experience in our daily work with clients. Organisations more or less lack the necessary tools to support the needs of the newer generation in their daily work.

To summarize the challenge: organisations need to be able to supply their new workforce with the right tools to constantly find (and also manipulate) the latest and best information required for them to shine.

Success depends on finding the right information

In order for the new generation to succeed, companies must regularly review how information is handled plus the tools supporting information-heavy work tasks.

New employees need to be able to access the information and knowledge left by retiring employees, while creating and finding new content and information in such a way that information realises its true value as an asset.

Efficiency, automation… And Information Management!

There are several ways of improving efficiency, the first step is often to investigate if parts, or perhaps the entire creating and finding process can be automated. Secondly, attack the information challenges.

When we get a grip of the information we are to handle, it’s time to look into the supporting IT systems. How are employees supposed to find what they are looking for? How do they want to?

We have gotten used to find answers by searching online. This is in the DNA of the 90’s employee. By investing in a great search platform and developing processes to ensure high information quality within the organisation, we are certain the organisation will not only manage the generational renewal but excel in continuously developing new information centric services.

Written by: Maria “Ia” Björk & Joar Svensson

Sensemaking or Digital Despair

Finding our way in the bright, futuristic, data-driven & intertwined world, often taxes us and our digital-hungry senses. Fast rewind to the recent FindabilityDay 2015 and the parade of brilliant speaker talents on stage. Starting of with our dear friend and peer, Martin White, on the topic the future of search.

Human factors, from idea inception to design and practical UX of our digital artifacts. The key has been make-do and ship. This is the reason the more technically-advanced mobiles fell by the wayside 8 years ago Apple’s iPhone.

The social life with information, shapes our daily lives, in a hyper-connected world. It’s still very hard to find that information needle in the haystack, and most days we feel despair when losing the scent of information nuggets. The results from the Findability Survey, spoke clearly. Without sound organising principles to information and data, and a pliable recorded vision, we won’t find anything of value.

Next, moving into an old business model, with Luna’s and Sara’s presentation, a great example, where we see that the orchestration and choreography of their data assets will determine their survival or demise – in conjunction with infused means to information management practices, processes and tools. They showed a new set of facets to delivering on their mission in their line-of business.

Regardless of the line of business, it becomes clear that our fragmented workplace setting now only partly “on tap”. It makes our daily lives a mess, since things do not interoperate. The vision should show the way to a shared information commons, where we all cultivate.

So finally, How do we make sense of any mess?

Answer: Architect a place where you can find comfort with social conventions shared on the information used. Abby Covert, laid out a beautiful tapestry of things we all need to take on, to make sense in everyday life, and life at work. With clear and distinct guardrails, and signposts we don’t feel so distracted or lost. Her talk was a true enlightenment for me, being of the same profession, Information Architect.

View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog

Improving User Intelligence with the ELK Stack at SCA

SCA is a leading global hygiene and forest products company, employing around 44,000 people worldwide. The Group (all companies within SCA) develops and produces sustainable personal care, tissue and forest products. Sales are conducted in about 100 countries under many strong brands. Each brand each has its own website and its own search.

At SCA we use Elasticsearch, Logstash, and Kibana to record searches, clicks on result documents and user feedback, on both the intranet and external sites. We also collect qualitative metrics by asking our public users a question after showing search results: “Did you find what you were looking for?” The user has the option to give a thumbs up or down and also write a comment.

What is logged?

All search parameters and results information is recorded for each search event: the query string, paging, sorting, facets, the number of hits, search response time, the date and time of the search, etc. Clicking a result document also records a multitude of information: the position of the document in the result list, the time it took from search to click and various document metadata (such as URL, source, format, last modified, author, and more). A click event also gets connected with the search event that generated it. This is also the case for feedback events.

Each event is written to a log file that is being monitored by Logstash, which then creates a document from each event and pushes them to Elasticsearch where the data is visualized in Kibana.

Why?

Due to the extent of information that is indexed, we can answer questions from the very simple, such as “What are the ten most frequent queries during the past week?” and “Users who click on document X, what do they search for?” to the more complex like “What is the distribution of clicked documents’ last modified dates, coming from source S, on Wednesdays? The possibilities are almost endless!

The answers to these questions allow us to tune the search to meet the needs of the users to an even greater extent and deliver even greater value. Today, we use this analysis for everything from adjusting the relevance model, to adding new facets or removing old ones, or changing the layout of the search and result pages.

Experienced value – more than “just” logs

Recording search and click events are common practice, but at SCA we have extended this to include user feedback, as mentioned above. This increases the value of the statistics even more. It allows an administrator to follow up on negative feedback in detail, e.g. by recreating the scenario. It also enables implicitly evaluated trial periods for change requests. If a statistically significant increase in the share of positive feedbacks is observed, then that change made it easier for users to find what they were looking for. We can also find the answer to new questions, such as “What’s the feedback from the users who experience zero hits?” and “Are users more likely to find what they are looking for if they use facets?”

And server monitoring as well!

This setup is not only used to record information about user behavior, we also monitor the health of our servers. Every few seconds we index information about each server’s CPU, memory and disk usage. The most obvious gain is the historic aspect. Not only can we see the resource usage at a specific point in time, we can also see trends that would not be noticeable if we only had access to data from right now. This can of course be correlated with the user statistics, e.g. if a rise in CPU usage can be correlated to an increase in query volume.

Benefits of the ELK Stack

What this means for SCA is that they get a search that is ever improving. We, the developers and administrators of the search system, are no longer in the dark regarding what changes actually change things for the better. The direct feedback loop between the users and administrators of the system creates a sense of community, especially when users see that their grievances are being tended to. Users find what they are looking for to a greater and greater extent, saving them time and frustration.

Conclusion

We rely on Elasticsearch, Logstash and Kibana as the core of our search capability, and for the insight to continually improve. We’re excited to see what the 2.0 versions bring. The challenge is to know what information you are after and create a model that will meet those needs. Getting the ELK platform up and running at SCA was the part of the project that took the least amount of our time, once the logs started streaming out of our systems.