Choosing an Open Source search engine: Solr or Elasticsearch?

There has never been a better time to be a search and open source enthusiast than 2017. Far behind are the old days of information retrieval being a field only available for academia and experts.

Now we have plenty of search engines that allow us not only to search, but also navigate and discover our information. We are going to be focusing on two of the leading search engines that happed to be open source projects: Elasticsearch and Solr.

Comparison (Solr and Elasticsearch)

Organisations and Support

Solr is an Apache sub-project developed parallelly along Lucene. Thanks to this it has synchronized releases and benefits directly from any new Lucene feature.

Lucidworks (previously Lucid Imagination) is the main company supporting Solr. They provide development resources for Solr, commercial support, consulting services, technical training and commercial software around Solr. Lucidworks is based in San Francisco and offer their services in the USA and all rest of the world through strategic partners. Lucidworks have historically employed around a third of the most active Solr and Lucene committers, contributing most of the Solr code base and organizing the Lucene/Revolution conference every year.

Elasticsearch is an open-source product driven by the company Elastic (formerly known as Elasticsearch). This approach creates a good balance between the open-source community contributing to the product and the company making long term plans for future functionality as well as ensuring transparency and quality.

Elastic comprehends not only Elasticsearch but a set of open-source products called the Elastic stack: Elassticsearch, Kibana, Logstash, and Beats. The company offers support over the whole Elastic stack and a set of commercial products called X-Pack, all included, in different tiers of subscriptions. They offer trainings every second week around the world and organize the ElasticON user conferences.

Ecosystem

Solr is an Apache project and by being so it benefits from a large variety of apache projects that can be used along with it. The first and foremost example is its Lucene core (http://lucene.apache.org/core/) that is released on the same schedule and from which it receives all its main functionalities. The other main project is Zookeper that handles SolrCloud clusters configuration and distribution.

On the information gathering side there is Apache Nutch, a web crawler, and Flume , a distributed log collector.

When it comes to process information, there are no end to Apache projects, the most commonly used alongside Solr are Mahout for machine learning, Tika for document text and metadata extraction and Spark for data processing.

The big advantage lies in the big data management and storage, with the highly popular Hadoop  library as well as Hive, HBase, and Cassandra databases. Solr has support to store the index in a Hadoop Highly Distributed File System for high resilience.

Elasticsearch is owned by the Elastic company that drives and develops all the products on its ecosystem, which makes it very easy to use together.

The main open-source products of the Elastic stack along Elasticsearch are Beats, Logstash and Kibana. Beats is a modular platform to build different lightweight data collectors. Logstash is a data processing pipeline. Kibana is a visualization platform where you can build your own data visualization, but already has many build-in tools to create dashboards over your Elasticsearch data.

Elastic also develop a set of products that are available under subscription: X-Pack. Right now, X-Pack includes five producs: Security, Alerting, Monitoring, Reporting, and Graph. They all deliver a layer of functionality over the Elastic Stack that is described by its name. Most of them are included as a part of Elasticsearch and Kibana.

Strengths

Solr

  • Many interfaces, many clients, many languages.
  • A query is as simple as solr/select?q=query.
  • Easy to preconfigure.
  • Base product will always be complete in functionality, commercial is an addon.

Elasticsearch

  • Everything can be done with a JSON HTTP request.
  • Optimized for time-based information.
  • Tightly coupled ecosystem.

Base product will contain the base and is expandable, commercial are additional features.

solr vs elasticsearch comparison open source search engine

Conclusion – Solr or Elasticsearch?

If you are already using one of them and do not explicitly need a feature exclusive of the other, there is no big incentive in making a migration.

In any case, as the common answer when it comes to hardware sizing recommendations for any of them: “It depends.” It depends on the amount of data, the expected growth, the type of data, the available software ecosystem around each, and mostly the features that your requirements and ambitions demand; just to name a few.

 

At Findwise we can help you make a Platform evaluation study to find the perfect match for your organization and your information.

 

Written by: Daniel Gómez Villanueva – Findability and Search Expert

Using search technologies to create apps that even leaves Apple impressed

At Findwise we love to see how we can use the power of search technologies in ways that goes beyond the typical search box application.

One thing that has exploded the last few years is of course apps in smartphones and tablets. It’s no longer enough to store your knowledge in databases that are kept behind locked doors. Professionals of today want to have instant access to knowledge and information right where they are. Whether if it’s working at the factory floor or when showcasing new products for customers.

When you think of enterprise search today, you should consider it as a central hub of knowledge rather than just a classical search page on the intranet. Because when an enterprise search solution is in place, when information from different places have been normalized and indexed in one place, then there really are no limits for what you can do with the information.

By building this central hub of knowledge it’s simple to make that knowledge available for other applications and services within or outside of the organization. Smartphone and tablet applications is one great example.

Integrating mobile apps with search engine technologies works really well because of four reasons:

  • It’s fast. Search engines can find the right information using advanced queries or filtering options in a very short time, almost regardless of how big the index is.
  • It’s lightweight. The information handled by the device should only be what is needed by the device, no more, no less.
  • It’s easy to work with. Most search engine technologies provides a simple REST interface that’s easy to integrate with.
  • A unified interface for any content. If the content already is indexed by the enterprise search solution, then you use the same interface to access any kind of information.

 

We are working together with SKF, a company that has transformed itself from a traditional industry company into a knowledge engineering company over the last years. I think it’s safe to say that Findwise have been a big part of that journey by helping SKF create their enterprise search solution.

And of course, since we love new challenges, we have also helped them create a few mobile apps. In particular there are two different apps that we have helped out with:

  • SKF Shelf – A portable product brochures archive. The main use case is quick and easy access to product information for sales reps when visiting customers.
  • MPLP (mobile product landing page) – A mobile web app that you get to if you scan QR-codes printed on the package.

 

And even more recently the tech giant Apple has noticed how the apps makes the day to day work of employees easier.


SKF Shelf3

Read more about Enterprise Search at Findwise

Elastic{ON} 2017 – breaking all the records!

Elastic{ON} 2017 draws 2200 participants to Pier 48 during these somewhat chilly San Francisco days in March. It’s a 40% increase from the 1600 or so participants last year, in line with the growing interest for the Elastic Stack and the successes of Elastic commercially.

From Findwise – we are a team of 4 Findwizards, networking, learning and reporting.

Shay Banon, the creator of Elasticsearch and Elastic CTO, is doing both the opening and closing keynote. It is apparent that the transition of the CEO role from Steven Schuurman has already started.

ElasticON 2017

2016 in retrospective with the future in mind

Elastic reached 100 million downloads in 2016, and have managed to land approximately 4000 paying subscription customers out of this installed base to date. A lot of presentations during the conference is centered around new functionality that is developed and will be released to the open source community freely. Other functionality goes into the commercial X-pack subscriptions. Some X-pack functionality is available freely under the Basic subscription level that only requires registration.

Most presentations are centered around search powered analytics, and fewer around regular free text search. Elasticsearch and the Elastic Stack got its main use cases within logging, analytics and in various applications as a data platform or middle-layer with search use-cases as a strong sidekick.

A strong focus on analytics

There’s 22 sponsors at the event, and most of the companies are either offering cloud based monitoring or machine learning services. IBM, the platinum sponsor, are promoting the Bluemix cloud services for cognitive Watson functionality and uses the conference to reach out to the predominantly developer-focused audience.

Prelert was acquired in September last year, and is now being integrated into the Elastic Stack as the Machine Learning component and is used for unsupervised anomaly detection to give operation log insights. Together with the new modular Beats architecture and various Kibana improvements, it looks apparent that Elastic is chasing the huge market Splunk currently controls within logging and analytics.

Elasticsearch SQL – giving BI what it needs

Elasticsearch SQL will give the search engine SQL capability just like Solr got with their parallel SQL interface. Elasticsearch is becoming more and more a “data platform”. Increasingly becomming an competitor to HPE Vertica and Amazon RedShift as it hits a sweet spot use-case where a combination of faster data loading and extreme scalability is needed, and it is acceptable with the tradeoffs of limited functionality (such as the lack of JOIN operations). With SQL support the platform can use existing visualization tools such as Tableu and it expands the user base as many people in the Business Intelligence sector knows SQL by heart.

Fast and simple Beats is music to our ears

Beats will become modular in the next release, and more beats modules will be created either by Elastic or in the open source or commercial community. This increases simple connectivity to various data sources, and adds standardized dashboards for the data source, which will increase simplicity and speed in implementation.

Heartbeat is a new Beat (with a beautiful name!) that send pings to check that services are alive and functioning.

Kibana goes international

Kibana is maturing with some new key updates coming soon. A Time series visual builder that will give graphical guidance on how to build the dashboards, Kibana Canvas gives custom dynamic reports and enables slide show presentations with live data, and the GUI frontend is translated to various languages.

There’s a new tile service for maps, so instead of relying on external map services, Elastic now got control over the maps functionality. The service can be used free of charge but requires registration (Basic subscription) to use all 18 zoom levels.

kibana-int

 

To conclude, we’ve had three good days with exciting product news and lots of interesting meetings in what could very well be the biggest show for search and search-driven analytics right now! Be sure to see us at the next year’s Elastic{ON} again. If not before, see you then!

 

From San Francisco with love,

/Andreas, Christian, Joar and Peter

What will happen in the information sector in 2017?

As we look back at 2016, we can say that it has been an exciting and groundbreaking year that has changed how we handle information. Let’s look back at the major developments from 2016 and list key focus areas that will play an important role in 2017.

3 trends from 2016 that will lay basis for shaping the year 2017

Cloud

There has been a massive shift towards cloud, not only using the cloud for hosting services but building on top of cloud-based services. This has affected all IT projects, especially the Enterprise Search market when Google decided to discontinue GSA and replace it with a cloud based Springboard. More official information on Springboard is still to be published in writing, but reach out to us if you are keen on hearing about the latest developments.

There are clear reasons why search is moving towards the cloud. Some of the main ones being machine learning and the amount of data. We have an astonishing amount of information available, and the cloud is simply the best way to handle this overflow. Development in the cloud is faster, the cloud gives practically unlimited processing power and the latest developments available in the cloud are at an affordable price.

Machine learning

One area that has taken huge steps forward has been machine learning. It is nowadays being used in everyday applications. Google wrote a very informative blog post about how they use Cloud machine learning in various scenarios. But Google is not alone in this space – today, everyone is doing machine learning. A very welcome development was the formation of Partnership on AI by Amazon, Google, Facebook, IBM and Microsoft.

We have seen how machine learning helps us in many areas. One good example is health care and IBM Watson managing to find a rare type of leukemia in 10 minutes. This type of expert assistance is becoming more common. While we know that it is still a long path to come before AI becomes smarter than human beings, we are taking leaps forward and this can be seen by DeepMind beating a human at the complex board game Go.

Internet of Things

Another important area is IoT. In 2016 most IoT projects have, in addition to consumer solutions, touched industry, creating a smart city, energy utilization or connected cars. Companies have realized that they nowadays can track any physical object to the benefits of being able to serve machines before they break, streamline or build better services or even create completely new business based on data knowledge. On the consumer side, we’ve in 2016 seen how IoT has become mainstream with unfortunate effect of poorly secured devices being used for massive attacks.

 

3 predictions for key developments happening in 2017

As we move towards the year 2017, we see that these trends from 2016 have positive effects on how information will be handled. We will have even more data and even more effective ways to use it. Here are three predictions for how we will see the information space evolve in 2017.

Insight engine

The collaboration with computers are changing. For decades, we have been giving tasks to computers and waited for their answer. This is slowly changing so that we start to collaborate with computers and even expect computers to take the initiative. The developments behind this is in machine learning and human language understanding. We no longer only index information and search it with free text. Nowadays, we can build a computer understanding information. This information includes everything from IoT data points to human created documents and data from other AI systems. This enables building an insight engine that can help us formulate the right question or even giving us insight based on information to a question we never ask. This will revolutionize how we handle our information how we interact with our user interfaces.

We will see virtual private assistants that users will be happy to use and train so that they can help us to use information like never before in our daily live. Google Now, in its current form, is merely the first step of something like this, being active towards bringing information to the user.

Search-driven analytics

The way we use and interact with data is changing. With collected information about pretty much anything, we have almost any information right at our fingertips and need effective ways to learn from this – in real time. In 2017, we will see a shift away from classic BI systems towards search-driven evolutions of this. We already have Kibana Dashboards with TimeLion and ThoughtSpot but these are only the first examples of how search is revolutionizing how we interact with data. Advanced analytics available for anyone within the organization, to get answers and predictions directly in graphs and diagrams, is what 2017 insights will be all about.

Conversational UIs

We have seen the rise of Chatbots in 2016. In 2017, this trend will also take on how we interact with enterprise systems. A smart conversational user interface builds on top of the same foundations as an enterprise search platform. It is highly personalized, contextually smart and builds its answers from information in various systems and information in many forms.

Imagine discussing future business focus areas with a machine that challenges us in our ideas and backs everything with data based facts. Imagine your enterprise search responding to your search with a question asking you to detail what you actually are achieving.

 

What are your thoughts on the future developement?

How do you see the 2017 change the way we interact with our information? Comment or reach out in other ways to discuss this further and have a Happy Year 2017!

 

Written by: Ivar Ekman

Elastic Stack 5.0 is released

At a first glance, the major Elasticsearch version bump might seem frightening. Going from version 2.4.x to 5.0 is a big jump, but there’s no need to worry. The main reason is to align versions between the different products in the stack. Having all products on the same version will make it a lot easier to handle future upgrades and simplify the overall experience for both new and existing users.

All products in the stack have been updated, some more than others. Here are a few highlights regarding Elasticsearch 5.0 that we recommend you to read before upgrading. Or schedule an appointment with us and we’ll help you out!

New relevance model

Elasticsearch prior version 5 used the default scoring algorithm TF/IDF. From now on the default algorithm is BM25.

Depending on the nature of your indexed information, a re-index operation might give you slightly different results and most likely more relevant.

Re-index from remote

This new feature of the Elasticsearch API is really useful when for example upgrading from old clusters. By specifying a remote cluster in the API call, you can easily transfer old documents to your newly created 5.0 cluster without going through a rolling node upgrade procedure.

Ingest Node

There’s a new node type in town. Starting from version 5.0, Elasticsearch gives you the possibility to do simple data manipulation within a running cluster prior indexing. This is useful if you prefer a more simplistic architecture without Logstash instances, but still require to do some alterations to your data.

Most core processors found in Logstash are available. Often used ones include:

  • Date Processor
  • Convert processor
  • Grok Processor
  • Rename Processor
  • JSON Processor

Search and Aggregations

The search API has been refactored to be more clever regarding which indices are hit, but also if aggregations need to be recalculated or not when issuing range queries. By looking at when indices were last modified, range aggregations can be cached and only recalculated if really needed. This improvement is really useful for the typical log analytic case with time series data. You will notice speed improvements in your Kibana dashboards.

New data structures

Lucence 6.0 introduces a new feature called dimensional points, which uses the k-d tree geo-spatial data structure to enable fast single- and multi-dimensional numeric range and geo-spatial point-in-shape filtering. Elasticsearch 5.0 implements a variant called block k-d tree specifically designed for efficient IO, which gives significant performance boosts when indexing as well as filtering.

Should I upgrade?

If your typical use case involves geo-spatial queries and filtering, we definitely recommend that you upgrade your cluster and re-index your documents to gain the performance boost. Due to the simplicity in upgrading or even migrating data to a completely new cluster, it will be worth the time getting your Elastic Stack up to date and ready for features to come.

In case you need help, don’t hesitate to contact us and we will guide you through the process.

Written by: Joar Svensson, Consultant Findwise

Involuntarily digital footprints violate personal integrity (learn about GDPR)

The aim of this blog post is to make “average Joe” understand how the new upcoming General Data Protection Regulation (GDPR) affects his everyday life.

To start with, let’s sort some expressions out.

Digital footprint

According to Wikipedia, there are two main classifications for digital footprints;
• Passive digital footprint – Data collected without the owner’s knowledge.
• Active digital footprint – Data released deliberately by the user himself (i.e. sharing an image on Facebook).

Personal integrity

Integrity could be described as the quality of being honest and having strong moral principles. In general, it’s a personal choice how to choose your standpoint in the question of integrity. Gossiping about secrets told in confidence is an example to illustrate with. Publishing images of others without their knowledge is another (this might even be illegal).

This illustrative case could be you

To understand what GDPR is about and how it affects your everyday life I will illustrate by an example that I hope you could recognize yourself in.

Imagine: You live in an apartment in a mid-size facility with other people (we can choose to call them neighbours). In front of the facilities there is a space dedicated for parking cars. One day a neighbour of yours chooses to move and therefore hires a real-estate agent, helping out with selling the apartment.

As you are somewhat curious about what the apartments in your neighbourhood is worth, you look the advertisement for the apartment up on the internet. When you find the apartment you see your own car on the picture in the parking space. On top of this you discover that the registration number of the car is fully visible.

Should you care?

According to Datainspektionen, registration numbers is considered as “personal data”. So the first mistake by the broker being done here is creating a passive digital footprint for you. The second mistake by the broker being done is breaking the law. In Sweden it is not allowed to publish personal data without acknowledgement by the owner.

The moral compass of the broker should be questioned here. A passive digital footprint in your name is created, your personal integrity has been violated and the law has been broken.

On top of that: GDPR starts in may 2018. You have the right to be forgotten whenever you want (you can push companies to remove your personal data from their systems).

Is there a business case?

A lawyer could probably build a business case around suing real-estate brokers for publishing pictures of cars registration numbers without the owner’s acknowledgement.

As a regular citizen you should probably not get to agitated about a picture of your cars registration number? Or maybe you should, it depends on your level of personal integrity. As the modern society evolves, the amount of different types of information being digitalized grows by the day.

By this example, I hope “average Joe” now understands what digital footprints, personal integrity and GDPR is. Maybe this got you thinking and you want to know more about GDPR.

There are probably two ways to see on this in a sober way. Live with your personal data being spread (and get used to that you soon won’t have anything personal anymore) or maybe it’s time to stick the neck out and say “hey, stop publishing my personal data without asking me”.

No matter if you want it or not, you are affected by GDPR.

 

 

Written by: Markus Edström

Generational renewal at work – a search challenge

The big generational shift

There have been discussions surrounding the great generational renewal in the workplace for a while. The 50’s generation, who have spent a large part of their working lives within the same company, are being replaced by an agile bunch born in the 90’s. We are not taken by tabloid claims that this new generation does not want to work, or that companies do not know how to attract them. What we are concerned with is that businesses are not adapting fast enough to the way the new generation handle information to enable the transfer of knowledge within the organisation.

Working for the same employer for decades

Think about it for a while, for how long have the 50’s generation been allowed to learn everything they know? We see it all the time, large groups of employees ready to retire, after spending their whole working lives within the same organisation. They began their careers as teenagers working on the factory floor or in a similar role, step by step growing within the company, together with the company. These employees have tended to carry a deep understanding of how their organisation work and after years of training, they possess a great deal of knowledge and experience. How many companies nowadays are willing to offer the 90’s workers the same kind of journey? Or should they even?

2016 – It’s all about constant accessibility

The world is different today, than 50 years ago. A number of key factors are shaping the change in knowledge-intense professions:

  • Information overload – we produce more and more information. Thanks to the Internet and the World Wide Web, the amount of information available is greater than ever.
  • Education has changed. Employees of the 50’s grew up during a time when education was about learning facts by rote. The schools of today focus more on teaching how to learn through experience, to find information and how to assess its reliability.
  • Ownership is less important. We used to think it was important to own music albums, have them in our collection for display. Nowadays it’s all about accessibility, to be able to stream Spotify, Netflix or an online game or e-book on demand. Similarly we can see the increasing trend of leasing cars over owning them. Younger generations take these services and the accessibility they offer for granted and they treat information the same way, of course. Why wouldn’t they? It is no longer a competitive advantage to know something by heart, since that information is soon outdated. A smarter approach of course is to be able to access the latest information. Knowing how to search for information – when you need it.

Factors supporting the need for organising the free flow of the right information:

  • Employees don’t stay as long as they used to in the same workplace anymore, which for example, requires a more efficient on boarding process. It’s no longer feasible to invest the same amount of time and effort on training one individual since he/she might be changing workplace soon enough anyway.
  • It is much debated whether it is possible to transfer knowledge or not. Current information on the other hand is relatively easy to make available to others.
  • Access to information does not automatically mean that the quality of information is high and the benefits great.

Organisations lack the right tools

Knowing a lot of facts and knowledge about a gradually evolving industry was once a competitive advantage. Companies and organisations have naturally built their entire IT infrastructure around this way of working. A lot of IT applications used today were built for a previous generation with another way of working and thinking. Today most challenges involve knowing where and how to find information. This is something we experience in our daily work with clients. Organisations more or less lack the necessary tools to support the needs of the newer generation in their daily work.

To summarize the challenge: organisations need to be able to supply their new workforce with the right tools to constantly find (and also manipulate) the latest and best information required for them to shine.

Success depends on finding the right information

In order for the new generation to succeed, companies must regularly review how information is handled plus the tools supporting information-heavy work tasks.

New employees need to be able to access the information and knowledge left by retiring employees, while creating and finding new content and information in such a way that information realises its true value as an asset.

Efficiency, automation… And Information Management!

There are several ways of improving efficiency, the first step is often to investigate if parts, or perhaps the entire creating and finding process can be automated. Secondly, attack the information challenges.

When we get a grip of the information we are to handle, it’s time to look into the supporting IT systems. How are employees supposed to find what they are looking for? How do they want to?

We have gotten used to find answers by searching online. This is in the DNA of the 90’s employee. By investing in a great search platform and developing processes to ensure high information quality within the organisation, we are certain the organisation will not only manage the generational renewal but excel in continuously developing new information centric services.

Written by: Maria “Ia” Björk & Joar Svensson

Best Practices for Enterprise Search: What Leading Practitioners Do

Best Practices for Enterprise Search, from a practitioner perspective. The content of this presentation is more focused on actionable tasks and processes than search technology. The best practices are based on data from the Enterprise Search and Findability Survey, previous research, empirical evidence and of course the accumulated collective experience by Findwise consultants.

Your comments on this “Best Practices for Enterprise Search” presentation are very much appreciated!

Presented at Intranett 2012 in Oslo, Norway, 22 November 2012 by Kristian Norling.