What’s new in Apache Solr 6?

Apache Solr 6 has been released recently! You need to remember about some important technical news: no more support for reading Lucene/Solr 4.x index or Java 8 is required. But what I think, the most interesting part is connected with its new features, which certainly follow world trends. I mean here: SQL engine at the top of the Solr, graph search and replicating data across different data centers.

Apache Solr

One of the most promising topic among the new features is Parallel SQL Interface. In a brief, it is possibility to run SQL queries on the top of the Solr Cloud (only Cloud mode right now). It can be very interesting to combine full-text capabilities with well-known SQL statements.
Solr uses Presto internally, which is a SQL query engine and works with various types of data stores. Presto is responsible for translating SQL statements to the Streaming Expression, since Solr SQL engine in based on the Streaming API.
Thanks to that, SQL queries can be executed at worker nodes in parallel. There are two implementations of grouping results (aggregations). First one is based on map reduce algorithm and the second one uses Solr facets. The basic difference is a number of fields used in grouping clause. Facet API can be used for better performance, but only when GROUP BY isn’t complex. If it is, better try aggregationMode=map_reduce.
From developer perspective it’s really transparent. Simple statement like “SELECT field1 FROM collection1” is translated to proper fields and collection. Right now clauses like WHERE, ORDER BY, LIMIT, DISTINCT, GROUP BY can be used.
Solr still doesn’t support whole SQL language, but even though it’s a powerful feature. First of all, it can make beginners life easier, since relational world is commonly known. What is more, I imagine this can be useful during some IT system migrations or collecting data from Solr for further analysis. I hope to hear many different study cases in the near future.

Apache Solr 6 introduces also a topic, which is crucial, wherever a search engine is a business critical system. I mean cross data center replication (CDCR).
Since Solr Cloud has been created to support near real-time (NRT) searching, it didn’t work well when cluster nodes were distributed across different data centers. It’s because of the communication overhead generated by the leaders, replicas and synchronizations operation.

New idea is in experimental phase and still under developing, but for now we have an active-passive mode, where data is pushed from the Source DC to the Target DC. Documents can be sent in a real-time or according to the schedule. Every leader from active cluster sends asynchronously updates to the proper leader in passive cluster. After that, target leaders replicate changes to their replicas as usual.
CDCR is crucial when we think about distributed systems working in high-availability mode. It always refers to disaster recovery, scaling or avoiding single points of failure (SPOF). Please visit documentation page to find some details and plans for the future: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

What if your business works in highly connected environment, where data relationships matter, but you still benefit from full-text searching? Solr 6 has a good news – it’s a graph traversal functionality.
A lot of enterprises know that focusing on relations between documents and graph data modeling is a future. Now you can build Solr queries which will allow you to discover information organized in nodes and edges. You can explore your collections in terms of data interactions and connections between particular data elements. We can think about the use cases from semantic search area (query augmentation, using ontologies etc.) or more prosaic, like organization security roles or access control.
Graph traversal query is still in progress, but we can use it from now and its basic syntax is really simple: fq={!graph from=parent_id to=id}id:”DOCUMENT_ID”

The last Solr 6 improvement, which I’m going to mention about is a new scoring algorithm – BM25. In fact, it’s a change forced by Apache Lucene 6. BM25 is now a default similarity implementation. Similarity is a process which examines which documents are similar to the query and to what extent. There are many different factors which determine document score. There are e.g.: number of search terms found in document, popularity of this search terms over the whole collection or document length. This is where BM25 improves scoring: it takes into consideration average length of the documents (fields) across the entire corpus. It also limits better an impact of terms frequency on results ranking.

As we can see, Apache Solr 6 provides us with many new features and those mentioned above are not all of them. We’re going to write more about the new functionalities soon. Until then, we encourage you to try the newest Solr on your own and remember: don’t hesitate to contact us in case of any problems!

How it all began: a brief history of Intranet Search

In accordance to sources, the birth of the intranet fell on a 1994 – 1996, that was true prehistory from an IT systems point of view. Intranet history is bound up with the development of Internet – the global network. The idea of WWW, proposed in 1989 by Tim Berners-Lee and others, which aim was to enable the connection and access to many various sources, became the prototype for the first internal networks. The goal of intranet invention was to increase employees productivity through the easier access to documents, their faster circulation and more effective communication. Although, access to information was always a crucial matter, in fact, intranet offered lots more functionalities, i.e.: e-mail, group work support, audio-video communication, texts or personal data searching.

Overload of information

Over the course of the years, the content placed on WWW servers had becoming more important than other intranet components. First, managing of more and more complicated software and required hardware led to development of new specializations. Second, paradoxically the easiness of information printing became a source of serious problems. There was too much information, documents were partly outdated, duplicated, without homogeneous structure or hierarchy. Difficulties in content management and lack of people responsible for this process led to situation, when final user was not able to reach desired piece of information or this had been requiring too much effort.

Google to the rescue

As early as in 1998 the Gartner company made a document which described this state of Internet as a “Wild West”. In case of Internet, this problem was being solved by Yahoo or Google, which became a global leader on information searching. In internal networks it had to be improved by rules of information publishing and by CMS and Enterprise Search software. In many organizations the struggle for easier access to information is still actual, in the others – it has just began.

cowboys

And the Search approached

It was search engine which impacted the most on intranet perception. From one side, search engine is directly responsible for realization of basic assumptions of knowledge management in the company. From the other, it is the main source of complaints and frustration among internal networks users. There are many reasons of this status quo: wrong or unreadable searching results, lack of documents, security problems and poor access to some resources. What are the consequences of such situation? First and foremost, they can be observed in high work costs (duplication of tasks, diminution in quality, waste of time, less efficient cooperation) as well as in lost chances for business. It must not be forgotten that search engine problems often overshadow using of intranet as a whole.

How to measure efficiency?

In 2002 Nielsen Norman Group consultants estimated that productivity difference between employees using the best and the worst corporate network is about 43%. On the other hand, annual report of Enterprise Search and Findability Survey shows that in situation, when almost 60% of companies underline the high importance of information searching for their business, nearly as 45% of employees have problem with finding the information.
Leaving aside comfort and level of employees satisfaction, the natural effect of implementation and improvement of Enterprise Search solutions is financial benefit. Contrary to popular belief, investments profits and savings from reaching the information faster are completely countable. Preparing such calculations is not pretty easy. The first step is: to estimate time, which is spent by employees on searching for information, to calculate what percentage of quests end in a fiasco and how long does it take to perform a task without necessary materials. It should be pointed out that findings of such companies as IDC or AIIM shows that office workers set aside at least 15-35% of their working hours for searching necessary information.
Problems with searching are rarely connected with technical issues. Search engines, currently present on our market, are mature products, regardless of technologies type (commercial/open-source). Usually, it is always a matter of default installation and leaving the system in untouched state just after taking it “out of the box”. Each search engine is different because it deals with various documents collections. Another thing is that users expectations and business requirements are changing continually. In conclusion, ensuring good quality searching is an unremitting process.

Knowledge workers main tool?

Intranet has become a comprehensive tool used for companies goals accomplishment. It supports employees commitment and effectiveness, internal communication and knowledge sharing. However, its main task is to find information, which is often hide in stack of documents or dispersed among various data sources. Equipped with search engine, intranet has become invaluable working tool practically in all sectors, especially in specific departments as customer service or administration.

So, how is your company’s access to information?


This text makes an introduction to series of articles dedicated to intranet searching. Subsequent articles are intended to deal with: search engine function in organization, benefit from using Enterprise Search, requirements of searching information system, the most frequent errors and obstacles of implementations and systems architecture.

Continuous crawl in SharePoint 2013

Continuous crawl is one of the new features that comes with SharePoint 2013. As an alternative to incremental crawl, it promises to improve the freshness of the search results. That is, the time between when an item is updated in SharePoint by a user and when it becomes available in search.

Understanding how this new functionality works is especially important for SharePoint implementations where content changes often and/or where it’s a requirement that the content should instantly be searchable. Nonetheless, since many of the new SharePoint 2013 functionalities depend on search (see the social features, the popular items, or the content by search web parts), understanding continuous crawl and planning accordingly can help level the user expectation with the technical capabilities of the search engine.

Both the incremental crawl and the continuous crawl look for items that were added, changed or deleted since the last successful crawl, and update the index accordingly. However, the continuous crawl overcomes the limitation of the incremental crawl, since multiple continuous crawls can run at the same time. Previously, an incremental crawl would start only after the previous incremental crawl had finished.

Limitation to content sources

Content not stored in SharePoint will not benefit from this new feature. Continuous crawls apply only to SharePoint sites, which means that if you are planning to index other content sources (such as File Shares or Exchange folders) your options are restricted to incremental and full crawl only.

Example scenario

The image below shows two situations. In the image on the left (Scenario 1), we are showing a scenario where incremental crawls are scheduled to start at each 15 minutes. In the image on the right (Scenario 2), we are showing a similar scenario where continuous crawls are scheduled at each 15 minutes. After around 7 minutes from starting the crawl, a user is updating a document. Let’s also assume that in this case passing through all the items to check for updates would take 44 minutes.

Continuous crawl SharePoint 2013

Incremental vs continuous crawl in SharePoint 2013

In Scenario 1, although incremental crawls are scheduled at each 15 minutes, a new incremental crawl cannot be started while there is a running incremental crawl. The next incremental crawl will only start after the current one is finished. This means 44 minutes for the first incremental crawl to finish in this scenario, after which the next incremental crawl kicks in and finds the updated document and send it to the search index. This scenario shows that it could take around 45 minutes from the time the document was updated until it is available in search.

In Scenario 2, a new continuous crawl will start at each 15 minutes, as multiple continuous crawls can run in parallel. The second continuous crawl will see the updated document and send it to the search index. By using the continuous crawl in this case, we have reduced the time it takes for a document to be available in search from around 45 minutes to 15 minutes.

Not enabled by default

Continuous crawls are not enabled by default and enabling them is done from the same place as for the incremental crawl, from the Central Administration, from Search Service Application, per content source. The interval in minutes at which a continuous crawl will start is set to a default of 15 minutes, but it can be changed through PowerShell to a minimum of 1 minute if required. Lowering the interval will however increase the load on the server. Another number to take into consideration is the maximum number of simultaneous requests, and this is a configuration that is done again from the Central Administration.

Continuous crawl in Office 365

Unlike in SharePoint 2013 Server, continuous crawls are enabled in SharePoint Online by default but are managed by Microsoft. For those used to the Central Administration from the on-premise SharePoint server, it might sound surprising that this is not available in SharePoint Online. Instead, there is a limited set of administrative features. Most of the search features can be managed from this administrative interface, though the ability to manage the crawling on content sources is missing.

The continuous crawl for Office 365 is limited in the lack of control and configuration. The crawl frequency cannot be modified, but Microsoft targets between 15 minutes and one hour between a change and its availability in the search results, though in some cases it can take hours.

Closer to real-time indexing

The continuous crawl in SharePoint 2013 overcomes previous limitations of the incremental crawl by closing the gap between the time when a document is updated and when this is visible in the search index.

A different concept in this area is the event driven indexing, which we will explain in our next blog post. Stay tuned!

Reaching Findability #2

Findability is surprisingly complex due to the large number of measures needed to be understood and undertaken. I believe that one of the principal challenges lies within the pedagogical domain. This is my second post in a series of simple tips for reaching Findability. You can also sign up here for a subscription to a free email course on this topic!

Take control of the technology!

The right search technology is an important foundation for making your information findable. There is a plethora of good search products on the market, all of them with different properties and strengths. The right products are those that fulfill your needs at the lowest cost. Therefore, to make the right choice, you must have a good understanding of your requirements.

A good search engine is specialized in figuring out what you’re actually intending to find, even if you only type a single word with ambiguous meaning. The search engine can make the difference when the exact term or spelling is not obvious, or a word is simply misspelled. It can also increase the relevance of search hits by only displaying results in languages you understand, and prioritizing results that are relevant in your current context.

With the right search platform in place, making a correct set-up and configuration is vital. While the initial installation may seem simple, taking advantage of the more powerful functions is complex and requires deep knowledge of search and information management.

If you lack access to a search platform, think again! Maybe your organization is using SharePoint, which in many versions contain a powerful search engine. Maybe you are using a search engine on the web site, which can also be used for other purposes or vice versa. Sometimes it pays off to investigate what technologies are already employed by the organization and look for new applications.

Feel free to contact me if you wish to discuss this further, anders.nilsson@findwise.com or sign up here to get our free email course.

Reaching Findability

Findability is not rocket science, but remain complex due to the large number of measures needed to be understood and undertaken. I’ve been giving this a lot of thought and believe one of the principal challenges lies within the pedagogical domain. Therefore I’ve compiled a number of simple tips for reaching Findability which I will share in a series of blog posts. You can also sign up here for a subscription to a free email course on this topic!

Take control of your information!

A strong incentive to improve Findability is to make information available to people who don’t have prior knowledge of where it resides or what it looks like. That doesn’t make management of the actual information less important. It’s just the other way round!

To gain control of your information, you must understand what and where it is. What do we know about it, what is the quality of it and how can a search engine expose it to the users? Often existing metadata, the surrounding structure and the actual content can tell us much that can be used to make it findable. Remember that important information can reside in many places. Look in the intranet, mailboxes, file servers as well as databases and proprietary systems to mention but a few.

When your most important information has been identified you need to build an information model that outlines the important concepts and terms, and how they fit together. This enables a structured way of working with the information, as well as technical solutions that simplifies finding, discovering and navigating it.

Bear in mind that it can be difficult to cover all the information at once. To avoid being overwhelmed start with some of the most important information, the stuff which really makes a difference in streamlining a process. Preferably, use a method for identifying and prioritizing business effects as a starting point to ensure your efforts are wisely spent.

Feel free to contact me if you wish to discuss this further, anders.nilsson@findwise.com or sign up here to get our free email course.

Query Rules in SharePoint 2013

Leaving both the SharePoint Conference in Las Vegas and the recent European SharePoint Conference in Copenhagen behind, Findwise continues sharing impressions about the new search in SharePoint 2013! We have previously given an overview of what is new in search in SharePoint 2013 and discussed Microsoft’s focus areas for the release. In this post, we focus more on the ranking of the search results using the query rules.

Understanding user intent in search is one of the key developments in the new release. The screenshots below, showing out-of-the-box functionality on some sample content, exemplify how the search engine adapts to the user query. Keywords such as ‘deck’, ‘expert’, or ‘video’ can express the user’s needs and expectations for different search results and information types, and what the search engine does in this case is promoting those results that have a higher probability to be relevant to the user’s search.

Query rules

Source: Microsoft

 

The adaptability of the search results can seem remarkable, as we see in these examples, aiming to provide more relevant search results through a better understanding of the user intent. Actually, this is powered by a new feature in SharePoint 2013 called query rules. Even more interesting maybe is that you can define your own custom query rules matching your specific needs without writing any code!

The simplest query rule would be to promote a specific result for a given search query. For example, you can promote a product’s instruction manual when the users search for that product name. Previously, in SharePoint 2010, you were able to define such promoted results (or “best bets”) using the Search Keywords. The query rules in SharePoint 2013 extend this functionality, providing an easy way to create powerful search experiences that adapt to user intent and business needs.

When defining a query rule, there are two main things to consider: conditions and corresponding actions. The conditions specify when the rule will be applied and the actions specify what to do when the rule is matched. There are six different condition types and three action types that can be defined.

For example, a query condition can be that a query keyword matches a specified phrase or a term from a dictionary (such as ‘picture’, ‘download’ or a product name from the term store), or that the query is more popular for a certain result type (such as images when for example searching for ‘cameras’), or that it matches a given regular expression (useful for matching phone numbers for example). The correlated actions can consist of promoting individual results on top of the ranked search results (promoting for example the image library), promoting a group of search results (such as image results, or search results federated from a web search engine), or changing the ranking of the search results by modifying the query (by changing the sorting of results or filtering on a content type). Another thing to consider is where you define the rule. Query rules can be created at Search Service Application, Site Collection, or Site level. The rules are inherited by default but you can remove, add, configure and change the order of query rules at each level. Fortunately, it also allows you to test a query and see which rules will fire.

There is one more thing though that you need to take into account: some features of query rules are limited in some of the licensing plans. Some plans only allow you to add the promoted results, and the more advanced actions on query rules are disabled. Check TechNet for guidelines on managing query rules and a list of features available across different licensing plans.

With the query rules, you have the freedom and power to change the search experience and adapt it to your needs. Defining the right keywords to be matched on the user queries and mapping the conditions with the relevant actions is easy but the process must undoubtedly be well managed. The management of the query rules should definitely be part of your SharePoint 2013 search governance strategy.

Let’s have a chat about how you can create great search experiences that match your specific users and business needs!

Enterprise Graph Search

Facebook will soon launch their new Graph Search to the general public, and it has received a lot of interest lately.

With graph search, the users will be able to query the social graph that millions of people have constructed over the years when friending each other and putting in more and more personal information about themselves and their friends in the vast Facebook database. It will be possible to query for friends of friends who have similar interests as you, and invite them to a party, or to query for companies where people with similar beliefs as you work, and so on and so forth. The information that is already available, will all the sudden become much more accessible through the power of graph search.

How can we bring this to an enterprise search environment? Well, there are lots of graphs in the enterprise as well to query, both social and other types. For example, how about being able to query for people that have been members of a project in the last three years that involved putting a new product successfully to the market. This would be an interesting list of people to know about, if you’re a marketing director that want to assemble a team in the company, to create a new product and make sure it succeeds in the market.

If we dissect graph search, we will find three important concepts:

  1. The information we want to query against don’t only need to be indexed into one central search engine, but also the relations and attributes of all information objects need to be normalized to create the relational graph and have standard attributes to query against. We could use the Open Graph Protocol as the foundation.
  2. We need a parser that take human language and converts it to a formal query language that a search engine understands. We might want to query in different human languages as well.
  3. The presentation of results should be adapted to the kind of information sought for. In Facebook’s example, if you query for people you will get a list of people with their pictures and some relevant personal information in the result list, and if you query for pictures you will get a collage of pictures (similar to the Google image search).

So the recipe to success is to give the information management part of the project a big focus, making sure to create a unified information model of the content to be indexed. Then create a query parser for natural language based on actual user behavior, and the same user studies would also give us information on how to visualize the different result set types.

I believe we will see more of these kind of solutions in the coming years in the enterprise search market, and look forward exploring the possibilities together with our clients.

Accessing Enterprise Content with Mobile Search

Today many IT departments are investing in mobile technology to make their internal enterprise content accessible in employees mobile phones and other mobile devices. We all want to be able to work without being at the office, and without having to run around with the job laptop. Imagine being at a business lunch and you want to pull up some presentation you have on the company intranet, why not just use the mobile phone?

In some organizations this is possible, and in some it still isnt’t. And in most organizations you don’t have access to all the documents and content available internally in document management systems, file shares and databases. And even if you did have access to the content in your mobile phone, you wouldn’t want to start browsing for it because it’s just too cumbersome to find it.

Here’s an idea for you: why not utilize the enterprise search platform to make the content both accessible, findable and readable?

First step is to make the content accessible. Since all content is already being indexed by the search engine, it’s already in one central place, at least in text representation. If you have a solution in place for having mobile phones access the company intranet, it should be fairly simple to open up for mobile devices to access the enterprise search web interface as well, with security credentials still in place.

Secondly the content need to be findable, and what better way to find information on a mobile phone is there than to search for it? With mobile search user interface patterns this will be much more efficient than traditional browsing for information.

And third, when you have found your document, you can use search engine features such as fingernail previews, automatic summarization and HTML conversion to make it easily readable on the mobile device.

Check out my presentation on SlideShare on accessing content with mobile search as well.

If you already have an enterprise search platform in place, why not start researching how to utilize it to make your enterprise content accessible on your mobile phone?

And if you don’t have an enterprise search platform in place, I suppose you now have yet another reason to add to your business case for investing in one.

Impressions of GSA 7.0

Google released Google Search Appliance, GSA 7.0, in early October. Magnus Ebbesson and I joined the Google hosted pre sales conference in Zürich where we had some of the new functionality presented and what the future will bring to the platform. Google is really putting an effort into their platform, and it gets stronger for each release. Personally I tend to like hardware and security updates the most but I have to say that some of the new features are impressive and have great potential. I have had the opportunity to try them out for a while now.

In late November we held a breakfast seminar at the office in Gothenburg where we talked about GSA in general with a focus on GSA 7.0 and the new features. My impression is that the translate functionality is very attractive for larger enterprises, while the previews brings a big wow-factor in general. The possibility of configuring ACLs for several domains is great too, many larger enterprises tend to have several domains. The entity extraction is of course interesting and can be very useful; a processing framework would enhance this even further however.

It is also nice to see that Google is improving the hardware. The robustness is a really strong argument for selecting GSA.

It’s impressive to see how many languages the GSA can handle and how quickly it performs the translation. The user will be required to handle basic knowledge of the foreign language since the query is not translated. However it is reasonably common to have a corporate language witch most of the employees handle.

The preview functionality is a very welcome feature. The fact that it can highlight pages within a document is really nice. I have played around to use it through our Jellyfish API with some extent of success. Below are two examples of usage with the preview functionality.

GSA 7.0 Preview

GSA 7 Preview - Details

A few thoughts

At the conference we attended in Zürich, Google mentioned what they are aiming to improve the built in template in the GSA. The standard template is nice, and makes setting up a decent graphical interface possible for almost no cost.

My experience is however that companies want to do the frontend integrated with their own systems. Also, we tend to use search for more purposes than the standard usage. Search driven intranets, where you build intranet sites based on search results, is an example where the search is used in a different manner.

A concept that we have introduced at Findwise is search as a service. It means that the search engine is a stand-alone product that has APIs that makes it easy to send data to it and extract data from it. We have created our own APIs around the GSA to make this possible. An easy way to extract data based on filtering of data is essential.

What I would like to see in the GSA is easier integration with performing search, such as a rest or soap service for easy integration of creating search clients. This would make it easier to integrate functionality, such as security, externally. Basically you tell the client who the current user is and then the client handles the rest. It would also increase maintainability in the sense of new and changing functionality does not require a new implementation for how to parse the xml response.

I would also like to see a bigger focus of documentation of how to use functionality, previews and translation, externally.

Final words

My feeling is that the GSA is getting stronger and I like the new features in GSA 7.0. Google have succeeded to announce that they are continuously aiming to improve their product and I am looking forward for future releases. I hope the GSA will take a step closer to the search as a service concept and the addition of a processing framework would enhance it even further. The future will tell.