Real Time Search in the Enterprise

Real time search is a big fuzz in the global network called Internet. Major search engines like Google and Bing are now providing users with real time search results from Facebook, Twitter, Blogs and other social media sites. Real time search means that as soon as content are created or updated, it is immediately searchable. This might be obvious and seems like a basic requirement, but working with search you know that this is not the case most of the time. Looking inside the firewall, in the enterprise, I dare to say that real time search is far from common. Sometimes content is not changed very frequently so it is not necessary to make it instantly searchable. Though, in many cases it’s the technical architecture that limits a real time search implementation.

The most common way of indexing content is by using a web crawler or a connector. Either way, you schedule them to go out and fetch new/updated/deleted content at specific interval during the day. This is the basic architecture for search platforms these days. The advantage of this approach is that the content systems does not need to adapt to the search platform, they just deliver content through their ordinary API:s during indexing. The drawback is that new or updated content is not available until next scheduled indexing. Depending on the system this might take several hours. Due to several reasons, mostly performance, you do not want to schedule connectors or web crawlers to fetch content too often. Instead, to provide real time search you have to do the other way around; let the content system push content to the search platform.

Most systems have some sort of event system that triggers an event when content is created/updated/deleted. Listening for these events, the system can send the content to the search platform at the same time it’s stored in the content system. The search platform can immediately index the pushed content and make it searchable. This requires adaptation of the content system towards the search platform. In this case though, I think the advantages outweighs the disadvantages. Modern content systems of today are (or should be) providing a plug-in architecture so you should fairly easy be able to plug in this kind of code. These plug-ins could also be provided by the search platform vendors just as ordinary connectors are provided today.

Do you agree, or have I been living in a cave for the past years? I’d love to hear you comments on this subject!

3 thoughts on “Real Time Search in the Enterprise

  1. If you are really unlucky and/or have a lot of content that is slow to fetch the time to index can be days or even weeks in the works case.

    Soo.. the solution is to change from ‘pul’l connectors and crawlers to push connectors a.k.a ‘index as a service’ conenctors that receives index requests from external systems instead of going out and search for content themselves.

    ‘Index as a service’ have already been successfully developed and installed for a number of our customers and are also in the process of beeing delivering it to even more as we speak.

    And I fully agree with you Tobias that this is the future if you want fresh and controlled content into the index.

  2. The “pull” approach is much simpler to implement because you have control over how to collect data, especially the more heterogenous your sources of information are. But “push” is much simpler to maintain over time as you aren’t putting as much load on systems all at once. I think though that “Real Time” is push done to the X degree, because what you are saying is not just am I having content updated as it changes, but I am also publishing those changes immediately… Not just on a timed basis. And many scaling approaches requiring bundling up many updates and doing a single publish of all those changes. Whereas in Real Time search, if I add data, I should be able to immediatly search for it, not just in 5 or 10 minutes when the data gets replicated across.

  3. Pingback: Information Flow in VGR

Leave a Reply

Your email address will not be published. Required fields are marked *