Real Time Search in the Enterprise

Real time search is a big fuzz in the global network called Internet. Major search engines like Google and Bing are now providing users with real time search results from Facebook, Twitter, Blogs and other social media sites. Real time search means that as soon as content are created or updated, it is immediately searchable. This might be obvious and seems like a basic requirement, but working with search you know that this is not the case most of the time. Looking inside the firewall, in the enterprise, I dare to say that real time search is far from common. Sometimes content is not changed very frequently so it is not necessary to make it instantly searchable. Though, in many cases it’s the technical architecture that limits a real time search implementation.

The most common way of indexing content is by using a web crawler or a connector. Either way, you schedule them to go out and fetch new/updated/deleted content at specific interval during the day. This is the basic architecture for search platforms these days. The advantage of this approach is that the content systems does not need to adapt to the search platform, they just deliver content through their ordinary API:s during indexing. The drawback is that new or updated content is not available until next scheduled indexing. Depending on the system this might take several hours. Due to several reasons, mostly performance, you do not want to schedule connectors or web crawlers to fetch content too often. Instead, to provide real time search you have to do the other way around; let the content system push content to the search platform.

Most systems have some sort of event system that triggers an event when content is created/updated/deleted. Listening for these events, the system can send the content to the search platform at the same time it’s stored in the content system. The search platform can immediately index the pushed content and make it searchable. This requires adaptation of the content system towards the search platform. In this case though, I think the advantages outweighs the disadvantages. Modern content systems of today are (or should be) providing a plug-in architecture so you should fairly easy be able to plug in this kind of code. These plug-ins could also be provided by the search platform vendors just as ordinary connectors are provided today.

Do you agree, or have I been living in a cave for the past years? I’d love to hear you comments on this subject!

To Crawl or Not to Crawl in Enterprise Search

Having an Enterprise Search Engine, there are basically two ways of getting content into the index; using a web crawler or a connector. Both methods have their advantages and disadvantages. In this post I’ll try to poinpoint the differences with the two methods.

Web crawler

Most systems of today have a web-interface. Let it be your time reporting system, intranet, document management, you’ll probably access those with your web browser. Because of this, it’s very easy to use a web crawler to index this content as well.

The web crawler index the pages by starting at one page. From there, it follows all outbound links and index those. From those pages, it follows all links, and so on. This process continues until all links at a web site has been followed and the pages been indexed. The crawler thus uses the same technique as a human, visit a page and clicking the links.

Most Enterprise Search Engines are bundled with a web crawler. Thus, it’s usually very easy to get started. Just enter a start page and within minutes you’ll have searchable content in your index. No extra installation or license fee are required. For some sources, this may also be the only option, i.e if you’re indexing external sources that your company has no control of.

The main disadvantage though, is that web pages are designed for humans, not crawlers. This means that there are a lot of extra information for presentation purposes, such as navigation menus, sticky information messages, headers and footers and so on. All of this makes it a more pleasant experience for the user, and also making it easier to navigate on the page. The crawler on the other hand has no use of this information when retrieving pages. It’s actually reducing information quality in the index. For example, a navigation menu will be displayed on every page, thus the crawler will index the navigation content for all pages. So if you have a navigation item called “Customers” and a user searches for customers, he/she will get a hit in ALL pages in the index.

There are ways to get around this, but it requires either altering of the produced HTML or adjustments in the search engine. Also, if the design of the site change, you have to do these adjustments again.

Connector

Even though the majority of systems has a web-interface, the content is stored in a data source of some format. It might be a database, structured file system, etc. By using a connector, you connect either to the underlying data source or to the system directly by its programming API.

Using a connector, the search engine does not get any presentation information but only the pure content, making the information quality in the index better. The connector can also retrieve all metadata associated with the information which further increases the quality. Often, you’ll also have more fine-grained control over what will be indexed with a connector than a web crawler.

Though, using a connector requires more configuration. It might also cost some extra money to buy one for your system, and require additional hardware. Though, once set up, it’s most likely to produce more relevant results compared to a web crawler.

Bottom line is it’s a consideration between quality and cost, as most decisions in life 🙂

High Expectations to Googlify the Company = Findability Problem?

It is not a coincidence that the verb “to google” has been added to several renowned dictionaries, such as those from Oxford and Merriam-Webster. Search has been the de facto gateway to the Web for some years now. But when employees turn to Google on the Web to find information about the company they work for, your alarm bells should be ringing. Do you have a Findability problem within the firewall?

The Google Effect on User Expectations

“Give us something like Google or better.”

 

“Compared to Google, our Intranet search is almost unusable.”

 

“Most of the time it is easier to find enterprise information by using Google.”

The citations above come from a study Findwise conducted during 2008-2009 for a customer, who was on the verge of taking the first steps towards a real Enterprise Search application. The old Intranet search tool had become obsolete, providing access to a limited set of information sources only and ranking outdated information over the relevant documents that were in fact available. To put it short, search was causing frustration and lots of it.

However, the executives at this company were wise enough to act on the problem. The goal was set pretty high: Everybody should be able to find the corporate information they need faster and more accurately than before. To accomplish this, an extensive Enterprise Search project was launched.

This is where the contradiction comes into play. Today users are so accustomed to using search as the main gateway to the Web, that the look and feel of Google is often seen as equal to the type of information access solution you need behind the firewall as well. The reasons are obvious; on the Web, Google is fast and it is relevant. But can you—and more importantly should you—without question adopt a solution from the Web within the firewall as well?

Enterprise Search and Web Search are different

  1. Within the firewall, information is stored in various proprietary information systems, databases and applications, on various file shares, in a myriad of formats and with sophisticated security and version control issues to take into account. On the Web, what your web crawler can find is what it indexes.
  2. Within the firewall, you know every single logged in user, the main information access needs she has, the people she knows, the projects she is taking part in and the documents she has written. On the Web, you have less precise knowledge about the context the user is in.
  3. Within the firewall, you have less links and other clear inter-document dependencies that you can use for ranking search results. On the Web, everything is linked together providing an excellent starting point for algorithms such as Google’s PageRank.

Clearly, the settings differ as do user needs. Therefore, the internal search application will be different from a search service on the web; at least if you want it to really work as intended.

Start by Setting up a Findability Strategy

When you know where you are and where you want to be in terms of Findability—i.e. when you have a Findability strategy—you can design and implement your search solution using the search platform that best fits the needs of your company. It might well be Google’s Search Appliance. Just do not forget, the GSA is a totally different beast compared to the Google your users are accustomed to on the Web!

References

http://en.wikipedia.org/wiki/Googling