To Crawl or Not to Crawl in Enterprise Search

Having an Enterprise Search Engine, there are basically two ways of getting content into the index; using a web crawler or a connector. Both methods have their advantages and disadvantages. In this post I’ll try to poinpoint the differences with the two methods.

Web crawler

Most systems of today have a web-interface. Let it be your time reporting system, intranet, document management, you’ll probably access those with your web browser. Because of this, it’s very easy to use a web crawler to index this content as well.

The web crawler index the pages by starting at one page. From there, it follows all outbound links and index those. From those pages, it follows all links, and so on. This process continues until all links at a web site has been followed and the pages been indexed. The crawler thus uses the same technique as a human, visit a page and clicking the links.

Most Enterprise Search Engines are bundled with a web crawler. Thus, it’s usually very easy to get started. Just enter a start page and within minutes you’ll have searchable content in your index. No extra installation or license fee are required. For some sources, this may also be the only option, i.e if you’re indexing external sources that your company has no control of.

The main disadvantage though, is that web pages are designed for humans, not crawlers. This means that there are a lot of extra information for presentation purposes, such as navigation menus, sticky information messages, headers and footers and so on. All of this makes it a more pleasant experience for the user, and also making it easier to navigate on the page. The crawler on the other hand has no use of this information when retrieving pages. It’s actually reducing information quality in the index. For example, a navigation menu will be displayed on every page, thus the crawler will index the navigation content for all pages. So if you have a navigation item called “Customers” and a user searches for customers, he/she will get a hit in ALL pages in the index.

There are ways to get around this, but it requires either altering of the produced HTML or adjustments in the search engine. Also, if the design of the site change, you have to do these adjustments again.

Connector

Even though the majority of systems has a web-interface, the content is stored in a data source of some format. It might be a database, structured file system, etc. By using a connector, you connect either to the underlying data source or to the system directly by its programming API.

Using a connector, the search engine does not get any presentation information but only the pure content, making the information quality in the index better. The connector can also retrieve all metadata associated with the information which further increases the quality. Often, you’ll also have more fine-grained control over what will be indexed with a connector than a web crawler.

Though, using a connector requires more configuration. It might also cost some extra money to buy one for your system, and require additional hardware. Though, once set up, it’s most likely to produce more relevant results compared to a web crawler.

Bottom line is it’s a consideration between quality and cost, as most decisions in life 🙂