Data source analysis is one of the crucial parts of an enterprise search deployment project. Search engine results quality strongly depends on an indexed data quality. In case of web-based sources, there are two basic ways of reaching the data: internal and external. Internal method involves reading the data directly from its storage place, such as a database, filesystem files or API. Documents are read by some criteria or all documents are read, depending on requirements. External technique relies on reading a rendered HTML with content via HTTP, the same way as it is read by human users. Reaching further documents (so called content discovery) is achieved by following hyperlinks present in the content or with a sitemap. This method is called a web crawling.
The crawling, in contrary to a direct source reading, does not require particular preparations. In a minimal variant, just a starting URL is required and that’s it. Content encoding is detected automatically, off the shelf components extract text from the HTML. The web crawling may appear as a quick and easy way to collect a content to be indexed. But after deeper analysis, it turns out to have multiple serious drawbacks.
We don’t feel like crawling…
Firstly, the web crawling is slow. Content is fetched via HTTP and needs to be analyzed instantly, in order to extract links to consecutive documents. Many websites embed images in a form of base64-encoded binary or SVG tags directly in HTML or CSS files for the sake of web browsing speed. But in case of the web crawler, it causes transfer of unnecessary data that slows it down. CMS needs to render a fully-fledged website with headers, footers and interactive elements, that are simply ignored by the crawler. Indexing speed is important, when a low publish-to-index delay is required. In case of web crawling, the delay raises from minutes to hours, depending on particular deployment. Rendering of unnecessary parts generates unnecessary website load, that potentially slows down actual customers’ browsing experience. This load generates costs, but does not bring a direct revenue.
Secondly, modern websites rely heavily on a dynamically generated or fetched content combined on the client side. Web crawlers are much behind web browsers when it comes to cutting-edge dynamic techniques, especially when they rely on modern browsers hacks and tricks. This increases a risk of missing some parts of the content and increase content processing and analysis effort.
Another factor that expands those parameters is that HTML is focused mostly on presentation purpose. Multiple templates or even multiple CMS systems commonly act as a single website, that may appear as consistent unit for a human, but are totally different for the web crawler. Proper content extraction from a website is hard and brings a previously mentioned risk of missing some part of the content, but also a contrary risk of indexing unnecessary content. Metadata extraction from other sources than HTML meta tags is risky as well. Due to close presentation layer coupling, crawling is fragile to presentation modifications. Even apparently innocent template changes may cause a crawled content processing pipeline to break. Because of all those reasons, proper web crawling solution configuration may require multiple trials-and-errors iterations and thus take more effort than a proper internal data source preparation.
…but sometimes it makes sense
On the other hand, web crawling has some niches when it may be used despite its drawbacks. Sometimes getting an access to internal data store consumes a lot of time, when requests are processed by corporate security departments. When the quick or temporary, PoC or MVP deployment is needed, for example to enable the UI team to do their work, a website can be crawled just to provide real data from the organization. This is a lot better than indexing a fake data. Crawling may be used as a part of the hybrid method as well. In some environments, content inventory is easier to fetch than actual content. In such cases, inventory is used to get a metadata and generate URLs to actual content that are further fetched with a crawler. This lets to mitigate the content discovery impediment and to avoid extracting URLs from the website content. And the last, but not least, scenario is when simply there is no access to particular internal data source. In such cases, the web crawling is the last resort.
Enterprise search solution efficiency is closely bound to data in the index. Effective indexing of a website-published data done by fetching content and metadata directly from their sources is a preferred solution. However, there are use cases when a web crawling is the most business or technically justified solution. The Findwise i3 platform has an embedded web crawler among multiple data connectors. This lets to unify data obtained with numerous techniques or to migrate from one method to another in a drop-in replacement manner. Enterprise search i3 deployment project can be quickly started-up with a crawler and then extended with other data sources when needed.