Web crawling is the last resort

Data source analysis is one of the crucial parts of an enterprise search deployment project. Search engine results quality strongly depends on an indexed data quality. In case of web-based sources, there are two basic ways of reaching the data: internal and external. Internal method involves reading the data directly from its storage place, such as a database, filesystem files or API. Documents are read by some criteria or all documents are read, depending on requirements. External technique relies on reading a rendered HTML with content via HTTP, the same way as it is read by human users. Reaching further documents (so called content discovery) is achieved by following hyperlinks present in the content or with a sitemap. This method is called a web crawling.

The crawling, in contrary to a direct source reading, does not require particular preparations. In a minimal variant, just a starting URL is required and that’s it. Content encoding is detected automatically, off the shelf components extract text from the HTML. The web crawling may appear as a quick and easy way to collect a content to be indexed. But after deeper analysis, it turns out to have multiple serious drawbacks.

Continue reading

Information Flow in VGR

The previous week Kristian Norling from VGR (Västra Götaland Regional Council) posted a really interesting and important blog post about information flow. Those of you who doesn’t know what VGR has been up to previously, here is a short background.

For a number of years VGR has been working to give reality to a model for how information is created, managed, stored and distributed. And perhaps the most important part – integrated.

Information flow in VGR

Why is Information Flow Important?

In order to give your users access to the right information it is essential to get control of the whole information flow i.e. from the time it is created until it reaches the end user. If we lack knowledge about this, it is almost impossible to ensure quality and accuracy.

The fact that we have control also gives us endless possibilities when it comes to distributing the right information at the right time (an old cliché that is finally becoming reality). To sum up: that is what search is all about!

When information is being created VGR uses a Metadata service which helps the editors to tag their content by giving keyword suggestions.

In reality this means that the information can be distributed in the way it is intended. News are for example tagged with subject, target group and organizational info (apart from dates, author, expiring date etc which is automated) – meaning that the people belonging to specific groups with certain roles will get the news that are important to them.

Once the information is tagged correctly and published it is indexed by search. This is done in a number of different ways: by HTML-crawling, through RSS, by feeding the search engine or through direct indexing.

The information is after this available through search and ready to be distributed to the right target groups. Portlets are used to give single sign-on access to a number of information systems and template pages in the WCM (Web Content Management system) uses search alerts to give updated information.

Simply put: a search alert for e.g. meeting minutes that contains your department’s name will give you an overview of all information that concerns this when it is published, regardless of in which system it resides.

Furthermore, the blog post describes VGRs work with creating short and persistent URL:s (through an URL-service) and how to ”monitor” and “listen to” the information flow (for real-time indexing and distribution) – areas where we all have things to learn. Over time Kristian will describe the different parts of the model in detail, be sure to keep an eye on the blog.

What are your thoughts on how to get control of the information flow? Have you been developing similar solutions for part of this?

The Evolution of Search in Video Media

Search is becoming more and more an infrastructure necessity and in some areas, and for some users, considered a commodity. However, the evolution of new areas for use of search is growing rapidly both on the web and within the enterprises. Google’s recent acquisition of YouTube is giving us one example of new areas. To search in video material is not simple and I believe we have just seen the very early stage of this new technique.

I am participating in an EU funded project – RUSHES. The project is within the 6th framework program. The aim of the project is among other things to develop techniques for automatic content cataloguing and semantic based indexing. So what impact will this have for the end users and search in video ?

Well, they won’t have to go to a category and search under for example “News and politics”, instead the users will be able to use keywords such as “president” and “scandals” to get clips about Nixon and the Watergate saga. The content provider, on the other hand, won’t have to see the video clip in order to annotate and meta tag it, they will just run the video through a “RUSHES” module and the program will handle the rest. These new scenarios in combination with the semantic web (Web 2.0), will enable new possibilities and business opportunities which we have not even dreamt of before! Like search in video!