ExternalFileField in Solr

Sometimes we want to update document values in an indexed field more often than other fields. A good solution to this is to use the field type ExternFileField. The ExternalFileField gets values from an external file instead of the index. Such file can easily be changed and update the field after a commit. Hence no documents need to be re-indexed. A field that has ExternalFileField as type is not searchable. The field may currently only be used as a ValueSource in a FunctionQuery.

The external file contains keys and values:

key1=value1
key2=value2

The keys don’t need to be unique.

The name of the external file must be external_<fieldname> or external_<fieldname>.* and must be placed in the index directory.

A new file type of the type ExternalFileField and field must be added to schema.xml.

<fieldType name="file"

           keyField="keyField" defVal="1" indexed="false"

           stored="false" valType="float" />

<field name="<fieldname>" type="file" />

keyField is the field that contains the keys and <fieldname> contains the values from the external file.

valType defines the value type of the field.

At Findwise we have used this method for a customer where we wanted to show the most visited pages higher up in the search result. These statistics are changing daily for a lot of pages and we don’t want to re-index all these pages every day.

Distributed processing + search == true?

In June 2011, I attended the Berlin Buzzwords conference. The main theme of the conference was undoubtedly the current paradigm shift in distributed processing, driven by the major success of Hadoop. Doug Cutting – founder of Apache projects such as Lucene, Nutch and Hadoop – held one of the keynotes. He focused on what he recognized as the new foundations for this paradigm shift:

– Commodity hardware
– Sequential file access
– Sharding
– Automated, high level reliability
– Open source

Distributed processing is done fairly well with Hadoop. Distributed search on the other hand is more or less limited to sharding and/or replicating the index. The downside of sharding is that you perform the same search on multiple servers and then need to combine the results. Due to the nature of algorithms in search such as tf/idf, tasks like ranking results suffers. Andrzej Białecki (another frequent Lucene committer) held a presentation on this topic, and his view can be summarized as: Use local search as long as you can, distribute only when the cost of local search limitations outweighs the cost of distributed search.

The setup of automated replication and sharding, with help from Zookeeper in the Solr Cloud project, is a major step in the right direction but the question on how to properly combine search results from different nodes still remains. One thing is sure though, there is a lot of interesting work being done in this area.

Information Flow in VGR

The previous week Kristian Norling from VGR (Västra Götaland Regional Council) posted a really interesting and important blog post about information flow. Those of you who doesn’t know what VGR has been up to previously, here is a short background.

For a number of years VGR has been working to give reality to a model for how information is created, managed, stored and distributed. And perhaps the most important part – integrated.

Information flow in VGR

Why is Information Flow Important?

In order to give your users access to the right information it is essential to get control of the whole information flow i.e. from the time it is created until it reaches the end user. If we lack knowledge about this, it is almost impossible to ensure quality and accuracy.

The fact that we have control also gives us endless possibilities when it comes to distributing the right information at the right time (an old cliché that is finally becoming reality). To sum up: that is what search is all about!

When information is being created VGR uses a Metadata service which helps the editors to tag their content by giving keyword suggestions.

In reality this means that the information can be distributed in the way it is intended. News are for example tagged with subject, target group and organizational info (apart from dates, author, expiring date etc which is automated) – meaning that the people belonging to specific groups with certain roles will get the news that are important to them.

Once the information is tagged correctly and published it is indexed by search. This is done in a number of different ways: by HTML-crawling, through RSS, by feeding the search engine or through direct indexing.

The information is after this available through search and ready to be distributed to the right target groups. Portlets are used to give single sign-on access to a number of information systems and template pages in the WCM (Web Content Management system) uses search alerts to give updated information.

Simply put: a search alert for e.g. meeting minutes that contains your department’s name will give you an overview of all information that concerns this when it is published, regardless of in which system it resides.

Furthermore, the blog post describes VGRs work with creating short and persistent URL:s (through an URL-service) and how to ”monitor” and “listen to” the information flow (for real-time indexing and distribution) – areas where we all have things to learn. Over time Kristian will describe the different parts of the model in detail, be sure to keep an eye on the blog.

What are your thoughts on how to get control of the information flow? Have you been developing similar solutions for part of this?

Real Time Search in the Enterprise

Real time search is a big fuzz in the global network called Internet. Major search engines like Google and Bing are now providing users with real time search results from Facebook, Twitter, Blogs and other social media sites. Real time search means that as soon as content are created or updated, it is immediately searchable. This might be obvious and seems like a basic requirement, but working with search you know that this is not the case most of the time. Looking inside the firewall, in the enterprise, I dare to say that real time search is far from common. Sometimes content is not changed very frequently so it is not necessary to make it instantly searchable. Though, in many cases it’s the technical architecture that limits a real time search implementation.

The most common way of indexing content is by using a web crawler or a connector. Either way, you schedule them to go out and fetch new/updated/deleted content at specific interval during the day. This is the basic architecture for search platforms these days. The advantage of this approach is that the content systems does not need to adapt to the search platform, they just deliver content through their ordinary API:s during indexing. The drawback is that new or updated content is not available until next scheduled indexing. Depending on the system this might take several hours. Due to several reasons, mostly performance, you do not want to schedule connectors or web crawlers to fetch content too often. Instead, to provide real time search you have to do the other way around; let the content system push content to the search platform.

Most systems have some sort of event system that triggers an event when content is created/updated/deleted. Listening for these events, the system can send the content to the search platform at the same time it’s stored in the content system. The search platform can immediately index the pushed content and make it searchable. This requires adaptation of the content system towards the search platform. In this case though, I think the advantages outweighs the disadvantages. Modern content systems of today are (or should be) providing a plug-in architecture so you should fairly easy be able to plug in this kind of code. These plug-ins could also be provided by the search platform vendors just as ordinary connectors are provided today.

Do you agree, or have I been living in a cave for the past years? I’d love to hear you comments on this subject!

To Crawl or Not to Crawl in Enterprise Search

Having an Enterprise Search Engine, there are basically two ways of getting content into the index; using a web crawler or a connector. Both methods have their advantages and disadvantages. In this post I’ll try to poinpoint the differences with the two methods.

Web crawler

Most systems of today have a web-interface. Let it be your time reporting system, intranet, document management, you’ll probably access those with your web browser. Because of this, it’s very easy to use a web crawler to index this content as well.

The web crawler index the pages by starting at one page. From there, it follows all outbound links and index those. From those pages, it follows all links, and so on. This process continues until all links at a web site has been followed and the pages been indexed. The crawler thus uses the same technique as a human, visit a page and clicking the links.

Most Enterprise Search Engines are bundled with a web crawler. Thus, it’s usually very easy to get started. Just enter a start page and within minutes you’ll have searchable content in your index. No extra installation or license fee are required. For some sources, this may also be the only option, i.e if you’re indexing external sources that your company has no control of.

The main disadvantage though, is that web pages are designed for humans, not crawlers. This means that there are a lot of extra information for presentation purposes, such as navigation menus, sticky information messages, headers and footers and so on. All of this makes it a more pleasant experience for the user, and also making it easier to navigate on the page. The crawler on the other hand has no use of this information when retrieving pages. It’s actually reducing information quality in the index. For example, a navigation menu will be displayed on every page, thus the crawler will index the navigation content for all pages. So if you have a navigation item called “Customers” and a user searches for customers, he/she will get a hit in ALL pages in the index.

There are ways to get around this, but it requires either altering of the produced HTML or adjustments in the search engine. Also, if the design of the site change, you have to do these adjustments again.

Connector

Even though the majority of systems has a web-interface, the content is stored in a data source of some format. It might be a database, structured file system, etc. By using a connector, you connect either to the underlying data source or to the system directly by its programming API.

Using a connector, the search engine does not get any presentation information but only the pure content, making the information quality in the index better. The connector can also retrieve all metadata associated with the information which further increases the quality. Often, you’ll also have more fine-grained control over what will be indexed with a connector than a web crawler.

Though, using a connector requires more configuration. It might also cost some extra money to buy one for your system, and require additional hardware. Though, once set up, it’s most likely to produce more relevant results compared to a web crawler.

Bottom line is it’s a consideration between quality and cost, as most decisions in life 🙂

Search Driven Process to Increase Content Quality

Experience from recent and ongoing search and retrieval projects have shown that enterprises have got a better and deeper insight in their content when deploying a new search platform and search driven solutions. Not only in unstructured content repositories, but also in structured sources. As information is indexed and is visualized in a more user friendly way it doesn’t take much time before the people responsible find content issues that are brought out in the light. Content that e.g. is misplaced, tagged wrongly, documents with poorly defined security information etc. Issues that earlier were hidden due to lack of a holistic view of content. This leads to setting up search driven processes to increase the content quality.

It has been said that before enterprises should think of deploying an enterprise search solution one is recommended to get a completely clear picture of all it’s content; but maybe one should reformulate this and also think of an enterprise search solution as a supporting tool in the process when improving the content as well.

Taking it a step further would be to allow write-backs from the search engine to content sources to enrich and improve quality and completeness of stored information. Tune search quality and content quality at the same time! Make the content quality assurance process search driven!

The Evolution of Search in Video Media

Search is becoming more and more an infrastructure necessity and in some areas, and for some users, considered a commodity. However, the evolution of new areas for use of search is growing rapidly both on the web and within the enterprises. Google’s recent acquisition of YouTube is giving us one example of new areas. To search in video material is not simple and I believe we have just seen the very early stage of this new technique.

I am participating in an EU funded project – RUSHES. The project is within the 6th framework program. The aim of the project is among other things to develop techniques for automatic content cataloguing and semantic based indexing. So what impact will this have for the end users and search in video ?

Well, they won’t have to go to a category and search under for example “News and politics”, instead the users will be able to use keywords such as “president” and “scandals” to get clips about Nixon and the Watergate saga. The content provider, on the other hand, won’t have to see the video clip in order to annotate and meta tag it, they will just run the video through a “RUSHES” module and the program will handle the rest. These new scenarios in combination with the semantic web (Web 2.0), will enable new possibilities and business opportunities which we have not even dreamt of before! Like search in video!