Google Search Appliance (GSA) 6.10 released

Last week, Google released version 6.10 of the software to their Google Search Appliance (GSA).

This is a minor update and the focus at the Google teams has been bug fixes and increased stability. Looking at the release notes, there’s indeed plenty of bugs that has been solved.

However, there are also some new features in this release. Some of the more interesting, in my opinion, are:

Multiple front-end configuration for Dynamic Navigation

Since the 6.8 release, the GSA has been able to provde facets, or Dynamic Navigation as Google calls it. However the facets has been global so you couldn’t have two front ends with different facets. This is now possible.
More feeds statistics and Adjust PageRank in feeds
More statistics of what’s happening with feeds you push into the GSA is a very welcome feature. The possibility to adjus PageRank allows for some more control over relevancy in feeds.

Indexing Crawl time kerberos support and Indexing large files

Google is working hard on security and every release since 6.0 has included some security improvements. Nice to see that it continues. Since beginning, the GSA has simply dropped files bigger than 30 MB. Now it will index larger (you can configure how large), but still only the first 2.5 MB of the content will be indexed.

Stopword lists for differented languages

Scalability Centralized configuration

For a multi-node GSA setup, you can now specify the configuration on the master and it’s propagated to the slaves

For a complete list of new features, see the New and Changed Features page in the documentation

Solr 3.1 released

Last friday, Solr 3.1 was released along with Lucene 3.1. This might seem like a big step from previous version 1.4.1, but is an effect of the merged development for Solr and Lucene that took place a year ago. The Solr version now reflects the Lucene version that is used.

For a complete list of new features and enhancements, you can read the release notes. Though, some of the most interesting features are:

  • Extended dismax (edismax) query parser. It’s an enhancement over dismax, supports full lucene query syntax etc.
  • Spatial search (ie, we can now enable geo-search; sort by distance, boost by distance etc)
  • Numeric range facets.
  • Lots of optimizations and performance improvements, including better Unicode and 64-bit JVM support.

Update: There’s a good list of features and enhancements at Sematexts blog:

I’m really keen on the Spatial Search which open up a new set of applications, espeacially for Mobile Search where you have the advantage of knowing the position of the user.

I’m glad the community pulled of this release after the merge with Lucene and it will be fun to start working with it. What’s your favorite feature in 3.1? Drop a comment!

The Difference Between Search and Findability

Enterprise Search and Findability What is Different?

Is “Findability” only a buzzword to describe the same thing as before when talking about search solutions, or does it bring something new to the discussion? I’d like to think the latter and this week I read a blog post describing the difference between search and findability in a very good way. I couldn’t have written it better myself.

For the lazy one, I’ve picked a quote that is the key element in the post:

Findability: introducing the robot waiter

Imagine you’re in a futuristic restaurant and when the robot waiter approaches, you ask for ‘ham and cheese omelette’. In response he just shrugs his robotic shoulders and says ‘not found – please try again.’ You then have to keep guessing until you find a match for something you’d like to order.

Now imagine a second futuristic restaurant where the robot waiter says ‘Mr Grimes, how lovely to see you, the last time you visited you had A and B and gave them a 5 star rating. People who ordered x, also ordered y and found that the wines a, b and c went really well with it.’At first restaurant the menu was searchable (though regrettably the ‘ham and cheese omelette’ query didn’t match anything), at the second restaurant the menu was findable.

To me, this analogy is spot-on. I dare to say that making content searchable is more of a technical issue while reaching great findability requires understanding of the business. Why is that?

Well, making a content repository searchable you “only” need to hook up a connector, index the repository and display a search box to the users. To succeed with this, it doesn’t matter if the content is movie reviews, user manuals, recipes, a product catalog or whatever. What you need to know is the format of the repository (is it a SQL database, file system, ECM, etc.).

But if you want your users to find what they want in your repositories, business knowledge is a requirement. It’s true that you help your users find information by implementing technical stuff like query completion, facets, did-you-mean, synonym dictionaries, etc. But if they are to be of any help you need to present facets that are useful, populate the synonym dictionary with terms used in your organisation,etc. For example, a good synonym file targeted towards nurses and doctors would be very different compared to one targeted at employees at an insurance company.

LDAP Connector for Openpipeline Used for Indexing Organisational Information

Finding people within your organisation, also denoted as People Search, is something that is a key ingredient in a findability solutions. People catalogs are often based on an LDAP structure which holds the important information about each employee.

The LDAP connector for Openpipeline is the result of the latest activity at the Findwise development department which makes it easy to make the LDAP structure searchable. As always with a connector, you get direct access to the source which ensures a very efficent indexing and good control over the indexed information.

The LDAP connector has a number of features, some noted below:

  • SSL support – Supports LDAP over SSL
  • Pagination – LDAP entries can be retrieved in batches if the LDAP server supports the PagedResultControl. This increases performance and reduces memory consumption drastically
  • Incremental indexing – If the LDAP server flag each update to an entry with a timestamp, the connector can use this timestamp to only fetch updated entries.
  • Delete entries – LDAP entries that has been removed since the last run will be removed from the index
  • Attribute specification – Specify what attributes that should be returned for each entry. By only retrieving the attributes you need, performance is increased.

Interested of knowing more about the connector, or do you have any experience you like to share when indexing LDAP directories? Please drop a comment!

Processing Pipeline for the Google Search Appliance

Max has previously highlighted the subject of a processing pipeline for Apache Solr. Another enterprise search engine that is lacking this feature is the Google Search Appliance (GSA). Today, I’d like to share what me and my colleagues have done to overcome this issue in a couple of projects lately. Discussing the needs and motivation why a pipeline is needed is not really the scope of this post, but I can give you some brief examples why I believe it’s needed:

Normalize metadata across sources

In an enterprise search installation you are making several sources searchable. Say you have indexed the intranet and a file share with documents. On the intranet, each page is tagged with an author in a metadata field called “creator”. The documents on the file share also have an author information but this is stored in a metadata field called “author”. If you want to find all information from a given author you need to know all the different fields in all sources that holds the author. Using a pipeline, you can map the creator metadata from the intranet and the author metadata to a common field in the index (i.e author).

Overcome GSA shortcomings

One shortcoming of the GSA is that it doesn’t calculate the file size of a non-html document. In a pipeline you can easily calculate the size of each document coming in, regardless of file format, and put it in a field in the GSA index.

So how have we done this? We wanted a standard, reusable architecture that wouldn’t interfere too much with a standard GSA setup. Still we wanted it to cover as much sources as possible without any adjustments on the specific connector that would be used.

Thus, we targeted the built-in crawler and the connector api for the solution, which can be described in a couple of steps:
  1. It resides as a stand-alone component between the GSA and the content sources
  2. The GSA fetches the content through the component
  3. The component delivers the content to both the GSA and a standalone pipeline
  4. The content is indexed in the GSA. When pipeline processing is done, it sends the updated content to the GSA.
The image below will give you a visual overview.

With this approach, when the solution is set the pipeline can be used with the GSA crawler and all connectors built using the connector api. We’ve also discussed to extend the solution to support the plain feed protocol, which shouldn’t be that much of a hazzle. If you’re interested to find out more about the solution, don’t hesitate to leave a comment or contact me. We will also put this up on Google Marketplace soon.

Real Time Search in the Enterprise

Real time search is a big fuzz in the global network called Internet. Major search engines like Google and Bing are now providing users with real time search results from Facebook, Twitter, Blogs and other social media sites. Real time search means that as soon as content are created or updated, it is immediately searchable. This might be obvious and seems like a basic requirement, but working with search you know that this is not the case most of the time. Looking inside the firewall, in the enterprise, I dare to say that real time search is far from common. Sometimes content is not changed very frequently so it is not necessary to make it instantly searchable. Though, in many cases it’s the technical architecture that limits a real time search implementation.

The most common way of indexing content is by using a web crawler or a connector. Either way, you schedule them to go out and fetch new/updated/deleted content at specific interval during the day. This is the basic architecture for search platforms these days. The advantage of this approach is that the content systems does not need to adapt to the search platform, they just deliver content through their ordinary API:s during indexing. The drawback is that new or updated content is not available until next scheduled indexing. Depending on the system this might take several hours. Due to several reasons, mostly performance, you do not want to schedule connectors or web crawlers to fetch content too often. Instead, to provide real time search you have to do the other way around; let the content system push content to the search platform.

Most systems have some sort of event system that triggers an event when content is created/updated/deleted. Listening for these events, the system can send the content to the search platform at the same time it’s stored in the content system. The search platform can immediately index the pushed content and make it searchable. This requires adaptation of the content system towards the search platform. In this case though, I think the advantages outweighs the disadvantages. Modern content systems of today are (or should be) providing a plug-in architecture so you should fairly easy be able to plug in this kind of code. These plug-ins could also be provided by the search platform vendors just as ordinary connectors are provided today.

Do you agree, or have I been living in a cave for the past years? I’d love to hear you comments on this subject!

Relevance is Important – and Relevant

A couple of weeks ago I read an interesting blog post about comparing the relevance of three different search engines. This made me start thinking of relevance and how it’s sometimes overlooked when choosing or implementing a search engine in a findability solution. Sometimes a big misconception is that if we just install a search engine we will get splendid search results out of the box. While it’s true that the results will be better than an existing database based search solution, the amount of configuration needed to get splendid results is based on how good relevance you get from the start. And as seen in the blog post, it can be quite a bit of different between search engines and relevance is important.

So what is relevance and why does it differ between search engines? Computing relevance is the core of a search engine. Essentially the target is to deliver the most relevant set of results with regards to your search query. When you submit your query, the search engine is using a number of algorithms to find, within all indexed content, the documents or pages that best corresponds to the query. Each search engine uses it’s own set of algorithms and that is why we get different results.

Since the relevance is based on the content it will also differ from company to company. That’s why we can’t say that one search engine has better relevance than the other. We can just say that it differs. To know who performs the best, you have to try it out on your own content. The best way to choose a search engine for your findability solution would thus be to compare a couple and see which yields the best results. After comparing the results, the next step would then be to look at how easy it is to tune the relevance algorithms, to what extent it is possible and how much you need to tune. Based on how good relevance you get from the start you might not need to do much relevance tuning, thus you don’t need the “advanced relevance tuning functionality” that might cost extra money.

In the end, the best search engine is not the one with most functionality. The best one is the one that gives you the most relevant results, and by choosing a search engine with good relevance for your content some initial requirements might be obsolete which will save you time and money.

To Crawl or Not to Crawl in Enterprise Search

Having an Enterprise Search Engine, there are basically two ways of getting content into the index; using a web crawler or a connector. Both methods have their advantages and disadvantages. In this post I’ll try to poinpoint the differences with the two methods.

Web crawler

Most systems of today have a web-interface. Let it be your time reporting system, intranet, document management, you’ll probably access those with your web browser. Because of this, it’s very easy to use a web crawler to index this content as well.

The web crawler index the pages by starting at one page. From there, it follows all outbound links and index those. From those pages, it follows all links, and so on. This process continues until all links at a web site has been followed and the pages been indexed. The crawler thus uses the same technique as a human, visit a page and clicking the links.

Most Enterprise Search Engines are bundled with a web crawler. Thus, it’s usually very easy to get started. Just enter a start page and within minutes you’ll have searchable content in your index. No extra installation or license fee are required. For some sources, this may also be the only option, i.e if you’re indexing external sources that your company has no control of.

The main disadvantage though, is that web pages are designed for humans, not crawlers. This means that there are a lot of extra information for presentation purposes, such as navigation menus, sticky information messages, headers and footers and so on. All of this makes it a more pleasant experience for the user, and also making it easier to navigate on the page. The crawler on the other hand has no use of this information when retrieving pages. It’s actually reducing information quality in the index. For example, a navigation menu will be displayed on every page, thus the crawler will index the navigation content for all pages. So if you have a navigation item called “Customers” and a user searches for customers, he/she will get a hit in ALL pages in the index.

There are ways to get around this, but it requires either altering of the produced HTML or adjustments in the search engine. Also, if the design of the site change, you have to do these adjustments again.

Connector

Even though the majority of systems has a web-interface, the content is stored in a data source of some format. It might be a database, structured file system, etc. By using a connector, you connect either to the underlying data source or to the system directly by its programming API.

Using a connector, the search engine does not get any presentation information but only the pure content, making the information quality in the index better. The connector can also retrieve all metadata associated with the information which further increases the quality. Often, you’ll also have more fine-grained control over what will be indexed with a connector than a web crawler.

Though, using a connector requires more configuration. It might also cost some extra money to buy one for your system, and require additional hardware. Though, once set up, it’s most likely to produce more relevant results compared to a web crawler.

Bottom line is it’s a consideration between quality and cost, as most decisions in life 🙂

Google Search Appliance Learns What You Want to Find

Analyzing user behaviour is a key ingredient to make a search solutions successful. By using Search Analytics, you gain knowledge of how your users use the search solution and what they expect to find. With this knowledge, simple adjustments such as Key Matches, Synonyms and Query Suggestion can enhance the findability of your search solution.

In addition to this, you can also tune the relevancy by knowing what your users are after. An exciting field in this area is to automate this task, i.e by analyzing what users click on in the search result, the relevancy of the documents it automatically adjusted. Findwise has been looking into this area lately, but there hasn’t been any out-of-the-box functionality for this from any vendor.

Until now.

Two weeks ago Google announced the second major upgrade this year for the Google Search Appliance. Labeled as version 6.2, it brings a lot of new features. The most interesting and innovative one is the Self-Learning Scorer. The self learning scorer analyzes user’s click and behaviour in the search result and use it as input to adjust the relevancy. This means that if a lot of people clicks on the third result, the GSA will boost this document to appear higher up in the result set. So, without you having to do anything, the relevance will increase over time making your Search Solution perform better the more it is used. It’s easy to imagine this will create an upward spiral.

The 6.2 release also delivers improvements regarding security, connectivity, indexing overview and more. To read more about the release, head over to the Google Enterprise Blog.