Migration from Google Search Appliance (GSA) in 4 easy steps

 

 

Google Search Appliance is being phased out and in 2018, renewals will end. As an existing client, you can buy one-year license renewals throughout 2017. However, if fancying a change, here’s 4 simple steps for switching to Apache Solr or Elasticsearch.

1. Choose your hosting solution or servers

Wikimedia_Foundation_Servers-8055_14 

Whereas Google Search Appliance comes ready to plug in, Apache Solr and Elasticsearch need to be deployed and hosted on servers. You can choose to host Solr or Elasticsearch on your own infrastructure or in the cloud. Both platforms are highly scalable and can be massively distributed.

  • Own infrastructure

Servers and hardware requirements are highly dependent on the number of documents, documents types, search use cases and number of users. Memory, CPUs, disk and network are the main parameters to consider.

Elasticsearch hardware recommendations: https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

Apache Solr performance: https://wiki.apache.org/solr/SolrPerformanceProblems

Both Elasticsearch and Solr requires running java. For SolrCloud, you will also need to install Zookeeper.

  • In the cloud

You can also choose to run Solr or Elasticsearch on a cloud platform.

Elastic official cloud platform: https://www.elastic.co/cloud

2. Define your schema and mapping

In Apache Solr and Elasticsearch, fields can be indexed and processed differently according on type, language, use case … A field and its type can be defined in Elasticsearch using the mapping API or in Apache Solr with the schema.xml

Elasticsearch mapping API: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Apache Solr schema: https://wiki.apache.org/solr/SchemaXml

3. Tune your connectors

the-cable-guy

Do you need to change all connectors?

The answer is no. Connectors sending GSA feeds can be kept, just refactor the output to match the Elasticsearch or Solr indexing syntax.

However, if you use GSA to crawl websites, you will need either to reconsider crawling as the method to get your data or to use an external webcrawler (like Norconex) Contrary to GSA, Apache Solr and Elasticsearch do not come with a webcrawler.

Elasticsearch Indexing API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

4. Rewrite your queries and fetch new output

All common query functions such as filtering, sorting and dynamic navigation are standard in both Apache Solr and Elasticsearch. However, query parameters and output (XML or JSON) are different, which means queries and front-end need adaption to your new search engine.

If you are using Jellyfish by Findwise, queries and output will roughly be the same.

Elasticsearch response body: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html

Apache Solr response: https://cwiki.apache.org/confluence/display/solr/Response+Writers

Google Search Appliance features equivalence

GSA feature Elasticsearch Apache Solr
Web crawling X X
Language Bundles Languages Language Analysis
Synonyms Synonyms Synonyms
Stopwords Stopwords Stopwords
Result Biasing Controlling relevance Query elevation
Suggestions Search-suggesters Suggester
Dynamic navigation Aggregations Faceting
Document preview X X
User result X X
Expert search X X
Keymatch X X
Related Queries X X
Secure search Shield Solr Security
Search reports Logstash+Kibana X
Mirroring/Distributed Scale Elastic Solr Cloud
System alert Watcher X
Email update/Alert Watcher X

X = not available outside of the box

New look for the GSA-powered file share search at Implement Consulting Group

The file share search on Implement Consulting Group’s intranet is driven by a Google Search Appliance (GSA). Recently, with help from Findwise, the search interface was given a new look, that integrates more seamlessly with the overall design of the intranet.

GSA comes with a default search interface similar to the Google.com search. The interface is easy to customize from GSA’s administrative interface, however, some features are simply not customizable by clicking around. Therefore, GSA supports the editing of an XSLT file for customizing the search. GSA returns the search results in XML format, and by processing this file with XSLT we can customise how the search results look and behave.

Custom CSS and JavaScript was used for integrating GSA’s search functionalities in the look and feel of the intranet. Implement’s new intranet is based on thoughtfarmer.com and the design was delivered by 1508.dk.

— And here is the search results page with a new look:

icg-gsa-screenshot-findwise

The new look of the search results page on Implement Consulting Group’s Google Search Appliance powered search

Impressions of GSA 7.0

Google released Google Search Appliance, GSA 7.0, in early October. Magnus Ebbesson and I joined the Google hosted pre sales conference in Zürich where we had some of the new functionality presented and what the future will bring to the platform. Google is really putting an effort into their platform, and it gets stronger for each release. Personally I tend to like hardware and security updates the most but I have to say that some of the new features are impressive and have great potential. I have had the opportunity to try them out for a while now.

In late November we held a breakfast seminar at the office in Gothenburg where we talked about GSA in general with a focus on GSA 7.0 and the new features. My impression is that the translate functionality is very attractive for larger enterprises, while the previews brings a big wow-factor in general. The possibility of configuring ACLs for several domains is great too, many larger enterprises tend to have several domains. The entity extraction is of course interesting and can be very useful; a processing framework would enhance this even further however.

It is also nice to see that Google is improving the hardware. The robustness is a really strong argument for selecting GSA.

It’s impressive to see how many languages the GSA can handle and how quickly it performs the translation. The user will be required to handle basic knowledge of the foreign language since the query is not translated. However it is reasonably common to have a corporate language witch most of the employees handle.

The preview functionality is a very welcome feature. The fact that it can highlight pages within a document is really nice. I have played around to use it through our Jellyfish API with some extent of success. Below are two examples of usage with the preview functionality.

GSA 7.0 Preview

GSA 7 Preview - Details

A few thoughts

At the conference we attended in Zürich, Google mentioned what they are aiming to improve the built in template in the GSA. The standard template is nice, and makes setting up a decent graphical interface possible for almost no cost.

My experience is however that companies want to do the frontend integrated with their own systems. Also, we tend to use search for more purposes than the standard usage. Search driven intranets, where you build intranet sites based on search results, is an example where the search is used in a different manner.

A concept that we have introduced at Findwise is search as a service. It means that the search engine is a stand-alone product that has APIs that makes it easy to send data to it and extract data from it. We have created our own APIs around the GSA to make this possible. An easy way to extract data based on filtering of data is essential.

What I would like to see in the GSA is easier integration with performing search, such as a rest or soap service for easy integration of creating search clients. This would make it easier to integrate functionality, such as security, externally. Basically you tell the client who the current user is and then the client handles the rest. It would also increase maintainability in the sense of new and changing functionality does not require a new implementation for how to parse the xml response.

I would also like to see a bigger focus of documentation of how to use functionality, previews and translation, externally.

Final words

My feeling is that the GSA is getting stronger and I like the new features in GSA 7.0. Google have succeeded to announce that they are continuously aiming to improve their product and I am looking forward for future releases. I hope the GSA will take a step closer to the search as a service concept and the addition of a processing framework would enhance it even further. The future will tell.

Google Search Appliance (GSA) 6.12 released

Google has released yet another version of the Google Search Appliance (GSA). It is good to see that Google stay active when it comes to improving their enterprise search product! Below is a list of the new features:

Dynamic navigation for secure search

The facet feature, new since 6.8, is still being improved. When filters are created, it is now possible to take in account that they only include secure documents, which the user is authorized to see.

Nested metadata queries

In previous Search Appliance releases there were restrictions for nesting meta tags in search queries. In this release many of those restrictions are lifted.

LDAP authentication with Universal Login

You can configure a Universal Login credential group for LDAP authentication.

Index removal and backoff intervals

When the Search Appliance encounters a temporary error while trying to fetch a document during crawl, it retains the document in the crawl queue and index. It schedules a series of retries after certain time intervals, known as “backoff” intervals. This before removing the URL from the index.

An example when this is useful is when using the processing pipeline that we have implemented for the GSA. GSA uses an external component to index the content, if that component goes down, the GSA will receive a “404 – page does not exist” when trying to crawl and this may cause mass removal from the index. With this functionality turned on, that can be avoided.

Specify URLs to crawl immediately in feeds

Release 6.12 provides the ability to specify URLs to crawl immediately in a feed by using the crawl-immediately attribute. This is a nice feature in order to prioritise what needs to get indexed quickly.

X-robots-tag support

The Appliance now supports the ability to exclude non-html documents by using the x-robots-tag. This feature opens the possibility to exclude non-html documents by using the x-robots-tag.

Google Search Appliance documentation page

Google Search Appliance (GSA) 6.10 released

Last week, Google released version 6.10 of the software to their Google Search Appliance (GSA).

This is a minor update and the focus at the Google teams has been bug fixes and increased stability. Looking at the release notes, there’s indeed plenty of bugs that has been solved.

However, there are also some new features in this release. Some of the more interesting, in my opinion, are:

Multiple front-end configuration for Dynamic Navigation

Since the 6.8 release, the GSA has been able to provde facets, or Dynamic Navigation as Google calls it. However the facets has been global so you couldn’t have two front ends with different facets. This is now possible.
More feeds statistics and Adjust PageRank in feeds
More statistics of what’s happening with feeds you push into the GSA is a very welcome feature. The possibility to adjus PageRank allows for some more control over relevancy in feeds.

Indexing Crawl time kerberos support and Indexing large files

Google is working hard on security and every release since 6.0 has included some security improvements. Nice to see that it continues. Since beginning, the GSA has simply dropped files bigger than 30 MB. Now it will index larger (you can configure how large), but still only the first 2.5 MB of the content will be indexed.

Stopword lists for differented languages

Scalability Centralized configuration

For a multi-node GSA setup, you can now specify the configuration on the master and it’s propagated to the slaves

For a complete list of new features, see the New and Changed Features page in the documentation

KMWorld 2010 Reflections: Search is a Journey Not a Destination

Two weeks ago me, Ludvig Johansson and Christopher Wallström attended KMWorlds quadruple conference in Washington D.C. The conference consisted of four different conferences; KMWorld, Enterprise Search Summit, Taxonomy Bootcamp and SharePoint Symposium. I focused on Enterprise Search Summit and SharePoint Symposium and Christopher mainly covered Taxonomy Bootcamp as well as the Enterprise Search Summit. (Christopher will soon write a blog post about this as well.)

During the conferences there where some good quality content, however most of it was old news with speakers mainly focusing on outputs of their own products. This was disappointing since I had hoped to see the newest and coolest solutions within my area. Speakers presented systems from their corporations, where the newest and coolest functionality they described was shallow filters on a Google Search Appliance. From my perspective this is not new or cool. I would rather consider this standard functionality in today’s search solutions.

However, some sessions where really good. Daniel W. Rasmus talked about the Evolution of Search in quite a fun and thoughtful way. One thing he wanted to see in the near future was more personalization of search. Search needs to know the user and adapt to him/her and not simply use a standardized algorithm. As Rasmus sad it: “my search engine is not that in to me”. This is, as I would put it, spot on how we see it at Findwise. Today’s customer wants standard search with components that have existed for years now. It’s time for search to take the next step in the evolution and for us to start deliver Findabillity solutions adapted to your needs as an individual. In the line of this, Rasmus ended with another good quote: “Don’t let your search vendors set your exceptions to low”. I think this speaks for it self more or less. If we want contextual search then we should push the vendors out there to start deliver!

Another good session was delivered by Ellen Feaheny on how to utilize both old and new systems smarter. It was from this session the title of this post origins, “It’s a journey not a destination”. I thought this sums up what we feel everyday in our projects. It’s common that customers want to see projects to have a clear start and end. However with search and Findability we see it as a journey. I can even go as far to say it’s a journey without an end. We have customers coming and complaining about their search; saying “It doesn’t work anymore” or “The content is old”, to give two examples. The problem is that search is not a one time problem that you solve and then never have to think about again. If you don’t work with your search solution and treat search as a journey, continually improve relevance, content and invest time in search analytics your solution will soon get dusty and not deliver what your employees or customers wants.

Search is a journey not a destination.

Google Search Appliance Learns What You Want to Find

Analyzing user behaviour is a key ingredient to make a search solutions successful. By using Search Analytics, you gain knowledge of how your users use the search solution and what they expect to find. With this knowledge, simple adjustments such as Key Matches, Synonyms and Query Suggestion can enhance the findability of your search solution.

In addition to this, you can also tune the relevancy by knowing what your users are after. An exciting field in this area is to automate this task, i.e by analyzing what users click on in the search result, the relevancy of the documents it automatically adjusted. Findwise has been looking into this area lately, but there hasn’t been any out-of-the-box functionality for this from any vendor.

Until now.

Two weeks ago Google announced the second major upgrade this year for the Google Search Appliance. Labeled as version 6.2, it brings a lot of new features. The most interesting and innovative one is the Self-Learning Scorer. The self learning scorer analyzes user’s click and behaviour in the search result and use it as input to adjust the relevancy. This means that if a lot of people clicks on the third result, the GSA will boost this document to appear higher up in the result set. So, without you having to do anything, the relevance will increase over time making your Search Solution perform better the more it is used. It’s easy to imagine this will create an upward spiral.

The 6.2 release also delivers improvements regarding security, connectivity, indexing overview and more. To read more about the release, head over to the Google Enterprise Blog.

Try the GSA Virtual Edition

One drawback with the Google Search Appliance (GSA) has been that you cannot test it before you buy it. You could go to a Google Partner and ask them to index your content but that only works well with  public content. If it’s content behind your firewall it gets worse and you most probably have to buy your own GSA just to try it out.

With the Virtual Google Search Appliance (VGSA) you can now try all the GSA functionality before buying it. The VGSA is simply a VMWare image of the GSA software. Simply install it on a regular server, fire up VMWare and you’re good to go! All functionality of the real deal is available, including the connector framework. The only limitation is the index limit of 50,000 documents.

Our customers often wants to do a PoC or Pilot before investing in an enterprise search solution. The VGSA is ideal for this since it’s easy and cheap to get up and running. It’s also great for us partners that we now can have multiple installations to experiment with without buying a lot of hardware.

Read more about the VGSA here!