PIM is for storage

– Add search for distribution, customization and seamless multichannel experiences.


Retailers, e-commerce and product data
Having met a number of retailers to discuss information management, we’ve noticed they all experience the same problem. Products are (obviously) central and information is typically stored in a PIM or DAM system. So far so good, these systems do the trick when it comes to storing and managing fundamental product data. However, when trying to embrace current trends1 of e-commerce, such as mobile friendliness, multi-channel selling and connecting products to other content, PIM systems are not really helping. As it turns out, PIM is great for storage but not for distribution.

Retailers need to distribute product information across various channels – online stores, mobile and desktop, spreadsheet exports, subsets of data with adjustments for different markets and industries. They also need connecting products to availability, campaigns, user generated content and fast changing business rules. Add to this the need for closing the analytics feedback loop, and the IT department realises that PIM (or DAM) is not the answer.

Product attributes

Adding search technology for distribution
Whereas PIM is great for storage, search technology is the champ not only for searching but also for distribution. You may have heard the popular Create Once Publish Everywhere? Well, search technology actually gives meaning to the saying. Gather any data (PIM, DAM, ERP, CMS), connect it to other data and display it across multiple channels and contexts.

Also, with the i32 package of components you can add information (metadata) or logic that is not available in the PIM system. This whilst source data stay intact – there is no altering, copying or moving.

Combined with a taxonomy for categorising information you’re good to go. You can now enrich products and connect them to other products and information (processing service). Categorise content according to product taxonomy and be done. Performance will be super high, as content is denormalised and stored in the search engine, ready for multi channel distribution. Also, with this setup you can easily also add new sources to enrich products or modify relevance. Who knows what information will be relevant for products in the future?

To summarise

  • PIM for input, search for output. Design for distribution!
  • Use PIM for managing products, not for managing business rules.
  • Add metadata and taxonomies to tailor product information for different channels.
  • Connect products to related content.
  • Use stand-alone components based on open source for strong TCO and flexibility.

References
1 Gartner for marketers
2The Findwise i3 package of components (for indexing, processing, searching and analysing data) is compatible with the open source search engines Apache Solr and Elasticsearch. 

Video: Introducing Hydra – An Open Source Document Processing Framework

Introducing Hydra – An Open Source Document Processing Framework from presented at Lucene Revolution hosted on Vimeo.

Presented by Joel Westberg, Findwise AB
This presentation details the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.

Phonetic Algorithm: Bryan, Brian, Briane, Bryne, or … what was his name again?

Let the spelling loose …

What do Callie and Kelly have in common (except for the double ‘l’ in the middle)? What about “no” and “know”, or “Ceasar’s” and “scissors” and what about “message” and “massage”? You definitely got it – Callie and Kelly, “no” and “know”, “Ceasar’s” and “scissors” sound alike, but are spelled quite differently. “message” and “massage” on the other hand differ by only one vowel (“a” vs “e”) but their pronunciation is not at all the same.

It’s a well known fact for many languages that ortography does not determine the pronunciation of words. English is a classic example. George Bernard Shaw was the attributed author of “ghoti” as an alternative spelling of “fish”. And while phonology often reflects the current state of the development of the language, orthography may often lag centuries behind. And while English is notorious for that phenomenon it is not the only one. Swedish, French, Portuguese, among others, all have their ortography/pronunciation discrepancies.

Phonetic Algorithms

So how do we represent things that sound similar but are spelled different? It’s not trivial but for most cases it is not impossible either. Soundex is probably the first algorithm to tackle this problem. It is an example of the so called phonetic algorithms which attempt to solve the problem of giving the same encoding to strings which are pronounced in a similar fashion. Soundex was designed for English only but has its limits. DoubleMetaphone (DM) is one of the possible replacements and relatively successful. Designed by Lawrence Philips in the beginning of 1990s it not only deals with native English names but also takes proper care of foreign names so omnipresent in the language. And what is more – it can output two possible encodings for a given name, hence the “Double” in the naming of the algorithm, – an anglicised and a native (be that Slavic, Germanic, Greek, Spanish, etc.) version.

By relying on DM one can encode all the four names in the title of this post as “PRN”. The name George will get two encodings – JRJ and KRK, the second version reflecting a possible German pronunciation of the name. And a name with Polish origin, like Adamowicz, would also get two encodings – ATMTS and ATMFX, depending on whether you pronounce the “cz” as the English “ch” in “church” or “ts” in “hats”.

The original implementation by Lawrence Philips allowed a string to be encoded only with 4 characters. However, in most subsequent
implementations of the algorithm this option is parameterized or just omitted.

Apache Commons Codec has an implementation of the DM among others (Soundex, Metaphone, RefinedSoundex, ColognePhonetic, Coverphone, to
name just a few.) and here is a tiny example with it:

import org.apache.commons.codec.language.DoubleMetaphone;

public class DM {

public static void main(String[] args) {

String s = "Adamowicz";

DoubleMetaphone dm = new DoubleMetaphone();

// Default encoding length is 4!

// Let's make it 10

dm.setMaxCodeLen(10);

System.out.println("Alternative 1: " + dm.doubleMetaphone(s) +

// Remember, DM can output 2 possible encodings:

"nAlternative 2: " + dm.doubleMetaphone(s, true));

}
}

The above code will print out:

Alternative 1: ATMTS

Alternative 2: ATMFX

It is also relatively straightforward to do phonetic search with Solr. You just need to ensure that you add the phonetic analysis to a field which contains names in your schema.xml:

Enhancements

While DM does perform quite well, at first sight, it has its limitations. We should know that it still originated from the English language and although it aims to tackle a variety of non-native borrowings most of the rules are English-centric. Suppose you work on any of the Scandinavian languages (Swedish, Danish, Norwegian, Icelandic) and one of the names you want to encode is “Örjan”. However, “Orjan” and “Örjan” get different encodings – ARJN vs RJN. Why is that? One look under the hood (the implementation in DoubleMetaphone.java) will give you the answer:

private static final String VOWELS = "AEIOUY";

So the Scandinavian vowels “ö”, “ä”, “å”, “ø” and “æ” are not present. If we just add these then compile and use the new version of the DM implementation we get the desired output – ARJN for both “Örjan” and “Orjan”.

Finally, if you don’t want to use DM or maybe it is really not suitable for your task, you still may use the same principles and create your own encoder by relying on regular expressions for example. Suppose you have a list of bogus product names which are just (mis)spelling variations of some well known names and you want to search for the original name but get back all ludicrous variants. Here is one albeit very naïve way to do it. Given the following names:

CupHoulder

CappHolder

KeepHolder

MacKleena

MackCliiner

MacqQleanAR

Ma’cKcle’an’ar

and with a bunch of regular expressions you can easily encode them as “cphldR” and “mclnR”.

String[] ar = new String[]{"CupHoulder", "CappHolder", "KeepHolder",
"MacKleena", "MackCliiner", "MacqQleanAR", "Ma'cKcle'an'ar"};

for (String a : ar) {
a = a.toLowerCase();
a = a.replaceAll("[ae]r?$", "R");
a = a.replaceAll("[aeoiuy']", "");
a = a.replaceAll("pp+", "p");
a = a.replaceAll("q|k", "c");
a = a.replaceAll("cc+", "c");
System.out.println(a);
}

You can now easily find all the ludicrous spellings of “CupHolder” och “MacCleaner”.

I hope this blogpost gave you some ideas of how you can use phonetic algorithms and their principles in order to better discover names and entities that sound alike but are spelled unlike. At Findwise we have done a number of enhancements to DM in order to make it work better with Swedish, Danish and Norwegian.

References

You can learn more about Double Metaphone from the following article by the creator of the algorithm:
http://drdobbs.com/cpp/184401251?pgno=2

A German phonetic algorithm is the Kölner Phonetik:
http://de.wikipedia.org/wiki/Kölner_Phonetik

And SfinxBis is a phonetic algorithm based on Soundex and is Swedish specific:
http://www.swami.se/projekt/sfinxbis.68.html

Solr Processing Pipeline

Hi again Internet,

For once I have had time to do some thinking. Why is there no powerful data processing layer between the Lucene Connector Framework and Solr? I´ve been looking into the Apache Commons Processing Pipeline. It seems like a likely candidate to do some cool stuff.  Look at the diagram below.

A schematic drawing of a Solr Pipeline concept. (Click to enlarge)

What I´m thinking of is to make a transparent Solr processing pipeline that speaks the Solr REST protocol on each end. This means that you would be able to use SolrJ or any other API to communicate with the Pipeline.

Has anyone attempted this before?  If you’re interested in chatting about the pipeline drop me a mail or just grab me at Eurocon in Prague this year.

Findwise releases Open Pipeline Plugins

Findwise is proud to announce that we now have released our first publicly available plugins to the Open Pipeline crawling and document processing framework. A list of all available plugins can be found on the Open Pipeline Plugins page and the ones Findwise have created can be downloaded on our Findwise Open Pipeline Plugins page.

OpenPipeline is an open source software for crawling, parsing, analyzing and routing documents. It ties together otherwise incomplete solutions for enterprise search and document processing. OpenPipeline provides a common architecture for connectors to data sources, file filters, text analyzers and modules to distribute documents across a network. It includes a job scheduler and a full UI with a point-and-click interface.

Findwise have been using this framework in a number of customer projects with great success. It ties particularly good together with Apache Solr, not only because it is open source but most importantly because it fills a hole in functionality that Solr lacks – an easy to use framework for developing document processors and connectors. However we are not using this for Solr only, a number of plugins for the Google Search Appliance have also been made and we have started investigating how Open Pipeline can be integrated with the IBM Omnifind search engine as well.

The best thing with this framework is that it is very flexible and customizable but still easy to use AND, maybe most importantly for me as a developer, easy to work with and develop against. It has a simple yet powerful enough API to handle all that you need. And because it is an open source framework any shortcomings and limitations that we find along the way can be investigated in detail and a better solution can be proposed to the Open Pipeline team for inclusion in future releases.

We have in fact already contributed to the development of the project in a great deal by using it, testing it and by reporting bugs and suggested improvements on their forums. And the response from the team has been very good – some of our suggested improvements have already been included and some are on the way in the new 0.8 version. We are also in the process of further deepening the collaboration by signing a contributors agreement so that we eventually can be able to contribute with code as well.

So how do our customers benefit from this?

First it makes us develop and deliver search and index solutions more quickly and of better quality to our customers. This is because more developers can work with the same framework as a base and the overall code base will be used more, tested more and is thus of better quality. We have also the possibility to reuse good and well tested components so that several customers together can share the costs of development and thus get a better service/product for less money which is always a good thing of course!

What Differentiates a Good Search Engine from a Bad One?

That was one of the questions the UIE research group asked themselves when conducting a study of on-site search. One of the things they discovered was that the choice of search engine was not as important as the implementation. Most of the big search vendors were found in both the top sites and the bottom sites.

So even though the choice of vendor influences what functionality you can achieve and the control you have over your content there are other things that matter, maybe even more. Because the best search engine in the world will not work for you unless you configure it properly.

According to Jared Spool there are four kinds of search results:

  • ‘Match relevant results’ – returns the exact thing you were looking for.
  • ‘Zero results’ – no relevant results found.
  • ‘Related results’ – i.e. search for a sweater and also get results for a cardigan. (If you know that a cardigan is a type of sweater you are satisfied. Otherwise you just get frustrated and wonder why you got a result for a cardigan when you searched for a sweater).
  • ‘Wacko results – the results seem to have nothing in common with your query.

So what did the best sites do according to Jared Spool and his colleagues?
They returned match relevant results, and they did not return 0 results for searches.

So how do you achieve that then? We have previously written about the importance of content refinement and information quality. But what do you do when trying to achieve good search results with your search engine? And what if you do not have the time or knowledge to do a proper content tuning process?

Well, the search logs are a good way to start. Start looking at them to identify the 100 most common searches and the results they return. Are they match relevant results? It is also a good idea to look at the searches that return zero results and see if there is anything that can be done to improve those searches as well.

Jared Spool and his colleagues at UIE mostly talk about site search for e-commerce sites. For e-commerce sites bad search results mean loss of revenue while good search results hopefully give an increase in revenue (if other things such as check out do not fail). Working with intranet search the implications are a bit different.

With intranet search solutions the searches can be more complex when information not items, is what users are searching for. It might not be as easy to just add synonyms or group similar items to achieve better search results. I believe that in such a complex information universe, proper content tuning is the key to success. But looking at the search logs is a good way for you to start. And me and my colleagues here at Findwise can always help you how to get the most out of your search solution.

Search as a Tool for Information Quality Assurance

Feedback from stakeholders in ongoing projects has highlighted the real need for a supporting tool to assist in the analysis of large amounts of content. This would introduce a phase where super-users and information owners have the possibility to go through a information quality assurance process across the information silos, before releasing information directly to end users.

Using standard features contained within enterprise search platforms, great value can be delivered as well as time saved in extracting essential information. Furthermore, you have the possibility to detect key information objects that are hidden by a lack of a holistic view.

In this way adapted applications can easily be built on top to support process specific analysing demands e.g. through entity extraction (automatic detection and extraction of names, places, dates etc) and cross-referencing unstructured and structured sources. The time is here to gain control of your enterprise information, the information quality and turn it into knowledge.

Search Driven Process to Increase Content Quality

Experience from recent and ongoing search and retrieval projects have shown that enterprises have got a better and deeper insight in their content when deploying a new search platform and search driven solutions. Not only in unstructured content repositories, but also in structured sources. As information is indexed and is visualized in a more user friendly way it doesn’t take much time before the people responsible find content issues that are brought out in the light. Content that e.g. is misplaced, tagged wrongly, documents with poorly defined security information etc. Issues that earlier were hidden due to lack of a holistic view of content. This leads to setting up search driven processes to increase the content quality.

It has been said that before enterprises should think of deploying an enterprise search solution one is recommended to get a completely clear picture of all it’s content; but maybe one should reformulate this and also think of an enterprise search solution as a supporting tool in the process when improving the content as well.

Taking it a step further would be to allow write-backs from the search engine to content sources to enrich and improve quality and completeness of stored information. Tune search quality and content quality at the same time! Make the content quality assurance process search driven!