PIM is for storage

– Add search for distribution, customization and seamless multichannel experiences.

Retailers, e-commerce and product data
Having met a number of retailers to discuss information management, we’ve noticed they all experience the same problem. Products are (obviously) central and information is typically stored in a PIM or DAM system. So far so good, these systems do the trick when it comes to storing and managing fundamental product data. However, when trying to embrace current trends1 of e-commerce, such as mobile friendliness, multi-channel selling and connecting products to other content, PIM systems are not really helping. As it turns out, PIM is great for storage but not for distribution.

Retailers need to distribute product information across various channels – online stores, mobile and desktop, spreadsheet exports, subsets of data with adjustments for different markets and industries. They also need connecting products to availability, campaigns, user generated content and fast changing business rules. Add to this the need for closing the analytics feedback loop, and the IT department realises that PIM (or DAM) is not the answer.

Product attributes

Adding search technology for distribution
Whereas PIM is great for storage, search technology is the champ not only for searching but also for distribution. You may have heard the popular Create Once Publish Everywhere? Well, search technology actually gives meaning to the saying. Gather any data (PIM, DAM, ERP, CMS), connect it to other data and display it across multiple channels and contexts.

Also, with the i32 package of components you can add information (metadata) or logic that is not available in the PIM system. This whilst source data stay intact – there is no altering, copying or moving.

Combined with a taxonomy for categorising information you’re good to go. You can now enrich products and connect them to other products and information (processing service). Categorise content according to product taxonomy and be done. Performance will be super high, as content is denormalised and stored in the search engine, ready for multi channel distribution. Also, with this setup you can easily also add new sources to enrich products or modify relevance. Who knows what information will be relevant for products in the future?

To summarise

  • PIM for input, search for output. Design for distribution!
  • Use PIM for managing products, not for managing business rules.
  • Add metadata and taxonomies to tailor product information for different channels.
  • Connect products to related content.
  • Use stand-alone components based on open source for strong TCO and flexibility.

1 Gartner for marketers
2The Findwise i3 package of components (for indexing, processing, searching and analysing data) is compatible with the open source search engines Apache Solr and Elasticsearch. 

Understanding politics with Watson using Text Analytics

To understand the topics that actually are important to different political parties is a difficult task. Can text analytics together with an search index be an approach to given a better understanding?

This blog post describes how IBM Watson Explorer Content Analytics (WCA) can be used to make sense of Swedish politics. All speeches (in Swedish: anföranden) in the Swedish Parliament from 2004 to 2015 are analyzed using WCA. In total 139 110 transcribed text documents were analyzed. The Swedish language support build by Findwise for WCA is used together with a few text analytic processing steps which parses out person names, political party, dates and topics of interest. The selected topics in this analyzed are all related to infrastructure and different types of fuels.

We start by looking at how some of the topics are mentioned over time.

Analyze of terms of interets in Swedsih parlament between 2004 and 2014.

Analyze of terms of interest in Swedish parliament between 2004 and 2014.

The view shows topic which has a higher number of mentions compared to what would be expected during one year. Here we can see among other topics that the topic flygplats (airport) has a high increase in number of mentioning during 2014.

So let’s dive down and see what is being said about the topic flygplats during 2014.

Swedish political parties mentioning Bromma Airport.

Swedish political parties mentioning Bromma Airport during 2014.

The above image shows how the different political parties are mentioning the topic flygplats during the year 2014. The blue bar shows the number of times the topic flygplats was mentioned by each political party during the year. The green bar shows the WCA correlation value which indicates how strongly related a term is to the current filter. What we can conclude is that party Moderaterna mentioned flygplats during 2014 more frequently than other parties.

Reviewing the most correlated nouns when filtering on flygplats and the year 2014 shows among some other nouns: Bromma (place in Sweden), airport and nedläggning (closing). This gives some idea what was discussed during the period. By filtering on the speeches which was held by Moderaterna and reading some of them makes it clear that Moderaterna is against a closing of Bromma airport.

The text analytics and the index provided by WCA helps us both discover trending topics over time and gives us a tool for understanding who talked about a subject and what was said.

All the different topics about infrastructure can together create a single topic for infrastructure. Speeches that are mentioning tåg (train), bredband (broadband) or any other defined term for infrastructure are also tagged with the topic infrastructure. This wider concept of infrastructure can of course also be viewed over time.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Discussions in Swedish parliament mentioning the defined terms which builds up the subject infrastructure 2004 to 2015.

Another way of finding which party that are most correlated to a subject is by comparing pair of facets. The following table shows parties highly related to terms regarding infrastructure and type of fuels.

Political parties highly correlated to subjects regarding infrastructure and types of fuel.

Swedish political parties highly correlated to subjects regarding infrastructure and types of fuel.

Let’s start by explain the first row in order to understand the table. Mobilnät (mobile net) has only been mentioned 44 times by Centerpartiet, but Centerpartiet is still highly related to the term with a WCA correlation value of 3.7. This means that Centerpartiet has a higher share of its speeches mentioning mobilnät compared to other parties. The table indicates that two parties Centerpartiet and Miljöpartiet are more involved about the subject infrastructure topics than other political parties.

Swedish parties mentioning the defined concept of infrastructure.

Swedish parties mentioning the defined concept of infrastructure.

Filtering on the concept infrastructure also shows that Miljöpartiet and Centerpartiet are the two parties which has the highest share of speeches mentioning the defined infrastructure topics.

Interested to dig deeper into the data? Parsing written text with text analytics is a successful approach for increasing an understanding of subjects such as politics. Using IBM Watson Explorer Content Analytics makes it easy. Most of the functionality used in this example is also out of the box functionalities in WCA.

Swedish language support (natural language processing) for IBM Content Analytics (ICA)

Findwise has now extended the NLP (natural language processing) in ICA to include both support for Swedish PoS tagging and Swedish sentiment analysis.

IBM Content Analytics with Enterprise Search (ICA) has its strength in natural language processing (NLP) which is achieved in the UIMA pipeline. From a Swedish perspective, one concern with ICA has always been its lack of NLP for Swedish. Previously the Swedish support in ICA consisted only of dictionary-based lemmatization (word: “sprang” -> lemma: “springa”). However, for a number of other languages ICA has also provided part of speech (PoS) tagging and sentiment analysis. One of the benefits of the PoS tagger is its ability to disambiguate words, which belong to multiple classes (e.g. “run” can be both a noun and a verb) as well as assign tags to words, which are not found in the dictionary. Furthermore, the POS tagger is crucial when it comes to improving entity extraction, which is important when a deeper understanding of the indexed text is needed.

Findwise has now extended the NLP in ICA to include both support for Swedish PoS tagging and Swedish sentiment analysis. The two images below shows simple examples of the PoS support.

Example when ICA uses NLP to analyse the string "ICA är en produkt som klarar entitetsextrahering"Example when ICA uses NLP to analyse the string "Watson deltog i jeopardy"

The question is how this extended functionality could be used?

IBM uses ICA and its NLP support together with several of their products. The jeopardy playing computer Watson may be the most famous example, even if it is not a real product. Watson used NLP in its UIMA pipeline when it analyzed its data from sources such as Wikipedia and Imdb.

One product which leverage from ICA and its NLP capabilities is Content and Predictive Analytics for Healthcare. This product helps doctors to determine which action to take for a patient given the patient’s journal and the symptoms. By also leveraging the predictive analytics from SPSS it is possible to suggest the next action for the patient.

ICA can also be connected directly to IBM Cognos or SPSS where ICA is the tool which creates structure to unstructured data. By using the NLP or sentiment analytics in ICA, structured data can be extracted from text documents. This data can then be fed to IBM Cognos, SPSS or non IBM products such as Splunk.

ICA can also be used on its own as a text miner or a search platform, but in many cases ICA delivers its maximum value together with other products. ICA is a product which helps enriching data by creating structure to unstructured data. The processed data can then be used by other products which normally work with structured data.

Predictive Analytics World 2012

At the end of November 2012 top predictive analytics experts, practitioners, authors and business thought leaders met in London at Predictive Analytics World conference. Cameral nature of the conference combined with great variety of experiences brought by over 60 attendees and speakers made a unique opportunity to dive into the topic from Findwise perspective.

Dive into Big Data

In the Opening Keynote, presented by Program Chairman PhD Geert Verstraeten, we could hear about ways to increase the impact of Predictive Analytics. Unsurprisingly a lot of fuzz is about embracing Big Data.  As analysts have more and more data to process, their need for new tools is obvious. But business will cherish Big Data platforms only if it sees value behind it. Thus in my opinion before everything else that has impact on successful Big Data Analytics we should consider improving business-oriented communication. Even the most valuable data has no value if you can’t convince decision makers that it’s worth digging it.

But beeing able to clearly present benefits is not everything. Analysts must strive to create specific indicators and variables that are empirically measurable. Choose the right battles. As Gregory Piatetsky (data mining and predictive analytics expert) said: more data beats better algorithms, but better questions beat more data.

Finally, aim for impact. If you have a call center and want to persuade customers not to resign from your services, then it’s not wise just to call everyone. But it might also not be wise to call everyone you predict to have high risk of leaving. Even if as a result you loose less clients, there might be a large group of customers that will leave only because of the call. Such customers may also be predicted. And as you split high risk of leaving clients into “persuadable” ones and “touchy” ones, you are able to fully leverage your analytics potencial.

Find it exciting

Greatest thing about Predictive Analytics World 2012 was how diverse the presentations were. Many successful business cases from a large variety of domains and a lot of inspiring speeches makes it hard not to get at least a bit excited about Predictive Analytics.

From banking and financial scenarios, through sport training and performance prediction in rugby team (if you like at least one of: baseball, Predictive Analytics or Brad Pitt, I recommend you watch Moneyball movie). Not to mention Case Study about reducing youth unemployment in England. But there are two particular presentations I would like to say a word about.

First of them was a Case Study on Predicting Investor Behavior in First Social Media Sentiment-Based Hedge Fund presented by Alexander Farfuła – Chief Data Scientist at MarketPsy Capital LLC. I find it very interesting because it shows how powerful Big Data can be. By using massive amount of social media data (e.g. Twitter), they managed to predict a lot of global market behavior in certain industries. That is the essence of Big Data – harness large amount of small information chunks that are useless alone, to get useful Big Picture.

Second one was presented by Martine George – Head of Marketing Analytics & Research at BNP Paribas Fortis in Belgium. She had a really great presentation about developing and growing teams of predictive analysts. As the topic is brisk at Findwise and probably in every company interested in analytics and Big Data, I was pleased to learn so much and talk about it later on in person.

Big (Data) Picture

Day after the conference John Elder from Elder Research led an excellent workshop. What was really nice is that we’ve concentrated on the concepts not the equations. It was like a semester in one day – a big picture that can be digested into technical knowledge over time. But most valuable general conclusion was twofold:

  • Leverage – an incremental improvement will matter! When your turnover can be counted in millions of dollars even half percent of saving mean large additional revenue.
  • Low hanging fruit – there is lot to gain what nobody else has tried yet. That includes reaching for new kinds of data (text data, social media data) and daring to make use of it in a new, cool way with tools that weren’t there couple of years ago.

Plateau of Productivity

As a conclusion, I would say that Predictive Analytics has become a mature, one of the most useful disciplines on the market. As in the famous Gartner Hype, Predictive Analytics reached has reached the Plateau of Productivity. Though often ungrateful, requiring lots of resources, money and time, it can offer your company a successful future.

Video: Introducing Hydra – An Open Source Document Processing Framework

Introducing Hydra – An Open Source Document Processing Framework from presented at Lucene Revolution hosted on Vimeo.

Presented by Joel Westberg, Findwise AB
This presentation details the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.

Distributed processing + search == true?

In June 2011, I attended the Berlin Buzzwords conference. The main theme of the conference was undoubtedly the current paradigm shift in distributed processing, driven by the major success of Hadoop. Doug Cutting – founder of Apache projects such as Lucene, Nutch and Hadoop – held one of the keynotes. He focused on what he recognized as the new foundations for this paradigm shift:

– Commodity hardware
– Sequential file access
– Sharding
– Automated, high level reliability
– Open source

Distributed processing is done fairly well with Hadoop. Distributed search on the other hand is more or less limited to sharding and/or replicating the index. The downside of sharding is that you perform the same search on multiple servers and then need to combine the results. Due to the nature of algorithms in search such as tf/idf, tasks like ranking results suffers. Andrzej Białecki (another frequent Lucene committer) held a presentation on this topic, and his view can be summarized as: Use local search as long as you can, distribute only when the cost of local search limitations outweighs the cost of distributed search.

The setup of automated replication and sharding, with help from Zookeeper in the Solr Cloud project, is a major step in the right direction but the question on how to properly combine search results from different nodes still remains. One thing is sure though, there is a lot of interesting work being done in this area.

Development Techniques for Solr: Structure First or Structure Last?

I’d like to share two different development techniques for Solr I commonly use when setting up a Apache Solr project. To explain it I’ll start by introducing the way I used to work. (The wrong way 😉 )

Development Techniques for Solr: The Structure First

Since I work as a enterprise search consultant I come across a lot of different data sources.  All of these data sources have at least some structure, some more than others.

My objective as a backend developer was then to first of all figure out how the data source was structured and then design a Solr schema that fit the requirements, both technical and business.

The problem with this was of course that the requirements were quite fuzzy until I actually figured out how the data was structured and even more importantly what the data quality was.

In many cases I would spend a lot of time on extracting a date from the source, converting that to an ISO 8601 date format (Supported by Solr), updating the schema with that field and then finally reindexing. Only to learn that the date was either not required or had too poor data quality to be used.

My point being that I spent a lot of time designing a schema (and connector) for a source which I, and most others, knew almost nothing about.

Development Techniques for Solr: The Structure Last

Ok so what’s the supposed “right way” of doing this?

In Solr there is a concept called dynamic fields. It allows you to map fields that fulfil a certain name criteria to a specific type. In the example Solr schema you can find the following section:

<!– uncomment the following to ignore any fields that don’t already match an existing

field name or dynamic field, rather than reporting them as an error.

alternately, change the type=”ignored” to some other type e.g. “text” if you want

unknown fields indexed and/or stored by default –>

<!–dynamicField type=”ignored” multiValued=”true” /–>

The section above will drop any fields that are not explicitly declared in the schema. But what I usually do to start with is to do the complete opposite. I map all fields to a string type.

<dynamicField multiValued=”true” indexed=”true” stored=”true”/>

I start with a minimalist schema that only has an id field and the above stated dynamic field.

With this schema it doesn’t matter what I do, everything is mapped to a string field, exactly as it is entered.

This allows me to focus on getting the data into Solr without caring about what to name the fields, what properties they should have and most importantly to even having to declare them at all.

Instead I can focus on getting the data out of the source system and then into Solr. When that’s done I can use Solr´s schema browser to see what fields are high quality, contain a lot of text or are suited to be used as facets and use this information to help out in the requirements process.

The Structure Last Technique lets you be more pragmatic about your requirements.