PIM is for storage

– Add search for distribution, customization and seamless multichannel experiences.


Retailers, e-commerce and product data
Having met a number of retailers to discuss information management, we’ve noticed they all experience the same problem. Products are (obviously) central and information is typically stored in a PIM or DAM system. So far so good, these systems do the trick when it comes to storing and managing fundamental product data. However, when trying to embrace current trends1 of e-commerce, such as mobile friendliness, multi-channel selling and connecting products to other content, PIM systems are not really helping. As it turns out, PIM is great for storage but not for distribution.

Retailers need to distribute product information across various channels – online stores, mobile and desktop, spreadsheet exports, subsets of data with adjustments for different markets and industries. They also need connecting products to availability, campaigns, user generated content and fast changing business rules. Add to this the need for closing the analytics feedback loop, and the IT department realises that PIM (or DAM) is not the answer.

Product attributes

Adding search technology for distribution
Whereas PIM is great for storage, search technology is the champ not only for searching but also for distribution. You may have heard the popular Create Once Publish Everywhere? Well, search technology actually gives meaning to the saying. Gather any data (PIM, DAM, ERP, CMS), connect it to other data and display it across multiple channels and contexts.

Also, with the i32 package of components you can add information (metadata) or logic that is not available in the PIM system. This whilst source data stay intact – there is no altering, copying or moving.

Combined with a taxonomy for categorising information you’re good to go. You can now enrich products and connect them to other products and information (processing service). Categorise content according to product taxonomy and be done. Performance will be super high, as content is denormalised and stored in the search engine, ready for multi channel distribution. Also, with this setup you can easily also add new sources to enrich products or modify relevance. Who knows what information will be relevant for products in the future?

To summarise

  • PIM for input, search for output. Design for distribution!
  • Use PIM for managing products, not for managing business rules.
  • Add metadata and taxonomies to tailor product information for different channels.
  • Connect products to related content.
  • Use stand-alone components based on open source for strong TCO and flexibility.

References
1 Gartner for marketers
2The Findwise i3 package of components (for indexing, processing, searching and analysing data) is compatible with the open source search engines Apache Solr and Elasticsearch. 

What’s new in Apache Solr 6?

Apache Solr 6 has been released recently! You need to remember about some important technical news: no more support for reading Lucene/Solr 4.x index or Java 8 is required. But what I think, the most interesting part is connected with its new features, which certainly follow world trends. I mean here: SQL engine at the top of the Solr, graph search and replicating data across different data centers.

Apache Solr

One of the most promising topic among the new features is Parallel SQL Interface. In a brief, it is possibility to run SQL queries on the top of the Solr Cloud (only Cloud mode right now). It can be very interesting to combine full-text capabilities with well-known SQL statements.
Solr uses Presto internally, which is a SQL query engine and works with various types of data stores. Presto is responsible for translating SQL statements to the Streaming Expression, since Solr SQL engine in based on the Streaming API.
Thanks to that, SQL queries can be executed at worker nodes in parallel. There are two implementations of grouping results (aggregations). First one is based on map reduce algorithm and the second one uses Solr facets. The basic difference is a number of fields used in grouping clause. Facet API can be used for better performance, but only when GROUP BY isn’t complex. If it is, better try aggregationMode=map_reduce.
From developer perspective it’s really transparent. Simple statement like “SELECT field1 FROM collection1” is translated to proper fields and collection. Right now clauses like WHERE, ORDER BY, LIMIT, DISTINCT, GROUP BY can be used.
Solr still doesn’t support whole SQL language, but even though it’s a powerful feature. First of all, it can make beginners life easier, since relational world is commonly known. What is more, I imagine this can be useful during some IT system migrations or collecting data from Solr for further analysis. I hope to hear many different study cases in the near future.

Apache Solr 6 introduces also a topic, which is crucial, wherever a search engine is a business critical system. I mean cross data center replication (CDCR).
Since Solr Cloud has been created to support near real-time (NRT) searching, it didn’t work well when cluster nodes were distributed across different data centers. It’s because of the communication overhead generated by the leaders, replicas and synchronizations operation.

New idea is in experimental phase and still under developing, but for now we have an active-passive mode, where data is pushed from the Source DC to the Target DC. Documents can be sent in a real-time or according to the schedule. Every leader from active cluster sends asynchronously updates to the proper leader in passive cluster. After that, target leaders replicate changes to their replicas as usual.
CDCR is crucial when we think about distributed systems working in high-availability mode. It always refers to disaster recovery, scaling or avoiding single points of failure (SPOF). Please visit documentation page to find some details and plans for the future: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

What if your business works in highly connected environment, where data relationships matter, but you still benefit from full-text searching? Solr 6 has a good news – it’s a graph traversal functionality.
A lot of enterprises know that focusing on relations between documents and graph data modeling is a future. Now you can build Solr queries which will allow you to discover information organized in nodes and edges. You can explore your collections in terms of data interactions and connections between particular data elements. We can think about the use cases from semantic search area (query augmentation, using ontologies etc.) or more prosaic, like organization security roles or access control.
Graph traversal query is still in progress, but we can use it from now and its basic syntax is really simple: fq={!graph from=parent_id to=id}id:”DOCUMENT_ID”

The last Solr 6 improvement, which I’m going to mention about is a new scoring algorithm – BM25. In fact, it’s a change forced by Apache Lucene 6. BM25 is now a default similarity implementation. Similarity is a process which examines which documents are similar to the query and to what extent. There are many different factors which determine document score. There are e.g.: number of search terms found in document, popularity of this search terms over the whole collection or document length. This is where BM25 improves scoring: it takes into consideration average length of the documents (fields) across the entire corpus. It also limits better an impact of terms frequency on results ranking.

As we can see, Apache Solr 6 provides us with many new features and those mentioned above are not all of them. We’re going to write more about the new functionalities soon. Until then, we encourage you to try the newest Solr on your own and remember: don’t hesitate to contact us in case of any problems!

Using log4j in Tomcat and Solr and How to Make a Customized File Appender

This article shows how to use log4j for both tomcat and solr, besides that, I will also show you the steps to make your own customized log4j appender and use it in tomcat and solr. If you want more information than is found in this blogpost, feel free to visit our website or contact us.

Default Tomcat log mechanism

Tomcat by default uses a customized version of java logging api. The configuration is located at ${tomcat_home}/conf/logging.properties. It follows the standard java logging configuration syntax plus some special tweaks(prefix property with a number) for identifying logs of different web apps.

An example is below:

handlers = 1catalina.org.apache.juli.FileHandler, 2localhost.org.apache.juli.FileHandler, 3manager.org.apache.juli.FileHandler, 4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

.handlers = 1catalina.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

1catalina.org.apache.juli.FileHandler.level = FINE

1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs

1catalina.org.apache.juli.FileHandler.prefix = catalina.

2localhost.org.apache.juli.FileHandler.level = FINE

2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs

2localhost.org.apache.juli.FileHandler.prefix = localhost.

Default Solr log mechanism

Solr uses slf4j logging, which is kind of wrapper for other logging mechanisms. By default, solr uses log4j syntax and wraps java logging api (which means that it looks like you are using log4j in the code, but it is actually using java logging underneath). It uses tomcat logging.properties as configuration file. If you want to define your own, it can be done by placing a logging.properties under ${tomcat_home}/webapps/solr/WEB-INF/classes/logging.properties

Switching to Log4j

Log4j is a very popular logging framework, which I believe is mostly due to its simplicity in both configuration and usage. It has richer logging features than java logging and it is not difficult to make an extension.

Log4j for tomcat

  1. Rename/remove ${tomcat_home}/conf/logging.properties
  2. Add log4j.properties in ${tomcat_home}/lib
  3. Add log4j-xxx.jar in ${tomcat_home}/lib
  4. Download tomcat-juli-adapters.jar from extras and put it into ${tomcat_home}/lib
  5. Download tomcat-juli.jar from extras and replace the original version in ${tomcat_home}/bin

(extras are the extra jar files for special tomcat installation, it can be found in the bin folder of a tomcat download location, fx. http://archive.apache.org/dist/tomcat/tomcat-6/v6.0.33/bin/extras/)

Log4j for solr

  1. Add log4j.properties in ${tomcat_home}/webapps/solr/WEB-INF/classes/ (create classes folder if not present)
  2. Replace slf4j-jdkxx-xxx.jar with slf4j-log4jxx-xxx.jar in ${tomcat_home}/webapps/solr/WEB-INF/lib (which means switching underneath implementation from java logging to log4j logging)
  3. Add log4jxxx.jar to ${tomcat_home}/webapps/solr/WEB-INF/lib

Make our own log4j file appender

Log4j has 2 types of common fileappender:

  • DailyRollingFileAppender – rollover at certain time interval
  • RollingFileAppender – rollover at certain size limit

And I found a nice customized file appender:

  •  CustodianDailyRollingFileAppender online.

I happen to need a file appender which should  rollover at certain time interverl(each day) and backup earlier logs in backup folder and get zipped. Plus removing logs older than certain days. CustodianDailyRollingFileAppender already has the rollover feature, so I decide to start with making a copy of this class,

Parameters

Besides the default parameters in DailyRollingFileAppender, I need 2 more parameters,

Outdir – backup directory

maxDaysToKeep – the number of days to keep the log file

You only need to define these 2 parameters in the new class, and add get/set methods for them (no constructor involved). The rest will be handled by log4j framework.

Logging entry point

When there comes a log event, the subAppend(…) function will be called, inside which a super.subAppend(event); will just do the log writing work. So before that function call, we can add the mechanism for back up and clean up.

Clean up old log

Use a file filter to find all log files start with the filename, delete those older than maxDaysToKeep.

Backup log

Make a separate Thread for zipping the log file and delete original log file afterwards(I found CyclicBarrier very easy to use for this type of wait thread to complete task, and a thread is preferable for avoiding file lock/access ect. problems). Call the thread at the point where current log file needs to be rolled over to backup.

Deploy the customized file appender

Let’s say we make a new jar called log4jxxappender.jar, we can deploy the appender by copying the jar file to ${tomcat_home}/lib and in ${tomcat_home}/webapps/solr/WEB-INF/lib

Example configuration for solr,

log4j.rootLogger=INFO, solrlog

log4j.appender.solrlog=com.findwise.xx.log4j.fileappender.YyRollingFileAppender

log4j.appender.solrlog.File=${catalina.home}/logs/solr.log

log4j.appender.solrlog.Append=true

log4j.appender.solrlog.Encoding=UTF-8

log4j.appender.solrlog.DatePattern='.'yyyy-MM-dd

log4j.appender.solrlog.MaxDaysToKeep=10

log4j.appender.solrlog.Outdir=${catalina.base}/logs/backup

log4j.appender.solrlog.layout=org.apache.log4j.PatternLayout

log4j.appender.solrlog.layout.ConversionPattern = %d [%t] %-5p %c - %m%n

Solr.war

Last thing to remember about solr is to zip the deployment folder ${tomcat_home}/webapps/solr and rename the zip file solr.zip to solr.war. Now you should have a log4j enabled solr.war file with your customized fileappender.

Want more information, have further questions or need help? Stop by our website or contact us!

Video: Introducing Hydra – An Open Source Document Processing Framework

Introducing Hydra – An Open Source Document Processing Framework from presented at Lucene Revolution hosted on Vimeo.

Presented by Joel Westberg, Findwise AB
This presentation details the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.

Development Techniques for Solr: Structure First or Structure Last?

I’d like to share two different development techniques for Solr I commonly use when setting up a Apache Solr project. To explain it I’ll start by introducing the way I used to work. (The wrong way 😉 )

Development Techniques for Solr: The Structure First

Since I work as a enterprise search consultant I come across a lot of different data sources.  All of these data sources have at least some structure, some more than others.

My objective as a backend developer was then to first of all figure out how the data source was structured and then design a Solr schema that fit the requirements, both technical and business.

The problem with this was of course that the requirements were quite fuzzy until I actually figured out how the data was structured and even more importantly what the data quality was.

In many cases I would spend a lot of time on extracting a date from the source, converting that to an ISO 8601 date format (Supported by Solr), updating the schema with that field and then finally reindexing. Only to learn that the date was either not required or had too poor data quality to be used.

My point being that I spent a lot of time designing a schema (and connector) for a source which I, and most others, knew almost nothing about.

Development Techniques for Solr: The Structure Last

Ok so what’s the supposed “right way” of doing this?

In Solr there is a concept called dynamic fields. It allows you to map fields that fulfil a certain name criteria to a specific type. In the example Solr schema you can find the following section:

<!– uncomment the following to ignore any fields that don’t already match an existing

field name or dynamic field, rather than reporting them as an error.

alternately, change the type=”ignored” to some other type e.g. “text” if you want

unknown fields indexed and/or stored by default –>

<!–dynamicField type=”ignored” multiValued=”true” /–>

The section above will drop any fields that are not explicitly declared in the schema. But what I usually do to start with is to do the complete opposite. I map all fields to a string type.

<dynamicField multiValued=”true” indexed=”true” stored=”true”/>

I start with a minimalist schema that only has an id field and the above stated dynamic field.

With this schema it doesn’t matter what I do, everything is mapped to a string field, exactly as it is entered.

This allows me to focus on getting the data into Solr without caring about what to name the fields, what properties they should have and most importantly to even having to declare them at all.

Instead I can focus on getting the data out of the source system and then into Solr. When that’s done I can use Solr´s schema browser to see what fields are high quality, contain a lot of text or are suited to be used as facets and use this information to help out in the requirements process.

The Structure Last Technique lets you be more pragmatic about your requirements.

Information Flow in VGR

The previous week Kristian Norling from VGR (Västra Götaland Regional Council) posted a really interesting and important blog post about information flow. Those of you who doesn’t know what VGR has been up to previously, here is a short background.

For a number of years VGR has been working to give reality to a model for how information is created, managed, stored and distributed. And perhaps the most important part – integrated.

Information flow in VGR

Why is Information Flow Important?

In order to give your users access to the right information it is essential to get control of the whole information flow i.e. from the time it is created until it reaches the end user. If we lack knowledge about this, it is almost impossible to ensure quality and accuracy.

The fact that we have control also gives us endless possibilities when it comes to distributing the right information at the right time (an old cliché that is finally becoming reality). To sum up: that is what search is all about!

When information is being created VGR uses a Metadata service which helps the editors to tag their content by giving keyword suggestions.

In reality this means that the information can be distributed in the way it is intended. News are for example tagged with subject, target group and organizational info (apart from dates, author, expiring date etc which is automated) – meaning that the people belonging to specific groups with certain roles will get the news that are important to them.

Once the information is tagged correctly and published it is indexed by search. This is done in a number of different ways: by HTML-crawling, through RSS, by feeding the search engine or through direct indexing.

The information is after this available through search and ready to be distributed to the right target groups. Portlets are used to give single sign-on access to a number of information systems and template pages in the WCM (Web Content Management system) uses search alerts to give updated information.

Simply put: a search alert for e.g. meeting minutes that contains your department’s name will give you an overview of all information that concerns this when it is published, regardless of in which system it resides.

Furthermore, the blog post describes VGRs work with creating short and persistent URL:s (through an URL-service) and how to ”monitor” and “listen to” the information flow (for real-time indexing and distribution) – areas where we all have things to learn. Over time Kristian will describe the different parts of the model in detail, be sure to keep an eye on the blog.

What are your thoughts on how to get control of the information flow? Have you been developing similar solutions for part of this?

Solr Processing Pipeline

Hi again Internet,

For once I have had time to do some thinking. Why is there no powerful data processing layer between the Lucene Connector Framework and Solr? I´ve been looking into the Apache Commons Processing Pipeline. It seems like a likely candidate to do some cool stuff.  Look at the diagram below.

A schematic drawing of a Solr Pipeline concept. (Click to enlarge)

What I´m thinking of is to make a transparent Solr processing pipeline that speaks the Solr REST protocol on each end. This means that you would be able to use SolrJ or any other API to communicate with the Pipeline.

Has anyone attempted this before?  If you’re interested in chatting about the pipeline drop me a mail or just grab me at Eurocon in Prague this year.