How to improve search relevance using machine learning and statistics – Apache Solr Learning to Rank

In search, the relevance denotes how well a retrieved document or set of documents meets the information need of the user. Natural languages, synonymy, homonymy, term frequency, norms, relevance model, boosting, performance, subjectivity… the reasons why search relevancy still remains a hard problem are multiple. This article will deal with how can machine learning and search statistics improve the relevance using the learning to rank plugin which will be included in a newer version of Solr.  If you want more information than is provided in this blogpost, be sure to visit our website or contact us!

Background

Considering an Intranet search solution where users can be divided into two groups: developers and sales.
Search and clicks statistics are collected and the following picture illustrates a specific search performed 569 times by the users with the click statistics for each document.

Example of a search with click statistics

Example of a search with click statistics

As noticed, the top search hit, of which the score is computed from term frequency, index documents frequency and field-length norm, is less relevant (got less clicks) than documents with lower scores. Instead of trying to manually tweak the relevancy model and the different field boosts, which will probably lead, by tweaking for a specific query, to decrease the global relevancy for other queries, we will try to learn from the users click statistics to automatically generate a relevancy model.

Architecture, features and model

Architecture

Using search and click statistics as training data, the search engine can learn, from input features and with a ranking model, the probability that a document is relevant for a user.

Search architecture

Search architecture with a ranking training model

Features

With the Solr learning to rank plugin, features can be defined using standard Solr queries.
For my specific example, I will choose the following features:
– originalScore: Original score computed by Solr
– queryMatchTitle: Boolean stating if the user query text match the document title
– queryMatchDescription: Boolean stating if the user query text match the document description
– isPowerPoint: Boolean stating if the document type is PowerPoint

[
   { "name": "isPowerPoint",
     "class": "org.apache.solr.ltr.feature.SolrFeature",
     "params":{ "fq": ["{!terms f=filetype }.pptx"] }
   },
   {
    "name":"originalScore",
    "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
    "params":{}
   },
   {
    "name" : " queryMatchTitle",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=title}${user_query}" }
   },
   {
    "name" : " queryMatchDescription",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=description}${user_query}" }
   }
]

Training a ranking model

From the statistics click data, a training set (X, y), where X is the feature input vector and y a boolean stating if the user clicked a document or not, can be generated and used to compute a linear model using regression which will output the weight of each feature.

Example of statistics:

{
    q: "i3",
    docId: "91075403",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0,
    score: 3,43		
},
{
    q: "i3",
    docId: "82034458",
    userId: "507f1f77bcf86cd799439011",
    clicked: 1
    score: 3,43		
},
{
    q: "coucou",
    docId: "1246732",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0	
    score: 3,42	
}

Training data generated:
X= (originalScore, queryMatchTitle, queryMatchDescription, isPowerPoint)
y= (clicked)

{
    originalScore: 3,43,
    queryMatchTitle: 1
    queryMatchDescription: 1,
    isPowerPoint: 1, 
    clicked: 0		
},
{
    originalScore: 3,43,
    queryMatchTitle: 0
    queryMatchDescription: 1,
    isPowerPoint: 0, 
    clicked: 1			
},
{
    originalScore: 3,42	
    queryMatchTitle: 1
    queryMatchDescription: 0,
    isPowerPoint: 1, 
    clicked: 0		
}

Once the training is completed and the different features weight computed, the model can be sent to Solr using the following format:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6783,
            "queryMatchTitle": 0.4833,
            "queryMatchDescription": 0.7844,
            "isPowerPoint": 0.321
      }
    }
}

Run a Rerank Query

Solr LTR plugin allows to easily apply the re-rank model on the documents result by adding rq={!ltr model=myModelName reRankDocs=25} to the query.

Personalization of the search result

If your statistics data include information about users, specific re-rank model can be trained according different user groups.
In my current example, I trained a specific model for the developer group and for the sales representatives.

Dev model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"devModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6421,
            "queryMatchTitle": 0.4561,
            "queryMatchDescription": 0.5124,
            "isPowerPoint": 0.017
      }
    }
}

Sales model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"salesModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.712,
            "queryMatchTitle": 0.582,
            "queryMatchDescription": 0.243,
            "isPowerPoint": 0.623
      }
    }
}

From the statistics data, the system learnt that a PowerPoint document is more relevant for a sales representative than for a developers.

Developer search

Developer search with re-ranking

Sales representative search

Sales representative search with re-ranking

To conclude, with a search system continuously trained from a flow of statistics, not only the search relevance will be more customized and personalized to your users, but the relevance will also be automatically adapted to the users behavior change.

If you want more information, have further questions or need help please visit our website or contact us!

Solr LTR plugin to be release soon: https://github.com/bloomberg/lucene-solr/tree/master-ltr-plugin-release/solr/contrib/ltr

Query Completion with Apache Solr

There are plenty of names for this functionality: query completion, suggestions, auto-complete, auto-suggest, word completion, type ahead and maybe some more. Even if we may point slight differences between them (suggestions can base on your index documents or external input such users queries), from technical point of view it’s all about the same: to propose a query for the end user.

google-suggestearly Google Suggest from 2008. Source: http://www.wpromote.com/blog/4-things-in-08-that-changed-the-face-of-search/

 

Suggester feature was started 8 years ago by Google, in 2008. Users got used to the query completion and nowadays it’s a common feature of all mature search engines, e-commerce platforms and even internal enterprise search solutions.

Suggestions help with navigating users through the web portal, allow to discover relevant content and recommend popular phrases (and thus search results). In the e-commerce area they are even more important because well implemented query completion is able to high up conversion rate and finally – increase sales revenue. Word completion never can lead to zero results, but this kind of mistake is made frequently.

And as many names describe this feature there are so many ways to build it. But still it’s not so trivial task to implement good working query completion. Software like Apache Solr doesn’t solve whole problem. Building auto-suggestions is also about data (what should we present to users), its quality (e.g. when we want to suggest other users’ queries), suggestions order (we got dozens matches, but we can show only 5; which are the most important?) or design (user experience or similar).

Going back to the technology. Query completion can be built in couple of ways with Apache Solr. You can use mechanisms like facets, terms, dedicated suggest component or just do a query (with e.g. dismax parser).

Take a look at Suggester. It’s very easy to run. You just need to configure searchComponent and requestHandler. Example:

<searchComponent name="suggester" class="solr.SuggestComponent">
  <lst name="suggester">
    <str name="name">suggester1</str>
    <str name="lookupImpl">FuzzyLookupFactory</str>
    <str name="dictionaryImpl">DocumentDictionaryFactory</str>
    <str name="field">title</str>
    <str name="weightField">popularity</str>
    <str name="suggestAnalyzerFieldType">text</str>
  </lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="suggest">true</str>
    <str name="suggest.count">10</str>
  </lst>
  <arr name="components">
    <str>suggester</str>
  </arr>
</requestHandler>

SuggestComponent is a ready-to-use implementation, which is responsible for serving up suggestions based on commands and queries. It’s an efficient solution, i.e. because it works on structure separated from main index and it’s being kept in memory. There are some basic settings like field used for autocompleting or defining text analyzing chain. LookImpl defines how to match terms in index. There are about 10 algorithms with different purpose. Probably the most popular are:

  • AnalyzingLookupFactory (default, finds matches based on prefix)
  • FuzzyLookupFactory (finds matches with misspellings),
  • AnalyzingInfixLookupFactory (finds matches anywhere in the text),
  • BlendedInfixLookupFactory (combines matches based on prefix and infix lookup)

You need to choose the one which fulfill your requirements. The second important parameter is dictionaryImpl which represents how indexed suggestions are stored. And again, you can choose between couple of implementations, e.g. DocumentDictionaryFactory (stores terms, weights, and optional payload) or HighFrequencyDictionaryFactory (works when very common terms overwhelm others, you can set up proper threshold).

There are plenty of different settings you can use to customize your suggester. SuggestComponent is a good start and probably covers many cases, but like everything, there are some limitations like e.g. you can’t easily filter out results.

Example execution:

http://localhost:8983/solr/index/suggest?wt=json&suggest.dictionary=analyzingSuggester&suggest.q=lond

suggestions: [
  { term: "london" },
  { term: "londonderry" },
  { term: "londoño" },
  { term: "londoners" },
  { term: "londo" }
]

Another way to build a query completion is to use mechanisms like faceting, terms or highlighting.

The example of QC built on facets:

http://localhost:8983/solr/index/select?q=*:*&facet=on&facet.field=title_keyword&facet.mincount=1&facet.contains=lon&rows=0&wt=json

title_keyword: [
  "blonde bombshell", 2,
  "12-pounder long gun", 1,
  "18-pounder long gun", 1,
  "1957 liga española de baloncesto", 1,
  "1958 liga española de baloncesto", 1
]

Please notice that here we have used facet.contains method, so query matches also in the middle of phrase. It works on the basis of regular expression. Additionally, we have a count for every suggestion in Solr response.

TermsComponent (returns indexed terms and the number of documents which contain each term) and highlighting (originally, emphasize fragments of documents that match the user’s query) can be also used, what is presented below.

Terms example:

<searchComponent name="terms" class="solr.TermsComponent"/>
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <bool name="terms">true</bool>
    <bool name="distrib">false</bool>
  </lst>
  <arr name="components">
    <str>terms</str>
  </arr>
</requestHandler>
http://localhost:8983/solr/index/terms?terms.fl=title_general&terms.prefix=lond&terms.sort=index&wt=json

title_general: [
  "londinium",
  "londo",
  "london",
  "london's",
  "londonderry"
]

Highlighting example:

http://localhost:8983/solr/index/select?q=title_ngram:lond &fl=title&hl=true&hl.fl=title&hl.simple.pre=&hl.simple.post=

title_ngram: [
  "londinium",
  "londo",
  "london",
  "london's",
  "londonderry"
]

You can also do auto-complete even with usual, full-text query. It has lots of advantages: Lucene scoring is working, you have filtering, boosts, matching through many fields and whole Lucene/Solr queries syntax. Take a look at this eDisMax example:

http://localhost:8983/solr/index/select?q=lond&qf=title_ngram&fl=title&defType=edismax&wt=json

docs: [
  { title: "Londinium" },
  { title: "London" },
  { title: "Darling London" },
  { title: "London Canadians" },
  { title: "Poultry London" }
]

The secret is an analyzer chain whether you want to base on facets, query or SuggestComponent. Depending on what effect you want to achieve with your QC, you need to index data in a right way. Sometimes you may want to suggest single terms, another time – whole sentences or product names. If you want to suggest e.g. letter by letter you can use Edge N-Gram Filter. Example:

<fieldType name="text_ngram" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory minGramSize="1" maxGramSize="50" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

N-Gram is a structure of n items (size depends on given range) from a given sequence of text. Example: term Findwise, minGramSize = 1 and maxGramSize = 10 will be indexed as:

F
Fi
Fin
Find
Findw
Findwi
Findwis
Findwise

With such indexed text you can easily achieve functionality where user is able to see changing suggestions after each letter.

Another case is an ability to complete word after word (like Google does). It isn’t trivial, but you can try with shingle structure. Shingles are similar to N-Gram, but it works on whole words. Example: Searching is really awesome, minShingleSize = 2 and minShingleSize = 3 will be indexed as:

Searching is
Searching is really
is really
is really awesome
really awesome

Example of Shingle Filter:

<fieldType name="text_shingle" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="10" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What if your users could use QC which supports synonyms? Then they could put e.g. abbreviation and find a full suggestion (NYC -> New York City, UEFA -> Union Of European Football Associations). It’s easy, just use Synonym Filter in your text field:

<fieldType name="text_synonym" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
  </analyzer>
</fieldType>

And then just do a query:

http://localhost:8983//select?defType=edismax&fl=title&q=nyc&qf=title_synonym&wt=json

docs: [
  { title: "New York City" },
  { title: "New York New York" },
  { title: "Welcome to New York City" },
  { title: "City Club of New York" },
  { title: "New York" }
]

Another very similar example concerns language support and matching suggestions regardless of the terms’ form. It can be especially valuable for languages with  the rich grammar rules and declination. In the same way how SynonymsFilter is used, we can configure a stemmer / lemmatization filter e.g. for English (take a look here and remember to put language filter both for index and query time) and expand matching suggestions.

As you can see, there are many ways to run query completion, you need to adjust right mechanism and text analysis based on your own limitations and also on what you want to achieve.

There are also other topics connected with preparing type ahead solution. You need to consider performance issues, they are mostly centered on response time and memory consumption. How many requests will generate QC? You can assume that at least 3 times more than your regular search service. You can handle traffic growth by optimizing Solr caches, installing separated Solr instanced only for suggesting service. If you’ll create n-gram, shingles or similar structures, be aware that your index size will increase. Remember that if you decided to use facets or highlighting for some reason to provide suggester, this both mechanisms make your CPU heavy loaded.

In my opinion, the most challenging issue to resolve is choosing a data source for query completion mechanism. Should you suggest parts of your documents (like titles, keywords, authors)? Or use NLP algorithms to extract meaningful phrases from your content? Maybe parse search/application logs and use the most popular users queries? Be careful, filter out rubbish, normalize users input). I believe the answer is YES – to all. Suggestions should be diversified (to lead your users to a wide range of search resources) and should come from variety of sources. More than likely, you will need to do a hard job when processing documents – remember that data cleaning is crucial.

Similarly, you need to take into account different strategies when we talk about the order of proposed suggestions. It’s good to show them in alphanumeric order (still respect scoring!), but you can’t stop here. Specificity of QC is that application can return hundreds of matches, but you can present only 5 or 10 of them. That’s why you need to promote suggestions with the highest occurrence in index or the most popular among the users. Further enhancements can involve personalizing query completion, using geographical coordinates or implementing security trimming (you can see only these suggestions you are allowed to).

I’m sure that this blog post doesn’t exhaust the subject of building query completion, but I hope I brought this topic closer and showed the complexity of such a task. There are many different dimension which you need to handle, like data source of your suggestions, choosing right indexing structure, performance issues, ranking or even UX and designing (how would you like to present hints – simple text or with some graphics/images? Would you like to divide suggestions into categories? Do you always want to show result page after clicked suggestion or maybe redirect to particular landing page?).

Search engine like Apache Solr is a tool, but you still need an application with whole business logic above it. Do you want to have a prefix-match and infix-match? To support typos and synonyms? To suggest letter after the letter or word by word? To implement security requirements or advanced ranking to propose the best tips for your users? These and even more questions need to be think over to deliver successful query completion.

What’s new in Apache Solr 6?

Apache Solr 6 has been released recently! You need to remember about some important technical news: no more support for reading Lucene/Solr 4.x index or Java 8 is required. But what I think, the most interesting part is connected with its new features, which certainly follow world trends. I mean here: SQL engine at the top of the Solr, graph search and replicating data across different data centers.

Apache Solr

One of the most promising topic among the new features is Parallel SQL Interface. In a brief, it is possibility to run SQL queries on the top of the Solr Cloud (only Cloud mode right now). It can be very interesting to combine full-text capabilities with well-known SQL statements.
Solr uses Presto internally, which is a SQL query engine and works with various types of data stores. Presto is responsible for translating SQL statements to the Streaming Expression, since Solr SQL engine in based on the Streaming API.
Thanks to that, SQL queries can be executed at worker nodes in parallel. There are two implementations of grouping results (aggregations). First one is based on map reduce algorithm and the second one uses Solr facets. The basic difference is a number of fields used in grouping clause. Facet API can be used for better performance, but only when GROUP BY isn’t complex. If it is, better try aggregationMode=map_reduce.
From developer perspective it’s really transparent. Simple statement like “SELECT field1 FROM collection1” is translated to proper fields and collection. Right now clauses like WHERE, ORDER BY, LIMIT, DISTINCT, GROUP BY can be used.
Solr still doesn’t support whole SQL language, but even though it’s a powerful feature. First of all, it can make beginners life easier, since relational world is commonly known. What is more, I imagine this can be useful during some IT system migrations or collecting data from Solr for further analysis. I hope to hear many different study cases in the near future.

Apache Solr 6 introduces also a topic, which is crucial, wherever a search engine is a business critical system. I mean cross data center replication (CDCR).
Since Solr Cloud has been created to support near real-time (NRT) searching, it didn’t work well when cluster nodes were distributed across different data centers. It’s because of the communication overhead generated by the leaders, replicas and synchronizations operation.

New idea is in experimental phase and still under developing, but for now we have an active-passive mode, where data is pushed from the Source DC to the Target DC. Documents can be sent in a real-time or according to the schedule. Every leader from active cluster sends asynchronously updates to the proper leader in passive cluster. After that, target leaders replicate changes to their replicas as usual.
CDCR is crucial when we think about distributed systems working in high-availability mode. It always refers to disaster recovery, scaling or avoiding single points of failure (SPOF). Please visit documentation page to find some details and plans for the future: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

What if your business works in highly connected environment, where data relationships matter, but you still benefit from full-text searching? Solr 6 has a good news – it’s a graph traversal functionality.
A lot of enterprises know that focusing on relations between documents and graph data modeling is a future. Now you can build Solr queries which will allow you to discover information organized in nodes and edges. You can explore your collections in terms of data interactions and connections between particular data elements. We can think about the use cases from semantic search area (query augmentation, using ontologies etc.) or more prosaic, like organization security roles or access control.
Graph traversal query is still in progress, but we can use it from now and its basic syntax is really simple: fq={!graph from=parent_id to=id}id:”DOCUMENT_ID”

The last Solr 6 improvement, which I’m going to mention about is a new scoring algorithm – BM25. In fact, it’s a change forced by Apache Lucene 6. BM25 is now a default similarity implementation. Similarity is a process which examines which documents are similar to the query and to what extent. There are many different factors which determine document score. There are e.g.: number of search terms found in document, popularity of this search terms over the whole collection or document length. This is where BM25 improves scoring: it takes into consideration average length of the documents (fields) across the entire corpus. It also limits better an impact of terms frequency on results ranking.

As we can see, Apache Solr 6 provides us with many new features and those mentioned above are not all of them. We’re going to write more about the new functionalities soon. Until then, we encourage you to try the newest Solr on your own and remember: don’t hesitate to contact us in case of any problems!

Using log4j in Tomcat and Solr and How to Make a Customized File Appender

This article shows how to use log4j for both tomcat and solr, besides that, I will also show you the steps to make your own customized log4j appender and use it in tomcat and solr. If you want more information than is found in this blogpost, feel free to visit our website or contact us.

Default Tomcat log mechanism

Tomcat by default uses a customized version of java logging api. The configuration is located at ${tomcat_home}/conf/logging.properties. It follows the standard java logging configuration syntax plus some special tweaks(prefix property with a number) for identifying logs of different web apps.

An example is below:

handlers = 1catalina.org.apache.juli.FileHandler, 2localhost.org.apache.juli.FileHandler, 3manager.org.apache.juli.FileHandler, 4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

.handlers = 1catalina.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

1catalina.org.apache.juli.FileHandler.level = FINE

1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs

1catalina.org.apache.juli.FileHandler.prefix = catalina.

2localhost.org.apache.juli.FileHandler.level = FINE

2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs

2localhost.org.apache.juli.FileHandler.prefix = localhost.

Default Solr log mechanism

Solr uses slf4j logging, which is kind of wrapper for other logging mechanisms. By default, solr uses log4j syntax and wraps java logging api (which means that it looks like you are using log4j in the code, but it is actually using java logging underneath). It uses tomcat logging.properties as configuration file. If you want to define your own, it can be done by placing a logging.properties under ${tomcat_home}/webapps/solr/WEB-INF/classes/logging.properties

Switching to Log4j

Log4j is a very popular logging framework, which I believe is mostly due to its simplicity in both configuration and usage. It has richer logging features than java logging and it is not difficult to make an extension.

Log4j for tomcat

  1. Rename/remove ${tomcat_home}/conf/logging.properties
  2. Add log4j.properties in ${tomcat_home}/lib
  3. Add log4j-xxx.jar in ${tomcat_home}/lib
  4. Download tomcat-juli-adapters.jar from extras and put it into ${tomcat_home}/lib
  5. Download tomcat-juli.jar from extras and replace the original version in ${tomcat_home}/bin

(extras are the extra jar files for special tomcat installation, it can be found in the bin folder of a tomcat download location, fx. http://archive.apache.org/dist/tomcat/tomcat-6/v6.0.33/bin/extras/)

Log4j for solr

  1. Add log4j.properties in ${tomcat_home}/webapps/solr/WEB-INF/classes/ (create classes folder if not present)
  2. Replace slf4j-jdkxx-xxx.jar with slf4j-log4jxx-xxx.jar in ${tomcat_home}/webapps/solr/WEB-INF/lib (which means switching underneath implementation from java logging to log4j logging)
  3. Add log4jxxx.jar to ${tomcat_home}/webapps/solr/WEB-INF/lib

Make our own log4j file appender

Log4j has 2 types of common fileappender:

  • DailyRollingFileAppender – rollover at certain time interval
  • RollingFileAppender – rollover at certain size limit

And I found a nice customized file appender:

  •  CustodianDailyRollingFileAppender online.

I happen to need a file appender which should  rollover at certain time interverl(each day) and backup earlier logs in backup folder and get zipped. Plus removing logs older than certain days. CustodianDailyRollingFileAppender already has the rollover feature, so I decide to start with making a copy of this class,

Parameters

Besides the default parameters in DailyRollingFileAppender, I need 2 more parameters,

Outdir – backup directory

maxDaysToKeep – the number of days to keep the log file

You only need to define these 2 parameters in the new class, and add get/set methods for them (no constructor involved). The rest will be handled by log4j framework.

Logging entry point

When there comes a log event, the subAppend(…) function will be called, inside which a super.subAppend(event); will just do the log writing work. So before that function call, we can add the mechanism for back up and clean up.

Clean up old log

Use a file filter to find all log files start with the filename, delete those older than maxDaysToKeep.

Backup log

Make a separate Thread for zipping the log file and delete original log file afterwards(I found CyclicBarrier very easy to use for this type of wait thread to complete task, and a thread is preferable for avoiding file lock/access ect. problems). Call the thread at the point where current log file needs to be rolled over to backup.

Deploy the customized file appender

Let’s say we make a new jar called log4jxxappender.jar, we can deploy the appender by copying the jar file to ${tomcat_home}/lib and in ${tomcat_home}/webapps/solr/WEB-INF/lib

Example configuration for solr,

log4j.rootLogger=INFO, solrlog

log4j.appender.solrlog=com.findwise.xx.log4j.fileappender.YyRollingFileAppender

log4j.appender.solrlog.File=${catalina.home}/logs/solr.log

log4j.appender.solrlog.Append=true

log4j.appender.solrlog.Encoding=UTF-8

log4j.appender.solrlog.DatePattern='.'yyyy-MM-dd

log4j.appender.solrlog.MaxDaysToKeep=10

log4j.appender.solrlog.Outdir=${catalina.base}/logs/backup

log4j.appender.solrlog.layout=org.apache.log4j.PatternLayout

log4j.appender.solrlog.layout.ConversionPattern = %d [%t] %-5p %c - %m%n

Solr.war

Last thing to remember about solr is to zip the deployment folder ${tomcat_home}/webapps/solr and rename the zip file solr.zip to solr.war. Now you should have a log4j enabled solr.war file with your customized fileappender.

Want more information, have further questions or need help? Stop by our website or contact us!

How to Index and Search XML Content in Solr

Indexing XML Content

In solr, there is an xml update request handler which can be used to update xml formatted data.

For example,

<add>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>

However when a field itself should contain xml formatted data, the xml update handler will fail to import. Because, xml update handler parse the import data with xml parser, it will try to get direct child text under ‘field’ node, which is empty if a field’s direct child is xml tag.

What we can do is to use json update handler. For example:

[
  {
    "id" : "MyTestDocument",
    "title" : "<root p="cc">test \ node</root>"
  }
]

There are two things to notice,

  1. Both ‘‘ and ‘‘ characters should be escaped
  2. The xml content should be kept as a single line

Json import data can be loaded into Solr by the curl command,

curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'

Or, by using solrj:

CommonsHttpSolrServer server = new CommonsHttpSolrServer(serverpath);
server.setMaxRetries(1);
ContentStreamUpdateRequest csureq = new ContentStreamUpdateRequest("/update/json");
csureq.addFile(file);
NamedList<Object> result = server.request(csureq);
NamedList<Object> responseHeader = (NamedList<Object>) result.get("responseHeader");

Integer status = (Integer) responseHeader.get("status");

Stripping out xml tags in Schema definition

When querying xml content, we most likely will not be interested in xml tags. So we need to strip out xml tags before indexing the xml text. We can do that by applying HTMLStripCharFilter to the xml content.
            <analyzer type="index">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>
            <analyzer type="query">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>

Search XML Content

Xml content search does not differ much from text content search. However, if people want to search for xml attributes, there requires some special tweak.

HTMLStripCharFilter we mentioned earlier will filter out all xml tags including attributes, in order to index attributes, we need to find a way to make HTMLStripCharFilter keep the attribute text.

For example if we have original xml content as following,

<sample attr=”key_o2_4”>find it </sample>
After applying HTMLStripCharFilter, we want to have,

key_o2_4    find it
One way we can do is to add assistance xml instruction tags in original xml content such as,

<sample attr=”key_o2_4”><?solr key_o2_4?>find it</sample>

And apply Solr.PatternReplaceCharFilterFactory to it as shown in following schema fieldtype definition.

<analyzer type="index">
...
<charFilter pattern="&lt;?solr ([A-Z0-9_-]*)?&gt; " replacement="       $1  " maxBlockChars="10000000"/>
<charFilter/>
...
</analyzer>

Which will make replace <?solr key_o2_4?> with 7 leading empty spaces + key_o2_4 + 2 ending empty spaces in order to keep the original offset,

With this technique, we can do a search on attr attribute and get a hit.

Do you have questions? Visit our website or contact us for more information.

ExternalFileField in Solr

Sometimes we want to update document values in an indexed field more often than other fields. A good solution to this is to use the field type ExternFileField. The ExternalFileField gets values from an external file instead of the index. Such file can easily be changed and update the field after a commit. Hence no documents need to be re-indexed. A field that has ExternalFileField as type is not searchable. The field may currently only be used as a ValueSource in a FunctionQuery.

The external file contains keys and values:

key1=value1
key2=value2

The keys don’t need to be unique.

The name of the external file must be external_<fieldname> or external_<fieldname>.* and must be placed in the index directory.

A new file type of the type ExternalFileField and field must be added to schema.xml.

<fieldType name="file"

           keyField="keyField" defVal="1" indexed="false"

           stored="false" valType="float" />

<field name="<fieldname>" type="file" />

keyField is the field that contains the keys and <fieldname> contains the values from the external file.

valType defines the value type of the field.

At Findwise we have used this method for a customer where we wanted to show the most visited pages higher up in the search result. These statistics are changing daily for a lot of pages and we don’t want to re-index all these pages every day.

Development Techniques for Solr: Structure First or Structure Last?

I’d like to share two different development techniques for Solr I commonly use when setting up a Apache Solr project. To explain it I’ll start by introducing the way I used to work. (The wrong way 😉 )

Development Techniques for Solr: The Structure First

Since I work as a enterprise search consultant I come across a lot of different data sources.  All of these data sources have at least some structure, some more than others.

My objective as a backend developer was then to first of all figure out how the data source was structured and then design a Solr schema that fit the requirements, both technical and business.

The problem with this was of course that the requirements were quite fuzzy until I actually figured out how the data was structured and even more importantly what the data quality was.

In many cases I would spend a lot of time on extracting a date from the source, converting that to an ISO 8601 date format (Supported by Solr), updating the schema with that field and then finally reindexing. Only to learn that the date was either not required or had too poor data quality to be used.

My point being that I spent a lot of time designing a schema (and connector) for a source which I, and most others, knew almost nothing about.

Development Techniques for Solr: The Structure Last

Ok so what’s the supposed “right way” of doing this?

In Solr there is a concept called dynamic fields. It allows you to map fields that fulfil a certain name criteria to a specific type. In the example Solr schema you can find the following section:

<!– uncomment the following to ignore any fields that don’t already match an existing

field name or dynamic field, rather than reporting them as an error.

alternately, change the type=”ignored” to some other type e.g. “text” if you want

unknown fields indexed and/or stored by default –>

<!–dynamicField type=”ignored” multiValued=”true” /–>

The section above will drop any fields that are not explicitly declared in the schema. But what I usually do to start with is to do the complete opposite. I map all fields to a string type.

<dynamicField multiValued=”true” indexed=”true” stored=”true”/>

I start with a minimalist schema that only has an id field and the above stated dynamic field.

With this schema it doesn’t matter what I do, everything is mapped to a string field, exactly as it is entered.

This allows me to focus on getting the data into Solr without caring about what to name the fields, what properties they should have and most importantly to even having to declare them at all.

Instead I can focus on getting the data out of the source system and then into Solr. When that’s done I can use Solr´s schema browser to see what fields are high quality, contain a lot of text or are suited to be used as facets and use this information to help out in the requirements process.

The Structure Last Technique lets you be more pragmatic about your requirements.

Solr Processing Pipeline

Hi again Internet,

For once I have had time to do some thinking. Why is there no powerful data processing layer between the Lucene Connector Framework and Solr? I´ve been looking into the Apache Commons Processing Pipeline. It seems like a likely candidate to do some cool stuff.  Look at the diagram below.

A schematic drawing of a Solr Pipeline concept. (Click to enlarge)

What I´m thinking of is to make a transparent Solr processing pipeline that speaks the Solr REST protocol on each end. This means that you would be able to use SolrJ or any other API to communicate with the Pipeline.

Has anyone attempted this before?  If you’re interested in chatting about the pipeline drop me a mail or just grab me at Eurocon in Prague this year.