Indexing XML Content
In solr, there is an xml update request handler which can be used to update xml formatted data.
For example,
<add>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>
However when a field itself should contain xml formatted data, the xml update handler will fail to import. Because, xml update handler parse the import data with xml parser, it will try to get direct child text under ‘field’ node, which is empty if a field’s direct child is xml tag.
What we can do is to use json update handler. For example:
[
{
"id" : "MyTestDocument",
"title" : "<root p="cc">test \ node</root>"
}
]
There are two things to notice,
- Both ‘
”
‘ and ‘‘ characters should be escaped
- The xml content should be kept as a single line
Json import data can be loaded into Solr by the curl command,
curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'
Or, by using solrj:
CommonsHttpSolrServer server = new CommonsHttpSolrServer(serverpath);
server.setMaxRetries(1);
ContentStreamUpdateRequest csureq = new ContentStreamUpdateRequest("/update/json");
csureq.addFile(file);
NamedList<Object> result = server.request(csureq);
NamedList<Object> responseHeader = (NamedList<Object>) result.get("responseHeader");
Integer status = (Integer) responseHeader.get("status");
Stripping out xml tags in Schema definition
When querying xml content, we most likely will not be interested in xml tags. So we need to strip out xml tags before indexing the xml text. We can do that by applying HTMLStripCharFilter
to the xml content.
<analyzer type="index">
...
<charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
<tokenizerSpellE">solr.StandardTokenizerFactory"/>
<filterSpellE">solr.LowerCaseFilterFactory"/>
...
</analyzer>
<analyzer type="query">
...
<charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
<tokenizerSpellE">solr.StandardTokenizerFactory"/>
<filterSpellE">solr.LowerCaseFilterFactory"/>
...
</analyzer>
Search XML Content
Xml content search does not differ much from text content search. However, if people want to search for xml attributes, there requires some special tweak.
HTMLStripCharFilter
we mentioned earlier will filter out all xml tags including attributes, in order to index attributes, we need to find a way to make HTMLStripCharFilter
keep the attribute text.
For example if we have original xml content as following,
<sample attr=”key_o2_4”>find it </sample>
After applying HTMLStripCharFilter
, we want to have,
key_o2_4 find it
One way we can do is to add assistance xml instruction tags in original xml content such as,
<sample attr=”key_o2_4”><?solr key_o2_4?>find it</sample>
And apply Solr.PatternReplaceCharFilterFactory
to it as shown in following schema fieldtype definition.
<analyzer type="index">
...
<charFilter pattern="<?solr ([A-Z0-9_-]*)?> " replacement=" $1 " maxBlockChars="10000000"/>
<charFilter/>
...
</analyzer>
Which will make replace <?solr key_o2_4?>
with 7 leading empty spaces + key_o2_4 + 2 ending empty spaces in order to keep the original offset,
With this technique, we can do a search on attr
attribute and get a hit.
Do you have questions? Visit our website or contact us for more information.