I’d like to share two different development techniques for Solr I commonly use when setting up a Apache Solr project. To explain it I’ll start by introducing the way I used to work. (The wrong way 😉 )
Development Techniques for Solr: The Structure First
Since I work as a enterprise search consultant I come across a lot of different data sources. All of these data sources have at least some structure, some more than others.
My objective as a backend developer was then to first of all figure out how the data source was structured and then design a Solr schema that fit the requirements, both technical and business.
The problem with this was of course that the requirements were quite fuzzy until I actually figured out how the data was structured and even more importantly what the data quality was.
In many cases I would spend a lot of time on extracting a date from the source, converting that to an ISO 8601 date format (Supported by Solr), updating the schema with that field and then finally reindexing. Only to learn that the date was either not required or had too poor data quality to be used.
My point being that I spent a lot of time designing a schema (and connector) for a source which I, and most others, knew almost nothing about.
Development Techniques for Solr: The Structure Last
Ok so what’s the supposed “right way” of doing this?
In Solr there is a concept called dynamic fields. It allows you to map fields that fulfil a certain name criteria to a specific type. In the example Solr schema you can find the following section:
<!– uncomment the following to ignore any fields that don’t already match an existing
field name or dynamic field, rather than reporting them as an error.
alternately, change the type=”ignored” to some other type e.g. “text” if you want
unknown fields indexed and/or stored by default –>
<!–dynamicField type=”ignored” multiValued=”true” /–>
The section above will drop any fields that are not explicitly declared in the schema. But what I usually do to start with is to do the complete opposite. I map all fields to a string type.
<dynamicField multiValued=”true” indexed=”true” stored=”true”/>
I start with a minimalist schema that only has an id field and the above stated dynamic field.
With this schema it doesn’t matter what I do, everything is mapped to a string field, exactly as it is entered.
This allows me to focus on getting the data into Solr without caring about what to name the fields, what properties they should have and most importantly to even having to declare them at all.
Instead I can focus on getting the data out of the source system and then into Solr. When that’s done I can use Solr´s schema browser to see what fields are high quality, contain a lot of text or are suited to be used as facets and use this information to help out in the requirements process.
The Structure Last Technique lets you be more pragmatic about your requirements.