Let’s try to explain what a tree facet is, by starting with a common use case of a “normal” facet. It consists of a list of filters, each corresponding to a value of a common search engine field and a count representing the number of documents matching that value. The main characteristic of a tree facet is that its filters each may have a list of child filters, each of which may have a list of child filters, etc. This is where the “tree” part of its name comes from.
Tree facets are therefore well suited to represent data that is inherently hierarchical, e.g. a decision tree, a taxonomy or a file system.
Two commons methods of generating tree facets, using either Elasticsearch or Solr, are the pivot approach and the path approach. Some of the characteristics, benefits and drawbacks of each method are presented below.
While ordinary facets consist of a flat list of buckets, tree facets consist of multiple levels of buckets, where each bucket may have child buckets, etc. If applying a filter query equivalent to some bucket, all documents matching that bucket, or any bucket in that sub-tree of child buckets, are returned.
Tree facets with Pivot
The name is taken from Solr (Pivot faceting) and allows faceting within results of the parent facet. This is a recursive setting, so pivot faceting can be configured for any number of levels. Think of pivot faceting as a Cartesian product of field values.
A list of fields is provided, where the first element in the list will generate the root level facet, the second element will generate the second level facet, and so on. In Elasticsearch, the same result is achieved by using the more general concept of aggregations. If we take a terms aggregation as an example, this simply means a terms aggregation within a parent terms aggregation, and so on.
The major benefit of pivot faceting is that it can all be configured in query time and the data does not need to be indexed in any specific way. E.g. the list of fields can be modified to change the structure of the returned facet, without having to re-index any content.
The values of the returned facet/aggregation are already in a structured, hierarchical format. There is no need for any parsing of paths to build the tree.
The number of levels in the tree must be known at query time. Since each field must be specified explicitly, it puts a limit on the maximum depth of the tree. If the tree should be extended to allow for more levels, then content must be indexed to new fields and the query needs to include these new fields.
Pivot faceting assumes a uniformity in the data, in that the values on each level in the tree, regardless of their parent, are of the same types. This is because all values on some specific level comes from the same field.
When to use
At least one of the following statements hold:
- The data is homogenous – different objects share similar sets of properties
- The data will, structurally, not change much over time
- There is a requirement on a high level of query time flexibility
- There is a requirement on a high level of flexibility without re-indexing documents
Tree facets with Path
Data is indexed into a single field, on a Unix style file path format, e.g. root/middle/leaf (the path separator is configurable). The index analyzer of this field should be using a path hierarchy tokenizer (Elasticsearch, Solr). It will expand the path so that a filter query for some node in the tree will include the nodes in the sub-tree below the node. The example path above would be expanded to root, root/middle, root/middle/leaf. These represent the filter queries for which the document with this path should be returned. Note that the query analyzer should be keyword/string so that queries are interpreted verbatim.
Once the values have been indexed, a normal facet or terms aggregation is put on the field. This will return all possible paths and sub-paths, which can be a large number, so make sure to request all of them. Once facet/aggregation is returned, its values need to be parsed and built into a tree structure.
The path approach can handle any number of levels in the tree, without any configuration explicitly stating how many levels there are, both on the indexing side and on the query side. It is also a natural way of handling different depths in different places in the tree, not all branches need to be the same length.
Closely related to the above-mentioned benefit, is the fact that the path approach does not impose any restrictions on the uniformity of the tree. Nodes on a specific level in the tree may represent different concepts, dependent only on their parent. This fits very well with many real-world applications, as different objects and entities have different sets of properties.
Data must be formatted in index time. If any structural changes to the tree are required, affected documents need to be re-indexed.
To construct a full tree representation of the paths returned in the facet/aggregation, all paths need to be requested. If the tree is big, this can become costly, both for the search engines to generate and with respect to the size of the response payload.
Data is not returned in a hierarchical format and must be parsed to build the tree structure.
When to use
At least one of the following statements hold:
- The data is heterogenous – different objects have different sets of properties, varying numbers of levels needed in different places in the tree
- The data could change structurally over time
- The content and structure of the tree should be controlled by content only, no configuration changes
Tree facets – Conclusion
The listed benefits and drawback of each method can be used as a guide to find the best method from case to case.
When there is no clear choice, I personally tend to go for the path approach, just because it is so powerful and dynamic. This comes with the main drawback of added cost of configuration for index time data formatting, but it is usually worth it in my opinion.