Hierarchical facet configuration

Valid from Datafari v5.1

The hierarchical facet, where you can find the documentation here, works with fields representing a path tokenization, where each value is prefixed by its depth level. For example, the path '/home/france/labs' will be tokenized as this :

Tokens

Tokens

0/home

1/home/france

2/home/france/labs

Before v5.1, you had to code yourself a “hierarchical tokenizer”. In v5.1 we have implemented a “ready to use” tokenizer which is configurable through some parameters of the Datafari Update Processor in the solrconfig.xml file of the FileShare core: [DATAFARI_HOME]/solr/solrcloud/FileShare/conf/solrconfig.xml

<processor class="com.francelabs.datafari.updateprocessor.DatafariUpdateProcessorFactory"> <bool name="entities.extract.simple">${entity.extract:false}</bool> <bool name="entities.extract.simple.name">${entity.name:false}</bool> <bool name="entities.extract.simple.phone">${entity.phone:false}</bool> <bool name="entities.extract.simple.special">${entity.special:false}</bool> <str name="entities.extract.simple.special.regex">.*resume*</str> <!-- <str name="entities.extract.simple.phone.regex"></str> --> <!-- Hierarchical tokenizer configuration --> <bool name="hierarchical.path.processing">true</bool> <str name="hierarchical.field">urlHierarchy</str> <str name="hierarchical.path.separator">/</str> </processor>

Here are the parameters:

  • hierarchical.path.processing: boolean true or false. If true then the update processor will perform hierarchical tokenization on the “url” field

  • hierarchical.field: The field name in which will be stored the tokens generated by the tokenization. By default, the field is “urlHierarchy” which exists by default in the schema. You can set another field if you want but you have to keep in mind that the field type must not have any stemming or other token analyzer configured. Ideally the type of the field must be either “string” or “keyword”

  • hierarchical.path.separator: This is the char separator that the hierarchical tokenizer should use to split the path into levels of depth. The default char is '/' because most of the paths are using this char as separator for folders/depth.

The tokenization is performed on the “url” field of documents by default and this cannot be configured. You can also notice that the tokenization is performed on all documents without exception. If you want to change both behaviors you will have to switch off the default tokenizer (by setting the hierarchical.path.processing to false) and implement your own tokenizer in a Custom Update Processor.
Here is the small piece of code our update processor uses to perform hierarchical tokenisation that you can use as a solid base for your custom update processor:

// Get the url to tokenize String url = (String) doc.getFieldValue("url"); if (hierarchicalPathProcessing) { // Create path hierarchy for facet final List<String> urlHierarchy = new ArrayList<>(); final Matcher regexMatcher = hierarchicalRegexPattern.matcher(url); String cleanUrl = url; if (regexMatcher.find()) { final int endIndex = regexMatcher.end(); cleanUrl = url.substring(endIndex - 2); } // Create path hierarchy for facet final long separatorCount = cleanUrl.chars().filter(ch -> ch == hierarchicalPathSeparator).count(); int previousIndex = 0; int depth = 0; // Tokenize the path and add the depth as first character for each token // (like: 0/home,1/home/project ...) for (int i = 0; i < separatorCount; i++) { final int endIndex = cleanUrl.indexOf(hierarchicalPathSeparator, previousIndex); if (endIndex == -1) { break; } String label = cleanUrl.substring(0, endIndex); if (label.isEmpty()) { label = String.valueOf(hierarchicalPathSeparator); } urlHierarchy.add(depth + label); depth++; previousIndex = endIndex + 1; } doc.remove(hierarchicalField); doc.addField(hierarchicalField, urlHierarchy); }

Once your documents indexed with either our hierarchical tokenizer or yours, you can set the hierarchical facet widget in the search UI by following the documentation : [DEPRECATED] Ajaxfrancelabs Widgets and Modules