Basic Text Tagging at indexing and Searching time

Valid from 4.3

The documentation below is valid from Datafari 4.3 upwards

Datafari comes bundled with a simple entity extraction tool capable of:

Extracting names provided in a list from the documents
Extracting phone numbers from the documents (if they conform to a certain format)
Extracting custom "special" entities that match a provided regular expression

Configuration of the entity extraction tool must be done BEFORE the indexation.

Any change in the configuration after the first indexation will require a total wipe of the index, reloading the solr core FileShare and indexing the files again for it to take effect.

Entity extraction takes place during indexation, meaning that activating this feature will have an impact on the indexation performances. The impact will depend on the number of features activated and the complexity (time needed to compute matches) of the regex used.

Activating and de-activating the Feature

The activation state of this feature can be managed from the Search Engine Configuration → Entity Extraction page in the administration panel of Datafari, which is presented in the image bellow and only in this location.

From there you have a global switch to activate or not the whole feature, and then one switch to toggle the activation of each specific feature separately.

You must trigger the save button manually for the changes to take effect.

Configure the list of names to be retrieved

Do not forget that this must be done before the indexation. You will need to empty the index and do the indexation again if you forgot or if you change the file after the initial indexation.

If you choose to use the name extraction, you need to provide the system with a list of names that must be extracted.

This list can't be edited from the admin panel at the moment.

First you need to download your Solr configuration from Zookeeper.

To do so, download the zookeeper configuration using the Search Engine Configuration → Zookeeper screen.

This will download the zookeeper configuration for the FileShare collection to /opt/datafari/bin/backup/solr/.

Now, copy the content of the folder /opt/datafari/bin/backup/solr/ into the folder /opt/datafari/solr/solrcloud/FileShare/conf. Open the file keep_phrases.txt and fill it with the names you want to identity in your documents, with one name per line like this:

John
Jack
Daniel
Henry

Once you are done, save and close the file, upload the configuration to zookeeper using the Configuration → Zookeeper and reload the zookeeper configuration from the same screen.

From now on, if you activate the name entity extraction and index new data, the names present in the list will be recognized as entities and extracted by the system.

Configure the special entity extraction regex

The regex must be changed before the first indexation. If you want to change it later, you will need to empty the index and re-index everything after the change or you will have inconsistent results in the entity extraction.

The regex used in the special entity extraction cannot be tuned from the admin panel at the moment.

To modify it, first go to the admin panel and activate the special entity extraction feature (which requires to activate the simple extraction feature as it is a global switch) and save you changes.

Once this is done, download your Solr configuration from Zookeeper.

To do so, download the zookeeper configuration using the Search Engine Configuration → Zookeeper screen.

This will download the zookeeper configuration for the FileShare collection to /opt/datafari/bin/backup/solr/.

Now, copy the content of the folder /opt/datafari/bin/backup/solr/ into the folder /opt/datafari/solr/solrcloud/FileShare/conf.

Edit the file solrconfig.xml and go to the section

which should look like this:

<processor class="com.francelabs.datafari.updateprocessor.DatafariUpdateProcessorFactory">
        <bool name="entities.extract.simple">${entity.extract:false}</bool>
        <bool name="entities.extract.simple.name">${entity.name:false}</bool>
        <bool name="entities.extract.simple.phone">${entity.phone:false}</bool>
        <bool name="entities.extract.simple.special">${entity.special:false}</bool>
        <str name="entities.extract.simple.special.regex">.*resume*</str>
      </processor>

Edit the line <str name="entities.extract.simple.special.regex">.*resume*</str> to match you needs and save the file.

Once you are done, save and close the file, upload the configuration to zookeeper using the Configuration → Zookeeper and reload the zookeeper configuration from the same screen.

From now on, if you index new files, text matching the provided regex will be extracted as an entity.

You can activate / de-activate the feature as much as you want, the regex won't change.

If you change the regex, you must clear the index and perform a full indexation from scratch for it to take effect, else you will face inconsistent results.

Extraction is performed at indexation time against the content of the documents only.

A complex regex or a regex matching long text can have a negative impact on the indexing performances.

Check Simple entity extraction implementation - Enteprise Edition if you want some details on where is the code that manages entity extraction and display in Datafari.

Valid from 4.1

The documentation below is valid from Datafari 4.1 upwards

Datafari comes bundled with a simple entity extraction tool capable of:

Extracting names provided in a list from the documents
Extracting phone numbers from the documents (if they conform to a certain format)
Extracting custom "special" entities that match a provided regular expression

Configuration of the entity extraction tool must be done BEFORE the indexation.

Any change in the configuration after the first indexation will require a total wipe of the index, reloading the solr core FileShare and indexing the files again for it to take effect.

Entity extraction takes place during indexation, meaning that activating this feature will have an impact on the indexation performances. The impact will depend on the number of features activated and the complexity (time needed to compute matches) of the regex used.

Activating and de-activating the Feature

The activation state of this feature can be managed from the Search Engine Configuration → Entity Extraction page in the administration panel of Datafari, which is presented in the image bellow.

From there you have a global switch to activate or not the whole feature, and then one switch to toggle the activation of each specific feature separately.

You must trigger the save button manually for the changes to take effect.

Configure the list of names to be retrieved

Do not forget that this must be done before the indexation. You will need to empty the index and do the indexation again if you forgot or if you change the file after the initial indexation.

If you choose to use the name extraction, you need to provide the system with a list of names that must be extracted.

This list can't be edited from the admin panel at the moment.

You must download the zookeeper configuration using the Search Engine Configuration → Zookeeper screen.

This will download the zookeeper configuration for the FileShare core to /opt/datafari/solr/solrcloud/FileShare/conf.

In this folder, open the file keep_phrases.txt and fill it with the names you want to identity in your documents, with one name per line like this:

John
Jack
Daniel
Henry

Once you are done, save and close the file, upload the configuration to zookeeper using the Configuration → Zookeeper and reload the zookeeper configuration from the same screen.

From now on, if you activate the name entity extraction and index new data, the names present in the list will be recognized as entities and extracted by the system.

Configure the special entity extraction regex

The regex must be changed before the first indexation. If you want to change it later, you will need to empty the index and re-index everything after the change or you will have inconsistent results in the entity extraction.

The regex used in the special entity extraction cannot be tuned from the admin panel at the moment.

To modify it, first go to the admin panel and activate the special entity extraction feature (which requires to activate the simple extraction feature as it is a global switch) and save you changes.

Once this is done, download the zookeeper configuration using the Search Engine Configuration → Zookeeper screen.

This will download the zookeeper configuration for the FileShare core to /opt/datafari/solr/solrcloud/FileShare/conf.

Edit the file configoverlay.json which should look like this:

{"updateProcessor":{"datafariUpdateProcessor":{
      "entities.extract.simple.special":true,
      "entities.extract.simple.special.regex":".*resume*",
      "entities.extract.simple.phone":false,
      "name":"datafariUpdateProcessor",
      "extension.fromname":false,
      "entities.extract.simple":true,
      "class":"com.francelabs.datafari.updateprocessor.DatafariUpdateProcessorFactory",
      "entities.extract.simple.name":false}}}

Edit the regex on the right hand side of the line "entities.extract.simple.special.regex":".*resume*" to match you needs and save the file.

Once you are done, save and close the file, upload the configuration to zookeeper using the Configuration → Zookeeper and reload the zookeeper configuration from the same screen.

From now on, if you index new files, text matching the provided regex will be extracted as an entity.

You can activate / de-activate the feature as much as you want, the regex won't change.

If you change the regex, you must clear the index and perform a full indexation from scratch for it to take effect, else you will face inconsistent results.

Extraction is performed at indexation time against the content of the documents only.

A complex regex or a regex matching long text can have a negative impact on the indexing performances.

Check Simple entity extraction implementation - Enteprise Edition if you want some details on where is the code that manages entity extraction and display in Datafari.