Basic Text Tagging at indexing and Searching time

Valid from Datafari 6.0

You should use the https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2844295173 to tag text.


Valid from 4.3 up to 5.5

The documentation below is valid from Datafari 4.3 upwards

Datafari comes bundled with a simple entity extraction tool capable of:

  • Extracting names provided in a list from the documents

  • Extracting phone numbers from the documents (if they conform to a certain format)

  • Extracting custom "special" entities that match a provided regular expression

Configuration of the entity extraction tool must be done BEFORE the indexation.

Any change in the configuration after the first indexation will require a total wipe of the index, reloading the solr core FileShare and indexing the files again for it to take effect.

Activating and de-activating the Feature

The activation state of this feature can be managed from the Search Engine Configuration → Entity Extraction page in the administration panel of Datafari, which is presented in the image bellow and only in this location.

From there you have a global switch to activate or not the whole feature, and then one switch to toggle the activation of each specific feature separately.

You must trigger the save button manually for the changes to take effect.

Configure the list of names to be retrieved

If you choose to use the name extraction, you need to provide the system with a list of names that must be extracted.

This list can't be edited from the admin panel at the moment.

First you need to download your Solr configuration from Zookeeper.

To do so, download the zookeeper configuration using the Search Engine Configuration → Zookeeper screen.

This will download the zookeeper configuration for the FileShare collection to /opt/datafari/bin/backup/solr/.

Now, copy the content of the folder /opt/datafari/bin/backup/solr/ into the folder /opt/datafari/solr/solrcloud/FileShare/conf. Open the file keep_phrases.txt and fill it with the names you want to identity in your documents, with one name per line like this:

John Jack Daniel Henry

Once you are done, save and close the file, upload the configuration to zookeeper using the Configuration → Zookeeper and reload the zookeeper configuration from the same screen.

From now on, if you activate the name entity extraction and index new data, the names present in the list will be recognized as entities and extracted by the system.

Configure the special entity extraction regex

The regex used in the special entity extraction cannot be tuned from the admin panel at the moment.

To modify it, first go to the admin panel and activate the special entity extraction feature (which requires to activate the simple extraction feature as it is a global switch) and save you changes.

Once this is done, download your Solr configuration from Zookeeper.

To do so, download the zookeeper configuration using the Search Engine Configuration → Zookeeper screen.

This will download the zookeeper configuration for the FileShare collection to /opt/datafari/bin/backup/solr/.

Now, copy the content of the folder /opt/datafari/bin/backup/solr/ into the folder /opt/datafari/solr/solrcloud/FileShare/conf.

Edit the file solrconfig.xml and go to the section 

<processor class="com.francelabs.datafari.updateprocessor.DatafariUpdateProcessorFactory">

 which should look like this:

<processor class="com.francelabs.datafari.updateprocessor.DatafariUpdateProcessorFactory"> <bool name="entities.extract.simple">${entity.extract:false}</bool> <bool name="entities.extract.simple.name">${entity.name:false}</bool> <bool name="entities.extract.simple.phone">${entity.phone:false}</bool> <bool name="entities.extract.simple.special">${entity.special:false}</bool> <str name="entities.extract.simple.special.regex">.*resume*</str> </processor>

Edit the line <str name="entities.extract.simple.special.regex">.*resume*</str> to match you needs and save the file.

Once you are done, save and close the file, upload the configuration to zookeeper using the Configuration → Zookeeper and reload the zookeeper configuration from the same screen.

From now on, if you index new files, text matching the provided regex will be extracted as an entity.

You can activate / de-activate the feature as much as you want, the regex won't change.

If you change the regex, you must clear the index and perform a full indexation from scratch for it to take effect, else you will face inconsistent results.

Check Simple entity extraction implementation - Enteprise Edition if you want some details on where is the code that manages entity extraction and display in Datafari.


 

Datafari comes bundled with a simple entity extraction tool capable of:

  • Extracting names provided in a list from the documents

  • Extracting phone numbers from the documents (if they conform to a certain format)

  • Extracting custom "special" entities that match a provided regular expression

Activating and de-activating the Feature

The activation state of this feature can be managed from the Search Engine Configuration → Entity Extraction page in the administration panel of Datafari, which is presented in the image bellow.

From there you have a global switch to activate or not the whole feature, and then one switch to toggle the activation of each specific feature separately.

You must trigger the save button manually for the changes to take effect.

Configure the list of names to be retrieved

If you choose to use the name extraction, you need to provide the system with a list of names that must be extracted.

This list can't be edited from the admin panel at the moment.

You must download the zookeeper configuration using the Search Engine Configuration → Zookeeper screen.

This will download the zookeeper configuration for the FileShare core to /opt/datafari/solr/solrcloud/FileShare/conf.

In this folder, open the file keep_phrases.txt and fill it with the names you want to identity in your documents, with one name per line like this:

John Jack Daniel Henry

Once you are done, save and close the file, upload the configuration to zookeeper using the Configuration → Zookeeper and reload the zookeeper configuration from the same screen.

From now on, if you activate the name entity extraction and index new data, the names present in the list will be recognized as entities and extracted by the system.

Configure the special entity extraction regex

The regex used in the special entity extraction cannot be tuned from the admin panel at the moment.

To modify it, first go to the admin panel and activate the special entity extraction feature (which requires to activate the simple extraction feature as it is a global switch) and save you changes.

Once this is done, download the zookeeper configuration using the Search Engine Configuration → Zookeeper screen.

This will download the zookeeper configuration for the FileShare core to /opt/datafari/solr/solrcloud/FileShare/conf.

Edit the file configoverlay.json which should look like this:

Edit the regex on the right hand side of the line "entities.extract.simple.special.regex":".*resume*" to match you needs and save the file.

Once you are done, save and close the file, upload the configuration to zookeeper using the Configuration → Zookeeper and reload the zookeeper configuration from the same screen.

From now on, if you index new files, text matching the provided regex will be extracted as an entity.

You can activate / de-activate the feature as much as you want, the regex won't change.

If you change the regex, you must clear the index and perform a full indexation from scratch for it to take effect, else you will face inconsistent results.

Check Simple entity extraction implementation - Enteprise Edition if you want some details on where is the code that manages entity extraction and display in Datafari.