Simple entity extraction implementation

 

Valid from Datafari 4.1 in EE and valid from Datafari 5 for CE

The basic entity extraction bundled in Datafari is inspired by the following Lucidworks blog post : How to Perform Entity Extraction in Solr | Lucidworks

We put some code in Datafari Enterprise to reuse it for customers.

The goal of this page is to explain the work already done and how to reuse it.

1. Solr configuration :

  • Schema.xml

-- Fieldtype key_phrased added :

If you’ve got a select list of special terms or phrases for your domain that you’d like to turn into facets and easily filter the documents that contain them, the field will be useful.

-- Examples of fields created to extract entities : entity_phone, entity_phone_present, entity_people, entity_people_present

-- Copyfield used to fill entity_person field (may or may not be present wether the names entity extraction is activated or not in Datafari)

  • keep_phrases.txt

A text file containing the entities to be identified in the documents when the names entity extraction feature is activates. One entity per line. It has been thought to be used to extract names but can be used to extract any list of phrases the user wants.

  • DatafariUpdateProcessor.java 

There is a section entity extraction added in this class. Some checks are done against variables provided in the updateprocessor definition in solrconfig.xml to see if the feature is activated or not. This ensure that the code is not run is the feature is not activated.

The entity_person field is filled with a regex pattern. If we find a US phone number in the content field, we extract the expression and copy it on a specific field : entity_phone. We also put true to the field entity_phone_present.

2. UI configuration - Only for Ajaxfrancelabs (check DatafariUI config documentation for DatafariUI)

The code for the facets related to entity extraction can be found in :

-- Datafari/js/search.js

-- Datafari/searchView.jsp

-- Datafari/js/AjaxFranceLabs/widgets/SubClassResult.widget.js

See the blog post : http://www.francelabs.com/blog/?p=475&preview=true for more details.

The page Basic Text Tagging at indexing and Searching time gives you the process to follow to configure the already implemented entity extraction in Datafari.