Regex Entity Connector

Valid from v6

This documentation is valid from Datafari v6 onwards.

This documentation is valid from Datafari v6 onwards.

The Regex Entity Connector is used to extract information from the content or metatadas of a document thanks to a regular expression and populate a Solr metadata with the result.

For example, you can use it to extract e-mails, phone numbers, product numbers from documents. You could also extract words coming from a list, and later decide to display them as facets.

How to use it

Add the Regex Entity Connector in the job pipeline in the Connection tab of MCF.

It must be inserted anywhere after the TikaServerRmetaConnector. It needs the document to be transformed in text, otherwise the document contains only binary data.

Then in the Regex Entity configuration tab that appears, fill in as explained below :

Source field: the source from where the regular expression should be extracted. It can be set to “content”, “url”, or any ManifoldCF document field provided by the Repository Connector. Note: these are NOT the Solr index fields.

Associated regular expression: the regular expression used to extract data and fill the metadata. An example is given to retrieve the line in the document containing the word “option“, case insensitive.

Destination field: the metadata (present in the Datafari Solr) to be filled. Some valid metadata are given as example. We do not recommend using these examples, as chances are high that it may collide with another indexing pipeline that may overwrite its value. You could for instance create dedicated fields for your metadata dedicated to regex, for instance regex_phones.

Value if match: if one match or more are found during the extraction and if this optional field is set, then this value will be set in the destination metadata, instead of the raw matches values. This field is optionnal.

Value if no match: if no match is found during the extraction and if this optional field is set, then this value will be set in the destination metadata. This field is optionnal.

Keep only one value: if multiple matches are found during the extraction and if this optional field is set to true, then only the first one will be kept in the destination metadata. This parameter only applies if "Value if match" is empty.

  • The source metadata must be “content”, “url” or any ManifoldCF document field provided by your Repository Connector.

  • The destination metadata must exist in Solr environment (no check done).

  • Depending on your regular expression, several different values may be found in a document, so the metadata receiving the results must be multi-valued, otherwise it will contain the last match found.

  • If the checkbox “Keep only one value” is set to true or if “Value if true” is specified, then only one value will be used.

You can add as many destination metadata, regular expression and source metadata as you want by clicking on the Add button.

Some examples of useful regular expression:

  • Ignore case: (?i)searched_word: retrieves “searched_word” regardless of character case.

  • Retrieve the line containing: .*searched_word.*

  • Search a point: \. “\” is the escape character.

  • Spaces are taken into account, so searching “word1 word2” will search the exact expression in the content.

  • e-mails: ([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})

  • Phone number: (\+)([\s.\(\)]*\d{1}){8,13}(-)?(\d{1,5})

  • Search “word1” or “word2”: word1|word2.

More examples of regex here : Regex typical use cases

 

Some example of use cases:

With the first line, the crawl will extract all email addresses from content, to store them into the multi-valued field “entity_email”.

The second lines indicates that if the document content contains at least one phone number, the “entity_phone_present” Solr field will be set to “true”. Otherwise, it will be set to “false”.

The third one allow the extraction of the first phone number appearing in the document, and store it into “entity_phone” field. If no phone number is found, this field won’t be added to the document.

Finally, the last line indicates that if a line from the document contains the expression “word”, then it will be added to the multi-valued field “entity_word”. If the expression is not found, then “entity_word” will be set to “No word here”.

Warning! When using a destination field, make sure it exists in Solr, or create it if necessary. It has to be a multivalued field. Note that if you want to use it into facets, it has to be a “String” field.