Valid from v6.2

This documentation is valid from Datafari v6.2 onwards.

The Regex Entity Connector is used to extract information from the content or metatadas of a document thanks to a regular expression and populate a Solr metadata with the result.

For example, you can use it to extract e-mails, phone numbers, product numbers from documents. You could also extract words coming from a list, and later decide to display them as facets.

How to use it

Add the Regex Entity Connector in the job pipeline in the Connection tab of MCF.

It must be inserted anywhere after the TikaServerRmetaConnector. It needs the document to be transformed in text, otherwise the document contains only binary data.

Then in the Regex Entity configuration tab that appears, fill in as explained below :

Source field: the source from where the regular expression should be extracted. It can be set to “content”, “url”, or any ManifoldCF document field provided by the Repository Connector. Note: these are NOT the Solr index fields.

Associated regular expression: the regular expression used to extract data and fill the metadata. An example is given to retrieve the line in the document containing the word “option“, case insensitive.

Destination field: the metadata (present in the Datafari Solr) to be filled. Some valid metadata are given as example. We do not recommend using these examples, as chances are high that it may collide with another indexing pipeline that may overwrite its value. You could for instance create dedicated fields for your metadata dedicated to regex, for instance regex_phones.

Value if match: if one match or more are found during the extraction and if this optional field is set, then this value will be set in the destination metadata, instead of the raw matches values. This field is optionnal.

Value if no match: if no match is found during the extraction and if this optional field is set, then this value will be set in the destination metadata. This field is optionnal.

Keep only one value: if multiple matches are found during the extraction and if this optional field is set to true, then only the first one will be kept in the destination metadata. This parameter only applies if "Value if match" is empty.

Extract regex groups: If there is one or more groups in the regular expression and this option is enabled, then the value of the groups will be set in the destination metadata. If it is not checked, the complete match is stored in the destination metadata. The groups of a regular expression are defined between ( and ). For example, this regular expression has one group: word1 (.*?) word2. In this regular expression, we are searching for the part of a text starting with word1and ending with word2. Let's take this text: My word1 titi tata toto word2 and other words. With the regular expression mentioned above, if the checkbox is enabled, then the destination metadata gets the value "titi tata toto". If the checkbox is not enabled, the destination metadata gets the value "word1 titi tata toto word2". Now, if we change a little the regex to have several groups, for example: word1 (.?) tata (.?) word2, the result is: destination metadata = "titi toto". “titi” and “toto” are extracted and set in destination metadata, separated by a whitespace. This field is optional.

The source metadata must be “content”, “url” or any ManifoldCF document field provided by your Repository Connector.
The destination metadata must exist in Solr environment (no check done).
Depending on your regular expression, several different values may be found in a document, so the metadata receiving the results must be multi-valued, otherwise it will contain the last match found.
If the checkbox “Keep only one value” is set to true or if “Value if true” is specified, then only one value will be used.
For convenience, we remove spaces before and after the extracted text. In fact, we can't think of any cases where these spaces would be useful for indexing purposes. More often it’s a problem, and it's quite complicated to create a regular expression to get rid of them.
This connector processes a file line by line: a line is defined by a end of line character or limited to a capacity of 65536 bytes.

You can add as many destination metadata, regular expression and source metadata as you want by clicking on the Add button.

Some examples of useful regular expression:

Ignore case: (?i)searched_word: retrieves “searched_word” regardless of character case.
Retrieve the line containing: .*searched_word.*
Search a point: \. “\” is the escape character.
Spaces are taken into account, so searching “word1 word2” will search the exact expression in the content.
e-mails: ([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})
Phone number: (\+)([\s.\(\)]*\d{1}){8,13}(-)?(\d{1,5})
Search “word1” or “word2”: word1|word2.
For exemples using regex groups, see the section above about the “Extract regex groups” option.

More examples of regex here : Regex typical use cases

Some example of use cases:

With the first line, the crawl will extract all email addresses from content, to store them into the multi-valued field “entity_email”.

The second lines indicates that if the document content contains at least one phone number, the “entity_phone_present” Solr field will be set to “true”. Otherwise, it will be set to “false”.

The third one allow the extraction of the first phone number appearing in the document, and store it into “entity_phone” field. If no phone number is found, this field won’t be added to the document.

Finally, the last line indicates that if a line from the document contains the expression “word”, then it will be added to the multi-valued field “entity_word”. If the expression is not found, then “entity_word” will be set to “No word here”.

You might want to use it for simple XML extractions:

Considering a file with this structure:

<part1>
  <department>DEP1</department>
</part1>
<part2>
  <department>DEP2</department>
</part2>

You can extract all department values with the following regex and by checking the “Regex groups” checkbox: <department>(.*?)</department>. Your destination metadata will get the following value DEP1 DEP2.

But if you want to extract only the department of <part1>, this is not possible with the Regex Entity Connector as it processes a file line by line. The file above presents several lines (i.e. with return characters). The regex to extract department of <part1> could have been: <part1>\s<department>(.*?)</department>. \s is used to take into account the return to the line. But as the connector reads line by line, only <part1> will be read when the regex is applied.

However, as long as you're not in the situation where you need to extract data from tag nesting, the Regex Entity Connector is an interesting solution for, say, extracting chunks of information between tags. Let’s see this other example: <department>DEP1/COMMON-DEP/DEP2</department>, you want to extract DEP1 and DEP2. Let's say COMMON-DEP is part of the selection criteria for finding departments to extract. The regular expression will be: <department>(.*?)/COMMON-DEP/(.*?)</department>.

Warning! When using a destination field, make sure it exists in Solr, or create it if necessary (no crash, but the field will be silently ignored). It has to be a multivalued field. Note that if you want to use it into facets, it has to be a “String” field.

Going further

With https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/3222306817 you can access Solr field created or updated after the Output Solr Connector processing.

Valid from v6.0 up to 6.1 included

This documentation is valid from Datafari v6.0 up to 6.1.

This documentation is valid from Datafari v6 onwards.