Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

More examples here : Regex typical use cases

Info

Valid from v6

This documentation is valid from Datafari v6 onwards.

The Regex Entity Connector is used to extract information from the content of a document thanks to a regular expression and populate a Solr metadata with the result.

For example, you can use it to extract e-mails, phone numbers, product numbers from documents. You could also extract words coming from a list, and later decide to display them as facets.

How to use it

Add the Regex Entity Connector in the job pipeline in the Connection tab of MCF.

...

It must be inserted anywhere after the TikaServerRmetaConnector. It needs the document to be transformed in text, otherwise the document contains only binary data.

Then in the Regex Entity configuration tab that appears, fill in as explained below :

...

Column Metadata: the metadata (present in the Datafari Solr) to be filled. Some valid metadata are given as example. We do not recommend using these examples, as chances are high that it may collide with another indexing pipeline that may overwrite its value. You could for instance create dedicated fields for your metadata dedicated to regex, for instance regex_phones.

Associated regular expression: the regular expression used to extract data and fill the metadata. An example is given to retrieve the line in the document containing the word “option“, case insensitive.

Note
  • The metadata must exist in Solr environment (no check done).

  • Depending on your regular expression, several different values may be found in a document, so the metadata receiving the results must be multi-valued, otherwise it will contain the last match found.

  • You can define only one regular expression per metadata for this version (as of June 1st, 2023).

  • The regex are only applied to what Tika considers as "content" in the source document, this excludes therefore any document metadata (as of June 1st, 2023).

You can add as many destination metadata as you want by clicking on the Add button.
If a metadata is already present and you add the same with another regular expression, this one will replace the previous one.

Some examples of useful regular expression:

  • Ignore case: (?i)searched_word: retrieves “searched_word” regardless of character case.

  • Retrieve the line containing: .*searched_word.*

  • Search a point: \. “\” is the escape character.

  • Spaces are taken into account, so searching “word1 word2” will search the exact expression in the content.

  • e-mails: ([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})

  • Phone number: (\+)([\s.\(\)]*\d{1}){8,13}(-)?(\d{1,5})

  • Search “word1” or “word2”: word1|word2.

More examples here : Regex typical use cases