Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Version History

Version 1 Next »

Valid from 5.5

Available for both the Enterprise edition and the Community edition of Datafari.

The Regex Entity Connector is used to extract information from the content of a document thanks to a regular expression and populate a Solr metadata with the result.

Here is some applications:

  • Collecting data in a document like: e-mails, phone number, etc...

  • Extract words and display their value and number in a facet on search screen

How to use it

Add the Regex Entity Connector in the job pipeline in the Connection tab of MCF.

It must be inserted after the TikaServerRmetaConnector. It needs the document to be transformed in text, otherwise the document contains only binary data.

Then in configuration screen:

Column Metadata: the metadata to be filled. Some valid metadata are given as example.

Associated regular expression: the regular expression used to extract data and fill the metadata. An example is given to retrieve the line in the document containing the word “option“, case insensitive.

  • The metadata must exists in Solr environment (no check done).

  • Because of the fact that several values can be found in a document, take care to choose a multi-valued metadata.

  • You can define only one regular expression per metadata for this version.

You can add as metadata as you want.
If a metadata is already present and you add the same with another regular expression, this one will replace the current.

Some examples of useful regular expression:

  • Ignore case: (?i)searched_word: retrieves “searched_word” regardless of character case.

  • Retrieve the line containing: .*searched_word.*

  • Search a point: \. “\” is the escape character.

  • Spaces are taken into account, so searching “word1 word2” will search the exact expression in the content.

  • e-mails: ([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})

  • Phone number: (\+)([\s.\(\)]*\d{1}){8,13}(-)?(\d{1,5})

  • Search “word1” or “word2”: word1|word2.

More examples here : Regex typical use cases

  • No labels