Metadata Cleaner Connector

The Metadata Cleaner Connector has been introduced in both the Enterprise and Community edition of the v5.0

The purpose of this connector is to replace regular expressions that match metadata names or metadata values in documents by an associated string value. It has been mainly developed to clean potential problematic characters or characters chain which may trigger exceptions during indexation into Solr and fail the MCF job, but one can find it other applications.
It is thus used by default in the MCF jobs created by the simplified crawling UI of Datafari to replace “${“ char sequence found in the metadata names and values by “_{”. The reason is that Solr tries to process ${something} patterns found in metadata names and values as Solr variables. In some cases this can lead to errors and unwanted behaviors during the indexation phase.
Of course, the connector will only try to match the regular expressions on string based metadata names and values.

It is initialized by default in Datafari and can be added in the connectors workflow of an MCF job by selecting it in the list of transformation connectors (if the job has been created by the simplified crawler UI of Datafari, it is present by default in the workflow). To guarantee its efficiency, it must be inserted right before the output connector:

After adding the connector to the workflow a new tab named “Metadata Cleaner” should appear in the list of tabs. In this tab, several things can be configured:

  • Metadata name cleaners: It represents a list of pairs regular expression/replace value. The document metadata names are browsed and each time a regular expression in this list matches, the match is replaced by the corresponding string value.
    To add a new pair, simply fulfill the “regular expression” text input and “replace value” text input then click on the “Add” button
    To remove a pair, simply click on the “Delete” button next to the pair you want to delete
    The regular expression must comply with the standard regular expression syntax

  • Metadata value cleaners: It represents a list of pairs regular expression/replace value. The document metadata values are browsed and each time a regular expression in this list matches, the match is replaced by the corresponding string value.
    To add a new pair, simply fulfill the “regular expression” text input and “replace value” text input then click on the “Add” button
    To remove a pair, simply click on the “Delete” button next to the pair you want to delete
    The regular expression must comply with the standard regular expression syntax