Emptier Connector
The emptier connector is available for both the Enterprise edition and the Community edition of Datafari.
Its goal is to empty the content of the files in order to keep only the title of the document and some metadata of the document. Note that all the ACLs are kept too.
Keep in mind that it means that all the content of a file (this includes the metadata that are part of the binaries) are removed.
How to use itÂ
Add the emptier connector in the job pipeline in the Connection tab of MCF.
If you are inserting it in a job generated by the simplified connectors UI, then it must be the second filter connector applied (just after the metatadata adjuster filter).
If you are inserting it in a job created by yourself, think carefully about where to position it in your pipeline. It has been created in order to remove heavy payloads on the Tika, so we recommend to position it before the TikaServerConnector transformation, but you may have transformation connectors that need to occur before your Emptier Connector, depending on what your objectives are.
After that you can configure the options in the dedicated tab : "Emptier filter"
Filter field (on which regex filters will be applied. empty = doc uri)
Indicates on which field the regex rules below will be applied. If the textbox is blank, it will be applied on the id field
Include filters (empty documents that match):
Indicates the regex rules used to select the matching documents (i.e the documents that need to be emptied).
For example if you enter .*Â it will be applied to all the documents that go through this transformation connector.
Exclude filters (do not empty documents that match):
At the opposite, you can indicate the documents that will not match
Maximum document size (higher document length will be emptied):
Indicates a maximum document size (in bytes). It means that if a document size is greater than that size, the content will be empty.
Minimum document size (lower document length will be emptied):
Indicates a minimum document size (in bytes) threshold. It means that if a document size is less than that size, the content will be empty.
You can check in the Simple history of MCF if the rules are well applied to your documents :