Doc Filter Connector

The Doc Filter Connector has an important limitation ! If you use it on several jobs having the same repo source connector to filter different kind of files (like one job to only include pdf files, and one job to only include office files), the filters will conflicts. The result is that only the files matching the filters configured on the last ran job will be indexed, the files matching the filters of all the other already ran jobs will be deleted from the index !

For example, if you have 2 jobs configured with the same repo source connector, one job that only includes pdf files, one job that only includes office files, if you run the pdf files job first, then the office files job next, your index will only contains office files, not the pdf ones.

To solve this issue, you need to use a different repo source connector per job ! Yes it means to have several repo source connectors that are strictly identical BUT their names. Unfortunately, there is no other solution as it is a MCF framework limitation that causes that behavior.

The Doc Filter Connector has been introduced in both the Enterprise and Community edition of the v5.1

This connector provides regex based filters to include or exclude documents from indexation.

This connector reproduces what the “Global Filters” tab can propose, but since this global filter is not present for any connector, the Doc Filter needs to be used then. This means you do not need it for the web connector and the file system connector.

It is initialized by default in Datafari and can be added in the connectors workflow of an MCF job by selecting it in the list of transformation connectors. For optimal performances, it must be inserted right after the repo connector:

 

After adding the connector to the workflow a new tab named “Doc Filter” should appear in the list of tabs. In this tab, several things can be configured:

  • Filter field: Here you need to specify the field of the document on which you want to apply the regex filter. If this parameter is left empty, the filters will be applied by default to the document URI. The filed you can set in this this parameter MUST BE an existing field of the document AND a string type field, otherwise your filters won’t work.
    For example, if the documents crawled contain a “department_name” field, you can set it in the filter field in order to filter the documents so that their “department_name” field complies with the regex filters specified.
    You can only filter on one document field !

  • Include filters: It represents the list of regex filters from which the document field specified in the “Filter field” parameter must match at least one of the filters for the document to be included in the indexation process.
    To add a regex filter to this list, simply fulfill the text input with a regex then click on the “Add” button next to it
    To delete a regex filter, simply click on the “Delete” button next to it
    The regular expression must comply with the standard regular expression syntax. For instance \.(?i)zip(?-i)$ will match with any filename ending by .zip. Because \. literally matches the ‘.' char as it is escaped by ‘\’, the ‘$’ at the end of a regex expression indicates that it is the end of a line, and the (?i)(?-i) that encapsulate the ‘zip’ string indicates that it must not consider the case sensitivity of the 'zip’ string (so it will match even with ZIP, zIP, Zip etc). You can try your regular expressions with this online tool which also explains the behavior of the regex you enter: https://regex101.com/

    • Recommended default values: None.

  • Exclude filters: It represents the list of regex filters from which the document field specified in the “Filter field” parameter must match at least one of the filters for the document to be excluded from the indexation process.
    This list has a higher priority than the include filters list which means that a document that matches at least one filter in the include filters list and one filter in this list will be excluded from the indexation process
    To add a regex filter to this list, simply fulfill the text input with a regex then click on the “Add” button next to it
    To delete a regex filter, simply click on the “Delete” button next to it
    The regular expression must comply with the standard regular expression syntax. For instance \.(?i)zip(?-i)$ will match with any filename ending by .zip. Because \. literally matches the ‘.' char as it is escaped by ‘\’, the ‘$’ at the end of a regex expression indicates that it is the end of a line, and the (?i)(?-i) that encapsulate the ‘zip’ string indicates that it must not consider the case sensitivity of the 'zip’ string (so it will match even with ZIP, zIP, Zip etc). You can try your regular expressions with this online tool which also explains the behavior of the regex you enter: https://regex101.com/

    • Recommended default values for exclusion related to the URIs:

      • \/~.* \.(?i)pst(?-i)$ \.(?i)gz(?-i)$ \.(?i)ini(?-i)$ \.(?i)tar(?-i)$ \.(?i)lnk(?-i)$ \.(?i)db(?-i)$ \.(?i)odb(?-i)$ \.(?i)mat(?-i)$ \/\..* \.(?i)tgz(?-i)$ \.(?i)zip(?-i)$ \.(?i)rar(?-i)$ \.(?i)7z(?-i)$ \.(?i)bz2(?-i)$
      •  

  • Maximum document size: This parameter represents the maximum threshold in octets for the documents size. When a document size in octets exceeds this threshold, the document is excluded from the indexation process

    • Recommended default value: none

  • Minimum document size: This parameter represents the minimum threshold in octets for the documents size. When a document size in octets is under this threshold, the document is excluded from the indexation process

    • Recommended default value: none

When a document matches a specified include or exclude filter and thus is included or excluded from the indexation process, a simple history entry is generated by the connector so that you can keep track of whats happening during the crawl: