Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Info

Valid as of Datafari 4.6

Documentation below is valid from Datafari 4.6 onwards

To avoid crawling errors and problems that are hard to identify and/or solve it is very important, on the very first configuration of your crawl job to limit as much as possible the scope of docs/files that will be indexed. It is better to be very restrictive at the beginning and to raise progressively the limits than to try to exhaustively crawl a repository.

1. Using Simplified Connectors

If you have to crawl a website or a win share repository then start by creating a job using the MCF simplified UI but do not select the option to automatically start the job after its creation, instead go to the MCF UI and edit the freshly created job.

...

The simplified web connector is also configured to exclude from the crawl a set of filetypes. The list is available in https://datafari.atlassian.net/browse/

Jira Legacy
serverSystem JIRA
serverIdc2109247-6c11-3ad3-8fca-95ff091e2b8e
keyDATAFARI-452

The filer job filters in addition the known archive files (pst, rar, zip...). you can check the full list in the "Global Filters" tab.

...

If you are configuring a crawl job for a winshare repository, as it is possible to either exclude files or empty them (in order to only index the filename), you will have to see which file paths and/or file extensions must be excluded and which ones to only index the filename ? The regular expressions of file paths/file extensions to exclude from the crawl MUST BE SET in the 'Global Filters' tab of the job configuration, the other ones in the 'Emptier Filter' tab

2. If Simplified Connectors cannot be used

If you can not use the MCF simplified UI because you are configuring a job for another type of repository connector than web or winshare, this documentation is still valid, you will only need to manually add the Doc Filter Connector and the Emptier Connector to the job connectors and to correctly configure the Data Extraction Server Configuration - Enterprise Edition. This means we recommend you to:

  • Use the emptier connector and put the following restrictions:

    • Include filters (empty documents that match)

      Code Block
      \.(?i)avi(?-i)$
      \.(?i)mp4(?-i)$
      \.(?i)mkv(?-i)$
      \.(?i)mov(?-i)$
      \.(?i)wmv(?-i)$
      \.(?i)flv(?-i)$
      \.(?i)mp3(?-i)$
      \.(?i)wav(?-i)$
      \.(?i)wma(?-i)$
      \.(?i)flac(?-i)$
      \.(?i)aac(?-i)$
      \.(?i)aiff(?-i)$
      \.(?i)ogg(?-i)$
    • Maximum document size (higher document length will be emptied): 100000000

  • Open the TikaServerConnector tab and put the following restrictions:

    • Keep all metadata: checked

    • Normalize metadata names: checked

    • Content write limit: 1000000

    • Extract archives content: checked

  • Use the doc filter connector and put the following restrictions:

    • Exclude filters:

      Code Block
      \/~.*
      \.(?i)pst(?-i)$
      \.(?i)gz(?-i)$
      \.(?i)ini(?-i)$
      \.(?i)tar(?-i)$
      \.(?i)lnk(?-i)$
      \.(?i)db(?-i)$
      \.(?i)odb(?-i)$
      \.(?i)mat(?-i)$
      \/\..*
      \.(?i)tgz(?-i)$
      \.(?i)zip(?-i)$
      \.(?i)rar(?-i)$
      \.(?i)7z(?-i)$
      \.(?i)bz2(?-i)$
  • Use the Metadata Cleaner and put the following restrictions:

    • Metadata name cleaners (replace expressions in the metadata names):

      Code Block
      Regular expression: \$\{
      Replace value: _{
    • Metadata value cleaners (replace expressions in the metadata values):

      Code Block
      Regular expression: \$\{
      Replace value: _{

3.