Crawl jobs configuration best practices

Valid as of Datafari 6.0

Documentation below is valid from Datafari 4.6 onwards

To avoid crawling errors and problems that are hard to identify and/or solve it is very important, on the very first configuration of your crawl job you can limit as much as possible the scope of docs/files that will be indexed. It is better to be very restrictive at the beginning and to raise progressively the limits than to try to exhaustively crawl a repository.

1. Using Simplified Connectors

If you have to crawl a website or a win share repository (and some other types), then start by creating a job using the MCF simplified UI but do not select the option to automatically start the job after its creation, instead go to the MCF UI and edit the recently created job.

By default, a job created through the simplified UI is configured to :

empty the .avi .mp4 and .mkv files (thanks to the Emptier Connector)
empty the files exceeding 100Mo (thanks to the Emptier Connector)
limit the indexed text content of a file to 1Mo (configured in the TikaServerConnector tab)
clean up potential metadata problems with the Metadata Cleaner

The simplified web connector is also configured to exclude from the crawl a set of filetypes. The list is available in https://datafari.atlassian.net/browse/DATAFARI-452

The simplified web connector also proposes as an option to only accept a set of office related file formats. See Extend simplified job with 2 filtering possibilities (#787) for more information.

The filer job filters in addition the known archive files (pst, rar, zip...). you can check the full list in the "Global Filters" tab.

The simplified filer connector also proposes as an option to only accept a set of office related file formats. See Extend simplified job with 2 filtering possibilities (#787) for more information.

At this moment, you should ask yourself one question :

Are there specific file paths and/or file extensions to exclude from the crawl because they are not worth indexing ? (like file paths where backup files are stored, or binary files that are not worth indexing)

If you have a positive response this question, then you should list those file extensions and file paths. You will then need to create regular expressions that will match the list and set them in the 'Emptier Filter' tab of the job configuration.

If you are configuring a crawl job for a winshare repository, as it is possible to either exclude files or empty them (in order to only index the filename), you will have to see which file paths and/or file extensions must be excluded and which ones to only index the filename ? The regular expressions of file paths/file extensions to exclude from the crawl MUST BE SET in the 'Global Filters' tab of the job configuration, the other ones in the 'Emptier Filter' tab

2. If Simplified Connectors cannot be used

If you can not use the MCF simplified UI because you are configuring a job for another type of repository connector that does not propose the simplified mode, this documentation is still valid, you will only need to manually add the Doc Filter Connector and the Emptier Connector to the job connectors and to correctly configure the Data Extraction Server Configuration - Enterprise Edition. This means we recommend you to:

Use the emptier connector and put the following restrictions:
- Include filters (empty documents that match)
  \.(?i)avi(?-i)$ \.(?i)mp4(?-i)$ \.(?i)mkv(?-i)$ \.(?i)mov(?-i)$ \.(?i)wmv(?-i)$ \.(?i)flv(?-i)$ \.(?i)mp3(?-i)$ \.(?i)wav(?-i)$ \.(?i)wma(?-i)$ \.(?i)flac(?-i)$ \.(?i)aac(?-i)$ \.(?i)aiff(?-i)$ \.(?i)ogg(?-i)$
- Maximum document size (higher document length will be emptied): 100000000
Open the TikaServerConnector tab and put the following restrictions:
- Keep all metadata: checked
- Normalize metadata names: checked
- Content write limit: 1000000
- Extract archives content: checked
Use the doc filter connector and put the following restrictions:
- Exclude filters:
  \/~.* \.(?i)pst(?-i)$ \.(?i)gz(?-i)$ \.(?i)ini(?-i)$ \.(?i)tar(?-i)$ \.(?i)lnk(?-i)$ \.(?i)db(?-i)$ \.(?i)odb(?-i)$ \.(?i)mat(?-i)$ \/\..* \.(?i)tgz(?-i)$ \.(?i)zip(?-i)$ \.(?i)rar(?-i)$ \.(?i)7z(?-i)$ \.(?i)bz2(?-i)$
Use the Metadata Cleaner and put the following restrictions:
- Metadata name cleaners (replace expressions in the metadata names):
  Regular expression: \$\{ Replace value: _{
- Metadata value cleaners (replace expressions in the metadata values):
  Regular expression: \$\{ Replace value: _{

Valid as of Datafari 4.6 up to 5.5

Documentation below is valid from Datafari 4.6 up to 5.5

To avoid crawling errors and problems that are hard to identify and/or solve it is very important, on the very first configuration of your crawl job to limit as much as possible the scope of docs/files that will be indexed. It is better to be very restrictive at the beginning and to raise progressively the limits than to try to exhaustively crawl a repository.

1. Using Simplified Connectors

If you have to crawl a website or a win share repository then start by creating a job using the MCF simplified UI but do not select the option to automatically start the job after its creation, instead go to the MCF UI and edit the freshly created job.

By default, the job created through the simplified UI are configured to :

empty the .avi .mp4 and .mkv files (thanks to the Emptier Connector)
empty the files exceeding 100Mo (thanks to the Emptier Connector)
limit the indexed text content of a file to 1Mo (configured in the TikaServerConnector tab)
clean up potential metadata problems with the Metadata Cleaner

The simplified web connector is also configured to exclude from the crawl a set of filetypes. The list is available in https://datafari.atlassian.net/browse/DATAFARI-452

The filer job filters in addition the known archive files (pst, rar, zip...). you can check the full list in the "Global Filters" tab.

At this moment, you should ask yourself one question :

Are there specific file paths and/or file extensions to exclude from the crawl cause they are not worth indexing ? (like file paths where are stored backup files or binary files that are not worth indexing)

If you have a positive response this question, then you should list those file extensions and file paths. You will then need to create regular expressions that will match the list and set them in the 'Emptier Filter' tab of the job configuration.

If you are configuring a crawl job for a winshare repository, as it is possible to either exclude files or empty them (in order to only index the filename), you will have to see which file paths and/or file extensions must be excluded and which ones to only index the filename ? The regular expressions of file paths/file extensions to exclude from the crawl MUST BE SET in the 'Global Filters' tab of the job configuration, the other ones in the 'Emptier Filter' tab

2. If Simplified Connectors cannot be used

If you can not use the MCF simplified UI because you are configuring a job for another type of repository connector than web or winshare, this documentation is still valid, you will only need to manually add the Doc Filter Connector and the Emptier Connector to the job connectors and to correctly configure the Data Extraction Server Configuration - Enterprise Edition. This means we recommend you to:

Use the emptier connector and put the following restrictions:
- Include filters (empty documents that match)
  \.(?i)avi(?-i)$ \.(?i)mp4(?-i)$ \.(?i)mkv(?-i)$ \.(?i)mov(?-i)$ \.(?i)wmv(?-i)$ \.(?i)flv(?-i)$ \.(?i)mp3(?-i)$ \.(?i)wav(?-i)$ \.(?i)wma(?-i)$ \.(?i)flac(?-i)$ \.(?i)aac(?-i)$ \.(?i)aiff(?-i)$ \.(?i)ogg(?-i)$
- Maximum document size (higher document length will be emptied): 100000000
Open the TikaServerConnector tab and put the following restrictions:
- Keep all metadata: checked
- Normalize metadata names: checked
- Content write limit: 1000000
- Extract archives content: checked
Use the doc filter connector and put the following restrictions:
- Exclude filters:
  \/~.* \.(?i)pst(?-i)$ \.(?i)gz(?-i)$ \.(?i)ini(?-i)$ \.(?i)tar(?-i)$ \.(?i)lnk(?-i)$ \.(?i)db(?-i)$ \.(?i)odb(?-i)$ \.(?i)mat(?-i)$ \/\..* \.(?i)tgz(?-i)$ \.(?i)zip(?-i)$ \.(?i)rar(?-i)$ \.(?i)7z(?-i)$ \.(?i)bz2(?-i)$
Use the Metadata Cleaner and put the following restrictions:
- Metadata name cleaners (replace expressions in the metadata names):
  Regular expression: \$\{ Replace value: _{
- Metadata value cleaners (replace expressions in the metadata values):
  Regular expression: \$\{ Replace value: _{