Info |
---|
Valid as of Datafari 4.6Documentation below is valid from Datafari 4.6 onwards |
To avoid crawling errors and problems that are hard to identify and/or solve it is very important, on the very first configuration of your crawl job to limit as much as possible the scope of docs/files that will be indexed. It is better to be very restrictive at the beginning and to raise progressively the limits than to try to exhaustively crawl a repository.
1. Using Simplified Connectors
If you have to crawl a website or a win share repository then start by creating a job using the MCF simplified UI but do not select the option to automatically start the job after its creation, instead go to the MCF UI and edit the freshly created job.
...
The simplified web connector is also configured to exclude from the crawl a set of filetypes. The list is available in https://datafari.atlassian.net/browse/
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
The filer job filters in addition the known archive files (pst, rar, zip...). you can check the full list in the "Global Filters" tab.
...
If you are configuring a crawl job for a winshare repository, as it is possible to either exclude files or empty them (in order to only index the filename), you will have to see which file paths and/or file extensions must be excluded and which ones to only index the filename ? The regular expressions of file paths/file extensions to exclude from the crawl MUST BE SET in the 'Global Filters' tab of the job configuration, the other ones in the 'Emptier Filter' tab
2. If Simplified Connectors cannot be used
If you can not use the MCF simplified UI because you are configuring a job for another type of repository connector than web or winshare, this documentation is still valid, you will only need to manually add the Doc Filter Connector and the Emptier Connector to the job connectors and to correctly configure the Data Extraction Server Configuration - Enterprise Edition. This means we recommend you to:
Use the emptier connector and put the following restrictions:
Include filters (empty documents that match)
Code Block \.(?i)avi(?-i)$ \.(?i)mp4(?-i)$ \.(?i)mkv(?-i)$ \.(?i)mov(?-i)$ \.(?i)wmv(?-i)$ \.(?i)flv(?-i)$ \.(?i)mp3(?-i)$ \.(?i)wav(?-i)$ \.(?i)wma(?-i)$ \.(?i)flac(?-i)$ \.(?i)aac(?-i)$ \.(?i)aiff(?-i)$ \.(?i)ogg(?-i)$
Maximum document size (higher document length will be emptied): 100000000
Open the TikaServerConnector tab and put the following restrictions:
Keep all metadata: checked
Normalize metadata names: checked
Content write limit: 1000000
Extract archives content: checked
Use the doc filter connector and put the following restrictions:
Exclude filters:
Code Block \/~.* \.(?i)pst(?-i)$ \.(?i)gz(?-i)$ \.(?i)ini(?-i)$ \.(?i)tar(?-i)$ \.(?i)lnk(?-i)$ \.(?i)db(?-i)$ \.(?i)odb(?-i)$ \.(?i)mat(?-i)$ \/\..* \.(?i)tgz(?-i)$ \.(?i)zip(?-i)$ \.(?i)rar(?-i)$ \.(?i)7z(?-i)$ \.(?i)bz2(?-i)$
Use the Metadata Cleaner and put the following restrictions:
Metadata name cleaners (replace expressions in the metadata names):
Code Block Regular expression: \$\{ Replace value: _{
Metadata value cleaners (replace expressions in the metadata values):
Code Block Regular expression: \$\{ Replace value: _{
3.