Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Valid as of Datafari 4.6

Documentation below is valid from Datafari 4.6 onwards

To avoid crawling errors and problems that are hard to identify and/or solve it is very important, on the very first configuration of your crawl job to limit as much as possible the scope of docs/files that will be indexed. It is better to be very restrictive at the beginning and to raise progressively the limits than to try to exhaustively crawl a repository.

1. Using Simplified Connectors

If you have to crawl a website or a win share repository then start by creating a job using the MCF simplified UI but do not select the option to automatically start the job after its creation, instead go to the MCF UI and edit the freshly created job.

By default, the job created through the simplified UI are configured to :

  • empty the .avi .mp4 and .mkv files (thanks to the Emptier Connector)

  • empty the files exceeding 100Mo (thanks to the Emptier Connector)

  • limit the indexed text content of a file to 1Mo (configured in the TikaServerConnector tab)

The simplified web connector is also configured to exclude from the crawl a set of filetypes. The list is available in https://datafari.atlassian.net/browse/DATAFARI-452

The filer job filters in addition the known archive files (pst, rar, zip...). you can check the full list in the "Global Filters" tab.

At this moment, you should ask yourself one question :

  • Are there specific file paths and/or file extensions to exclude from the crawl cause they are not worth indexing ? (like file paths where are stored backup files or binary files that are not worth indexing)

If you have a positive response this question, then you should list those file extensions and file paths. You will then need to create regular expressions that will match the list and set them in the 'Emptier Filter' tab of the job configuration.

If you are configuring a crawl job for a winshare repository, as it is possible to either exclude files or empty them (in order to only index the filename), you will have to see which file paths and/or file extensions must be excluded and which ones to only index the filename ? The regular expressions of file paths/file extensions to exclude from the crawl MUST BE SET in the 'Global Filters' tab of the job configuration, the other ones in the 'Emptier Filter' tab

2. If Simplified Connectors cannot be used

If you can not use the MCF simplified UI because you are configuring a job for another type of repository connector than web or winshare, this documentation is still valid, you will only need to manually add the Doc Filter Connector and the Emptier Connector to the job connectors and to correctly configure the Data Extraction Server Configuration - Enterprise Edition. This means we recommend you to:

  • Use the emptier connector and put the following restrictions:

    • Include filters (empty documents that match)

      \.(?i)avi(?-i)$
      \.(?i)mp4(?-i)$
      \.(?i)mkv(?-i)$
      \.(?i)mov(?-i)$
      \.(?i)wmv(?-i)$
      \.(?i)flv(?-i)$
      \.(?i)mp3(?-i)$
      \.(?i)wav(?-i)$
      \.(?i)wma(?-i)$
      \.(?i)flac(?-i)$
      \.(?i)aac(?-i)$
      \.(?i)aiff(?-i)$
      \.(?i)ogg(?-i)$
    • Maximum document size (higher document length will be emptied): 100000000

  • Open the TikaServerConnector tab and put the following restrictions:

    • Keep all metadata: checked

    • Normalize metadata names: checked

    • Content write limit: 1000000

    • Extract archives content: checked

  • Use the doc filter connector and put the following restrictions:

    • Exclude filters:

      \/~.*
      \.(?i)pst(?-i)$
      \.(?i)gz(?-i)$
      \.(?i)ini(?-i)$
      \.(?i)tar(?-i)$
      \.(?i)lnk(?-i)$
      \.(?i)db(?-i)$
      \.(?i)odb(?-i)$
      \.(?i)mat(?-i)$
      \/\..*
      \.(?i)tgz(?-i)$
      \.(?i)zip(?-i)$
      \.(?i)rar(?-i)$
      \.(?i)7z(?-i)$
      \.(?i)bz2(?-i)$
  • No labels