Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

Valid from 5.0

The documentation below is valid from Datafari v5.0.0 upwards

Since the 4.0.0 version of Datafari, introducing ManifoldCF v2.8.1, a new transformation connector is available : the Content limiter.
The purpose of this connector is to truncate the content stream of a crawled file if its size is above the limit configured, instead of ignoring and not indexing the file. This helps to improve the stability of Solr in case the amount of pure text to index is so big that it causes huge CPU and memory load which can lead to an OOM from Solr or the Operating System. 

...

  • In the MCF admin UI, create a new transformation connector
    Enter "contentLimiter" as a Name

    Then save it

  • Now you need to insert this transformation connector just before your output connector in a jobImage Removed

    Image Added
Warning

THE CONTENT LIMITER TRANSFORMATION CONNECTOR MUST ALWAYS BE PUT AFTER A TIKA CONNECTOR !!!!
If the content limiter connector truncate a stream that is not a pure text, the other connectors might not work and the MCF job can fail or worth. This is the reason why it MUST be placed after a Tika connector as its output is a guaranteed text stream.

  • In the "Content limiter" tab of your job, you can set the maximum allowed stream size (in Bytes/octets). The recommended max size for the stream is 1 000 000 Bytes (less than 1Mo) as it represents the text size of a book of approximately 500 pages which is normally enough for 1 file.

Note

Notice that with the standard configuration of Datafari, Solr will hardly handle more than 25Mo of stream size by file. Above this limit you will have a very high probability of OOM

...

Expand
titleValid from 4.0 up to 4.6
Info

Valid from 4.0 up to 4.6

The documentation below is valid from Datafari v4.0.0 upwards

Info

Users of Datafari Enterprise Edition

Except for particular cases, do not use this content limiter with your Datafari Enterprise Edition solution, because if you are using the available Tika Server Connector, it is already equipped and preconfigured with an optimised content limiter.

Since the 4.0.0 version of Datafari, introducing ManifoldCF v2.8.1, a new transformation connector is available : the Content limiter.
The purpose of this connector is to truncate the content stream of a crawled file if its size is above the limit configured, instead of ignoring and not indexing the file. This helps to improve the stability of Solr in case the amount of pure text to index is so big that it causes huge CPU and memory load which can lead to an OOM from Solr or the Operating System. 

Follow these steps to use the content limiter transformation connector in a crawl job:

  • In the MCF admin UI, create a new transformation connector
    Enter "contentLimiter" as a Name

    Image Modified

    Then save it

  • Now you need to insert this transformation connector just before your output connector in a job

    Image Modified

Warning

THE CONTENT LIMITER TRANSFORMATION CONNECTOR MUST ALWAYS BE PUT AFTER A TIKA CONNECTOR !!!!
If the content limiter connector truncate a stream that is not a pure text, the other connectors might not work and the MCF job can fail or worth. This is the reason why it MUST be placed after a Tika connector as its output is a guaranteed text stream.

  • In the "Content limiter" tab of your job, you can set the maximum allowed stream size (in Bytes/octets). The recommended max size for the stream is 1 000 000 Bytes (less than 1Mo) as it represents the text size of a book of approximately 500 pages which is normally enough for 1 file.

Note

Notice that with the standard configuration of Datafari, Solr will hardly handle more than 25Mo of stream size by file. Above this limit you will have a very high probability of OOM