Content Limiter transformation connector Configuration

Valid from 5.0

The documentation below is valid from Datafari v5.0.0 upwards

Since the 4.0.0 version of Datafari, introducing ManifoldCF v2.8.1, a new transformation connector is available : the Content limiter.
The purpose of this connector is to truncate the content stream of a crawled file if its size is above the limit configured, instead of ignoring and not indexing the file. This helps to improve the stability of Solr in case the amount of pure text to index is so big that it causes huge CPU and memory load which can lead to an OOM from Solr or the Operating System. 

Follow these steps to use the content limiter transformation connector in a crawl job:

  • In the MCF admin UI, create a new transformation connector
    Enter "contentLimiter" as a Name

    Then save it

  • Now you need to insert this transformation connector just before your output connector in a job

THE CONTENT LIMITER TRANSFORMATION CONNECTOR MUST ALWAYS BE PUT AFTER A TIKA CONNECTOR !!!!
If the content limiter connector truncate a stream that is not a pure text, the other connectors might not work and the MCF job can fail or worth. This is the reason why it MUST be placed after a Tika connector as its output is a guaranteed text stream.

  • In the "Content limiter" tab of your job, you can set the maximum allowed stream size (in Bytes/octets). The recommended max size for the stream is 1 000 000 Bytes (less than 1Mo) as it represents the text size of a book of approximately 500 pages which is normally enough for 1 file.

Notice that with the standard configuration of Datafari, Solr will hardly handle more than 25Mo of stream size by file. Above this limit you will have a very high probability of OOM


Since the 4.0.0 version of Datafari, introducing ManifoldCF v2.8.1, a new transformation connector is available : the Content limiter.
The purpose of this connector is to truncate the content stream of a crawled file if its size is above the limit configured, instead of ignoring and not indexing the file. This helps to improve the stability of Solr in case the amount of pure text to index is so big that it causes huge CPU and memory load which can lead to an OOM from Solr or the Operating System. 

Follow these steps to use the content limiter transformation connector in a crawl job:

  • In the MCF admin UI, create a new transformation connector
    Enter "contentLimiter" as a Name

    Then save it

  • Now you need to insert this transformation connector just before your output connector in a job



  • In the "Content limiter" tab of your job, you can set the maximum allowed stream size (in Bytes/octets). The recommended max size for the stream is 1 000 000 Bytes (less than 1Mo) as it represents the text size of a book of approximately 500 pages which is normally enough for 1 file.