Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Valid from 4.0

The documentation below is valid from Datafari v4.0.0 upwards

Users of Datafari Enterprise Edition

Except for particular cases, do not use this content limiter with your Datafari Enterprise Edition solution, because if you are using the available Tika Server Connector, it is already equipped and preconfigured with an optimised content limiter.

Since the 4.0.0 version of Datafari, introducing ManifoldCF v2.8.1, a new transformation connector is available : the Content limiter.
The purpose of this connector is to truncate the content stream of a crawled file if its size is above the limit configured, instead of ignoring and not indexing the file. This helps to improve the stability of Solr in case the amount of pure text to index is so big that it causes huge CPU and memory load which can lead to an OOM from Solr or the Operating System. 

Follow these steps to use the content limiter transformation connector in a crawl job:

  • In the MCF admin UI, create a new transformation connector

    Enter "contentLimiter" as a Name


    Then save it


  • Now you need to insert this transformation connector just before your output connector in a job



    THE CONTENT LIMITER TRANSFORMATION CONNECTOR MUST ALWAYS BE PUT AFTER A TIKA CONNECTOR !!!!
    If the content limiter connector truncate a stream that is not a pure text, the other connectors might not work and the MCF job can fail or worth. This is the reason why it MUST be placed after a Tika connector as its output is a guaranteed text stream.

  • In the "Content limiter" tab of your job, you can set the maximum allowed stream size (in Bytes/octets). The recommended max size for the stream is 1 000 000 Bytes (less than 1Mo) as it represents the text size of a book of approximately 500 pages which is normally enough for 1 file.

Notice that with the standard configuration of Datafari, Solr will hardly handle more than 25Mo of stream size by file. Above this limit you will have a very high probability of OOM

  • No labels