OCR on ManifoldCF Configuration with Datafari CE [DEPRECATED]

Valid up to version 4.0.0

By default, Datafari uses the Apache Solr Extracting Handler (aka Solr Cell) that leverages Apache Tika to extract the content that will be indexed from the crawled files. To limit the resource consumption (especially the network if ManifoldCF is installed on an external server), it is possible to use Tika directly in ManifoldCF. In this case, the content is extracted directly in ManifoldCF, and only the content that should be indexed is sent to Apache Solr. A Tika Transformation Connection is configured by default in Datafari. To use it, you simply have to add it to your crawling job in ManifoldCF :

Then, click on "Insert Transformation Before". You can now go to the tab "Boilerplate" and select "Extract Everything" :

Now your crawling job is ready to use Apache Tika directly in ManifoldCF to extract the content of the crawled files. You can also use Tesseract OCR to analyse images in order to extract data from image and pdf files. Tesseract OCR is bundled in Datafari. In order to make the ManifoldCF TikaOCR transformation connector able to use Tesseract to do the OCR analysis, you have to change the property "OCR" in /opt/datafari/tomcat/conf/datafari.properties from "false" to "true". Then restart Datafari.