Data Extraction (Tika) Embedded in ManifoldCF Configuration [DEPRECATED]

Valid from 4.0

The documentation below is valid from Datafari v4.0.0 upwards

Users of Datafari Enterprise Edition

Except for particular cases, do not use this documentation with your Datafari Enterprise Edition solution, because it is already equipped and preconfigured with an optimised externel Tika Server Connector.

For users of the Datafari Community Edition: One of the options for Datafari is to use the Apache Solr Extracting Handler (aka Solr Cell) that leverages Apache Tika embedded in Solr to extract the content that will be indexed from the crawled files. Still, in order to limit the resource consumption (especially the network if ManifoldCF is installed on an external server), it is possible to use Tika directly in ManifoldCF. In this case, the content is extracted directly in ManifoldCF, and only the content that should be indexed is sent to Apache Solr. A Tika Transformation Connection is configured by default in Datafari. To use it, you simply have to add it to your crawling job in ManifoldCF :

Then, click on "Insert Transformation Before". You can now go to the tab "Boilerplate" and select "Extract Everything" :

Now your crawling job is ready to use Apache Tika directly in ManifoldCF to extract the content of the crawled files.