Valid from 4.0.0
The documentation below is valid from Datafari v4.0.0 upwards
By default, the 'DatafariSolr' output connector, which is pre-configured in MCF by Datafari, sends all the documents to the /update/extract handler of Solr. This handler uses an embed Tika to parse the incoming document before indexing it, even if the parsing has already been done by a Tika connector or a Tika service connector that you may have configured in the crawl job. This may result in an alteration of the content of the document, like for XML, CSV or JSON files and also in resource and treatment time consumption that could be avoided.
This is the reason why, in the 4.0.0 version of Datafari, we provide a new Solr handler to index documents without using Tika.
Of course this handler can only work with documents already parsed by a Tika connector or a Tika service connector and will result in Solr errors and MCF job hanging if it is not the case, so be really careful !
The handler java classes are :
- com.francelabs.datafari.handler.parsed.ParsedContentHandler
- com.francelabs.datafari.handler.parsed.ParsedDocumentLoader
- com.francelabs.datafari.handler.parsed.ParsedRequestHandler
They are located under the 'datafari-handler' module of the Datafari github project
A pre-configured MCF output connector named 'DatafariSolrNoTika' is created by Datafari during the installation. This ouput connector is configured to use the /update/no-tika URI to push documents to Solr. This URI, on Solr side, is mapped to the ParsedContentHandler. You can retrieve the configuration in [DATAFARI_HOME]/solr/solrcloud/FileShare/conf/solrconfig.xml :
<requestHandler class="com.francelabs.datafari.handler.parsed.ParsedRequestHandler" name="/update/no-tika" startup="lazy"> <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.language">ignored_</str> <str name="fmap.source">ignored_</str> <str name="uprefix">ignored_</str> <str name="update.chain">datafari</str> </lst> </requestHandler>
You should not have to modify the parameters as they are globally the same as for the /update/extract handler. Unless for the 'ignoreTikaException' parameter which is of course useless and unused.
So to simply avoid Tika procesing on Solr side during a crawl, select the 'DatafariSolrNoTika' output connector as the output of your MCF crawl job. But remember that you will need to set at least a Tika connector or a Tika service connector between your input connector and the 'DatafariSolrNoTika', otherwise you won't be able to index documents: