Content Comparison

Warninginfo

title	Valid as of v4from 4.0.1

Valid as of The documentation below is valid from Datafari v4.0.1 upwards

If you want to analyse images in order to extract data from image and pdf files, you can use Tesseract with Tika.

...

Code Block
sudo apt-get install tesseract-ocr-all

With this configuration, you can already extract data from your images !

For that, just configure your jobs in MCF as usual and use the Tika included in Solr so you ned to choose the DatafariSolr output and NOT choose TikaEmbedded transformation connector.

And that's it, start the job and the OCRization will be activated.

Remarks :

If you want to extract data from PDF, you need to do an additional step ie customize the file PDFParser.properties which is included in the tika-parsers.jar.
Install Vim :
Code Block
apt-get install vim
Stop datafari :
Code Block
cd /opt/datafari/bin bash stop-datafari.sh
Edit the jar :
Code Block
vim /opt/datafari/solr/solrcloud/FileShare/lib/extraction/tika-parsers*.jar
Choose the file org/apache/tika/parser/pdf/PDFParser.properties and change the line ocrStrategy :
Code Block
ocrStrategy ocr_and_text_extraction
Save your changes
Start Datafari :
Code Block
cd /opt/datafari/bin bash start-datafari.sh
And as before, just start your jobs in MCF, the data extraction will be done for PDF content.
Customize the Tesseract configuration
By default, Tesseract does the text extraction for English content. If you want to add other languages you need to edit the Tesseract configuration file located here : org/apache/tika/parser/ocr/TesseractOCRConfig.properties
Install Vim :
Code Block
apt-get install vim
Stop datafari :
Code Block
cd /opt/datafari/bin bash stop-datafari.sh
Edit the jar :
Code Block
vim /opt/datafari/solr/solrcloud/FileShare/lib/extraction/tika-parsers*.jar
Choose the file org/apache/tika/parser/ocr/TesseractOCRConfig.properties and change the line language, if I want to add French language, the configuration will be :
Code Block
language=eng+fra
Save your changes
Start Datafari :
Code Block
cd /opt/datafari/bin bash start-datafari.sh
And as before, just start your jobs in MCF, the data extraction will be done for PDF content.

Version	Old Version 5	New Version 6
Changes made by	Cedric	Cedric
Saved on	24 Jan, 2018	10 Aug, 2018

Versions Compared

Key