Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »


If you want to analyse images in order to extract data from image and pdf files, you can use Tesseract with Tika.

In order to make it work in Datafari, all you have to do is to install Tesseract on your Datafari server and restart Datafari. You can use indifferently Tika embedded in MCF (with the TikaOCR transformation connector) or Tika Solr (simply use DatafariSolr output as usual), your documents will be processed by Tesseract.

To install Tesseract, the steps are very easy :
To install it on a Debian/Ubuntu system, run the following command:

sudo apt-get install tesseract-ocr

Congrats ! Tesseract is now installed and ready to be used. However by default, Tesseract is only able to deal with english language, if you want to deal with other languages you will need to install the corresponding package (if available). The pattern of a language paclage is: tesseract-ocr-[country_code]

The country code is based on three letters. For example the Tesseract package for the french language is: tesseract-ocr-fra
So to install the french package you will need to run the following command:

sudo apt-get install tesseract-ocr-fra

You can find the list of available language packages for Tesseract on the web or here: https://packages.ubuntu.com/search?keywords=tesseract-ocr
You can also install all the available languages in a single command:

sudo apt-get install tesseract-ocr-all

With this configuration, you can already extract data from your images !  

For that, just configure your jobs in MCF as usual and use the Tika included in Solr so you ned to choose the DatafariSolr output and NOT choose TikaEmbedded transformation connector.

And that's it, start the job and the OCRization will  be activated.

Remarks :

  1. If you want to extract data from PDF, you need to do an additional step ie customize the file PDFParser.properties which is included in the tika-parsers.jar.
    Install Vim : 

    apt-get install vim


    Stop datafari :

    cd /opt/datafari/bin
    bash stop-datafari.sh


    Edit the jar :

    vim /opt/datafari/solr/solrcloud/FileShare/lib/extraction/tika-parsers*.jar


    Choose the file org/apache/tika/parser/pdf/PDFParser.properties and change the line ocrStrategy :

    ocrStrategy ocr_and_text_extraction


    Save your changes 
    Start Datafari :

    cd /opt/datafari/bin
    bash start-datafari.sh


    And as before, just start your jobs in MCF, the data extraction will be done for PDF content.


  2. Customize the Tesseract configuration
    By default, Tesseract does the text extraction for English content. If you want to add other languages you need to edit the Tesseract configuration file located here : org/apache/tika/parser/ocr/TesseractOCRConfig.properties
    Install Vim : 

    apt-get install vim


    Stop datafari :

    cd /opt/datafari/bin
    bash stop-datafari.sh


    Edit the jar :

    vim /opt/datafari/solr/solrcloud/FileShare/lib/extraction/tika-parsers*.jar


    Choose the file org/apache/tika/parser/ocr/TesseractOCRConfig.properties and change the line language, if I want to add French language, the configuration will be :

    language=eng+fra


    Save your changes 
    Start Datafari :

    cd /opt/datafari/bin
    bash start-datafari.sh


    And as before, just start your jobs in MCF, the data extraction will be done for PDF content.

  • No labels