Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

Info

Deprecated since Datafari 5.4

The documentation below is deprecated from Datafari v5.4 onwards. Please refer to OCR on Tika Configuration

...

Info

Valid from 4.0.1

The documentation below is valid from Datafari v4.0.1 upwards

...

For that, just configure your jobs in MCF as usual and use the Tika included in Solr so you ned to choose the DatafariSolr output and NOT choose TikaEmbedded transformation connector.

And that's it, start the job and the OCRization will  be activated.

Remarks :

  1. If you want to extract data from PDF, you need to do an additional step ie customize the file PDFParser.properties which is included in the tika-parsers.jar.
    Install Vim : 

    Code Block
    apt-get install vim

    Stop datafari :

    Code Block
    cd /opt/datafari/bin
    bash stop-datafari.sh

    Edit the jar :

    Code Block
    vim /opt/datafari/solr/solrcloud/FileShare/lib/extraction/tika-parsers*.jar

    Choose the file org/apache/tika/parser/pdf/PDFParser.properties and change the line ocrStrategy :

    Code Block
    ocrStrategy ocr_and_text_extraction

    Save your changes 
    Start Datafari :

    Code Block
    cd /opt/datafari/bin
    bash start-datafari.sh


    And as before, just start your jobs in MCF, the data extraction will be done for PDF content.

  2. Customize the Tesseract configuration
    By default, Tesseract does the text extraction for English content. If you want to add other languages you need to edit the Tesseract configuration file located here : org/apache/tika/parser/ocr/TesseractOCRConfig.properties
    Install Vim : 

    Code Block
    apt-get install vim

    Stop datafari :

    Code Block
    cd /opt/datafari/bin
    bash stop-datafari.sh

    Edit the jar :

    Code Block
    vim /opt/datafari/solr/solrcloud/FileShare/lib/extraction/tika-parsers*.jar

    Choose the file org/apache/tika/parser/ocr/TesseractOCRConfig.properties and change the line language, if I want to add French language, the configuration will be :

    Code Block
    language=eng+fra

    Save your changes 
    Start Datafari :

    Code Block
    cd /opt/datafari/bin
    bash start-datafari.sh

    And as before, just start your jobs in MCF, the data extraction will be done for PDF content.