OCR on Tika Configuration
Valid from Datafari 6
The documentation below is valid from Datafari v6.0 upwards, for both the CE and EE editions
By default a Tika server does not perform "OCRization" on incoming docs. To do so, you will need two things:
Install Tesseract
Configure the Tika Server
You may use the OCR Tika in two variations:
Direct use, as described here, and the OCR phase takes place at the indexing phase
Indirect use, by going through a https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1832484865 , and the OCR phase takes place asynchronously.
1. Install Tesseract
Tesseract is an open source application that "OCRize" documents. Tika uses this tool to retrieve the content of images.
To install it on a Debian/Ubuntu system, run the following command:
sudo apt-get install tesseract-ocr
Congrats ! Tesseract is now installed and ready to be used. However by default, Tesseract is only able to deal with english language, if you want to deal with other languages you will need to install the corresponding package (if available). The pattern of a language paclage is: tesseract-ocr-[country_code]
The country code is based on three letters. For example the Tesseract package for the french language is: tesseract-ocr-fra
So to install the french package you will need to run the following command:
sudo apt-get install tesseract-ocr-fra
You can find the list of available language packages for Tesseract on the web or here: https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-different-versions.md
You can also install all the available languages in a single command:
sudo apt-get install tesseract-ocr-all
At this step you can test if tesseract is correctly working with your desired languages by sending it a TIF image file containing text to OCR through the following command:
tesseract /path/to/my-test-image.tif /path/to/my-output
No need to add a file extension for the output file as Tesseract automatically adds “.txt”
The “my-output.txt” file should contain the extracted text from your TIF file
2. Configure the Tika Server
This section is optional if you have been using this procedure with the OCR option : Tika Server - Easy creation & configuration
Once Tesseract is installed on the same machine than the Tika Server, it usually uses it automatically to parse image files but in the Tika server provided with Datafari, it is disabled by default.
To enable OCR in Tika Server you will need to configure the TesseractOCRParser and the PDFParser in the tika-config.xml file located in [DATAFARI_HOME]/tika-server/conf/. By default, the TesseractOCRParser is excluded from the parsers and its configuration is commented to disable it.
So first you will need to comment its exclusions:
<parser class="org.apache.tika.parser.DefaultParser">
<!-- comment this parser-exclude tag to activate OCR parser -->
<!-- <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser" /> -->
</parser>
Then to un-comment its configuration section to enable it:
<!-- Uncomment this part to activate OCR parser -->
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
<params>
<param name="applyRotation" type="bool">false</param>
<param name="colorSpace" type="string">gray</param>
<param name="density" type="int">300</param>
<param name="depth" type="int">4</param>
<param name="enableImagePreprocessing" type="bool">false</param>
<param name="filter" type="string">triangle</param>
<param name="imageMagickPath" type="string"></param>
<param name="language" type="string">eng+fra</param>
<param name="maxFileSizeToOcr" type="long">2147483647</param>
<param name="minFileSizeToOcr" type="long">1000</param>
<param name="pageSegMode" type="string">1</param>
<param name="pageSeparator" type="string"></param>
<param name="preserveInterwordSpacing" type="bool">false</param>
<param name="resize" type="int">900</param>
<param name="skipOcr" type="bool">false</param>
<param name="tessdataPath" type="string"></param>
<param name="tesseractPath" type="string"></param>
<param name="timeoutSeconds" type="int">420</param>
</params>
</parser>
Once the section un-commented, you can tweak the different parameters of the parser to fit your needs (it is not the purpose of this documentation to explain them: Google is you friend !). In any case, YOU MUST at least pay close attention to the “timeoutSeconds” and the “language” parameters.
The “timeoutSeconds” parameter corresponds to the max allowed time to perform OCR on a document before giving up. In our experience most of documents of less than 10 pages fulfilled with images will take a bit more than 10 minutes to be processed, thus we set the default value to 1200 seconds which corresponds to 20 minutes in order to cover most of the documents. You may want to adapt this parameter according to your needs and/or experience but keep in mind that the value you set here MUST BE THE SAME than the value of the taskTimeoutMillis
parameter of the Tika server parameters section (mentioned further in this documentation).
The “language” parameter corresponds to the languages the Tesseract parser may encounter and that it will try to recognize in documents. Any language encountered not listed here will not be recognized and the OCRization will fail or be inaccurate ! The parameter is configured by default on “eng+fra”. If you want to deal with documents in other languages, you will need to add the language code (which is the same than the one present in the package name). Each specified language must be separated by a '+'.
For example, if you want to deal with english, french and deutch documents, you will need to set the "language" property like this:
<param name="language" type="string">eng+fra+deu</param>
Now, to perform OCR process on images embeded in PDF files, you need to tell the PDF parser of Tika to do so because it is disabled by default (for resources consumption and processing time reasons). This can be done in the PDFParser configuration section of the tika-config.xml file which is by default:
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<!-- these are the defaults; you only need to specify the ones you want to modify -->
<param name="allowExtractionForAccessibility" type="bool">true</param>
<param name="averageCharTolerance" type="float">0.3</param>
<param name="catchIntermediateIOExceptions" type="bool">true</param>
<param name="checkExtractAccessPermission" type="bool">false</param>
<!-- whether or not to add processing to detect angles and extract text accordingly -->
<param name="detectAngles" type="bool">false</param>
<param name="dropThreshold" type="float">2.5</param>
<param name="enableAutoSpace" type="bool">true</param>
<param name="extractAcroFormContent" type="bool">true</param>
<param name="extractActions" type="bool">false</param>
<param name="extractAnnotationText" type="bool">true</param>
<param name="extractBookMarksText" type="bool">true</param>
<param name="extractFontNames" type="bool">false</param>
<param name="extractInlineImages" type="bool">false</param>
<param name="extractUniqueInlineImagesOnly" type="bool">true</param>
<param name="ifXFAExtractOnlyXFA" type="bool">false</param>
<!-- Use up to 500MB when loading a pdf into a PDDocument -->
<param name="maxMainMemoryBytes" type="long">524288000</param>
<!-- dots per inch for the ocr rendering of the page image -->
<param name="ocrDPI" type="int">300</param>
<!--if you request tif, make sure you have imageio jars on your classpath! -->
<param name="ocrImageFormatName" type="string">png</param>
<param name="ocrImageQuality" type="float">1.0</param>
<!-- options: argb, binary, gray, rgb -->
<param name="ocrImageType" type="string">gray</param>
<param name="ocrRenderingStrategy" type="string">ALL</param>
<!-- options: no_ocr, auto, ocr_only, ocr_and_text_extraction -->
<param name="ocrStrategy" type="string">no_ocr</param>
<param name="ocrStrategyAuto" type="string">better</param>
<!-- whether or not to set KCMS for faster (but legacy/unsupported) image rendering -->
<param name="setKCMS" type="bool">false</param>
<param name="sortByPosition" type="bool">false</param>
<param name="spacingTolerance" type="float">0.5</param>
<param name="suppressDuplicateOverlappingText" type="bool">false</param>
</params>
</parser>
As you can notice, there is an 'ocrStrategy' parameter which is by default set to 'no_ocr'. You can set it with three other options:
Our recommendation is to set to auto
- auto : try to extract text, but run OCR if fewer than 10 characters were extracted of if there are more than 10 characters with unmapped Unicode values. This is the most efficient mode but you will loose OCR text on pages containing one or several images on which to perform OCR and a text of more than 10 characters
- ocr_only : don't bother extracting text, just run OCR on each page. So if you only have text in a pdf and no images it will be a huge amount of process time loss for nothing cause the OCR process will take time but will not retunr any text. Use this mode only if you are absolutely sure that you will only feed Tika with files containing only images to OCR process
- ocr_and_text_extraction : Run both OCR and text extraction on each page. This is the mode that will ensure to retrieve everything (text + ocr processed text) but it is also the more expensive mode in terms of resources/processing time
So for example if I want the PDF parser to return both "OCRized" text and "natural" text of PDF files, I will set the 'ocrStrategy' parameter like this:
<param name="ocrStrategy" type="string">ocr_and_text_extraction</param>
NOTE: There is an “extractInlineImages” parameter in the PDFParser conf, it is used to change the OCR mode. There are two OCR modes:
1/ render each page and then run OCR on that rendered image: This is the default mode. In this mode, each pdf page is rendered as a unique image and the OCR process is applied to this image. The advantage is that there is only one OCR process per page so it consumes less resources and is potentially faster than the other mode. The inconvenient is that if there are several images next to each other in the page, the semantic analysis of the OCR process may be wrong and/or have difficulties to process the page, resulting in bad accuracy and lack of performances
2/ run OCR against inline images: This is the alternative mode that is enabled be setting the “extractInlineImages” parameter to true. Instead of rendering each pdf page as a unique image, the parser will extract the images as they are in the page and OCR process each one of them. The advantage of this method is that the semantic analysis will be 100% correct in any case but the inconvenient is that if there are many images in a page, the performances will be much worse than the other mode
The second mode “run OCR against inline images” MUST NOT BE ENABLED with Datafari when the option “extract archive content” is enabled on the Tika Server connector in a job. Otherwise Tika will process twice the PDF pages in case this mode is enabled, resulting in DRASTIC JOB PERFORMANCE LOSSES ! (at least x3 times slower)
Our recommendation is to keep the default configuration so the properties are set like that :
<param name="extractInlineImages" type="bool">false</param>
<param name="extractUniqueInlineImagesOnly" type="bool">true</param>
And LAST BUT NOT LEAST, as the OCR process can also take a very long time, you will also need to tell the Tika server not to kill its task too soon. By default a timeout of 120 seconds is set to a Tika task which means after 2 minutes, if the text extraction of a file is not finished, the task is killed. You can change the value of this timeout thanks to the taskTimeoutMillis parameter of the “server” section in the tika-config.xml file, remember that this parameter MUST BE EQUAL to the “timeoutSeconds” parameter of the TesseractOCRParser section (mentioned earlier in this documentation) :
Ou recommandation is :
<taskTimeoutMillis>1200000</taskTimeoutMillis>
<param name="timeoutSeconds" type="int">1200</param>
In our experience with OCR, many files can take over than 10 minutes to be OCR processed so we strongly recommend to set at least 20 minutes of task timeout so a value of 1200000 !
Once the above instructions are done, don’t forget to restart the Tika server. You can now test the OCR process with Tika thanks to a PDF file containing images to OCR process with the following command:
curl -T /path/to/my-pdf-file.pdf http://localhost:9998/rmeta/text > /path/to/extract-result.json
Once processed, you should retrieve the extracted text in the “X-TIKA:content” parameter of the ‘extract-result.json’ file.
3.Configure the MCF job(s)
Go to the MCF admin UI (Data Crawlers Expert Mode) and:
Edit the TikaServerRMetaConnector transformation connector in the “List Transformation Connections” menu and set the “Max connections” parameter to “1” in the “Throttling” tab. This will limit the number of documents simultaneously sent to the Tika OCR server to 1 and so, guarantee the stability of the OCR process.
You also need to change the ‘Connection timeout’ and ‘Socket timeout’ parameters in the ‘Tika Server’ tab to set them to the value you set to the timeOut parameter in the previous section + 5 secondsEdit the repository connector your are using in your job in the “List Repository Connections” menu and set the “Max connections” parameter to “1” in the “Throttling” tab.
Add a filter to your job in order to limit the files that will be crawled to those that need to be OCR processed. If you don’t know or understand how to do that, please contact Francelabs.
4. Enjoy !
Once you have completed the two previous sections, you can run your Tika Server and enjoy the power of the OCR !