Tesseract is an open source application that "OCRize" documents. Tika uses this tool to retrieve the content of images. To install it on a Debian/Ubuntu system, run the following command:
sudo apt-get install tesseract-ocr
Congrats ! Tesseract is now installed and ready to be used. However by default, Tesseract is only able to deal with english language, if you want to deal with other languages you will need to install the corresponding package (if available). The pattern of a language paclage is: tesseract-ocr-[country_code] The country code is based on three letters. For example the Tesseract package for the french language is: tesseract-ocr-fra So to install the french package you will need to run the following command:
At this step you can test if tesseract is correctly working with your desired languages by sending it a TIF image file containing text to OCR through the following command:
No need to add a file extension for the output file as Tesseract automatically adds โ.txtโ The โmy-output.txtโ file should contain the extracted text from your TIF file
Once Tesseract is installed on the same machine than the Tika Server, it usually uses it automatically to parse image files but in the Tika server provided with Datafari, it is disabled by default.
To enable OCR in Tika Server you will need to configure the TesseractOCRParser and the PDFParser in the tika-config.xml file located in [DATAFARI_HOME]/tika-server/conf/. By default, the TesseractOCRParser is excluded from the parsers and its configuration is commented to disable it.
So first you will need to comment its exclusions:
Then to un-comment its configuration section to enable it:
Once the section un-commented, you can tweak the different parameters of the parser to fit your needs (it is not the purpose of this documentation to explain them: Google is you friendย !). In any case, YOU MUST at least pay close attention to the โtimeoutSecondsโ and the โlanguageโ parameters. The โtimeoutSecondsโ parameter corresponds to the max allowed time to perform OCR on a document before giving up. In our experience most of documents of less than 10 pages fulfilled with images will take a bit more than 10 minutes to be processed, thus we set the default value to 1200 seconds which corresponds to 20 minutes in order to cover most of the documents. You may want to adapt this parameter according to your needs and/or experience but keep in mind that the value you set here MUST BE THE SAME than the value of the taskTimeoutMillis parameter of the Tika server parameters section (mentioned further in this documentation). The โlanguageโ parameter corresponds to the languages the Tesseract parser may encounter and that it will try to recognize in documents. Any language encountered not listed here will not be recognized and the OCRization will fail or be inaccurate ! The parameter is configured by default on โeng+fraโ. If you want to deal with documents in other languages, you will need to add the language code (which is the same than the one present in the package name). Each specified language must be separated by a '+'. For example, if you want to deal with english, french and deutch documents, you will need to set the "language" property like this:
Now, to perform OCR process on images embeded in PDF files, you need to tell the PDF parser of Tika to do so because it is disabled by default (for resources consumption and processing time reasons). This can be done in the PDFParser configuration section of the tika-config.xml file which is by default:
As you can notice, there is an 'ocrStrategy' parameter which is by default set to 'no_ocr'. You can set it with three other options:
- auto : try to extract text, but run OCR if fewer than 10 characters were extracted of if there are more than 10 characters with unmapped Unicode values. This is the most efficient mode but you will loose OCR text on pages containing one or several images on which to perform OCR and a text of more than 10 characters - ocr_only : don't bother extracting text, just run OCR on each page. So if you only have text in a pdf and no images it will be a huge amount of process time loss for nothing cause the OCR process will take time but will not retunr any text. Use this mode only if you are absolutely sure that you will only feed Tika with files containing only images to OCR process -ย ocr_and_text_extraction : Run both OCR and text extraction on each page. This is the mode that will ensure to retrieve everything (text + ocr processed text) but it is also the more expensive mode in terms of resources/processing time So for example if I want the PDF parser to return both "OCRized" text and "natural" text of PDF files, I will set the 'ocrStrategy' parameter like this:
NOTE: There is an โextractInlineImagesโ parameter in the PDFParser conf, it is used to change the OCR mode. There are two OCR modes: 1/ render each page and then run OCR on that rendered image: This is the default mode. In this mode, each pdf page is rendered as a unique image and the OCR process is applied to this image. The advantage is that there is only one OCR process per page so it consumes less resources and is potentially faster than the other mode. The inconvenient is that if there are several images next to each other in the page, the semantic analysis of the OCR process may be wrong and/or have difficulties to process the page, resulting in bad accuracy and lack of performances 2/ run OCR against inline images: This is the alternative mode that is enabled be setting the โextractInlineImagesโ parameter to true. Instead of rendering each pdf page as a unique image, the parser will extract the images as they are in the page and OCR process each one of them. The advantage of this method is that the semantic analysis will be 100% correct in any case but the inconvenient is that if there are many images in a page, the performances will be much worse than the other mode The second mode โrun OCR against inline imagesโ MUST NOT BE ENABLED with Datafari when the option โextract archive contentโ is enabled on the Tika Server connector in a job. Otherwise Tika will process twice the PDF pages in case this mode is enabled, resulting in DRASTIC JOB PERFORMANCE LOSSES ! (at least x3 times slower)
And LAST BUT NOT LEAST, as the OCR process can also take a very long time, you will also need to tell the Tika server not to kill its task too soon. By default a timeout of 120 seconds is set to a Tika task which means after 2 minutes, if the text extraction of a file is not finished, the task is killed. You can change the value of this timeout thanks to the taskTimeoutMillis parameter of the โserverโ section in the tika-config.xml file, remember that this parameter MUST BE EQUAL to the โtimeoutSecondsโ parameter of the TesseractOCRParser section (mentioned earlier in this documentation) :
In our experience with OCR, many files can take over than 10 minutes to be OCR processed so we strongly recommend to set at least 20 minutes of task timeout so a value of 1200000 !
Once the above instructions are done, restart the Tika server. You can now test the OCR process with Tika thanks to a PDF file containing images to OCR process with the following command:
Once processed, you should retrieve the extracted text in the โX-TIKA:contentโ parameter of the โextract-result.jsonโ file.
3.Configure the MCF job(s)
Go to the MCF admin UI (Data Crawlers Expert Mode) and:
Edit the TikaServerRMetaConnector transformation connector in the โList Transformation Connectionsโ menu and set the โMax connectionsโ parameter to โ1โ in the โThrottlingโ tab. This will limit the number of documents simultaneously sent to the Tika OCR server to 1 and so, guarantee the stability of the OCR process. You also need to change the โConnection timeoutโ and โSocket timeoutโ parameters in the โTika Serverโ tab to set them to the value you set to the timeOut parameter in the previous section + 5 secondsย
ย
ย
Edit the repository connector your are using in your job in the โList Repository Connectionsโ menu and set the โMax connectionsโ parameter to โ1โ in the โThrottlingโ tab.
Add a filter to your job in order to limit the files that will be crawled to those that need to be OCR processed. If you donโt know or understand how to do that, please contact Francelabs.
4. Enjoy !
Once you have completed the two previous sections, you can run your Tika Server and enjoy the power of the OCR !
By default a Tika server does not perform "OCRization" on incoming docs. To do so, you will need two things:
Install Tesseract
Configure the Tika Server
Install Tesseract Tesseract is an open source application that "OCRize" documents. Tika uses this tool to retrieve the content of images. To install it on a Debian/Ubuntu system, run the following command:
Congrats ! Tesseract is now installed and ready to be used. However by default, Tesseract is only able to deal with english language, if you want to deal with other languages you will need to install the corresponding package (if available). The pattern of a language paclage is: tesseract-ocr-[country_code] The country code is based on three letters. For example the Tesseract package for the french language is: tesseract-ocr-fra So to install the french package you will need to run the following command:
At this step you can test if tesseract is correctly working with your desired languages by sending it a TIF image file containing text to OCR through the following command:
No need to add a file extension for the output file as Tesseract automatically adds โ.txtโ The โmy-output.txtโ file should contain the extracted text from your TIF file
Configure the Tika Server Once Tesseract is installed on the same machine than the Tika Server, it usually use it automatically to parse image files but in the Tika server provided with Datafari, we created a parameter to enable or disable the OCR parser and it is disabled by default. This parameter is named โDO_OCRโ and is located in the file /opt/datafari/tika-server/bin/set-tika-env.sh. You must set it to โtrueโ in order to perform OCR with the Tika Server:
Now the Tika server will use Tesseract but it will only use the english language to "OCRize" these files by default and you will need to configure the Tika Server to use different or several languages. To do so, the tika server provided with Datafari exposes the configuration files so they are easily accessible. Here are the locations of the two configuration files along with a quick description: - /opt/datafari/tika-server/conf/ocr/org/apache/tika/parser/ocr/TesseractOCRConfig.properties: This is the configuration file of the main OCR parser which parses image files - /opt/datafari/tika-server/conf/ocr/org/apache/tika/parser/pdf/PDFParser.properties: This is the configuration file of the PDF parser. To perform OCR process on images embbeded in a PDF file you need to tell the PDF parser of Tika to do so because it is disabled by default (for resources consumption and processing time reasons). Note: if you configure a Tika Server independently of Datafari on a different machine, we strongly recommand to copy the tika-server folder of Datafari in order to benefit from the simplified configuration options. Otherwise you will have to directly modify the two files listed above in the tika-server.jar file which is less convenient. In the jar file, the two files are located in /ocr/org/apache/tika/parser/ocr/TesseractOCRConfig.properties and /ocr/org/apache/tika/parser/pdf/PDFParser.properties. First, edit the TesseractOCRConfig.properties file: In this file you can see a property named "language" which is set to "eng+fra". If you want to deal with documents in other languages, you will need to add the language code (which is the same three letters code of the package). Each specified language must be separated by a '+'. For example, if you want to deal with english, french and deutch documents, you will need to set the "language" property like this:
In the 'TesseractOCRConfig.properties' you can also configure other things but it is not the purpose of this documentation to explain them: Google is you friendย !
Then edit the file PDFParser.properties: This file has an 'ocrStrategy' which is by default set to 'no_ocr'. You can set it with three other options: - auto : try to extract text, but run OCR if fewer than 10 characters were extracted of if there are more than 10 characters with unmapped Unicode values. This is the most efficient mode but you will loose OCR text on pages containing one or several images on which to perform OCR and a text of more than 10 characters - ocr_only : don't bother extracting text, just run OCR on each page. So if you only have text in a pdf and no images it will be a huge amount of process time loss for nothing cause the OCR process will take time but will not retunr any text. Use this mode only if you are absolutely sure that you will only feed Tika with files containing only images to OCR process -ย ocr_and_text_extraction : Run both OCR and text extraction on each page. This is the mode that will ensure to retrieve everything (text + ocr processed text) but it is also the more expensive mode in terms of resources/processing time So for example if I want the PDF parser to return both "OCRized" text and "natural" text of PDF files, I will set the 'ocrStrategy' parameter like this:
NOTE: There is an โextractInlineImagesโ parameter in the PDFParser.properties file, it is used to change the OCR mode. There are two OCR modes: 1/ render each page and then run OCR on that rendered image: This is the default mode. In this mode, each pdf page is rendered as a unique image and the OCR process is applied to this image. The advantage is that there is only one OCR process per page so it consumes less resources and is potentially faster than the other mode. The inconvenient is that if there are several images next to each other in the page, the semantic analysis of the OCR process may be wrong and/or have difficulties to process the page, resulting in bad accuracy and lack of performances 2/ run OCR against inline images: This is the alternative mode that is enabled be setting the โextractInlineImagesโ parameter to true. Instead of rendering each pdf page as a unique image, the parser will extract the images as they are in the page and OCR process each one of them. The advantage of this method is that the semantic analysis will be 100% correct in any case but the inconvenient is that if there are many images in a page, the performances will be much worse than the other mode The second mode โrun OCR against inline imagesโ MUST NOT BE ENABLED with Datafari when the option โextract archive contentโ is enabled on the Tika Server connector in a job. Otherwise Tika will process twice the PDF pages in case this mode is enabled, resulting in DRASTIC JOB PERFORMANCE LOSSES ! (at least x3 times slower) And last but not least, as the OCR process can also take a very long time, you will also need to tell the Tika server not to kill its task too soon. By default a timeout of 120 seconds is set to a Tika task which means after 2 minutes, if the text extraction of a file is not finished, the task is killed. You can change the value of this timeout thanks to the TIKA_SPAWN_TASK_TIMEOUT parameter in the /opt/datafari/tika-server/bin/set-tika-env.sh file:
In our experience with OCR, many files can take over than 10 minutes to be OCR processed so we strongly recommend to set at least 20 minutes of task timeout ! Once the above instructions are done, restart the Tika server. You can now test the OCR process with tika thanks to a PDF file containing images to OCR process with the following command:
Once processed, you should retrieve the extracted text in the โ__TEXT__โ file within the โresult-file.zipโ file.
Enjoy ! Once you have completed the two previous sections, you can run your Tika Server and enjoy the power of the OCR !
By default a Tika server does not perform "OCRization" on incoming docs. To do so, you will need two things:
Install Tesseract
Configure the Tika Server
Install Tesseract Tesseract is an open source application that "OCRize" documents. Tika uses this tool to retrieve the content of images. To install it on a Debian/Ubuntu system, run the following command:
Congrats ! Tesseract is now installed and ready to be used. However by default, Tesseract is only able to deal with english language, if you want to deal with other languages you will need to install the corresponding package (if available). The pattern of a language paclage is: tesseract-ocr-[country_code] The country code is based on three letters. For example the Tesseract package for the french language is: tesseract-ocr-fra So to install the french package you will need to run the following command:
Configure the Tika Server Once Tesseract is installed on the same machine than the Tika Server, this one will automatically use it to parse image files. But it will only use the english language to "OCRize" these files by default and you will need to configure the Tika Server to use different or several languages. To do so you will need to edit the jar file of Tika and then edit the following file: /org/apache/tika/parser/ocr/TesseractOCRConfig.properties In this file you can see a property named "language" which is set to "eng". If you want to deal with documents in other languages, you will need to add the language code (which is the same three letters code of the package). Each specified language must be separated by a '+'. For example, if you want to deal with english and french documents, you will need to set the "language" property like this:
Make your choice and then save the file inside the jar In the 'TesseractOCRConfig.properties' you can also configure other things but it is not the purpose of this documentation to explain them: Google is you friendย ! Then you need to configure the PDF parser of Tika, because it does not use Tesseract by default to treat PDF image files. Once again the configuration file is located inside the jar file of Tika: /org/apache/tika/parser/pdf/PDFParser.properties This file has an 'ocrStrategy' which is by default set to 'no_ocr'. You can set it with three other options: - auto : try to extract text, but run OCR if fewer than 10 characters were extracted of if there are more than 10 characters with unmapped Unicode values. This is the most efficient mode but you will loose OCR text on pages containing one or several images on which to perform OCR and a text of more than 10 characters - ocr_only : don't bother extracting text, just run OCR on each page. So if you only have text in a pdf and no images it will be a huge amount of process time loss for nothing cause the OCR process will take time but will not retunr any text. Use this mode only if you are absolutely sure that you will only feed Tika with files containing only images to OCR process -ย ocr_and_text_extraction : Run both OCR and text extraction on each page. This is the mode that will ensure to retrieve everything (text + ocr processed text) but it is also the more expensive mode in terms of resources/processing time So for example if I want the PDF parser to return both "OCRized" text and "natural" text of PDF files, I will set the 'ocrStrategy' parameter like this:
Set the wanted option to the 'ocrStrategy' parameter then save the file inside the jar NOTE: There is an โextractInlineImagesโ parameter in the PDFParser.properties file, it is used to change the OCR mode. There are two OCR modes: 1/ render each page and then run OCR on that rendered image: This is the default mode. In this mode, each pdf page is rendered as a unique image and the OCR process is applied to this image. The advantage is that there is only one OCR process per page so it consumes less resources and is potentially faster than the other mode. The inconvenient is that if there are several images next to each other in the page, the semantic analysis of the OCR process may be wrong and/or have difficulties to process the page, resulting in bad accuracy and lack of performances 2/ run OCR against inline images: This is the alternative mode that is enabled be setting the โextractInlineImagesโ parameter to true. Instead of rendering each pdf page as a unique image, the parser will extract the images as they are in the page and OCR process each one of them. The advantage of this method is that the semantic analysis will be 100% correct in any case but the inconvenient is that if there are many images in a page, the performances will be much worse than the other mode The second mode โrun OCR against inline imagesโ MUST NOT BE ENABLED with Datafari when the option โextract archive contentโ is enabled on the Tika Server connector in a job. Otherwise Tika will process twice the PDF pages in case this mode is enabled, resulting in DRASTIC JOB PERFORMANCE LOSSES ! (at least x3 times slower)
And last but not least, as the OCR process can also take a very long time, you will also need to tell the Tika server not to kill its task too soon. By default a timeout of 120 seconds is set to a Tika task which means after 2 minutes, if the text extraction of a file is not finished, the task is killed. You can change the value of this timeout thanks to the TIKA_SPAWN_TASK_TIMEOUT parameter in the /opt/datafari/tika-server/bin/set-tika-env.sh file:
In our experience with OCR, many files can take over than 10 minutes to be OCR processed so we strongly recommend to set at least 20 minutes of task timeout !
Once you have completed these two tasks you can run your Tika Server and enjoy the power of the OCR !