OCR on simplified jobs

Valid from version 5.4

This documentation is valid from Datafari 5.4 upwards

Since Datafari 5.4, a new option has been added to the simplified jobs, it is the “Create a side OCR job” checkbox:

This feature will create a clone of the simplified job currently being declared, but configured to perform OCR on PDF and image files through a dedicated Tika Server connector configured to request a provided Tika Server with OCR enabled. It means that you will need to set-up a Tika Server (you can also do it via Tika Server - Easy creation & configuration ) and configure it to perform OCR

When checking the checkbox, 3 additional parameters will appear and MUST be set:

Tika server OCR Host: The hostname of the Tika server configured to perform OCR
Tika server OCR Port: The port of the Tika server configured to perform OCR
Tika server OCR Name: The name to set to the Tika server connector that will be created with the host and port provided above

Notice that if you set a Tika server OCR name that already exists in MCF, the connector corresponding to that name will be used and no new one will be created. In that case, the Host and Port will be ignored.

When saving a simplified job with the side OCR job enabled, in the list of jobs you will notice two jobs, one job having the name you entered as sourcename in the simplified UI, and one job having the same name but containing the “OCR” prefix:

If you edit the OCR job, in the connection tab you will notice that what differs from the other created job is that the default TikaRmetaConnector has been replaced by a TikaRmetaConnector having the name you specified in the “Tika server OCR Name” parameter and that it is preceded by a DocFilter connector:

In the “Doc Filter” tab, you will see that some default include filters have been applied to only keep pdf and image files:

Those filters are applied by default on the document id which is most often the doc uri. But, depending on the type of repository connector you are using, it may happen that the document id is not the uri of the document and/or that it does not contain the extension of the document and thus the filters cannot be applied (Check the repository connector documentation and/or contact France Labs to know if the document id corresponds to the document uri and if the regex filters of the DocFilter connector can match) ! If that is the case (filters cannot be applied) then you will need either to point at another document metadata to apply the filters on (if available, a metadata containing the extension of the document), or to modify the regex filters, or both.

Sometimes, the repository connector itself provides a way to filter incoming documents, this is for example the case with the WinShare connector, in that case, it is advised to use the filters of the repository connector to only keep pdf and image files, and then remove the DocFilter connector from the pipeline for better efficiency.

Once you are satisfied with the configuration of the OCR job, make sure that the crawling time window of your OCR job occurs AFTER the crawling time window of your corresponding non-OCR job. Otherwise your OCR-extracted text will be deleted by the non-OCR job crawl. Note also that if you run two jobs at the same time on an MCF node, the two jobs will interfere with each other because MCF only has one processing queue for documents. So, MCF will randomly queue documents to process from the standard job and the OCR job, resulting in longer processing time for both jobs, but more importantly, some documents may be processed by the OCR job BEFORE the standard job and in that case, the OCR WILL BE LOST, because the last version of the document that will be indexed will be the one without OCR !

The recommended method is to create the OCR job on a dedicated MCF node, while still making sure that its time window is AFTER the time window of its corresponding non-OCR job (which is on another MCF node).