Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Datafari 6 has introduced a new UI admin tool allowing to easily create and configure a Tika Server ! . The tool is accessible through the ‘Expert Menu → Create a Tika Server’ side menu tab of the admin UI of Datafari:

...

  • Installation directory: The directory path where the Tika Server files or the installation zip file (see ‘External Tika Server’ parameter) will be put. Before creating the Tika Server you MUST BE SURE that this directory exists, is local to the Datafari server on which you access this admin tool UI, be empty, and that the ‘datafari’ user has write permissions on it !

  • Tika Server host: The hostname of the Tika Server. If you want to only authorize local request to the Tika Server then you MUST set ‘localhost’, otherwise, set the IP of the machine that will host and execute the Tika Server instance. For a local installation, set ‘localhost’. If the Tika server is intended to be remote, for security reasons you may want to configure it so it only accepts local request and expose it through an apache proxy that will redirect the requests, in that case, set ‘localhost’ too

  • Tika Server Port: The port that the Tika Server will use. Be sure to use a different port for each Tika Server instance you have ! Notice that the default port used by an out of the box Tika Server is 9998

  • Tika Server temporary directory: The path of the directory that will be used by the Tika Server instance to write temporary files. You must choose a directory that exists on the machine that will host the Tika Server instance, on which the user that will run the Tika Server instance has write permissions, and ideally, a folder that is located on an SSD because the number of I/O will be very high !. In case you are doing a localhost install, the default user used will be the datafari user, please make sure it has the correct access rights

  • Tika Server type: The type of Tika Server configuration you want to apply on the Tika Server. Currently there are two types of Tika Server configuration available: ‘Simple’ and ‘OCR’.
    'Simple' is a standard configuration which disable disables the OCR parser so that, even if Tesseract is installed on the machine hosting the server, we insure that the Tika Server instance will not use it and will not perform OCR on files !.
    'OCR' is an OCR oriented configuration that enable enables the OCR parser and configure configures the PDF parser to use it, according to a specific OCR strategy that you will select (see OCR type specific parameters further in this doc)

  • External Tika Server: By enabling checking this option, the Tika Server files will be archived in a zip file named ‘tika-server.zip’ and stored in the installation directory. If this option is not checked, then the Tika Server files will be put in the install installation directory and will not be zipped. You should enable this option when the Tika Server instance you want to create is meant to be hosted by on another machine than the one on which the Datafari main node is hosted. As a Tika Server instance consumes a lot of resources, and even more when configured to perform OCR, we strongly recommend to have a dedicated machine for each Tika Server instance ! . Our hardware recommendations for a Tika Server machine are the following:
    CPU: 4c/8t 2.5GHz
    RAM: 16GB
    Storage: 512GB SSD
    OS: Ubuntu 20 or Debian 10
    SWAP: 10GB

If you select ‘OCR’ as Tika Server type, a new parameter will appear, the OCR strategy. The OCR strategy is used by the PDF parser of Tika to determine how to process the PDF files, you have three available strategies:

  • auto : try to extract text, but run OCR if fewer than 10 characters were extracted or if there are more than 10 characters with unmapped Unicode values. This is the most efficient mode but for pages that have more than 10 characters AND images, OCR on these images will not be performed.

  • ocr_only : don't bother extracting text, just run OCR on each page. So if you only have text in a pdf and no images it will be a huge amount of process time wasted for nothing since the OCR process will take time but will not return any text. Use this mode only if you are absolutely sure that you will only feed Tika with files containing only images to OCR.

  • ocr_and_text_extraction : Run both OCR and text extraction on each page. This is the mode that will ensure to retrieve everything (text + ocr processed text) but it is also the more expensive mode in terms of resources/processing time.

Select the strategy that fits your need and it will be applied to the PDF parser configuration !

Currently, the ‘OCR’ type is the only one that has a specific parameter, so once all the other parameters are correctly set, you can click on the ‘Create’ button. If the creation is successful you will find in the installation directory: either all the necessary files to run the local Tika Server instance, or the installation zip file named ‘tika-server.zip’ if you enabled the ‘External Tika Server’ option. In any case, the tree structure of the files is the same: a bin directory containing the Tika Server jar + the script files allowing to run the server, and a conf directory containing the logging conf files of Tika + the ‘tika-config.xml’ file which is the main configuration file of Tika.

In case you have the installation zip file, simply copy it in an installation directory of your choice on the target machine and unzip it.

Everything is configured to be ready to run so to run the Tika server you will need to open a bash into the bin directory with the user that will run the Tika Server instance, and execute the following command:

Code Block
bash tika-server.sh start

To stop the instance simply run the following command:

Code Block
bash tika-server.sh stop

Be careful, as specified in the description of the ‘Tika Server temporary directory’ parameter, the user that you will use to run the instance must have write permissions on the folder in which is installed the Tika Server but also on the specified temporary directory !

By default, the logs are stored in a ‘logs’ folder located in the installation directory, but you can change the location thanks to the ‘TIKA_LOGS_DIR’ parameter defined in the ‘set-tika-env.sh’ script located in the ‘bin’ directory. You can change the rolling strategy and the name of the log files by modifying the log4j2 properties files located in the ‘conf’ folder.

Last but not least, notice that this admin tool only configure the Tika Server instance to perform what you want, but achieving the ultimate goal may require other things to do. For example, even if the Tika Server instance if configured to perform OCR, if Tesseract is not installed and configured, the OCR will not be performed. Always refer to the documentation of the feature you want to implement, to be sure to correctly achieve what you want. Concerning the OCR, refer to this documentation: OCR on Tika Configuration