Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Valid from Datafari 6

This documentation is valid from Datafari Community & Entreprise Edition v6 upwards

Datafari 6 has introduced a new UI admin tool allowing to easily create and configure a Tika Server ! The tool is accessible through the ‘Expert Menu → Create a Tika Server’ side menu tab of the admin UI of Datafari:

The required parameters depend on the type of Tika Server configuration that you will choose. First we will take a look at the parameters that are common to any Tika Server configuration:

  • Installation directory: The directory path where the Tika Server files or the installation zip file (see ‘External Tika Server’ parameter) will be put. Before creating the Tika Server you MUST BE SURE that this directory exists, is local to the Datafari server on which you access this admin tool UI, be empty, and that the ‘datafari’ user has write permissions on it !

  • Tika Server host: The hostname of the Tika Server. If you want to only authorize local request to the Tika Server then you MUST set ‘localhost’, otherwise, set the IP of the machine that will host and execute the Tika Server instance

  • Tika Server Port: The port that the Tika Server will use. Be sure to use a different port for each Tika Server instance you have ! Notice that the default port used by an out of the box Tika Server is 9998

  • Tika Server temporary directory: The path of the directory that will be used by the Tika Server instance to write temporary files. You must choose a directory that exists on the machine that will host the Tika Server instance, on which the user that will run the Tika Server instance has write permissions, and ideally, a folder that is located on an SSD because the number of I/O will be very high !

  • Tika Server type: The type of Tika Server configuration you want to apply on the Tika Server. Currently there are two types of Tika Server configuration available: ‘Simple’ and ‘OCR’.
    'Simple' is a standard configuration which disable the OCR parser so that, even if Tesseract is installed on the machine hosting the server, we insure that the Tika Server instance will not use it and will not perform OCR on files !
    'OCR' is an OCR oriented configuration that enable the OCR parser and configure the PDF parser to use it, according to a specific OCR strategy that you will select (see OCR type specific parameters further in this doc)

  • External Tika Server: By enabling this option, the Tika Server files will be archived in a zip file named ‘tika-server.zip’ and stored in the installation directory. If this option is not checked, then Tika Server files will be put in the install directory and will not be zipped. You should enable this option when the Tika Server instance you want to create is meant to be hosted by another machine than the one on which the Datafari main node is hosted. As a Tika Server instance consumes a lot of resources, and even more when configured to perform OCR, we strongly recommend to have a dedicated machine for each Tika Server instance ! Our hardware recommendations for a Tika Server machine are the following:
    CPU: 4c/8t 2.5GHz
    RAM: 16GB
    Storage: 512GB SSD
    OS: Ubuntu 20 or Debian 10
    SWAP: 10GB

  • No labels