...
Installation directory: The directory path where the Tika Server files or the installation zip file (see ‘External Tika Server’ parameter) will be put. Before creating the Tika Server you MUST BE SURE that this directory exists, is local to the Datafari server on which you access this admin tool UI, be empty, and that the ‘datafari’ user has write permissions on it !
Tika Server host: The hostname of the Tika Server. If you want to only authorize local request to the Tika Server then you MUST set ‘localhost’, otherwise, set the IP of the machine that will host and execute the Tika Server instance. For a local installation, set ‘localhost’. If the Tika server is intended to be remote, for security reasons you may want to configure it so it only accepts local request and expose it through an apache proxy that will redirect the requests, in that case, set ‘localhost’ too
Tika Server Port: The port that the Tika Server will use. Be sure to use a different port for each Tika Server instance you have ! Notice that the default port used by an out of the box Tika Server is 9998
Tika Server temporary directory: The path of the directory that will be used by the Tika Server instance to write temporary files. You must choose a directory that exists on the machine that will host the Tika Server instance, on which the user that will run the Tika Server instance has write permissions, and ideally, a folder that is located on an SSD because the number of I/O will be very high. In case you are doing a localhost install, the default user used will be the datafari user, please make sure it has the correct access rights
Tika Server type: The type of Tika Server configuration you want to apply on the Tika Server. Currently there are two types of Tika Server configuration available: ‘Simple’ and ‘OCR’.
'Simple' is a standard configuration which disables the OCR parser so that, even if Tesseract is installed on the machine hosting the server, we insure that the Tika Server instance will not use it and will not perform OCR on files.
'OCR' is an OCR oriented configuration that enables the OCR parser and configures the PDF parser to use it, according to a specific OCR strategy that you will select (see OCR type specific parameters further in this doc)External Tika Server: By checking this option, the Tika Server files will be archived in a zip file named ‘tika-server.zip’ and stored in the installation directory. If this option is not checked, then the Tika Server files will be put in the installation directory and will not be zipped. You should enable this option when the Tika Server instance you want to create is meant to be hosted on another machine than the one on which the Datafari main node is hosted. As a Tika Server instance consumes a lot of resources, and even more when configured to perform OCR, we strongly recommend to have a dedicated machine for each Tika Server instance. Our hardware recommendations for a Tika Server machine are the following:
CPU: 4c/8t 2.5GHz
RAM: 16GB
Storage: 512GB SSD
OS: Ubuntu 20 or Debian 10
SWAP: 10GB
...