...
Installation directory: The directory path where the Tika Server files or the installation zip file (see ‘External Tika Server’ parameter) will be put. Before creating the Tika Server you MUST BE SURE that this directory exists, is local to the Datafari server on which you access this admin tool UI, be empty, and that the ‘datafari’ user has write permissions on it !
Tika Server host: The hostname of the Tika Server. If you want to only authorize local request to the Tika Server then you MUST set ‘localhost’, otherwise, set the IP of the machine that will host and execute the Tika Server instance. For a local installation, set ‘localhost’. If the Tika server is intended to be remote, for security reasons you may want to configure it so it only accepts local request and expose it through an apache proxy that will redirect the requests, in that case, set ‘localhost’ too
Tika Server Port: The port that the Tika Server will use. Be sure to use a different port for each Tika Server instance you have ! Notice that the default port used by an out of the box Tika Server is 9998
Tika Server temporary directory: The path of the directory that will be used by the Tika Server instance to write temporary files. You must choose a directory that exists on the machine that will host the Tika Server instance, on which the user that will run the Tika Server instance has write permissions, and ideally, a folder that is located on an SSD because the number of I/O will be very high. In case you are doing a localhost install, the default user used will be the datafari user, please make sure it has the correct access rights
Tika Server type: The type of Tika Server configuration you want to apply on the Tika Server. Currently there are two types of Tika Server configuration available: ‘Simple’ and ‘OCR’.
'Simple' is a standard configuration which disables the OCR parser so that, even if Tesseract is installed on the machine hosting the server, we insure that the Tika Server instance will not use it and will not perform OCR on files.
'OCR' is an OCR oriented configuration that enables the OCR parser and configures the PDF parser to use it, according to a specific OCR strategy that you will select (see OCR type specific parameters further in this doc)External Tika Server: By checking this option, the Tika Server files will be archived in a zip file named ‘tika-server.zip’ and stored in the installation directory. If this option is not checked, then the Tika Server files will be put in the install installation directory and will not be zipped. You should enable this option when the Tika Server instance you want to create is meant to be hosted on another machine than the one on which the Datafari main node is hosted. As a Tika Server instance consumes a lot of resources, and even more when configured to perform OCR, we strongly recommend to have a dedicated machine for each Tika Server instance. Our hardware recommendations for a Tika Server machine are the following:
CPU: 4c/8t 2.5GHz
RAM: 16GB
Storage: 512GB SSD
OS: Ubuntu 20 or Debian 10
SWAP: 10GB
If you select ‘OCR’ as Tika Server type, a new parameter will appear, the OCR strategy. The OCR strategy is used by the PDF parser of Tika to determine how to process the PDF files, you have three available strategies:
auto : try to extract text, but run OCR if fewer than 10 characters were extracted or if there are more than 10 characters with unmapped Unicode values. This is the most efficient mode but you will lose OCR text on pages containing one or several images on which to perform OCR and a text of for pages that have more than 10 characters AND images, OCR on these images will not be performed.
ocr_only : don't bother extracting text, just run OCR on each page. So if you only have text in a pdf and no images it will be a huge amount of process time wasted for nothing since the OCR process will take time but will not return any text. Use this mode only if you are absolutely sure that you will only feed Tika with files containing only images to OCR.
ocr_and_text_extraction : Run both OCR and text extraction on each page. This is the mode that will ensure to retrieve everything (text + ocr processed text) but it is also the more expensive mode in terms of resources/processing time.
...