Data Extraction Server Configuration
Valid from 5.0
The documentation below is valid from Datafari v5.0.0 upwards for both the EE and CE editions
Retry behavior 4.3.1
Starting with Datafari 4.3.1 the Tika server connector does not abort anymore a job. Instead, it retries a document indefinitely as the Tika server is down (does not respond) and retries n times (specified in the configuration described here) a document if it is in a SocketException state before skipping it.
This change has been made as it has proved in a production environment that it guaranties 100% of clean documents to be indexed and 100% of bad documents to be reported without any hang of the job. Of course it means that you need to properly configure a monitoring of the Tika server to be sure that it will not be down for a long time (see Monitoring a Data Extraction (Tika) server - Enterprise Edition)
Hardcoded limitations
Starting with Datafari 4.3.1, there is a hardcoded limit on the Tika metadata extraction. A metadata name cannot exceed 8000 chars otherwise the metadata is excluded.
The Tika server is now packaged with Datafari.
In datafari.properties there is the property TIKASERVER (by default at true) to indicate if the Datafari startup script launches the Tika server. You can have several Tika Servers running, one max per machine.
If TIKASERVER= yes :
Monoserver : Tika server started
Multiserver :
Main node : Tika server started if TIKASERVER= yes on this machine datafari.properties
Solr node : Not relevant as there is no preinstalled Tika Server
MCF node : Tika server started if TIKASERVER= yes on this machine datafari.properties
The Tika server is used through a transformation connector named "TikaServerConnector". Here is how it is configured :
In the configuration of the connector itself you will retrieve the following parameters :
Tika hostname: the hostname of the Tika server to use. It may be useful to change it if you move the Tika server to another server, otherwise it should be correctly configured by Datafari.
Tika port: the port of the Tika server. By default it is 9998
Connection timeout: time in milliseconds during which the connector tries to connect/re-connect to the Tika server, when this time is elapsed a connection timeout is triggered. A connection timeout occurs only upon starting the TCP connection. This usually happens if the remote machine does not answer. This means that the server has been shut down, you used the wrong IP/DNS name, wrong port or the network connection to the server is down
Socket timeout: time in milliseconds during which the connector will wait a Tika server response when sending document to parse. When this time is elasped a socket timeout is triggered. A socket timeout is dedicated to monitor the continuous incoming data flow. If the data flow is interrupted for the specified timeout the connection is regarded as stalled/broken. Of course this only works with connections where data is received all the time
Retry interval (in ms): time in milliseconds after which, if a ServiceInterruptionException is triggered by the connector, the connector retries to process the document on which the exception has been triggered. A ServiceInterruptionException is triggered by the TikaServerConnector on any exception happening during document processing, except for SocketTimeoutException. Notice that when the Tika Server response code is present, any response code is not considered as a Java exception, thus, no ServiceInterruptionException is triggered in that case. (for example, even if a Tika Server response code is '500 - Internal error', the connector will not trigger a ServiceInterruptionException)
Number of retries: number of retries after which, if the document processing still ends by a SocketException, the concerned document is skipped. Notice that the retry count starts after the fist SocketException and the document is skipped if the retry count exceed the number defined here. For example, if the number of retries is set to 3, the first SocketException encoutered will initialize the retry count to 0, after a first retry it will be increased to 1, after the second retry it will be increased to 2, after the third retry it will be increased to 3, at this moment a very last retry will be attempted and if it fails, the retry count will be increased to 4 and then, as it exceeds 3, the document will be skipped. So by setting the value to 3, there will be 5 attempts.
In the job configuration you will be able to change other parameters :
Field mappings: allow you to change the final name of a metadata
Keep all metadata: keep metadata extracted by Tika or not
Normalize metadata names: normalize the metadata names. Very useful to be sure that metadata will not be ignored by Solr if they contain unallowed characters in their name.
Content write limit: limit the size of the text content extracted from the document. If the document contains more characters than the limit specified here, the content will be truncated to fit the limit. This parameter is very important to prevent hudge text content that can trigger oom during Solr indexation
Extract archives content: extract the content of archives files or simply skip the archives files. If you decide to skip the archives, they will be indexed but with an empty content. The recognized types of archives are zip, gz, tar, gtar, 7z, xz, boz, bz2, cpio, jar, ar, a, pab, ost and pst