Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Warning
titleHardcoded limitations

Starting with Datafari 4.3.1, there is a hardcoded limit on the Tika metadata extraction. The hard limit is set at 1000 characters (1000 bytes) and the -possibly truncated- metadata is urlencoded before being sent to Solr.


Info
titleValid from 4.0

The documentation below is valid from Datafari v4.0.0 upwards

...

  • Tika hostname: the hostname of the Tika server to use. It may be useful to change it if you move the Tika server to another server, otherwise it should be correctly configured by Datafari.
  • Tika port: the port of the Tika server. By default it is 9998  
  • Connection timeout: time in milliseconds during which the connector tries to connect/re-connect to the Tika server, when this time is elapsed a connection timeout is triggered. A connection timeout occurs only upon starting the TCP connection. This usually happens if the remote machine does not answer. This means that the server has been shut down, you used the wrong IP/DNS name, wrong port or the network connection to the server is down
  • Socket timeout: time in milliseconds during which the connector will wait a Tika server response when sending document to parse. When this time is elasped a socket timeout is triggered. A socket timeout is dedicated to monitor the continuous incoming data flow. If the data flow is interrupted for the specified timeout the connection is regarded as stalled/broken. Of course this only works with connections where data is received all the time
  • Retry interval (in ms): time in milliseconds after which, if a ServiceInterruptionException is triggered by the connector, the connector retries to process the document on which the exception has been triggered. A ServiceInterruptionException is triggered by the TikaServerConnector on any exception happening during document processing, except for SocketTimeoutException. Notice that when the Tika Server response code is present, any response code is not considered as a Java exception, thus, no ServiceInterruptionException is triggered in that case. (for example, even if a Tika Server response code is '500 - Internal error', the connector will not trigger a ServiceInterruptionException)
  • Number of retries: number of retries after which, if the document processing still ends by a ServiceInterruptionException, the job is aborted. Notice that the retry count starts after the fist ServiceInterruptionException and the job abort is triggered if the retry count exceed the number defined here. For example, if the number of retries is set to 3, the first ServiceInterruptionException encoutered will initialize the retry count to 0, after a first retry it will be increased to 1, after the second retry it will be increased to 2, after the third retry it will be increased to 3, at this moment a very last retry will be attempted and if it fails, the retry count will be increased to 4 and then, as it exceeds 3, the job will be aborted. So by setting the value to 3, there will be 5 attempts.   

...