When indexing documents, Datafari tries to identify the types of files it is indexing to register the correct extension together with each document. The registered extension is used to provide faceted search and allow the user to filter the results of his query by file type. This page exposes the means used by Datafari to perform this detection.
Default Behavior
By default, Datafari will try to guess the extension of the file from the filename. If the URL to a document si something like:
file:///a/b/c/someFile.ext
Then the extracted extension will be ext, whatever the real type of the file is.
If the extraction of the file extension from the file name fails, then the file type guessed by tika is used instead.
Finally, if both of the previous technique failed, then an empty string is used as the file extension.
Be aware that using the default behavior on websites that have url like:
http://mydomain.net/some/path/getDoc.php?doc=myDocument.pdf
will result in a php type for this document (and all other documents that are retrieved using the same script).
For web crawl in general, it is advised to use the alternative configuration, that uses the Tika extracted type in priority.
Alternative Configuration
The configuration can be change by changing in the solrconfig.xml file the parameter:
<str name="extension.fromname">true</str>
Changing the value to false will make the tika guess the main type and if it fails use the filename guess as a fallback.
This parameter is under the the "datafari" updateRequestProcessorChain near line 1180.
Please refer to Manage Solr configuration with Zookeeper to know where the configuration files are located and how to reload the configuration correctly.