Search Index

Valid from 3.0

The documentation below is valid from Datafari 3.0 upwards

Datafari leverages Solr Cloud for its search capacities. Datafari utilizes 3 distributed Solr indexes. The main index, that contains all the crawled documents is FileShare.

FileShare index

We do not cover exhaustively the index fields and their configuration, as you can directly check them in the file schema.xml available here :

/opt/datafari/solr/solrcloud/FileShare/conf/schema.xml

Field names

Field Name	Description
title_fr	contains french title of a document
title_en	contains english title of a document
content_fr	contains french content of a document (the data is indexed and not stored in the index)
content_en	contains english content of a document (the data is indexed and not stored in the index)
content_hl	contains the content of the document (not depending on the language). This field is stored and not indexed and is used for highlighting. This field is truncated with the update processor TruncateFieldUpdateProcessor in the datafari update chain in solrconfig.xml
source	source of the data (for example, Web)
last_modified	last modified date of the document
extension	file format
allow_token*	These fields are used to store information on the access rights of the document
deny_token*	These fields are used to store information on the access rights of the document
suggest	Stores the terms used for autocompletion
spell	Stores the terms used of spellchecking
url	url of the document

Field types

FieldTypes for title_* and content_* are text_* (for example, text_en is the fieldType for title_en). This fieldtype contains specific analyzer for full text search on english text. Analysis phase is described in next section. FieldType for source is string. String FieldType doesn't contains analyzer and is used for example for facetting capabalities.

Data analysis at the indexing phase

As a reminder, a Lucene analysis chain holds a set of components able to tokenise and filter data in order to extract the terms to be stored in the index.There is one analysis chain per field. For each field, the analysis chains in Datafari have been optimized. For instance, for field content_lang1 and title_lang1 the field type text_lang1 includes components such as stemming that are specific to the language. Some other components have been added, such as the word_delimiter which allows to extract a file name from a URL. The LimitToken Filter is used to limit the number of terms indexed for each field of each document in order to be able to truncate very big documents before creating the inverted index. A big index can lead to poor search performances.

Modifying structure of the index

The structure of the index is defined by a schema. The schema template is stored here : /opt/datafari/solr/solrcloud/FileShare/conf/schema.xml. Then, the schema is loaded (at first start, or on demand with the updateSolrConfig.sh script) in the Zookeeper that holds all Solr configurations. When the core is loaded, a new file, called managed-schema is created in Zookeeper. This files is a copy of schema.xml and will reflect the modification pushed through the schema API. For example, the custom fields with the script addCustomSchemaInfo.sh (see dedicated section Custom Solr configuration).