Search Index

Valid from 3.0

The documentation below is valid from Datafari 3.0 upwards

Datafari leverages SolrCloud for its search capacities. Datafari utilizes 3 distributed Solr indexes. The main index, that contains all the crawled documents is FileShare.

FileShare index

We do not cover exhaustively the index fields and their configuration, as you can directly check them in the file schema.xml available here : 

/opt/datafari/solr/solrcloud/FileShare/conf/schema.xml

Field names

Field Name

Description

Field Name

Description

title_fr

contains french title of a document

title_en

contains english title of a document

content_fr

contains french content of a document (the data is indexed and not stored in the index)

content_en

contains english content of a document (the data is indexed and not stored in the index)

content_hl

contains the content of the document (not depending on the language). This field is stored and not indexed and is used for highlighting. This field is truncated with the update processor TruncateFieldUpdateProcessor in the datafari update chain in solrconfig.xml

source

source of the data (for example, Web)

last_modified

last modified date of the document

extension

file format

allow_token*

These fields are used to store information on the access rights of the document

deny_token*

These fields are used to store information on the access rights of the document

suggest

Stores the terms used for autocompletion

spell

Stores the terms used of spellchecking

url

url of the document

Field types

FieldTypes for title_* and content_* are text_* (for example, text_en is the fieldType for title_en). This fieldtype contains specific analyzer for full text search on english text. Analysis phase is described in next section. FieldType for source is string. String FieldType does not contain analyzers and is used for example for facetting capabalities.

Data analysis at the indexing phase

As a reminder, a Lucene analysis chain holds a set of components able to tokenise and filter data in order to extract the terms to be stored in the index. There is one analysis chain per field. For each field, the analysis chains in Datafari have been optimized. For instance, for field content_lang1 and title_lang1 the field type text_lang1 include components such as stemming that are specific to the language. Some other components have been added, such as the word_delimiter which allows to extract a file name from a URL. The LimitToken Filter is used to limit the number of terms indexed for each field of each document in order to be able to truncate very big documents before creating the inverted index. A big index can lead to poor search performances. 

Modifying structure of the index

The structure of the index is defined by a schema. The schema template is stored here : /opt/datafari/solr/solrcloud/FileShare/conf/schema.xml. Then, the schema is loaded (at first start, or on demand with the updateSolrConfig.sh script) in the Zookeeper that holds all Solr configurations. When the core is loaded, a new file, called managed-schema is created in Zookeeper. This files is a copy of schema.xml and will reflect the modification pushed through the schema API. For example, the custom fields with the script addCustomSchemaInfo.sh (see dedicated section Custom Solr configuration ).