Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Current »

Valid from 3.0

The documentation below is valid from Datafari 3.0 upwards


Datafari leverages Solr Cloud for its search capacities. Datafari utilizes 3 distributed Solr indexes. The main index, that contains all the crawled documents is FileShare.

FileShare index

We do not cover exhaustively the index fields and their configuration, as you can directly check them in the file schema.xml available here : 

/opt/datafari/solr/solrcloud/FileShare/conf/schema.xml

Field names


Field NameDescription
title_frcontains french title of a document
title_encontains english title of a document
content_frcontains french content of a document (the data is indexed and not stored in the index)
content_encontains english content of a document (the data is indexed and not stored in the index)
content_hlcontains the content of the document (not depending on the language). This field is stored and not indexed and is used for highlighting. This field is truncated with the update processor TruncateFieldUpdateProcessor in the datafari update chain in solrconfig.xml
source

source of the data (for example, Web)

last_modifiedlast modified date of the document
extensionfile format
allow_token*These fields are used to store information on the access rights of the document
deny_token*These fields are used to store information on the access rights of the document
suggestStores the terms used for autocompletion
spellStores the terms used of spellchecking
urlurl of the document
Field types

FieldTypes for title_* and content_* are text_* (for example, text_en is the fieldType for title_en). This fieldtype contains specific analyzer for full text search on english text. Analysis phase is described in next section. FieldType for source is string. String FieldType doesn't contains analyzer and is used for example for facetting capabalities.

Data analysis at the indexing phase

As a reminder, a Lucene analysis chain holds a set of components able to tokenise and filter data in order to extract the terms to be stored in the index.There is one analysis chain per field. For each field, the analysis chains in Datafari have been optimized. For instance, for field content_lang1 and title_lang1 the field type text_lang1 includes components such as stemming that are specific to the language. Some other components have been added, such as the word_delimiter which allows to extract a file name from a URL. The LimitToken Filter is used to limit the number of terms indexed for each field of each document in order to be able to truncate very big documents before creating the inverted index. A big index can lead to poor search performances. 

Modifying structure of the index

The structure of the index is defined by a schema. The schema template is stored here : /opt/datafari/solr/solrcloud/FileShare/conf/schema.xml. Then, the schema is loaded (at first start, or on demand with the updateSolrConfig.sh script) in the Zookeeper that holds all Solr configurations. When the core is loaded, a new file, called managed-schema is created in Zookeeper. This files is a copy of schema.xml and will reflect the modification pushed through the schema API. For example, the custom fields with the script addCustomSchemaInfo.sh (see dedicated section Custom Solr configuration).


  • No labels