Search Index
Valid from 3.0
The documentation below is valid from Datafari 3.0 upwards
Datafari leverages SolrCloud for its search capacities. Datafari utilizes 3 distributed Solr indexes. The main index, that contains all the crawled documents is FileShare
.
FileShare index
We do not cover exhaustively the index fields and their configuration, as you can directly check them in the file schema.xml
available here :
/opt/datafari/solr/solrcloud/FileShare/conf/schema.xml
Field names
Field Name | Description |
---|---|
title_fr | contains french title of a document |
title_en | contains english title of a document |
content_fr | contains french content of a document (the data is indexed and not stored in the index) |
content_en | contains english content of a document (the data is indexed and not stored in the index) |
content_hl | contains the content of the document (not depending on the language). This field is stored and not indexed and is used for highlighting. This field is truncated with the update processor TruncateFieldUpdateProcessor in the datafari update chain in solrconfig.xml |
source | source of the data (for example, Web) |
last_modified | last modified date of the document |
extension | file format |
allow_token* | These fields are used to store information on the access rights of the document |
deny_token* | These fields are used to store information on the access rights of the document |
suggest | Stores the terms used for autocompletion |
spell | Stores the terms used of spellchecking |
url | url of the document |
Field types
FieldTypes for title_*
and content_*
are text_*
(for example, text_en
is the fieldType for title_en
). This fieldtype contains specific analyzer for full text search on english text. Analysis phase is described in next section. FieldType for source is string
. String FieldType does not contain analyzers and is used for example for facetting capabalities.
Data analysis at the indexing phase
As a reminder, a Lucene analysis chain holds a set of components able to tokenise and filter data in order to extract the terms to be stored in the index. There is one analysis chain per field. For each field, the analysis chains in Datafari have been optimized. For instance, for field content_lang1
and title_lang1
the field type text_lang1
include components such as stemming that are specific to the language. Some other components have been added, such as the word_delimiter
which allows to extract a file name from a URL. The LimitToken
Filter is used to limit the number of terms indexed for each field of each document in order to be able to truncate very big documents before creating the inverted index. A big index can lead to poor search performances.
Modifying structure of the index
The structure of the index is defined by a schema. The schema template is stored here : /opt/datafari/solr/solrcloud/FileShare/conf/schema.xml
. Then, the schema is loaded (at first start, or on demand with the updateSolrConfig.sh
script) in the Zookeeper that holds all Solr configurations. When the core is loaded, a new file, called managed-schema
is created in Zookeeper. This files is a copy of schema.xml
and will reflect the modification pushed through the schema API. For example, the custom fields with the script addCustomSchemaInfo.sh
(see dedicated section https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/16384017 ).