Valid from Datafari v6.2

This documentation explains how to install, configure and use the Solr Vector Search within the RAG features or through the Datafari API. It is subject to change.

What is vector search and how is it useful?

Vector search is using a vectorised representation of documents. More precisely, a dense vector representation in our case, since one could see BM25 as a sparse vector search mechanism. Dense vector search using certain pre-trained sentence transformers allows to manage semantic search, better than BM25, that is why vector search is useful in certain scenarios.

How to enable vector search features?

Since the v6.3, the Solr Vector Search feature comes with a dedicated page on the AdminUI. At each of the steps below, we illustrate how to configure it for using openAI cloud with GPT4oXYZ with your account token ZYXW.

Go to the Extra Functionalities > Solr Vector Search page, in the Admin Menu.

The textarea is a read-only field that shows the JSON configuration that will be stored in Solr. It can be edited by setting the associated fields above.
More information about the model configuration in the Solr Text-to-vector documentation.
Switch the “Enable vector search” button to “On”.
In the “Select an existing model configuration, or create a new one” list, pick “Add a new embeddings model”.
If there is already one model (or more) configured in Solr, it appears in this list. You can select it here and skip the model creation (steps 4 to 8), or edit it. You still need to make sure that is it tag as “Active model” (step 9).
Select a model configuration templates (required). Available templates are:
- OpenAI (for OpenAI API or any other compatible API) => the one we pick for our example
- Datafari AI Agent (same interface than OpenAI, but the template’s default values are for the Datafari AI Agent)
- Hugging Face (for Hugging Face API)
- Mistral (for Mistral Cloud)
- Cohere (for Cohere’s API)
Name the model configuration (required).

Model configuration names are identifier, and must be unique. If you create a new model configuration with the name of an existing one, the existing one will be overriden.
Only use alphanumerical characters,
Default (and recommanded) value is “default_model”
In our example, we are using the default value: “default_model”.

Write the name of the embeddings model that will be used by the external service (required). Depending on the selected template, a default value is provided.
In our example, we are using gpt4o-XYZ.
Type the base URL of the external service (required). Depending on the selected template, a default value is provided.
In our example, we are using OpenAI API.
Enter your security token in the “API key” field (required). This If you are using Datafari AI Agent, use a placeholder key (e.g.: XXXXX). => for our openAI example, put the key available in your openAI account.
For our openAI example, our key is “ZYXW”. Use the key available in your openAI account.
Set this model as Solr active embeddings model. Unless you are not planning to use the model you are adding for vector embeddings, you probably want to check this option.
Select a vector field (required). The vector field must match the dimension of the vector generated by the selected models. If you can’t find the dimension you need amongst the available model, consider creating it in Solr configuration.
Supposing that or “gpt4o-XYZ” generates 384 dimensions vectors, we will be using the “vector_384” field in our example.
Configure the filters that will be applied during chunking. Content that does not match all requirements will not be embedded, nor indexed into VectorMain. Set to 0 to ignore those filters.
Save, and wait a few seconds. If everything went fine, the model list should now contain your new model configuration.
The newly created model configuration now appears in the list. As it has been set as the “active model”, it is automatically selected on page loading.

Launch your job in ManifoldCF. The VectorMain should soon be populated with subdocuments, containing semantic vectors.

Manual configuration is documented below.

1. Load the embeddings model configuration into Solr

Our solution uses TextToVector Query Parser and TextToVector Update Processor features from the Solr-LLM module. Both require an external service that can run a Embeddings Model (which may be a Large Language Model, but not necessarily).

A model encodes text into a vector, sometimes called an “embedding”.
A model in Solr is a reference to an external API that runs the Embedding Model responsible for the text vectorisation.

See Solr “Text to Vector” for more information.

https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html

You can use any compatible API, including our Datafari AI Agent, that can be installed onto your Datafari server.

To upload the model declaration for our VectorMain collection, please use:

curl -XPUT 'http://localhost:8983/solr/VectorMain/schema/text-to-vector-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json'

To view all of the declared models for VectorMain:

http://localhost:8983/solr/VectorMain/schema/text-to-vector-model-store

To delete “default_model” from VectorMain:

curl -XDELETE 'http://localhost:8983/solr/VectorMain/schema/text-to-vector-model-store/default_model'

Example of a JSON model for the Datafari AI Agent (see the “Upload” curl request): /path/myModel.json

{
  "class": "dev.langchain4j.model.openai.OpenAiEmbeddingModel",
  "name": "default_model",
  "params": {
    "baseUrl": "http://localhost:8888",
    "apiKey": "xxxxxxxxxx",
    "modelName": "all-MiniLM-L6-v2.Q8_0.gguf",
    "timeout": 60,
    "logRequests": true,
    "logResponses": true,
    "maxRetries": 5
  }
}

The recommended name for the embedding model is “default_model”. This is the default value for both Datafari webapp and Solr configuration. If you choose to use a different name, please edit the “texttovector.model” parameter from VectorMain collection, as well as solr.embeddings.model from rag.properties.
Currently, the AI Agent does not require an API key. However, the “apiKey” field must not be empty. Otherwise, Solr will consider that the model configuration is incorrect.
The current default embeddings model is https://huggingface.co/leliuga/all-MiniLM-L6-v2-GGUF. It produces 384-dimensions vectors. Check our AI Agent - Installation and configuration documentation if you plan to use a different model.

Select the proper vector field:

The vector dimensions (size of the vector) directly depends on your embeddings model. You need to pick the proper field from Solr VectorMain schema, matching the generated vectors dimensions. Datafari provides multiples default fields that you can use. If the vector field you need is not available in VectorMain, you need to create it.

Set the vector field (in this example, vector_384 is a provided vector field with 384 dimensions):

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"texttovector.outputfield": "vector_384"}}' http://localhost:8983/solr/VectorMain/config

You also need to edit the rag.properties file, providing the name of the proper field:

solr.vector.field=vector_384

Available vector fieds

Available vector fieds are defined in the schema.xml configuration file. Dimensions are set in the vector fieldType:

<fieldType name="knn_vector_4" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/>

Datafari provides a collection of default fields compatible with the current most common dimensions. If the dimension you need is not available, you should add a new DenseVectorField and a new fieldType using the custom Solr configuration feature.

The following fields are currently declated in the Datafari configuration (as of Datafari 6.2):

Field name	FieldType	Dimensions

Field name	FieldType	Dimensions
vector_4	knn_vector_4	4
vector_256	knn_vector_256	256
vector_384	knn_vector_384	384
vector_512	knn_vector_512	512
vector_768	knn_vector_768	768
vector_1024	knn_vector_1024	1024
vector_1536	knn_vector_1536	1536
vector_3072	knn_vector_3072	3072
vector_4096	knn_vector_4096	4096

2. Enable the VectorUpdateProcessor

This step must be done before the indexing. When enabled, all indexed documents are chunked into smaller pieces, and each chunk is sent to the VectorMain collection. Chunks are then processed by the Solr TextToVector Update Processor.

Enable the VectorUpdateProcessor:

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"vector.enabled": "true"}}' http://localhost:8983/solr/FileShare/config

Configure chunking method

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"vector.vector.splitter": "recursiveSplitter"}}' http://localhost:8983/solr/FileShare/config

Available options for chunking methods are: recursiveSplitter (recommended), splitterByParagraph, splitterByCharacter, splitterByLine, splitterBySentence

3. Configure chunks text filters

When indexing documents, some portions (or entire documents) may not be woth embedding. To avoid generating embeddings from low-quality or irrelevant content, we provide two configurable chunks text filters:

Absolute filter: This filter sets a minimum number of alphanumeric characters required in a chunk content for it to be considered relevant. If the chunk text contains fewer characters than the specified threshold, it will be discarded.
- Type: Integer
- Default value: 1
Relative filter: This filter checks the ratio of alphanumeric characters relative to the total number of characters in the chunk. Chunks with a lower ratio are discarded.
- Type: decimal (between 0.0 and 1.0)
- Default value: 0.0

Use the following command to change the “absolute” filter value (minimum alphanumeric characters):

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"vector.filter.minchunklength": 80}}' http://localhost:8983/solr/FileShare/config

Use the following command to change the “relative” filter value (minimum alphanumeric ratio):

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"vector.filter.minalphanumratio": 0.5}}' http://localhost:8983/solr/FileShare/config

Discarded chunks do not increment the chunk index used for child document IDs. This guarantees ID continuity in VectorMain collection.

Example: file:////share/hello_world.txt

chunk A (OK) → file:////share/hello_world.txt_1
chunk B (Filtered)
chunk C (OK) → file:////share/hello_world.txt_2
chunk D (OK) → file:////share/hello_world.txt_3

4. Edit the rag.properties configuration

The rag.properties files is located here: /opt/datafari/tomcat/conf/rag.properties. Find more information about it here.

Property	Description

Property	Description
solr.enable.vector.search	Enable “Solr Vector Search” in RAG processes.
solr.embeddings.model	Defaut value is “default_model”. Only change it if you want a differently name for your model configuration in Solr.
solr.vector.field	The name of the Solr vector field used for vector search. Make sure that it is available in VectorMain schema. (Default: vector_384)
solr.topK	The default value for “topK” in Solr vector search. Solr will return the topK most relevant snippets. That defines the number of chunks used to answer RAG queries. (Default: 10)

Examples of configuration

Here are two examples of Vector Search configuration. For a quick & easy installation:

Check “Enable vector search”
Select “Add a new embeddings model”.
Check all the checkboxes
Configure the embeddings model:

	Embeddings with OpenAI	Embeddings with Datafari AI Agent

	Embeddings with OpenAI	Embeddings with Datafari AI Agent
Requirements	An OpenAI API key	An instance of Datafari AI Agent (can be on the same server)
Configuration	Model configuration template: `OpenAI` Model identifier: `default_model` Model: `text-embedding-3-small` Base URL: `https://api.openai.com/v1` API key: `{YOUR_API_KEY}` Vector field: `vector_1536` (Important for text-embedding-3-small model !) Maximum chunk size: `300` Maximum chunk overlap: `0` Chunking method: `Recursive splitter` Chunk length filter: `50` (Optional) Chunk ratio filter: `0.5` (Optional)	Model configuration template: `Datafari AI Agent` Model identifier: `default_model` Model: `all-MiniLM-L6-v2.Q8_0.gguf` Base URL: `http://localhost:8888/` (Use this URL if the AI Agent is installed on the same server. Otherwise, replace “localhost” with the proper hostname) API key: `XXX` (or any String, as long as it is not empty) Vector field: `vector_384` (Important for all-MiniLM-L6-v2 model !) Maximum chunk size: `300` Maximum chunk overlap: `0` Chunking method: `Recursive splitter` Chunk length filter: `50` (Optional) Chunk ratio filter: `0.5` (Optional)

Save

How does it work?

Work in progress

Datafari Vector Search