Datafari Vector Search

Datafari Vector Search

Valid from Datafari v6.2

This documentation explains how to install, configure and use the Solr Vector Search within the RAG features or through the Datafari API. It is subject to change.

What is vector search and how is it useful?

Vector search is using a vectorised representation of documents. More precisely, a dense vector representation in our case, since one could see BM25 as a sparse vector search mechanism. Dense vector search using certain pre-trained sentence transformers allows to manage semantic search, better than BM25, that is why vector search is useful in certain scenarios.

How to enable vector search features?

Since the v6.3, the Solr Vector Search feature comes with a dedicated page on the AdminUI. At each of the steps below, we illustrate how to configure it for using openAI cloud with GPT4oXYZ with your account token ZYXW.

  1. Go to the Extra Functionalities > Solr Vector Search page, in the Admin Menu.

    image-20250521-160754.png
    The textarea is a read-only field that shows the JSON configuration that will be stored in Solr. It can be edited by setting the associated fields above.
    More information about the model configuration in the Solr Text-to-vector documentation.
  2. Switch the “Enable vector search” button to “On”.

  3. In the “Select an existing model configuration, or create a new one” list, pick “Add a new embeddings model”.
    If there is already one model (or more) configured in Solr, it appears in this list. You can select it here and skip the model creation (steps 4 to 8), or edit it. You still need to make sure that is it tag as “Active model” (step 9).

  4. Select a model configuration templates (required). Available templates are:
    - OpenAI (for OpenAI API or any other compatible API) => the one we pick for our example
    - Datafari AI Agent (same interface than OpenAI, but the template’s default values are for the Datafari AI Agent)
    - Hugging Face (for Hugging Face API)
    - Mistral (for Mistral Cloud)
    - Cohere (for Cohere’s API)

    image-20250522-082543.png
  5. Name the model configuration (required).

  • Model configuration names are identifier, and must be unique. If you create a new model configuration with the name of an existing one, the existing one will be overriden.

  • Only use alphanumerical characters,

  • Default (and recommanded) value is “default_model”

    image-20250522-082445.png
    In our example, we are using the default value: “default_model”.
  1. Write the name of the embeddings model that will be used by the external service (required). Depending on the selected template, a default value is provided.

    image-20250522-082357.png
    In our example, we are using gpt4o-XYZ.
  2. Type the base URL of the external service (required). Depending on the selected template, a default value is provided.

    image-20250522-082653.png
    In our example, we are using OpenAI API.
  3. Enter your security token in the “API key” field (required). This If you are using Datafari AI Agent, use a placeholder key (e.g.: XXXXX). => for our openAI example, put the key available in your openAI account.

    image-20250522-082747.png
    For our openAI example, our key is “ZYXW”. Use the key available in your openAI account.
  4. Set this model as Solr active embeddings model. Unless you are not planning to use the model you are adding for vector embeddings, you probably want to check this option.

    image-20250522-082839.png
  5. Select a vector field (required). The vector field must match the dimension of the vector generated by the selected models. If you can’t find the dimension you need amongst the available model, consider creating it in Solr configuration.

    image-20250521-160935.png
    Supposing that or “gpt4o-XYZ” generates 384 dimensions vectors, we will be using the “vector_384” field in our example.
  6. Configure the filters that will be applied during chunking. Content that does not match all requirements will not be embedded, nor indexed into VectorMain. Set to 0 to ignore those filters.

  7. Save, and wait a few seconds. If everything went fine, the model list should now contain your new model configuration.

    image-20250522-083430.png
    image-20250522-083706.png
    The newly created model configuration now appears in the list. As it has been set as the “active model”, it is automatically selected on page loading.

     

     

  1. Launch your job in ManifoldCF. The VectorMain should soon be populated with subdocuments, containing semantic vectors.

Manual configuration is documented below.

 

1. Load the embeddings model configuration into Solr

Our solution uses TextToVector Query Parser and TextToVector Update Processor features from the Solr-LLM module. Both require an external service that can run a Embeddings Model (which may be a Large Language Model, but not necessarily).

  • A model encodes text into a vector, sometimes called an “embedding”.

  • A model in Solr is a reference to an external API that runs the Embedding Model responsible for the text vectorisation.

See Solr “Text to Vector” for more information.

You can use any compatible API, including our Datafari AI Agent, that can be installed onto your Datafari server.

To upload the model declaration for our VectorMain collection, please use:

curl -XPUT 'http://localhost:8983/solr/VectorMain/schema/text-to-vector-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json'

To view all of the declared models for VectorMain:

http://localhost:8983/solr/VectorMain/schema/text-to-vector-model-store

To delete “default_model” from VectorMain:

curl -XDELETE 'http://localhost:8983/solr/VectorMain/schema/text-to-vector-model-store/default_model'

Example of a JSON model for the Datafari AI Agent (see the “Upload” curl request): /path/myModel.json

{ "class": "dev.langchain4j.model.openai.OpenAiEmbeddingModel", "name": "default_model", "params": { "baseUrl": "http://localhost:8888", "apiKey": "xxxxxxxxxx", "modelName": "all-MiniLM-L6-v2.Q8_0.gguf", "timeout": 60, "logRequests": true, "logResponses": true, "maxRetries": 5 } }
  • The recommended name for the embedding model is “default_model”. This is the default value for both Datafari webapp and Solr configuration. If you choose to use a different name, please edit the “texttovector.model” parameter from VectorMain collection, as well as solr.embeddings.model from rag.properties.

  • Currently, the AI Agent does not require an API key. However, the “apiKey” field must not be empty. Otherwise, Solr will consider that the model configuration is incorrect.

  • The current default embeddings model is https://huggingface.co/leliuga/all-MiniLM-L6-v2-GGUF. It produces 384-dimensions vectors. Check our AI Agent - Installation and configuration documentation if you plan to use a different model.

Select the proper vector field:

The vector dimensions (size of the vector) directly depends on your embeddings model. You need to pick the proper field from Solr VectorMain schema, matching the generated vectors dimensions. Datafari provides multiples default fields that you can use. If the vector field you need is not available in VectorMain, you need to create it.

Set the vector field (in this example, vector_384 is a provided vector field with 384 dimensions):

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"texttovector.outputfield": "vector_384"}}' http://localhost:8983/solr/VectorMain/config

You also need to edit the rag.properties file, providing the name of the proper field:

solr.vector.field=vector_384

Available vector fieds

Available vector fieds are defined in the schema.xml configuration file. Dimensions are set in the vector fieldType:

<fieldType name="knn_vector_4" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/>

Datafari provides a collection of default fields compatible with the current most common dimensions. If the dimension you need is not available, you should add a new DenseVectorField and a new fieldType using the custom Solr configuration feature.

The following fields are currently declated in the Datafari configuration (as of Datafari 6.2):

Field name

FieldType

Dimensions

Field name

FieldType

Dimensions

vector_4

knn_vector_4

4

vector_256

knn_vector_256

256

vector_384

knn_vector_384

384

vector_512

knn_vector_512

512

vector_768

knn_vector_768

768

vector_1024

knn_vector_1024

1024

vector_1536

knn_vector_1536

1536

vector_3072

knn_vector_3072

3072

vector_4096

knn_vector_4096

4096

2. Enable the VectorUpdateProcessor

This step must be done before the indexing. When enabled, all indexed documents are chunked into smaller pieces, and each chunk is sent to the VectorMain collection. Chunks are then processed by the Solr TextToVector Update Processor.

Enable the VectorUpdateProcessor:

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"vector.enabled": "true"}}' http://localhost:8983/solr/FileShare/config

Configure chunking method

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"vector.vector.splitter": "recursiveSplitter"}}' http://localhost:8983/solr/FileShare/config

Available options for chunking methods are: recursiveSplitter (recommended), splitterByParagraph, splitterByCharacter, splitterByLine, splitterBySentence

3. Configure chunks text filters

When indexing documents, some portions (or entire documents) may not be woth embedding. To avoid generating embeddings from low-quality or irrelevant content, we provide two configurable chunks text filters:

  • Absolute filter: This filter sets a minimum number of alphanumeric characters required in a chunk content for it to be considered relevant. If the chunk text contains fewer characters than the specified threshold, it will be discarded.

    • Type: Integer

    • Default value: 1

  • Relative filter: This filter checks the ratio of alphanumeric characters relative to the total number of characters in the chunk. Chunks with a lower ratio are discarded.

    • Type: decimal (between 0.0 and 1.0)

    • Default value: 0.0

Use the following command to change the “absolute” filter value (minimum alphanumeric characters):

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"vector.filter.minchunklength": 80}}' http://localhost:8983/solr/FileShare/config

Use the following command to change the “relative” filter value (minimum alphanumeric ratio):

curl -XPOST -H 'Content-type:application/json' -d '{"set-user-property": {"vector.filter.minalphanumratio": 0.5}}' http://localhost:8983/solr/FileShare/config

Discarded chunks do not increment the chunk index used for child document IDs. This guarantees ID continuity in VectorMain collection.

Example: file:////share/hello_world.txt

4. Edit the rag.properties configuration

The rag.properties files is located here: /opt/datafari/tomcat/conf/rag.properties. Find more information about it here.

Property

Description

Property

Description

solr.enable.vector.search

Enable “Solr Vector Search” in RAG processes.

solr.embeddings.model

Defaut value is “default_model”. Only change it if you want a differently name for your model configuration in Solr.

solr.vector.field

The name of the Solr vector field used for vector search. Make sure that it is available in VectorMain schema. (Default: vector_384)

solr.topK

The default value for “topK” in Solr vector search. Solr will return the topK most relevant snippets. That defines the number of chunks used to answer RAG queries. (Default: 10)

 

Examples of configuration

Here are two examples of Vector Search configuration. For a quick & easy installation:

  • Check “Enable vector search”

  • Select “Add a new embeddings model”.

  • Check all the checkboxes

  • Configure the embeddings model:

 

Embeddings with OpenAI

Embeddings with Datafari AI Agent

 

Embeddings with OpenAI

Embeddings with Datafari AI Agent

Requirements

  • An OpenAI API key

  • An instance of Datafari AI Agent (can be on the same server)

Configuration

  • Model configuration template: OpenAI

  • Model identifier: default_model

  • Model: text-embedding-3-small

  • Base URL: https://api.openai.com/v1

  • API key: {YOUR_API_KEY}

  • Vector field: vector_1536 (Important for text-embedding-3-small model !)

  • Maximum chunk size: 300

  • Maximum chunk overlap: 0

  • Chunking method: Recursive splitter

  • Chunk length filter: 50 (Optional)

  • Chunk ratio filter: 0.5 (Optional)

  • Model configuration template: Datafari AI Agent

  • Model identifier: default_model

  • Model: all-MiniLM-L6-v2.Q8_0.gguf

  • Base URL: http://localhost:8888/
    (Use this URL if the AI Agent is installed on the same server. Otherwise, replace “localhost” with the proper hostname)

  • API key: XXX (or any String, as long as it is not empty)

  • Vector field: vector_384 (Important for all-MiniLM-L6-v2 model !)

  • Maximum chunk size: 300

  • Maximum chunk overlap: 0

  • Chunking method: Recursive splitter

  • Chunk length filter: 50 (Optional)

  • Chunk ratio filter: 0.5 (Optional)

  • Save

How does it work?

Work in progress