Datafari Vector Search

Datafari Vector Search

Valid from Datafari v7.0

This documentation explains how to install, configure and use the Solr Vector Search within the RAG features or through the Datafari API. It is subject to change.

Introduction

What is vector search and how is it useful?

Vector search is using a vectorised representation of documents. More precisely, a dense vector representation in our case, since one could see BM25 as a sparse vector search mechanism. Dense vector search using certain pre-trained sentence transformers allows to manage semantic search, better than BM25, that is why vector search is useful in certain scenarios.

Prerequisites

In order to configure the Vector Search in Datafari, you need:

  • A compatible external service that can run an embeddings model (text vectorizer). It can either be:

    • The OpenAI API (requires an OpenAI API key)

    • An instance of Datafari AI Agent (can be installed on the same server than Datafari).

    • Any other OpenAI-compatible API that can run vector embeddings

    • Mistral Cloud API (requires a Mistral Cloud API key)

    • A Cohere API

    • The Hugging Face API (requires an Hugging Face access token)

  • Your embeddings model must be configured in Datafari AdminUI.

Configure an embeddings model

Before starting the embeddings of your content, you must configure an Active Embeddings Model. You can configure multiple Embeddings Models, but only the active one is used for embeddings and vector search.

If you change the Active Model, you may need to start an Embeddings Jobs to vectorize your content with the new model.

Here is how to configure an embeddings model:

  1. Go to the Vector Search > Embeddings Models page, in the Admin Menu.

    95b717fa-57d0-4820-933a-e1efc486a71a.png
  2. Add a new model.

  3. Name your model. This name must not only contain alphanumerical characters, “-”, and “_”. It must be unique, since it is used as an identifier.

  4. Select the Solr vector field. Vector fields are dynamic fields in Solr. It is crucial that the selected vector matches the dimension of the vectors generated by the model.
    Vector field names use the following structure:

    vector_<dimensions>_*

    Examples: “vector_384_daia” (384 dimensions), “vector_1534_openai” (1536 dimensions).
    Here is the list of the available fields (replace the “*” by any alphanumerical String, such as your model ID:
    vector_4_*, vector_256_*, vector_384_*, vector_512_*, vector_768_*, vector_1024_*, vector_1536_*, vector_3072_*, vector_4096_*
    We highly recommend to use unique vector fields for your different embeddings model to avoid conflicts or overriding.
    If you can’t find the dimension you need in the list, consider creating it in Solr configuration.

More information about the Solr Vector field in the Troubleshooting section.

  1. In “Interface Type”, select the type of service. Available templates are:

    1. OpenAI (for OpenAI API or any other compatible API) => the one we pick for our example

    2. Datafari AI Agent (same interface than OpenAI, but the template’s default values are for the Datafari AI Agent)

    3. Hugging Face (for Hugging Face API)

    4. Mistral (for Mistral Cloud)

    5. Cohere (for Cohere’s API)

  2. Configure the model: Once the Interface Type is selected, the form extends to show specific parameters (Base URL, security token…). Fill the form and click the “Save” button.

  3. (Optional) If you have multiple models, you can use the “Active Model” list to pick the “Active Model”, the one that will be used for embeddings and vector search.

Configure chunking (optional)

image-20251022-125020.png

This configuration section is optional (note that chunking itself is mandatory, but it has a default configuration, hence the optional aspect of this step). During indexing, in order to allow Vector Search, documents must be chunked into multiple sub-documents. That task is processed by the “VectorUpdateProcessor”, when the Vector Search is enabled in the job’s “Vector Search Connector”. The chunking can be configured in the dedicated AdminUI: “Vector Search” > “Chunking Configuration” AdminUI.

  1. Maximum size of the chunks:

The size, in tokens, of the chunks. We use the OpenAI tokenizer to estimate the size of chunks. Make sure that chunks are not too large for your embeddings model. (default: 300)

  1. Maximum size of the chunks overlap:

If greater than 0, chunks may overlap. (default: 0)

  1. Chunk filters

Filter the chunks based on:

  • a minimum length, in characters (default: 1)

  • a minimum alpha-numerical characters ratio, between 0 and 1. (default: 0)

Configure your crawling job

In order to use Vector Search in Datafari, your documents must be chunked. For that, you must include the Vector Search Connector in your ManifoldCF crawler job.

If you are using the Simplified Job Creation, you can check the “Enable Vector Search” and “Enable Vector Embeddings at Indexing” options to automatically configure the Vector Search Connector. This is available for Filer Jobs, Web Jobs and Database Jobs.

image-20251022-134355.png
  • Enable Vector Search for crawled documents: This option MUST be checked to allow vector search for crawled documents. If checked, indexed documents will be chunked. The chunks are stored in the VectorMain Solr collection.

  • Enable Embeddings at Indexing: Use this option to automatically start embeddings of the chunks as soon as they are indexed. If this option is not checked, you can still start the embeddings using the “Vector Search” > “Vector Embeddings Management” AdminUI (section Vector Embeddings Management).

    • The embeddings can significantly increase the processing time, depending on the Active Embeddings Model.

    • If some chunks failed to be embedded for any reason (bad configuration, timeout exception, network error…), you can use the Embeddings Job to retry failed embeddings.

Vector Embeddings Management

This feature is available in the “Vector Search” > “Vector Embeddings Management” AdminUI. From this page, you can launch the Vector Embeddings job. As explained above, this step is not necessary if you have checked the box “Enable Embeddings at Indexing”.

image-20251022-140304.png

Simply press “Start Embeddings” to run the job.

If the “Force” option is checked, EVERY chunk from VectorMain will be vectorized, including those that have already been embedded.

The progress bar indicates the number of chunks (in VectorMain) that have been vectorized by the Active Model, out of the total number of chunks in VectorMain.

How does it work ?

When the button is clicked, the “Atomic Update” tool is used to start the pre-configured “VECTOR” (or “VECTOR_FORCE”, if the “force” option is used) job.

This job crawls all chunks from VectorMain that have NOT been embedded by the current active model, using the “/select/not-embeded” Solr Search Handler. If the “force” option is checked, the “/opensearch” handler is used instead to retrieve ALL chunks.
Then, the job sends Atomic Update requests for each chunk, using the “/update/embed” Solr Update Handler. The content of the chunks is sent to the Active Embedded Model, vectorized, and the vector is stored in the Vector Field associated to the Active Embeddings Models.

Examples of model configuration

Here are two examples of Vector Search configuration. For a quick & easy installation:

 

Embeddings with OpenAI

Embeddings with Datafari AI Agent

 

Embeddings with OpenAI

Embeddings with Datafari AI Agent

Requirements

  • An OpenAI API key

  • An instance of Datafari AI Agent (can be on the same server)

Configuration

Embeddings Model Configuration

  • Model configuration template: OpenAI

  • Model identifier: openai

  • Model: text-embedding-3-small

  • Base URL: https://api.openai.com/v1

  • API key: {YOUR_API_KEY}

  • Vector field: vector_1536_openai (Important for text-embedding-3-small model !)

Chunking Configuration

  • Maximum chunk size: 300

  • Maximum chunk overlap: 0

  • Chunking method: Recursive splitter

  • Chunk length filter: 50 (Optional)

  • Chunk ratio filter: 0.5 (Optional)

Embeddings Model Configuration

  • Model configuration template: Datafari AI Agent

  • Model identifier: daia

  • Model: all-MiniLM-L6-v2.Q8_0.gguf

  • Base URL: http://localhost:8888/
    (Use this URL if the AI Agent is installed on the same server. Otherwise, replace “localhost” with the proper hostname)

  • API key: XXX (or any String, as long as it is not empty)

  • Vector field: vector_384_daia (Important for all-MiniLM-L6-v2 model !)

Chunking Configuration

  • Maximum chunk size: 300

  • Maximum chunk overlap: 0

  • Chunking method: Recursive splitter

  • Chunk length filter: 50 (Optional)

  • Chunk ratio filter: 0.5 (Optional)

How does it work?

The vector search has three parts : indexing, embeddings, and searching.

  • The indexing is all the process that happens during the execution of the ManifoldCF crawler. It includes chunking, and an optional phase of embeddings.

  • The embeddings is the conversion of the chunks content into semantic vectors. It can be dissociated from the indexing, allowing the use of multiple models.

  • The generated vectors can be used to run semantic “vector search” queries in Datafari.

Indexing

When “Vector Search” is enabled in the Vector Search Transformation Connector, documents indexed into FileShare by a manifoldCF job are processed by the VectorUpdateProcessor. This processor divides each document into smaller subdocuments. This step is called chunking. The size and overlap of the chunks can be configured from the AdminUI (default: 300 tokens/0 overlap).

While the whole document is normally indexed in the FileShare collection, the chunks are sent to a separate Solr collection: VectorMain.

image-20251022-153227.png
Document chunking during indexing process

If the “Enable embeddings at indexing” option is enabled in the ManifoldCF job’s configuration, the chunks will be embedded by the TextToVectorUpdateProcessor as soon as they are received by VectorMain. They are then indexed, even if the embeddings failed.

The chunk size is calculated using the Langchain4j OpenAI tokenizer. This means that the evaluated size (in tokens) may be different from actual size estimated by the embeddings model. This may cause an error if you allow large chunks, depending on the model. We may be able to provide different tokenizers in the future.

Embeddings

Once the document chunks are stored in the VectorMain collection, each one is enriched with its vector representation. This process is done in a second step, using the Atomic Update mechanism.

The Vector Embeddings using Atomic Update job is optional if the “Enable Embeddings at Indexing” is on. However it can still be useful to rerun failed embeddings.

A dedicated Atomic Update job reads documents from VectorMain . For each document:

  • An update request is sent to the Solr's Atomic Update API.

  • The document is processed by the “TextToVectorUpdateProcessor.” The text content is sent to the configured embedding model (Datafari AI Agent, OpenAI-compatible API, Cohere, HuggingFace…).

  • The returned vector is added to the document as a new field vector_* (the “Active Field”, as defined in configuration of the “Active Model”).

Embeddings can be launched and monitored from the AdminUI, or started manually on the server using the following commands:

cd /opt/datafari/bin/atomicupdates sudo bash atomic-updates-launcher.sh VECTOR full

The VECTOR job is defined by default in the Atomic Update configuration file: atomicUpdate-cfg.json

"VECTOR": { "searchHandler": "/select/not-embedded", # Select only not-embedded chunks, bypassing security layer (Enterprise Edition) "updateHandler": "/update/embed", # Forcing the use of the TextToVectorUpdateProcessor chain "source": { "baseUrl": "localhost:2181", "solrCollection": "VectorMain" # Retrieves chunks from VectorMain }, "destination": { "baseUrl": "localhost:2181", "solrCollection": "VectorMain" # Sends update requests to VectorMain }, "fieldsOperation": { "vectorize": "set" # Required to trigger embeddings. }, "nbDocsPerBatch": 100, # Setting an exceeding value here may result in timeout exceptions "fieldsMapping": {} }, "VECTOR_FORCE": { "searchHandler": "/opensearch", "updateHandler": "/update/embed", "source": { "baseUrl": "localhost:2181", "solrCollection": "VectorMain" }, "destination": { "baseUrl": "localhost:2181", "solrCollection": "VectorMain" }, "fieldsOperation": { "vectorize": "set" }, "nbDocsPerBatch": 100, "fieldsMapping": { } }
image-20251022-152145.png
Vector Embeddings workflow, using the Atomic Update job.
  1. The VECTOR job retrieves not-embedded chunks from VectorMain, using the “/select/not-embedded” search handler. That handler filters results that already have the name of the Active Embeddings Model’s field in their “has_vector” field.

  2. The VECTOR job prepares atomic update requests, with 100 documents per batch. Only the id and the vectorize fields are required for each document update. Requests are sent to the “/update/embed” handler of the VectorMain collection.

  3. Incoming requests are processed by the TextToVectorUpdateProcessor. The chunk’s content is sent to the embedding model, which returns a vector stored in the document’s active vector field.

  4. If the chunk has been successfully vectorized, the TextToVectorUpdateProcessor adds the name of the Active Vector Field[1] in the “has_vector” multivalued field of the document, to prevent any unnecessary re-embeedings.

  5. The documents are updated in VectorMain with the new vector values, and an updated “has_vector” field.

In the example above, the chunk “alphabet.pdf_0” already contains a vector, while others don’t. This can happen in several scenarios:

  • The document was indexed after the last embedding job and has not yet been embedded.

  • A previous embedding attempt failed (e.g., timeout, API error). These documents will be processed in the next run.

  • The document itself has been updated and the incremental crawl has therefore deleted its chunks in VectorMain.

[1] The active vector field is a DenseVectorField in VectorMain’s document that is used for embeddings and vector search in default configuration. It can be configured in the AdminUI, in the Active Model configuration.

Vector search

Once documents in VectorMain contain a vector, users can search using the “/vector” search handler (that uses the TextToVector query parser) to run a semantic search.

When a search query is received:

  1. The query is converted to a vector using the same embedding model.

  2. Solr runs a KNN query on the active vector field using the provided TextToVector syntax. The nearest-neighbors computing is optimized with the SeededKnnVectorQuery, and the filtering is optimized using an ACORN algorithm.

  3. The topK most relevant documents (from VectorMain) are returned as results

Example of a hybrid query in Solr:

curl "https://[DATAFARI_HOST]/solr/VectorMain/vector?queryrag=What%20are%20the%20first%20letters%20of%20the%20alphabet%20%3F&topK=2&vectorField=vector_1024&model=default_model"

Parameters:

  • queryrag: The user query (can be semantic or keywords)

  • topK: The number of results returned.

  • vectorField: The vector field used for vector distance calculation.

  • model: The model used for the query embeddings. Must be the same one that generated the vector in vectorField.

image-20251022-151847.png
Vector Search process

 

  1. Datafari sends a query to VectorMain’s /vector search handler.

  2. The TextToVector Query Parser transform the query (e.g. What are the first letters of the alphabet ?) into a semantic vector, using the configured embeddings model.

  3. A KNN similarity search is executed on the vectorField.

  4. The topK most similar documents are returned, ranked by vector similarity.

The same model must be used at both indexing and query time to ensure accurate semantic retrieval.

Full API reference and query examples are available in https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/1672871937 of the documentation.

Search configuration and optimization

Vector Search can be optimized with optional features, that can be enabled and configured in the “Vector Search > Vector Search Configuration” AdminUI.

image-20260409-092317.png

Parameter

Description

Recommended/default value

Type

Parameter

Description

Recommended/default value

Type

TopK for Vector Seach (solr.topK)

The default topK parameter for vector search. The number of k-nearest results to return.

10

Integer, min 1

TopK for Hybrid Search (rrf.topK)

The default topK parameter for hybrid search. The number of k-nearest results to return.

Also used to set the number of results of the BM25 part of the hybrid search.

With default values, hybrid search will retrieved 50 chunks with BM25 search, 50 chunks with vector search, and merge all these chunks with the RRF algorithm to only return the best 10 results.

50

Integer, min 1

Must be greater than solr.topK

RRF Rank Constant (rrf.rank.constant)

A constant used in the results re-ranking of the RRF algorithm.

The score of a results is based on the following equation:

score = Σ (1 / (rrf.rank.constant + rank))

60

Integer, min 1

Enable ACORN (solr.enable.acorn)

ACORN is an algorithm designed to make hybrid searches consisting of a filter and a vector search more efficient. This approach tackles both the performance limitations of pre- and post- filtering. It modifies the construction of the HNSW graph and the search on it.

Source: https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html

true

Boolean

Filtered Search Threshold (solr.filtered.search.threshold)

If the percentage of documents that satisfies the filter is less than the threshold ACORN will be used. From 0 (never use ACORN) to 100 (always use ACORN)

60

Integer, from 0 (never use ACORN) to 100 (always use ACORN)

Enable LADR (solr.enable.ladr)

Use SeededKnnVectorQuery to initiate the entry points in the HNSW graph with a “seed query”, in order to improve the relevancy of the results.

Source: https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html

true

Boolean

Troubleshooting

How to pick the vector field ?

The generated vector dimensions (size of the vector) directly depends on your embeddings model. You need to pick the proper field from Solr VectorMain schema, matching the generated vectors dimensions. Datafari provides multiples default fields that you can use. If the vector field you need is not available in VectorMain (list above), you need to create it.

Field name

FieldType

Dimensions

Field name

FieldType

Dimensions

vector_4, vector_4_*

knn_vector_4

4

vector_256, vector_256_*

knn_vector_256

256

vector_384, vector_384_*

knn_vector_384

384

vector_512, vector_512_*

knn_vector_512

512

vector_768, vector_768_*

knn_vector_768

768

vector_1024, vector_1024_*

knn_vector_1024

1024

vector_1536, vector_1536_*

knn_vector_1536

1536

vector_3072, vector_3072_*

knn_vector_3072

3072

vector_4096, vector_4096_*

knn_vector_4096

4096

Important: Any vector field’s name must start with “vector” to be recognized in the Admin UI.

How do the chunks filter work ?

When indexing documents, some portions (or entire documents) may not be worth embedding. To avoid generating embeddings from low-quality or irrelevant content, we provide two configurable chunks text filters:

  • Absolute filter: This filter sets a minimum number of alphanumeric characters required in a chunk content for it to be considered relevant. If the chunk text contains fewer characters than the specified threshold, it will be discarded.

    • Type: Integer

    • Default value: 1

  • Relative filter: This filter checks the ratio of alphanumeric characters relative to the total number of characters in the chunk. Chunks with a lower ratio are discarded.

    • Type: decimal (between 0.0 and 1.0)

    • Default value: 0.0

Discarded chunks do not increment the chunk index used for child document IDs. This guarantees ID continuity in VectorMain collection.

Example: file:////share/hello_world.txt


Valid for Datafari v6.3

This documentation explains how to install, configure and use the Solr Vector Search within the RAG features or through the Datafari API. It is subject to change.

What is vector search and how is it useful?

Vector search is using a vectorised representation of documents. More precisely, a dense vector representation in our case, since one could see BM25 as a sparse vector search mechanism. Dense vector search using certain pre-trained sentence transformers allows to manage semantic search, better than BM25, that is why vector search is useful in certain scenarios.