Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

Valid from Datafari 7.0

Introduction

As we have been working on the implementation of a RAG (Retrieval Augmented Generation) solution into Datafari, we came up with a new feature: “Datafari RagAPI”. RagAPI is a collection of Java classes and methods designed to handle RAG-related processes within Datafari. For more AI-related features, see also AI Powered Datafari API .

RAG processes can be triggered using the AiPowered API endpoints.

All our AI features can be used calling the proper API endpoint, or by using the AI chatbot widget available on Datafari UIv2.

At the core of AiPowered API are ChatModels, a set of Langchain4j’s classes that act as interfaces between Datafari and external APIs leveraging Large Language Models (LLMs). These services allow integration with third-party AI providers like OpenAI API, Mistral Cloud, our local solutions such as LocalAI.

This documentation covers the details of the RAG processes.

What is RAG?

RAG stands for Retrieval-Augmented Generation. It is the exploitation of a Large Language Model in order to generate the response to a user question or prompt leveraging (or potentially limiting to) extra contextual information provided with the prompt (these contextual information can be relevant documents or chunks of documents, coming from sources such as Datafari search results). In our Datafari case, we purposedly ask the LLM to restrict its answer to the knowledge contained in the extra contextual information (or at least we try to).

Here are some sources, for a better understanding of RAG:

Classic search (BM25) VS Vector Search

The “Retrieval” part of the RAG is an important step. In this step, a search is processed to identify a list of documents that may contain the desired information, and extract relevant fragments that can be interpreted by the LLM. In our own terms, the “classic” method is the keywords-based search, implemented in Datafari. The vector search is based on Machine Learning algorithms to capture the meaning and the context of unstructured data, converted into vectors. The advantage of vector search is to “understand” natural language queries, thus finding more relevant documents, that may not necessarily use the same terms as the ones in the query.

Datafari currently offers three different approaches to RAG retrieval:

  • Keyword-based Search (classic BM25): Full documents are retrieved using a traditional BM25 Datafari search, followed by a chunking process.

  • Solr Vector Search: During indexing, documents are pre-chunked, and each chunk is vectorized. The classic keyword-based search is replaced by a fully vector-based retrieval process, using Text to Vector Solr features. Short size chunks are returned, instead of whole documents.

  • Hybrid Search (RRF): Combines results from both BM25 and Vector searches, with and “Reciprocal Ranking Fusion” algorithm.

How does RAG work in Datafari ?

The RAG process can be started through two different Datafari API endpoints:

  1. Query reception

A “RAG” query is received from the user, through one of the API endpoints.

  1. History retrieval (optional)

If “chat memory” is enabled, the chat history is retrieved from the request to be used in the prompts.

  1. Query rewriting (optional)

If “query rewriting” is enabled, the search query is rewritten by the LLM before the Solr search (source retrieval).

  • This option can be enabled for BM25 only, vector search only, or both. When enabled for both, two search queries are generated using two different prompts.

  • The rewritten query is only used during the retrieval step.

  • The initial user query is still used in RAG process, in the “Prompting” step, and provided to the LLM as context for response generation.

  • If “chat memory” is enabled, the conversation history is used for query rewriting.

  1. Source retrieval

Documents are retrieved from Solr using Datafari search. The retrieval process can use Vector Search technology, classic BM25 Search, or Hybrid Search (RRF). If the “query rewriting” feature is enabled, the rewritten query is used for vector search search. Otherwise, the initial user query used.

  1. Chunking

Any document content (or document extract, in case of vector search) larger than the maximum chunk size defined in configuration is chunked into smaller pieces. Each piece is called a “chunk”.

  1. Prompting

A list of prompts (including instructions for the model, relevant documents chunks and the user query) is prepared and sent to the LLM External Service.

If the prompt exceeds the length limit for a single request, each chunk is processed separately. Once all chunks have been handled, the LLM is invoked again to generate a final, consolidated response.
This process should be optimized soon to process multiple chunks at once.

  1. Response Generation

The LLM generates a text response.

  1. Response formatting

Datafari will format the webservice response into a standard JSON, attach the sources used by the LLM, and send it to the user.

 

image-20241107-164327.png

Endpoints

The RAG endpoint is documented in the AI Powered Datafari API section.

Configuration

The easiest and fastest way to configure RAG and other AI-powered features is to use the dedicated page on the Admin interface.

More information here: AI Powered Datafari API

Note: If you intend to use Solr Vector Search, refer to:

Technical specification

Process description

Depending on the configuration and on the Retrieval approach, the global RAG process can take three forms.

image-20250603-095439.png

 

 

image-20250602-083343.png

 

  1. The client sends a query to the Datafari API, using the POST /ai (or POST /ai/stream) endpoint:

    POST https://{DATAFARI_HOST}/Datafari/rest/v2.0/ai

    Parameters are extracted from the HTTPS request, and configuration is retrieved from rag.properties.

  2. A search query is processed using Search API methods, based on the user prompt, in order to retrieve a list of potentially relevant documents. This search can be either a keyword-based BM25 search, a Solr Vector Search of an hydrid search (RRF).
    In the first case, the search will return entire documents, that will require chunking to be processed by the LLM.
    In the case of Vector Search or Hybrid Search, Solr will return a number of length-limited document excerpts (chunks).

  3. Retrieved documents (in particular from BM25 search) might be too big to be handled in one call to the LLM. Chunking allows to cut large documents into smaller pieces to process them sequentially. The maximum size of the chunks can be configured in the “RAG & AI configuration” AdminUI (chunk.size property in rag.properties).

The chunking uses Langchain4j DocumentSplitters.recursive(...) splitter. See this link for more information about chunking strategies.

In case Vector or Hybrid Search is enabled, the retrieved excerpts may be larger than {chunk.size} characters. If that happens, they will be chunked again.

  1. During prompting, the list of documents/snippets is converted into a list of prompts that will be processed by the LLM. Each prompt contains instructions (instructions are defined in the /opt/datafari/tomcat/webapps/Datafari/WEB-INF/classes/prompts folder), documents excerpts, and the user prompt as a question.

    If prompts are short enough, they might be sent to the LLM into one single request to potentially improve performances. If that is not the case, we use either “Iterative refine” or “Map-Reduce” methods (configurable in AdminUI) to process all chunks.
    The prompt in a single request to the LLM contains as many chunks as possible (minimum 1) without exceeding the limit (in character) set in prompt.max.request.size (instructions and history included).

In the future, we should conduct a benchmark to compare the "Map-Reduce" and "Refining" methods for RAG. Read more about those chunking strategies here: LLM Transformation Connector

  1. Our solution is designed to be able to interact with various LLM API. The dispatcher selects the proper ChatLanguageModel to interact with the configured LLM API.
    A ChatLanguageModel is a Langchain4j component that provides methods to call an external LLM service.
    To this day, Datafari should supports the following ChatLanguageModels:
    - OpenAiChatLanguageModel, compatible with OpenAI API and any other OpenAI-like API (including Datafari AI Agent) (tested with OpenAI API, Datafari AI Agent, Mistral Cloud)
    - AzureOpenAiChatLanguageModel, for Azure OpenAI (tested with gpt-4o-mini)
    - GoogleAiGeminiChatLanguageModel, for Google AI Gemini (tested with gemini-2.0-flash)
    - HuggingFaceChatLanguageModel, for Hugging Face API (not tested)
    - OllamaChatLanguageModel, for Ollama API (not tested)

  2. The selected interface is used to send an HTTP/HTTPS query containing the prompt to the external LLM API. In case of multi-steps scenario, the service can be called multiple times.

  3. The response is formatted in JSON and sent back to the user.

Chunking

Many documents stored in the FileShare Solr collection are too large to be processed in a single request by a Large Language Model. To address this, implementing a chunking strategy is essential, allowing us to work with manageable, concise, and contextually relevant text snippets.

The chunking strategy depends on the Retrieval method. The two cases are detailed below.

Case 1: BM25 Search

Case 2: Solr Vector Search Hybrid Search

Case 1: BM25 Search

Case 2: Solr Vector Search Hybrid Search

The BM25 Search returns large and whole documents from FileShare. Those documents are chunked into smaller pieces during the chunking step of the RAG process.

All retrieved documents are processed by the ChunkUtils Java class. The chunkContent() method uses a Langchain4j solution: Recursive DocumentSplitters. This splitter recursively divides a document into paragraphs (defined by two or more consecutive newline characters), lines, sentences, words (…), in order to fit as many content as possible without exceeding the configured chunk size limit[1].

[1] The size of the chunks (currently in character, but should be in tokens in the future) can be configured in the AdminUI.

image-20250528-103931.png

 

In this scenario, chunking occurs during document indexing within the VectorUpdateProcessor. All files uploaded to FileShare are processed and split into smaller chunks using the DocumentByParagraphSplitter.

These chunks are then stored as new "child" documents, inheriting their parent's metadata. The chunked content replaces the original content in the child documents.

The child documents are stored in a separate Solr collection, VectorMain. Once created, each child’s content is embedded using the Solr TextToVectorUpdateProcessor.

When Vector/Hybrid Search is executed in Datafari, it retrieves documents from the VectorMain collection instead of FileShare, eliminating the need for additional chunking steps.

The chunking step described in Case 1 is still applied on Vector Search retrieved documents. However, depending on your configuration, this may have no effect since the retrieved contents are probably short enough.

Detailed chunking workflow

1. Indexing

Chunking:
No chunking here

Indexed documents:

  • Whole documents are indexed into FileShare

2. Retrieval

The search returns a maximum of N whole documents from FileShare
N corresponds to the chunking.maxFiles property.
Default value for N is 3.

3. RAG process

Chunking
All documents are chunked (if necessary) with the following conditions:

  • A chunk must not exceed the limit of {chunking.chunk.size} characters

  • A recursive splitter is used (paragraphs, then lines, then sentences…). This is not configurable.

Prompting and chunk management
Prompt is generated accordingly to the chunk management strategy (Map-Reduce or Iterative Refining) set in prompt.chunking.strategy

  • We stuff as many chunks as possible (minimum 1) in the prompt, without exceeding the limit of {prompt.max.request.size} characters

  • Every chunks from the N documents must be processed.

  • If all chunks do not fit in a single prompt (one prompt per LLM request), we send as many request as necessary.

 

 

 

 

 

property.name: Property configurable in the « RAG & AI configuration » AdminUI, or in the   rag.properties file.

Detailed chunking workflow

1. Indexing

Chunking:
Documents are chunked in the VectorUpdateProcessor with the following conditions:

  • A chunk must not exceed the limit of {vector.chunksize} tokens

  • A recursive splitter is used by default. This is configurable by editing the FileShare property vector.splitter (default value is 300).

  • You can set an optional maximum overlap by editing the FileShare property vector.maxoverlap (default value is 0).

Indexed documents:

  • Whole documents are indexed into FileShare

  • Chunks (subdocuments) are indexed into VectorMain. Their content is embedded during the indexing in VectorMain by the TextToVectorUpdateProcessor.

2. Retrieval

The search returns a maximum of {rag.topK} subdocuments from VectorMain Default value for rag.topK is 10.

3. RAG process

Chunking
All documents are chunked (if necessary, which is probably not the case if {chunking.chunk.size} > {vector.chunksize} ) with the following conditions:

  • A chunk must not exceed the limit of {chunking.chunk.size} characters

  • A recursive splitter is used (paragraphs, then lines, then sentences…). This is not configurable.

Prompting and chunk management
Prompt is generated accordingly to the chunk management strategy (Map-Reduce or Iterative Refining) set in prompt.chunking.strategy

  • We stuff as many chunks as possible (minimum 1) in the prompt, without exceeding the limit of {prompt.max.request.size} characters

  • Every chunks from the N documents must be processed.

  • If all chunks do not fit in a single prompt (one prompt per LLM request), we send as many request as necessary.

 

 

property.name: Property configurable in the « RAG & AI configuration » AdminUI, or in the   rag.properties file.

property.name: Property configurable in the « Solr Vector Search » AdminUI, or directly in   Solr with a Curl command.

 

Prompts

Prompts are stored in “ChatMessage” objects sent to the LLM. Each "ChatMessages" owns:

  • A type: UserMessage for the user query and document content, ChatMessages for AI-generated messages, or SystemMessage for instructions. In “mono-message” prompts (one message per LLM request), we only use the UserMessage.

  • A text content: The body of the message, which may include instructions, the user query, and/or document content.

Currently, in order to support a larger variety of LLM services, Datafari only uses “mono-message” prompts.

If the RAG process needs to manage too many or too large snippets, it may not be able to fit all of the into one single LLM request. In this situation, a chunk management strategy is required. Datafari provides two options: Iterative Refining method, and Map-Reduce method. You can pick one in the “RAG & AI configuration” AdminUI. Read more about chunking management strategies in the LLM Tranformation Connector documentation.

Below are the prompt chains associated with a RAG query, for each chunk management strategy.

Case 1 : Map-Reduce method

Case 2 : Iterative Refining method

Case 1 : Map-Reduce method

Case 2 : Iterative Refining method

First, the LLM is called once per chunk set, each time with the following prompt template:

template-rag.txt

You are an AI assistant specialized in answering questions strictly based on the provided documents and chat history (if any). - Your response must be accurate, concise, and must not include any invented information. - You must always mention the source document where you found the information. - If the documents do not contain the answer, say that you can’t find the answer. {format} Below are the documents you must use: ###### {snippets} ###### {history} Now, answer the following question in {language}, using only the information from the documents or from the chat history (if any): query: {userquery} answer:

Then, if more that one call was made during the first step, the LLM is called one final time to generate a final response based on all its previous responses:

template-mergeAllRag.txt

You are a helpful RAG assistant. We have provided a list of responses to the user query based on different sources: ###### {snippets} ###### Given the context information and not prior knowledge, answer the user query Do not provide any information that does not belong in documents or in chat history. If the context does not provide an answer, say that you can’t find the answer. {history} You must mention the document names when it is possible and relevant. Answer the user query in {language}. Query: {userquery} Answer:

{history} : A prompt containing the chat history (if any): (template-history.txt)

You are allowed to use the following conversation history if needed: {conversation}

{language} : The name (in English) of the user’s prefered language (i.e.: French)

{snippets} (first step): A list of formatted chunks provided to the LLM as sources. In the initial prompt, they contain retrieved sources, each chunk is formatted with the following template: (template-fromTextSegment.txt)

# Title: {title} # Content: ''' {content} '''

Here, {title} is the title of a the original document, based on the first value of the “title” field in Solr. {content} is the content of the chunk.

{snippets} (second step): The list of all the previous responses generated during the first step, each formatted that way:

* {previous_response}

{userquery} : The initial user query (i.e.: what is Datafari ?)

{conversation} : A succession of formatted user/assistant messages that form a conversation. Each message is formatted with this template: (template-history-message.txt)

- {role}: {content}

Here, {role} is assistant or user. {content} either contain a previous user query, or an AI generated response.

 

 

First, the LLM is called with the first N chunks (as many chunks as a request can fit)

template-refine-initial.txt

Context information is below. ###### {snippets} ###### Given the context information and not prior knowledge, answer the query. Your response must be accurate, concise, must not include any invented information, and must always mention the source document when it is relevant. If the documents do not contain an answer, say that you can’t find the answer. {format} {history} Now, answer the user’s question in {language} using only this information. Query: {userquery} Answer:

Then, it will recursively call the LLM for each pack of chunks, with the following prompt template:

template-refine-refining.txt

The original query is as follows: {userquery} We have provided a previous answer to the query: ###### {lastresponse} ###### We have the opportunity to refine the existing answer (only if needed) with some more context below. ###### {snippets} ###### Using only context and the previous response and not prior knowledge, answer the user query. If the context and the previous response do not contain an answer, say that you can’t find the answer. {history} Always answer in {language} and always mention the source document when it is relevant. Query: {userquery} Answer:

{history} : A prompt containing the chat history (if any): (template-history.txt)

You are allowed to use the following conversation history if needed: {conversation}

{language} : The name (in English) of the user’s prefered language (i.e.: French)

{lastresponse} : The assistant’s last response (i.e.: Datafari in an open source search engine developped by France Labs…)

{snippets} : A list of formatted chunks provided to the LLM as sources. They contain retrieved sources, each chunk is formatted with the following template: (template-fromTextSegment.txt)

# Title: {title} # Content: ''' {content} '''

Here, {title} is the title of a the original document, based on the first value of the “title” field in Solr. {content} is the content of the chunk.

{userquery} : The initial user query (i.e.: what is Datafari ?)

{conversation} : A succession of formatted user/assistant messages that form a conversation. Each message is formatted with this template: (template-history-message.txt)

- {role}: {content}

Here, {role} is assistant or user. {content} either contain a previous user query, or an AI generated response.

{title} : The title of a single document. Based on the first value of the “title” field in Solr.

{content} : The content of a single chunk/snippet/message.

To determine the number of chunks that can fin in a single request, we use our own size calculator. It compares the total size of the prompt (including instructions, history, sources and user query) to the maximum allowed size (in characters), as defined in the “RAG & AI configuration” AdminUI, or the llm.max.request.size parameter in rag.properties.

We stuff as many chunks as possible into the prompt (minimum 1), without exceeding this limit.

Available ChatModels

See https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/4152393729/Add+new+Chat+Language+Models+in+Datafari?search_id=94388b4f-c056-4423-9a1d-bb814f4f7f00&additional_analytics=queryHash---31e06f7d89feb99a0e6c0affe198748c3bb5bef5e3cc92d95cb9e996197d3fc3 to manage ChatModels.

AI powered Search

To enhance the relevance of document excerpts sent to the LLM, we have implemented vector search solutions. This machine learning-based approach represents semantic concepts as vectors, offering more accurate results than traditional keyword-based search. Additionally, vector search improves retrieval quality in multilingual datasets, as it relies on semantic meaning rather than exact wording.

Solr Vector Search

Solr Vector Search uses the new text-to-vector feature provided by Solr 9.8. The purpose is to replace the current BM25 search and the local vectore store by a full vector search solution (and in the future, an hybrid search solution for even more relevant results).

Our VectorUpdateProcessor process all documents that are indexed into the FileShare Solr collection. Documents are split into chunks, those are embedded, and stored into the VectorMain collection.

Those chunks can know be searched using the “/vector” handler, as long as the feature is enabled in the dedicated AdminUI.

The following query can be used to process a vector search through the API.

https://{DATAFARI_HOST}/Datafari/rest/v2.0/search/vector?queryrag={prompt}&topK={topK}
  • queryrag or q (required) : The user query. The "queryrag" parameter is required by Solr; however, if it is missing, Datafari will automatically populate it with the value of "q".

  • topK (optional) : The number of results to return. (default: 10, editable in “RAG & AI confirugation” AdminUI)

  • model (optional) : The active embeddings model name, as defined in Solr. By default, Datafari automatically uses the value stored in solr.embeddings.model in rag.properties (editable in “Solr Vector Search” AdminUI). Unless you are experimenting with multiple models, or you are directly requesting Solr API (and bypassing Datafari API), you probably don’t need to use this parameter.

Read more about Solr Vector Search set-up and configuration in the dedicated documentation: Datafari Vector Search

Hybrid Search (RRF)

Hybrid Seach in Datafari API uses a Reciprocal Rank Fusion (RRF) algorithm to combine results from multiple ranking strategies (here, BM25 and vector similarity search). This is designed to significantly improve relevancy by blending semantic and lexical signals.

  • Performs both a BM25 query and a vector search query into VectorMain collection.

  • Merges the results using Solr's RRF ranking.

  • Returns the fused top results to the user.

The following query can be used to process an RRF Hybrid search through the API.

https://{DATAFARI_HOST}/Datafari/rest/v2.0/search/vector?queryrag={prompt}&topK={topK}
  • queryrag or q (required) : The user query. The "queryrag" parameter is required by Solr; however, if it is missing, Datafari will automatically populate it with the value of "q".

  • rows (optional) : The number of results to return. (default: 10, editable in “RAG & AI confirugation” AdminUI).

  • start (optional) : The start position of the results, for pagination. (default: 0).

  • topK (optional) : Different from vector search’s topK ! The number or results retrieved by the two initial searches (BM25 & vector search). Must be greater than “rows”. (default: 60, editable in “RAG & AI confirugation” AdminUI)

  • model (optional) : The active embeddings model name, as defined in Solr. By default, Datafari automatically uses the value stored in solr.embeddings.model in rag.properties (editable in “Solr Vector Search” AdminUI). Unless you are experimenting with multiple models, or you are directly requesting Solr API (and bypassing Datafari API), you probably don’t need to use this parameter.

Read more about Solr Vector Search set-up and configuration in the dedicated documentation: Datafari Vector Search

Chat Memory (for RAG)

For models that support conversational context, it is possible to enable chat memory within the RAG process.

  • As of April 2025, no back-end storage is provided. The chat history must be managed client-side, typically in the UI or frontend application.

Enable Chat Memory

To activate chat memory:

  1. Enable the option in the AdminUI or in rag.properties:

In “RAG & AI configuration” AdminUI:

image-20250528-095055.png

 

In rag.properties:

chat.memory.enabled=true
  1. Define the maximum number of messages to include in the context with:

In “RAG & AI configuration” AdminUI:

image-20250528-095230.png

In rag.properties:

chat.memory.history.size=8

By default, 6 messages are included: 3 user messages + 3 assistant responses.

Keep in mind: all chat history is included in the prompt and consumes part of the model’s context window.

Therefore, prefer using models with a large context length based on your needs, and adjust chat.memory.history.size and prompt.max.request.size accordingly.

Using Chat Memory in API calls

To include chat history when calling the /ai/streamingor /ai endpoints, use the optional history field in your JSON payload. Chat history will be added to the LLM context during RAG Generation processes.

Example:

POST https://DATAFARI_HOST/Datafari/rest/v2.0/ai
{ "query": "What is my dog's name ?", "lang": "fr", "history": [ { "role":"user", "content": "I just adopted a black labrador. I called her Jumpy." }, { "role":"assistant", "content": "How nice ! I am sure she will be happy with you." }, { "role":"user", "content": "What is the capital of France?" }, { "role":"assistant", "content": "La capitale de la France est Paris, d'après le document `Capitale de la France`." } ] }

This chat history will be included in the prompt and passed to the LLM (in each request), providing contextual awareness for more coherent and personalized responses.

Datafari will run a full RAG process, based on the query "What is my dog's name ?". Optionally, the query may be dynamically rewritten to include the chat history based on the query rewriting method. But if not, the chat history will be used AFTER the chunks have been retrieved. This also means that the documents chunks retrieved from earlier questions of the user (in the same discussion or any other) are NOT used again. Only the documents chunks retrieved at the nth query, are used for the nth answer.

To summarize:

Case 1: query rewriting is not activated