/
LLM Transformation Connector - ALPHA VERSION

LLM Transformation Connector - ALPHA VERSION

Valid from v6.2 - TBC

This documentation is valid from Datafari v6.2 onwards.

This feature is experimental, and elements from this documentation are subject to changes.

The LLM Transformation Connector currently has two functions, each of them using an external AI model to serve their purpose:

  • Summarization: write a summary of the document in the selected language.

  • Categorization: Categorize the document into one (or more) of the available categories. Categories are configurable. Default categories are: Invoice, Call for Tenders, Request for Quotations, Technical paper, Presentation, Resumes, Others.

Glossary

LLM Connector: A ManifoldCF Transformation connector provided by France Labs, that processes AI related tasks.

Datafari AI Agent: Our own OpenAI-like API, able to run AI models to complete tasks such as RAG, summarization or categorization. Documentation here: AI Agent - API documentation

LLM Services: These Java classes allow the connection between the Transformation Connector, and your AI API solution. More LlmServices can be created by extending the LlmService class, and overriding the invoke method. Current available LLM Services are OpenAiLlmService (for OpenAI-compatible APIs), and DatafariAiAgentLlmService (for Datafari AI Agent).

How does it work

As mentioned above, the jobs can process three operations for each document. Theses operations can be individually enabled or disabled. During the document processing, a Large Language Model API (such as OpenAI) will be called to execute the operation using the content.

At this time, two options are available.

  • OpenAI-compatible APIs. This solution requires an API token. It is compatible with OpenAI API, Datafari AI Agent, and any other OpenAI-compatible services.

  • Datafari AI Agent. Our Datafari AI Agent is an OpenAI-compatible API, and can be used with the OpenAiLlmService. However, we provided a dedicated LLM Service. Main differences are:

    • it does not require the API token variable to be present (in this situation, it will create a default API token before contacting the LLM),

    • it is meant to use local LLMs rather than remote LLMs (local means on the Datafari AI Agent server)

    • it will automatically use a local default LLM in case it cannot use the LLM requested in the Transformation Connector.

    • it contains a request queue mechanism to manage the request rate sent to the LLM in order to avoid resources issues.

 

image-20250207-112101.png
LLMServices are Java components whose role is to transmit calls from the LLM Transformation Connector to LLM external APIs (such as OpenAI API or Datafari AI Agent).

 

 

It is important to remember that the Datafari AI Agent will silently use its default model in case the requested model cannot be used. This means the transformation connector will work, and you will only know that your model has NOT been used, by looking at the logs. It may be the objective of a later ticket.

Chunking

There are currently no chunking strategies implemented in the connector for the Categorisation task. We have not done any benchmark to validate our choice, but we have arbitrarily decided that the categories will be generated based on the beginning of the document, and adding a chunking strategy here would result in a significant drop of performances. We may however, in a near future, add a chunking strategy for categorization (or other genAI related prompts) in order to generate more accurate generated answers. This will be done on the tranformation connector side, together with the loop to iterate other the chunks when contacting the LLM.

To avoid Exceptions due to the amount of tokens sent to the LLM, the content is the chunks is now configurable. See “Max size (in token)” configuration for more information.

Prompt

Here are the prompts used for categorization and summarization.

  • Categorization

"""Categorize the following document in one (or more) of the following categories: {categories}. If you don't know, say \"Others\". \n\n {content}"""
  • Summarization (depending of if a language is specified)

Summarize this document : \n\n {document}
Summarize this document in {language}: \n\n """{document}"""
  • Recursive summarization (used with the Iterative Refining Chunking Method). It stats after a simple summarization of the first chunk.

Here is the a summary of the document {documentName}` you wrote, based on {parts 1 to x-1}: """{last_generated_summary}""" "Here is the part{x} of the document: """{chunk_content}""" "Write a global summary {in_language} of the document.

x beeing the processed chunk number (the first chunk is “1”, and is excluded from the loop since it has to be summarized beforehand). The “{parts 1 to x-1} is replaced by:
- part 1 if x=2
- part 1 to x if x > 2

How to use it

Create the job

Using Datafari’s admin interface, create a job for the source you want to crawl on WITHOUT checking the box to start the job immediately.

Define global configuration

In MCF, go to Outputs > List Transformation Connections, and “Edit” the LLM Connector.

image-20241010-083936.png

In LLM Transformation Connector tab, edit the global configuration.

image-20250124-150912.png
  1. Type of LLM

Defines the type of LLM service you are interacting with. Current available values are “OpenAI API or similar” (for OpenAI or similar API). You can use this option to call OpenAI API, our Datafari AI Agent, or any other OpenAI-like API.

  1. LLM API endpoint

The base URL of your API. Default for OpenAI API is:

https://api.openai.com/v1/

For Datafari AI Agent, the url should look like:

http://[AI_AGENT_HOST]:8888/
  1. Model you want to use for categorization or summarization

The Large Language Model you want to use. Leave empty to use default models. Default for OpenAI API is:

gpt-3.5-turbo

For Datafari AI Agent, default model is:

mistral-7b-openorca.Q4_0.gguf
  1. API key

Required for OpenAI API, or any API that requires an API token.

  1. Max size (in token) of the prompt sent to the model

Required. Defines, in token, the max chunk size in tokens.

  1. Max size (in character) of the prompt sent to the model

Optional. The “Max size (in character)” option is a safety feature. If a chunk is larger that the allowed size in character, it will be TRUNCATED. This will cause DATA LOSS.

Set to 0 to disable.

This option is subject to change.

Define job specifications

In MCF, go to Jobs > List all jobs, and “Edit” the job you want to configure.

  • In the connection tab, on the bottom row stating “Transformation” with a select dropdown, select the “LlmConnector”.

  • Then click Insert transformation before on the output line right above

  • You should end up with this

image-20241010-090623.png
  • Click on the “LLM Tranformation Connector” tab and define the job specifications.

image-20250124-152326.png

The three first checkbox allows you to enable or disable each feature.

The “Max tokens” field defines the length of the generated summaries.

Summaries language” is the language used for summaries generation.

Categories” is the list of available categories. Default list is : Invoice, Call for Tenders, Request for Quotations, Technical paper, Presentation, Resumes. The default category is “Other”. You can add or remove any category.

Then, click “save”.

Run the job

Once your job is fully configured, on MCF, go to Jobs > Status and Job Management. Then, start the job.

If an error occurs during the process, you can consult the simple history to find out what happened.

If the job ran successfully, the following Solr fields should be field: “llm_summary” and “llm_categories” (multivalued).

The OpenAI API has strict restrictions on the number of requests it can handle. Currently, some requests may be rejected by the API. The AI ​​Agent is not affected by this issue, as mentionned above.

Display the “Categories” facet

If you successfully configured the LLM Transformation Connector and ran a crawler job with the “categorization” option, the “llm_categories” Solr field should now contain files categories. You can use this field to display a facet in the UI.

  • Edit the ui-config.json file:

/opt/datafari/www/ui-config.json
  • Add the following block into the "left" section:

{ "field": "llm_categories", "maxShow": 6, "minShow": 3, "op": "OR", "sendToSolr": true, "show": true, "title": "Categories", "type": "FieldFacet" },

 

 

image-20241115-150951.png
The llm_categories facet

 

 

More information and details about faceting and customizing Datafari UI here: Customizing DatafariUI

State of the Art

Decision about chunking strategy and their processing

Chunking strategy for Datafari RAG

Langchain4j provides multiple “DocumentSplitter” classes, offering multiple chunking strategy options. In order to preserve the “paragraph-based” meaning of the chunks, we decided to use the DocumentByParagraphSplitter (extract from its Javadoc):

Splits the provided Document into paragraphs and attempts to fit as many paragraphs as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.

Langchain4j chunking options are documented here: RAG (Retrieval-Augmented Generation) | LangChain4j

Langchain (the Python library) provides an experimental feature of Semantic Chunking. However, this solution is not available in Langchain4j, and would involve heavier processes, including vector embedding that would not fit into the LLM process. That is why chunking by paragraphs or by sentences remains (currently) the best way to split a document without losing its context, and is the closest to Semantic Chunking we can currently provide.

Read more about levels of chunking strategies: 7 Chunking Strategies in RAG You Need To Know - F22 Labs

Although we only list RAG focused articles for chunking strategies, we decided to opt for the same strategy for the LLM Transformation Connector. Other strategies may be more appropriate, but we have not investigated.

Chunks processing strategy for the LLM Transformation Connector for Summaries

Now that the default chunking method is set, we have to find the most convenient algorithm to handle all the chunks, in order to generate the final summary.

The strategy we use for RAG in Datafari API is the “Map-Reduce” strategy, as presented in several articles [1]:

  • Documents are chunked into TextSegments.

  • Each segment is processed (in that case, summarized) by the LLM.

  • Then, all the responses are sent back to the LLM to generate a final response. If the summaries list is too long, this operation may require a recursive algorithm, involving extra requests. Note that we have not yet implemented such a recursive algorithm.

This solution works fine for most use cases. However, the LLM Connector meets some restrictions. As it needs to process thousands or millions of documents, and each document may need multiple calls to LLM, we need to make sure that one failure in the chain process will not break the full document summarization.

That is why we decided to use the Iterative Refining Method [2]:

  • Documents are chunked

  • The first chunk is summarized by the LLM

  • Then, each chunk is sent one-by-one to the LLM with the last generated summary, in order to generate a refined summary.

If the process chain is broken, we can keep the last generated document. We have not done benchmarks about it, but it seems that generating a summary using the last generated document can be better than nothing. In addition to this, the Iterative Refining strategy allows to set a limit to the number of chunk processed, in order to limit the number of requests sent to the LLM. This should eventually be benchmarked.

Note that we think that it may be possible to limit the number of chunks used to summarize in the document with the Map-Reduce method, and generate relevant summaries. However, we haven’t tested it yet.

Another significant “pro” to the Iterative Refining Method is that, since the prompt only contains a summary, a chunk and some instructions, it is easier to control its size. In contrast, with the Map-Reduce method, the final prompt may contain an unknown number of digests, requiring recursive algorithms.

 

[1] The Map-Reduce is often mentioned in articles that talk about RAG or AI summarization. Here are some examples:

generative-ai/language/use-cases/document-summarization/summarization_large_documents_langchain.ipynb at main · GoogleCloudPlatform/generative-ai

https://dev.to/rogiia/how-to-use-llms-summarize-long-documents-4ee1

https://medium.com/@abonia/summarization-with-langchain-b3d83c030889#:~:text=Map%2DReduce%20Method&text=The%20map_reduce%20technique%20is%20designed,to%20create%20a%20final%20summary.

https://python.langchain.com/docs/tutorials/summarization/

 

[2] The Iterative Refining Method is mentioned in these articles :

https://python.langchain.com/docs/how_to/summarize_refine/
https://medium.com/@abonia/summarization-with-langchain-b3d83c030889

 

Related content