LLM Transformation Connector - ALPHA VERSION
Valid from v6.2 - TBC
This documentation is valid from Datafari v6.2 onwards.
This feature is experimental, and elements from this documentation are subject to changes.
The LLM Transformation Connector currently has two functions, each of them using an external AI model to serve their purpose:
Summarization: write a summary of the document in the selected language.
Categorization: Categorize the document into one (or more) of the available categories. Categories are configurable. Default categories are: Invoice, Call for Tenders, Request for Quotations, Technical paper, Presentation, Resumes, Others.
Glossary
LLM Connector: A ManifoldCF Transformation connector provided by France Labs, that processes AI related tasks.
Datafari AI Agent: Our own OpenAI-like API, able to run AI models to complete tasks such as RAG, summarization or categorization. Documentation here: AI Agent - API documentation
LLM Services: These Java classes allow the connection between the Transformation Connector, and your AI API solution. More LlmServices can be created by extending the LlmService
class, and overriding the invoke
method. Current available LLM Services are OpenAiLlmService
(for OpenAI-compatible APIs), and DatafariAiAgentLlmService
(for Datafari AI Agent).
How does it work
As mentioned above, the jobs can process three operations for each document. Theses operations can be individually enabled or disabled. During the document processing, a Large Language Model API (such as OpenAI) will be called to execute the operation using the content.
At this time, two options are available.
OpenAI-compatible APIs. This solution requires an API token. It is compatible with OpenAI API, Datafari AI Agent, and any other OpenAI-compatible services.
Datafari AI Agent. Our Datafari AI Agent is an OpenAI-compatible API, and can be used with the OpenAiLlmService. However, we provided a dedicated LLM Service. Main differences are:
it does not require the API token variable to be present (in this situation, it will create a default API token before contacting the LLM),
it is meant to use local LLMs rather than remote LLMs (local means on the Datafari AI Agent server)
it will automatically use a local default LLM in case it cannot use the LLM requested in the Transformation Connector.
it contains a request queue mechanism to manage the request rate sent to the LLM in order to avoid resources issues.
It is important to remember that the Datafari AI Agent will silently use its default model in case the requested model cannot be used. This means the transformation connector will work, and you will only know that your model has NOT been used, by looking at the logs. It may be the objective of a later ticket.
Chunking
There are currently no chunking strategies implemented in the connector for the Categorisation task. We have not done any benchmark to validate our choice, but we have arbitrarily decided that the categories will be generated based on the beginning of the document, and adding a chunking strategy here would result in a significant drop of performances. We may however, in a near future, add a chunking strategy for categorization (or other genAI related prompts) in order to generate more accurate generated answers. This will be done on the tranformation connector side, together with the loop to iterate other the chunks when contacting the LLM.
To avoid Exceptions due to the amount of tokens sent to the LLM, the content is the chunks is now configurable. See “Max size (in token)” configuration for more information.
Prompt
Here are the prompts used for categorization and summarization.
Categorization
"""Categorize the following document in one (or more) of the following categories: {categories}. If you don't know, say \"Others\". \n\n {content}"""
Summarization (depending of if a language is specified)
Summarize this document : \n\n {document}
Summarize this document in {language}: \n\n """{document}"""
Recursive summarization (used with the Iterative Refining Chunking Method). It stats after a simple summarization of the first chunk.
Here is the a summary of the document {documentName}` you wrote, based on {parts 1 to x-1}:
"""{last_generated_summary}"""
"Here is the part{x} of the document:
"""{chunk_content}"""
"Write a global summary {in_language} of the document.
x
beeing the processed chunk number (the first chunk is “1”, and is excluded from the loop since it has to be summarized beforehand). The “{parts 1 to x-1}
is replaced by:
- part 1
if x=2
- part 1 to x
if x > 2
How to use it
Create the job
Using Datafari’s admin interface, create a job for the source you want to crawl on WITHOUT checking the box to start the job immediately.
Define global configuration
In MCF, go to Outputs > List Transformation Connections, and “Edit” the LLM Connector.
In LLM Transformation Connector tab, edit the global configuration.
Type of LLM
Defines the type of LLM service you are interacting with. Current available values are “OpenAI API or similar” (for OpenAI or similar API). You can use this option to call OpenAI API, our Datafari AI Agent, or any other OpenAI-like API.
LLM API endpoint
The base URL of your API. Default for OpenAI API is:
https://api.openai.com/v1/
For Datafari AI Agent, the url should look like:
http://[AI_AGENT_HOST]:8888/
Model you want to use for categorization or summarization
The Large Language Model you want to use. Leave empty to use default models. Default for OpenAI API is:
gpt-3.5-turbo
For Datafari AI Agent, default model is:
mistral-7b-openorca.Q4_0.gguf
API key
Required for OpenAI API, or any API that requires an API token.
Max size (in token) of the prompt sent to the model
Required. Defines, in token, the max chunk size in tokens.
Max size (in character) of the prompt sent to the model
Optional. The “Max size (in character)” option is a safety feature. If a chunk is larger that the allowed size in character, it will be TRUNCATED. This will cause DATA LOSS.
Set to 0 to disable.
This option is subject to change.
Define job specifications
In MCF, go to Jobs > List all jobs, and “Edit” the job you want to configure.
In the connection tab, on the bottom row stating “Transformation” with a select dropdown, select the “LlmConnector”.
Then click Insert transformation before on the output line right above
You should end up with this
Click on the “LLM Tranformation Connector” tab and define the job specifications.
The three first checkbox allows you to enable or disable each feature.
The “Max tokens” field defines the length of the generated summaries.
“Summaries language” is the language used for summaries generation.
“Categories” is the list of available categories. Default list is : Invoice, Call for Tenders, Request for Quotations, Technical paper, Presentation, Resumes. The default category is “Other”. You can add or remove any category.
Then, click “save”.
Run the job
Once your job is fully configured, on MCF, go to Jobs > Status and Job Management. Then, start the job.
If an error occurs during the process, you can consult the simple history to find out what happened.
If the job ran successfully, the following Solr fields should be field: “llm_summary” and “llm_categories” (multivalued).
The OpenAI API has strict restrictions on the number of requests it can handle. Currently, some requests may be rejected by the API. The AI Agent is not affected by this issue, as mentionned above.
Display the “Categories” facet
If you successfully configured the LLM Transformation Connector and ran a crawler job with the “categorization” option, the “llm_categories” Solr field should now contain files categories. You can use this field to display a facet in the UI.
Edit the
ui-config.json
file:
/opt/datafari/www/ui-config.json
Add the following block into the
"left"
section:
{
"field": "llm_categories",
"maxShow": 6,
"minShow": 3,
"op": "OR",
"sendToSolr": true,
"show": true,
"title": "Categories",
"type": "FieldFacet"
},
More information and details about faceting and customizing Datafari UI here: Customizing DatafariUI
State of the Art
Decision about chunking strategy and their processing
Chunking strategy for Datafari RAG
Langchain4j provides multiple “DocumentSplitter
” classes, offering multiple chunking strategy options. In order to preserve the “paragraph-based” meaning of the chunks, we decided to use the DocumentByParagraphSplitter
(extract from its Javadoc):
Splits the provided Document into paragraphs and attempts to fit as many paragraphs as possible into a single TextSegment, adhering to the limit set by maxSegmentSize.
Langchain4j chunking options are documented here: RAG (Retrieval-Augmented Generation) | LangChain4j
Langchain (the Python library) provides an experimental feature of Semantic Chunking. However, this solution is not available in Langchain4j, and would involve heavier processes, including vector embedding that would not fit into the LLM process. That is why chunking by paragraphs or by sentences remains (currently) the best way to split a document without losing its context, and is the closest to Semantic Chunking we can currently provide.
Read more about levels of chunking strategies: 7 Chunking Strategies in RAG You Need To Know - F22 Labs
Although we only list RAG focused articles for chunking strategies, we decided to opt for the same strategy for the LLM Transformation Connector. Other strategies may be more appropriate, but we have not investigated.
Chunks processing strategy for the LLM Transformation Connector for Summaries
Now that the default chunking method is set, we have to find the most convenient algorithm to handle all the chunks, in order to generate the final summary.
The strategy we use for RAG in Datafari API is the “Map-Reduce” strategy, as presented in several articles [1]:
Documents are chunked into TextSegments.
Each segment is processed (in that case, summarized) by the LLM.
Then, all the responses are sent back to the LLM to generate a final response. If the summaries list is too long, this operation may require a recursive algorithm, involving extra requests. Note that we have not yet implemented such a recursive algorithm.
This solution works fine for most use cases. However, the LLM Connector meets some restrictions. As it needs to process thousands or millions of documents, and each document may need multiple calls to LLM, we need to make sure that one failure in the chain process will not break the full document summarization.
That is why we decided to use the Iterative Refining Method [2]:
Documents are chunked
The first chunk is summarized by the LLM
Then, each chunk is sent one-by-one to the LLM with the last generated summary, in order to generate a refined summary.
If the process chain is broken, we can keep the last generated document. We have not done benchmarks about it, but it seems that generating a summary using the last generated document can be better than nothing. In addition to this, the Iterative Refining strategy allows to set a limit to the number of chunk processed, in order to limit the number of requests sent to the LLM. This should eventually be benchmarked.
Note that we think that it may be possible to limit the number of chunks used to summarize in the document with the Map-Reduce method, and generate relevant summaries. However, we haven’t tested it yet.
Another significant “pro” to the Iterative Refining Method is that, since the prompt only contains a summary, a chunk and some instructions, it is easier to control its size. In contrast, with the Map-Reduce method, the final prompt may contain an unknown number of digests, requiring recursive algorithms.
[1] The Map-Reduce is often mentioned in articles that talk about RAG or AI summarization. Here are some examples:
https://dev.to/rogiia/how-to-use-llms-summarize-long-documents-4ee1
https://python.langchain.com/docs/tutorials/summarization/
[2] The Iterative Refining Method is mentioned in these articles :
https://python.langchain.com/docs/how_to/summarize_refine/
https://medium.com/@abonia/summarization-with-langchain-b3d83c030889