AI Agent - Technical documentation
Valid from Datafari 6.2
The AI Agent is an easy-to-use tool provided by France Labs, that can host and expose one or several Large Language Models. They can be used with Datafari for various features, such as:
Retrieval Augmented Generation ( Retrieval-Augmented Generation (RAG) )
Summarization ( LLM Transformation Connector - France Labs only )
Categorization ( LLM Transformation Connector - France Labs only )
Find the project on our Gitlab: https://gitlab.datafari.com/sandboxespublic/datafari-ai-agent
Installation documentation: AI Agent - Installation and configuration
Functional and API documentation: AI Agent - API documentation
Installation
Installation documentation can be found here: AI Agent - Installation and configuration
How to use it
API documentation can be found here: AI Agent - API documentation
Tools and frameworks
The AI Agent is a Python solution. See installation documentation for requirements and recommandations.
FastAPI
A simple and fast framework for building simple APIs.
See FastAPI documentation here.
Langchain
A powerful framework providing tools to build with LLMs.
See Langchain documentation here.
Llama-cpp-python
A Python package for managing LLMs. This solutions supports GPU acceleration.
See Llama-cpp documentation here.
Supported models
Llama-cpp-python supports models in gguf format. Those can be downloaded manually, or dynamically from Hugging Face.
Global configuration
The global configuration file, named .env, is located at the root of the project. It contains technical configuration, default values, default model…
Attribute name | Description | Recommendation | Default value |
|---|---|---|---|
LOCAL_MODELS_ONLY | If enabled, prevent models from downloading from Hugging Face. Set to | Optional. You can use set it to true once your model is downloaded and tested. | false |
LOG_LEVEL | Log level. Accepted values are |
| INFO |
LOCAL_DIR | The absolute or relative (from project root) of the folder that contains models. Not tested yet with absolute path. | Unless you specifically need to store the models somewhere else, use the default value ( | ./models |
LOAD_EMBEDDINGS_MODEL_ON_START | If enabled, the default embeddings model will be loaded on service start. |
| true |
LOAD_LLM_ON_START | If enabled, the default LLM will be loaded on service start. |
| false |
N_BATCH | Number of tokens in the prompt that are fed into the model at a time. Should be between 1 and n_ctx, consider the amount of VRAM in your GPU. | Only CPU: With GPU: | 512 |
N_UBATCH_LLM=768 | Micro-batch size for LLM requests. Helps optimize memory usage during inference. | Set to | 768 |
N_UBATCH_EBD=512 | Micro-batch size for embedding requests. Controls how inputs are split into batches. |
| 512 |
N_GPU_LAYERS | The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU. High values may improve performances, but may cause risks of Out of Memory Exceptions. | No GPU : | -1 |
N_THREADS | Number of CPU threads. | As many threads has your CPU provides if you are using a dedicated machine. (Auto detection if not set) | (Auto detection if not set) |
N_CTX | Deprecated (use N_CTX_LLM and N_CTX_EBD) Context Window, determines the maximum number of tokens that can be processed at once. | For RAG use cases, | 20000 |
N_CTX_LLM | Maximum context window (in tokens) specifically for LLM models (chat/completions). | For RAG use cases, | 20000 |
N_CTX_EBD | Maximum input length (in tokens) for embedding models. |
| 512 |
PORT | The port to use with the API. |
| 8888 |
MAX_REQUEST_QUEUE_SIZE | The maximum number of requests that can be queued. |
| 1000 |
DEFAULT_TEMPERATURE | The default value for temperature. It defines the level of “randomness” of the generated responses. Integer or Float, from 0 to 1. | Set to | 0 |
DEFAULT_MAX_TOKENS | The default value for max_tokens (limit in tokens of the LLM responses) |
| 200 |
DEFAULT_LLM_REPOSITORY | Hugging Face repository of the default model. The model should be downloaded if it does not exist into the |
| bartowski/Ministral-8B-Instruct-2410-GGUF |
DEFAULT_LLM_FILENAME | Name of the default model. | Must be a Hugging Face (or locally installed) GGUF model, supported by llama.cpp. Q4_K_M models or lower without GPU: With a GPU: | Ministral-8B-Instruct-2410-Q6_K.gguf |
DEFAULT_EMBEDDINGS_MODEL_REPOSITORY | Hugging Face repository of the default model. The model should be downloaded if it does not exist into the |
| leliuga/all-MiniLM-L6-v2-GGUF |
DEFAULT_EMBEDDINGS_MODEL_FILENAME | Name of the default embeddings model. | Must be a Hugging Face (or locally installed) GGUF embeddings model, supported by llama.cpp.
| all-MiniLM-L6-v2.Q8_0.gguf |
Any modification require a server restart to be applied. You can use the restart script:
bash /opt/datafari-ai-agent/bin/restart.shGPU support
Thanks to llama-cpp-python, the AI Agent supports GPU-accelerated models. Make sure you select the right options during the installation.
Bin scripts
The AI Agent comes with monitoring scripts, located in the ./bin folder.
The monitoring scripts should not be run with sudo, which could cause privilege issues.
install.sh
Installation script. Can be used as a standalone installer with the --full option. Check AI Agent - Installation and configuration installation documentation.
The install.sh script can take several arguments.
Attribute name | Description | Default value |
|---|---|---|
--full | Use this option if you are using the script as a stand-alone installer. If you already cloned or downloaded the project, don't use it. |
|
--start | Starts the web services at the end of the installation. |
|
--gpu | Use this option to enable the GPU features. |
|
-l [location] | The AI Agent will be installed in the [location] folder. | /opt |
-b [branche] | The Git branch from which the project is cloned. | master |
--help | Display the help message, presenting available arguments. |
|
--logrotate | Enable automatic log rotation (via logrotate) and skip the interactive prompt. |
|
--nologrotate | Disable automatic log rotation and skip the interactive prompt. If neither |
|
Example of use:
bash install.sh --full --gpu -l /home/debian -b devIn this example, the “dev” version of the AI Agent will be downloaded in /home/debian/datafari-ai-agent, with GPU acceleration enabled.
start.sh
Launch the AI Agent services. Executions logs are stored in the ./logs folder. The script also stores the active session pid in /opt/datafari-ai-agent/pid/aiagent.pid (default location).
stop.sh
Stop the AI Agent services and delete the aiagent.pid file.
model_manager.sh
The Model Manager Script allows the admin to:
list all locally installed models (in LOCAL_DIR)
download new models from Hugging Face
delete one model from LOCAL_DIR
delete all models from LOCAL_DIR
Arguments are optional. When starting the script, the administrator will need to provide information through input prompt. It is possible to set parameters using the corresponding argument.
Argument name | Description |
|---|---|
--action [action] | The action to process.
|
--model-filename [model] | Set the model file name. Useful for |
--model-repository [repository] | Set the model file name. Useful for |
--force | Skip the confirmation input. |
--help | Display the help message, presenting available arguments. |
Example of use without argument:
bash model_manager.shExample of use with arguments (download a new model):
bash model_manager.sh --action add --model-filename all-MiniLM-L6-v2.Q8_0.gguf --model-repository leliuga/all-MiniLM-L6-v2-GGUF --forceExample of use with arguments (delete that model):
bash model_manager.sh -a remove -m all-MiniLM-L6-v2.Q8_0.gguf -fOverall processing steps
Here's a step-by-step pipeline that describes what happens when a request comes into the web service:
Client Sends a Request
A client sends a POST request to the web service, targeting one of the endpoints:
/batch for batch processing (multiple queries).
/invoke for simple processing (a single query).
API Endpoint Receives the Request
The FastAPI server receives the request and routes it to the appropriate handler based on the endpoint.
Request Validation and Queuing
The request payload is validated against predefined models.
Rate Limiting: The service checks if the request complies with the rate limits (number of requests allowed within a certain time frame).
If valid, the request is placed in the appropriate queue:
batch_request_queue for batch requests.
invoke_request_queue for simple requests.
Background Task Picks Up the Request
One of the continuously running background tasks (
process_batch_requestsorprocess_simple_requests) checks its respective queue for any pending requests.The task dequeues the request and starts processing it.
Model Loading and Invocation
Model Retrieval: The service retrieves the queried (or default) model in the
LOC_DIRfolder, using theget_modelfunction. If the model is already cached, it uses the cached version. Otherwise, if it is allowed to (seeONLY_LOCAL_MODELS), and if the"model_repository"and"model"fields are set, the service will try to retrieve and download it from Hugging Face.Processing:
The prompt is extracted from the user request, and cleaned using a
clean_textfunctionThe prompt is then processed by the LLM.
The content is extracted from the LLM response.
This response is attached to the original request object.
Return the Response
The API endpoint that initially received the request waits until the background task has attached the response.
The response is then sent back to the client in JSON format.
If an error occurred during processing, the error message is returned instead.
Logging and Monitoring
Throughout the process, logs are generated to record key events, such as the reception of the request, model loading, and any errors.
Logs level can be configured in the
.envfile. SeeLOG_LEVELattribute.Python logs are stored in a file (
logs/aiagent.log) and displayed in the console for monitoring.If the AI Agent has been started using the
start.shscript, the technical logs are available inlogs/aiagent-[datetime].log.
Troubleshooting
See troubleshooting section in AI Agent - Installation and configuration.
Security
Input data validation
The code uses pydantic to validate input data via the Document, RequestData, and RequestBody data models. This ensures that the incoming data adheres to the expected format.
Empty queue processing:
To ensure that the web service remains manageable and does not become overwhelmed by an exceeding number of requests, we set a specific limit that will be configured in the .env file.
Once this limit is reached, any new incoming requests will be blocked.
{
"error": "The service is currently overloaded. Please try again later"
}