AI Agent - Technical documentation

AI Agent - Technical documentation

Valid from Datafari 6.2

The AI Agent is an easy-to-use tool provided by France Labs, that can host and expose one or several Large Language Models. They can be used with Datafari for various features, such as:

 

Find the project on our Gitlab: https://gitlab.datafari.com/sandboxespublic/datafari-ai-agent

Installation documentation: AI Agent - Installation and configuration

Functional and API documentation: AI Agent - API documentation

Installation

Installation documentation can be found here: AI Agent - Installation and configuration

How to use it

API documentation can be found here: AI Agent - API documentation

Tools and frameworks

The AI Agent is a Python solution. See installation documentation for requirements and recommandations.

FastAPI

A simple and fast framework for building simple APIs.

See FastAPI documentation here.

Langchain

A powerful framework providing tools to build with LLMs.

See Langchain documentation here.

Llama-cpp-python

A Python package for managing LLMs. This solutions supports GPU acceleration.

See Llama-cpp documentation here.

Supported models

Llama-cpp-python supports models in gguf format. Those can be downloaded manually, or dynamically from Hugging Face.

Global configuration

The global configuration file, named .env, is located at the root of the project. It contains technical configuration, default values, default model…

Attribute name

Description

Recommendation

Default value

Attribute name

Description

Recommendation

Default value

LOCAL_MODELS_ONLY

If enabled, prevent models from downloading from Hugging Face.

Set to true to enable.

Optional. You can use set it to true once your model is downloaded and tested.

false

LOG_LEVEL

Log level. Accepted values are DEBUG, INFO, WARNING, ERROR.

INFO is recommended for normal use.

INFO

LOCAL_DIR

The absolute or relative (from project root) of the folder that contains models.

Not tested yet with absolute path.

Unless you specifically need to store the models somewhere else, use the default value (./models)

./models

LOAD_EMBEDDINGS_MODEL_ON_START

If enabled, the default embeddings model will be loaded on service start.

true

true

LOAD_LLM_ON_START

If enabled, the default LLM will be loaded on service start.

true if you need the chat_completion endoint. Otherwise, false.

false

N_BATCH

Number of tokens in the prompt that are fed into the model at a time.

Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

Only CPU:
512 if RAM < 8GB
1024 if 8GB < RAM < 16GB
2048 if RAM > 16GB

With GPU:
VRAM=6GB : 128/256
VRAM=12GB : 512/1024
VRAM=24GB : 1024/2048

512

N_UBATCH_LLM=768

Micro-batch size for LLM requests. Helps optimize memory usage during inference.

Set to 512768 depending on N_CTX and available RAM.

768

N_UBATCH_EBD=512

Micro-batch size for embedding requests. Controls how inputs are split into batches.

512 is recommended. Lower if you encounter memory pressure.

512

N_GPU_LAYERS

The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.

High values may improve performances, but may cause risks of Out of Memory Exceptions.

No GPU : 0
VRAM=4GB : 5, low benefit
VRAM=6GB : 10, slight acceleration
VRAM=8GB : 20, decent acceleration, OK for 7B Q4_K_M models
VRAM=12GB : 32, OK for 7B models, almost full GPU
VRAM=16+GB : -1, full GPU, optimal

-1

N_THREADS

Number of CPU threads.

As many threads has your CPU provides if you are using a dedicated machine.

(Auto detection if not set)

(Auto detection if not set)

N_CTX

Deprecated (use N_CTX_LLM and N_CTX_EBD)

Context Window, determines the maximum number of tokens that can be processed at once.

For RAG use cases, 8192 is usually sufficient. Can be increased up to 20000 on high-memory systems.

20000

N_CTX_LLM

Maximum context window (in tokens) specifically for LLM models (chat/completions).

For RAG use cases, 8192 is usually sufficient. Can be increased up to 20000 on high-memory systems.

20000

N_CTX_EBD

Maximum input length (in tokens) for embedding models.

512 is standard and recommended for sentence or short-paragraph embeddings.

512

PORT

The port to use with the API.

 

8888

MAX_REQUEST_QUEUE_SIZE

The maximum number of requests that can be queued.

 

1000

DEFAULT_TEMPERATURE

The default value for temperature. It defines the level of “randomness” of the generated responses.

Integer or Float, from 0 to 1.

Set to 0 for for constant results.

0

DEFAULT_MAX_TOKENS

The default value for max_tokens (limit in tokens of the LLM responses)

 

200

DEFAULT_LLM_REPOSITORY

Hugging Face repository of the default model. The model should be downloaded if it does not exist into the LOCAL_DIR folder.

 

bartowski/Ministral-8B-Instruct-2410-GGUF

DEFAULT_LLM_FILENAME

Name of the default model.

Must be a Hugging Face (or locally installed) GGUF model, supported by llama.cpp.

Q4_K_M models or lower without GPU:
Ministral-8B-Instruct-2410-Q4_K_M.gguf

With a GPU:
Ministral-8B-Instruct-2410-Q6_K.gguf

Ministral-8B-Instruct-2410-Q6_K.gguf

DEFAULT_EMBEDDINGS_MODEL_REPOSITORY

Hugging Face repository of the default model. The model should be downloaded if it does not exist into the LOCAL_DIR folder.

 

leliuga/all-MiniLM-L6-v2-GGUF

DEFAULT_EMBEDDINGS_MODEL_FILENAME

Name of the default embeddings model.

Must be a Hugging Face (or locally installed) GGUF embeddings model, supported by llama.cpp.

all-MiniLM-L6-v2.Q8_0.gguf generates 384 dimensions vectors.

all-MiniLM-L6-v2.Q8_0.gguf

Any modification require a server restart to be applied. You can use the restart script:

bash /opt/datafari-ai-agent/bin/restart.sh

GPU support

Thanks to llama-cpp-python, the AI Agent supports GPU-accelerated models. Make sure you select the right options during the installation.

Bin scripts

The AI Agent comes with monitoring scripts, located in the ./bin folder.

The monitoring scripts should not be run with sudo, which could cause privilege issues.

install.sh

Installation script. Can be used as a standalone installer with the --full option. Check AI Agent - Installation and configuration installation documentation.

The install.sh script can take several arguments.

Attribute name

Description

Default value

Attribute name

Description

Default value

--full

Use this option if you are using the script as a stand-alone installer. If you already cloned or downloaded the project, don't use it.

 

--start

Starts the web services at the end of the installation.

 

--gpu

Use this option to enable the GPU features.

 

-l [location]

The AI Agent will be installed in the [location] folder.

/opt

-b [branche]

The Git branch from which the project is cloned.

master

--help

Display the help message, presenting available arguments.

 

--logrotate

Enable automatic log rotation (via logrotate) and skip the interactive prompt.

 

--nologrotate

Disable automatic log rotation and skip the interactive prompt.

If neither --logrotate nor --nologrotate is specified, the user will be prompted to choose whether to enable log rotation.

 

Example of use:

bash install.sh --full --gpu -l /home/debian -b dev

In this example, the “dev” version of the AI Agent will be downloaded in /home/debian/datafari-ai-agent, with GPU acceleration enabled.

start.sh

Launch the AI Agent services. Executions logs are stored in the ./logs folder. The script also stores the active session pid in /opt/datafari-ai-agent/pid/aiagent.pid (default location).

stop.sh

Stop the AI Agent services and delete the aiagent.pid file.

model_manager.sh

The Model Manager Script allows the admin to:

  • list all locally installed models (in LOCAL_DIR)

  • download new models from Hugging Face

  • delete one model from LOCAL_DIR

  • delete all models from LOCAL_DIR

Arguments are optional. When starting the script, the administrator will need to provide information through input prompt. It is possible to set parameters using the corresponding argument.

Argument name

Description

Argument name

Description

--action [action]
-a [action]

The action to process.

  • list: display a list of the models from LOCAL_DIR

  • add: download a new model. The model filename and repository will be asked in input if not provided.

  • remove: remove an existing model from LOCAL_DIR. The model filename will be asked in input if not provided.

  • remove_all: remove all existing models from LOCAL_DIR.

--model-filename [model]
-m [model]

Set the model file name. Useful for add and remove actions.

--model-repository [repository]
-r [repository]

Set the model file name. Useful for add action.

--force
-f

Skip the confirmation input.

--help
-h

Display the help message, presenting available arguments.

Example of use without argument:

bash model_manager.sh

Example of use with arguments (download a new model):

bash model_manager.sh --action add --model-filename all-MiniLM-L6-v2.Q8_0.gguf --model-repository leliuga/all-MiniLM-L6-v2-GGUF --force

Example of use with arguments (delete that model):

bash model_manager.sh -a remove -m all-MiniLM-L6-v2.Q8_0.gguf -f

Overall processing steps

Here's a step-by-step pipeline that describes what happens when a request comes into the web service:

  1. Client Sends a Request

  • A client sends a POST request to the web service, targeting one of the endpoints:

    • /batch for batch processing (multiple queries).

    • /invoke for simple processing (a single query).

  1. API Endpoint Receives the Request

  • The FastAPI server receives the request and routes it to the appropriate handler based on the endpoint.

  1. Request Validation and Queuing

  • The request payload is validated against predefined models.

  • Rate Limiting: The service checks if the request complies with the rate limits (number of requests allowed within a certain time frame).

  • If valid, the request is placed in the appropriate queue:

    • batch_request_queue for batch requests.

    • invoke_request_queue for simple requests.

  1. Background Task Picks Up the Request

  • One of the continuously running background tasks (process_batch_requests or process_simple_requests) checks its respective queue for any pending requests.

  • The task dequeues the request and starts processing it.

  1. Model Loading and Invocation

  • Model Retrieval: The service retrieves the queried (or default) model in the LOC_DIR folder, using the get_model function. If the model is already cached, it uses the cached version. Otherwise, if it is allowed to (see ONLY_LOCAL_MODELS), and if the "model_repository" and "model" fields are set, the service will try to retrieve and download it from Hugging Face.

  • Processing:

    • The prompt is extracted from the user request, and cleaned using a clean_text function

    • The prompt is then processed by the LLM.

    • The content is extracted from the LLM response.

    • This response is attached to the original request object.

  1. Return the Response

  • The API endpoint that initially received the request waits until the background task has attached the response.

  • The response is then sent back to the client in JSON format.

  • If an error occurred during processing, the error message is returned instead.

  1. Logging and Monitoring

  • Throughout the process, logs are generated to record key events, such as the reception of the request, model loading, and any errors.

  • Logs level can be configured in the .env file. See LOG_LEVEL attribute.

  • Python logs are stored in a file (logs/aiagent.log) and displayed in the console for monitoring.

  • If the AI Agent has been started using the start.sh script, the technical logs are available in logs/aiagent-[datetime].log.

Troubleshooting

See troubleshooting section in AI Agent - Installation and configuration.

Security

Input data validation

  • The code uses pydantic to validate input data via the Document, RequestData, and RequestBody data models. This ensures that the incoming data adheres to the expected format.

Empty queue processing:

To ensure that the web service remains manageable and does not become overwhelmed by an exceeding number of requests, we set a specific limit that will be configured in the .env file.

Once this limit is reached, any new incoming requests will be blocked.

{ "error": "The service is currently overloaded. Please try again later" }