Binary Transformation Connector
Valid from v6.2 in beta version
This documentation is valid from Datafari v6.2 onwards.
This feature is a work in progress, and is subject to change.
Description
The Binary Transformation is a custom component for ManifoldCF crawling pipeline that enables metadata extraction from binary files by delegating analysis to external processing services. This allow documents such as images of PDFs to be dynamically processed by AI or third-party APIs before indexing.
Global workflow
Filter
Filters are applied. If the document should not be processed (e.g. if the connector is disabled or if the document is filtered due to configuration), the process ends here.
Binary content extraction
When a document is ingested, its binary content is extracted.
When document’s binary content is extracted, it is removed from the original document. To prevent issues and conflicts with other treatments in the indexing pipeline, we immediately restore the document’s content right after the extraction.
External Service picking
Datafari provides a list of “External Service connection classes”, each containing the following method:
public String invoke(String base64content) throws ManifoldCFException { ... }There should be one Java class per type of External Service. Currently, Datafari only supports the followings:
OpenAI API (
com.francelabs.datafari.transformation.binary.services.OpenAiExternalService)Datakeen API (
com.francelabs.datafari.transformation.binary.services.DatakeenExternalService)
The role of these classes is:
Retrieving required parameters
Prepare the HTTP/HTTPS request
Call the external service
Handle service-related errors
Return the JSON response
Connection classes may require specific parameters that are not part of the default Connector specifications. Those should be set in the “Additional parameters” field.
Default Connector Specifications | Examples of additional parameters |
|---|---|
| max_tokens=300 |
Compatible services are documented below.
External Service calling
The selected external service is called with the invoke( … ) method. The request is prepared, and send. The JSON response is returned as a String, and is ready to be read.
In case of errors or exceptions, the connection class throws a ManifoldCFException with error details, so these are logged in the ManifoldCF Simple History.
Data extraction/injection
The connector parses the JSON response and injects the relevant fields as metadata into the document, which are then passed on for indexing by Solr.
The data extraction uses our JsonUtils class, that extracts data from the provided JSON, based on a location key. Once retrieved, the data is injected into the document’s metadata.
It is also possible to override the content of the document in the index by setting the “content” metadata in the “Metadata extraction” field. See configuration section for more details.
Configuration
Here is how to implement the Binary Transformation Connector into your ManifoldCF pipeline.
Add the connector to the pipeline.
Add the connector to the ManifoldCF crawling job. It is possible to add multiple Binary Connectors in a single job.
The Binary connector must always be placed BEFORE the Tika connector. The Tika connector empties the document binary content when reading it, without restoring it after hand. Therefore, the Binary Connector won’t be able to read and process this content.
Open the connector specifications tab
Go to the "Binary Transformation Connector" tab, and configure the job.
Provide the service information
Fill the following fields:
Enable this connector: Check it!
Type of external service: select the service you are using. Current options are Datakeen and OpenAI.
Service hostname: The base URL of the API. You can leave this one empty to use the default URL, as defined in the connection class.
Service endpoint: The endpoint this connector instance must use. You can leave this one empty to use the default endpoint, as defined in the connection class.
Security token: The API key. Use it only if the service you are using requires it.
Additional parameters: Set here additional parameters that are specific to the service you are using. See the table above the list of parameters used by each service.
Parameters must use the format
key=valueOne parameter per line
| OpenAI API | Datakeen API |
|---|---|---|
Type of service | OpenAI | Datakeen |
Service hostname | Optional. Default value: | Optional. Default value: |
Service endpoint | Optional. Default value: | Optional. Default value: |
Security token | Required! Use your own OpenAI API Token. | Do not set it (unless you have a permanent token). Datakeen API requires authentication to generate a 10 minutes token. If the API token is not defined (and it should not be), the connection class will use the ID/password [1] to generate a dynamic token during the indexing. Currently, a new security token is generated for each document. This may be a subject of optimization in the future. [1] See “Additional parameters |
Additional parameters | max_tokens=500
temperature=0.1max_tokens: Optional. The maximum size (in tokens) of the model response. Defaut arbitrarily set to temperature: Optional. The randomness (from 0 to 1) of the generated response. Default arbitrarily set to | username=mydatakeenloggin
password=MySeCrEtPaSsWoRd!username: Required. Your Datakeen username. Required, unless you have a permanent security token. password: Required. Your Datakeen password. Required, unless you have a permanent security token. |
The final URL used to call the service is built by concatenating the HOSTNAME and the ENDPOINT.
Configure metadata extraction
When configuring the connector, you need to specify the metadata (Solr fields) you want to set, and the location of the associated data within the service response JSON.
Metadata and the source location must be defined in the “Metadata extraction” textarea.
Parameters must use the format
metadata=location.One metadata per line
{
"predictions": [
{
"message": "Success",
"results": {
"entities": [
{
"firstName": "Paul",
"fullName": "martin paul",
"id": "eb231392-d3a5-4615-aad9-476e82f852f1",
"lastName": "MARTIN"
}
],
"photo": false,
"report": null
},
"description": "This document is a driving license that belongs to Paul MARTIN...",
"status": 200,
"type": "",
"verificationId": "",
"webhook": null
}
]
}
|
|
Remember that the specified metadata must be existing fields of your Solr main collection (and in VECTORMAIN if you are using Vector Search). Those can be dynamic fields.
If you need to create new Solr fields, check this documentation.
You can also override the base64 content of the document (in the crawl and in the index) by setting a metadata called "content".
Example:
content=predictions[0].descriptionUse the “content” metadata with care. This will permanently override the document content in the pipeline and in the index, and set its mimetype to "text/plain". The new content will then be used to populate content fields in Solr (exactContent, preview_content...). The file name and URL also remain intact.
Configure filters
You can use the "Filters" textarea to set one or multiple optional filters, that will only apply to the current instance of the connector
Use key=values pairs
Supported filters are:
inc_extension: If set, any file extension that is not listed here will be excluded.
exc_extension: If set, any file extension that is listed here will be excluded.
inc_mimetype: If set, any file with a mimeType that is not listed here will be excluded.
exc_mimetype: If set, any file with a mimeType that is listed here will be excluded.
min_size: Integer. The minimum size in bits for the file to be processed.
max_size: Integer. The maximum size in bits for the file to be processed.
inc_metadata (NOT TESTED): File with be excluded if it does not have the specified metadata set to the specified value.
E.g.: "inc_metadata=author:Nicolas, source:confluence" will only allow documents that have the author metadata set to "Nicolas", and the "source" metadata set to "confluence". If the metadata does not exist on the processed document, the document is filtered. Regex are supported here.exc_metadata (NOT TESTED): File with be excluded if it does own the specified metadata set to the specified value.
E.g.:exc_metadata=author:Nicolas, filename:*licence*will deny any document with “Nicolas” in its author metadata, and any filename containing the word “licence”. If none of the metadata exists on the processed document, the document is accepted. Regex are supported here.
One filter per line.
When multivalued, the filters values should be separated with a comma (,)
Example:
inc_extension=png, jpeg, pdf inc_mimetype=image/png, image/jpeg, application/pdf min_size=1 max_size=50000000 inc_metadata=filename:*licence*, source:id_share
Start the job
Once your job is fully configured, launch the job. You can monitor it using the ManifoldCF Simple History, or by checking that indexed documents in Solr are provided with the expected metadata.
Compatible services
Currently, the Binary Transformation Connector is compatible with two types of external services. Read more about the specificities of each type of service and their default configuration in the Binary Transformation Connector | Configuration section (part 3).
OpenAI API
This connection class (com.francelabs.datafari.transformation.binary.services.OpenAiExternalService) is configured to send a simple request to the OpenAI API, with the following JSON body:
{
"messages": [
{
"content": [
{
"image_url": {
"url": "{base64content}"
},
"type": "image_url"
},
{
"text": "Describe the content of this image and mention all names entities.",
"type": "text"
}
],
"role": "user"
}
],
"max_tokens": "{max_tokens}",
"model": "{model}",
"temperature": "{temperature}"
}The prompt is currently static and hard coded. This may change in the future.
Above is an example of request to OpenAI API
Datakeen API
Datakeen API endpoint are documented here: https://docs.datakeen.co/reference/post_auth
Datakeen API provides a large number of AI-related endpoints for document analysis, information extraction and fraud detection. The associated connection (com.francelabs.datafari.transformation.binary.services.DatakeenExternalService) class allows the Binary Transformation Connector to send the content of the processed documents, in the following JSON body:
{
"paramDict": {
"files": [
"{base64content}"
]
}
}Above is an example of request to Datakeen API
Implement a new type of service
Implementing a new type of service to the Binary Transformation Connector requires code development (Java), and project compilation. This section of the documentation is technical.
Create a new Java connection class.
The class must be located in:
{DAFARI_CE_PATH}/datafari-binary-connector/src/main/java/com/francelabs/datafari/transformation/binary/services/It must extend the “ExternalService” class, and implement the IExternalService interface:
package com.francelabs.datafari.transformation.binary.services;
...
public class DatakeenExternalService extends ExternalService implements IExternalService {
...
}Implement a constructor:
public DatakeenExternalService(BinarySpecification spec) {
// ALWAYS CALL SUPER AT THE BEGINING OF THE CONSTRUCTOR
super(spec);
// Specific constructor code here
...
}Implement the invoke(…) method:
public String invoke(String base64content) throws ManifoldCFException {
... // Your code. This method takes the base64 content as String parameter, and returns the external service response as a JSON formatted String.
}Handle the new ExternalService class in Binary.java
switch (spec.getStringProperty(BinaryConfig.NODE_TYPE_OF_SERVICE)) {
case "datakeen":
service = new DatakeenExternalService(spec);
break;
case "openai":
service = new OpenAiExternalService(spec);
break;
default:
LOGGER.error("Invalid service type specified.");
activities.recordActivity(startTime, ACTIVITY_BINARY, document.getBinaryLength(), documentURI, "KO", "Invalid service type");
return activities.sendDocument(documentURI, document);
}Add the new “Type of Service” in the Binary Connector’s
editSpecification.html:
<!-- TYPE OF SERVICE -->
<tr>
<td class="description"><nobr>$Encoder.bodyEscape($ResourceBundle.getString('binary.typeOfService'))</nobr></td>
<td class="value">
<select name="s${SEQNUM}_typeOfService">
<option value="openai" #if($typeOfService == 'openai') selected="selected" #end >OpenAI</option>
<option value="datakeen" #if($typeOfService == 'datakeen') selected="selected" #end >Datakeen</option>
</select>
</td>
</tr>Build, deploy, test and enjoy.