Binary Transformation Connector

Binary Transformation Connector

Valid from v6.2 in beta version

This documentation is valid from Datafari v6.2 onwards.

This feature is a work in progress, and is subject to change.

Description

The Binary Transformation is a custom component for ManifoldCF crawling pipeline that enables metadata extraction from binary files by delegating analysis to external processing services. This allow documents such as images of PDFs to be dynamically processed by AI or third-party APIs before indexing.

Global workflow

  1. Filter

Filters are applied. If the document should not be processed (e.g. if the connector is disabled or if the document is filtered due to configuration), the process ends here.

  1. Binary content extraction

When a document is ingested, its binary content is extracted.

When document’s binary content is extracted, it is removed from the original document. To prevent issues and conflicts with other treatments in the indexing pipeline, we immediately restore the document’s content right after the extraction.

  1. External Service picking

Datafari provides a list of “External Service connection classes”, each containing the following method:

public String invoke(String base64content) throws ManifoldCFException { ... }

There should be one Java class per type of External Service. Currently, Datafari only supports the followings:

  • OpenAI API (com.francelabs.datafari.transformation.binary.services.OpenAiExternalService)

  • Datakeen API (com.francelabs.datafari.transformation.binary.services.DatakeenExternalService)

The role of these classes is:

  • Retrieving required parameters

  • Prepare the HTTP/HTTPS request

  • Call the external service

  • Handle service-related errors

  • Return the JSON response

Connection classes may require specific parameters that are not part of the default Connector specifications. Those should be set in the “Additional parameters” field.

Default Connector Specifications

Examples of additional parameters

Default Connector Specifications

Examples of additional parameters

  • Enable Binary Connector

  • Type of Service

  • Service hostname

  • Service endpoint

  • Service security token

max_tokens=300
temperature=0
model=gpt-4o-mini

Compatible services are documented below.

  1. External Service calling

The selected external service is called with the invoke( … ) method. The request is prepared, and send. The JSON response is returned as a String, and is ready to be read.

In case of errors or exceptions, the connection class throws a ManifoldCFException with error details, so these are logged in the ManifoldCF Simple History.

  1. Data extraction/injection

The connector parses the JSON response and injects the relevant fields as metadata into the document, which are then passed on for indexing by Solr.

The data extraction uses our JsonUtils class, that extracts data from the provided JSON, based on a location key. Once retrieved, the data is injected into the document’s metadata.

It is also possible to override the content of the document in the index by setting the “content” metadata in the “Metadata extraction” field. See configuration section for more details.

Configuration

Here is how to implement the Binary Transformation Connector into your ManifoldCF pipeline.

  1. Add the connector to the pipeline.

Add the connector to the ManifoldCF crawling job. It is possible to add multiple Binary Connectors in a single job.

The Binary connector must always be placed BEFORE the Tika connector. The Tika connector empties the document binary content when reading it, without restoring it after hand. Therefore, the Binary Connector won’t be able to read and process this content.

  1. Open the connector specifications tab

Go to the "Binary Transformation Connector" tab, and configure the job.

image-20250513-092754.png
An example of configuration for Datakeen API
  1. Provide the service information

Fill the following fields:

  • Enable this connector: Check it!

  • Type of external service: select the service you are using. Current options are Datakeen and OpenAI.

  • Service hostname: The base URL of the API. You can leave this one empty to use the default URL, as defined in the connection class.

  • Service endpoint: The endpoint this connector instance must use. You can leave this one empty to use the default endpoint, as defined in the connection class.

  • Security token: The API key. Use it only if the service you are using requires it.

  • Additional parameters: Set here additional parameters that are specific to the service you are using. See the table above the list of parameters used by each service.

    • Parameters must use the format key=value

    • One parameter per line

 

OpenAI API

Datakeen API

 

OpenAI API

Datakeen API

Type of service

OpenAI

Datakeen

Service hostname

Optional. Default value: https://api.openai.com/v1/

Optional. Default value: https://api.datakeen.co/api/v1/

Service endpoint

Optional. Default value: /chat/completions

Optional. Default value: /reco/multi-doc

Security token

Required!

Use your own OpenAI API Token.

Do not set it (unless you have a permanent token).

Datakeen API requires authentication to generate a 10 minutes token. If the API token is not defined (and it should not be), the connection class will use the ID/password [1] to generate a dynamic token during the indexing.

Currently, a new security token is generated for each document. This may be a subject of optimization in the future.

[1] See “Additional parameters

Additional parameters

max_tokens=500 temperature=0.1

max_tokens: Optional. The maximum size (in tokens) of the model response. Defaut arbitrarily set to 500.

temperature: Optional. The randomness (from 0 to 1) of the generated response. Default arbitrarily set to 0.

username=mydatakeenloggin password=MySeCrEtPaSsWoRd!

username: Required. Your Datakeen username. Required, unless you have a permanent security token.

password: Required. Your Datakeen password. Required, unless you have a permanent security token.

The final URL used to call the service is built by concatenating the HOSTNAME and the ENDPOINT.

  1. Configure metadata extraction

When configuring the connector, you need to specify the metadata (Solr fields) you want to set, and the location of the associated data within the service response JSON.

  • Metadata and the source location must be defined in the “Metadata extraction” textarea.

  • Parameters must use the format metadata=location.

  • One metadata per line

{ "predictions": [ { "message": "Success", "results": { "entities": [ { "firstName": "Paul", "fullName": "martin paul", "id": "eb231392-d3a5-4615-aad9-476e82f852f1", "lastName": "MARTIN" } ], "photo": false, "report": null }, "description": "This document is a driving license that belongs to Paul MARTIN...", "status": 200, "type": "", "verificationId": "", "webhook": null } ] }

Location key (as defined in “Metadata extraction field)

Effect

entity_message=predictions[0].message

The "entity_message" field value in Solr will be set to "Success"

entity_firstName=predictions[0].results.entities[0].firstName

The "entity_firstName" field value in Solr will be set to "Paul"

entity_city=predictions[0].results.entities[0].city

Since the targeted data does not exist in the service response in this specific example, this line will have no effect on the processed document.

 

Remember that the specified metadata must be existing fields of your Solr main collection (and in VECTORMAIN if you are using Vector Search). Those can be dynamic fields.

If you need to create new Solr fields, check this documentation.

You can also override the base64 content of the document (in the crawl and in the index) by setting a metadata called "content".

Example:

content=predictions[0].description

Use the “content” metadata with care. This will permanently override the document content in the pipeline and in the index, and set its mimetype to "text/plain". The new content will then be used to populate content fields in Solr (exactContent, preview_content...). The file name and URL also remain intact.

  1. Configure filters

You can use the "Filters" textarea to set one or multiple optional filters, that will only apply to the current instance of the connector

  • Use key=values pairs

  • Supported filters are:

    • inc_extension: If set, any file extension that is not listed here will be excluded.

    • exc_extension: If set, any file extension that is listed here will be excluded.

    • inc_mimetype: If set, any file with a mimeType that is not listed here will be excluded.

    • exc_mimetype: If set, any file with a mimeType that is listed here will be excluded.

    • min_size: Integer. The minimum size in bits for the file to be processed.

    • max_size: Integer. The maximum size in bits for the file to be processed.

    • inc_metadata (NOT TESTED): File with be excluded if it does not have the specified metadata set to the specified value.
      E.g.: "inc_metadata=author:Nicolas, source:confluence" will only allow documents that have the author metadata set to "Nicolas", and the "source" metadata set to "confluence". If the metadata does not exist on the processed document, the document is filtered. Regex are supported here.

    • exc_metadata (NOT TESTED): File with be excluded if it does own the specified metadata set to the specified value.
      E.g.: exc_metadata=author:Nicolas, filename:*licence* will deny any document with “Nicolas” in its author metadata, and any filename containing the word “licence”. If none of the metadata exists on the processed document, the document is accepted. Regex are supported here.

  • One filter per line.

  • When multivalued, the filters values should be separated with a comma (,)

  • Example:

    inc_extension=png, jpeg, pdf inc_mimetype=image/png, image/jpeg, application/pdf min_size=1 max_size=50000000 inc_metadata=filename:*licence*, source:id_share
  1. Start the job

Once your job is fully configured, launch the job. You can monitor it using the ManifoldCF Simple History, or by checking that indexed documents in Solr are provided with the expected metadata.

Compatible services

Currently, the Binary Transformation Connector is compatible with two types of external services. Read more about the specificities of each type of service and their default configuration in the Binary Transformation Connector | Configuration section (part 3).

OpenAI API

This connection class (com.francelabs.datafari.transformation.binary.services.OpenAiExternalService) is configured to send a simple request to the OpenAI API, with the following JSON body:

{ "messages": [ { "content": [ { "image_url": { "url": "{base64content}" }, "type": "image_url" }, { "text": "Describe the content of this image and mention all names entities.", "type": "text" } ], "role": "user" } ], "max_tokens": "{max_tokens}", "model": "{model}", "temperature": "{temperature}" }

The prompt is currently static and hard coded. This may change in the future.

Above is an example of request to OpenAI API

Request:

POST https://api.openai.com/v1/chat/completions
{ "messages": [ { "content": [ { "image_url": { "url": "data:image/jpeg;base64,{base64content}" }, "type": "image_url" }, { "text": "Describe the content of this image and mention all names entities.", "type": "text" } ], "role": "user" } ], "max_tokens": "500", "model": "gpt-4o-mini", "temperature": "0" }

Here, {base64} contains the base64 content of the image of a fake French driver’s licence.

Response:

{ "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "annotations": [ ], "content": "The image is of a French driver's license (Permis de Conduire). It contains the following key (....) such as holograms or watermarks.", "refusal": null, "role": "assistant" } } ], "created": 1746623120, "id": "chatcmpl-BUYka2Bpc0XxuSxlYnwK2YB7V7qGL", "model": "gpt-4o-mini-2024-07-18", "object": "chat.completion", "service_tier": "default", "system_fingerprint": "fp_129a36352a", "usage": { "completion_tokens": 172, "completion_tokens_details": { "accepted_prediction_tokens": 0, "audio_tokens": 0, "reasoning_tokens": 0, "rejected_prediction_tokens": 0 }, "prompt_tokens": 25520, "prompt_tokens_details": { "audio_tokens": 0, "cached_tokens": 0 }, "total_tokens": 25692 } }

Datakeen API

Datakeen API endpoint are documented here: https://docs.datakeen.co/reference/post_auth

Datakeen API provides a large number of AI-related endpoints for document analysis, information extraction and fraud detection. The associated connection (com.francelabs.datafari.transformation.binary.services.DatakeenExternalService) class allows the Binary Transformation Connector to send the content of the processed documents, in the following JSON body:

{ "paramDict": { "files": [ "{base64content}" ] } }

Above is an example of request to Datakeen API

Request:

POST https://api.datakeen.co/api/v1/reco/multi-doc
{ "paramDict": { "files": [ "data:image/jpeg;base64,{base64}" ] } }

Here, {base64} contains the base64 content of the image of a fake French driver’s licence.

Response:

{ "predictions": [ { "message": "Success", "results": { "controls": { }, "documents": [ { "cardSide": "front", "cardType": "driver_license_fr_2013", "code": "4.0", "codeDescription": "We could not process all the controls on the given document, please review its authenticity", "codeName": "toVerify", "controlCategories": { "dataCoherency": { "controls": { "dateConformity": { "confidence": 1, "value": false }, "dateValidity": { "confidence": 1, "value": false }, "dateValidityDelivery": { "confidence": 1, "value": false }, "mrzConformity": { "confidence": 1.0, "value": true }, "mrzValidity": { "confidence": 1.0, "value": true } }, "status": false }, "metadata": { "controls": { "notMultipleVersions": { "confidence": null, "value": null }, "notSuspectedSoftware": { "confidence": 1, "value": true } }, "status": null }, "visual": { "controls": { "chipIsPresent": { "confidence": null, "value": null }, "hologramIsPresent": { "confidence": null, "value": null }, "initialsIsPresent": { "confidence": null, "value": null }, "photoIsPresent": { "confidence": null, "value": null }, "rfSymbolIsPresent": { "confidence": null, "value": null }, "stampIsPresent": { "confidence": null, "value": null }, "waveIsPresent": { "confidence": null, "value": null } }, "status": null } }, "controls": { "chipIsPresent": { "confidence": null, "value": null }, "dateConformity": { "confidence": 1, "value": false }, "dateValidity": { "confidence": 1, "value": false }, "dateValidityDelivery": { "confidence": 1, "value": false }, "globalStatus": { "confidence": 1, "score": 1, "value": true }, "hologramIsPresent": { "confidence": null, "value": null }, "initialsIsPresent": { "confidence": null, "value": null }, "metadataAuthorConformity": { "confidence": null, "value": null }, "metadataDateConformity": { "confidence": null, "value": null }, "metadataProducerConformity": { "confidence": null, "value": null }, "metadataVersionConformity": { "confidence": null, "value": null }, "mrzConformity": { "confidence": 1.0, "value": true }, "mrzValidity": { "confidence": 1.0, "value": true }, "notMultipleVersions": { "confidence": null, "value": null }, "notSuspectedSoftware": { "confidence": 1, "value": true }, "photoIsPresent": { "confidence": null, "value": null }, "rfSymbolIsPresent": { "confidence": null, "value": null }, "stampIsPresent": { "confidence": null, "value": null }, "waveIsPresent": { "confidence": null, "value": null } }, "entities": [ "672d5984-9c0f-4b4c-9556-3d61d333f284" ], "extractedInformation": { "address": { "confidence": null, "value": "" }, "birthCountry": { "confidence": 0.9, "value": "" }, "birthDate": { "confidence": 0.953270197236634, "value": "14.07.1981" }, "birthDateMRZ": { "confidence": 0, "value": "" }, "birthDepartment": { "confidence": 0.9, "value": "99" }, "birthPlace": { "confidence": 0.814401126487638, "value": "UTOPIA CITY" }, "countryCode": { "confidence": 0.9, "value": "FR" }, "deliveryDate": { "confidence": 0.908951778981759, "value": "01.01.2013" }, "expiryDate": { "confidence": 0.829335106463844, "value": "31.12.2018" }, "expiryDateMRZ": { "confidence": 0, "value": "31.12.2018" }, "firstName": { "confidence": 0.823751733685576, "value": "Paul" }, "firstNameMRZ": { "confidence": 0, "value": "" }, "fullName": { "confidence": 0.645786724817933, "value": "Paul MARTIN" }, "gender": { "confidence": null, "value": "" }, "genderMRZ": { "confidence": 0, "value": "" }, "idNumber": { "confidence": 0.933639838818205, "value": "13AA00002" }, "idNumberMRZ": { "confidence": 0, "value": "13AA00002" }, "lastName": { "confidence": 0.783957955303592, "value": "MARTIN" }, "lastNameMRZ": { "confidence": 0, "value": "MARTIN" }, "licenseCategories": { "confidence": null, "value": "" }, "mrz": { "confidence": 0, "value": "D1FRA13AA000026181231MARTIN<<9" }, "nationality": { "confidence": null, "value": "" }, "neph": { "confidence": null, "value": "" }, "spouseName": { "confidence": null, "value": "" } }, "metadata": [ { "author": null, "created_date": "2021-02-01T14:45:16", "creator": null, "file": "file-0.jpeg", "keywords": null, "modified_date": "2021-02-01T14:45:16", "producer": "Adobe Photoshop 22.1 (Windows)", "subject": null, "title": null } ], "origin": null, "pages": [ { "num": 0, "path": "file-0.jpeg", "type": "driver-license-p1" } ], "path": "", "type": "driver-license", "verificationId": null } ], "entities": [ { "firstName": "Paul", "fullName": "martin paul", "id": "672d5984-9c0f-4b4c-9556-3d61d333f284", "lastName": "MARTIN" } ], "photo": false, "report": null }, "status": 200, "type": "", "verificationId": "", "webhook": null } ] }

 

Implement a new type of service

Implementing a new type of service to the Binary Transformation Connector requires code development (Java), and project compilation. This section of the documentation is technical.

  1. Create a new Java connection class.

The class must be located in:

{DAFARI_CE_PATH}/datafari-binary-connector/src/main/java/com/francelabs/datafari/transformation/binary/services/

It must extend the “ExternalService” class, and implement the IExternalService interface:

package com.francelabs.datafari.transformation.binary.services; ... public class DatakeenExternalService extends ExternalService implements IExternalService { ... }

Implement a constructor:

public DatakeenExternalService(BinarySpecification spec) { // ALWAYS CALL SUPER AT THE BEGINING OF THE CONSTRUCTOR super(spec); // Specific constructor code here ... }

Implement the invoke(…) method:

public String invoke(String base64content) throws ManifoldCFException { ... // Your code. This method takes the base64 content as String parameter, and returns the external service response as a JSON formatted String. }
  1. Handle the new ExternalService class in Binary.java

switch (spec.getStringProperty(BinaryConfig.NODE_TYPE_OF_SERVICE)) { case "datakeen": service = new DatakeenExternalService(spec); break; case "openai": service = new OpenAiExternalService(spec); break; default: LOGGER.error("Invalid service type specified."); activities.recordActivity(startTime, ACTIVITY_BINARY, document.getBinaryLength(), documentURI, "KO", "Invalid service type"); return activities.sendDocument(documentURI, document); }
  1. Add the new “Type of Service” in the Binary Connector’s editSpecification.html:

<!-- TYPE OF SERVICE --> <tr> <td class="description"><nobr>$Encoder.bodyEscape($ResourceBundle.getString('binary.typeOfService'))</nobr></td> <td class="value"> <select name="s${SEQNUM}_typeOfService"> <option value="openai" #if($typeOfService == 'openai') selected="selected" #end >OpenAI</option> <option value="datakeen" #if($typeOfService == 'datakeen') selected="selected" #end >Datakeen</option> </select> </td> </tr>
  1. Build, deploy, test and enjoy.