Vector Update Processor - BETA VERSION

Valid from Datafari 6.2

In order to implement the Vector Search feature into Datafari, we created the Vector Update Processor to chunk and prepare documents for embeddings.

What is vector search ?

Vector search is a method of information retrieval where documents and queries are represented as vectors instead of plain text. In vector search, machine learning models generate the vector representations of source inputs, which can be text, images, or other content. Having a mathematic representation of content provides a common basis for search scenarios. If everything is a vector, a query can find a match in vector space, even if the associated original content is in different media or language than the query.

Source: https://learn.microsoft.com/en-us/azure/search/vector-search-overview

What does this Update Processor do ?

Once documents are crawled by ManifoldCF jobs, they are injected into Solr. An Update Processor is a Solr component that can process entering documents.

The Vector Update Processor is a component that chunks incoming documents content into smaller pieces, and sends them to the VectorMain collection.

Then, for each chunk, it creates a Solr document in the newly created Solr Collection, “VectorMain”. Since these chunk-based solr documents originally come from a “parent” document, we call them subdocuments; they contain the following metadata:

id: The String ID of the subdocument. It matches the parent ID, concatenated with the index of the chunk.
vector: An array of Float, reprensenting the embedded content. It is generated by the Solr TextToVectorUpdateProcessor.
parent_doc: The String ID of the parent document from the FileShare collection.
content: The String content of the chunk

Read more about Update Processors here: Custom Update Processor .

How does it work ?

After the execution of crawling jobs, ManifoldCF sends CREATE, UPDATE and/or DELETE document requests to Solr. Each of these requests are documents to be processed, that should be added in, updated in, or removed from the FileShare collection.

Vector Update Processor integration

Incoming requests

CREATION and UPDATE requests (green document in the example above) are processed by the Datafari Update Processor (and every other Update Processors listed in the Datafari processor chain in solr_config.xml, including Custom Update Processor ).

Note : The Datafari Update Processor is a Datafari component that carries out various operations to prepare the documents. It is independant from the Vector Update Processor.

DELETION requests are not processed by the Datafari Update Processor, since they use a different handler. However, both CREATION, DELETION and UPDATE requests are automatically processed by the Vector Update Processor, as long as it is enabled.

Inside the Vector Update Processor

The component handles separately DELETIONS (in processDelete() method), and CREATIONS/UPDATES (in processUpdate() method).

CREATIONS/UPDATES

Deleting existing children:
When updating or creating a document, every existing subdocuments from this processed document is deleted from the VectorMain collection.
Children are identified using their parent_id field, which matches the parent ID.

Deleting existing child documents at creation time ensures consistency and prevents issues such as leftover subdocuments from previous manual or incomplete deletions, even though in normal circumstances no children should exist.

Chunking:
The content from the processed document is extracted. Then, it is chunked using Langchain4J librairies.
Tokenizer tokenizer = new OpenAiTokenizer(); DocumentSplitter splitter = DocumentSplitters.recursive(this.chunksize, this.maxoverlap, tokenizer);
The current default values (in tokens) for these parameters are the following:
CHUNK_SIZE: 300 MAX_OVERLAP_SIZE: 0
Creating subdocuments:
Each chunk is converted into a SolrInputDocument, inheriting parent metadata, with some modifications:
"id": The String ID of the subdocument. It is formed with the parent ID, concatenated with the index of the chunk. "parent_doc": The String ID of the parent document from the FileShare collection. "content_en", "content_fr", "preview_content": Removed from the object. "exactContent", "embedded_content": The content of the chunk. "exactContent" is a multivalued field, used for RAG processes. "embedded_content" is singlevalued, and is required for vector embeddings.
The subdocuments are then added to the VectorMain Solr Collection.
Vector embeddings:
The Vector embeddings is processed by the TextToVector Update Processor, when children documents are indexed into VectorMain. See Datafari Vector Search | How to enable vector search features? for more information and proper configuration.

The number of dimensions of the vector depends on the model, and should match the vector dimension defined in the knn_vector fieldType, in the schema.xml file.

CREATIONS/UPDATES

When a document deletion query is received by Solr (when deleting a ManifoldCF job, for example), the processDelete() method is called. Every existing subdocuments from this processed document is deleted from the VectorMain collection.
Children are identified using their parent_id field, which matches the parent ID.

Technical specifications

This section provides technical details about the processor.

Chunking

The chunking is processed by a Recursive Splitter, provided by the Langchain4j library.

Tokenizer tokenizer = new OpenAiTokenizer();
DocumentSplitter splitter = DocumentSplitters.recursive(this.chunksize, this.maxoverlap, tokenizer);

This code will ensure that the text is segmented into paragraphs, with no more than CHUNK_SIZE tokens (current value: 300) each, with a maximum overlap of MAX_OVERLAP_SIZE tokens (current value: 0).

Embeddings model

The chunks are embedded using the Text To Vector Update Processor.

How to use it ?

To enable Vector Update Processor and the Solr Vector Search, follow the instructions from Datafari Vector Search.

The processor is invoked in the solrconfig.xml file (you do not need to edit this file):

/opt/datafari/solr/solrcloud/FileShare/conf

<processor class="com.francelabs.datafari.updateprocessor.VectorUpdateProcessorFactory">
	<str name="host">${vector.host:localhost\:2181}</str>
	<str name="collection">${vector.collection:VectorMain}</str>
	<bool name="enabled">${vector.enabled:false}</bool>
	<int name="chunksize">${vector.chunksize:300}</int>
	<int name="maxoverlap">${vector.maxoverlap:0}</int>
	<str name="splitter">${vector.splitter:splitterByParagraph}</str>
	<str name="minchunklength">${vector.filter.minchunklength:1}</str>
	<str name="minalphanumratio">${vector.filter.minalphanumratio:0.0}</str>
</processor>

The processor is invoked twice on the file. One in datafari processor chain (for creations and updates), and the other in the datafari_delete processor chain (for deletions).

To enable it, you must edit the “vector.enable” property and configure the TextToVector Update Processor. Those steps are detailed in Datafari Vector Search.