Vector Update Processor - BETA VERSION
Valid from Datafari 6.2
In order to implement the Vector Search feature into Datafari, we created the Vector Update Processor to chunk and prepare documents for embeddings.
What is vector search ?
Vector search is a method of information retrieval where documents and queries are represented as vectors instead of plain text. In vector search, machine learning models generate the vector representations of source inputs, which can be text, images, or other content. Having a mathematic representation of content provides a common basis for search scenarios. If everything is a vector, a query can find a match in vector space, even if the associated original content is in different media or language than the query.
Source: Vector search - Azure AI Search
What does this Update Processor do ?
Once documents are crawled by ManifoldCF jobs, they are injected into Solr. An Update Processor is a Solr component that can process entering documents.
The Vector Update Processor is a component that chunks incoming documents content into smaller pieces, and sends them to the VectorMain collection.
Then, for each chunk, it creates a Solr document in the newly created Solr Collection, “VectorMain”. Since these chunk-based solr documents originally come from a “parent” document, we call them subdocuments; they contain the following metadata:
id: The String ID of the subdocument. It matches the parent ID, concatenated with the index of the chunk.
vector: An array of Float, reprensenting the embedded content. It is generated by the Solr
TextToVectorUpdateProcessor
.parent_doc: The String ID of the parent document from the FileShare collection.
content: The String content of the chunk
Read more about Update Processors here: Custom Update Processor .
How does it work ?
After the execution of crawling jobs, ManifoldCF sends CREATE
, UPDATE
and/or DELETE
document requests to Solr. Each of these requests are documents to be processed, that should be added in, updated in, or removed from the FileShare collection.
Incoming requests
CREATION
and UPDATE
requests (green document in the example above) are processed by the Datafari Update Processor (and every other Update Processors listed in the Datafari processor chain in solr_config.xml
, including Custom Update Processor ).
Note : The Datafari Update Processor is a Datafari component that carries out various operations to prepare the documents. It is independant from the Vector Update Processor.
DELETION
requests are not processed by the Datafari Update Processor, since they use a different handler. However, both CREATION
, DELETION
and UPDATE
requests are automatically processed by the Vector Update Processor, as long as it is enabled.
Inside the Vector Update Processor
The component handles separately DELETIONS
(in processDelete()
method), and CREATIONS
/UPDATES
(in processUpdate()
method).
CREATIONS
/UPDATES
Deleting existing children:
When updating or creating a document, every existing subdocuments from this processed document is deleted from theVectorMain
collection.
Children are identified using theirparent_id
field, which matches the parent ID.
Deleting existing child documents at creation time ensures consistency and prevents issues such as leftover subdocuments from previous manual or incomplete deletions, even though in normal circumstances no children should exist.
Chunking:
The content from the processed document is extracted. Then, it is chunked using Langchain4J librairies.Tokenizer tokenizer = new OpenAiTokenizer(); DocumentSplitter splitter = DocumentSplitters.recursive(this.chunksize, this.maxoverlap, tokenizer);
The current default values (in tokens) for these parameters are the following:
CHUNK_SIZE: 300 MAX_OVERLAP_SIZE: 0
Creating subdocuments:
Each chunk is converted into aSolrInputDocument
, inheriting parent metadata, with some modifications:"id": The String ID of the subdocument. It is formed with the parent ID, concatenated with the index of the chunk. "parent_doc": The String ID of the parent document from the FileShare collection. "content_en", "content_fr", "preview_content": Removed from the object. "exactContent", "embedded_content": The content of the chunk. "exactContent" is a multivalued field, used for RAG processes. "embedded_content" is singlevalued, and is required for vector embeddings.
The subdocuments are then added to the
VectorMain
Solr Collection.Vector embeddings:
The Vector embeddings is processed by the TextToVector Update Processor, when children documents are indexed into VectorMain. See Datafari Vector Search | How to enable vector search features? for more information and proper configuration.
The number of dimensions of the vector depends on the model, and should match the vector dimension defined in the knn_vector fieldType, in the schema.xml file.
CREATIONS
/UPDATES
When a document deletion query is received by Solr (when deleting a ManifoldCF job, for example), the processDelete()
method is called. Every existing subdocuments from this processed document is deleted from the VectorMain
collection.
Children are identified using their parent_id field, which matches the parent ID.
Technical specifications
This section provides technical details about the processor.
Chunking
The chunking is processed by a Recursive Splitter
, provided by the Langchain4j library.
Tokenizer tokenizer = new OpenAiTokenizer();
DocumentSplitter splitter = DocumentSplitters.recursive(this.chunksize, this.maxoverlap, tokenizer);
This code will ensure that the text is segmented into paragraphs, with no more than CHUNK_SIZE tokens
(current value: 300
) each, with a maximum overlap of MAX_OVERLAP_SIZE
tokens (current value: 0
).
Embeddings model
The chunks are embedded using the Text To Vector Update Processor.
How to use it ?
To enable Vector Update Processor and the Solr Vector Search, follow the instructions from Datafari Vector Search.
The processor is invoked in the solrconfig.xml
file (you do not need to edit this file):
/opt/datafari/solr/solrcloud/FileShare/conf
<processor class="com.francelabs.datafari.updateprocessor.VectorUpdateProcessorFactory">
<str name="host">${vector.host:localhost\:2181}</str>
<str name="collection">${vector.collection:VectorMain}</str>
<bool name="enabled">${vector.enabled:false}</bool>
<int name="chunksize">${vector.chunksize:300}</int>
<int name="maxoverlap">${vector.maxoverlap:0}</int>
<str name="splitter">${vector.splitter:splitterByParagraph}</str>
<str name="minchunklength">${vector.filter.minchunklength:1}</str>
<str name="minalphanumratio">${vector.filter.minalphanumratio:0.0}</str>
</processor>
The processor is invoked twice on the file. One in datafari
processor chain (for creations and updates), and the other in the datafari_delete
processor chain (for deletions).
To enable it, you must edit the “vector.enable” property and configure the TextToVector Update Processor. Those steps are detailed in Datafari Vector Search.