Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy for dealing with changes to those documents. Solr supports three approaches to updating documents that have only partially changed.

One of them The first is Atomic Update. This approach allows changing only one or more fields of a document without having to reindex the entire document.

The second approach is known as in-place updates. This approach is similar to atomic updates (is a subset of atomic updates in some sense), but can be used only for updating single valued non-indexed and non-stored docValue-based numeric fields.

The third approach is known as optimistic concurrency or optimistic locking. It is a feature of many NoSQL databases, and allows conditional updating a document based on its version. This approach includes semantics and rules for how to deal with version matches or mis-matches.

Atomic Updates (and in-place updates) and Optimistic Concurrency may be used as independent strategies for managing changes to documents, or they may be combined: you can use optimistic concurrency to conditionally apply an atomic update.

1.2. Choice in Datafari

Doing partial document updates like this can be particularly useful in Datafari when we need to add a type of metadata to documents that takes a long time to extract. First we index documents by extracting the “simple” metadata, so that these documents are quickly available for search. Then we can extract the more complex metadata to complete documents information and improve results accuracy.

Given that in Datafari all the fields we wish to update are indexed and stored, we opted for the Atomic Update approach.

We also use the Optimistic Concurrency approach in combination, so the document must exist to be updated. Without this restriction, we could create new incomplete documents containing only the partial fields.

2. Processing principle

Without Atomic Update, if we want to enrich a Solr collection with OCR data using the current Annotator functionnality, we create a dedicated OCR job targeting the same collection. This job will overwrite the collection with its data. The OCR job must be up to date to guarantee fresh documents content. If we want to add Spacy data to the same collection (a Spacy server allows us to extract natural language entities using Hugging Face AI models), we have to add it to the the same job as the OCR one. Indeed, if we were to create a dedicated Spacy job, it will completly overwrite the collection with its data; and since it has no OCR related data, the OCR related field(s) will be emptied (and vice versa when the OCR job passes by). One disadvantage is that this job being sequential, it will only be as fast as the slowest computation step (either OCR or Spacy in our case). We cannot parallelize the computations.

We are therefore very carefull careful when designing the job sequencing if we don't want the fresh content of documents to be overwritten by the Spacy or OCR job.

Using Datafari’s Atomic Update Service solves this problem. We can update documents with job-specific fields only. The operating principle is somewhat similar to processing without Atomic Update. Compared to our current Annotator functionnalityfunctionality, we can now do it in 3 separate jobs. One fast, to index the collection to be enriched, one for OCR and one for Spacy. One difference compared to the current mechanism is that the OCR and Spacy jobs must index the documents in another collection. Then, the Atomic Update Service can be called. It will retrieve the Spacy and OCR specific fields to update the global collection.

...