Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We also use the Optimistic Concurrency approach in combination, so the document must exist to be updated. Without this restrictionconstraint, we could create may end up with new incomplete documents containing only the partial fields.

...

Without Atomic Update, if we want to enrich a Solr collection with OCR data using the current Annotator functionality, we create a dedicated OCR job targeting the same collection. This job will overwrite the collection with its data. The OCR job must be up to date to guarantee fresh documents content. If we want to add Spacy data to the same collection (a Spacy server allows us to extract natural language entities using Hugging Face AI models), we have to add it to the the same job as the OCR one. Indeed, if we were to create a dedicated Spacy job, it will would completely overwrite the collection with its data; and since it has no OCR related data, the OCR related field(s) will would be emptied (and vice versa when the OCR job passes by). One disadvantage is that this This job being sequential, it will only be as fast as the slowest computation step (either OCR or Spacy in our case). We cannot parallelize the computations.

...

Using Datafari’s Atomic Update Service solves this problem. We can update documents with job-specific fields only. The operating principle is somewhat similar to processing without Atomic Update. Compared to our current Annotator functionality, we can now do it in 3 several separate jobs. One fastIn the scenario mentioned earlier, we would have one fast job, to index the collection to be enriched, one job for the OCR and one for Spacy. One difference compared Compared to the current previous mechanism is that , the OCR and Spacy jobs must index the documents in another collectionseparate collections (see the illustration below). Then, the Atomic Update Service can be called. It will retrieve the Spacy and OCR specific fields to update the global collection.

...

The Atomic Update Service is available as an executable file with a configuration file used to set up the update jobs. You will need to launch one Atomic Update job per collection used to update the destination collection. Using the same exemple as above, you will launch one Atomic Update job for the Spacy collection and another for the OCR collection.

...

If there is a syntax error in the configuration file, the job does will not start and no log is generated. Otherwise, at least one line is entered to indicate that the job has been started.

...

You can manually launch a job with the command (no need of for a specific permission): bash atomic-updates-launcher.sh <job_name> [fromDate]

...

[fromDate] (optional) you can force the date from which to select documents (based on last_modified Solr field). The expected date format is "yyyy-MM-dd HH:mm" with or without time specified, and french format is supported. Specify "full" (not case sensitive) to force full crawl. A Full crawl may be necessary if for some reason, one or more of your jobs in the Atomic Update chain has run a full crawl causing overriding the overwriting of fields to be updated by Atomic Update.

...