1. Reminder: what is Atomic Update?
Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy for dealing with changes to those documents. Solr supports three approaches to updating documents that have only partially changed.
The first is Atomic Update. This approach allows changing only one or more fields of a document without having to reindex the entire document.
The second approach is known as in-place updates. This approach is similar to atomic updates (is a subset of atomic updates in some sense), but can be used only for updating single valued non-indexed and non-stored docValue-based numeric fields.
The third approach is known as optimistic concurrency or optimistic locking. It is a feature of many NoSQL databases, and allows conditional updating a document based on its version. This approach includes semantics and rules for how to deal with version matches or mis-matches.
Atomic Updates (and in-place updates) and Optimistic Concurrency may be used as independent strategies for managing changes to documents, or they may be combined: you can use optimistic concurrency to conditionally apply an atomic update.
1.2. Choice in Datafari
Doing partial document updates like this can be particularly useful in Datafari when we need to add a type of metadata to documents that takes a long time to extract. First we index documents by extracting the “simple” metadata, so that these documents are quickly available for search. Then we can extract the more complex metadata to complete documents information and improve results accuracy.
Given that in Datafari all the fields we wish to update are indexed and stored, we opted for the Atomic Update approach.
We also use the Optimistic Concurrency approach in combination, so the document must exist to be updated. Without this constraint, we may end up with new incomplete documents containing only the partial fields.
2. Processing principle
Without Atomic Update, if we want to enrich a Solr collection with OCR data using the current Annotator functionality, we create a dedicated OCR job targeting the same collection. This job will overwrite the collection with its data. The OCR job must be up to date to guarantee fresh documents content. If we want to add Spacy data to the same collection (a Spacy server allows us to extract natural language entities using Hugging Face AI models), we have to add it to the the same job as the OCR one. Indeed, if we were to create a dedicated Spacy job, it would completely overwrite the collection with its data; and since it has no OCR related data, the OCR related field(s) would be emptied (and vice versa when the OCR job passes by). This job being sequential, it will only be as fast as the slowest computation step (either OCR or Spacy in our case). We cannot parallelize the computations.
We are therefore very careful when designing the job sequencing if we don't want the fresh content of documents to be overwritten by the Spacy or OCR job.
Using Datafari’s Atomic Update Service solves this problem. We can update documents with job-specific fields only. The operating principle is somewhat similar to processing without Atomic Update. Compared to our current Annotator functionality, we can now do it in several separate jobs. In the scenario mentioned earlier, we would have one fast job, to index the collection to be enriched, one job for the OCR and one for Spacy. Compared to the previous mechanism, the OCR and Spacy jobs must index the documents in separate collections (see the illustration below). Then, the Atomic Update Service can be called. It will retrieve the Spacy and OCR specific fields to update the global collection.
4. Run Atomic Update Service
The Atomic Update Service is available as an executable file with a configuration file used to set up the update jobs. You will need to launch one Atomic Update job per collection used to update the destination collection. Using the same exemple as above, you will launch one Atomic Update job for the Spacy collection and another for the OCR collection.
Find this component in Datafari in [DATAFARI_HOME]/bin/atomicupdates containing these files:
atomic-updates-launcher.sh: to run Atomic Update (more details below)
atomicUpdate-cfg.json: to configure your job(s) (examples of values supplied)
atomicUpdate-example.json: the configuration file explained
atomicUpdate-log4j2.xml: the log configuration file (A user does not need to modify this file)
datafari-solr-atomic-update-… .jar : the executable file
To modify these files you need to be root user.
4.1. Configuration
For each Atomic Update job, you must configure the following:
the collection used to update the target collection, that is the location and name of the Spacy collection, for exemple.
the target collection
the source collection fields used to update the destination collection fields
(Optional) the mapping between source fields and destination fields, when they are not the same.
The configuration file is “atomicUpdate-cfg.json” and here is the way to set it up :
{ "logConfigFile": "", // Specify the log configuration file location if it is different from the provided one in the "atomicupdates" directory or if you want to move it in another location. "jobs": { "JOB_1": { // Put the name you want for this configuration "source": { "baseUrl": Solr Cloud base Url for the source Collection used to update target Collection. You can specify Solr or Zookeeper host, but prefer the Zookeeper host as Datafari use it to dispatch to all Solr hosts you have. (For information) The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http(s)://localhost:8983/solr" ; you need to specify all Solr hosts. The syntax for Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol. Whatever host type, you can define several severs by separating URLs with comma, but using Zookeeper, there is only one server. Example with solr host: "http://solr1:8983/solr, http://solr2:8983/solr,...". "solrCollection": the Solr source Collection for JOB_1. Exemple "Spacy". }, "destination": { "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" block. "solrCollection": the Solr target Collection for JOB_1. Exemple "FileShare". }, "fieldsOperation": { // the fields of the source collection and Atomic Update operation like: set, add, remove, etc... // the "set" operation will be the more appropriate value for most cases, as it replaces the target value with the source value. // see more about operations available here: https://solr.apache.org/guide/solr/9_5/indexing-guide/partial-document-updates.html#atomic-updates "field_1": "set", "field_2": "add", "field_3": "add-distinct", "field_4": "set" }, "nbDocsPerBatch": The documents are selected and updated per batches. Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content). Experienced values are 1000 for OCR sources and 2000 for Spacy sources. We observed good performances with theses values "fieldsMapping": { // Optional: to specify a mapping between source and destination collections "field_3": "dest_field_1", "field_2": "dest_field_2" } }, "JOB_2": { // Put the name you want for this configuration ... } } }
The atomicUpdate-example.json file pretty much repeats what has been said here.
Exemple:
{ "logConfigFile": "/opt/datafari/tomcat/conf/atomicUpdate-log4j2.xml", "jobs": { "OCR": { "source": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "OCR" }, "destination": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "FileShare" }, "fieldsOperation": { "previewContent": "set", }, "nbDocsPerBatch": 1000, }, "SPACY": { "source": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "Spacy" }, "destination": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "FileShare" }, "nbDocsPerBatch": 2000, "fieldsOperation": { "entity_product": "set", "entity_loc": "set", "last_author": "add-distinct" }, "fieldsMapping": { "last_author": "author" } } } }
If there is a syntax error in the configuration file, the job will not start and no log is generated. Otherwise, at least one line is entered to indicate that the job has been started.
If there are configuration data errors, the job fails and you can check errors in the log file (see more in https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2939387906/Atomic+Update+Management#4.3.-Logs ).
4.2. Run
You can manually launch a job with the command (no need for a specific permission): bash atomic-updates-launcher.sh <job_name> [fromDate]
<job_name>
one job to be specified and refers to the name (case sensitive) put in the configuration file “atomicUpdate-cfg.json”:
"jobs": { "OCR": { ... }, "SPACY": { ... } }
[fromDate]
(optional) you can force the date from which to select documents (based on last_modified Solr field). The expected date format is "yyyy-MM-dd HH:mm" with or without time specified, and french format is supported. Specify "full" (not case sensitive) to force full crawl. A Full crawl may be necessary if for some reason, one or more of your jobs in the Atomic Update chain has run a full crawl causing the overwriting of fields to be updated by Atomic Update.
When the job is launched without “fromDate” specified, if it is the first run, a full crawl is done. The next runs will select documents from the last execution time of the job (start time).
Exemple:
bash atomic-updates-launcher.sh SPACY or bash atomic-updates-launcher.sh SPACY full or bash atomic-updates-launcher.sh SPACY "05/02/2023"
To create a scheduled Service use the Cron tool.
It is possible to run several Atomic Update jobs at the same time, for example the OCR and Spacy jobs, given that a priori, these jobs will not update the same Solr fields.
4.3. Logs
The log file is [DATAFARI_HOME]/logs/atomic-update.log.
On each line of the file, the job name is specified, as this log file is common to all Atomic Update jobs.
It contains at least one line specifying that the job has started.
If everything is going fine, there are lines with the number of documents processed for each successful batch of documents updated.
If no error occurs during the run, the final line specify the job status as "DONE". Otherwise, the final line specifies “FAILED” status and the total number of documents processed appears at previous line.