1. Reminder: what is Atomic Update?
Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy for dealing with changes to those documents. Solr supports three approaches to updating documents that have only partially changed.
One of them is Atomic Update. This approach allows changing only one or more fields of a document without having to reindex the entire document.
This can be particularly useful in Datafari when we need to add a type of metadata to documents that takes a long time to extract. First we index documents by extracting the “simple” metadata, so that these documents are quickly available for search. Then we can extract the more complex metadata to complete documents information and improve results accuracy.
2. Processing principle
Without Atomic Update, if we want to enrich a Solr collection with OCR data using the current Annotator functionnality, we create a dedicated OCR job targeting the same collection. This job will overwrite the collection with its data. The OCR job must be up to date to guarantee fresh documents content. If we want to add Spacy data to the same collection (a Spacy server allows us to extract natural language entities using Hugging Face AI models), we have to add it to the the same job as the OCR one. Indeed, if we were to create a dedicated Spacy job, it will completly overwrite the collection with its data; and since it has no OCR related data, the OCR related field(s) will be emptied (and vice versa when the OCR job passes by). One disadvantage is that this job being sequential, it will only be as fast as the slowest computation step (either OCR or Spacy in our case). We cannot parallelize the computations.
We are therefore very carefull when designing the job sequencing if we don't want the fresh content of documents to be overwritten by the Spacy or OCR job.
Using Datafari’s Atomic Update Service solves this problem. We can update documents with job-specific fields only. The operating principle is somewhat similar to processing without Atomic Update. Compared to our current Annotator functionnality, we can now do it in 3 separate jobs. One fast, to index the collection to be enriched, one for OCR and one for Spacy. One difference compared to the current mechanism is that the OCR and Spacy jobs must index the documents in another collection. Then, the Atomic Update Service can be called. It will retrieve the Spacy and OCR specific fields to update the global collection.
4. Run Atomic Update Service
The Atomic Update Service is available as an executable file with a configuration file used to set up the update jobs. You will need to launch one service job per collection used to update the destination collection. Using the same exemple as above, you launch one Atomic Update job for the Spacy collection and another for the OCR collection.
4.1. Configuration
For each Atomic Update job, you must configure the following:
the collection used to update the target collection, that is the location and name of the Spacy collection, for exemple.
the target collection
the source collection fields used to update the destination collection fields
(Optional) the mapping between source fields and destination fields, when they are not the same.
The configuration file is “atomicUpdate-cfg.json” and here is the way to set it up:
{ "logConfigFile": "", // Specify the log configuration file location if it is different from the provided one in the "atomicupdates" directory or if you want to move it in another location. "jobs": { "JOB_1": { // Put the name you want for this configuration "source": { "baseUrl": Solr Cloud base Url for the source Collection used to update target Collection. You can specify Solr or Zookeeper host. The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http://localhost:8983/solr" ; you need to specify all Solr hosts. The syntax for Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol. Whatever host type, you can define several severs by separating URLs with comma: "http://solr1:8983/solr, http://solr2:8983/solr,...". "solrCollection": the Solr source Collection for JOB_1. Exemple "Spacy". }, "destination": { "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" block. "solrCollection": the Solr target Collection for JOB_1. Exemple "FileShare". }, "fieldsOperation": { // the fields of the source collection and Atomic Update operation like: set, add, remove, etc... "field_1": "set", "field_2": "add", "field_3": "add-distinct", "field_4": "set" }, "nbDocsPerBatch": The documents are selected and updated per batches. Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content). "fieldsMapping": { // Optional: to specify a mapping between source and destination collections "field_3": "dest_field_1", "field_2": "dest_field_2" } }, "JOB_2": { // Put the name you want for this configuration ... } } }
Exemple:
{ "logConfigFile": "/opt/datafari/tomcat/conf/atomicUpdate-log4j2.xml", "jobs": { "OCR": { "source": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "OCR" }, "destination": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "FileShare" }, "fieldsOperation": { "previewContent": "set", }, "nbDocsPerBatch": 1000, }, "SPACY": { "source": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "Spacy" }, "destination": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "FileShare" }, "nbDocsPerBatch": 2000, "fieldsOperation": { "entity_product": "set", "entity_loc": "set", "last_author": "add-distinct" }, "fieldsMapping": { "last_author": "author" } } } }
4.2. Run
You can manually launch a job with the command: java -jar datafari-solr-atomic-update.jar <job_name> [fromDate]
<job_name>
refers to the name put in the configuration file “atomicUpdate-cfg.json”:
"jobs": { "OCR": { ... }, "SPACY": { ... } }
[fromDate]
(optional) you can force the date from which to select documents (based on last_modified Solr field). The expected date format is "yyyy-MM-dd HH:mm" with or without time specified, and french format is supported. Specify "full" (not case sensitive) to force full crawl.
When the job is launched without “fromDate” specified, if it is the first run, a full crawl is done. The next runs will select documents from the last execution time of the job (start time).
Exemple:
java -jar datafari-solr-atomic-update.jar SPACY or java -jar datafari-solr-atomic-update.jar SPACY full or java -jar datafari-solr-atomic-update.jar SPACY "05/02/2023"
To create a scheduled Service use the Cron tool.