1. Reminder: what is Atomic Update?
Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy for dealing with changes to those documents. Solr supports three approaches to updating documents that have only partially changed.
One of them is Atomic Update. This approach allows changing only one or more fields of a document without having to reindex the entire document.
This can be particularly useful in Datafari when we need to add a type of metadata to documents that takes a long time to extract. First we index documents by extracting the simplest metadata, so that these documents available for search. Then we can extract the more complex metadata to complete documents information and improve results accuracy.
2. Processing principle
Without Atomic Update, if we want to enrich a Solr collection with OCR data, we create an OCR job targeting the same collection. This job will overwrite the collection with its data. The OCR job must be up to date to guarantee fresh documents content. If we want to add Spacy data to the same collection, we have to create its job and same as OCR one, it will overwrite the collection with its data.
We are therefore very sensitive to job sequencing if we don't want the fresh content of documents to be overwritten by the Spacy or OCR job.
Using Datafari’s Atomic Update Service solves this problem. We can update documents with job-specific fields only. The operating principle is somewhat similar to processing without Atomic Update. Taking the previous exemple, we still need the 3 jobs. One fast, to index the collection to be enriched, one for OCR and one for Spacy. The difference is that OCR and Spacy jobs must index the documents in another collection. Then, the Atomic Update Service can be called. It will retrieve the Spacy and OCR specific fields to update the first collection.
4. Run Atomic Update Service
The Atomic Update Service is available as an executable file with a configuration file used to set up the update jobs. You will need to launch one service job per collection used to update the destination collection. Using the same exemple as above, you launch one Atomic Update job for the Spacy collection and another for the OCR collection.
4.1. Configuration
For each Atomic Update job, you must configure this information:
the collection used to update the target collection, that is the location and name of the Spacy collection, for exemple.
the target collection
the source collection fields used to update the destination collection fields
(Optional) the mapping between source fields and destination fields, when they are not the same.
The configuration file is “atomicUpdate-cfg.json” and here is the way to set it up:
{ "logConfigFile": atomicUpdate-log4j2.xml loggin file location, "jobs": { "JOB_1": { // Put the name you want for this configuration "source": { "baseUrl": Solr Cloud base Url for the source Collection used to update target Collection. You can specify Solr or Zookeeper host. The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http://localhost:8983/solr" ; you need to specify all Solr hosts. The syntax for Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol. Whatever host type, you can define several severs by separating URLs with comma: "http://solr1:8983/solr, http://solr2:8983/solr,...". "solrCollection": the Solr source Collection for JOB_1. Exemple "Spacy". }, "destination": { "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" bloc. "solrCollection": the Solr target Collection for JOB_1. Exemple "FileShare". }, "fieldsOperation": { // the fields of the source collection and Atomic Update operation like: set, add, remove, etc... "field_1": "set", "field_2": "add", "field_3": "add-distinct", "field_4": "set" }, "nbDocsPerBatch": The documents are selected and updated per batches. Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content). "fieldsMapping": { // Optional: to specify a mapping between source and destination collections "field_3": "dest_field_1", "field_2": "dest_field_2" } }, "JOB_2": { // Put the name you want for this configuration ... } } }
Exemple:
{ "logConfigFile": "/opt/datafari/tomcat/conf/atomicUpdate-log4j2.xml", "jobs": { "OCR": { "source": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "OCR" }, "destination": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "FileShare" }, "fieldsOperation": { "previewContent": "set", }, "nbDocsPerBatch": 1000, }, "SPACY": { "source": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "Spacy" }, "destination": { "baseUrl": "dev.datafari.com:2181", "solrCollection": "FileShare" }, "nbDocsPerBatch": 2000, "fieldsOperation": { "entity_product": "set", "entity_loc": "set", "last_author": "add-distinct" }, "fieldsMapping": { "last_author": "author" } } } }
4.2. Run
You can manually launch a job with the command: java -jar datafari-solr-atomic-update.jar <job_name> [fromDate]
<job_name>
refers to the name put in the configuration file “atomicUpdate-cfg.json”:
"jobs": { "OCR": { ... }, "SPACY": { ... } }
[fromDate]
(optional) you can force the date from which to select documents (based on last_modified Solr field). The expected date format is "yyyy-MM-dd HH:mm" with or without time specified, and french format is supported. Specify "full" (not case sensitive) to force full crawl.
When the job is launched without “fromDate” specified, if it is the first run, a full crawl is done. The next runs will select documents from the last execution time of the job (start time).
Exemple:
java -jar datafari-solr-atomic-update.jar SPACY or java -jar datafari-solr-atomic-update.jar SPACY full or java -jar datafari-solr-atomic-update.jar SPACY "05/02/2023"
To create a scheduled Service use the Cron tool.