...
atomic-updates-launcher.sh: to run Atomic Update (more details below)
atomicUpdate-cfg.json: to configure your job(s) (examples of values supplied)
atomicUpdate-example.json: the configuration file explained
atomicUpdate-log4j2.xml: the log configuration file (A user does not need to modify this file)
datafari-solr-atomic-update-… .jar : the executable file
atomicUpdateLastExec: contains the job(s) status and last execution date and time (more details in ).
To modify these files you need to be root user.
...
Code Block | ||
---|---|---|
| ||
{ "logConfigFile": "", // Specify the log configuration file location if it is different from the provided one in the "atomicupdates" directory or if you want to move it in another location. "jobs": { "JOB_1": { // Put the name you want for this configuration "source": { // Define the Intermediary Solr, OCR or Spacy in our example above "baseUrl": Solr Cloud base Url for the source Collection used to update target Collection. You can specify Solr or Zookeeper host, but prefer the Zookeeper host as Datafari use it to dispatch to all Solr hosts you have. (For information) The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http(s)://localhost:8983/solr" ; you need to specify all Solr hosts. The syntax for Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol. Whatever host type, you can define several severs by separating URLs with comma, but using Zookeeper, there is only one server. Example with solr host: "http://solr1:8983/solr, http://solr2:8983/solr,...". "solrCollection": the Solr source Collection for JOB_1. Example "Spacy". }, "destination": { // Define the final Solr, FileShare in our example above. "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" block. "solrCollection": the Solr target Collection for JOB_1. Example "FileShare". }, "fieldsOperation": { // the fields of the source collection and Atomic Update operation like: set, add, remove, etc... // the "set" operation will be the more appropriate value for most cases, as it replaces the target value with the source value. // see more about operations available here: https://solr.apache.org/guide/solr/9_5/indexing-guide/partial-document-updates.html#atomic-updates "field_1": "set", "field_2": "add", "field_3": "add-distinct", "field_4": "set" }, "nbDocsPerBatch": The documents are selected and updated per batches. This represents the number of documents per batch fetched from the intermediary Solr ("solrCollection" in the "source" parameter, for instance the Spacy Solr collection) up to the final Solr ("solrCollection" in the "destination" parameter, the FileShare collection in our illustration above). Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content). You can give a try for instance to values 1000 for OCR sources and 2000 for Spacy sources, and play with them to optimise your atomic updates performances. "fieldsMapping": { // Optional: to specify a mapping between source and destination collections (remove the content of the bloc if you don't need mapping, like this: "fieldsMapping": {}) "field_3": "dest_field_1", "field_2": "dest_field_2" } }, "JOB_2": { // Put the name you want for this configuration ... } } } |
...
[fromDate]
(optional) you can force the date from which to select documents (based on last_modified Solr field). The documents we speak about are those of the intermediate Solr of the job you want to run. The expected date format is either "yyyy-MM-dd HH:mm" or "yyyy-MM-dd".
It may be convenient to force this date if you want to update documents from a date before the last execution date of the job (which is the default behavior). Specify "full" An other good reason to force this date, is to avoid a full crawl after a previous crash of the job. In this particular case, you will need to modify the job status to DONE in the “atomicUpdateLastExec“ file, otherwise the full crawl will still be executed. See more in and .
Specify "full" (not case sensitive) to force full crawl. A Full crawl may be necessary if for some reason, one or more of your jobs in the Atomic Update chain has run a full crawl causing the overwriting of fields to be updated by Atomic Update.
...
In this example, the Atomic Update runs everyday at 1:15 AM.
3.3. Logs
...
When the job is done
At the end of the job, its status and execution date is written in the file “atomicUpdateLastExec” (in [DATAFARI_HOME]/logs/atomic-update.log.
On each line of the file, the job name is specified, as this log file is common to all Atomic Update jobs.
It contains at least one line specifying that the job has started.
If everything is going fine, there are lines with the number of documents processed for each successful batch of documents updated.
If no error occurs during the run, the final line specify the job status as "DONE". Otherwise, the final line specifies “FAILED” status and the total number of documents processed appears at previous line.bin/atomicupdates).This file is created after the first execution of Atomic Update and is common to all jobs configured. Example, for the OCR
Code Block |
---|
#
#Tue Mar 13 15:51:41 CET 2024
OCR.STATUS=DONE (FAILED if so)
OCR.LAST_EXEC=2024-03-12T14\:00\:00.428Z
SPACY.STATUS=DONE
SPACY.LAST_EXEC=2024-03-13T15\:30\:00.523Z |
What happens when a job fails?
When a job fails, the default behavior is to run a full crawl next time. As soon as the status FAILED is mentioned in the “atomicUpdateLastExec“ file, full crawl is forced. We made this choice thinking of a scheduled job that might fail for a temporary reason. Forcing a full crawl at next run, we give a scheduled job a chance to turn back to a safe state by itself with the guaranty of data integrity.
A temporary reason may come with a system operator who has restarted the Solr server of the final collection, for instance. If the Atomic Update job was running, it will lose access to the Solr collection to update then the job will end with a FAILED status.
If you don't think it's necessary to do a full crawl, you can change this behavior by modifying the job status to DONE in the “atomicUpdateLastExec“ file and running it manually, specifying from which document date you want to resume (“fromDate” parameter explained above).
3.3. Logs
The log file is [DATAFARI_HOME]/logs/atomic-update.log.
On each line of the file, the job name is specified, as this log file is common to all Atomic Update jobs.
It contains at least one line specifying that the job has started.
If everything is going fine, there are lines with the number of documents processed for each successful batch of documents updated.
If no error occurs during the run, the final line specify the job status as "DONE". Otherwise, the final line specifies “FAILED” status and the total number of documents processed appears at previous line.
Code Block |
---|
INFO 2024-03-12T15:17:48Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job started !
INFO 2024-03-12T15:17:48Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|The Last execution file doesn't exists yet. It will be created by the first Atomic Update Job execution.
INFO 2024-03-12T15:17:48Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job: Select documents modified from: null (null indicates a full crawl))
INFO 2024-03-12T15:17:48Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR RUNNING
INFO 2024-03-12T15:17:49Z (main) - Datafari-jar|Atomic Update|solrj.impl.ZkClientClusterStateProvider|Cluster at localhost:2181 ready
INFO 2024-03-12T15:17:49Z (main) - Datafari-jar|Atomic Update|solrj.impl.ZkClientClusterStateProvider|Cluster at localhost:2181 ready
INFO 2024-03-12T15:17:50Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Docs processed : 1000
INFO 2024-03-12T15:17:53Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Docs processed : 2000
INFO 2024-03-12T15:17:56Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Docs processed : 3000
INFO 2024-03-12T15:17:59Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Docs processed : 3900
INFO 2024-03-12T15:17:59Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR DONE |
Example of first run and failed job:
Code Block |
---|
INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job started !
INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|The Last execution file doesn't exists yet. It will be created by the first Atomic Update Job execution.
INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job: Select documents modified from: null (null indicates a full crawl))
INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR RUNNING
INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|solrj.impl.ZkClientClusterStateProvider|Cluster at localhost:2181 ready
ERROR 2024-03-12T14:58:25Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job: Total number of documents processed: 0
java.lang.NullPointerException: null
at com.francelabs.datafari.solraccessors.DocumentsUpdator.createSolrDocToUpdate(DocumentsUpdator.java:114) ~[datafari-solr-atomic-update-6.1-dev-Community.jar:?]
at com.francelabs.datafari.solraccessors.DocumentsUpdator.updateDocuments(DocumentsUpdator.java:62) ~[datafari-solr-atomic-update-6.1-dev-Community.jar:?]
at com.francelabs.datafari.SolrAtomicUpdateLauncher.main(SolrAtomicUpdateLauncher.java:133) [datafari-solr-atomic-update-6.1-dev-Community.jar:?]
INFO 2024-03-12T14:58:25Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR FAILED |