Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

Valid from Datafari 6.0

1. Reminder: what is Atomic Update?

...

Given that in Datafari all the fields we wish to update are indexed and stored, we opted for the Atomic Update approach.

We also use the Optimistic Concurrency approach in combination, so the document must exist to be updated. Without this restriction, we could create new incomplete documents containing only the partial fields.

2. Processing principle

Without Atomic Update, if we want to enrich a Solr collection with OCR data using the current Annotator functionnality, we create a dedicated OCR job targeting the imposed ourselves not to have incomplete documents, in a sense that an atomic update should only update solr documents that have been created by the “standard” crawl (otherwise, for instance in case a full crawl creates a solr document, then an incremental standard crawl deletes this document, before the atomic update comes in while still having the document in its intermediary solr, it would create a new solr document with incomplete data). For that, optimistic concurrency in Solr gives us the ability to only update documents already created.

2. Processing principle

Without Atomic Update, if we want to enrich a Solr collection with OCR data using the current Annotator functionality, we create a dedicated OCR job targeting the same collection. This job will overwrite the collection with its data. The OCR job must be up to date to guarantee fresh documents content. If we want to add Spacy data to the same collection (a Spacy server allows us to extract natural language entities using Hugging Face AI models), we have to add it to the the same job as the OCR one. Indeed, if we were to create a dedicated Spacy job, it will completly would completely overwrite the collection with its data; and since it has no OCR related data, the OCR related field(s) will would be emptied (and vice versa when the OCR job passes by). One disadvantage is that this job This job being sequential, it will only be as fast as the slowest computation step (either OCR or Spacy in our case). We cannot parallelize the computations.

...

Using Datafari’s Atomic Update Service solves this problem. We can update documents with job-specific fields only. The operating principle is somewhat similar to processing proces sing without Atomic Update. Compared to our current Annotator functionality, we can now do it in 3 several separate jobs. One fastIn the scenario mentioned earlier, we would have one fast job, to index the collection to be enriched, one job for the OCR and one for Spacy. One difference compared Compared to the current previous mechanism is that , the OCR and Spacy jobs must index the documents in another collectionseparate collections (see the illustration below). Then, the Atomic Update Service can be called. It will retrieve the Spacy and OCR specific fields to update the global collection.

Gliffy
imageAttachmentIdatt2970189827
macroId77d5b1d1-4fd6-40e3-a930-ca6d8822615e
baseUrlhttps://datafari.atlassian.net/wiki
nameAtomic Update process principle
diagramAttachmentIdatt2970058755
containerId2939387906
timestamp1705502579842

...

3. Run Atomic Update Service

The Atomic Update Service is available as an executable file with a configuration file used to set up the update jobs. You will need to launch one service Atomic Update job per collection used to update the destination collection. Using the same exemple as above, you will launch one Atomic Update job for the Spacy collection and another for the OCR collection.

4.1. Configuration

For each Atomic Update job, you must configure the following:

  • the collection used to update the target collection, that is the location and name of the Spacy collection, for exemple.

  • the target collection

  • the source collection fields used to update the destination collection fields

  • (Optional) the mapping between source fields and destination fields, when they are not the same.

The configuration file is “atomicUpdate-cfg.json” and here is the way to set it up:

...

languagejson

...

Find this component in Datafari in [DATAFARI_HOME]/bin/atomicupdates containing these files:

atomic-updates-launcher.sh: to run Atomic Update (more details below)
atomicUpdate-cfg.json: to configure your job(s) (examples of values supplied)
atomicUpdate-example.json: the configuration file explained
atomicUpdate-log4j2.xml: the log configuration file (A user does not need to modify this file)
datafari-solr-atomic-update-.jar : the executable file

atomicUpdateLastExec: contains the job(s) status and last execution date and time (more details in ).

To modify these files you need to be datafari user or belong to root group.

3.1. Configuration

For each Atomic Update job, you must configure the following:

  • the collection used to update the target collection, that is the location and name of the Spacy collection, for exemple.

  • the target collection

  • the source collection fields used to update the destination collection fields

  • (Optional) the mapping between source fields and destination fields, when they are not the same.

The configuration file is “atomicUpdate-cfg.json” and here is the way to set it up :

Code Block
languagejson
{
  "logConfigFile": "", // Specify the log configuration file location if it is different from the provided one in the "atomicupdates" 
        "baseUrl":  Solr Cloud base Url for the source Collection used to update target Collection. You can specify Solrdirectory or Zookeeper hostif you want to move it in another location.
  "jobs": {
    "JOB_1": {  // Put the name you want for this configuration
The syntax for Solr host is: "source"http://datafari_domain:8983/solr", ex: "http://localhost:8983/solr" ; you need to specify all Solr hosts. { // Define the Intermediary Solr, OCR or Spacy in our example above
        "baseUrl":  Solr Cloud base Url for the source Collection used to update Thetarget syntaxCollection. for
Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol.
                    Whatever host type, you can define several severs by separating URLs with comma: "http://solr1:8983/solr, http://solr2:8983/solr,...".
        "solrCollection": the Solr source Collection for JOB_1. Exemple "Spacy".
      },                 You can specify Solr or Zookeeper host, but prefer the Zookeeper host as Datafari use it to dispatch to all Solr hosts you have.
                      (For information) The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http://localhost:8983/solr" ; you need to specify all Solr hosts. Do not specify https url, because it is the proxy-based url to specify here.
      "destination": {         "baseUrl": Solr Cloud base Url for the target Collection. The syntax isfor the same as in "source" block.
        "solrCollection": the Solr target Collection for JOB_1. Exemple "FileShare".Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol.
              },      Whatever "fieldsOperation": { // the fields of the source collection and Atomic Update operation like: set, add, remove, etc..host type, you can define several severs by separating URLs with comma, but using Zookeeper, there is only one server. Example with solr host: "http://solr1:8983/solr, http://solr2:8983/solr,...".
        "solrCollection"field_1": "set",the Solr source Collection for JOB_1. Example   "field_2": "add",Spacy".
        "field_3": "add-distinct",
 },
      "field_4destination": "set"{ // Define the final   }Solr, FileShare in our example above.
 "nbDocsPerBatch": The documents are selected and updated per batches. Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content)."baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" block.
          "solrCollection"fieldsMapping": { // Optional: to specify a mapping between source and destination collections
 the Solr target Collection for JOB_1. Example "FileShare".
      },
      "field_3fieldsOperation": "dest_field_1",
        "field_2": "dest_field_2"
      }
    },
    "JOB_2": {  // Put the name you want for this configuration
      ...
    }
  }
}

Exemple:

Code Block
languagejson
{
  "logConfigFile": "/opt/datafari/tomcat/conf/atomicUpdate-log4j2.xml",
  "jobs": {
    "OCR": {
      "source": {
        "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "OCR"
      },
      "destination": {
        "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "FileShare"
      },
      "fieldsOperation": {
        "previewContent": "set",
      },
      "nbDocsPerBatch": 1000,
    },
    "SPACY": {
      "source": {
        "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "Spacy"
      },
      "destination": {
        "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "FileShare"
      },
      "nbDocsPerBatch": 2000,
      "fieldsOperation": {
        "entity_product": "set",
        "entity_loc": "set",
        "last_author": "add-distinct"
      },
      "fieldsMapping": {
        "last_author": "author"
      }
    }
  }
}

4.2. Run

You can manually launch a job with the command: java -jar datafari-solr-atomic-update.jar <job_name> [fromDate]

<job_name> refers to the name put in the configuration file “atomicUpdate-cfg.json”:

Code Block
languagejson
"jobs": {
    "OCR": {
    ...
    },
    "SPACY": {
    ...
    }
  }

[fromDate] (optional) you can force the date from which to select documents (based on last_modified Solr field). The expected date format is "yyyy-MM-dd HH:mm" with or without time specified, and french format is supported. Specify "full" (not case sensitive) to force full crawl.

When the job is launched without “fromDate” specified, if it is the first run, a full crawl is done. The next runs will select documents from the last execution time of the job (start time).

Exemple:

Code Block
java -jar datafari-solr-atomic-update.jar SPACY
or
java -jar datafari-solr-atomic-update.jar SPACY full
or
java -jar datafari-solr-atomic-update.jar SPACY "05/02/2023"

...

{ // the fields of the source collection and Atomic Update operation like: set, add, remove, etc...
                           // the "set" operation will be the more appropriate value for most cases, as it replaces the target value with the source value.
                           // see more about operations available here: https://solr.apache.org/guide/solr/9_5/indexing-guide/partial-document-updates.html#atomic-updates
        "field_1": "set",
        "field_2": "add",
        "field_3": "add-distinct",
        "field_4": "set"
      },
      "nbDocsPerBatch": The documents are selected and updated per batches. 
                        This represents the number of documents per batch fetched from the intermediary Solr ("solrCollection" in the "source" parameter, for instance the Spacy Solr collection) 
                        up to the final Solr ("solrCollection" in the "destination" parameter, the FileShare collection in our illustration above).
                        Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content).
                        You can give a try for instance to values 1000 for OCR sources and 2000 for Spacy sources, and play with them to optimise your atomic updates performances. 
      "fieldsMapping": { // Optional: to specify a mapping between source and destination collections (remove the content of the bloc if you don't need mapping, like this: "fieldsMapping": {})
        "field_3": "dest_field_1",
        "field_2": "dest_field_2"
      }
    },
    "JOB_2": {  // Put the name you want for this configuration
      ...
    }
  }
}

The atomicUpdate-example.json file pretty much repeats what has been said here.

Exemple:

Code Block
languagejson
{
  "logConfigFile": "/opt/datafari/tomcat/conf/atomicUpdate-log4j2.xml",
  "jobs": {
    "OCR": {
      "source": {
        "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "OCR"
      },
      "destination": {
        "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "FileShare"
      },
      "fieldsOperation": {
        "previewContent": "set",
      },
      "nbDocsPerBatch": 1000,
    },
    "SPACY": {
      "source": {
        "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "Spacy"
      },
      "destination": {
        "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "FileShare"
      },
      "nbDocsPerBatch": 2000,
      "fieldsOperation": {
        "entity_product": "set",
        "entity_loc": "set",
        "last_author": "add-distinct"
      },
      "fieldsMapping": {
        "last_author": "author"
      }
    }
  }
}

If there is a syntax error in the configuration file, the job will not start and no log is generated. Otherwise, at least one line is entered to indicate that the job has been started.

If there are configuration data errors, the job fails and you can check errors in the log file (see more in https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2939387906/Atomic+Update+Management#4.3.-Logs).

3.2. Run

You can manually launch a job with the command (no need for a specific permission): bash atomic-updates-launcher.sh <job_name> [fromDate]

<job_name> one job to be specified and refers to the name (case sensitive) put in the configuration file “atomicUpdate-cfg.json”:

Code Block
languagejson
"jobs": {
    "OCR": {
    ...
    },
    "SPACY": {
    ...
    }
  }

[fromDate] (optional) you can force the date from which to select documents (based on last_modified Solr field). The documents we speak about are those of the intermediate Solr of the job you want to run. The expected date format is either "yyyy-MM-dd HH:mm" or "yyyy-MM-dd".

It may be convenient to force this date if you want to update documents from a date before the last execution date of the job (which is the default behavior). An other good reason to force this date, is to avoid a full crawl after a previous crash of the job. In this particular case, you will need to modify the job status to DONE in the “atomicUpdateLastExec“ file, otherwise the full crawl will still be executed. To clarify more, see https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2939387906/Atomic+Update+Management#When-the-job-is-done and https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2939387906/Atomic+Update+Management#What-happens-when-a-job-fails%3F.

Specify "full" (not case sensitive) to force full crawl. A Full crawl may be necessary if for some reason, one or more of your jobs in the Atomic Update chain has run a full crawl causing the overwriting of fields to be updated by Atomic Update.

When the job is launched without “fromDate” specified, if it is the first run, a full crawl is done. The next runs will select documents from the last execution time of the job (start time).

Exemple:

Code Block
bash atomic-updates-launcher.sh SPACY
or
bash atomic-updates-launcher.sh SPACY full
or
bash atomic-updates-launcher.sh SPACY "2023/02/05"

It is possible to run several Atomic Update jobs at the same time, for example the OCR and Spacy jobs, given that a priori, these jobs will not update the same Solr fields.

To create a scheduled Service use the Cron tool. Enter the command:

Code Block
crontab -e

Then add the line (replace [DATAFARI_HOME] by your Datafari’s location):

Code Block
15  1  *  *  * [DATAFARI_HOME]/bin/atomicupdates/atomic-updates-launcher.sh <job_name>
│   │  │  │  │
│   │  │  │  │
│   │  │  │  |_________   Day of Week (0 – 6) (0 is Sunday, or use names)
│   │  │  |____________ Month (1 – 12),* means every month
│   │  |______________  Day of Month (1 – 31),* means every day
│   |________________  Hour (0 – 23),* means every hour
|___________________ Minute (0 – 59), * means every minute

In this example, the Atomic Update runs everyday at 1:15 AM.

Here is another example for a weekly job, each Monday at 4 AM:

Code Block
0  4  *  *  1 /opt/datafari/bin/atomicupdates/atomic-updates-launcher.sh OCR

When the job is done

At the end of the job, its status and execution date is written in the file “atomicUpdateLastExec” (in [DATAFARI_HOME]/bin/atomicupdates).This file is created after the first execution of Atomic Update and is common to all jobs configured. Example, for the OCR

Code Block
#
#Tue Mar 13 15:51:41 CET 2024
OCR.STATUS=DONE (FAILED if so)
OCR.LAST_EXEC=2024-03-12T14\:00\:00.428Z
SPACY.STATUS=DONE
SPACY.LAST_EXEC=2024-03-13T15\:30\:00.523Z

What happens when a job fails?

When a job fails, the default behavior is to run a full crawl next time. As soon as the status FAILED is mentioned in the “atomicUpdateLastExec“ file, full crawl is forced. We made this choice thinking of a scheduled job that might fail for a temporary reason. Forcing a full crawl at next run, we give a scheduled job a chance to turn back to a safe state by itself with the guaranty of data integrity.

A temporary reason may come with a system operator who has restarted the Solr server of the final collection, for instance. If the Atomic Update job was running, it will lose access to the Solr collection to update then the job will end with a FAILED status.

If you don't think it's necessary to do a full crawl, you can change this behavior by modifying the job status to DONE in the “atomicUpdateLastExec“ file and running it manually, specifying from which document date you want to resume (“fromDate” parameter explained above).

3.3. Logs

The log file is [DATAFARI_HOME]/logs/atomic-update.log.

On each line of the file, the job name is specified, as this log file is common to all Atomic Update jobs.

It contains at least one line specifying that the job has started.

If everything is going fine, there are lines with the number of documents processed for each successful batch of documents updated.

If no error occurs during the run, the final line specify the job status as "DONE". Otherwise, the final line specifies “FAILED” status and the total number of documents processed appears at previous line.

Code Block
 INFO 2024-03-12T15:17:48Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job started !
 INFO 2024-03-12T15:17:48Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|The Last execution file doesn't exists yet. It will be created by the first Atomic Update Job execution.
 INFO 2024-03-12T15:17:48Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job: Select documents modified from: null (null indicates a full crawl))
 INFO 2024-03-12T15:17:48Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR RUNNING
 INFO 2024-03-12T15:17:49Z (main) - Datafari-jar|Atomic Update|solrj.impl.ZkClientClusterStateProvider|Cluster at localhost:2181 ready
 INFO 2024-03-12T15:17:49Z (main) - Datafari-jar|Atomic Update|solrj.impl.ZkClientClusterStateProvider|Cluster at localhost:2181 ready
 INFO 2024-03-12T15:17:50Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Docs processed : 1000
 INFO 2024-03-12T15:17:53Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Docs processed : 2000
 INFO 2024-03-12T15:17:56Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Docs processed : 3000
 INFO 2024-03-12T15:17:59Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Docs processed : 3900
 INFO 2024-03-12T15:17:59Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR DONE

Example of first run and failed job:

Code Block
 INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job started !
 INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|The Last execution file doesn't exists yet. It will be created by the first Atomic Update Job execution.
 INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job: Select documents modified from: null (null indicates a full crawl))
 INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR RUNNING
 INFO 2024-03-12T14:58:24Z (main) - Datafari-jar|Atomic Update|solrj.impl.ZkClientClusterStateProvider|Cluster at localhost:2181 ready
ERROR 2024-03-12T14:58:25Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job: Total number of documents processed: 0
java.lang.NullPointerException: null
	at com.francelabs.datafari.solraccessors.DocumentsUpdator.createSolrDocToUpdate(DocumentsUpdator.java:114) ~[datafari-solr-atomic-update-6.1-dev-Community.jar:?]
	at com.francelabs.datafari.solraccessors.DocumentsUpdator.updateDocuments(DocumentsUpdator.java:62) ~[datafari-solr-atomic-update-6.1-dev-Community.jar:?]
	at com.francelabs.datafari.SolrAtomicUpdateLauncher.main(SolrAtomicUpdateLauncher.java:133) [datafari-solr-atomic-update-6.1-dev-Community.jar:?]
 INFO 2024-03-12T14:58:25Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR FAILED

Here is the beginning lines of log at next run after job failed:

Code Block
 INFO 2024-03-12T17:06:59Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job started !
 INFO 2024-03-12T17:06:59Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job: Select documents modified from: 2024-03-12T14:28:32.428Z (null indicates a full crawl))
 INFO 2024-03-12T17:06:59Z (main) - Datafari-jar|Atomic Update|francelabs.datafari.SolrAtomicUpdateLauncher|OCR Job: Last state was FAILED, so a full crawl is done for this run.
 INFO 2024-03-12T17:06:59Z (main) - Datafari-jar|Atomic Update|datafari.save.JobSaver|Job OCR RUNNING

 and so on...