Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Gliffy
imageAttachmentIdatt2970189827
macroId77d5b1d1-4fd6-40e3-a930-ca6d8822615e
baseUrlhttps://datafari.atlassian.net/wiki
nameAtomic Update process principle
diagramAttachmentIdatt2970058755
containerId2939387906
timestamp1705502579842

...

3. Run Atomic Update Service

The Atomic Update Service is available as an executable file with a configuration file used to set up the update jobs. You will need to launch one Atomic Update job per collection used to update the destination collection. Using the same exemple as above, you will launch one Atomic Update job for the Spacy collection and another for the OCR collection.

...

atomic-updates-launcher.sh: to run Atomic Update (more details below)
atomicUpdate-cfg.json: to configure your job(s) (examples of values supplied)
atomicUpdate-example.json: the configuration file explained
atomicUpdate-log4j2.xml: the log configuration file (A user does not need to modify this file)
datafari-solr-atomic-update-.jar : the executable file

To modify these files you need to be root user.

...

To modify these files you need to be datafari user or belong to root group.

3.1. Configuration

For each Atomic Update job, you must configure the following:

...

Code Block
languagejson
{
  "logConfigFile": "", // Specify the log configuration file location if it is different from the provided one in the "atomicupdates" 
                          directory or if you want to move it in another location.
  "jobs": {
    "JOB_1": {  // Put the name you want for this configuration
      "source": { // Define the Intermediary Solr, OCR or Spacy in our example above
        "baseUrl":  Solr Cloud base Url for the source Collection used to update target Collection. 
                    You can specify Solr or Zookeeper host, but prefer the Zookeeper host as Datafari use it to dispatch to all Solr hosts you have.
                      (For information) The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http(s)://localhost:8983/solr" ; you need to specify all Solr hosts.
                      The syntax for Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol.
                    Whatever host type, you can define several severs by separating URLs with comma, but using Zookeeper, there is only one server. Example with solr host: "http://solr1:8983/solr, http://solr2:8983/solr,...".
        "solrCollection": the Solr source Collection for JOB_1. ExempleExample "Spacy".
      },
      "destination": { // Define the final Solr, FileShare in  "our example above.
        "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" block.
        "solrCollection": the Solr target Collection for JOB_1. ExempleExample "FileShare".
      },
      "fieldsOperation": { // the fields of the source collection and Atomic Update operation like: set, add, remove, etc...
                           // the "set" operation will be the more appropriate value for most cases, as it replaces the target value with the source value.
                           // see more about operations available here: https://solr.apache.org/guide/solr/9_5/indexing-guide/partial-document-updates.html#atomic-updates
        "field_1": "set",
        "field_2": "add",
        "field_3": "add-distinct",
        "field_4": "set"
      },
      "nbDocsPerBatch": The documents are selected and updated per batches. Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields
and their content).                          This represents the number of documents per batch fetched from the intermediary Solr (for"solrCollection" in the "source" parameter, for instance the Spacy Solr collection) 
                        up to the final Solr ("solrCollection" in the "destination" parameter, the FileShare collection in our illustration above).
                        YouEach canbatch giveis astored tryin forRAM instanceso tothis valuesnumber 1000depends foron OCRthe sourcesdata andsize 2000 for Spacy sources,retrieved (i.e fields and playtheir withcontent).
them to optimise your atomic updates performances.        "fieldsMapping": { // Optional: to specify a mapping between source andYou destinationcan collectionsgive a try for instance to values 1000  "field_3": "dest_field_1",
        "field_2": "dest_field_2"
      }for OCR sources and 2000 for Spacy sources, and play with them to optimise your atomic updates performances. 
     },     "JOB_2"fieldsMapping": {  // Put the name you want for this configuration Optional: to specify a mapping between source and destination collections
         ..."field_3": "dest_field_1",
    }   }
}

The atomicUpdate-example.json file pretty much repeats what has been said here.

Exemple:

Code Block
languagejson
{
  "logConfigFile": "/opt/datafari/tomcat/conf/atomicUpdate-log4j2.xml",
  "jobs": {
    "OCR": {
  "field_2": "dest_field_2"
      }
    },
    "sourceJOB_2": {  // Put the name you want for "baseUrl": "dev.datafari.com:2181",this configuration
        "solrCollection": "OCR"...
    }
   },
      "destination": }
}

The atomicUpdate-example.json file pretty much repeats what has been said here.

Exemple:

Code Block
languagejson
{
        "baseUrl"logConfigFile": "dev.datafari.com:2181/opt/datafari/tomcat/conf/atomicUpdate-log4j2.xml",
  "jobs": {
    "solrCollectionOCR": "FileShare"{
      },"source": {
        "fieldsOperationbaseUrl": {"dev.datafari.com:2181",
        "previewContentsolrCollection": "setOCR",
      },
      "nbDocsPerBatch": 1000,
    },
    "SPACYdestination": {
  
   "source": {         "baseUrl": "dev.datafari.com:2181",
        "solrCollection": "SpacyFileShare"
      },
      "destinationfieldsOperation": {
        "baseUrlpreviewContent": "dev.datafari.com:2181",
 set",
      },
      "solrCollectionnbDocsPerBatch": "FileShare"
 1000,
    },

     "nbDocsPerBatchSPACY": 2000,{
      "fieldsOperationsource": {
        "entity_productbaseUrl": "setdev.datafari.com:2181",
        "entity_locsolrCollection": "setSpacy",
      },
      "last_authordestination": "add-distinct"{
         },"baseUrl": "dev.datafari.com:2181",
        "fieldsMappingsolrCollection": { "FileShare"
      },
      "last_authornbDocsPerBatch": "author"2000,
      }"fieldsOperation": {
    }   }
}

If there is a syntax error in the configuration file, the job will not start and no log is generated. Otherwise, at least one line is entered to indicate that the job has been started.

If there are configuration data errors, the job fails and you can check errors in the log file (see more in https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2939387906/Atomic+Update+Management#4.3.-Logs ).

4.2. Run

You can manually launch a job with the command (no need for a specific permission): bash atomic-updates-launcher.sh <job_name> [fromDate]

<job_name> one job to be specified and refers to the name (case sensitive) put in the configuration file “atomicUpdate-cfg.json”:

Code Block
languagejson
"jobs": {
    "OCR": {
    ...
    },
    "SPACY": {
    ...
    }
  }

[fromDate] (optional) you can force the date from which to select documents (based on last_modified Solr field). The expected date format is either "yyyy-MM-dd HH:mm" or "yyyy-MM-dd". The expected date format is "yyyy-MM-dd HH:mm" with or without time specified, and french format is supported . Specify "full" (not case sensitive) to force full crawl. A Full crawl may be necessary if for some reason, one or more of your jobs in the Atomic Update chain has run a full crawl causing the overwriting of fields to be updated by Atomic Update.

When the job is launched without “fromDate” specified, if it is the first run, a full crawl is done. The next runs will select documents from the last execution time of the job (start time).

Exemple:

Code Block
bash atomic-updates-launcher.sh SPACY
or
bash atomic-updates-launcher.sh SPACY full
or
bash atomic-updates-launcher.sh SPACY "2023/02/05"

To create a scheduled Service use the Cron tool.

It is possible to run several Atomic Update jobs at the same time, for example the OCR and Spacy jobs, given that a priori, these jobs will not update the same Solr fields.

...

 "entity_product": "set",
        "entity_loc": "set",
        "last_author": "add-distinct"
      },
      "fieldsMapping": {
        "last_author": "author"
      }
    }
  }
}

If there is a syntax error in the configuration file, the job will not start and no log is generated. Otherwise, at least one line is entered to indicate that the job has been started.

If there are configuration data errors, the job fails and you can check errors in the log file (see more in https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2939387906/Atomic+Update+Management#4.3.-Logs).

3.2. Run

You can manually launch a job with the command (no need for a specific permission): bash atomic-updates-launcher.sh <job_name> [fromDate]

<job_name> one job to be specified and refers to the name (case sensitive) put in the configuration file “atomicUpdate-cfg.json”:

Code Block
languagejson
"jobs": {
    "OCR": {
    ...
    },
    "SPACY": {
    ...
    }
  }

[fromDate] (optional) you can force the date from which to select documents (based on last_modified Solr field). The documents we speak about are those of the intermediate Solr of the job you want to run. The expected date format is either "yyyy-MM-dd HH:mm" or "yyyy-MM-dd". It may be convenient to force this date if you want to update documents from a date before the last execution date of the job (which is the default behavior). Specify "full" (not case sensitive) to force full crawl. A Full crawl may be necessary if for some reason, one or more of your jobs in the Atomic Update chain has run a full crawl causing the overwriting of fields to be updated by Atomic Update.

When the job is launched without “fromDate” specified, if it is the first run, a full crawl is done. The next runs will select documents from the last execution time of the job (start time).

Exemple:

Code Block
bash atomic-updates-launcher.sh SPACY
or
bash atomic-updates-launcher.sh SPACY full
or
bash atomic-updates-launcher.sh SPACY "2023/02/05"

It is possible to run several Atomic Update jobs at the same time, for example the OCR and Spacy jobs, given that a priori, these jobs will not update the same Solr fields.

To create a scheduled Service use the Cron tool. Enter the command:

Code Block
crontab -e

Then add the line (replace [DATAFARI_HOME] by your Datafari’s location):

Code Block
15  1  *  *  * [DATAFARI_HOME]/bin/atomicupdates/atomic-updates-launcher.sh <job_name>
│   │  │  │  │
│   │  │  │  │
│   │  │  │  |_________   Day of Week (0 – 6) (0 is Sunday, or use names)
│   │  │  |____________ Month (1 – 12),* means every month
│   │  |______________  Day of Month (1 – 31),* means every day
│   |________________  Hour (0 – 23),* means every hour
|___________________ Minute (0 – 59), * means every minute

In this example, the Atomic Update runs everyday at 1:15 AM.

3.3. Logs

The log file is [DATAFARI_HOME]/logs/atomic-update.log.

...