Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

Valid from Datafari 6.0

1. Reminder: what is Atomic Update?

...

Code Block
languagejson
{
  "logConfigFile": "", // Specify the log configuration file location if it is different from the provided one in the "atomicupdates" 
                          directory or if you want to move it in another location.
  "jobs": {
    "JOB_1": {  // Put the name you want for this configuration
      "source": { // Define the Intermediary Solr, OCR or Spacy in our example above
        "baseUrl":  Solr Cloud base Url for the source Collection used to update target Collection. 
                    You can specify Solr or Zookeeper host, but prefer the Zookeeper host as Datafari use it to dispatch to all Solr hosts you have.
                      (For information) The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http(s)://localhost:8983/solr" ; you need to specify all Solr hosts. Do not specify https url, because it is the proxy-based url to specify here.
                      The syntax for Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol.
                    Whatever host type, you can define several severs by separating URLs with comma, but using Zookeeper, there is only one server. Example with solr host: "http://solr1:8983/solr, http://solr2:8983/solr,...".
        "solrCollection": the Solr source Collection for JOB_1. Example "Spacy".
      },
      "destination": { // Define the final Solr, FileShare in our example above.
        "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" block.
        "solrCollection": the Solr target Collection for JOB_1. Example "FileShare".
      },
      "fieldsOperation": { // the fields of the source collection and Atomic Update operation like: set, add, remove, etc...
                           // the "set" operation will be the more appropriate value for most cases, as it replaces the target value with the source value.
                           // see more about operations available here: https://solr.apache.org/guide/solr/9_5/indexing-guide/partial-document-updates.html#atomic-updates
        "field_1": "set",
        "field_2": "add",
        "field_3": "add-distinct",
        "field_4": "set"
      },
      "nbDocsPerBatch": The documents are selected and updated per batches. 
                        This represents the number of documents per batch fetched from the intermediary Solr ("solrCollection" in the "source" parameter, for instance the Spacy Solr collection) 
                        up to the final Solr ("solrCollection" in the "destination" parameter, the FileShare collection in our illustration above).
                        Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content).
                        You can give a try for instance to values 1000 for OCR sources and 2000 for Spacy sources, and play with them to optimise your atomic updates performances. 
      "fieldsMapping": { // Optional: to specify a mapping between source and destination collections (remove the content of the bloc if you don't need mapping, like this: "fieldsMapping": {})
        "field_3": "dest_field_1",
        "field_2": "dest_field_2"
      }
    },
    "JOB_2": {  // Put the name you want for this configuration
      ...
    }
  }
}

...