Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This can be particularly useful in Datafari when we need to add a type of metadata to documents that takes a long time to extract. First we index documents by extracting the simplest “simple” metadata, so that these documents are quickly available for search. Then we can extract the more complex metadata to complete documents information and improve results accuracy.

...

Without Atomic Update, if we want to enrich a Solr collection with OCR data using the current Annotator functionnality, we create an a dedicated OCR job targeting the same collection. This job will overwrite the collection with its data. The OCR job must be up to date to guarantee fresh documents content. If we want to add Spacy data to the same collection (a Spacy server allows us to extract natural language entities using Hugging Face AI models), we have to create its job and same as OCR oneto add it to the the same job as the OCR one. Indeed, if we were to create a dedicated Spacy job, it will completly overwrite the collection with its data; and since it has no OCR related data, the OCR related field(s) will be emptied (and vice versa when the OCR job passes by). One disadvantage is that this job being sequential, it will only be as fast as the slowest computation step (either OCR or Spacy in our case). We cannot parallelize the computations.

We are therefore very sensitive to carefull when designing the job sequencing if we don't want the fresh content of documents to be overwritten by the Spacy or OCR job.

Using Datafari’s Atomic Update Service solves this problem. We can update documents with job-specific fields only. The operating principle is somewhat similar to processing without Atomic Update. Taking the previous exemple, we still need the 3 Compared to our current Annotator functionnality, we can now do it in 3 separate jobs. One fast, to index the collection to be enriched, one for OCR and one for Spacy. The One difference compared to the current mechanism is that the OCR and Spacy jobs must index the documents in another collection. Then, the Atomic Update Service can be called. It will retrieve the Spacy and OCR specific fields to update the first global collection.

Gliffy
imageAttachmentIdatt2970189827
macroId77d5b1d1-4fd6-40e3-a930-ca6d8822615e
baseUrlhttps://datafari.atlassian.net/wiki
nameAtomic Update process principle
diagramAttachmentIdatt2970058755
containerId2939387906
timestamp1705502579842

...

For each Atomic Update job, you must configure this informationthe following:

  • the collection used to update the target collection, that is the location and name of the Spacy collection, for exemple.

  • the target collection

  • the source collection fields used to update the destination collection fields

  • (Optional) the mapping between source fields and destination fields, when they are not the same.

...

Code Block
languagejson
{
  "logConfigFile": atomicUpdate-log4j2.xml loggin file location,
  "jobs": {
    "JOB_1": {  // Put the name you want for this configuration
      "source": {
        "baseUrl":  Solr Cloud base Url for the source Collection used to update target Collection. You can specify Solr or Zookeeper host.
                    The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http://localhost:8983/solr" ; you need to specify all Solr hosts.
                    The syntax for Zookeeper is: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol.
                    Whatever host type, you can define several severs by separating URLs with comma: "http://solr1:8983/solr, http://solr2:8983/solr,...".
        "solrCollection": the Solr source Collection for JOB_1. Exemple "Spacy".
      },
      "destination": {
        "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" blocblock.
        "solrCollection": the Solr target Collection for JOB_1. Exemple "FileShare".
      },
      "fieldsOperation": { // the fields of the source collection and Atomic Update operation like: set, add, remove, etc...
        "field_1": "set",
        "field_2": "add",
        "field_3": "add-distinct",
        "field_4": "set"
      },
      "nbDocsPerBatch": The documents are selected and updated per batches. Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content).
      "fieldsMapping": { // Optional: to specify a mapping between source and destination collections
        "field_3": "dest_field_1",
        "field_2": "dest_field_2"
      }
    },
    "JOB_2": {  // Put the name you want for this configuration
      ...
    }
  }
}

...