...
Given that in Datafari all the fields we wish to update are indexed and stored, we opted for the Atomic Update approach.
We also use the Optimistic Concurrency approach in combination, so the document must exist to be updated. Without this constraint, we may end up with new incomplete documents containing only the partial fieldsimposed ourselves not to have incomplete documents, in a sense that an atomic update should only update solr documents that have been created by the “standard” crawl (otherwise, for instance in case a full crawl creates a solr document, then an incremental standard crawl deletes this document, before the atomic update comes in while still having the document in its intermediary solr, it would create a new solr document with incomplete data). For that, optimistic concurrency in Solr gives us the ability to only update documents already created.
2. Processing principle
Without Atomic Update, if we want to enrich a Solr collection with OCR data using the current Annotator functionality, we create a dedicated OCR job targeting the same collection. This job will overwrite the collection with its data. The OCR job must be up to date to guarantee fresh documents content. If we want to add Spacy data to the same collection (a Spacy server allows us to extract natural language entities using Hugging Face AI models), we have to add it to the the same job as the OCR one. Indeed, if we were to create a dedicated Spacy job, it would completely overwrite the collection with its data; and since it has no OCR related data, the OCR related field(s) would be emptied (and vice versa when the OCR job passes by). This job being sequential, it will only be as fast as the slowest computation step (either OCR or Spacy in our case). We cannot parallelize the computations.
...
Using Datafari’s Atomic Update Service solves this problem. We can update documents with job-specific fields only. The operating principle is somewhat similar to processing proces sing without Atomic Update. Compared to our current Annotator functionality, we can now do it in several separate jobs. In the scenario mentioned earlier, we would have one fast job, to index the collection to be enriched, one job for the OCR and one for Spacy. Compared to the previous mechanism, the OCR and Spacy jobs must index the documents in separate collections (see the illustration below). Then, the Atomic Update Service can be called. It will retrieve the Spacy and OCR specific fields to update the global collection.
...
Code Block | ||
---|---|---|
| ||
{ "logConfigFile": "", // Specify the log configuration file location if it is different from the provided one in the "atomicupdates" directory or if you want to move it in another location. "jobs": { "JOB_1": { // Put the name you want for this configuration "source": { "baseUrl": Solr Cloud base Url for the source Collection used to update target Collection. You "baseUrl":can specify Solr Cloud base Url for the source Collection used to update target Collection. or Zookeeper host, but prefer the Zookeeper host as Datafari use it to dispatch to all Solr hosts you have. You can specify Solr or Zookeeper host, but prefer the Zookeeper host as Datafari use it to dispatch to (For information) The syntax for Solr host is: "http://datafari_domain:8983/solr", ex: "http(s)://localhost:8983/solr" ; you need to specify all Solr hosts you have. (For information) The syntax for Solr hostZookeeper is: "http://datafari_domain:8983/solr2181", ex: "http(s)://localhost:8983/solr2181" ; youNo needhttp toprefix specifybecause allit's Solranother hostsprotocol. Whatever host type, you can define several severs by separating URLs with comma, but using Zookeeper, there is only one Theserver. syntaxExample forwith Zookeepersolr ishost: "datafari_domain:2181", ex: "localhost:2181" ; No http prefix because it's another protocol. http://solr1:8983/solr, http://solr2:8983/solr,...". "solrCollection": the Solr source Collection for JOB_1. Exemple "Spacy". }, Whatever host type, you can define several severs by separating URLs with comma, but using Zookeeper, there is only one server. Example with solr host: "http://solr1:8983/solr, http://solr2:8983/solr,...""destination": { "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" block. "solrCollection": the Solr sourcetarget Collection for JOB_1. Exemple "SpacyFileShare". }, "destinationfieldsOperation": { // the fields of the source collection and "baseUrl": Solr Cloud base Url for the target Collection. The syntax is the same as in "source" block.Atomic Update operation like: set, add, remove, etc... "solrCollection": the Solr target Collection for JOB_1. Exemple "FileShare". // the "set" operation will be },the more appropriate value for most "fieldsOperation": { //cases, as it replaces the fieldstarget value ofwith the source collection and Atomic Update operation like: set, add, remove, etc...value. // see more about operations available here: // the "set" operation will be the more appropriate value for most cases, as it replaces the target value with the source value.https://solr.apache.org/guide/solr/9_5/indexing-guide/partial-document-updates.html#atomic-updates "field_1": "set", "field_2": "add", "field_3": "add-distinct", "field_4": "set" }, // see more about operations available here"nbDocsPerBatch": https://solr.apache.org/guide/solr/9_5/indexing-guide/partial-document-updates.html#atomic-updates "field_1": "set", "field_2": "add", "field_3": "add-distinct", The documents are selected and updated per batches. Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content). "field_4": "set" }, This "nbDocsPerBatch": The documents are selected and updated per batches. Each batch is stored in RAM so this number depends on the data size retrieved (i.e fields and their content).represents the number of documents per batch fetched from the intermediary Solr (for instance the Spacy Solr collection) up to the final Solr (the FileShare collection in our illustration above). You can give a try Experiencedfor instance to values are 1000 for OCR sources and 2000 for Spacy sources. We observed good performances with theses values, and play with them to optimise your atomic updates performances. "fieldsMapping": { // Optional: to specify a mapping between source and destination collections "field_3": "dest_field_1", "field_2": "dest_field_2" } }, "JOB_2": { // Put the name you want for this configuration ... } } } |
...