Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

Valid from version

6

5.

0

4

This documentation is valid from Datafari 65.0 4 upwards

Since Datafari 65.0 4 a new option has been added to the simplified jobs, it is the “Create a side Spacy NER job” option:

...

In the “Spacy Fastapi” tab you will notice that the spacy endpoint has been forced by default to /split_detect_and_process/ to avoid any problems with the documents but you can change it if you want:

...

For the prefix for the name of the metadata, you need to create a Solr dynamic field first. For example if we put entity_ into the field, we need to do this configuration into Solr :

...

Once you are satisfied with the configuration of the Spacy job, make sure that the crawling time window of your Spacy job occurs AFTER the crawling time window of your corresponding non-Spacy job. Otherwise your Spacy-extracted entities will be deleted by the non-Spacy job crawl. Note also that if you run two jobs at the same time on an MCF node, the two jobs will interfere with each other because MCF only has one processing queue for documents. So, MCF will randomly queue documents to process from the standard job and the Spacy job, resulting in longer processing time for both jobs, but more importantly, some documents may be processed by the Spacy job BEFORE the standard job and in that case, the Spacy-extracted entities will be lost, because the last version of the document that will be indexed will be the one without the extracted entities.

...