Spacy NER on simplified jobs

Valid from version 6.0

This documentation is valid from Datafari 6.0 upwards

Since Datafari 6.0 a new option has been added to the simplified jobs, it is the “Create a side Spacy NER job” option:

This feature will create a clone of the simplified job currently being declared, but configured to exclusively perform a Spacy NER extraction on files through a dedicated Spacy FastAPI Server that you must provide. It means that you will need Setting up a server to host Spacy for Named Entity Recognition

When checking the checkbox, 2 additional parameters will appear and MUST be set:

Spacy connector name: the name to set to the Spacy NER connector that will be created using the provided Spacy server address
Spacy server address: full address of the Spacy server (leveraging FastAPI) to use. This parameter must contain the protocol to use, the IP address, and the port (ex: http://192.168.0.1:5000)

Note that if you set a Spacy connector name that already exists in MCF, the connector corresponding to that name will be used and no new one will be created. In that case, the Spacy server address will be ignored.

When saving a simplified job with the side Spacy NER job enabled, in the list of jobs you will notice two jobs, one job having the name you entered as sourcename in the simplified UI, and one job having the same name but containing the “NER” prefix:

If you edit the NER job, in the connection tab you will notice that what differs from the other created job is that the Spacy connector you specified in the “Spacy connector name” parameter has been added to the pipeline as last transformer connector:

In the “Spacy Fastapi” tab you will notice that the spacy endpoint has been forced by default to /split_detect_and_process/ to avoid any problems with the documents but you can change it if you want:

Once you are satisfied with the configuration of the Spacy NER job, you would like to run it in a time window that will not be the same than the original job if you plan to run the two jobs on the same MCF node. The best thing to do is to create the NER job on a dedicated MCF node so that you will be able to run it at any wanted time. If you run two jobs at the same time on an MCF node, the two jobs will interfere with each other because MCF only has one processing queue for documents. So, MCF will randomly queue documents to process from the standard job and the OCR job, resulting in longer processing time for both jobs, but more importantly, some documents may be processed by the Spacy job BEFORE the standard job and in that case, the extracted entities WILL BE LOST, because the last version of the document that will be indexed will be the one without the entities.