The spacy transformation connector allows using spacy models exposed through a web API using fastapi to be used to extract entities and add them to the documents metadata.
Lets take an example to use as a guideline throughout this documentation: we want to automatically extract person names from the documents we are indexing and add this information to the index. This requires the setup of a spacy web service with a model able to extract this information, the configuration of a job with the necessary transformation connector to call the web service, a dedicated field in Solr to store the information.
1. Setting up the fastapi server with spacy model
A dedicated server dedicated to running spacy and the fastapi interface is advised as most NER models require several gigabytes of memory to run smoothly. Each model has its own requirements and limitations so it is advised to test the model you plan on using before hand to check its requirements. And then make sure that the server you plan on running the NER model on can handle that.
Beside that, the folder https://gitlab.datafari.com/sandboxes/spacy_ner_2021/-/tree/master/fastapi contains a set of basic configuration files to setup a fastapi server serving spacy models.
This basic configuration contains models that are able to extract person names, so we don’t need to change the configuration in our case.
Note that the transformation connector uses the /process/ endpoint of the fastapi configuration presented above to process documents, and the /models endpoint to check if the server is alive and has models loaded. So it is important to leave these endpoints active and not change their behavior.
2. Using the Transformation Connector
Now that the spacy web service is running, we need to configure a job with a transformation connector to call this service.
2.1 Create a job
Using Datafari’s admin interface create a job for the source you want to extract entities on WITHOUT checking the box to start the job immediately
2.2 Defining a new transformation connector
In the MCF administration interface, go to the output → list transformation connections menu entry
Click on the Add a new transformation connection button at the bottom
Name the connector and click on the type tab
Choose datafari spacy fastapi connector and click continue
Then go to the Spacy Fastapi tab and fill in the address of your spacy fastapi server
Finally click save, if your fastapi server is running and accessible you should see a connection working statement on the next screen
2.3 Add the connector to the job
Go to your jobs list and select edit on the job you want to add entity extraction on
In the connection tab, on the bottom raw stating “Transformation” with a select dropdown, select the transformation connector you just created
Then click Insert transformation before on the output line right above
You should end up with this
Click on the Spacy Fastapi tab at the top and fill in the name of the model to be used as well as the prefix for the entities filed (if you want to use one, which is strongly recommended recommended)
Here the en_web_core_sm is Spacy’s small model for english that is part of the default configuration and is able to extract several entity types, including person’s names which is interesting to us. We chose that the metadata created by the transformation connector for each entity type is prefixed by entities_ (followed by the type of the entity). The content of the metadata is the list of entities of this type that were detected.
3. Store the information in solr
For the entities to be stored in solr, you will need to create fields with the same name as the metada created in MCF containing your entities. If you configured a prefix for your metadata, you can use one dynamic field in Solr to catch all the entities at once.
The field must have the following properties:
Be multivalued
Store strings unaltered as those are entities
We recommend that values are stored and indexed.
In our case, we want to store information about the detected persons names. To do so, we create a field entities_PER with the above parameters so that this entity type is stored into Solr (if you want to keep all entity types, you can create a dynamic field entities_*).