Spacy Transformation Connector

Valid from Datafari 5.1

The spacy transformation connector allows using spacy models exposed through a web API using fastapi to be used to extract entities and add them to the documents metadata.

Lets take an example to use as a guideline throughout this documentation: we want to automatically extract person names from the documents we are indexing and add this information to the index. This requires the setup of a spacy web service with a model able to extract this information, the configuration of a job with the necessary transformation connector to call the web service, a dedicated field in Solr to store the information.

1. Setting up the fastapi server with spacy model

A dedicated server dedicated to running spacy and the fastapi interface is advised as most NER models require several gigabytes of memory to run smoothly. Each model has its own requirements and limitations so it is advised to test the model you plan on using before hand to check its requirements. And then make sure that the server you plan on running the NER model on can handle that.

You can find information on how to deploy a fastapi webservice hosting some spacy models here: https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2657517573

If you want to develop your own web-service for entity extraction, note that this connector expect your endpoints to work in a specific way. If you need more details about his, ask us the question by posting an issue on our gitlab: https://gitlab.datafari.com/

2. Using the Transformation Connector

Now that the spacy web service is running, we need to configure a job with a transformation connector to call this service.

2.1 Create a job

  • Using Datafari’s admin interface create a job for the source you want to extract entities on WITHOUT checking the box to start the job immediately

2.2 Defining a new transformation connector

  • In the MCF administration interface, go to the output → list transformation connections menu entry

  • Click on the Add a new transformation connection button at the bottom

  • Name the connector and click on the type tab

  • Choose datafari spacy fastapi connector and click continue

  • Then go to the Spacy Fastapi tab and fill in the address of your spacy fastapi server

  • Finally click save, if your fastapi server is running and accessible you should see a connection working statement on the next screen

2.3 Add the connector to the job

  • Go to your jobs list and select edit on the job you want to add entity extraction on

  • In the connection tab, on the bottom row stating “Transformation” with a select dropdown, select the transformation connector you just created

  • Then click Insert transformation before on the output line right above

  • You should end up with this

  • Click on the Spacy Fastapi tab at the top and fill in the required information:

  • Name of the model: If a specific model name should be included in the query with the endpoint you are using, precise it here. Can be left blank if no model must be provided.

  • Endpoint: The endpoint on the entity extraction web service that you want your request to be sent to. Defaults to /process/ if none is provided. We recommend using /split_detect_and_process/ to avoid running into OOM for large documents, and to benefit from automatic language detection to pick the appropriate model. SpaCy will analyze files content and identify named entities. See an example blow for /split_detect_and_process/.

Example of text in document:

"My work at France Labs brought me to London."

Produced JSON :

{
"result": {
"ents": [
{
"text": "France Labs",
"label": "ORG",
"start": 11,
"end": 22
},
{
"text": "London",
"label": "GPE",
"start": 37,
"end": 43
}
],
"lang": "en"
}
}

  • Prefix: The prefix you want to use for the metadata that will be added to the document for the entities. It is strongly recommended to set one. Metadata will be named [prefix][entity_label]. No defaults so you have to set one. For example entity_.

You need to create the dynamic field in Solr prior to launch the job

3. Store the information in solr

For the entities to be stored in solr, you will need to create fields with the same name as the metadata created in MCF containing your entities. If you configured a prefix for your metadata, you can use one dynamic field in Solr to catch all the entities at once.

The field must have the following properties:

  • Be multivalued

  • Store strings unaltered as those are entities

We recommend that values are stored and indexed.

In our case, we want to store information about the detected persons names. To do so, we create a field entities_PER with the above parameters so that this entity type is stored into Solr (if you want to keep all entity types, you can create a dynamic field entities_*).