Setting up a server to host Spacy for Named Entity Recognition

Datafari can be setup to use a webservice to extract named entities through the https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2469920769 . However to do so, a webservice serving spacy models and allowing to query for entities must be setup. This is what this documentation is about.

Resources Needed

We tested the webservice on a machine with an 8 core CPU and 32GB of RAM for sentiment analysis using spacytextblob and keyword extraction using KeyBERT. The spacytextblob library was using the following language models depending on the language detected in documents:

"en": "en_core_web_trf",
"fr": "fr_dep_news_trf",
"de": "de_dep_news_trf",
"xx": "xx_ent_wiki_sm"

Resource consumption and requirement will vary depending on the task and models you use. We recommend you read the spacy documentation about the models you plan on using to get an idea of the requirements. Then perform some tests on the webservice before integrating it in an indexation pipeline to make sure it runs smoothly.

Some tasks / model may run better using a GPU when it is available.

Getting the Web-service

We developed a first version of a web-service meeting the requirements of the https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/2469920769 which is available here: https://gitlab.datafari.com/sandboxespublic/spacy-webservice

The readme gives extensive information on how to install, configure and use the web-service.

You can extend the capabilities of the web-service if you need to too. It uses python fastapi library, which is an easy way to build a web API.

Keep in mind this is a work in progress. The current API does not support pools of models and document queues. As a result, if documents are sent at a faster pace than what the models can treat them, an error will be sent back for some documents. The result will be that those documents won’t have any entity attached to them.