Valid from v6
This documentation is valid from Datafari v6 onwards.
Introduction
In the context of data privacy concerns and the strict requirements of the General Data Protection Regulation (GDPR), enterprises need robust tools to manage personal data responsibly. Through this article, I would like to show you a way to identify and extract personal data from your files set or your website using Datafari. In this demonstration, we will try to extract names, phone numbers and email addresses from a SMB file set.
This demonstration will be realized using Datafari 6, and two of its components. The “Regex Entity Connector” will extract phone numbers and email addresses using regex, and the “Spacy FastAPI Connector” will use a Spacy API to find names in documents using Neural Entity Recognition.
Datafari installation
The first thing you’ll need is an operating Datafari. You can either install the Community Edition or the Enterprise Edition.
Here is documentation to install Datafari – Enterprise Edition:
/wiki/spaces/DATAFARI/pages/1878654981
And here is the one for Datafari – Community Edition:
Install Datafari - Community Edition
Once the installation is over, see the next step to install a Spacy API.
Spacy API installation
Spacy is an open-source library for Natural Language Processing. It allows us to recognize many elements in a text, like people names, languages, dates, prices, organizations… In this context, we will create an API using Spacy to extract people names for our file system.
You can either set your API on the same server than Datafari, or use another machine. A GPU is not a hard requirement, but it is recommended as it can improve the performances of the indexation. To set this API, you can use our turnkey project here, following the instructions in the README.md file:
https://gitlab.datafari.com/sandboxespublic/spacy-webservice
Make sure that your API is running when you will launch your Crawler Job.
Create a simplified job
Now that you have installed Datafari and the Spacy API, let’s set up your crawler jobs. The easiest way to do so is to use the simplified job creation. For that, you need to be logged as an administrator and get to the administrator UI.
Get to “Connectors” > “Data Crawler Simplified Mode”
Select the source type you want to crawl. In my case, I want to crawl a SMB server so I will choose “Create a Job Filer”.
Server: the URL of your SMB file system.
User & password: credentials to access your file system.
Path: path to the repository your want to crawl.
Make sure you check the “Create a Spacy NER job” option and set you API url. If your API is hosted on your Datafari server, it should be http://localhost:5000.
Hit the confirm button. If the job is correctly created, go to the ManifoldCF UI :
”Connectors” > “Data Crawler Expert Mode”
First you need to check your repository connector is working. To do so, go to “Repositories” > “List Repository Connections”. Click on “View” in the line related to your File System. The “Connection status” should indicate “Connection working”. Otherwise, you may need to edit the connector to get it working.
Configure the job
The Repository Connector is ready. Now it is time to set up your job. On ManifoldCF UI, go to “Jobs” > “List all Jobs”. You should see two jobs. The one we want to use contains “NER” in its name. Click on "Edit” button.
In our demo, we want to use the Spacy Connector and the Regex Entity Connector to tag personal data in our Solr: names, phone numbers and email addresses. The names will be identified by our Spacy FastAPI Connector, the previously set API. Phone numbers and email addresses will be found by our Regex Entity Connector.
If you checked the Spacy option during the Simplified Job Creation, you should be able to see the SpacyConnector in the “Connection” (arrow 1 in the screenshot below) tab of the job edition page. However you need to manually add the RegexEntityConnector. To do so, select the “RegexEntityConnector” (arrow 2 in the screenshot below) in the “Transformation” line at the end of the table, and click on the “Insert transformation before” button of the SpacyConnector line (arrow 3 in the screenshot below).
When it’s done, you should see the RegexEntityConnector in the table, and a new “Regex Entity” tab. Next step is the configuration of those two connectors. We want them to feed metadatas in the Solr index.
entity_person should contain the names (provided by SpacyConnector)
entity_phone should contain the phone numbers (provided by RegexEntityConnector)
entity_email should contain the email addresses (provided by RegexEntityConnector)
Configure Spacy Connector
On the job edition page, go to the “Spacy FastAPI” tab. You have three parameters to set.
Model : You can leave blank to use the default model, which is configured the model.json file.
Endpoint : use the one per default, “/split_detect_and_process/”.
Prefix : This defines how your metadata will be named. In our example, we will set it to“entity_” as recommended, so we can use the existing field “entity_person”.
See our documentation for FastAPI Connector here :
Spacy Transformation Connector
Configure Regex Connector
Once your Spacy Connector is ready, go to the “Regex Entity” tab to configure the Regex Entity Connector. Its function is to store any regex matches in Solr metadata. In this situation, we want to extract phone numbers to “entity_phone”, and email addresses to “entity_email”.
To do so, we need to add two lines. For each line, this is how you should set the parameters.
Source field: you need to set “content” to look for data in the files content.
Regular expression:
Email address: ([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})
European phone numbers: (\+)([\s.\(\)]*\d{1}){8,13}(-)?(\d{1,5})
Those are only example. Feel free to use your own regex to find phone number or any other formatted data.
Destination field:
Email addresses: entity_email
Phone numbers: entity_phone
Leave other fields empty.
Don’t forget to save your job so you don’t lose your modifications.
See our documentation for Regex Entity Connector here:
Regex Entity Connector
Create Solr fields
Congratulations! Your job is now ready. However there is a last important step you need to follow before launching the job. You need to create the new Solr Fields so the job can write data into them.
In order to create a metadata into FileShare collection, add those lines in the file $DATAFARI_HOME/solr/solrcloud/FileShare/conf/customs_schema/custom_fields.incl.
You don’t need to add “entity_phone” or “entity_person”, as those already exist by default in Solr.
{ "name":"entity_email", "type":"string", "stored":true, "multiValued":true }
Then execute the following script:
/opt/datafari/solr/solrcloud/FileShare/conf/addCustomSchemaInfo.sh
From the Administration UI, go to “Search Engine Administration” > “Solr Administration”. In the dropdown menu on the left, select the “FileShare” collection, then click on “Schema”.
The new Solr fields should appear in the dropdown menu.
Launch the job
Your Crawler Job and the Solr schema should now be ready. Make sure your Spacy API is running. Now, you can launch the indexation. On the ManifoldCF UI, go to “Jobs” > “Status and Job Management”, and start your job.
Warning: remember that the one you want to launch is the one having “NER” in its name.
This operation may take some time, depending on the size of your file set and your server performance. Document should start appearing in the Datafari Search UI.
Custom facets
This last step can be done while the indexation is running. We want to create new facets on the Search UI to exploit our new metadatas.
You can find the documentation here:
Customizing DatafariUI Facets
Here are the facets we want to set up for our example:
These allow you to filter files containing names, phone numbers or email, and let you search for document containing specified information.
To add this facets, all you have to do is modify a file on your Datafari server:
/opt/datafari/www/ui-config.json
In the “left” section, add the following json:
{ "type": "QueryFacet", "title": "GDPR", "queries": [ "entity_phone:*", "entity_email:*", "entity_person:*" ], "labels": [ "Phone number", "Email address", "Person" ], "id": "gdpr_facet", "minShow": 5 }, { "type": "FieldFacet", "title": "People", "field": "entity_person", "op": "OR", "variant": "autocomplete", "minShow": 3, "maxShow": 15, "show": true }, { "type": "FieldFacet", "title": "Phone numbers", "field": "entity_phone", "op": "OR", "variant": "autocomplete", "minShow": 3, "maxShow": 15, "show": true }, { "type": "FieldFacet", "title": "Email", "field": "entity_email", "op": "OR", "variant": "autocomplete", "minShow": 3, "maxShow": 15, "show": true }
No need to restart the server, the facets should now appear on the Search UI.