...
Select the source type you want to crawl. In my case, I want to crawl a SMB server so I will choose “Create a Job Filer”. Before that, don’t forget to follow the Add the JCIFS-NG Connector to Datafari - Community Edition documentation in order to install the JCIFS-NG library (not pre-installed because it is an LGPL licence).
Server: the URL of your SMB file system.
User & password: credentials to access your file system.
Path: path to the repository your want to crawl.
...
When it’s done, you should see the RegexEntityConnector in the table, and a new “Regex Entity” tab. Next step is the configuration of those two connectors. We want them to feed metadatas in the Solr index.
rgpdentity_person should contain the names (provided by SpacyConnector)
rgpdentity_phone should contain the phone numbers (provided by RegexEntityConnector)
rgpdentity_email should contain the email addresses (provided by RegexEntityConnector)
...
Model : You can leave blank to use the default model, which is configured the model.json file. As of November 2023, the default model is therefore en_core_web_trf
Endpoint : use the one per default, “/split_detect_and_process/”.
Prefix : This defines how your metadata will be named. In our example, we will use “rgdpset it to“entity_” as recommended, so we can get a “rgpduse the existing field “entity_person” metadata.
...
See our documentation for FastAPI Connector here :
Spacy Transformation Connector
...
Once your Spacy Connector is ready, go to the “Regex Entity” tab to configure the Regex Entity Connector. Its function is to store any regex matches in Solr metadata. In this situation, we want to extract phone numbers to “rgpd“entity_phone”, and email addresses to “rgpd“entity_email”.
To do so, we need to add two lines. For each line, this is how you should set the parameters.
...
Destination field:
Email addresses: rgpd entity_email
Phone numbers: rgpd entity_phone
Leave other fields empty.
...
In order to create a metadata into FileShare collection, add those lines in the file $DATAFARI_HOME/solr/solrcloud/FileShare/conf/customs_schema/custom_fields.incl.
You don’t need to add “entity_phone” or “entity_person”, as those already exist by default in Solr.
Code Block |
---|
{ "name":"rgpd_person", "type":"string", "stored":true, "multiValued":true } && { "name":"rgpd_phone", "type":"string", "stored":true, "multiValued":true } && { "name":"rgpd_entity_email", "type":"string", "stored":true, "multiValued":true } |
...
Note |
---|
Warning: remember that the one you want to launch is the one havine having “NER” in its name. |
This operation may take some time, depending on the size of your file set and your server performance. Document should start appearing in the Datafari Search UI.
...
Code Block |
---|
{ "type": "QueryFacet", "title": "RGPDGDPR", "queries": [ "rgpdentity_phone:*", "rgpdentity_email:*", "rgpdentity_person:*" ], "labels": [ "Phone number", "Email address", "Person" ], "id": "rgpdgdpr_facet", "minShow": 5 }, { "type": "FieldFacet", "title": "People", "field": "rgpdentity_person", "op": "OR", "variant": "autocomplete", "minShow": 3, "maxShow": 15, "show": true }, { "type": "FieldFacet", "title": "Phone numbers", "field": "rgpdentity_phone", "op": "OR", "variant": "autocomplete", "minShow": 3, "maxShow": 15, "show": true }, { "type": "FieldFacet", "title": "Email", "field": "rgpdentity_email", "op": "OR", "variant": "autocomplete", "minShow": 3, "maxShow": 15, "show": true } |
...