Solr ingester crawler connector

Valid from Datafari X.X

The goal of this connector is to ingest data from Solr.

Why do we need to crawl data from Solr ? In general it is because a software has natively integrated Solr for its search and it is easier to index directly the data from this Solr than directly from the software.

The other solution would be directly to link our own Solr with the software but it has its specific schema and the indexation frequency cannot be configured.

So the Solr ingester connector can be very useful because we can control the field mappings between the source and the target, control the indexation frequency and the incremental indexation and also have a MCF authority connector to manage the security.

User documentation

  • Configure the Solr ingester repository connector

Connection type : choose Solr ingester

In the tab Solr ingester, you have the following parameters :

-URL Solr : the URL of the Solr you want to index (for example : http://localhost:8983/solr)

-Connection timeout : 60 000 ms by default

-Socket timeout : 180 000 ms by default

  • Configure the job related to the Solr ingester repository connector

The tabs that are applied specifically to this connector are :

Security and Parameters

-Security:
In this tab you can choose to take into account the security. If you check the checkbox "security activated" you have to fill the textbox "Security field". Indeed you indicate to the connector what is the name of the field into the source Solr that holds the security. It will be then stored into the field related to MCF : in the allow_token_document field.

-Parameters

Collection name (mandatory) : enter the name of the collection you want to index

Field mappings (optional) : indicate the mappings that you want to do between the fields extracted from the source Solr and the fields that you have in your output repository connector.

ID field (mandatory) : the unique key field of the source Solr

Date field  (mandatory) : the field that stores the date (NB : it is used for the incremental indexation. If you do not have a date field into your Solr source, add it to Solr and choose a default value like "NOW")

Content field (mandatory) : the field that contains the "content"

Filter condition (optional) : you can filter the documents that you want to crawl from the source Solr. For now it supports only one condition, the syntax has to be : "field:condition" like "inStock:true"

Technical documentation

  • AddSeedDocuments

In the addSeedDocuments method we perform a global query on the Solr source (with cursormark query). It will gather all the doc ids (filtered or not with a condition entered by the user)

  • ProcessDocuments

For each document identifier, we do a Solr query and check if it is present in the MCF database.

The revision is created with the id and the date of modification of the document in the Solr source.

If the doc is not present :> delete doc.

If it is present and the date is not the same : reindex it.

or it is not present : index it.

For index it, we use the response of the Solr query already done. We retrieve the security field, then do the fields mapping and finally send it to the output repository connector.