Database crawling simplified configuration

Valid from Datafari 6.0

This documentation is valid from Datafari EE v6.0

Datafari v6 EE upwards users

Starting from Datafari EE v6, a new “Duplicates detection” option has appeared in the simplified jobs creation ! Please refer to the Detect duplicates configuration documentation for more infos !

1. Prerequisites

Before using the simplified configuration UI for database crawling, you first need to understand that it is only compatible with databases that provide a JDBC driver. If the database you intend to crawl does not provide a JDBC driver you will not be able to crawl it with the existing MCF JDBC connector and thus, you would not use the simplified configuration specified here.

The other important thing to consider is that currently not all databases are compatible with the MCF JDBC connector even if they have a JDBC driver. At the moment the compatible databases are:

  • Oracle

  • Sybase

  • MySQL

  • Postgres

  • Microsoft SQL Server

  • MariaDB

If your database is not in that list, you won’t be able to crawl it, and you will have to open an issue on our gitlab to ask for an update to handle your database !

So, the first thing you will need to do is to download the jar file of the JDBC driver corresponding to the database type and version and install it following this documentation: Connector - Add a JDBC connector (MySQL, Oracle, etc)

2. Create the job through the simplified UI

Once you have installed the JDBC driver and restarted Datafari, go to the Data Crawlers Simplified Mode UI located in the admin section of Datafari in the “Connectors” tab and select “Create Database job” as source type:

You will need to fulfil at least all mandatory parameters to be able to save and create the db job.

Here is a description of each parameter:

  • Database type: Type of Database to connect to. If your database is not listed here then you will not be able to crawl it. In that case you can check if there is a specific connector for your database, if not, you will have to open an issue on our gitlab to ask for an update to handle it !

  • Database host: Database host, you must specify the port if it is not the default one, when the port is specified, it must be separated from the host by a ‘:' char. For example, if you want to crawl a MySQL database that is running locally on its default 3306 port, then you only need to specify ‘localhost’ here. If you have changed the default port of the MySQL database to 0690 (as an example), then you will need to set this parameter to 'localhost:0690’

  • Database name: Name of the Database to connect to by default

  • Database raw connection string: This parameter is optional and corresponds to a raw connection string that you may want to use to connect to your database. For example: "Server=127.0.0.1;Database=test;User Id=foo; Password=bar"

  • User: Username that can connect to the Database, can perform all the queries without restrictions and having full read access to the data you want to crawl

  • Password: Password of the provided user

  • Seeding query: Query that must return the IDs of all documents to crawl. The query must necessarily return and map the id field to the $(IDCOLUMN) parameter. The $(IDCOLUMN) is used by MCF to identify the id field and gather all the IDs of documents. As an helper you can take example of the demo query provided: “SELECT idfield AS $(IDCOLUMN) FROM documenttable WHERE modifydatefield > $(STARTTIME) AND modifydatefield <= $(ENDTIME)”. The $(STARTTIME) and $(ENDTIME) mapping parameters are optional and are used for delta crawl. You can retrieve more details about the seeding query and the delta crawl in this documentation: DB Connector Best Practices. The IDs gathered in this query will then be used in the other queries through the $(IDLIST) parameter to limit the results.

  • Version query: This query must be built so it returns a hash/value field representing the version of a document for each document IDs coming from the seeding query. The version filed must be mapped to the $(VERSIONCOLUMN) parameter. The $(VERSIONCOLUMN) is used by MCF to identify and gather the version of all document IDs that come from the ‘Seeding query’. The id field must also be returned and mapped to the $(IDCOLUMN) parameter so that MCF can identify the IDs. As an helper you can take example of the demo query provided: “SELECT idfield AS $(IDCOLUMN), versionfield AS $(VERSIONCOLUMN) FROM documenttable WHERE idfield IN $(IDLIST)”. The $(IDLIST) parameter is mandatory, it is replaced by MCF by the list of all IDs coming from the seeding query (not all IDs at once, the query is executed several times with different sets of IDs so that the results are somehow ‘paginated’). To have more explanations on the Version query role during delta crawls please consult this documentation: DB Connector Best Practices

  • Access Token query: Query that must return a field representing an access token of a document. An access token is a string value that is representing a right/group allowed to access to the document. The access token field must be mapped to the $(TOKENCOLUMN) parameter so that MCF can identify it. The id field must also be returned and mapped to the $(IDCOLUMN) parameter so that MCF can identify the IDs. As an helper you can take example of the demo query provided: “SELECT docidfield AS $(IDCOLUMN), aclfield AS $(TOKENCOLUMN) FROM acltable WHERE docidfield IN $(IDLIST)”. As for the Version query, the $(IDLIST) parameter is used to limit and ‘paginate’ results. This query is optional because it will not prevent the crawl to work without it but is MANDATORY when you want to implement the security for the crawled documents. When the security is activated, during a search, the access tokens of documents will be compared with the access tokens of the user performing the search, and only documents having at least one access token matching with the access tokens list of the user will be displayed in the search results. To have more explanations about this query and how works the security with database documents, please refer to this documentation: https://datafari.atlassian.net/wiki/x/D4C9oQ

  • Data query: Query that must return the main content field of documents. The content type must be either binary or text and the content field must be mapped to the $(DATACOLUMN) parameter so that MCF will be able to identify it and retrieve the content for each document. The query must also necessarily return the id field mapped to the $(IDCOLUMN) parameter, and a field representing the unique URL of a document and map this field to the $(URLCOLUMN) parameter. As an helper you can take example of the demo query provided: “SELECT idfield AS $(IDCOLUMN), urlfield AS $(URLCOLUMN), datafield AS $(DATACOLUMN) FROM documenttable WHERE idfield IN $(IDLIST)”. As for the previous queries, the $(IDLIST) parameter is used to limit and ‘paginate’ results

  • Source name: The provided source name will be used as a facet to enable result filtering by source/job

  • Repository name: The provided repository name will be used to name the repository connector and job in MCF. This name MUST be unique, if another repository has the same name, the job creation will fail !

  • Security: Enable ACL respect for database documents. Requires that the Token query is set AND that a corresponding authority connector is configured (refer to this documentation: https://datafari.atlassian.net/wiki/x/D4C9oQ ). If this box is not checked, all the crawled documents will be publicly accessible, so be careful !

  • Start the job once created: By enabling this option, once created, the job will be automatically started and the crawl will begin.

Once all parameters are correctly fulfilled, you can save the job and manage it into the MCF admin UI !

From Datafari v6.2 some new JDBC drivers may need valid certificates to access your database. When you check your job creation in MCF admin UI, if you have “PKIX error”, “General OpenSslEngine Problem error” and so, that means your driver require certificates to complete your database connection. In this interface you can bypass this requirement. See more here: https://datafari.atlassian.net/wiki/x/D4C9oQ