Info |
---|
Valid from 4.0The documentation below is valid from Datafari v4.0.0 upwards |
1. Repository/Authority connector configuration
When configuring either a JDBC authority connector or a JDBC repository connector, in the ‘Database Type’ tab you have to select an ‘Access Method’. The access method serves to find the columns in the resultsets by name or by label, . “by name” will search for original column name whereas the “by label” option will search for the column label. A column label is by default the original column name if there is no mapping (e.g ‘SELECT id AS doc_id’ is a mapping of the original column name ‘id’ to the mapping name ‘doc_id’), if a mapping is defined, then the label will be the mapping name. In our case we definitely want to use labels instead of original column names, so ALWAYS USE the “by label” access method !
...
2. Job queries
When creating a DB job in the ManifoldCF interface, four sql queries are expected by the job as part of the parameters. We detail two of them below because they may be challenging to configure :
a. Seeding query
When defining the "Seeding query" in a job involving a JDBC connector, MCF introduces two useful variables to deal with delta crawls:
$STARTTIME : Which represents the timestamp in MILLISECONDS of the start time of the last FINISHED run of the job. It means that the first time you will run the job, this variable will be equal to 0 and will have a value only once the job has run (and is finished with a "done" status) at least once.
$ENDTIME : Which represents the timestamp in MILLISECONDS of the end time of the last FINISHED run of the job. It means that the first time you will run the job, this variable will be equal to 0 and will have a value only once the job has run (and is finished with a "done" status) at least once.
Note |
---|
Timestamp formatsUsually, these variables are very useful to compare them to the timestamp of the last modified date of a document. But be careful, the timestamp provided is expressed in milliseconds and not in seconds like a standard Unix timestamp ! Keep it in mind because in most cases you will need to adjust either the timestamp on the Database side or divide the timestamp provided by MCF by 1000 to convert it into a Unix format !format. |
b. Version check query
This query is very important because it will generate a VERSIONCOLUMN variable for each crawled document that will be associated to the doc ID and will be used by MCF to determine during a delta crawl if the document needs to be re-indexed or not. Basically the best practice is to bind the VERSIONCOLUMN to a table column of a document that is updated each time a modification is performed on the document. If you cannot have such a column in your DB schema, what we advise is to create a function on Database side that will generate a hash for a document, based on each column value. With such a function, if any of the values of a DB document changes, the hash will also change. Now if you set this DB function as the VERSIONCOLUMN (via a standard SQL function call), MCF will know the document has to be re-indexed.
c. Data query
This query is used to gather document main content but also other metadata (if metadata are available in the same table as the content). three MCF parameters are required in the SELECT clause:
$(IDCOLUMN): which must be mapped (as label) with the id of the document in the queried table. This parameter is only used within MCF as internal document id but WILL NOT BE USED AS “id” OF THE DOCUMENT IN Datafari !
$(URLCOLUMN): MCF will use the column mapped to this parameter as id for the document ! This is where you need to be very careful with Datafari as the name of this parameter is misleading. IT WILL NOT DETERMINE the “url” filed in Datafari but the “id” field ! The “url” field of a document in Datafari is automatically filled with the “id” field if the document does not already have an “url” metadata !
$(DATACOLUMN): This parameter must be mapped (as label) to the table column representing the main content of a document
Those three parameters are required by MCF but the SELECT clause can retrieve as many columns from the table as you want ! Each column selected by the SELECT clause will constitute a metadata of the final document sent to Datafari with the column name (or the label if the column is labelled) as metadata name.
For example, considering the following table named “docs” :
ID | document_id | url | document_title | content |
---|---|---|---|---|
1 | doc_1 | https://mycompanie.com/doc_1 | advertisement.pdf | This is an advertise |
2 | doc_2 | january_report.doc | The January sales are good |
If we define the following data query:
Code Block | ||
---|---|---|
| ||
SELECT ID AS $(IDCOLUMN), document_id AS $(URLCOLUMN), content AS $(DATACOLUMN), url, document_title AS title FROM docs WHERE ID IN $(IDLIST) |
Then, once crawled we will retrieve the following documents in Datafari:
Code Block | ||
---|---|---|
| ||
[
{
"id": "doc_1",
"content": [
"This is an advertise"
],
"url": "https://mycompanie.com/doc_1",
"title": [
"advertisement.pdf"
]
},
{
"id": "doc_2",
"content": [
"The January sales are good"
],
"url": "https://mycompanie.com/doc_2",
"title": [
"january_report.doc"
]
}
] |
Notice that the $(URLCOLUMN)
parameter became the “id” of the document, the “title” field corresponds to the document_title
column as it has been labelled “title” in the query, and the “url” field corresponds to the url
column !
If we would not have selected the url
column in the WHERE
clause, then the documents in Datafari would have been like that:
Code Block | ||
---|---|---|
| ||
[
{
"id": "doc_1",
"content": [
"This is an advertise"
],
"url": "doc_1",
"title": [
"advertisement.pdf"
]
},
{
"id": "doc_2",
"content": [
"The January sales are good"
],
"url": "doc_2",
"title": [
"january_report.doc"
]
}
] |
Notice that the “url” field now corresponds to the “id” field. This is because as the documents don’t have an “url” metadata unlike the previous time, Datafari automatically sets the “id” field value to the “url” field value.