Valid from Datafari v6 onwards
This documentation is valid from Datafari v6 onwards
The CSV JDBC driver for the JDBC connector differs a little bit from the other drivers in the way it should be configured to properly work
1. Database connection
Unlike the other drivers, it does not require a database host and/or port, a database name, a user and a password, even if, unfortunately, those parameters are mandatory for the JDBC connector. So you must set something for those parameters anyway, whatever you want BUT an empty string !
The only parameter that is really important and that matters, is the “Database raw connection string”. The CSV JDBC driver must connect to a local folder (local on the machine where the job using the CSV driver will run), containing CSV files you want to crawl. So the “Database raw connection string” must be set with the absolute path of the local folder containing the CSV files to crawl, for example ‘/home/francelabs/csv’. The specified folder and all the files it contains must also have read permissions for the user running the MCF instance, which is ‘datafari’ by default on a standard Datafari installation. We recommend the files to be at the root of the specified folder, we have not tested the behavior with subfolders.
2. CSV column separator
Currently, the separator used for columns in the CSV files is not configurable, IT MUST BE a comma char: ,
3. Queries
Concerning the queries, there are some things to understand in order to build working ones.
First, the tables names MUST correspond to the CSV files names in the folder that you have configured in the database connection. For example, considering you have the following files in the folder configured in section 1:
accounts.csv persons.csv projects.csv
And assuming the persons.csv file contains a column named ‘id’ and you want to build a query that selects this column. Then you will build the following query:
SELECT id FROM persons;
As long as the persons.csv file is present in the folder configured as database connection and that the id column is referenced within the file, it will work.
Next, obviously, the files you want to perform queries on must be readable for the user that is running the MCF instance that runs the job. By default in Datafari it is the ‘datafari’ user.
Third, each CSV file that you want to query MUST contain as first line the column descriptions like this:
id,title,content,url 1,title 1,This is the content of document 1,/home/francelabs/test_csv/test.csv 2,title 2,This is the content of document 2,/home/francelabs/test_csv/test.csv
The column description line will be used by the driver to determine the columns names and you will be able to use them in your queries.
Last, if you set labels for columns in your query, the driver will use the labels for the WHERE clause. So if you label a column with another name, and want to set a WHERE clause on the same column, you will need to use its label in the WHERE clause. For example, if we take the previous CSV file content and want to build a query that use the label ‘doc_id’ for the ‘id’ column and define a WHERE clause to only get the id which is equal to '1', then we will build it like this:
SELECT id as doc_id WHERE doc_id=1;
Notice that we directly used the ‘doc_id’ label in the WHERE clause.
This is really important, specifically when you need to use MCF variables as labels in the different types of queries specified in the Database crawling simplified configuration documentation ! The Version query, Access Token query and Data query are affected by this behavior as they use the “idfield” in the where clause !
For example, here is the default data query:
SELECT idfield AS $(IDCOLUMN), urlfield AS $(URLCOLUMN), datafield AS $(DATACOLUMN) FROM documenttable WHERE idfield IN $(IDLIST)
Assuming we have a table named “documenttable” with “idfield”, “urlfield” and “datafield” as columns, this data query will not work as it is ! because the “idfield” is labeled $(IDCOLUMN)
and the WHERE
clause refers to “idfield”, not $(IDCOLUMN)
. This query will cause the following type of error during execution:
Caused by: java.sql.SQLException: Invalid column name: idfield
Instead, the correct data query would be:
SELECT idfield AS $(IDCOLUMN), urlfield AS $(URLCOLUMN), datafield AS $(DATACOLUMN) FROM documenttable WHERE $(IDCOLUMN) IN $(IDLIST)