How to crawl easily a database on Solr without DIH - DIH alternative

Solr does not contain anymore Data Import Handler (DIH) package anymore starting from Solr 9.

One of the main goals of DIH was to index databases easily into Solr.

In this page, we are going to explain how to do it with the open source solution Datafari which is an open source alternative of DIH. Indeed Datafari contains out of the box :

Apache ManifoldCF which is a framework of connectors (ManifoldCF is coded in Java and the output connection to Solr is configured out of the box into Datafari in SolrCloud mode so it is a smart client to index very quickly millions of documents into Solr)
Solr 8 to index data crawled from MCF
and the webapp Datafari with a user interface to search into the data

There are three main steps that we will detail :

Download and install Datafari
(optional) Add the JDBC driver that corresponds to your database (indeed we do not have the right to include it in Datafari due to licence issues for Oracle or MySQL for example) unless if you crawl a PostgreSQL database : the JDBC driver is already included)
NB : into the next version of Datafari (5.3), we will inclued by defaut the JDBC driver for Microsoft SQL server too because it has a MIT license
Create your crawl job

Right after these steps, you can begin to search into your data !

Let’s explain these steps :

For this tutorial, we use two instances on Scaleway :

GPS1 8x86 64 bit 32 Go RAM, 300 Go NVME with Debian 11

IP Datafari server : 51.158.69.126

IP MySQL server : 163.172.184.196

We install Datafari into one instance and MySQL server 8 into another instance. The installation of MySQL and the ingestion of a dataset into it are detailed into the annex of the page because it is out of scope of this page.

Download and install Datafari

Connect into SSH to your server

Download latest stable version of Datafari :

wget https://www.datafari.com/files/debian/datafari.deb

Install the dependencies of Datafari, to do that download the convenient script to install them automatically :

wget https://www.datafari.com/files/scripts_init_datafari/init_server_datafari_5_debian_10_plus.sh

Then execute the init script :

source init_server_datafari_5_debian_10_plus.sh

Install Datafari :

dpkg -i datafari.deb

init Datafari

cd /opt/datafari/bin
bash init-datafari.sh

Voila Datafari is installed and functional. You can connect to https://$IP_OF_YOUR_DATAFARI_SERVER/datafariui

In our example : https://51.158.69.126/datafariui

For more information see this page : Install Datafari - Community Edition

2. Install the JDBC driver into Datafari

In our example, we want to crawl a database into MySQL version 8.0.29.

We download the JDBC driver from Maven repository. The latest stable version is MySQL connector/J 8.0.29 : https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.29/mysql-connector-java-8.0.29.jar

cd /opt/datafari/mcf/mcf_home/connector-lib-proprietary
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.29/mysql-connector-java-8.0.29.jar
chmod 775 /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql*
chown datafari /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql*
cp /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql* /opt/datafari/tomcat-mcf/lib/
chmod 775 /opt/datafari/tomcat-mcf/lib/mysql*
chown datafari /opt/datafari/tomcat-mcf/lib/mysql*

Edit the file /opt/datafari/mcf/mcf_home/options.env.unix

nano /opt/datafari/mcf/mcf_home/options.env.unix

Add the path to the new lib in the -cp parameter line :

connector-lib-proprietary/mysql-connector-java-8.0.29.jar

The entire file (into Datafari 5.2) is this one :

-Xms512m
-Xmx512m
-Dorg.apache.manifoldcf.configfile=./properties.xml
-Duser.timezone=UTC
-Djute.maxbuffer=10000000
-Dorg.apache.manifoldcf.processid=A
-Djava.io.tmpdir=/opt/datafari/tmp
-cp .:lib/commons-codec-1.10.jar:connector-lib-proprietary/mysql-connector-java-8.0.29.jar:lib/commons-collections-3.2.2.jar:lib/mariadb-java-client-3.0.6.jar:/lib/commons-collections4-4.4.jar:lib/commons-el-1.0.jar:lib/commons-exec-1.3.jar:lib/commons-fileupload-1.3.3.jar:lib/commons-io-2.11.0.jar:lib/commons-lang-2.6.jar:lib/commons-lang3-3.12.0.jar:lib/commons-logging-1.2.jar:lib/ecj-4.3.1.jar:lib/gson-2.8.0.jar:lib/guava-25.1-jre.jar:lib/hsqldb-2.3.2.jar:lib/httpclient-4.5.13.jar:lib/httpcore-4.4.15.jar:lib/jasper-6.0.35.jar:lib/jasper-el-6.0.35.jar:lib/javax.activation-1.2.0.jar:lib/javax.activation-api-1.2.0.jar:lib/javax.mail-1.6.2.jar:lib/javax.servlet-api-3.1.0.jar:lib/jetty-client-9.4.25.v20191220.jar:lib/jetty-continuation-9.4.25.v20191220.jar:lib/jetty-http-9.4.25.v20191220.jar:lib/jetty-io-9.4.25.v20191220.jar:lib/jetty-jndi-9.4.25.v20191220.jar:lib/jetty-jsp-9.2.30.v20200428.jar:lib/jetty-jsp-jdt-2.3.3.jar:lib/jetty-plus-9.4.25.v20191220.jar:lib/jetty-schemas-3.1.M0.jar:lib/jetty-security-9.4.25.v20191220.jar:lib/jetty-server-9.4.25.v20191220.jar:lib/jetty-servlet-9.4.25.v20191220.jar:lib/jetty-util-9.4.25.v20191220.jar:lib/jetty-webapp-9.4.25.v20191220.jar:lib/jetty-xml-9.4.25.v20191220.jar:lib/jna-5.10.0.jar:lib/jna-platform-5.10.0.jar:lib/json-simple-1.1.1.jar:lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:lib/juli-6.0.35.jar:lib/log4j-1.2-api-2.17.1.jar:lib/log4j-api-2.17.1.jar:lib/log4j-core-2.17.1.jar:lib/log4j-jcl-2.17.1.jar:lib/log4j-jul-2.17.1.jar:lib/log4j-slf4j-impl-2.17.1.jar:lib/mcf-agents.jar:lib/mcf-core.jar:lib/mcf-jetty-runner.jar:lib/mcf-pull-agent.jar:lib/mcf-script-engine.jar:lib/mcf-ui-core.jar:lib/postgresql-42.2.5.jar:lib/serializer-2.7.1.jar:lib/slf4j-api-1.7.36.jar:lib/velocity-engine-core-2.3.jar:lib/xalan-2.7.1.jar:lib/xercesImpl-2.10.0.jar:lib/xml-apis-1.4.01.jar:lib/zookeeper-3.4.10.jar

For more information, see this page : Connector - Add a JDBC connector (MySQL, Oracle)

Restart Datafari

You can do it either by the Datafari admin UI :

Go to the main server of Datafari, then click on Services Administration and Restart.

or you can also restart Datafari into SSH :

cd /opt/datafari/bin
bash stop-datafari.sh
bash start-datafari.sh

For more information about restart of Datafari, see this page : /wiki/spaces/DATAFARI/pages/111903130

The final step is to configure the crawling job.

3. Configure the job

Go to the admin Datafari UI then go to Connectors → Data Crawler Simplified mode and select Create Database job.

You will need to fill all the mandatory parameters to configure the job.

To get some information about the different fields see this page : Database crawling simplified configuration

In our example, we crawl a dump of Wikipedia pages.

This is the structure :

And some data :

The configuration is this one :

Database type : MySQL
Database host : 163.172.184.196
Database name : wiki
user : root
Password : admin
Seeding query :
```
SELECT page_id AS $(IDCOLUMN) FROM page
```

Version query

SELECT page_id AS $(IDCOLUMN), page_id AS $(VERSIONCOLUMN) FROM page WHERE page_id IN $(IDLIST)

Data query

SELECT page_id AS $(IDCOLUMN), page_id AS $(URLCOLUMN), page_title AS $(DATACOLUMN) FROM page WHERE page_id IN $(IDLIST)

Source name : db
Repository name : msqylrepo
Start he job once created : check the box
Finally click on Save button

Then you can check if all is ok into MCF : https://$IP_OF_YOUR_DATAFARI_SERVER/datafari-mcf-crawler-ui/

In our example : https://51.158.69.126/datafari-mcf-crawler-ui/

Depending the number of raws into your database, the job will be first in ‘starting up’ mode to get all the ids of the documents to crawl. Then the real job of crawling will start, the job will be in running mode.

If you need to do some modifications into your job, simply edit it then relaunch the job.

Finally go to Datafari and search into your data :

Go to https://$IP_OF_DATAFARI_SERVER/datafariui

In our example : https://51.158.69.126/datafariui

You have a complete search solution in alternative to DIH.

Annexes

We detail here the steps we did to install MySQL server into the Scaleway instance.

Connect into SSH into the instance

Get MySQL Server 8 :

apt update
wget https://dev.mysql.com/get/mysql-apt-config_0.8.22-1_all.deb
apt install ./mysql-apt-config_0.8.22-1_all.deb
 apt update
 apt install mysql-server

Check if MySQL is well started :
```
service mysql status
```

Get the SQL dump of English Wikipedia pages :

wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz

Uncompress it :
```
gzip -d enwiki-latest-page.sql.gz
```
Create the database and change encoding :

mysql -uroot -p
CREATE DATABASE wiki; 
USE wiki;
ALTER DATABASE wiki CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

Import the data into the database :

mysql -u root -p wiki_en < enwiki-latest-page.sql

Change the configuration into MySQL to allow remote connection, to do so edit the file /etc/mysql/mysql.conf.d/mysqld.cnf :
```
nano /etc/mysql/mysql.conf.d/mysqld.cnf
```

And add this line at the end of the file :

bind-address            = 0.0.0.0

ie here we allow any IP to connect to the MySQL database, it is just for demo purpose here !

Create a new user into the database :

mysql -u root -p
CREATE USER 'datafari'@'51.158.69.126' IDENTIFIED BY 'admin';
GRANT ALL PRIVILEGES ON * . * TO 'datafari'@'51.158.69.126';

In this example, the name of the user is datafari and the password is admin. We allow datafari user to connect to MySQL database from the location of our Datafari server : 51.158.69.126. We granted all privileges to datafari user, once again it is just for demo purpose.