Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

This tutorial is based on Datafari 5.2, but the same procedure can be applied to more recent versions.

Starting from Solr 9, Solr does not contain anymore the Data Import Handler (DIH) package anymore starting from Solr 9. Yet it had a large users base, who are now looking for alternatives.

One of the main goals of DIH was to index databases easily into Solr.

In this page, we are going to explain how to do it with Datafari Community Edition, the open source solution Datafari which is an open source alternative of DIH. Indeed Datafari contains out of the box version of Datafari Enterprise Search. Thanks to several years of work, Datafari integrates seamlessly much more than what you will need to replace DIH, but we focus here on how to just replace DIH for a scenario with Solr. Out of the box and focusing on our DIH replacement scenario, Datafari 5.2 contains :

  • Apache ManifoldCF which is a framework of connectors (ManifoldCF is coded in Java and the output connection to Solr is configured out of the box into Datafari in SolrCloud mode so it is a smart client to index very quickly millions of documents into Solr)

  • Solr 8 to index data crawled from MCF

  • and the webapp Datafari with a user interface to search into the data (as a bonus, since it is not needed in the DIH replacement scenario)

There are three main steps that we will detail :

  1. Download and install Datafari

  2. (optional - not needed for PostgreSQL) Add the JDBC driver that corresponds to your database (

...

  1. we do not have the right to include it in Datafari due to licence issues for

...

  1. MariaDB or MySQL for example) unless

...

  1. you

...

  1. are crawling a PostgreSQL database

...

  1. , in which case the JDBC driver is already included

...


  1. NB :

...

  1. The next version of Datafari (5.3)

...

  1. will

...

  1. include by defaut the JDBC driver for Microsoft SQL

...

  1. server and Oracle server.

  2. Create your crawl job

Right after these steps, you can begin to search into your data !

...

For this tutorial, we use two instances on Scaleway :

  • GPS1 8x86 64 bit 32 Go RAM, 300 Go NVME with Debian 11

  • IP Datafari server : 51.158.69.126

  • IP MySQL server : 163.172.184.196

We will install Datafari into on one instance, and a MySQL server 8 into another instance. The on the other one. Note that the installation of MySQL and the ingestion of a dataset into it are detailed into the annex of the page because it is out of scope of this pagetutorial, and as such they are detailed in an annex.

1. Download and install Datafari

  • Connect into SSH to your server

  • Download latest stable version of Datafari :

    Code Block
    wget https://www.datafari.com/files/debian/datafari.deb
  • Install To install the dependencies of Datafari, to do that download the our convenient script to install them automatically :

    Code Block
    wget https://www.datafari.com/files/scripts_init_datafari/init_server_datafari_5_debian_10_plus.sh
  • Then Now execute the init script :

    Code Block
    source init_server_datafari_5_debian_10_plus.sh
  • Install Datafari :

    Code Block
    dpkg -i datafari.deb
  • init Datafari

    Code Block
    cd /opt/datafari/bin
    bash init-datafari.sh

Voila And Voilà! Datafari is installed and functional. You can connect to https://$IP_OF_YOUR_DATAFARI_SERVER/datafariui

In our example : https://51.158.69.126/datafariui

For more information see this page : Install Datafari - Community Edition

2. Install the JDBC driver into Datafari

In our example, we want to crawl a database into MySQL version 8.0.29.

...

Code Block
cd /opt/datafari/mcf/mcf_home/connector-lib-proprietary
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.29/mysql-connector-java-8.0.29.jar
chmod 775 /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql*
chown datafari /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql*
cp /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql* /opt/datafari/tomcat-mcf/lib/
chmod 775 /opt/datafari/tomcat-mcf/lib/mysql*
chown datafari /opt/datafari/tomcat-mcf/lib/mysql*
  • Edit the file /opt/datafari/mcf/mcf_home/options.env.unix

Code Block
nano /opt/datafari/mcf/mcf_home/options.env.unix

Add the path to the new lib in the -cp parameter line :

Code Block
connector-lib-proprietary/mysql-connector-java-8.0.29.jar

...

For more information, see this page : Connector - Add a JDBC connector (MySQL, Oracle, etc)

  • Restart Datafari

...

  • - You have 2 options to do it :

    • Option 1 - Via the Datafari admin UI :

...

    • Go to the main server of Datafari, then click on Services Administration and Restart.

...

    • Option 2 - By restarting Datafari via SSH :

Code Block
cd /opt/datafari/bin
bash stop-datafari.sh
bash start-datafari.sh

For more information about restart of restarting Datafari, see this page : /wiki/spaces/DATAFARI/pages/111903130

The final step is to configure the crawling job.

3. Configure the job

Go to the admin Datafari UI then go to Connectors → Data Crawler Simplified mode and select Create Database job.

...

You will need to fill in all the mandatory parameters to configure the job.

...

  • Database type : MySQL

  • Database host : 163.172.184.196

  • Database name : wiki

  • user : root

  • Password : admin

  • Seeding query :

    Code Block
    SELECT page_id AS $(IDCOLUMN) FROM page
  • Version query

    Code Block
    SELECT page_id AS $(IDCOLUMN), page_id AS $(VERSIONCOLUMN) FROM page WHERE page_id IN $(IDLIST)
  • Data query

    Code Block
    SELECT page_id AS $(IDCOLUMN), page_id AS $(URLCOLUMN), page_title AS $(DATACOLUMN) FROM page WHERE page_id IN $(IDLIST)
  • Source name : db

  • Repository name : msqylrepo

  • Start he job once created : check the box

  • Finally click on Save button

Then you can check if all is ok into in your MCF : https://$IP_OF_YOUR_DATAFARI_SERVER/datafari-mcf-crawler-ui/

In our example : https://51.158.69.126/datafari-mcf-crawler-ui/

...

Depending on the number of raws into rows in your database, the job will be first in ‘starting up’ mode to get all the ids of the documents to crawl. Then the real job Only then will the “real” act of crawling will start, and the job will be in running “running” mode.

If you need to do some modifications into modify your job, simply edit it then relaunch the job.

Finally go to Datafari and search into your data (optional, you do not need DatafariUI, you can do as you were doing before when you were combining DIH and Solr) :

Go to https://$IP_OF_DATAFARI_SERVER/datafariui

In our example : https://51.158.69.126/datafariui

...

You have a complete search solution in alternative to DIH, based on Apache ManifoldCF and Apache Solr.

Annexes

We detail here the steps we did to install MySQL server into the Scaleway instance.

  • Connect into SSH into the instance

  • Get MySQL Server 8 :

    Code Block
    apt update
    wget https://dev.mysql.com/get/mysql-apt-config_0.8.22-1_all.deb
    apt install ./mysql-apt-config_0.8.22-1_all.deb
     apt update
     apt install mysql-server
  • Check if MySQL is well started :

    Code Block
    service mysql status
  • Get the SQL dump of English Wikipedia pages :

    Code Block
    wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz
  • Uncompress it :

    Code Block
    gzip -d enwiki-latest-page.sql.gz
  • Create the database and change encoding :

    Code Block
    mysql -uroot -p
    CREATE DATABASE wiki; 
    USE wiki;
    ALTER DATABASE wiki CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

...

  • Import the data into the database :

    Code Block
    mysql -u root -p wiki_en < enwiki-latest-page.sql
  • Change the configuration into MySQL to allow remote connection, to do so edit the file /etc/mysql/mysql.conf.d/mysqld.cnf :

    Code Block
    nano /etc/mysql/mysql.conf.d/mysqld.cnf

...

  • Create a new user into the database :

    Code Block
    mysql -u root -p
    CREATE USER 'datafari'@'51.158.69.126' IDENTIFIED BY 'admin';
    GRANT ALL PRIVILEGES ON * . * TO 'datafari'@'51.158.69.126';

In this example, the name of the user is datafari and the password is admin. We allow datafari user to connect to MySQL database from the location of our Datafari server : 51.158.69.126. We granted all privileges to datafari user, once again it is just for demo purpose.