Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

This tutorial is based on Datafari 5.2, but the same procedure can be applied to more recent versions.

Starting from Solr 9, Solr does not contain anymore the Data Import Handler (DIH) package anymore. Furthermore, as of July 2022, no one really committed to maintain and update it regularly. Yet it had a large users base, who is are now looking for alternatives.

...

There are three main steps :

  1. Download and install Datafari

  2. (optional - not needed for PostgreSQL) Add the JDBC driver that corresponds to your database (we do not have the right to include it in Datafari due to licence issues for MariaDB or MySQL for example) unless you are crawling a PostgreSQL database, in which case the JDBC driver is already included
    NB : The next version of Datafari (5.3) will include by defaut the JDBC driver for Microsoft SQL server and Oracle server.

  3. Create your crawl job

Right after these steps, you can begin to search into your data !

...

  • Connect into SSH to your server

  • Download latest stable version of Datafari :

    Code Block
    wget https://www.datafari.com/files/debian/datafari.deb
  • To install the dependencies of Datafari, download our convenient script to install them automatically :

    Code Block
    wget https://www.datafari.com/files/scripts_init_datafari/init_server_datafari_5_debian_10_plus.sh
  • Now execute the init script :

    Code Block
    source init_server_datafari_5_debian_10_plus.sh
  • Install Datafari :

    Code Block
    dpkg -i datafari.deb
  • init Datafari

    Code Block
    cd /opt/datafari/bin
    bash init-datafari.sh

And VoilĂ ! Datafari is installed and functional. You can connect to https://$IP_OF_YOUR_DATAFARI_SERVER/datafariui

In our example : https://51.158.69.126/datafariui

For more information see this page : Install Datafari - Community Edition

...

Code Block
cd /opt/datafari/mcf/mcf_home/connector-lib-proprietary
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.29/mysql-connector-java-8.0.29.jar
chmod 775 /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql*
chown datafari /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql*
cp /opt/datafari/mcf/mcf_home/connector-lib-proprietary/mysql* /opt/datafari/tomcat-mcf/lib/
chmod 775 /opt/datafari/tomcat-mcf/lib/mysql*
chown datafari /opt/datafari/tomcat-mcf/lib/mysql*
  • Edit the file /opt/datafari/mcf/mcf_home/options.env.unix

Code Block
nano /opt/datafari/mcf/mcf_home/options.env.unix

Add the path to the new lib in the -cp parameter line :

Code Block
connector-lib-proprietary/mysql-connector-java-8.0.29.jar

...

For more information, see this page : Connector - Add a JDBC connector (MySQL, Oracle, etc)

  • Restart Datafari - You have 2 options to do it :

    • Option 1 - Via the Datafari admin UI : Go to the main server of Datafari, then click on Services Administration and Restart.

    • Option 2 - By restarting Datafari via SSH :

...

  • Database type : MySQL

  • Database host : 163.172.184.196

  • Database name : wiki

  • user : root

  • Password : admin

  • Seeding query :

    Code Block
    SELECT page_id AS $(IDCOLUMN) FROM page
  • Version query

    Code Block
    SELECT page_id AS $(IDCOLUMN), page_id AS $(VERSIONCOLUMN) FROM page WHERE page_id IN $(IDLIST)
  • Data query

    Code Block
    SELECT page_id AS $(IDCOLUMN), page_id AS $(URLCOLUMN), page_title AS $(DATACOLUMN) FROM page WHERE page_id IN $(IDLIST)
  • Source name : db

  • Repository name : msqylrepo

  • Start he job once created : check the box

  • Finally click on Save button

Then you can check if all is ok in your MCF : https://$IP_OF_YOUR_DATAFARI_SERVER/datafari-mcf-crawler-ui/

In our example : https://51.158.69.126/datafari-mcf-crawler-ui/

...

Finally go to Datafari and search your data (optional, you do not need DatafariUI, you can do as you were doing before when you were combining DIH and Solr) :

Go to https://$IP_OF_DATAFARI_SERVER/datafariui

In our example : https://51.158.69.126/datafariui

...

  • Connect into SSH into the instance

  • Get MySQL Server 8 :

    Code Block
    apt update
    wget https://dev.mysql.com/get/mysql-apt-config_0.8.22-1_all.deb
    apt install ./mysql-apt-config_0.8.22-1_all.deb
     apt update
     apt install mysql-server
  • Check if MySQL is well started :

    Code Block
    service mysql status
  • Get the SQL dump of English Wikipedia pages :

    Code Block
    wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz
  • Uncompress it :

    Code Block
    gzip -d enwiki-latest-page.sql.gz
  • Create the database and change encoding :

    Code Block
    mysql -uroot -p
    CREATE DATABASE wiki; 
    USE wiki;
    ALTER DATABASE wiki CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

...

  • Import the data into the database :

    Code Block
    mysql -u root -p wiki_en < enwiki-latest-page.sql
  • Change the configuration into MySQL to allow remote connection, to do so edit the file /etc/mysql/mysql.conf.d/mysqld.cnf :

    Code Block
    nano /etc/mysql/mysql.conf.d/mysqld.cnf

...

  • Create a new user into the database :

    Code Block
    mysql -u root -p
    CREATE USER 'datafari'@'51.158.69.126' IDENTIFIED BY 'admin';
    GRANT ALL PRIVILEGES ON * . * TO 'datafari'@'51.158.69.126';

In this example, the name of the user is datafari and the password is admin. We allow datafari user to connect to MySQL database from the location of our Datafari server : 51.158.69.126. We granted all privileges to datafari user, once again it is just for demo purpose.