[DEPRECATED] Analytic Stack (Apache Zeppelin)

Valid from Datafari 5.3 up to 5.5

This documentation is only valid for Datafari 5.3 up to 5.5

 

Before starting to use Apache Zeppelin notebooks, be sure to read the section 3. Apache Zeppelin Notebooks trap

Datafari 5.3 shifted from the Open Distro stack to the Apache Zeppelin, as it is much less demanding in terms of resources consumption.

1. How does it works ?

Where OpenDistro required to index the analytics data of Datafari into Elasticsearch, we can now index them in dedicated Solr indexes:

  • Crawl: all logs related to the crawls (Enterprise Edition only)

  • Logs: all components logs (Cassandra, Solr, Tomcat etc.) (Enterprise Edition only)

  • Access: all logs related to connections to the search UI

  • Statistics: all logs concerning searches performed by users

  • Monitoring: all logs concerning the corpus of documents (number of docs, file types, etc)

By indexing these logs into Solr instead of Elasticsearch, we cut the resource consumption and got rid of the security complexity !

We kept Logstash as a log pusher because Logstash OSS is opensource, light and we already had all of the log parsers configured. We just replaced the Elasticsearch output in the configuration by a Solr output thanks to a Solr plugin.

Concerning the Kibana Dashboards, they have been migrated to Apache Zeppelin, by replacing the dashboards with notebooks.

2. How to access and use Apache Zeppelin in Datafari

Apache Zeppelin is automatically started/stopped when Datafari is started/stopped by default unless you have disabled it during the install phase by answering “no” to the question “Do you want to enable analytic stack (yes/no) [yes] ?” or if you have disabled it in the configuration file [DATAFARI_HOME]/tomcat/conf/datafari.properties by setting the parameter “AnalyticsActivation” to false.

In case the Analytics are disabled in an already installed Datafari, to enable them you need to modify the parameter “AnalyticsActivation” in the conf file [DATAFARI_HOME]/tomcat/conf/datafari.properties and set it to “true”:

#Analytics AnalyticsActivation=true

Then restart datafari.

Once the Analytics are enabled, Apache Zeppelin will be started and stopped synchronously with Datafari and the notebooks that replaced the old dashboards will be accessible through the admin UI of Datafari via the admin menu:

  • Usage Analysis → Corpus Analysis

  • Usage Analysis → Queries Analysis

  • System Analysis → Check problematic files (Enterprise Only)

  • System Analysis → Logs Analysis (Enterprise Only)

Once you are connected to one of the Apache Zeppelin notebooks, you can navigate through the different notebooks available thanks to the “Notebook” header menu:

3. Apache Zeppelin Notebooks trap

Unlike Kibana, Apache Zeppelin does not automatically refresh the notebooks data. Please note that when a user accesses a notebook for the first time ever, no data will be displayed. To refresh the data of a notebook, you MUST do it manually by clicking on the “Run all paragraph” button which is located next to the notebook name at the top of the notebook:

By clicking on that button, all of the notebook data will be refreshed. You must perform this operation each time you open a notebook for the first time and each time you want to refresh the data of a notebook.

You can also refresh one paragraph at a time (a visualization is called a paragraph in Apache Zeppelin) by clicking on the “Run this paragraph” button in the top right corner of the paragraph:

 

4. Filtering the data of a paragraph

Unlike Kibana, in a notebook, you are unable to simply filter data by clicking on a value. Instead, you will need to directly modify the query of the paragraph (a visualization is called a paragraph in Apache Zeppelin) that is displayed above it:

But you will need to be familiar with the Apache Zeppelin Solr plugin syntax AND the Solr query syntax which will require some skills ! Also every modification of those queries will be applied and saved by clicking on the “Run this paragraph” or “Run all paragraph” button which will erase the original query but also may set the visualization or even the whole notebook in error in case of mistake !