Zoomdata Version

Connecting to CDH-Cloudera

Zoomdata offers connection to Cloudera’s open source Apache Hadoop platform - CDH (Cloudera Distributed Hadoop)*. CDH provides unified batch processing, interactive SQL, interactive search, and role-based access controls. In addition, it offers enterprise-grade continuous availability. Specifically, Zoomdata connects to CDH’s fault‐tolerant storage system called the Hadoop Distributed File System (HDFS). Keep in mind that in order to connect to CDH, you first need to enable Spark in Zoomdata .

HDFS is compatible with CDH versions 4 and 5.

Before setting up CDH-Cloudera, you have to configure an embedded Spark server in Zoomdata .

After you have enabled Spark, you can configure the HDFS connector.

To learn more about the Spark functionality and how it is utilized and enabled in Zoomdata, see How Zoomdata Uses Apache Spark .

CONFIGURING THE HDFS CONNECTOR

To configure the connector, perform the following steps:

  1. Log into Zoomdata.
    Administrators and users with appropriate access privileges can connect data sources in Zoomdata.
  2. Click the Sources menu item.

Figure 1

  1. Click the HDFS connector icon.
  2. Specify the name of your source and add a description (if desired).

Figure 2

  1. Click Next .
  2. On the File Path page,  specify the path to your remote file that you want to upload into Zoomdata.
    To use the first row of your data source as the column names, select the Read Headers checkbox.
    Specify the value separator that is in your data source in the corresponding field. Standard separators include commas (,) and semi-colons (;).
    Click Preview . From the Entries list, select the number  of entries to be displayed in preview.

Figure 3

  1. In the Preview section, you can configure fields properties. Click Next .
When you click Next it may take some time to load the dataset into memory depending on its size (from several minutes to possibly over half an hour).
  1. On the Fields page, create unique label names, as needed, for each Label field. If necessary, change the Type and Default options, select the checkboxes in the Distinct Count column. If you do not want to use specific fields from the data source, clear the checkboxes in the Visible column. Configure Filter Display settings for the required fields. Click Next .
    You can also add calculations in the Calculations section.
    Click Next .

Figure 4

  1. On the Refresh page, you can schedule asynchronous jobs to refresh fields in your data source. Refer to Using the Zoomdata Scheduler article for more information.
  2. On the Charts page, you can enable the charts that will be available for the data source and edit the settings for your charts.
    That is, select the styles that will be available for the data source, change the global default settings, and more.
    Learn more about how to customize a chart .
    ​Click Finish to save your changes.

Figure 5