Zoomdata Version

Configuring a Connection to a Standalone Spark Server

By default, Zoomdata establishes a Spark Proxy service along with an embedded Spark instance with a small configuration size during the installation process. However, you can disable this default setup and instead connect to your own Spark Server. Connecting to a standalone Spark Server is recommended if you have large datasets (GBs and higher amounts) to connect to and explore. In this way, Zoomdata can leverage the memory, cores and more robust configuration that may be available in your Spark Server.

This article lists the prerequisites and other considerations before proceeding to configure the connection. Then step-by-step instructions are provided to guide you through the configuration process. Last, you can learn how to monitor your Spark connection.

DATA FLOW IN Zoomdata when Connected to a Standalone Spark Server

When you connect Zoomdata to a standalone Spark server, the embedded Spark instance is automatically disabled. Instead, the Spark Proxy Service (in Zoomdata Server which handles the requests from users that are sent from the Zoomdata service) will connect to your standalone Spark server. Requests from Zoomdata will then be handled by the standalone Spark Server (as shown in Figure 1).

Figure 1


Keep in mind the following prerequisites:

  • The Zoomdata Server supports Spark server versions 1.3 to 1.5.

  • The Spark cluster should be running with version 8 of JVM (Java Virtual Machine)

  • The Network Time Protocol (NTP) should be running in all nodes on the server containing the Spark cluster.
  • Ensure that all the ports that are needed for communication between Zoomdata's node and Node Manager are opened in firewall rules:

Other Considerations

Configure the Default Spark Proxy's Properties

The Spark Proxy properties include port information, log outputs and cores usage and is contained in the spark-proxy.properties configuration file. This file is located in the /etc/zoomdata directory. To view the properties and instructions for changing these settings, refer to Changing the Default Configuration for an Embedded Spark Server article.

Public IP Address

If your Zoomdata instance does not reside in the same local network as your Spark cluster, then a public IP address is necessary from the server hosting Zoomdata. The server containing the Spark cluster needs to be able to access Zoomdata's public IP address. These environment variables need to be set to the public DNS name of the Zoomdata server, via the zoomdata.env file (Table 1):

Use the property with the default value to
SPARK_LOCAL_IP Your_Zoomdata_IP_Address
SPARK_PUBLIC_DNS localhost Your_Zoomdata_IP_Address

Table 1

The zoomdata.env file is located in the /etc/zoomdata directory. To access, follow the steps provided in the article Managing Configurations in Zoomdata .

Cores Utilization

By default Zoomdata will utilize all available cores in the Spark server when a job is run. In other words, only one Spark application (SparkContext instance) may run at a time. If you want to explicitly specify the number of cores to be used for a SparkContext job, then you need to edit the 'Max Cores Per Cluster' parameter in the spark-proxy.properties file. Instructions to make this change are covered in the Spark Setup steps provided below.

Spark Monitoring

Zoomdata lets you monitor the status of your Spark connections. In order to do this, Port 4041 should be opened on the Zoomdata server. Refer to the section below covering Spark Monitoring .

Testing Ports Availability

You can test the ports in your network environment to verify that they are open to Zoomdata. Zoomdata recommends using telnet to test from the Spark port in the Zoomdata instance to the server containing the Spark cluster:

telnet server.hostname port
Replace port with your specific port address.

Connecting Zoomdata to a Spark Proxy Service Running on a Separate Node

If you need to connect Zoomdata server to a Spark Proxy running on a separate node then you need to modify two property files: spark-proxy.properties and zoomdata.properties . For instructions, refer to Changing the Default Configuration for an Embedded Spark Server article.

Data Sources that Uses Spark in Zoomdata

Zoomdata leverages Spark to serve as both a processing engine and a caching service. For example, Zoomdata uses Spark to filter and aggregate datasets when not available in the data source (such as a NoSQL or Search source). In addition, Zoomdata will load S3, HDFS, or SQL-based datasets into a Spark cache in order to run analytical queries at greater speeds and efficiencies. Data sources connected in Zoomdata that can be 'sparked' (have the ability to leverage Spark processing) include the following:

(Sparked) Data Source Compatible Version
Cloudera Impala 1.4.2+
Google Analytics Core Reporting API Version 3.0
Hive on EMR Hive 1.0.x - 1.2.1
Hive on TEZ Hive 1.0.x - 1.2.1
Marketo REST API Version 1
MemSQL MemSQL 3.2
MySQL 5.6.13+
Oracle 11g Release 2+
PostgreSQL 9.3.3+
Salesforce Metadata API Version 34
SendGrid REST API Version 3
SQL Server 2012
Zendesk REST API Version 2


You will need to set the following Spark configurations: (1) Spark Master URL, (2) Memory and (3) Cores using the spark-proxy.properties configuration file. This file is located in the /etc/zoomdata directory.

​ To access the properties file and configure the options, perform the following steps:

  1. Log out of Zoomdata, if you are still in the program and close the browser.
  2. From your terminal, open a command line session.
  3. Via a command prompt, connect to your Zoomdata Server.
  4. Stop the Spark Proxy service:
    sudo service zoomdata-spark-proxy stop
  5. Use the following command to access and open the configuration file:
    sudo vi /etc/zoomdata/spark-proxy.properties
  6. Set Spark Master URL of the standalone spark server:
    spark.master=spark:// address-of-master-node :7077
  7. Allocate the required amount memory and number of cores:
    spark.executor.memory= 1g
    spark.cores.max= all cores available in a spark cluster
  8. Restart spark-proxy service:
    sudo service zoomdata-spark-proxy restart
  9. Make sure that there are no error messages in the spark-proxy.log file and there is a record that SparkContext has been started and validated. This log can be found in Zoomdata's /opt/zoomdata/logs directory.
  10. To validate the configuration, try creating an S3 data source with an example dataset provided in the Amazon S3 guide.
If you want to connect Zoomdata to a Spark Cluster that is on an earlier version of Spark (like v1.3.1), you need to take additional configuration steps. Refer to the article Connecting to a Spark Cluster v.1.3.1 for more information.


Zoomdata offers the ability to monitor Spark connections. You can access the Spark monitoring tool to check on the connection with datasets and identify issues as they arise. Specifically, this tool can help to identify, measure and evaluate the performance of the Spark connection and provide the means to isolate and rectify any issues or delays occurring in the process.

By default, a specific port number (4041) is used to access the Spark monitoring tool. The tool is available to access only after the following actions are taken:

  1. A data source is connected to Spark
  2. You have selected a chart style for the Sparked data source

After the two above criteria are met, you will be able to access the Spark monitoring tool by entering the following URL:

http:// IP_Address_of_Zoomdata_Server :4041/stages/

Replace IP_Address_of_Zoomdata_Server with the specific name of your Zoomdata server.

Figure 2

For help with issues you come across, access the Common Spark-It Validation and Troubleshooting Steps article.

In addition, the following articles about Spark in Zoomdata are available: