Zoomdata Version

Changing the Default Configuration for an Embedded Spark Server

By default, Zoomdata establishes a Spark Proxy service along with an embedded Spark instance with a small configuration size (that is, with a minimum amount of memory and core usage) during the installation process.

Due to the minimal setup configuration, the embedded Spark Server is best used for demo, testing and evaluation purposes only, using smaller sized datasets. This Spark setup is not scalable.

If needed, you have the capability to edit this default configuration. This article walks through how to change the default parameters for the local instance of Spark. Please keep in mind that any configuration you set should not exceed the specifications of the machine where Zoomdata resides (in terms of available memory and cores available).

Zoomdata uses a Spark Proxy Service which handles the requests from users that are sent from the Zoomdata service. This data flow is illustrated in Figure 1.

Figure 1

Default Configuration for the Embedded Spark Server

By default, Zoomdata sets up the embedded Spark instance with the following configuration:

  • 3 GB of RAM
  • Access to all the available cores on the machine where Zoomdata runs

Providing this default configuration lets you connect to and explore the data sources that depends on Spark, including:
(Select a data source link below for guidance on setting up that particular data source.)

In addition, the following data sources can be set to leverage Spark:

Changing the Default Spark Configuration

You can change the following default Spark configurations:

  • Memory, JVM, folder paths and Hadoop distribution via the zoomdata.env configuration file
  • Port information, log outputs, cores usage via the spark-proxy.properties configuration file

Configuring Spark Settings Using zoomdata.env

The zoomdata.env file is located in the /etc/zoomdata directory. To access, follow the steps provided in the article Managing Configurations in Zoomdata . Table 1 details the properties can be edited.

Use the property with the default value to
SPARK_PROXY_MAX_MEMORY "3g" allocate an amount of memory for Spark Proxy's jvm in case you need to load more than 1G of data
SPARK_PROXY_JAVA_OPTS "-Xms1g -Xmx$SPARK_PROXY_MAX_MEMORY -XX:OnOutOfMemoryError=\"kill -9 %p\"" set specific JVM arguments like -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails

"/opt/zoomdata/ temp"

specify the path to a folder where the temporary files must be stored
SPARK_PROXY_LOGDIR "/opt/zoomdata/logs" specify the path to a folder where the log files must be stored
SPARK_PROXY_HADOOP_DISTRO chd5 use custom distributions like CDH4 with Hadoop 1.X

Table 1

Configuring the Default Spark Proxy's Properties

The spark-proxy.properties file is located in the /etc/zoomdata directory. Table 2 details the Spark Proxy properties that can be edited.

Use the property with the default value to
rmi.host localhost hostname or IP address where RMI server would be bound
rmi.port 9292 SparkProxy's RMI Service registry port
rmi.service.port 9293 SparkProxy's RMI Service RMI services port
should.redirect.sysout true define the file (spark-proxy.log or syslog) for the messages written to the standard output to be logged to
logs.level INFO set the logging level for Spark Proxy's components
logs.root.level WARN set the logging level for all components including third party components
file.log.level ALL set the logging level of the messages to be written into spark-proxy.log
syslog.log.level OFF syslog.* properties set configuration for SyslogAppender in case you want to use external storage of log messages. For more info, see the examples .
syslog.port 514
syslog.suffix local
spark.master local[*] spark.* properties can be used to set SparkContext's specific settings. For more info, refer to Apache documentation.
spark.ui.port 4041
spark.local.dir "/opt/zoomdata/temp"
spark.sql.thriftServer.incrementalCollect false

Controls whether Spark Driver fetches results sequentially or in parallel. 'False' means in parallel. This should be set to 'True' if you want to fetch large result sets

Table 2

To access the properties file and configure the options, perform the following steps:

  1. Log out of Zoomdata, if you are still in the program and close the browser.
  2. From your terminal, open a command line session.
  3. Via a command prompt, connect to your Zoomdata Server.
  4. Stop the Zoomdata Server service:
    sudo service zoomdata stop
  5. Stop the Spark Proxy service:
    sudo service zoomdata-spark-proxy stop
  6. Use the following command to access and open the configuration file:
    sudo vi /etc/zoomdata/spark-proxy.properties

If the properties file does not exist, this command will create it. By default, the properties are not listed in the file. You have to manually add them and assign the required values.

  1. Add the new variable(s) into the file on a new line or edit an existing variable, as needed.
  2. Save and exit the properties file.
  3. Restart the Spark Proxy service:
    sudo service zoomdata-spark-proxy restart
  4. Restart Zoomdata service:
    sudo service zoomdata restart

To validate your configuration,  you can create an S3 source using an example dataset provided in the Connecting to Amazon S3 article.

Connecting Zoomdata to a Spark Proxy Service Running on a Separate Node

If you need to connect Zoomdata server to a Spark Proxy running on a separate node then you need to modify two property files: spark-proxy.properties and zoomdata.properties .

To access the spark-proxy.properties file review the topic (above) Configuring the Default Spark Proxy's Properties . To access the zoomdata.properties files follow the steps provided in the article Managing Configurations in Zoomdata .

In each of the files, add the following settings:

  • in spark-proxy.properties set:
    • rmi.host= ip/dns_of_server_where_spark_proxy_runs

  • in zoomdata.properties set
    • spark.proxy.host= ip/dns_of_server_where_spark_proxy_runs


The following additional articles about Spark in Zoomdata are available: