Zoomdata Version

Connecting Zoomdata to a Spark on Yarn Server

OVERVIEW

This article focuses on setting up and connecting to a Spark cluster residing in a Yarn server. There are a few prerequisites to review and verify before you are ready to make the connection between Zoomdata and Spark on Yarn. Review the Prerequisites section below as your first step.

The SparkProxy can work with Spark v1.3 - v.1.5.x.

Check the sections below to view the steps to configure SparkProxy for:

PREREQUISITES

Before you connect to Zoomdata to Spark, make sure that the following prerequisites are met in your environment:

  1. The Hadoop YARN cluster of your environment is running on Java 8 (if not, upgrade to v8 ).
  2. All the ports that are needed for communication between Zoomdata's node and YARN's Node Manager's are opened in firewall rules:
spark.blockManager.port=19001
spark.broadcast.port=19002
spark.driver.port=19003
spark.executor.port=19004
spark.fileserver.port=19005
spark.replClassServer.port=19006
spark.yarn.am.port=19007
spark.shuffle.service.port=19008

For additional information about configuring your ports for network security, refer to the Apache Spark article Spark Security .

3. Spark proxy server should be stopped and disabled for automatic startup with Zoomdata server

sudo service zoomdata-spark-proxy stop
#for systems with systemctl
sudo systemctl mask zoomdata-spark-proxy.service
#for systems with sysVinit set SPARK_PROXY_SERVICE variable to 'false' value in /etc/zoomdata/zoomdata.env

SETTING UP A SPARK PROXY

Steps for CDH

  1. Copy Spark 1.5.1 binary distribution to Zoomdata's server
    zoomdata-server>
    sudo mkdir /opt/zoomdata/spark-distro
    sudo wget -P /opt/zoomdata/spark-distro/ https://archive.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
    sudo tar -xvf /opt/zoomdata/spark-distro/spark-1.5.1-bin-hadoop2.6.tgz -C /opt/zoomdata/spark-distro/
  2. Copy Hadoop configuration files from the CDH node into the spark-distro folder on the Zoomdata server.
    To complete this step, you must set passwordless ssh access from YARN Resource Manager's node to Zoomdata node.
    Keep in mind that the hadoop-conf folder can be located either in /usr/lib/hadoop-yarn/etc/hadoop/ or /etc/hadoop/conf.cloudera.yarn/ folder.
    zoomdata-server>
    sudo mkdir /opt/zoomdata/spark-distro/hadoop-conf
    sudo chmod 777 /opt/zoomdata/spark-distro/hadoop-conf
    yarn-rm-node>
    scp /usr/lib/hadoop-yarn/etc/hadoop/* [email protected]:/opt/zoomdata/spark-distro/hadoop-conf
  3. In the /opt/zoomdata/spark-distro/hadoop-conf/core-site.xml file, set the value of the net.topology.script.file.name property to /opt/zoomdata/spark-distro/hadoop-conf/topology.py
    zoomdata-server>
    TOP_PY_PATH=$(find /opt/zoomdata/spark-distro/hadoop-conf -name topology*.py | sed 's/\//\\\//g')
    sed -i.bak "s/\/.*topology.*\.py/$TOP_PY_PATH/g" /opt/zoomdata/spark-distro/hadoop-conf/core-site.xml
  4. Create HDFS user on the Zoomdata node to run YARN applications on the behalf of this user:
    zoomdata-server>
    sudo ln -s /opt/zoomdata/spark-distro/hadoop-conf /var/lib/hadoop-hdfs
    sudo useradd -d /var/lib/hadoop-hdfs hdfs
  5. Verify that Spark on Yarn is configured correctly by running the following command:
    sudo su hdfs
    cd $HOME
    export HADOOP_CONF_DIR=/opt/zoomdata/spark-distro/hadoop-conf
    export JAVA_HOME=/opt/zoomdata/jre
    export SPARK_LOCAL_IP=$(hostname -i)
    /opt/zoomdata/spark-distro/spark-1.5.1-bin-hadoop2.6/bin/spark-submit \
    --class org.apache.spark.examples.sql.hive.HiveFromSpark --master yarn-client \
    --conf spark.ui.port=4041 --num-executors 1 --driver-memory 2g --executor-memory 2g --executor-cores 2 \
    --driver-java-options "-Djavax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/hive_metastore;create=true \
    -Dhive.metastore.warehouse.dir=/tmp/hive_warehouse" \
    /opt/zoomdata/spark-distro/spark-1.5.1-bin-hadoop2.6/lib/spark-examples-1.5.1-hadoop2.6.0.jar
  6. Start SparkProxy on YARN on the Zoomdata server by running the following commands:
    zoomdata-server>
    sudo su hdfs
    cd $HOME
    export HADOOP_CONF_DIR=/opt/zoomdata/spark-distro/hadoop-conf
    export JAVA_HOME=/opt/zoomdata/jre
    export SPARK_LOCAL_IP=$(hostname -i)
    export SPARK_PROXY_JAR=/opt/zoomdata/services/spark-proxy.jar
    export CLASSPATH=$SPARK_PROXY_JAR:$LAUNCHER_CLASSPATH
    /opt/zoomdata/spark-distro/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --class com.zoomdata.spark.proxy.SparkProxy \
    --conf spark.ui.port=4041 --conf spark.driver.extraClassPath=$SPARK_PROXY_JAR \
    --driver-java-options "-Drmi.host=localhost -Dhadoop.distro=provided -Djava.io.tmpdir=/tmp -Dlogs.dir=/tmp" \
    --master yarn-client --num-executors 1 --driver-memory 3g --executor-memory 3g --executor-cores 2 \
    --queue default --supervise $SPARK_PROXY_JAR

7. To validate the configuration, try creating an S3 data source with an example dataset provided in Connecting to Amazon S3 article.

Steps for HDP

  1. Copy Spark 1.5.1 binary distribution to Zoomdata's server
    zoomdata-server>
    sudo mkdir /opt/zoomdata/spark-distro
    sudo wget -P /opt/zoomdata/spark-distro/ https://archive.apache.org/dist/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
    sudo tar -xvf /opt/zoomdata/spark-distro/spark-1.5.1-bin-hadoop2.6.tgz -C /opt/zoomdata/spark-distro/
  2. Copy Hadoop configuration files from the CDH node into the spark-distro folder on the Zoomdata server.
    To complete this step, you must set passwordless ssh access from Zoomdata node to the Resource Manager or NM node.
    zoomdata-server>
    sudo mkdir /opt/zoomdata/spark-distro/hadoop-conf
    sudo chmod 777 /opt/zoomdata/spark-distro/hadoop-conf

    yarn-rm-node>
    scp /usr/hdp/current/hadoop-client/conf/* [email protected]:/opt/zoomdata/spark-distro/hadoop-conf
  3. In the /opt/zoomdata/spark-distro/hadoop-conf/core-site.xml file, set the value of the net.topology.script.file.name property to /opt/zoomdata/spark-distro/hadoop-conf/topology.py
    zoomdata-server>
    TOP_PY_PATH=$(find /opt/zoomdata/spark-distro/hadoop-conf -name topology*.py | sed 's/\//\\\//g')
    sed -i.bak "s/\/.*topology.*\.py/$TOP_PY_PATH/g" /opt/zoomdata/spark-distro/hadoop-conf/core-site.xml
  4. Create HDFS user on the Zoomdata node to run YARN applications on the behalf of this user:
    zoomdata-server>
    sudo ln -s /opt/zoomdata/spark-distro/hadoop-conf /var/lib/hadoop-hdfs
    sudo useradd -d /var/lib/hadoop-hdfs hdfs
  5. Verify that Spark on Yarn is configured correctly by running the following command:
    sudo su hdfs
    cd $HOME
    export HADOOP_CONF_DIR=/opt/zoomdata/spark-distro/hadoop-conf
    export JAVA_HOME=/opt/zoomdata/jre
    export SPARK_LOCAL_IP=$(hostname -i)
    /opt/zoomdata/spark-distro/spark-1.5.1-bin-hadoop2.6/bin/spark-submit \
    --class org.apache.spark.examples.sql.hive.HiveFromSpark --master yarn-client \
    --conf spark.yarn.am.extraJavaOptions="-Dhdp.version= hdp.version " \
    --conf spark.ui.port=4041 --num-executors 1 --driver-memory 2g --executor-memory 2g --executor-cores 2 \
    --driver-java-options "-Dhdp.version= hdp.version \
    -Djavax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/hive_metastore;create=true \
    -Dhive.metastore.warehouse.dir=/tmp/hive_warehouse" \
    /opt/zoomdata/spark-distro/spark-1.5.1-bin-hadoop2.6/lib/spark-examples-1.5.1-hadoop2.6.0.jar
  6. Start SparkProxy on YARN on the Zoomdata server by running the following commands:
    zoomdata-server>
    sudo su hdfs
    cd $HOME
    export HADOOP_CONF_DIR=/opt/zoomdata/spark-distro/hadoop-conf
    export JAVA_HOME=/opt/zoomdata/jre
    export SPARK_LOCAL_IP=$(hostname -i)
    export SPARK_PROXY_JAR=/opt/zoomdata/services/spark-proxy.jar
    export CLASSPATH=$SPARK_PROXY_JAR:$LAUNCHER_CLASSPATH
    /opt/zoomdata/spark-distro/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --class com.zoomdata.spark.proxy.SparkProxy \
    --conf spark.yarn.am.extraJavaOptions="-Dhdp.version= hdp.version " \
    --conf spark.ui.port=4041 --conf spark.driver.extraClassPath=$SPARK_PROXY_JAR \
    --driver-java-options "-Dhdp.version= hdp.version -Drmi.host=localhost -Dhadoop.distro=provided
    -Djava.io.tmpdir=/tmp -Dlogs.dir=/tmp" \
    --master yarn-client --num-executors 1 --driver-memory 3g --executor-memory 3g --executor-cores 2 \
    --queue default --supervise $SPARK_PROXY_JAR

7. To validate the configuration, try creating an S3 data source with an example dataset provided in Connecting to Amazon S3 article.

To find the correct value for the hdp.version parameter, (for example, 2.3.2.0-2950) run the following command on the YARN Resource Manager's node:
hdp-select status hadoop-client
*If your YARN cluster doesn't run on JRE8, but has JRE8 on every node manager, you can specify this with the following parameters to the submit command:
--conf spark.yarn.appMasterEnv.JAVA_HOME=<path_to_jre8>
--conf spark.executorEnv.JAVA_HOME=<path_to_jre8>
  • num-executors specifies the number of java processes that are allocated to the Spark driver. By default it is set to 1.
  • executor-memory allotment (entered as/specified in gigabytes) should be calculated based on the size of your data source or raw file. To calculate the needed RAM, use the following formula: # RAM = 1.92 * size of dataset

For example, if you have a dataset that is 10 GB, then you would need 19.8 GB of RAM. Also keep in mind that the amount of memory should not exceed the memory available for the Spark Server (otherwise the Spark connector will not be able to connect to the Spark Server).

  • executor-cores specifies the number of processing cores to be used for the Spark Proxy. The default commitment is 1 core.
  • driver-memory allotment, by default, is set to 3 GB which but it should be adjusted to specific usecases.
For additional information, refer to the Apache Spark article Running Spark on Yarn .

How to set passwordless SSH

To set passwordless SSH between YARN Resource Manager and the Zoomdata nodes run the the following commands:

yarn-rm-node>
ssh-keygen -t rsa
#copy result of the command below to the clipboard
cat /home/centos/.ssh/id_rsa.pub

zoomdata-server>
mkdir -p ~/.ssh

# add public key from the clipboard into ~/.ssh/authorized_keys
# try to open SSH session from YARN Resource Manager's node to Zoomdata's node
yarn-rm-node> ssh [email protected]

Monitoring the connection

The Spark Proxy should now be set up and running. You can also monitor the status of the Spark driver using one of the following interfaces:

  • ResourceManager's UI: http:// master-node-hostname :8088/cluster/apps/RUNNING
  • SparkDriver UI: http:// master-node-hostname :8088/proxy/ application-id /stages
To use the SparkDriver UI, you will need to run the ResourceManager UI first in order to identify the application-id .
  • Spark History Server UI: http:// master-node-hostname :18088/

Additional resources

If you would like to learn more about Spark as implemented in Zoomdata, the following articles are available:

To learn more about Apache Spark and Yarn, refer to the following Cloudera articles: