Zoomdata Version

How Zoomdata Uses Apache Spark

Overview

Spark is an open-source data analytics cluster computing system. It fits into the Hadoop open-source community and is built on top of HDFS. Spark is an optimized engine that supports MapReduce, but is not limited by it; with performance that can be up to 100x faster than Hadoop MapReduce for certain applications (due to its execution in memory). Spark provides improved support for interactive algorithms and data mining.

Refer to the Apache Spark website for in-depth information.

Zoomdata is certified for integration with Apache Spark by Databricks. Details of the certified applications program can be found on Databricks website . Zoomdata integrates the capabilities of Apache Spark in two, distinct ways:

  1. As a processing engine and caching tool
  2. As a data source (that is, connecting Zoomdata to a source)

Using Spark as a Processing Engine and Caching Tool

Zoomdata leverages Apache Spark data processing as a complementary processing layer within the Zoomdata server (as shown in Figure 1). Since Zoomdata pushes queries to the original data source, processes including aggregation, filtering and calculations are performed close to where data is stored. When aggregated, filtered result sets are retrieved from the source, this information is cached as data frames within Spark (also known as resilient distributed datasets--RDDs). Whenever you submit new requests for data, Zoomdata retrieves the data from the Spark result set cache whenever possible.

Zoomdata also uses cached result sets if the user sorts or crosstabs the data, or performs some kind of interaction that can be achieved without going back to the original source. For more in-depth information, refer to the article An Overview of Spark as a Processing Engine and Caching Service in Zoomdata .


Figure 1

By default, Zoomdata provides a local (embedded) Spark instance with a small configuration size (meaning, minimal amount of memory and core usage). But you have the option to bypass this setup and instead connect to either a standalone Spark Server or Spark on Yarn Server, depending on what you have in your network environment.

To use Zoomdata's embedded Spark Server, but change the default configuration, refer to the article Configuring an Embedded Spark Server .

Alternatively, access one of the following articles if you prefer to connect Zoomdata to your own Spark Server or Spark on Yarn Server:

  • When using Zoomdata's embedded Spark instance, the supported Spark version is v1.5.1
  • If connecting to a standalone Spark Server or Spark on Yarn Server, the supported Spark versions range from v1.3 to v1.5

CONNECTING ZOOMDATA TO A SPARK DATA SOURCE

You can connect Zoomdata to a Spark data source using the connector. For guidance to configure this connection in Zoomdata, refer to the article Connecting to SparkSQL .