Zoomdata Version

How Zoomdata Uses Apache Spark

Overview

Spark is an open-source data analytics cluster computing system. It fits into the Hadoop open-source community and is built on top of HDFS. Spark is an optimized engine that supports MapReduce, but is not limited by it; with performance that can be up to 100x faster than Hadoop MapReduce for certain applications (due to its execution in memory). Spark provides improved support for interactive algorithms and data mining.

See Apache Spark website for in-depth information.

Zoomdata is certified for integration with Apache Spark by Databricks. Details of the certified applications program can be found on Databricks website .

Zoomdata leverages Spark in three scenarios:

  1. As Zoomdata's internal mechanism for result set caching and as a processing engine (for functionalities such as calculations),
  2. As a data source (connecting to a Spark cluster using the SparkSQL connector , and
  3. Optionally, to ingest 'raw data' using Zoomdata's proprietary SparkIt functionality.
    This SparkIt functionality cannot be externalized to another Spark cluster.

Using Spark AS ZOOMDATA'S Processing Engine and RESULTSET CachE

Zoomdata leverages Apache Spark data processing as a complementary processing layer within the Zoomdata server (as shown in Figure 1). Since Zoomdata pushes queries to the original data source, processes including aggregation, filtering and calculations are performed close to where data is stored. When aggregated, filtered result sets are retrieved from the source, this information is cached as data frames within Spark (also known as resilient distributed datasets--RDDs). Whenever you submit new requests for data, Zoomdata retrieves the data from the Spark result set cache whenever possible.

Zoomdata also uses cached result sets if the user sorts or crosstabs the data, or performs some kind of interaction that can be achieved without going back to the original source. For more in-depth information, rsee An Overview of Spark as a Processing Engine and Caching Service in Zoomdata .


Figure 1

By default, Zoomdata provides a local (embedded) Spark instance with a small configuration size (meaning, minimal amount of memory and core usage).

When using Zoomdata's embedded Spark instance, the supported Spark version is v1.5.1.

CONNECTING ZOOMDATA TO A SPARK DATA SOURCE

You can connect Zoomdata to a Spark data source using the connector. For guidance to configure this connection in Zoomdata, see Connecting to SparkSQL .

Using SparkIt to Ingest Raw Data

An article covering this functionality is under development. If you have any questions about it, please contact Zoomdata Technical Support .


Related Articles: