Overview of Apache Spark as a Processing Engine and Caching Service in Zoomdata
Zoomdata leverages Apache Spark to serve as both a processing engine and a caching service. Specifically, Zoomdata uses Spark's capabilities to:
- Load static data into data frames (which Zoomdata references as 'SparkIt')
- Hold result set caches in data frames
- Spark data frames in memory or disk
- Perform calculations, totals and pivots on results
- Execute Fusion joins (a unique feature in Zoomdata that joins disparate, connected data sources to become a new data source)
Zoomdata is able to supplement a data source's API when analytic functionality is missing. Spark can be leveraged to perform filtering and aggregation of the datasets. For example, a NoSQL or Search source may not natively support aggregation, but Zoomdata can perform that in Spark.
In addition, Zoomdata leverages Spark to load datasets from S3, HDFS, or SQL-based data sources in order to run analytical queries at greater speeds and efficiencies.
Zoomdata employs a 'Spark Proxy' component (or service) which runs as a separate process in one of the nodes in Zoomdata. Spark Proxy can be configured to either connect to an external Spark cluster (YARN or Standalone) or use the (default) embedded Spark server. Once the Spark Proxy service is enabled, you will be able to also use Spark as a cache (referenced as 'SparkIt') so that sources including S3, HDFS and the different Cloud Connectors (Google Analytics, Marketo, SalesForce, SendGrid and Zendesk) can be used.
Figure 1 illustrates the data flow for data sources using the Spark Proxy service (enabled via either Embedded or Standalone Spark Cluster) and with the 'SparkIt' cache enabled for data sources.
When using Zoomdata's embedded Spark instance, the supported Spark version is v1.5.1.
If connecting to a standalone Spark Server or Spark on Yarn Server, the supported Spark versions range from v1.3 to v1.5.
Data sources connected in Zoomdata that can be 'sparked' (have the ability to leverage Spark processing) include the following:
|(Sparked) Data Source||Compatible Version|
|Google Analytics||Core Reporting API Version 3.0|
|Hive on EMR||Hive 1.0.x - 1.2.1|
|Hive on TEZ||Hive 1.0.x - 1.2.1|
|Marketo||REST API Version 1|
|Oracle||11g Release 2+|
|Salesforce||Metadata API Version 34|
|SendGrid||REST API Version 3|
|Zendesk||REST API Version 2|
Also keep in mind that Zoomdata requires enough memory from the Spark connection to contain the dataset.
By default, SparkProxy uses the embedded Spark Server if you haven't specified any connection to an external Spark cluster. The embedded Spark Server is best used for demo or testing purposes and is not recommended for production. You can configure this local instance of Spark. Refer to the article Configuring an Embedded Spark Server for guidance.
Zoomdata recommends that you configure SparkProxy to connect to your external Spark cluster. For more information, refer to the following articles:
DATA SOURCES THAT REQUIRE A SPARK SERVER
For the following data sources, a Spark server is required (before you can configure a connection to that source):
DATA SOURCES THAT CAN BE SPARKED
Zoomdata enables several different types of data sources for use with Spark. To activate the Spark function, you will simply toggle the ON/OFF switch during the connection process or at any time via the 'Tables' page of the supported data source. Select any of the sources below for connection instructions and guidance for enabling 'SparkIt':
Deploying Spark in a Highly Available Environment
When deploying Zoomdata in a highly available environment where several Zoomdata servers are set up, the Spark cluster's resources can be shared across all the servers. To do this efficiently, Zoomdata employs a Spark Proxy component which runs as a separate process in one of the nodes in the Zoomdata cluster. Spark Proxy can be configured to either connect to an external Spark cluster (YARN or Standalone) or use the embedded Spark server. For assistance, please contact Zoomdata support .