How Zoomdata Uses Apache Spark
Zoomdata leverages Spark in the following ways:
- As a mechanism for result set caching
- As a processing engine
Spark is also used as a data source within the Zoomdata environment by connecting to a Spark cluster using the SparkSQL connecter. For more information and steps on how to set up a connection, see Managing SparkSQL Connectors.
Using Spark as a Processing Engine and a Resultset Cache
Zoomdata leverages Apache Spark as a processing layer for calculations, totals and pivots on results, and also executes Fusion joins. Since Zoomdata pushes queries to the original data source, processes including aggregation, filtering and calculations are performed close to where data is stored. When aggregated, filtered result sets are retrieved from the source, this information is cached as data frames within Spark (also known as resilient distributed datasets--RDDs). Whenever you submit new requests for data, Zoomdata retrieves the data from the Spark result set cache whenever possible.
Zoomdata also uses cached result sets if the user sorts or crosstabs the data, or performs some kind of interaction that can be achieved without going back to the original source. If that particular data source does not have the capabilities needed for execution, then Spark is used to make up those differences in capabilities.
By default, Zoomdata provides an embedded Spark server that uses Spark version 2.2.
Was this topic helpful?