Apache Spark Data Stream Processing

Need Speed? Meet Apache Spark.

How Does Spark Streaming Work?

Zoomdata leverages Apache Spark data stream processing as a complementary processing layer within the Zoomdata server. Remember that as much as possible Zoomdata pushes query processing to original sources so that aggregation, filtering, and calculations are performed close to where data is stored. But as aggregated, filtered result sets are retrieved from their original sources, Zoomdata caches this data as Spark DataFrames (also known as resilient distributed datasets--RDDs). When users submit new requests for data, Zoomdata retrieves the data from the Spark result set cache whenever possible. Zoomdata also uses cached result sets if the user sorts or crosstabs the data, or performs some kind of interaction that can be achieved without going back to the original source.

Spark architecture with connectors

Apache Spark Capabilities

Zoomdata leverages Spark stream processing in a similar way to perform calculations on top of cached results. For example, when a user defines a new calculation based on metrics that already exist in a visualization, Zoomdata executes the calculation using Spark and the DataFrames that contain the cached data.

Spark also powers Zoomdata Fusion. As with single-source queries, Zoomdata pushes as much query processing as possible to the original sources. But when Zoomdata needs to fuse multiple sources, it retrieves an aggregated result set from each source and Spark performs the join between the two.

fusion on spark with connectors

Finally, Zoomdata leverages Spark to accelerate slow data sources via the SparkIt feature. Some sources, like flat files and S3 buckets, do not provide query capabilities. SparkIt provides a way to load big datasets from these sources into Spark, where they become fully interactive, queryable datasets.

Learn More about SparkIt

By implementing these capabilities with Apache Spark, Zoomdata taps into the broad open-source community that supports and enhances its scale-out, in-memory technology. In addition, for maximum scalability, Zoomdata provides the option to deploy the Spark caching, big data analytics, and fusion layer in an external Spark cluster.

Learn More about Big Data Analytics for Spark

Featured Resources

Apache Spark Data Stream Processing

Zoomdata leverages Apache Spark data processing as a complementary processing layer within the Zoomdata server.


401 E. 3rd Avenue, Second Floor
San Mateo, CA 94401
(650) 399-0024

11921 Freedom Drive, Suite 750
Reston, VA 20190
(571) 279-6166