Apache Spark Analytics

What is Apache Spark?

RDDs facilitate the implementation of iterative algorithms such as those for machine learning, which visit their data set multiple times in a loop. They also support exploratory data analysis, with repeated database-style queries of data. Applications built in this way improve real-time performance by reducing latency.

You can run Spark as a standalone application or on Hadoop, Mesos, or in the cloud, while accessing diverse data sources including the Hadoop distributed file system (HDFS), Cassandra, HBase, and S3.

How Does Spark Streaming Work?

Zoomdata leverages Apache Spark data stream processing as a complementary processing layer within the Zoomdata server. Remember that as much as possible Zoomdata pushes query processing to original sources so that aggregation, filtering, and calculations are performed close to where data is stored. But as aggregated, filtered result sets are retrieved from their original sources, Zoomdata caches this data as Spark DataFrames.

When users submit new requests for data, Zoomdata retrieves the data from the Spark Streaming result set cache whenever possible. Zoomdata also uses cached result sets if the user sorts or crosstabs the data, or performs some kind of interaction that can be achieved without going back to the original source.

Spark Core

Spark Core is the foundation of Apache Spark. It handles dispatching, scheduling, and I/O functions, which are exposed through the API.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a data abstraction layer called DataFrames. DataFrames provide support for structured and semi-structured data.

For example, when a user defines a new calculation based on metrics that already exist in a visualization, Zoomdata executes the calculation using Spark and the DataFrames that contain the cached data. Spark also powers Zoomdata Fusion. As with single-source queries, Zoomdata pushes as much query processing as possible to the original sources. But when Zoomdata needs to fuse multiple sources, it retrieves an aggregated result set from each source and Spark performs the join between the two. Finally, Zoomdata leverages Spark to accelerate slow data sources via the SparkIt feature. Some sources, like flat files and S3 buckets, do not provide query capabilities. SparkIt provides a way to load big data sets from these sources into Spark, where they become fully interactive, queryable data sets.

By implementing these capabilities with Apache Spark, Zoomdata taps into the broad open-source community that supports and enhances its scale-out, in-memory technology. In addition, for maximum scalability, Zoomdata provides the option to deploy the Spark caching, big data analytics, and fusion layer in an external Spark cluster.

Spark Architecture with Connectors

Why Spark for Big Data Analytics?

Many organizations have adopted Spark for big data processing and big data analytics. Companies like Comcast, Bloomberg, and Capital One. Why? Because Spark is great for big data analytics and large-scale data science use cases. For example, Comcast has used Spark, Spark MLlib, and machine learning to detect the issues behind anomalies in its 30 million cable boxes -- boxes that generate more one billion data points every day. Comcast runs Apache Spark on a 400-node cluster with nearly a TB of RAM and eight PBs of storage.

Bloomberg uses Spark for its low-latency, cloud-based analytics platform, which delivers financial information to its clients. The company uses the Spark DataFrame concept for its Spark applications. As more and more enterprises look for data analysis capabilities, Spark has become a virtual single toolbox for data scientists.

Spark is gaining immense popularity because it:

  • Features an advanced directed acyclic graph (DAG) execution engine that supports acyclic data flow and in-memory computing
  • Holds data in memory -- making it up to 100-times faster for certain applications
  • Supports multi-stage primitives, which makes it faster than Hadoop with MapReduce
  • Offers a convenient, unified programming model for developers, supporting SQL, stream processing, machine learning, and graph analytics
  • Allows user programs to load big data into a Spark cluster's memory and query it repeatedly, making it well suited to machine learning algorithms
  • Offers a scalable machine learning library via Spark MLlib
  • Provides support for graphs and graph-parallel computation with its GraphX API

In addition, unlike Hadoop, Apache Spark is compatible with several resource managers such as YARN or Mesos. And it's also easy to use, offering APIs in programming languages like Scala, Java, and Python, in addition to Spark SQL Spark does not include its own system for organizing files, which is one reason many big data projects run Spark on top of a Hadoop cluster using a distribution like Hortonworks or Cloudera. Most companies find that Apache Spark and Hadoop are necessary for a robust analytics ecosystem.

Zoomdata and Apache Spark

Apache and Zoomdata

Zoomdata integrates with and leverages Apache Spark for big data analytics in multiple ways. If you use Spark to manage data directly, Zoomdata can access and visualize Spark data via Spark SQL. Zoomdata connects to Spark and makes DataFrames available for fast visual analytics on big data, leveraging your existing Spark cluster by pushing Spark SQL queries to the source.

Because Spark is a powerful unified environment for structured queries, machine learning, and graph analytics, you can combine these frameworks and visualize results in Zoomdata. For example, build your machine learning models in Spark, such as a customer value score, and add that to your Spark DataFrame where Zoomdata can use that new field like any other attribute in Zoomdata.

Zoomdata can also integrate with Spark Streaming so that users can interact with live streams of real-time data. Learn More about Streaming Analytics <link to streaming analytics>

In addition, Zoomdata makes special use of Spark as an embedded technology. Working with any data source, Zoomdata leverages Spark for result set caching, data blending, and additional calculations on top of what is available from the source.

SparkIt: Visualize Data from Flat Files, S3, JSON and More

Zoomdata is designed to push query processing to the source as much as possible.

Common analytic processing includes:

  • Selection or filtering -- displaying only a subset of data that meets a condition, e.g. customers from North America
  • Aggregation -- calculating a value across many data elements, e.g. counting users or summing revenue across many customers

But some sources do not support even these basic types of data analytics. If you are working with raw data in a file system or in Amazon Web Services S3, the file system will not support any analytic processing.

Zoomdata includes the ability to read these raw files into Spark, where they become fast, interactive, and queryable. When establishing a connection, users can choose to “SparkIt” and preload data into a Spark DataFrame for interactive use through Zoomdata. This capability is available for common file formats such as CSV files, tab-delimited files, JSON and XML files.

SparkIt: Visualize Data from Flat Files, S3, JSON and More

SparkIt can also be used for sources other than raw files. Even relational data from Oracle, SQL Server, MySQL, or data from any “slow” source can be loaded into Zoomdata’s Spark layer to convert it to a fast, queryable, interactive source.

Featured Resources

Apache Spark Analytics

Zoomdata leverages Apache Spark data processing as a complementary processing layer within the Zoomdata server.


401 E. 3rd Avenue, Second Floor
San Mateo, CA 94401
(650) 399-0024

11921 Freedom Drive, Suite 750
Reston, VA 20190
(571) 279-6166