RDDs facilitate the implementation of iterative algorithms such as those for machine learning, which visit their data set multiple times in a loop. They also support exploratory data analysis, with repeated database-style queries of data. Applications built in this way improve real-time performance by reducing latency.
You can run Spark as a standalone application or on Hadoop, Mesos, or in the cloud, while accessing diverse data sources including the Hadoop distributed file system (HDFS), Cassandra, HBase, and S3.
Zoomdata leverages Apache Spark data stream processing as a complementary processing layer within the Zoomdata server. Remember that as much as possible Zoomdata pushes query processing to original sources so that aggregation, filtering, and calculations are performed close to where data is stored. But as aggregated, filtered result sets are retrieved from their original sources, Zoomdata caches this data as Spark DataFrames.
When users submit new requests for data, Zoomdata retrieves the data from the Spark Streaming result set cache whenever possible. Zoomdata also uses cached result sets if the user sorts or crosstabs the data, or performs some kind of interaction that can be achieved without going back to the original source.
Spark Core is the foundation of Apache Spark. It handles dispatching, scheduling, and I/O functions, which are exposed through the API.
Spark SQL is a component on top of Spark Core that introduces a data abstraction layer called DataFrames. DataFrames provide support for structured and semi-structured data.
For example, when a user defines a new calculation based on metrics that already exist in a visualization, Zoomdata executes the calculation using Spark and the DataFrames that contain the cached data. Spark also powers Zoomdata Fusion. As with single-source queries, Zoomdata pushes as much query processing as possible to the original sources. But when Zoomdata needs to fuse multiple sources, it retrieves an aggregated result set from each source and Spark performs the join between the two. Finally, Zoomdata leverages Spark to accelerate slow data sources via the SparkIt feature. Some sources, like flat files and S3 buckets, do not provide query capabilities. SparkIt provides a way to load big data sets from these sources into Spark, where they become fully interactive, queryable data sets.
By implementing these capabilities with Apache Spark, Zoomdata taps into the broad open-source community that supports and enhances its scale-out, in-memory technology. In addition, for maximum scalability, Zoomdata provides the option to deploy the Spark caching, big data analytics, and fusion layer in an external Spark cluster.
Many organizations have adopted Spark for big data processing and big data analytics. Companies like Comcast, Bloomberg, and Capital One. Why? Because Spark is great for big data analytics and large-scale data science use cases. For example, Comcast has used Spark, Spark MLlib, and machine learning to detect the issues behind anomalies in its 30 million cable boxes -- boxes that generate more one billion data points every day. Comcast runs Apache Spark on a 400-node cluster with nearly a TB of RAM and eight PBs of storage.
Bloomberg uses Spark for its low-latency, cloud-based analytics platform, which delivers financial information to its clients. The company uses the Spark DataFrame concept for its Spark applications. As more and more enterprises look for data analysis capabilities, Spark has become a virtual single toolbox for data scientists.
Spark is gaining immense popularity because it:
In addition, unlike Hadoop, Apache Spark is compatible with several resource managers such as YARN or Mesos. And it's also easy to use, offering APIs in programming languages like Scala, Java, and Python, in addition to Spark SQL Spark does not include its own system for organizing files, which is one reason many big data projects run Spark on top of a Hadoop cluster using a distribution like Hortonworks or Cloudera. Most companies find that Apache Spark and Hadoop are necessary for a robust analytics ecosystem.
Zoomdata integrates with and leverages Apache Spark for big data analytics in multiple ways. If you use Spark to manage data directly, Zoomdata can access and visualize Spark data via Spark SQL. Zoomdata connects to Spark and makes DataFrames available for fast visual analytics on big data, leveraging your existing Spark cluster by pushing Spark SQL queries to the source.
Because Spark is a powerful unified environment for structured queries, machine learning, and graph analytics, you can combine these frameworks and visualize results in Zoomdata. For example, build your machine learning models in Spark, such as a customer value score, and add that to your Spark DataFrame where Zoomdata can use that new field like any other attribute in Zoomdata.
Zoomdata can also integrate with Spark Streaming so that users can interact with live streams of real-time data. Learn More about Streaming Analytics <link to streaming analytics>
In addition, Zoomdata makes special use of Spark as an embedded technology. Working with any data source, Zoomdata leverages Spark for result set caching, data blending, and additional calculations on top of what is available from the source.
Zoomdata is designed to push query processing to the source as much as possible.
Common analytic processing includes:
But some sources do not support even these basic types of data analytics. If you are working with raw data in a file system or in Amazon Web Services S3, the file system will not support any analytic processing.
Zoomdata includes the ability to read these raw files into Spark, where they become fast, interactive, and queryable. When establishing a connection, users can choose to “SparkIt” and preload data into a Spark DataFrame for interactive use through Zoomdata. This capability is available for common file formats such as CSV files, tab-delimited files, JSON and XML files.
SparkIt can also be used for sources other than raw files. Even relational data from Oracle, SQL Server, MySQL, or data from any “slow” source can be loaded into Zoomdata’s Spark layer to convert it to a fast, queryable, interactive source.
Zoomdata leverages Apache Spark data processing as a complementary processing layer within the Zoomdata server.