There are many sources of streaming data such as a Twitter feed, a sensor network, connected IoT devices, or buy-sell orders from a financial marketplace. To visualize and analyze streaming data, you need a streaming engine and a high-performance data store in which to land it.
Unlike traditional streaming visualization and analytics tools, Zoomdata’s visualization and analytics are independent of the underlying streaming infrastructure. This means you are free to use any streaming engine, such as Apache Kafka, Apache Spark Streaming, Apache Storm, Apache Apex, Apache Nifi, Amazon Kinesis, and others.
Zoomdata’s streaming analytics are based on the principle of landing the data into a high-performance data store or data sink. Then Zoomdata’s streaming analytics engine continuously queries that data store in near real-time, and dynamically updates visualizations and analytics with typically sub-second data stream-to-screen response (depending on the performance of the landing data source being used). A standard WebSocket connection facilitates real-time data transfer between the Zoomdata server and web browser.
Zoomdata Streaming Architecture
Apache Kafka™ is a distributed streaming platform. Streaming platforms let you:
Kafka runs as a cluster on one or more servers. The cluster stores streams of records in categories called topics. Each record consists of a key, value, and timestamp. Kafka also has four APIs: producer, consumer, streams, and connector. Kafka is horizontally scalable, fault-tolerant, and fast.
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Apache Spark provides programmers with an API centered on a resilient distributed data set (RDD), a read-only multiset of data items distributed over a cluster of machines, that is supported in a fault-tolerant way. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory. The availability of RDDs facilitates the implementation of iterative algorithms, that visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data.
Apache Storm is a free, open source distributed real-time computation system that can reliably process unbounded streams of data. Storm does for real-time processing what Hadoop did for batch processing. Storm is simple and can be used with any programming language.
Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed remote procedure calls (RPCs), extract, transform, and load (ETL), and more. Storm is fast: a benchmark clocked it at over a million tuples (a finite ordered list of elements) processed per second per node. It is scalable, fault-tolerant, and easy to set up and operate.
Storm integrates with common queuing and database technologies. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation as needed.
Apex is a Hadoop YARN-native platform with a unified stream processing architecture that is suitable for real-time and batch processing. It processes big data in-motion in a way that is highly scalable, performant, fault tolerant, stateful, secure, distributed, and easily operable. Apex offers a simple API that enables developers to write or reuse generic Java code, lowering the expertise needed to write big data applications.
Apex allows fine grained, incremental recovery to only reset the portion of a topology that is affected by a failure and the ability to alter topology and operator properties on running applications. The platform comes with Malhar, an open source operator and codec library that can be used to build streaming applications. It also includes many connectors for messaging systems, databases, and files etc.
Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. It can be used for:
NiFi is not meant for distributed computation, complex event processing, or joins/complex rolling window operations.
Amazon Kinesis can cost-effectively collect, process, and analyze real-time, streaming data. It can ingest real-time streaming data such as application logs, website clickstreams, and IoT telemetry data into databases, data lakes, and data warehouses. Amazon Kinesis features three components:
What are big data sources?
Big data sources are repositories of large volumes of data. Using modern data frameworks and analytics tools, users can quickly connect to and derive value from these sources. Examples of big data sources are Amazon Redshift, HP Vertica, and MongoDB. Other big data sources – such as analytic/columnar data stores, NoSQL, and Hadoop data repositories – are also gaining popularity.
What is streaming data?
Streaming data is data that flows at a steady, high-speed rate, just like water from a tap. A data stream is a sequence of digitally encoded coherent signals (data packets) used to transmit information.
What’s the difference between real-time and streaming data?
Real-time data is data produced in the moment such as changing prices in the stock market. A system is called “real time” if it can react to the data within seconds or milliseconds. If a stock trade is placed and within milliseconds the trade is executed, that is a real-time execution system. Streaming is about actions taken on data. Real-time data can be streamed — made to flow to a system as it’s generated. But historical data can also be streamed — made to flow in a continuous stream.
What is real-time stream processing?
When a stream of data is processed, it is by definition being processed in real time. But the phrase usually refers to processing a stream of real-time data as it’s created. Of course, there is always some latency between data creation and data processing, even if only milliseconds. The definition of real time varies depending on the use case, but, regardless of the use case, real-time processing is very different from batch-oriented processing.