Making Streaming Analytics Work for You With Engines Like Spark Streaming And Twitter’s Heron
Watch this video to get an idea of what potential streaming analytics may be able to do for your organization. And the challenges you may encounter.
First, you’ll have choices to make depending on the use case you have in mind. One of the first will be the kind of streaming engine you need. For example, some streaming engines like Spark Streaming actually use a form of very fast batch processing with micro batches. Others engines, like Twitter’s Heron, offer pure, one-event-at-a-time streaming.
So, how to make streaming work for you -if you're implementing a streaming application for the first time, really, the choices can be a bit intimidating because there are a lot of different ways that you can parse and analyze events and sample them.
The Streaming Engine
For one thing, there is basically the type of streaming engine. Some streaming engines operate by essentially doing very fast batch processing, what we call micro batches. In other words, you take batches of events, you process them very fast, and it makes it seem as if they're real time. That's what Spark's Streaming does.
At the other end of the spectrum is what we call probably more of the idea of pure streaming where you're processing a single event at a time, and that would basically include engines such as, Storm, Heron, Apex, and there are a number of others out there.
And the differences are not academic, because basically, it--and it's not that one is better than the other. Micro batching is good for doing say more complex types of analytics, and there you can start to apply let's say like some machine learning, some modeling with micro batching, whereas with real time streaming, it's really about low latency and acting in the moment.
So, the operations that you're gonna perform in it are gonna be, of necessity, a lot simpler. We'll be doing, let's say, like filtering or some very simple aggregations.
Then there's basically how often you process the events that come through or the occurrences. And again, it may sound academic at first, but there are real types of use cases that are really connected to these, to these approaches.
You could process events at most once, which means that, well, okay, it's okay if you miss some events there, but you want to make sure that you don't process anything more than once. You can process at least once, which makes sure that you at least capture everything - doesn't matter how often you capture it. And then the most stringent way is capturing it exactly once.
And then there's the most stringent use case of exactly once, which is probably analogous to like online transaction systems. For instance, let's say you go up to an ATM machine and you withdraw money - well, you want to make sure that the bank does not credit your account or debit your account with two withdrawals.
And in a stream processing type of use case, this basically would apply to say like online capital markets, trading systems and especially, in let's say for like algorithmic trading where you want to make--where basically the frequency of these trades is really important in that if you by mistaken happen to capture something more than once, it could really throw off your algorithm. So, that's a case of where it's very stringent.
So, again, when you start to implement streaming, like any type of new application, the choices can be intimidating, but there is a logic to this, and there are real--and the fact is it's not that any approach is better than the other, it's what's more appropriate to the use case.