Misconceptions About and Differences Between Streaming Data and Database Data
Watch this video to find out how the differences between streaming data and database data change how data is managed.
For a long time, enterprise data and database data were synonymous. For example, an application writes transactions into a database. When we want to use the data we read it out of the database. Or we extract it and put it in another database for business intelligence purposes. The model is very different for streaming data. For one thing, there’s no place to retrieve your data. It’s a push model rather than a pull model. And that creates a lot of changes to the architecture you use to collect and persist that data.
I'm Mark Madsen, and the topic I'm going to talk about is misconceptions and differences between streaming data and the data we're used to that comes out of databases.
In the Beginning: Database Data
To start with, when you think about most of the enterprise data we've been dealing with throughout IT history, it has been data recorded in databases. It is transactions that an application writes into a database, and then when we want to use that data, we read it out of a database, or we extract it into another database for query purposes, and that's the end of it.
Streaming Data: Push not Pull
When you think about streaming data, one of the first things that's different is obviously that it's a push model rather than pull model. There's no there, there. There's no place to go and retrieve your data. And so, you have data that's flowing across the network, and you have to tap into that. It comes to you. You don't go to it. And that's fine when the data flows are predictable, but if you have any sort of spiky workloads, which you'll see a lot in streaming data where, suddenly, there's 10 times as much data flowing, you then need 10 times as much resource to process that data. And these ideas of accumulate and pull data in don't work, and so you have to think about how you're going to capture and persist that data regardless of how spiky and falling it is. And that creates a lot of changes to the architecture one uses to collect and persist that data.
Monitoring Streaming Data
And when you look at the use of that kind of data, you have another set of problems. You've got data that is flowing on a network, and maybe you need to watch and monitor that data, but most of the time, people don't stare at squiggly lines on a screen. You only need to know when something weird is happening - there's an anomaly, there's a deviation, there's something that causes an alert to pay attention to. So, you've got a stream of data flowing on the network for which there's some metric, and then when an alert is triggered that says go look at this, typically, you might go look at that screen and see what's going on. And that's part of your flowing data.
But, often, the event that's triggered and the alert and your response happen outside of that window, and so you need to go back and look at the persisted history of that and see what happened. Very often, when there's a deviation and there are anomalies, things start to squiggle, and they get a little bit more squiggly, and they get worse. And so, you're seeing it here, but you need to see what the buildup was to that, and that means replaying that stream, which means you then have to go back to a persistence layer and pull it and play it forward up to the current moment, which is live data. And so, this idea that somehow a stream is separate from persisted is a little bit of a misconception, because in fact, it's the same data. It's just here on the network and live or in a cache, or it's in a long-term persisted store. But, it's one continuous set of data. And that leads you into different tool architectures to deal with this on a sort of front end monitoring and different processing architectures on the back end.
Uses for Streaming Data
And then when you think about the uses of data and how one might apply BI models to this, you know, I mentioned that people don't just sit and watch squiggly lines on screens. People see things, get alerts, look at it, maybe go back into the past, but then often need to look into more details to set context around the event and see what's happening and why it's happening. And that is an exploration often of persisted data or contextual data that's related to what's going on or what just happened that triggered that alert.
And then there's another level beyond that, which is that when you start talking about embedded systems, whether they are demand pricing systems that change prices online or recommendation engines which we've been doing for a long time now, the stream and the data flow is not what you monitor. That flows through a data pipeline and an algorithm does something to that, and it gives you the next song on yourmusic list, or it gives you the next video on YouTube, or it gives the next product recommendation to show to somebody who's on a retail site. And that recommender is running and doing its thing with an embedded pipeline, and typically, you might just watch that. But, the anomalies come up when the model starts to decay or change.
Meta Level Metrics
And so, the recommender isn't recommending very well any more. And what you're really watching is meta level metrics. You're watching the recommendations the recommender makes and whether they are accepted recommendations. So, you're actually looking at an outcome variable, not the data flow. And so, the data flow generates these events. Like, I make a recommendation, and then eventually, you know, a person skipped the song because they didn't like it or whatever happened.
And so, secondary metrics form what you really monitor, not the primary data that sits below it. And so, it's a very different kind of approach. And in the BI world, we wouldn't really think about things in that way, and that BI approach to streaming tends to miss the fact that, often, you are watching an embedded system, you are not watching the data that flows in the embedded system.