What's bigger than big data? A lot of analysts and companies developing analytic software might say the internet of things — the IoT. The IoT is certainly a huge source of big data and, as more devices come on line, that source will continue to grow.
In a recent webcast, Mark Madsen of Third Nature and Ian Fyfe of Zoomdata gave their thoughts on the significance of IoT data — its challenges and opportunities.
Streaming Data and the IoT
Mark started by making a basic distinction between IoT data and other types of data. IoT data describes events; it’s not static data at rest but flows across a network. Events have a location and a time span in which they occur. The time span could be milliseconds to something quite slow like every sixty seconds. This is true whether you’re talking about a “smart” blender in your home, a sensor in your car, a networked machine in a factory, or a piece of software running infrastructure. Every event is a data point, and in many cases the data is used within the device that created it.
But, when event data is propagated by devices across a network, it can be persisted in a log file, a database, or a Hadoop cluster and used by other systems and analyzed for a particular purpose. It may generate an alert or trigger some type of action. The action could be taken automatically by a machine or it could be human intervention. The action is another event.
How much of the data you persist and for how long is really decided by the use case. For example, there are smart building systems that constantly record temperature readings to keep the building at a stable temperature. Does the system need to record every individual reading, or can it average the readings over a span of time and record that? It’s rare to need to persist every event in an event stream.
A more complex example might be a group of wind turbines, each of which has many sensors that control the pitch of the blades. Some aggregated data would be persisted for analysis of total power output of the group and other metrics. Individual blade data might be allowed to expire. Various levels of monitoring determine the need for data persistence and aggregation, especially when humans are involved.
IoT Data and Data Architecture
Streaming data makes its own demands on data architecture. There is the streaming or live component. There are in-memory caching and very low latency databases like Cassandra, which is a cloud data store with extremely fast read-write latencies. And then there is more persistent storage, which can be very low latency or very slow. Database latencies are quite fast in transaction processing terms but not fast at enormous scale.
So, the architecture must support live streams and all the variations of persisted and aggregated data. And then there are the uses for this data, which generally fall into two categories.
There is the business intelligence (BI) model, which is decision-action. Information goes to a human who makes decisions. Like should we add more turbines at the wind farm? Humans set the requirements and the goals and monitor the stream of events.
The other category is continuous, automated monitoring — a streaming dashboard taken from different event streams and combined like a join in a database. This is tricky to carry out when you’re writing code. And, then there’s machine monitoring with machines talking to each other via software. This is usually about anomaly detection. Financial fraud, for example. Or decision automation in areas like insurance quote generation.
There are many more examples in both categories, but the takeaway here is that your data architecture must be built to accommodate how you’re consuming streaming data, how you’re persisting it, and what your analytical goals are.
IoT Data Sources in Practice
After Mark’s presentation, Ian went into greater depth around the explosive growth of IoT data and what it takes to put this data to work. For example, Gartner has estimated that there are already more than six billion devices connected to the internet and by 2020 there are likely to be around 21 billion. Business Insider’ prediction is around 24 billion. No matter how you look at that’s a lot of devices — and a lot of data. Business and government are trying to figure out how to make the most of this data. Analytical tools are a big part of the equation there. And data stores.
When you’ve got data streaming in real time, some of it needs to be landed and persisted, so it can be used for historical comparisons. That requires a very scalable and high-performance data store — not the traditional database. They’re not designed to handle data arriving in real time or real-time queries. That’s where modern data stores like MemSQL, Kudu, and Google BigQuery come into play.
Then on the front end, you need tools that let you interact with data. Not just submit a query and wait for a response. But playback, rewind, and fast forward data streams on the fly. You also need the ability to combine various data sources — streaming data with historical reference data that might live in a traditional database like Oracle or a CRM application. And to visualize the results almost at once.
Zoomdata delivers this capability via micro query technology, which breaks large queries into tiny queries that run very fast even with huge data volumes. This enables visualizations to display and sharpen in real time as more of the micro queries return results. All this requires a robust streaming engine.
Legacy BI tools were not cut out for this kind of data ecosystem. They were built to analyze static data stores that might update overnight or, at most, a few times a day.
To hear more from Mark and Ian, check out the entire webcast.