Ah, the Lambda architecture. So data’s getting faster and streaming in real-time? Great! Oh your database can’t accept a continuous stream of INSERTS and also respond to SELECTS from users at the same time at a high scale? Behold the Lambda Architecture:
Use a Fast Data Sink, Not a Lambda Architecture for Real-Time Analytics
Basically the idea is to keep the fast stuff fast and the slow stuff slow. I wrote a paper 14 years ago on the challenges of real-time data warehousing. Fortunately both the data streaming, database, and BI layers have all evolved significantly since then, and now there exists databases and other data storage engines which can support the feature trinity that is needed to do both real-time and historical analytics right, without a Lambda architecture:
- Accept real-time streams of data at high rates
- Simultaneously respond to large volumes of queries, including on the most recently added data
- Store all the history needed for analysis
We call these engines “fast data sinks” and there are four main groups of them today:
- In-memory or GPU databases: databases such as SAP Hana, MemSQL, and Kinetica
- Search engines: Elasticsearch and Solr
- Cutting-edge Hadoop: Kudu, a storage engine that runs on the Cloudera Hadoop stack
- Some cloud databases: Google BigQuery, Snowflake
Some people try to use key-value stores and document datastores such as MongoDB and HBase for this type of use case, and it works at lower scales, but as soon as data and query volumes increase they often get too slow to be useful.
Zoomdata operates together with a fast data sink to allow interactivity and visualization on near real-time data. Streaming data can be sent directly to the fast data sink, or to Zoomdata, which immediately puts it into the fast data sink.
When users visualize data in real-time, Zoomdata runs lots of tiny queries directly on the fast data sink, effectively “tailing” the data. But these queries generally include some amount of micro-aggregation, so the raw data does not need to pass through the Zoomdata engine. This allows us to leverage the power of the fast data sink, instead of processing the data multiple times or storing it in multiple places.
A Lambda architecture, on the other hand, keeps the real time data separate from the historical data. This is only needed if they can’t be kept together, there is no other benefit from separating “now” from history. The theory was that some analytics needs to be done on fresh data, and other, perhaps more complex analysis doesn’t need the most recent data. However in reality you almost always want the freshest data, even if you aren’t analyzing what happened in the last few seconds or minutes, you would certainly want your analysis to include any historical data that had recently been updated or corrected.
When the Lambda architecture was originally conceived, the idea was that another layer, would seamlessly union the data across the “speed” and “batch” layers on behalf of the users. But that type of tool-level unioning hasn’t happened, and even if it did, wouldn’t support some types of analysis that need the raw data from both layers, such as distinct counts or histogram/binning type operations.
So the net is that today some of the most recent databases and data systems are able to meet the three requirements listed above to be a fast data sink. And they are getting inexpensive enough to procure and deploy that they can be used to also hold lots of history. So there really is no longer a reason to consider a Lambda architecture to handle real-time data. It can all be done in one platform, as long as that platform can act as a fast data sink.
To try Zoomdata for yourself, check out our interactive demos.