Query performance has always been an issue in the world of business intelligence (BI), and many BI users would be happy to have their reports load and render more quickly. Traditionally, the best way to achieve this performance (short of buying a bigger database) has been to build and maintain aggregate tables at various levels to intercept certain groups of queries to prevent repeat queries of the same raw data. Also many BI tools pull data out of databases into their own memory, into “cubes” of some sort, and run analysis off of those extracts.
Downsides of Aggregates and Cubes
Both of these approaches have the major downside of needing to maintain the aggregate or cube as new data arrives. In the past that has been a daily event, but most warehouses are now being stream-fed in near-real-time. It’s not practical to continuously rebuild aggregate tables or in-memory cubes every time a new row arrives or a historical row is updated.
Enter Hadoop. Problem solved? No.
The problem of query performance became even more of an issue when Hadoop and data lakes emerged as common stores of big data. While the query engines of these systems have improved over time, they are not always capable of providing acceptable response times particularly as data volumes get large and as query concurrency increases. Also, with the exception of Kudu, Hadoop and data lakes have not evolved clean ways to ingest real-time streams of data in a way that makes them instantly queryable, and particularly struggle with late-arriving or restated historical data.
Enter Proprietary Middleware. Problem solved? Still no.
In the world of big data, many companies try to address the query response and concurrency problems by introducing custom middleware that sits between a BI tool and a Hadoop system or data lake. Products like Datameer, Platfora, AtScale, Jethro, and others were designed to improve discrete query performance, but they introduced yet another layer to the stack that you have to select, buy, manage, and maintain. You also have to tune and worry about these products separately from everything else in your stack (like your database below and your BI cache above).
Here at Zoomdata we have done significant work to optimize queries against Hadoop and data lake systems to make them performant without any middleware. And our patented microquery and Data Sharpening™ technologies work together so regular business users can perform stream-of-thought analysis on massive amounts of data in a powerful and visually compelling user experience.
Modern Data Warehouses + Modern BI to the Rescue
More recently there has been rapid adoption of cloud-native data warehouses, such as Amazon Redshift, Google BigQuery, Quoble, and Snowflake. Nearly everyone we talk to in the data warehousing space these days is using, testing, or thinking about trying a cloud-native database for some of their data warehousing workloads.
What does this mean for BI and query performance? Well, it makes things easier to some extent, as these cloud-native systems speak fluent SQL, and have characteristics similar to traditional data warehouse systems but with vastly superior speed, scalability, and cost profiles than either traditional MPP warehouses or most Hadoop-based systems. Some of these cloud systems rely heavily on partitioning for performance, including micro-partitioning, which match up perfectly to Zoomdata’s unique microquery and Data Sharpening support for big data. Other databases available as cloud services use a primarily in-memory approach, such as SAP HANA and MemSQL, or are re-packaged versions of on-premises database technologies.
Modern Materialized Views
While microqueries and micro-partitions provide a lot of performance boost for interactive data exploration, the classic Monday Morning problem can still arise when thousands of dashboard consumers all ask for a similar (but not identical) dashboard at the same time. In this case aggregate tables can still be very useful to deflect some of that load from the raw underlying data.
But as discussed above, persisted aggregate tables are hard to maintain in the face of continuously-arising data. To address this, some of these cloud data stores, and most traditional data warehouse systems support materialized views. While traditional databases have offered materialized views for quite some time, and many of the recommendations below apply to those backends as well, the recent innovations allow these views to be defined with one SQL command, then automatically maintained by the system even as new data is continuously arriving even at extreme scale and speed.
Even if data arrives late, out of order, or historical data is updated/restated, the modern materialized view can handle it. Even under high rates of data inserts and updates, with huge volumes of data, and virtually unlimited query concurrency, the modern materialized view keeps trucking without batting an eye.
How to Decide which Materialized Views to Create
Whether using a cloud-native database, an in-memory database, or a more traditional on-premises data warehouse, there is some maintenance cost for the materialized views, and they do take up some amount of storage space (although typically much less than the underlying full detail raw data). So it’s best to look at your most commonly-used dashboards, and see what tables they are hitting and any joins they are doing, and build materialized views to be able to serve those dashboards. Then you can have those frequently-used Monday Morning dashboards run off the materialized views, but still have links to more detailed dashboards and reports for users that want to dive into the details.
What you don’t need however are a bunch of YAMS (yet another middleware system) sitting between your BI tool and your database. Let modern BI such as Zoomdata and your modern data warehouse work directly together to provide super-fast reports and dashboards, even to thousands of users against petabytes of data arriving in continuously in near-real-time. So let’s all can the YAMS, keep calm, and query on!