What's the Trade Off Between Speed and Scalability?
Watch this video to find out the tradeoffs between speed and scalability that all analytical tools must balance.
For big data analytics, some tools emphasize speed. Others place a priority on scalability. Tools that focus on speed typically use an "extract and query"
Hi, everyone. I’m here today to talk to you about big data analytics and more specifically the kinds of tools you need to query to big data. Now, one of the things you need to look for in the BI or analytics tool is how well it optimizes or balances scalability and performance. Every BI and analytical tool out there has had to make an architectural tradeoff between speed and scalability.
Trade Offs Between Speed and Scalability When Querying Big Data
So, if they’re gonna emphasize speed they take what I call an extract and query approach, that is they extract data from the source and they move it down into the BI or analytical tool and store it there in memory or on disk. Now, there are a couple advantages to this, one is that you can model that data and put it in memory and get consistently fast queries, which is what users demand. The downside is however is that you only get a subset of the data and usually that’s the summary data. And also the data may not be fresh. It’s however fresh it was when it was extracted.
Emphasizing Scalability -- Direct Queries
Now, if you’re not emphasizing speed you’re emphasizing scalability, that is giving users direct access to all the data. That’s what I call a direct query approach. So, you’re querying the source data directly. And the benefit is that you get access to all the data, not just a subset of it. And you get access to the freshest data possible. The downside, especially in a big data environment, is that your query performance may suffer, especially if you’re trying to query across terabytes or petabytes of data.
Compensating for Lack of Scalability
So, every BI tool, no matter which side of the spectrum they start on, tries to compensate for its inherent weaknesses. So, if you’re emphasizing speed and you’re using the extract and query approach you’re going to compensate by adopting a variety of different scalability strategies to give you the best of both worlds.
So, if you’re using extract and query approach, which you might do to be able to support larger sized databases and extract more data, is to increase the hardware, the size, the CPU power of your hardware. Of course, that’s pretty expensive after a while and your company may not want you to do that. The other thing is to not just create one extract database, but multiple extract databases and daisy chain those together. We used to do that in the old app days. Of course, this gets to be a nightmare to maintain and keep all those databases in sync.
Another approach is to just extract the summary data that users are gonna query 80 percent of the time and when they want the detailed data then you set up a drill through or drill to detail links. And that works pretty well except you have to set those up and sometimes those have to be coded and they’re kind of hard wired and a little brittle.
So, examples of this type of approach are Olap databases, as I mentioned, a lot of BI tools use these approaches within memory databases.
Compensating for Weak Performance
On the other side, if your tool starts from a scalability position it’s going to want to adopt a variety of performance strategies to ensure fast consistent performance across large volumes of data. So, one way to do that is not to use generic API’s to databases, but to craft interfaces to the native API of the database to extract all of the performance possible that that database has to offer.
The second is to use caching, which is let’s run those queries once at the beginning of the day and then allow users to hit the cache for those common queries and hit the database directly for less common queries. Of course, caching means that you’re not getting the freshest data possible.
A third and more novel approach is to use projections, these are regressions that while your query is running it estimates the result set in real-time and gives you answers based on a fraction of the results resolved up until that point in time. This is something, by the way, that Zoomdata does and they call it their data sharpening technique, which they’ve patented and it’s actually one of the coolest things I’ve seen in our space in quite a while and I encourage you to check it out on their website using one of their videos.
Of course this direct query approach has also been used for many years by real-time dashboarding tools that, if you set them up properly, they can actually twinkle using real-time data as it comes in.
So, make sure when you’re looking for a big data analytics tool that you understand its architecture and how it compensates for the deficiencies of its approach and optimizes both scalability and performance.
So, this is Wayne Eckerson signing off.