Organizing Data for Query at Scale
It’s great to be able to work with large scale data in distributed systems. But this video explains why the more data you have, the more important it is to optimize data stores for queries at scale. Appropriate partitioning schemes can help although every scheme has inherent limitations.
Likewise, data layout can also substantially reduce seek times for your analytic systems. Hadoop and SQL-on-Hadoop will accommodate a variety of layout options beyond traditional row oriented. Columnar formats like Parquet are more suitable for denormalized data than the row-oriented formats used in relational databases. Parquet functions effectively regardless of the data processing framework, data model, or programming language.
A lot of really smart people have been making it easier to work with distributed systems, and that’s great because nobody wants to figure out all the intricacies of shuffle sort and everything else that has to do with a distributed system. So, it’s resulted in the ability to use common languages like SQL or other APIs that are much easier to use than others to manipulate and work with and interact with data in this new paradigm. You get the ability to scale and move on data that is much, much larger than you’ve ever dealt with before, but there are certain downsides in kind of working with these paradigms, and you need to understand what ways to work with them.
And one of the most common ways to work with the data is to implement a partitioning scheme. Implementing partitioning gives you tons of benefits in terms of allowing these engines to isolate and identify only the piece of data that you need at the time to answer your question and are a real key to scalability.
The downside is that you need to figure out your partitioning scheme when you decide to create your table, identify your collections or do any kind of--set up any of your data structures with your data. Now, you could take a very simple partitioning scheme such as using date. So, if I partition my day, I get really fast query performance if I’m looking at yesterday’s sales figures, because I may be only accessing one node and a very small set of data that’s isolated to that single day. If I’m trying to aggregate the year’s sales figures, I might have other challenges in that partitioning scheme.
The next thing you need to take into account is how your data is laid out. There are a number of different ways to lay out data, especially in the Hadoop and SQL- on-Hadoop kind of paradigms. So, one of the ways to lay out data is just in a row by row level layout, like a CSV or tabbed delimited format. The downside with that is that for any kind of operation you want to do to pluck out a particular column you’re going to have to read in the data for the entire row. Now, that might be okay if you’re dealing with very highly normalized schema in your relational database world. That’s not great when you’re dealing with a heavily denormalized schema where you may have hundreds of attributes in your rows. Well, a much better approach would be to utilize a column oriented format such as Parquet or ORC that provides these new engines the ability to just pluck out the columns that they need, look at only the data that’s relevant to the query that they’re trying to answer and dramatically reduce the seek times of those systems.