Data Modeling for Distributed versus Traditional Systems
In this video, you’ll learn that the root cause of data analytics performance issues can often be traced to differences between traditional business intelligence/data warehousing (BIDW) environments and today’s distributed computing environments.
In terms of data modeling, what worked for BIDW won’t fly in the distributed compute world. For one thing, network throughput wasn’t much of an issue with BIDW. But, it’s very important when compute resources and data stores reside on different machines strung together via a network.
In a distributed scenario, it makes sense to denormalize data because every join operation generates heavy network traffic, and it also pays to focus on query optimization.
I walk into a lot of implementations where there are performance issues. And the first thing I identify is that there’s a definite skills gap. Now, a skills gap usually lies in a translation between what used to work in their traditional BIDW (business intelligence/data warehousing) world and what does work or doesn’t work in a distributive world.
The Traditional BI Data Warehouse World
And so, let me talk about the two differences between those two paradigms for a little bit. In your traditional BIDW world you had usually a single box, a big hefty machine that was connected to a really fast disk, and all your query engine had to deal with was how many compute cores and how much RAM and how many fetches to that disk. I didn’t have to worry about network throughput. I didn’t have to worry about what node was responsible for which data or any of the same concerns. I did have to be concerned with the amount of redundancy I had, because my disk was really, really expensive, and I wanted to reduce the amount of data I actually put on that disk.
The Distributive Paradigm
In the distributive world my paradigm is different. My disk is cheap. My storage is cheap. My compute is distributed across multiple individual machines, and they’re all strung together with a network. So, every operation I do I need to take the network into account. I need to take into account how utilizing or not better not utilizing the network in terms of pointing my queries to the node that has the data to which I’m interested in.
The Impact on Data Modeling
So, what does that have to do with data modeling? It has everything to do with data modeling in that if I try and have a very highly normalized schema, like what I might have done in my traditional BIDW world and translate that into the distributive world, I’m going to end up with a poorly performing system. If I understand that every join operation is going to incur a whole bunch of network traffic that is going to slow my system down and make it unusable to my users, then what I am going to do in the new distributive world is I’m going to denormalize my data, recognizing that a join operation is still the most expensive operation you’re going to incur in the distributive world.
The next thing I’m going to do is I’m going to look at how I’m fashioning my queries. I’m going to look at ways to optimize that. If I could give my SQL engine in the distributive world any kind of hint to say, “Go to node number one versus node number two,” because I know the data is on node number one, then I’m obviously going to get much more performance than saying, “I don’t know what node this is--this data is on. Go and look at all of it.” And that is where a lot of your physical data modeling comes into play.