Modern Data Architecture: Production, Collection, Distribution, Consumption
In this video, you learn about the various complexities involved in data architecture and why it should not be confused with data modeling.
Data architecture is about how, where, and why you position data. Data architecture doesn't assume data is in a relational database although our past experience has led us to think that way. Data architecture involves solving the design problems that either support or impede an effective data supply chain. A data supply chain has four components: production, collection, distribution, and consumption. And for various use cases in data science and analytics, each stage has design problems that need to be solved.
I'm Mark Madsen, and I'd like to talk about modern data architecture. Data architecture is one of those things that's kind of a complicated topic because people often confuse it with data modeling and immediately go from data architecture, which is more about how and where and why you position data and store data and manage it and turn it into a--it's relational data modeling problem. And data architecture doesn't assume data is in a relational database, but we've been trained to think about that because that's all we've had for so long now, and it's really changing.
The Data Warehouse
And when we think of a data warehouse, people then immediately come up with data lake as a different place to put different data. But, they're both part of the same environment, and you have to think about the full environment because you're using data from both of these places. One doesn't replace the other.
And so, when we go back and look at the model of a data warehouse, it's one place. It's building the death star. It takes six months, nine months, 12 months, and you've got this great big intergalactic database. And of course, it takes a long time to build because you have to model the data, you have to clean the data, you have to make sure that all of it matches and works perfectly together.
Not All Data Use Cases Require the Same Quality Level
But, not all uses of data require the same level of quality. Some uses of data are idiosyncratic. We only need this little chunk of data over here for these three people in this one department, and it never has any bearing on anybody else. But, in our methodologies, we have to put it all in the same place and tie it all together. It's not naturally distributed architecture, but everything in the IT environment is now a distributed architecture. That's what cloud essentially is, and client server was already taking us there.
Along Comes Hadoop
So, this idea that we have one thing is, you know, something that's changing. The irony is that along comes Hadoop and the idea of a data lake, and the idea is that there's a data lake, and it's just a place where you dump all of your data, and we're gonna forego the modeling pieces. But, it's still the same data architecture - one single centralized place for everything, and the problem is any time you have one place for everything to come in and everything to go back out, you create bottlenecks and challenges and architecture, and you have to rethink this.
Defining Design Problems
The way I approach it is to carve up the problem into design problems. And one of the ones that you find a lot in streaming architectures is if you're trying to persist that data, it's really incompatible with data warehousing methodologies, but it's often incompatible with relational databases because the high and fluctuating rates of incoming data that hit a database while you're simultaneously trying to query it are essentially why we separate a transaction processing from data warehousing in the first place. So, really, data collection is itself one challenge. And if one were to design a system, one wouldn't necessarily blend all of this stuff in and do one thing. You'd want to carve it up.
A Variety of Data Problems
And then you throw into it the variety of data problems. Well, I'm sourcing video, or I'm sourcing event streams which have variable structure, things which are also incompatible with relational databases. Then you start to see, well, maybe I do need something that's filesystem based or object-store based that I'm going to collect data in, but I'm just collecting it there. And if I can collect it quickly, and I can deal with the vagaries of spiky workloads where suddenly website traffic is 200X what it was yesterday and absorb that without breaking the rest of my infrastructure, that's a good thing.
Separate Collection and Management
And so, you carve off collection and management as separate domains. And so, get your data in, store it in immutable data structures, nothing changes, it's just in, it's recorded, it's done. And then from there, the pipeline might carry it forward for other uses, or if you don't really use that data very often, you leave it where it is. So, it stays in what we'd call the lake, and it's just there, and it's done.
But, if you need to match that data--say it's clicks off of a, you know, clickstream, fairly easy data we've been dealing with for years, maybe you want to link that to product sales or customer lifetime value calculations. Well, then you need to do a little bit of work to make sure that keys match. And so data set by data set, you treat the keys, and you start thinking about how do I match these keys up, and you start to apply some level of data quality. Maybe you start typing data so that it's all typed the same way so that customer numbers are numbers across each data set. But, you don't try to build the intergalactic data set. You focus on management. So, your second zone of the data architecture is just how do I manage the data so that it can be linked to other data for other purposes.
And then the third part is really specific to consumption. Star schemas, data warehouse models, they're all about consumption. They're not really about data management or data collection. And so, they are purpose built for that use, query reporting and dashboards. But, what happens when you come on with an analytics problem? You're doing statistical modeling. Those data structures in a star schema are not necessarily the ones that I need.
Reforming Data for Data Science and Analytics
And the whole problem of data science and analytics is reforming the data. And oftentimes, what you did to get it into a particular level of quality in the star schema or the third normal form model creates challenges that mean that you can't use that data. And so, you need to do different positionings of data.
Or, maybe you're doing graph work. You're trying to calculate between this metrics. And if you're trying to do those calculations, you need a separate engine to do them, so you're going to extract the data and move it forward. Well, that's just a different positioning for consumption of data. You may as well manage those independently of the actual data management area.
The Data Supply Chain
So, really, you're treating it like a supply chain. You know, you have one area where you're collecting up things and you're getting the raw materials, and you're storing them and managing them. You have distribution facilities where it's packaged up and it's a bit more organized so that when somebody says I need this you can go to that place in the warehouse, package it up and ship it out.
And when you consider something like the retail models, you can have big box grocery stores and small grocery only stores, and you can have convenience stores that have a very small set of products in them. They're all different retail formats with different assortments and different ways of organizing the product and different products in them, and they're all fed from distribution facilities.
Four Elements of Data Architecture
So, if you think of data architecture, you can think of it as a production and collection problem on one side, a distribution problem in the middle and consumption problems on the other side. So, the data architecture really has multiple places for the same data to live in different levels of cleanliness and different levels of structuring from least to most structured and from untreated to highly treated data. And so, a modern data architecture is a lot more complex than it used to be where we just said here's this one place and we'll put all the data right there.