The Difference Between Business Intelligence and Data Science
This video explains why data repeatability is one of the major differences between business intelligence and data science.
With BI, we’re used to the repeatability of the data. We design data models that are used for many things. There are star schemas and normalized data models, query reports and dashboards. And all of these are forms of the same thing, and their purpose is data access and answering a set of questions that don’t change. But data science often can’t use repeatable data schema. The questions are constantly changing. And you can’t design a data model that can adapt to constantly changing questions
I'm Mark Madsen, and the topic I'd like to talk about is the difference between BI and data science and how that affects the systems that we're building.
Business Intelligence and Repeatability
When we think of BI, we think of repeatability around data, and so we design a data model that can be used for many things. We have star schemas or normalized data models, and then we have query reporting and dashboards, which are all really just forms of the same thing. And the point of that is data access. We've built data marts and data warehouses and BI tools to get access to data. And so, we sum the data, we count the data, we look at the data, we get factual answers to questions we know in advance. You know, we don't know how many widgets were sold, but we know the question. And so, the entire modeling methodology for anything in the database world is --know the questions first so that you can construct a model that stores the data so that you can answer those questions, which means it's all about repeatability.
Data Science Often Can’t Use Repeatable Data Schema
And then you have analytic problems, whether they're really simple analysis that doesn't require any higher order math or something that requires correlation and a little bit of statistics or possibly getting out into the machine learning realm, you're doing something a little bit different here. Those are narrower in that there's a specific goal for the problem that you've got, and your goal is to answer this question using a particular technique in a particular way, and it's very narrowly framed. But, whether it's statistics or machine learning or any other technique, the data has to be structured for that particular technique, that algorithm. And that means that there's no repeatability in this schema. You can't design a data model that supports 15 different machine learning algorithms because each one is going to have its own special way to treat data.
Repeatability in Storing Data Versus Modeling Data
And so, when an analyst approaches a problem and decides to use a classifier over here of a particular type or says I'm gonna do a logistic regression, I'm going to do K-Means Clustering, I'm going to do, you know, topological analysis, each of those is going to require that they take the data, pull out the data that is of interest to them and then structure that data for that thing.
And that means that the point of repeatability is not in the schema. So, you can't approach this as a data modeling problem where you get your data and map it into a repeatable structure because each one is slightly different. And so, the repeatability has to be in storing the data so that it can be provisioned for the different techniques that different analysts want to use. And furthermore, each time they come up with a different question or a new question arises, they have to model the data slightly differently. It's called feature engineering, and it's a key part of the process.
The Happy Middle Ground
People like to focus on this and say, if only we had tools that could make this problem go away, we would solve it, but it's the nature of analytics. It's the nature of data science that torturing the data into a particular shape and dealing with problems in the data, whether they be outliers or gaps and missing data, is fundamental to that process. And that means that our approaches historically for the last 20, 30 years around data and making it accessible don't work for that class of problem. And so, you have ideas of data lake, which in part make a lot of sense because it's just dump the data in, they're going to figure out how to form and structure it themselves later.
Unfortunately, that doesn't come far enough along the line, and the data warehouse BI methodologies go too far. And there's a happy middle ground somewhere for managing data for data science that uses minimal structuring of data but enables the access for that kind of a problem. And that's one of the reasons that data architectures are so important now and rethinking how we choose to store and manage the data.