From Audio to Images to Text, Anything Can Become Data
Watch this video to learn how we’ve arrived at the point where more data exists outside rather than in databases.
Big data has made us rethink our very narrow view of data management, especially in the business intelligence, data warehousing field. We’re accustomed to thinking of data as records that are all the same type and form. But this is no longer true. The limits of analytical engines have actually limited what we consider data to be. Even in the early days of Hadoop a lot of projects dealt with data that could fit in a database.
I'm Mark Madsen, and I'm here to talk about how anything is data. One of the things that I think is a big change in the big data market is that we've gone from a period when transactions were recorded in databases to a point where more of the data exists outside of databases then exist inside databases.
Data is More than Records
And so, you know, we think of data in a very narrow view in the data management field, in the BI data warehousing field. We think of data as records that are stored, and they're cleaned, and they're all the same type, and they're all the same form, and that's not the case any more. It never really was, but that's just--the engine has constrained our thinking to the point where we're not even aware of it.
Early Hadoop Projects
Some of the early projects that I did involving streaming data or Hadoop, things like that, they involved data that you could fit into a database like clicks on a website. But, then people started to ask questions that led to needing data that wasn't in that structure, and it wasn't in the database. One of the first ones was people asking questions around landing pages, so we're doing email marketing, and in the email, you click the link, and it takes you to a landing page. And then the question is did we have colors on those links, or were they static links. Were they underlined? Were they images?
Web Pages Become Data
Well, nobody coded that. Nobody wrote down fields in the database. And so, we had to go through and look at all the templates and pull all this data out and then shove that stuff into a table. We did that once for thousands upon thousands of templates all formatted in HTML. And then they asked the next question - well, what about link color. Well, we have to go back and spend another two weeks doing that, and that's just silly.
So, we wrote a bunch of code that could format this. We stuffed all of these things into a column in HBase instead of using a relational database because an HBase doesn't care whether it's a number or a character or a date string or date. HTML page is a data type, so we just shoved HTML pages in there. And then any time you had a new question, we could write a little bit of code that sat in between you and the query, and then you could query out show me response rates based on colored links and sort them by link color.
And the web page becomes data in the same way that PDFs can be data, working with appraisals reports, which are just big documents that you would normally extract the data out of from the PDF and put into a bunch of tables in the database. You don't actually need to do that now. You can just leave them in PDFs and query them directly. You may want to position them into a database if you query them a lot, but you don't have to.
Working with Images
Probably, the most interesting project that I've done is working with images. We had questions like how does product packaging color influence sales? And you think about a grocery shelf, and you look at all the colors of the packaging, and package color changes over time. It changes more frequently than product colors do.
So, in clothing, you might have product colors, and you just encode them, and they're in the database, and you have a field, and you can query that. But, when you have things that are packaged, very often, package color is never recorded. It's not even in the description. Sometimes, you're lucky and it says bag of chips orange, bag of chips blue, but it doesn't actually really get recorded. So, you either have to do text analytics against this or, in the worst case, which a lot of companies are in, you go back to product management repositories and content management systems, and they're full of pictures of product and product packaging.
And somebody says, what about package colors. A human being has to go and look at every single image and pull out the package color. But, you can do this programmatically if you have the images because the image is just a data structure. And so, we took thousands of images and threw them out there and then wrote a little bit of code that would pull out what the package color was and surface that as a field. And then you set up a Hive table, and it says package color, which doesn't exist anywhere except as a color in an image, and then you extract that out of that. You can either do that in batch and store it or you can just do that live, which is what we did.
And then somebody comes along and asks the next logical question, which is what about color palettes because there's not necessarily one color on a package, there's multiple, and how does color palette influence behavior. And so, you pull out the color palette. Well, that's even more complex because that is a set of colors, and it takes a graphics artist or somebody who actually understands what they're doing to look at color palettes. And that would be even more months of work for a human being, and so we write more annotation code that strips color palettes out of images, which is really, really easy and fast, and we can then surface that and give you this.
And more interestingly is that you can also surface the image collection - show me all images that have this color palette becomes a possibility. So, now you can do queries based on things like color palette in a where clause, which is not a relational database thing you can do unless you were to encode all of these as values and do all kinds of ETL and model this stuff. You don't have to do that now.
So, there are ways to use all sorts of things.
Anything Can Become Data
So, anything can become data from audio to images to text. And I think that is one of the reasons you're seeing such a renaissance in various types of analytics is that you do these analytics to strip information out of things that aren't naturally data. So, what we consider data should now be an attribute of some other thing.