Picking the Right Data Store: A History Lesson
In this video, you’ll find out how we got from the invention of magnetic storage and fast computers to today’s complex world of modern data platforms and streaming analytics.
There were a lot of steps along the way starting with mainframes and hierarchical databases. Progress has occurred in every area: database design, query languages, scalable systems, and distributed architectures. Now open source projects like Hadoop and AI technologies like machine learning continue to advance the way organizations consume, store, and analyze data.
So, now let's maybe have a quick history lesson. So, how did we get where we are today in this world that's very complicated? Well, people have been trying to answer that question. Luckily for us, somebody invented magnetic storage, which helped us save our data in places that will be faster access. People invented fast computers to help us query that data.
Mainframes and Hierarchical Databases
We ended up with mainframes, and we ended up with hierarchical databases, we ended up with transactional systems that could count the number of widgets we sold and give us answers about how our business was doing. We created standards around that. We created SQL language. We created a series of companies that became huge like Oracle around the SQL database. We understood the difference between transactional systems where we collected every sale and data warehouses or OLAP systems where we reorganized that data so we could ask questions by grouping information, by rolling things up, by drilling down, by expanding.
ETL, Data Mining and Scalable Systems
We created systems to translate those transactional data sets from the OLTP, online transaction processing systems, to the OLAP systems. That was ETL.
We invented big and expensive boxes that were really good and fast at that and helped us do things like data mining. That's where companies like Teradata came in. Doesn't mining sound like hard work? I think it was.
We invented more scalable systems, MPP databases, massively parallel architectures. We invented better systems for storing data, columnar databases, things like Sybase IQ and Vertica, making it even better and faster to be able to answer more complex questions about your data.
We invented in memory databases to make it even faster. We invented things like SAP HANA. We looked at other technologies like GPUs for even faster access and query to that data.
From Transactional Data to Observational Data
We realized that, as we moved in this world from transactional data to interactional data to observational data, we needed bigger systems, which meant scale out. We had to move from systems which were a terabyte of data and could easily fit in one server to many terabytes, which will be spread over many servers, to many petabytes, which were really honestly the domain of very, very big and very clever companies.
This brings us to about 10 years ago where we've scaled out our systems, where we've moved very much from transactional to interactional systems, where 1.0 was starting to have observational systems. We've realized we should collect every piece of data we should have even if we don't know what questions we're going to use--to use it to answer.
Open Source Projects
We came up with all sorts of great open source projects like Hadoop to help us leverage cheap systems in a scale out way to answer more and more questions. We really made huge strides in terms of running things in the cloud. So, no longer did I have to buy a server, wait for it to be delivered and store some software on it, put it in my data center and then start running the questions and the software. I could simply click a button and have it powered up in Amazon.
We did a lot of work to understand how to ingest data quickly. Realizing that, to be able to query data as it's coming in when there's such large amounts of data, we need to be able to ingest the data and not wait two days to process it. So, that's where things like Kafka and modern data pipelines come from.
We realized that machine learning is a way that we can answer a lot of the harder questions without having somebody to pose the question. So, rather than Charles Darwin asking about dogs and dinosaurs, he could just say, well, find me some patterns related to teeth and other things between all these different animal skeletons I have and see if we can find something interesting for me.
All of these things have become possible. So, the question is how can you use this information to be successful?