From Data to Knowledge: Spark, Data Lakes, and the Return of SQL
From this video, you’ll learn how organizations are getting from data to insight.
Modern data platforms have lowered the cost of storing and processing data, which means organizations can afford to keep more data and more varieties of data -- including unstructured data. Of course, now they need to do something with it. From data to knowledge to actionable insight is the path to wisdom. For querying large volumes of data, SQL on Hadoop has emerged, along with technologies like Spark that has moved Hadoop beyond batch processing. Likewise, data lakes have become a way to create a unified repository that serves multiple analytic use cases, but they’ve increased concerns about data governance.
We're talking about modern data platforms, and we talked about the fact that the lowering cost of storing and processing data has enabled companies to store more data than they ever have before, particularly unstructured data, which was perhaps previously somewhat ignored.
Gaining Knowledge from Data
It's important to acknowledge, though, of course, that just storing data is one thing. It's what you do with it that counts. And it's going from data to knowledge to insight in terms of actually bringing those data points together that enables organizations to generate wisdom and then obviously get value from storing that data, which is what this absolutely has to be about fundamentally.
And the way that we've seen organizations do that is obviously through querying data. And the way that most organizations have traditionally done that has been with SQL. Hence, we've seen SQL coming to NoSQL environments and also SQL on Hadoop.
The Return of SQL
Now, we see some people talking a lot about the return of SQL as if it went away and came back. Actually, nobody stopped using SQL. Obviously, for a while, people were interested perhaps more in some emerging platforms, but nobody stopped using SQL. So, really, what we have here is about bringing the way in which people want to query and analyze their data to the environments in which that data is now stored.
One of the technologies that a lot of people are using to perform SQL on Hadoop querying is Apache Spark. And Apache Spark is not the only project, but it's a very interesting project in the evolution of Hadoop beyond just batch processing.
And we certainly see that Spark is gonna be fundamental to the use cases that people are putting to their Hadoop environments in the coming years. And we see that as being really a sort of in-memory, high-performance processing layer that the analytics provider is also taking advantage of in order to provide that high performance analytics on top of large volumes of data.
If we think about the evolution of Hadoop, as well, we can't ignore data lakes. Now, a data lake is a term that has sprung up in the last couple of years. It really describes a unified data repository for serving multiple analytic use cases, mostly on Hadoop, not necessarily only on Hadoop. It could involve cloud storage or even potentially a relational database.
I think the key thing here is that unified repository. In order to do that, of course, you can't just put everything in a Hadoop environment and expect to get the right answers out when you come and query that. So, we increasingly see that organizations are focusing on, you know, some traditional somewhat boring issues like data governance and data management to actually provide that understanding of what is in that environment. And having done so, they can then open up access to that environment to business users and data analysts. So, that focus on data governance actually enables a self service environment.