Modern Data Sources: HDFS, Kudu, Impala, and Elasticsearch
Watch this video to learn why it’s important for organizations to have the ability to run analytics on top of modern data sources.
Data schemas also have a role to play. Fixed schemas limit the range of queries you can make. Schemaless data supports fast queries on the fly. But for this, you need an abstraction layer so you can choose the most appropriate data source for the problem at hand. You also need to a way to deal with heterogeneous data sources. They’re just a fact of life in every organization.
My name is Mike McCarty. I'm a senior software engineer, focused on big data application and visualizations.
We hear the term modern data sources a lot in the industry like HDFS, Kudu, Impala and Elasticsearch. These sources go beyond traditional relational databases. We're also seeing more and more real-time data streams, so it's important for organizations to be able to run analytics on top of these streams and compare it to historical trends.
Fixed Data Schemas Limit Flexibility
Additionally, fixed schemas will lock you into a limited range of questions you can ask. When you are schemaless, however, you're able to have flexibility to ask questions on the fly. This is where an abstraction layer comes into play. With an abstraction layer, you can use the most appropriate source for your problem. Otherwise, you're locked into an analysis that's coupled with your data source. This means you have to run it through QA again if you want to change your source solution and thus increase your maintenance effort on your analysis.
Heterogeneous Data Is a Fact of Life
Companies have always had to deal with heterogeneous data sources. It's just a fact of life. Traditionally, data warehousing has been utilized to blend data. Unfortunately, the downside with this approach is expensive, slow and even stale results. A better method is to virtualize the blended data on the fly. This results in lower cost, increased speed and the freshest data possible. In the near future, I think this will be improved even further by pre-generated, blended aggregate data based on the user's past usage. And with this usage data, the system will automatically tune itself to provide the fastest possible results.