Picking the Right Data Store: Powerful Software
In this installment, you’ll find out the role of software in analyzing what you have in your data store.
When it comes to query engines for non-relational data, you have quite a few choices. The current generation of query engines are powerful -- but they also take expertise to configure. So DIY may not be the way you want to go. There’s a lot to be said for pre-configured systems like AWS and Google BigQuery. And no matter what route you choose, the issues of security and scalability also figure into the equation.
So, let's just assume for now that I've made a smart decision, and I've put all my data in the cloud somewhere. I still need to query it, and I've organized it in a way that makes it easy to query. I still need very powerful software that helps me answer my questions.
So, this is where some of the latest generations of query engines come in that are designed for distributed systems, they're designed for data stored in the cloud, they're designed for different data types, if you like, polyglot persistence. They're not just built for relational data. They're also considered document data, key values, graph data and so on. So, some examples here are things like Impala, Presto, Drill. They're all very powerful. They're all complicated. To set them up and understand them and configure them requires a lot of expertise. But, it may be the best way to set up a system to answer your questions.
A Pre-Configured System
Another approach is to have somebody run that workload for you and configure and offer you a system, whether it's RedShift or Amazon Athena or whether it's Snowflake or Big Query where most of the heavy lifting is done. I mean, you simply upload your data, parse the questions and set a few configurations parameters.
So, really, often, there's the question of DIY, do it yourself, without open source or not, whether it's a cloud solution where you're paying to store the data, you pay per gigabyte store, maybe a couple of cents a month, or you're paying per transaction or paying--typically, for some of these systems like Athena and BigQuery, you're paying for how many records are scanned to answer your question. So, you're really paying for that scalable compute that is used to answer your question.
So, the software that knows how to crawl through your data and get you the answers really quickly is something that you pay extra for. So, really, that's a tradeoff between data access time, performance, capability, how you want to pay, do you need to pay--do you need to model upfront exactly everything you want and find the best price, or do you need something flexible for your solution.
There are also other cost factors, not just performance, but the cost factors like availability. Do you need the system up 24/7.
Scalability and Security
Scalability - you may have picked a system which is good for your current data, but do you need to grow by a factor of 100 if you're a startup and you hope that you'll grow, is the solution you pick now gonna scale with you?
And what about security? Some systems, security is, yes, please, don’t don't look at my data. Other systems are, well, this user can see this row but not this column, I've got social security numbers in my data set, that's PII data, only certain people can see that, I'm storing data on behalf of other people. There's a lot of complexity associated with security, so you can be sure each of those has a cost - so, performance, availability, scaling, security, those are each additional costs.
Back to Darwin
So, bringing it back to Charles Darwin, you have to ask why I started with him - well, the answer is it's all about evolution. Data stores, data query engines are all evolving very rapidly. The kinds of questions you want to ask now, today versus tomorrow are evolving rapidly. So, really, the choice about how do you pick the right data store for your analytical questions involves thinking about evolution, thinking about what are your questions, what do you need to ask, what is the best query engine and thus data storage system to be able to support those questions. Don't forget evolution.