Cloud Managed Services with Amazon EMR, S3, and Hive with Facebook’s Presto
Watch this video to find out how cloud managed services can be used to query large amounts of data at low cost.
A good example is the use of Amazon EMR, S3, and Hive with Facebook’s Presto. Presto was designed for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. Facebook uses Presto for interactive queries against several internal data stores, including a 300PB data warehouse. Airbnb and Dropbox also use Presto.
I'm Alexander Whip, and I have backgrounds in computer science, psychology and business analytics. I've been a full stack engineer for about five years, and I've been in companies that do ed tech, consulting and big data visualization. I'm here to tell you how you can utilize cloud managed services to query large amounts of data and keep your costs low.
Amazon EMR, or Elastic MapReduce, is a managed cluster platform that simplifies running big data frameworks such as Apache Hadoop and Spark on AWS to process and analyze vast amounts of data. I'll be discussing two other technologies - first, Facebook's Presto, an open source distributed SQL language, which allows users to quickly aggregate large amounts of data all in memory. Second is Amazon's S3. It's scalable distributed storage in the cloud, which allows companies to store large amounts of data for relatively low cost.
Now, how do these three technologies come together?
Presto on Elastic MapReduce
Presto on EMR on S3 is essentially a system that allows you to deploy a clustered Presto server, and then using Presto, it can point into Hive, which is backing S3 files. And so, you can have petabytes of flat files, either ORC, JSON or anything that Hive can ultimately represent be queried and aggregated on by Presto.
This allows you to do direct SQL queries from Presto into Hive, quickly getting aggregated results. And what this enables you to do is have two things - first, it keeps costs low because you only have the query server up when you need it, and second, you have low storage costs overall because you're storing an S3.