Most MPP databases leverage a "shared-nothing architecture" where each server operates independently and controls its disk and memory. They distribute data onto dedicated disk or solid-state drive (SSD) storage units connected to each server in the appliance. (A data warehouse appliance includes an integrated set of servers, storage, operating systems, and databases.) This allows them to resolve an SQL query by scanning data on each server in a parallel way. This divide-and-conquer approach scales linearly as new servers are added into the architecture and delivers high performance.
For querying large data sets column-store databases are more efficient than row-store.
Traditional relational database management software systems designed for transaction processing store data by row, as this provides most efficient operation when inserting, deleting or updating an individual row. In contrast, a column-oriented database stores data tables by column instead of by row. Both column and row databases can use traditional query languages like SQL to load data and perform queries.
Nevertheless, a major bottleneck in handling big data is disk access. Columnar databases boost performance by reducing the amount of data that needs to be read from disk by efficiently compressing similar columnar data and reading only the data necessary to answer the query.
By storing data in columns rather than rows, the database can more quickly access and aggregate the data it needs to answer a query rather than scanning and discarding unwanted data in rows. As a result, the kinds of aggregate queries typically used when analyzing very large data sets run dramatically faster.
When most people think about big data, Hadoop and MapReduce, as well as other modern processing frameworks come to mind. So, does that mean that Hadoop is a replacement for MPP databases? No, in fact, there are striking similarities between the way all three work.
With Hadoop, MapReduce is used to break large chunks of data into smaller batches to process separately via a cluster of computing nodes. This is distributed query processing, which is exactly what MPP does.
MPP is often deployed on expensive, specialized hardware tuned for CPU, storage, and network performance. MapReduce and Hadoop typically run on clusters of servers that use commodity hardware (disks). It is usually less expensive to scale a Hadoop/MapReduce deployment than an MPP appliance.
Also, MapReduce logic is implemented via Java code, whereas MPP products are queried with SQL. Of course, Hive offers an SQL abstraction over MapReduce. But natively, they are different at the code level. Nevertheless, MPP and Hadoop/MapReduce can both be looked at as big data technologies.
While Zoomdata excels with big data and its uses cases, which for most people are synonymous with Hadoop, Zoomdata is not only for big data. Data-driven enterprises have significant data assets in relational databases, data warehouses, and other traditional systems. For this reason, Zoomdata also enables data discovery with traditional SQL-based sources such as Oracle, SQL Server, PostgreSQL, MySQL, as well as MPP solutions like the Amazon Aurora and column-store databases like Vertica, Teradata, and others.
Zoomdata features certified support for the following SQL databases, and more:
Enterprise data architectures almost always contain a combination of SQL database sources as well as modern sources like Apache Hadoop and Apache Spark. Zoomdata enables exploration across all these data sources with the ability to blend data on the fly using Zoomdata Fusion.
The unique features of Zoomdata that support visual analytics for big data can also be applied to traditional data. For example, Zoomdata Fusion can combine modern and traditional sources without having to move data to a common data store. Data Sharpening, micro-queries, result set caching, and Data DVR all add extra performance and capabilities for visual analytics on traditional SQL data sources.