There are many movers and shakers in the world of database warehousing, search, and data management. Choosing a data store can be a difficult decision. Below is a list of thirteen databases we have found to be the most essential for the residents of the fast moving world of BI. Hopefully, you benefit from some of this research as you search for the best product to optimize your data.
This search engine has specialized in turning search queries into results in the blink of an eye since 2008. Powered by the capable Lucene Java search library, Solr pages through keyword indexes instead of entire records to call up results quickly.
Solr provides modern search features like priority listing, hit highlighting, and spell checking for both structured and text searches. Searches are conducted through inverted indexing. Faceting, another prized feature, provides an ability to categorize and filter results. Solr handles well when working on external servers.
Running Java processes can be very taxing on server resources. This can take its toll on ill-equipped systems. Installation and setup is not easy, despite good documentation.
Despite the various modern features that make searching easier, its faceting and inverted indexing capabilities are what draw many customers. Solr is a solid search engine, that said, it has been slowly fading into the shadows of its more popular cognate, Elasticsearch.
Known for its multi-tenancy, or ability to satisfy many user group searches simultaneously, Elasticsearch was initially released in 2010 after much fundraising. In January 2016, it finally superseded Solr as ‘the most popular enterprise search engine.’
Fast and easy to setup and install, it includes features like hit highlighting and spell checking due to the Lucene library at its core. It supports common languages like JSON, Perl, Python, and many more. AWS, IBM and other companies provide professional support. Required disc space is low compared to Solr.
It doesn’t have much detailed documentation.
Though users of Solr may struggle to see the benefits of a switch due to similarities, Elasticsearch is a fast, popular, scalable, distributed open source alternative.
Potential was seen in giving Apache Hive a surgical transplant, replacing the original MapReduce framework with the more optimal Apache Tez powerhouse. The result is an impressive platform for SQL-based Hadoop analytics called Hive on Tez.
The additional power and flexibility of Hive on Tez is a product of Tez's ability to simplify tasks and reduce the number of disc writes that consume precious CPU activity.
Short or simple queries often run faster on other engines like Impala.
Hive on Tez is reliable regardless of the length or complexity of the task, tending to excel most during longer, complicated jobs because of its speed and stability. For more complicated analytics, though, consider Impala and Spark.
Nasdaq, Nokia, and Pinterest are a few of the many names that have found a home with this petabyte-scale data warehouse. Released in 2013, Redshift has attracted companies with its fast and affordable cloud-based analysis platform.
At a $1000/terabyte per year, Redshift allows storage of over a petabyte of data in the cloud. Redshift is SQL-friendly with standard tools for analytics, and node configurations are easy considering its elastic nature. A list of four nodes clients may choose from can help to optimize for either high throughput or low latencies.
Though modifications can be made, concurrency limits can be a problem for some analytic tools that have higher query requirements.
Redshift gives competitors a run for their money when showcasing as an economic, efficient, and stable place for data. If you need to supply petabyte-scale access to unlimited users with all the benefits of storing in the cloud, Redshift is for you.
Bringing much to the table as a Massive Parallel Processing (MPP) SQL query engine, Impala satisfies many appetites. It provides low-cost, low-latency, high-throughput service to SQL queries in a Hadoop-friendly platform. Publicized in mid-2013, this open-source Apache project wasted no time impressing potential customers with its low cost statistics.
Up to 70 times faster at half the cost of similar platforms like Apache Hive. Results like these are attributed to columnar storage and Massive Parallel Processing architecture. Daemon processes continuously run after boot, making Impala always ready to fulfill work requests with minimum delay. The MapReduce, Hive, and Hadoop file formats integrate well with Impala.
Impala is not reputed to be fault tolerant.
Due to the lack of fault tolerance, Impala is not recommended for jobs that require a longer runtime. If you’re looking for a solid, analytic database that handles well in the cloud and is seamless in its interactions with Hadoop, you can’t go wrong with Impala, especially if Redshift appears costly.
In 2008, after two employees from LinkedIn were driven to find a more efficient ingest for Hadoop, Apache Kafka appeared on the market as an open source distributed streaming platform. Companies like Ebay, Netflix, and Paypal were quickly attracted by its low latency charm and ability to publish, process, and store data streams in a fault-tolerant environment.
Eliminates message handling redundancies and unnecessary index storages, providing an ability to process nearly 1 trillion events per day. Log compaction, another benefit, helps save time and energy required to setup chrons for log removal, and allows old log files to be maintained and accessed for future learning.
Traditional message brokers have an advantage over Kafka because of additional steps taken to acknowledge message handoffs.
Website tracking, log aggregation, and an ability to handle data transactions in large quantities are just some of the reasons large scale companies have flocked to Kafka. Kafka is a durable, efficient, and fast messaging system capable of handling stream processing with ease.
MemSQL is a high performance and resilient closed source Relational Database Management System (RDMS) that compiles SQL. Released in mid 2013, this database is comprised of both aggregator and leaf nodes and can handle data storage in column and row formats.
MemSQL is an in-memory database with high performance that can process billions of queries in under two hours. It contains a lock-free data structure to ensure stable workloads. Hard memory is not relied upon to process transactions, reducing the strain on the CPU. This ensures that real-time data available for analysis is nearly instantaneous.
A significant amount of cached data.
If you need efficient performance and lock- or trouble-free analytics. It works well with Apache Spark because MemSQL makes ingest streaming simple.
Seasoned by competition with legacy companies like IBM and Oracle after its introduction in the 1970s, Teradata remains famed worldwide when it comes to data warehousing. Using a shared nothing framework allows the Teradata RDMS to grow through the installation of new servers.
Teradata offers a diverse platform of either public cloud, private cloud, on-premises installation, or a combination thereof. In recent initiatives, Teradata has expanded its domain by allowing software and tools to be used for analytics regardless of where data might be located.
Teradata is thought not to have a strong inventory of tools to work with.
Teradata is fast, but a bit pricey; however, many claim the customer service mitigates the cost. Teradata is ideal concerning large datasets and ad hoc queries, but may have too much overhead for smaller tasks.
Vertica is HP’s version of an RDMS. It is capable of providing analytics and storage in the Cloud, Hadoop, or on premises. As a columnar storage database created in 2005, Vertica competes with the best of them in terms of being an efficient low cost alternative for BI needs.
Log mining is made easier for system administrators who work with Vertica's logfile text search when monitoring for cyberattacks, unauthorized intruders, and system failures. Vertica provides an impressive reduction of disc space usage due to compression algorithms. This allows for reduced costs and increased performance.
The documentation on installation and setup has room for improvement.
Vertica is positioned as a well tuned platform for data warehousing ready to serve as an economic solution whether storing in cloud, on Hadoop, or on premises. If you need an economically priced solution and don't mind fighting through some initial configuration and setup struggles, Vertica may work well.
Apache Kudu is a storage engine developed to ease troubleshooting in Hadoop clusters. It possesses many desirable tools that assist administrators in tracking, tracing, and latency outlier identification of active datasets to find problems and solutions quickly and efficiently.
Kudu was built as a columnar live storage system. Read, write, and reformat transactions are less taxing on RAM column structured databases than they are on row-formatted databases. Search queries over terabytes of data are performed in seconds as a result. System stability and reliability results from Kudu’s Raft algorithm, which ensures more than one node requests a client response. Should failures occur, recoveries are completed in seconds.
Kudu shows its youth through instability when too much data is stored in individual columns. It lacks some security features. Limited customer support exists at this time.
Kudu was crafted to play middle of the road for both high throughput sequential and low latency random access storage, and is best used as an option for for large, frequently updating datasets. Works well with Java and C++ APIs.
Centered around use of a Resilient Distributed Dataset, Spark keeps up with the need for speed. Originating in 2009 at Berkeley, this processing engine made headlines by breaking scale sorting world records in 2014. Today, Spark is still advertised to run programs 100 times faster than Hadoop in memory and 10 times faster on disc, keeping users satisfied while remaining open sourced.
Spark is easy to use, allowing users to write applications on Windows and UNIX systems in common languages like Java, Python, and Scala.
Spark handles read/write actions by holding them in transparent memory. This approach optimizes for read actions; therefore, random access is not as efficient for disc writes, inserts, and deletes.
If looking for a powerful open source general processing engine that works well with a variety of data sources including Cassandra, Hbase and Hadoop in a fault tolerant manner, Spark is hard to beat. Its parallel framework allows for multiple applications including batch and streaming to be built and operated handily alongside each other.
Spark Streaming is an extension developed to simplify the number of independent steps and systems needed to ingest, process, and convert raw data from static and streaming sources into organized storage containers. Uber and Netflix are among its users.
Use of this extension reduces the setup and maintenance workloads that would be otherwise required of system administrators to upkeep multiple systems purposed with the same job. It is fault tolerant, has a high throughput, and is easy to use.
Allows for language use of Java, Python and Scalia; for Ruby, Clojure and others a better option might be Apache Storm.
Spark Streaming can be a great aid regardless of the originating stream or static source.
It works well with Kafka, Flume, Amazon Kinesis, Cassandra, and MySQL.
In the world of BI your choice of database will either hinder or propel you to success. As seen in the list above, there are many different platforms and highlights for each one; it can sometimes become hard not to get lost in the mix. Make sure you know which features best contribute to your company before deciding. We hope the list above helps you get started.