Zoomdata Version

How Zoomdata Caches the Data

Zoomdata offers different methods to cache data:

  1. SparkIt is used for raw data
  2. Zoomdata cache is used to cache aggregated result sets
  3. Storing metadata and streaming data in MongoDB

Zoomdata uses SparkIt to improve the charting and dashboarding performance with your connected data sources.

At this time, not all of Zoomdata's data connectors is able to utilize SparkIt. Refer to the article Data Sources Quick Reference Sheet , Table 3 for a list of data connectors that utilizes SparkIt. In addition, certain data connectors require SparkIt to be enabled before configuration. For details, refer to the article Configuring Embedded Spark Server .

CACHING WITH SPARKIT ENABLED

To optimize performance with your charts and dashboards, Zoomdata recommends that you enable SparkIt in the following cases:

  1. You have legacy, slow databases as your data source, where it takes a lot of time to run the query and retrieve the results.
  2. You have storage that does not support analytical queries (for example, S3, HDFS, and SaaS sources).  These types of data sources requires loading your data into SparkIt.
  3. You want to work with the results of data processing (for example, working with data sources like Hive on EMR or Hive on Tez).

You have the option to cache data using SparkIt by enabling the toggle switch within the data source setup page that supports this feature. By default, installing the Zoomdata Server also installs an embedded Spark Server / Spark Proxy. If your Spark Server is running in the same server as the Zoomdata instance, the default connection settings will already have been established during the installation process. However, you have the option to change these settings using the zoomdata.properties configuration file.

You can enable a data source to cache data using SparkIt during the connection process. During set up of the specific data connector, set the SparkIt toggle located in the Tables tab (as shown in Figure 1)


Figure 1

Creating a Custom SQL with a Group-by Clause with SparkIt enabled would result in aggregated data in SparkIt, not just raw data.

Data Flow with SparkIt and Zoomdata Cache Enabled

  1. After connecting to a data source, the data starts being loaded into SparkIt.
  2. Once a chart is created, a request is sent to Zoomdata Cache.
  3. If the requested data is not found in Zoomdata Cache, the request is sent to SparkIt.
  4. The retrieved data is sent to Zoomdata Cache and stored there.
  5. The chart displays the requested data.


Figure 2

Data Flow with SparkIt Disabled and Zoomdata Cache Enabled

Zoomdata Cache stores all the results of aggregated requests from your data source  (Figure 3). In this scenario, when a chart is created the request is first sent to Zoomdata Cache (1). If the required results are found there, they are visualized on your chart (2).


Figure 3

Otherwise, the data flow will be as follows (Figure 4):

  1. The request is sent to Zoomdata Cache.
  2. If the required results are not found in Zoomdata Cache, the request is sent to your data source.
  3. From your data source the results are sent to Zoomdata Cache and stored there.
  4. The chart displays the requested data.


Figure 4

For additional information about how Zoomdata implements Spark, refer to the article How Zoomdata Uses Apache Spark .

Caching with Zoomdata Cache disabled

Zoomdata cache is a temporary storage of the aggregated data from your data sources. By default, caching is enabled for all data sources. However, you can disable it if your data source is constantly being updated, or you do not want to allocate the required RAM, or performance of your data source is high, so you do not need to store the aggregated queries.

Data Flow with SparkIt Enabled and Zoomdata Cache Disabled

  1. After connecting to a data source, the data starts being loaded into SparkIt.
  2. Once a chart is created, a request is sent to SparkIt.
  3. The chart displays the requested data.


Figure 5

Data Flow with SparkIt and Zoomdata Cache Disabled

If you choose not to use any caching options, when working with your charts, the requests will be sent directly to the data source:


Figure 6