Zoomdata Version

Data Sharpening in Zoomdata

Overview

Data sharpening is Zoomdata's patented technique to deliver fast and responsive visualizations for large volumes of data. Conceptually, data sharpening is similar to the way large image files or streaming video files display in a browser. When you start to load the image file, you see a blurry approximation of the image. But as the file loads in the background, the image sharpens until the entire image eventually comes into clear focus.

When you create or modify a chart for a large dataset in Zoomdata, data sharpening can immediately display a partial or approximate rendering of the data. Zoomdata continuously updates the chart with more and more data until the fully sharpened result is available (as demonstrated in Figure 1).


Figure 1

Figure 1 shows an example data sharpening occurring with a dashboard built from Cloudera Impala data source. Data sharpening is indicated in the upper left corner of each chart, which displays a percentage steadily increasing over time.

While the chart(s) sharpens, you can continue to interact with it, zooming into more detail or changing the group-by without having to wait for the entire query to return. Essentially, you can continue your big data exploration without waiting for long-running queries over billions of rows of data to complete. Zoomdata adjusts on-the-fly based on your input. One thing to keep in mind—data sharpening may not always be needed when visualizing your data. If Zoomdata is able to complete its query of the data quickly (within a few seconds), then you will simply see the final result rendered in your selected chart. Data sharpening is a tool that is leveraged when the chart may not render immediately due to the volume of the data being queried.

How Data Sharpening Works

Keep in mind that Zoomdata connects to and runs queries in your original data source. As a result, data sharpening is the breakup of a query into smaller microqueries that are sent to all the available nodes in your data source. These microqueries incrementally return results back to Zoomdata, which then joins together the result sets. This allows you to view data as the query is executing instead of waiting for the entire result-set to process. Figure 2 presents a conceptual diagram of data sharpening in action.


Figure 2

Prerequisites for Data Sharpening

In order for Zoomdata to perform data sharpening with a data source, a "playable" time field is needed. Zoomdata will attempt to automatically detect this playable field from your data source during the source creation. To determine what makes a time attribute "playable," refer to Table 1. Afterwards, an appropriate time attribute needs to be specified in your data source's global default settings page. The granularity of this time field will henceforth be referred to as the "driving time field granularity" (DTFG) and play an important role in determining whether and how data sharpening is executed (this will be further elaborated in the next section). Meanwhile, the topic "Data Sharpening Setup and Process" in this article walks through the process of enabling data sharpening for your connected data source(s).

The setup for data sharpening differs slightly depending on the data source. Although a playable time field is required for a source in order for sharpening to occur, the time field requirement is based on the data source. Table 1 below details the time field requirements for the different data sources supported in Zoomdata.

Data Source Time Field Requirement
Amazon Redshift Sort Key (only the first sort key is selected)

Cloudera Impala

Partitioned time field (The time field that is partitioned needs to be configured from the "Fields" page (see Figure 3). A single partitioned column is needed for data sharpening to work in Impala sources.)

Search-based sources
(Cloudera Search, ElasticSearch, Solr)
Indexed time field*. Zoomdata automatically detects for indices.
SQL-based sources
(MySQL, Oracle, PostgreSQL, SQL Server)
Indexed time field*. Zoomdata automatically detects for indices.

Table 1

*The indexed time field should already be set in the data source, so no additional configuration is needed in Zoomdata.

Determining When Data Sharpening is Executed

When you create a chart, Zoomdata determines whether data sharpening is necessary based on the chart style selected and the time attribute parameters that are set for it.

For non-trend visuals (like bars, donuts, and heat maps), the granularity of the driving time field must be less than 10% of the range that is set in the  time bar (determined by the MIN/MAX range set in the data source). Refer to Figures 3 and 4 for an example. Figure 3 shows the Cloudera Impala source with the time attribute 'Record Min' as the partitioned time field. The granularity is set to 'Minutes.' The MIN and MAX (time) range is set from Nov. 10, 2013 10:00 pm to Nov. 11, 2013 9:59 pm. This setup falls within the 10% criteria, so data sharpening will execute as shown in Figure 4.

The minimum granularity used by Zoomdata will always be minute . Thus, even if your DTFG is second, Zoomdata will still use minute when performing this "10% rule" calculation.


Figure 3


Figure 4

INFO: In contrast, if your Impala source contains data from the past 5 years and that range is set in the time bar, then as long as the time granularity is set to months, weeks, or finer units, data sharpening should occur. But if the time granularity is set to years, data sharpening will not occur since the granularity is above the 10% threshold.

For trend visuals (like Line and Bars Trend and Line Trend: Attribute Values), Zoomdata executes an internal check to determine whether data sharpening should execute. Similar to the non-trend visuals, a 10% criteria is used, but it is slightly modified for the trend visual scenario. If the granularity of the driving time field for the source  is less than 10% of the time granularity set to be used in the particular trend chart, then data sharpening will execute.

The bottom line is that Zoomdata will try to perform data sharpening when warranted based on the size of the dataset, the time attributes available, and the time granularity that is set. If Zoomdata ascertains that results can be rendered in the chart quickly without data sharpening, it will do so. Otherwise, Zoomdata will attempt to use data sharpening to return near instantaneous result sets that are refined over time until the query completes.

DATA SHARPENING SETUP AND PROCESS

To enable the data sharpening feature for a data source connected in Zoomdata, you will need to enable the time settings in the data source's settings page. Specifically, the 'playable time field' needs to be set in 'Charts' > 'Global Default Settings' for the data source. However, Cloudera Impala sources require additional configuration as detailed in the next topic Data Sharpening on Cloudera Impala Sources .

Follow the steps below to enable the time settings:

  1. Log into Zoomdata (either as the admin or user with edit rights to data sources).
  2. Select the Sources menu item, as shown in Figure 5.

​​
Figure 5

  1. Select your data source (listed under the My Data Sources section, as shown in Figure 6).


Figure 6

  1. Navigate to the Charts page.
  2. Select the Time Attribute in Global Default Settings , as shown in Figure 7.


Figure 7

  1. Save your changes.

Data Sharpening on Cloudera Impala Sources

Data sharpening works with certain partitioned Impala sources. The partitioned field should be a time-based attribute and in a supported time format (for example, yyyy-MM-dd). Follow the steps below to set up Impala for data sharpening. An example scenario is provided to illustrate when data sharpening would occur when you explore a large dataset.

Steps to set up your Zoomdata connection to Cloudera Impala for data sharpening:

  1. Log into Zoomdata (either as the admin or user with edit rights to data sources).
  2. Select the Sources menu item, as shown in Figure 8.


Figure 8

  1. Access your Cloudera Impala source from the My Data Sources list, as shown in Figure 9.


Figure 9

  1. Navigate to the Fields page.
  2. Identify the partitioned time attribute that will be enabled, and change the setting from the Partitions column.
    (Figure 10 provides an example for illustration purposes only. Your partition options will be different.)


Figure 10

  1. From the Configure column, select an appropriate time granularity, as shown in Figure 11.

    Consider the 10% rule to ensure data sharpening execution.


Figure 11

  1. Select a related Time Field from the drop-down list (see Figure 12). This time field will serve as the driving time field and is the time field that needs to be specified in the global default settings.


Figure 12

  1. Select the Charts page.
  2. Select the Time Attribute in Global Default Settings , as shown in Figure 13.


Figure 13

  1. Save your changes.

The scenario below illustrates setting up Cloudera Impala for data sharpening.

Scenario

  • You have 3 years of historical data on Cloudera impala
  • Your data is partitioned by month (another column Order_Date_Month that contains data from column
    Order_Date truncated to month)
  • The time stamp in your data provides granularity to the day level (column Order_Date)

Partition Steps

  1. Determine whether there are sub-folders in Impala. If so, the Label must include the full date format (for example, month=201501, which is in time format 'yyyyMM').
  2. Configure the Impala source in the Fields page:
    • For the Order_Date field, make sure "Day" granularity is selected.
    • For the Order_Date_Month field:
      • In the Partitions column, set the partitioned time field to Date (or verify that it is selected)
      • In the Default column, set the option to Pattern and enter the appropriate time format (for this example, the time format is 'yyyyMM').
        select granularity to be Month (make sure time granularity of partitioned column is more than the granularity of linked time field).
      • Link the partition to the field. Select a Time Field from the drop-down list (for this example, 'Order_Date' is selected).
  3. Continue to the Charts page and, in the Global Default Settings option, select Order_Date from the drop-down menu list.
  4. Save your work!

In order for sharpening to work in this example, the time range should be at least 10 times greater than the time interval for the selected chart. So if we show one month's worth of data in the chart, and the time granularity is set to Day, data sharpening will execute (30 days in April, which meets the 10% threshold).

Please note that if your Impala partition breaks out time attributes into separate fields, data sharpening will not be available. For example, if YEAR, MONTH, and DAY are all separate partitioned fields, they would need to be combined into one field in order for data sharpening to be available.

Notes and Caveats

When Zoomdata connects to your data source for the first time, the application runs an initial query to return a sample of the dataset—approximately 100 rows of data—to provide an initial time range. Meanwhile, Zoomdata continues to run a comprehensive query to obtain the actual MIN/MAX range based on the entire dataset. In this instance, based on the results of the sample query, data sharpening may not activate since the time range and time granularity results will most likely fall short of the criteria for sharpening to execute. However, once Zoomdata completes the full query, data sharpening will work as expected as long as the correct parameters have been applied and the time criteria are met. Depending on the size of your dataset, the comprehensive query process may take a few minutes to complete its execution (additional constraints include the size and type of the data source and other factors such as competing resources on the database and resource and performance limitations).

Related Articles