Data Sharpening in Zoomdata
Data sharpening is Zoomdata's patented technique to deliver fast and responsive visualizations for large volumes of data. Conceptually, data sharpening is similar to the way large image files or streaming video files display in a browser. When you start to load the image file, you see a blurry approximation of the image. But as the file loads in the background, the image sharpens until the entire image eventually comes into clear focus.
When you create or modify a chart for a large data set in Zoomdata, data sharpening can immediately display a partial or approximate rendering of the data. Zoomdata continuously updates the chart with more and more data until the fully sharpened result is available.
While the chart(s) sharpens, you can continue to interact with it, zooming into more detail or changing the group-by without having to wait for the entire query to return. Essentially, you can continue your big data exploration without waiting for long-running queries over billions of rows of data to complete. Zoomdata adjusts on-the-fly based on your input. One thing to keep in mind—data sharpening may not always be needed when visualizing your data. If Zoomdata is able to complete its query of the data quickly (within a few seconds), then you will simply see the final result rendered in your selected chart. Data sharpening is a tool that is leveraged when the chart may not render immediately due to the volume of the data being queried.
How Data Sharpening Works
Keep in mind that Zoomdata connects to and runs queries in your original data source and can be resource intensive. The full query runs in the background at the same time as a series of microqueries that sample data across partitions and refine estimates.
In order for Zoomdata to perform data sharpening with a data source, a "playable" time field is needed. Zoomdata will attempt to automatically detect this playable field from your data source during the source creation. To determine what makes a time attribute "playable," refer to Table 1. Afterward, an appropriate time attribute needs to be specified in your data source's global default settings page. The granularity of this time field will henceforth be referred to as the "driving time field granularity" (DTFG) and play an important role in determining whether and how data sharpening is executed (this will be further elaborated in the next section). Meanwhile, the topic "Data Sharpening Setup and Process" in this topic walks through the process of enabling data sharpening for your connected data source(s).
The setup for data sharpening differs slightly depending on the data source. Although a playable time field is required for a source in order for sharpening to occur, the time field requirement is based on the data source. The table below details the time field requirements for the different data sources supported in Zoomdata.
|Data Source||Time Field Requirement|
|Amazon Redshift||Sort Key (only the first sort key is selected)|
Partitioned time field (The time field that is partitioned needs to be configured from the "Fields" page. A single partitioned column is needed for data sharpening to work in Impala sources.)
(Cloudera Search, Elasticsearch, Apache Solr)
|Indexed time field*. Zoomdata automatically detects for indices.|
(MySQL, Oracle, PostgreSQL, SQL Server)
|Indexed time field*. Zoomdata automatically detects for indices.|
Determining When Data Sharpening is Executed
When you create a chart, Zoomdata determines whether data sharpening is necessary based on the chart style selected and the time attribute parameters that are set for it.
For non-trend visuals (like bars, donuts, and heat maps), the granularity of the driving time field must be less than 10% of the range that is set in the time bar (determined by the MIN/MAX range set in the data source). The minimum granularity used by Zoomdata will always be minute. Thus, even if your DTFG is second, Zoomdata will still use minute when performing this "10% rule" calculation.
For trend visuals (like Line and Bars Trend and Line Trend: Attribute Values), Zoomdata executes an internal check to determine whether data sharpening should execute. Similar to the non-trend visuals, a 10% criteria is used, but it is slightly modified for the trend visual scenario. If the granularity of the driving time field for the source is less than 10% of the time granularity set to be used in the particular trend chart, then data sharpening will execute.
The bottom line is that Zoomdata will try to perform data sharpening when warranted based on the size of the data set, the time attributes available, and the time granularity that is set. If Zoomdata ascertains that results can be rendered in the chart quickly without data sharpening, it will do so. Otherwise, Zoomdata will attempt to use data sharpening to return near instantaneous result sets that are refined over time until the query completes.
Data Sharpening Setup and Process
To enable the data sharpening feature for a data source connected in Zoomdata, you will need to enable the time settings in the data source's settings page. Specifically, the 'playable time field' needs to be set in 'Charts' > 'Global Default Settings' for the data source. However, Cloudera Impala sources require additional configuration as detailed in the next topic Data Sharpening on Cloudera Impala Sources .
Follow the steps below to enable the time settings:
Log into Zoomdata (either as the admin or user with edit rights to data sources).
Select the Sources menu item.
Select your data source.
Select the Charts tab.
Select the Time Attribute in Global Default Settings.
Save your changes.
Data sharpening works with certain partitioned Impala sources. The partitioned field should be a time-based attribute and in a supported time format (for example, yyyy-MM-dd). Follow the steps below to set up Impala for data sharpening. An example scenario is provided to illustrate when data sharpening would occur when you explore a large data set.
Steps to set up your Zoomdata connection to Cloudera Impala for data sharpening:
- Log into Zoomdata (either as the admin or user with edit rights to data sources).
- Select the Sources menu item.
Access your Cloudera Impala source from the My Data Sources list.
- Navigate to the Fields tab.
Identify the partitioned time attribute that will be enabled, and change the setting in the
In the Configure column, select an appropriate time granularity, as shown below. Consider the 10% rule to ensure data sharpening execution.
- Select a related Time Field from the drop-down list . This time field will serve as the driving time field and is the time field that needs to be specified in the global default settings.
- Select the Charts page.
- Select the Time Attribute in Global Default Settings .
- Save your changes.
The scenario below illustrates setting up Cloudera Impala for data sharpening.
- You have 3 years of historical data on Cloudera impala
Your data is partitioned by month (another column Order_Date_Month that contains data from column
Order_Date truncated to month)
- The time stamp in your data provides granularity to the day level (column Order_Date)
Determine whether there are sub-folders in Impala. If so, the
Label must include the full date format (for example, month=201501, which is in time format 'yyyyMM').
Configure the Impala source in the
- For the Order_Date field, make sure "Day" granularity is selected.
For the Order_Date_Month field:
- In the Partitions column, set the partitioned time field to Date (or verify that it is selected)
Default column, set the option to Pattern and enter the appropriate time format (for this example, the time format is 'yyyyMM').
select granularity to be Month (make sure time granularity of partitioned column is more than the granularity of linked time field).
- Link the partition to the field. Select a Time Field from the drop-down list (for this example, 'Order_Date' is selected).
- Continue to the Charts page and, in the Global Default Settings option, select Order_Date from the drop-down menu list.
- Save your work!
In order for sharpening to work in this example, the time range should be at least 10 times greater than the time interval for the selected chart. So if we show one month's worth of data in the chart, and the time granularity is set to Day, data sharpening will execute (30 days in April, which meets the 10% threshold).
Please note that if your Impala partition breaks out time attributes into separate fields, data sharpening will not be available. For example, if YEAR, MONTH, and DAY are all separate partitioned fields, they would need to be combined into one field in order for data sharpening to be available.
Notes and Caveats
When Zoomdata connects to your data source for the first time, the application runs an initial query to return a sample of the data set—approximately 100 rows of data—to provide an initial time range. Meanwhile, Zoomdata continues to run a comprehensive query to obtain the actual MIN/MAX range based on the entire data set. In this instance, based on the results of the sample query, data sharpening may not activate since the time range and time granularity results will most likely fall short of the criteria for sharpening to execute. However, once Zoomdata completes the full query, data sharpening will work as expected as long as the correct parameters have been applied and the time criteria are met. Depending on the size of your data set, the comprehensive query process may take a few minutes to complete its execution (additional constraints include the size and type of the data source and other factors such as competing resources on the database and resource and performance limitations).
Was this topic helpful?