Source Connection Workflow
The steps to connect Zoomdata to your data source follow a fairly standard workflow that is applicable across most supported data sources. This article walks through this general workflow to help prepare you to connect Zoomdata to your data source(s) successfully. Before starting, Zoomdata recommends you review the capabilities and limitations quick reference sheet to understand the features and options available to you based on the data source you are connecting.
Prior to connecting your data sources, review the following resources to ensure that Zoomdata is able to connect to your source:
- Data Sources Quick Reference Sheet: Capabilities and Limitations
- Date and Time Formats Supported By Zoomdata
- How Zoomdata Caches the Data
- Overview of the Data Sources Page
The number of steps that you have to complete varies slightly depending on the type of data source being connected. Figure 1 illustrates the general steps:
- General Page
When the connector server is set up by the Zoomdata Administrator, certain parameters on the Connectors page may be customized. The instructions below assumes that the default configuration parameters were kept. If this is not the case, then your Connection page may differ slightly from the screen captures and references provided.
- Tables Page
- Fields Page
- Refresh Page
- Charts Page
Select the button to open the Data Sources page.
To connect a new source to Zoomdata, select the appropriate icon from the list of supported data sources (see Figure 2).
Navigate to the first screen in the connection process: the General page. Review each page listed below to help you understand the necessary configurations and settings needed.
First, create a unique name to identify this data source name and, optionally, add a description. The Source Name field is required (see Figure 3).
In this step, specify the connection details and, if applicable, authentication credentials in order to connect Zoomdata to the data source. The fields will vary depending on the selected data source. Once all the information has been supplied, select the Validate button to substantiate the connection (see Figure 4). If the connection is successful, you will be able to continue to the next page in the connection process. Save your settings.
Optionally, if you have access to already-validated connections, you can use one of those (see Figure 5).
The settings on this page differ depending on the data source you want to add. The table below lists the options available for each connector.
|Data Source||Schema||Object type||Custom SQL Support||Additional Filters|
|Hive on Tez||✔||Table||✔||-|
|Presto for Cassandra||✔||Table||✔||-|
In this step, you have to select the table from the available collection in the data source:
—in this list, all the available schemas for the data source are displayed. When you select a schema, all the corresponding tables that fit these criteria will display. If there are many results, use search to quickly locate the desired table (see Figure 6).
Table (index or collection)
—select the desired table (see Figure 7).
—available for certain data sources; if you want to use specific fields from the table, you can run SQL query to get them listed.
Zoomdata wraps your SQL query into a SELECT statement. If specific statements inside the wrapped query are not supported by your data source, the query will not be executed (see Figure 8).
—for specific data sources (for example, SendGrid or Google Analytics), you can specify the time period for which you want to use the data (see Figure 9).
—when you select a table, the list of fields and their types (as defined in the database) is displayed. By default, all of them are selected. You can clear the checkboxes near the specific fields to exclude them (see Figure 10).
Records to Sample —if it is not possible to define the fields types from the data source, you can define them via sampling. Select a table and specify the number of records from the data source to be analyzed. The type of data in each field will be defined. By default, 100 records are sampled. Change this value as needed.
—this section contains the preview of your data set (first 10 records ) (see Figure 11).
- For search-based data sources (for example, Solr), there is additional option to configure a Request Handler : a Solr-specific plug-in that defines a logic when executing a search request (available in both Cloudera Search and Solr sources).
—if Spark It feature is configured for your data source, you can enable it.
By default, SparkIt is disabled. Access the article
How Zoomdata Uses Apache Spark
(in the Admin Guide) for more information about this option
—you can enable or disable caching of aggregated query results for your data source.
If this option is enabled, the results will be stored in Spark. A ccess the article How Zoomdata Caches the Data (in the Admin Guide) for more information about this option.
—if you enable this option, Zoomdata will search the distinct values in the metadata store (PostgreSQL). This will speed up retrieving the values for filtering.
You can configure a scheduled job to refresh the distinct values in the metadata store.
Otherwise, each time Zoomdata will query the data source directly.
On this page, you can configure the settings for the fields from the tables (indices or collections) that you have selected in the previous step. These fields will be used as attributes and metrics on your chart.
In this step, Zoomdata samples the dataset (1,000 records) in order to define the fields' metadata (such as min and max values, cardinality, and more).
You can configure the following:
- Visible —by default, all the fields are visible. This means that you can visualize the data from these fields on your charts. If you do not want to use specific fields, clear the corresponding checkboxes.
- Label —by default, the names of the fields from your data source are used as labels. You change them in the Label column.
- Type —when you select tables (indices or collections), each field type is defined by Zoomdata. The default data types are displayed in the Type column. You can change them, if required.
FieldsTypeDetectorproperty in the application.properties file as required.
- The Partitioned column shows if the field is partitioned. The following connectors support partitioning: Cloudera Impala, Hive on Tez, Spark SQL, Apache Drill, and BigQuery
- For metrics, you can define the default value: SUM, AVG, MAX, MIN
Set the time pattern and granularity for the fields of the
type. If required, you can define the time and date pattern by selecting the
item from the list and specifying it in the corresponding field. Otherwise, you can proceed with the
You can also set the default time granularity for the selected field. To do this, select the required item (Second, Minute, Hour, Day, Week, Month, and Year) from the corresponding list.
'Time' fields containing granularities down to the hours, minutes and seconds may also apply a time zone label. To apply a time zone: The time zone label will be displayed in charts and dashboards for this data source.
- Select the current label (most likely, UTC (UTC+00:00)) to access the Time Zone pane.
- Then you can either (a) manually input the desired time zone in the text field or (2) select the drop down to choose one from the provided list.
- Tick the checkbox if this label should be applied to all time fields for the data source.
- Select the Apply button.
- Distinct Count — enable or disable distinct count for the fields.
—you can set the range of data that will be
available in the filter
for all data types.
For fields that are of type ATTRIBUTE, you can select the
Only Allow Custom Values
If this checkbox is selected, you will have to enter the names of the attributes manually while working with the Filters pane , since the list of attributes will not be displayed (see Figure 14).
- For fields that are of type ATTRIBUTE, you can select the Only Allow Custom Values checkbox.
To set the range of values that will be available in the filter for the INTEGER, MONEY, and NUMBER types, click
and specify 'Min' and 'Max' values (see Figure 15).
To set the range of values that will be available in the filter of the TIME type, click
and specify 'From' and 'To' time range (see Figure 16).
—this column will be available for all data sources except created via upload API and flat files when you edit the data source details.
It contains the Refresh button. You can refresh the metadata of the selected field by clicking this button (see Figure 17).
While editing the data source settings, click
refresh the list of fields available from the data source (that is, add new fields). Keep in mind that the metadata of the fields won't be refreshed (see Figure 18).
icon to view additional info about the data element. The Info pane will be displayed listing the settings (see Figure 19).
In this step, you can configure the data refresh settings for your data source.
As the last step of your data source configuration, define the default settings for each chart type . When you visualize the data from a data source, the default settings are applied to a chart.
Saving Your Connection
Once you have completed the process, click Save . You will receive a confirmation that the connection was successfully saved and be taken back out to the main Data Sources page. You should see your new connection listed in the My Data Sources section of the page.