Zoomdata Version

Optimizing Joins

Data fusion joins are processed in the order in which they are specified in the UI. This affects the resulting data and the performance of the join. In addition, the type of join you select affects whether fusion processing time is optimized.

Joins are usually performed in-memory. However, when join processing can be pushed down to the data connectors to perform, fusion processing time is greatly reduced. Zoomdata supports pushdown join processing in the following ways.

  • If a data connector supports pushdown joins and if the data to be joined comes from the same data source connection, Zoomdata pushes the join operation to the underlying data connectors and allows them to join the data instead. This capability is currently supported only for Impala and Hive data stores. Several examples are given later.

  • If the type of join is an inner join and aggregate functions SUM, MIN, MAX, or COUNT are used in the data, the Zoomdata engine intelligently pushes the aggregate queries to the underlying data connectors, thus reducing the amount of data that needs to be processed. In these cases, the aggregation is performed first before the data is joined. This aggregate pushdown occurs when joining data from the same or from different data sources.

Because most joins are performed in-memory, a configurable limit has been placed on the number of records that can be processed from each joined source. This limit is initially set at 1,000,000 records per joined data source and can be configured by your Zoomdata administrator or supervisor using the qe.zengine.edc.rows.limit property in the query-engine.properties file. See Managing the Zoomdata Query Engine. When this threshold is exceeded, no data is shown on the charts containing the fused data and a message appears indicating that the threshold (maximum row number) is exceeded. If you find you are hitting this limit, use filters on the chart or dashboard to reduce the number of records processed and shown.

Examples

Example 1

In the following two fusion data sources, Fusion Data Source 1 will be pushed to the Impala connector to perform, whereas Fusion Data Source 2 will be performed in-memory because the two data sources use different Impala connections.

Fusion Data Source 1 join:
    Impala-Data-Source1-using-Impala-Connection-1
    Impala-Data-Source2-using-Impala-Connection-1
Fusion Data Source 2 join:
    Impala-Data-Source1-using-Impala-Connection-1
    Impala-Data-Source3-using-Impala-Connection-2

Example 2

The following multisource fusion example has more than one join defined. Assuming both joins are inner joins, join 1 will be performed by the Impala connector and join 2 will be performed in-memory.

Fusion Data Source
    inner join 1:
	Impala-Data-Source1-using-Impala-Connection-1
    	Impala-Data-Source2-using-Impala-Connection-1
    inner join 2:
       Impala-Data-Source1-using-Impala-Connection-1
       Elasticsearch-Data-Source1-using Elasticsearch-Connection-1

The following diagram depicts the relationship of the joins in the fused data source:

Example 3

If the join order from Example 2 is switched as shown below and if the first join is changed to a left join, neither join can be performed by data connectors. They are both performed in-memory.

Fusion Data Source
    left join 1:
   	Elasticsearch-Data-Source1-using-Elasticsearch-Connection-1
    	Impala-Data-Source1-using-Impala-Connection-1
    inner join 2:
       Impala-Data-Source1-using-Impala-Connection-1
       Impala-Data-Source2-using-Impala-Connection-1

The following diagram depicts the relationship of the joins in the fused data source:

Was this topic helpful?