Zoomdata Version

Query Engine Microservice

The Zoomdata query engine sits between the web application (or a client application built with the Zoomdata JavaScript SDK) and the Zoomdata data connectors.

The query engine has three primary roles:

  1. It deconstructs and converts your query requests into distributed execution plans.

  2. It optimizes the execution plans based on data platform capabilities, in-memory cached results, and the query engine capabilities.

  3. It executes data functions that include:

    • Communicating with Zoomdata data connectors to execute push-down queries
    • Retrieving data from in-memory cached results, as appropriate
    • Using in-memory processing to combine, append, or manipulate one or more data sets to produce only the values needed to fulfill your request.

The key capabilities of the query engine include push-down processing for select data sources, adaptive caching, data playback and live mode, and multisource analysis.

Push-Down Processing

Zoomdata is built so users can interact directly with data down to the atomic row-level detail. Push-down processing is necessary to support this ad-hoc, interactive user experience on fresh data. As users explore the data and zoom down to lower levels of detail, Zoomdata continues to push processing down as new queries. The data store returns only the values that the query engine needs to populate the user’s visualizations. The Zoomdata push-down architecture also avoids scaling up the query engine unnecessarily when complex processing can be better executed on high-performance database engines or scalable data platforms.

Zoomdata’s default processing strategy is to push down as much work to the underlying data sources as possible, for as many data sources as possible. Internal to the query engine is a query optimizer that evaluates each end-user request, and determines whether to submit all or part of the request to the target data stores. This includes pushing down filtering criteria, derived fields, custom metrics, and offset, limit, sort, and time bucketing operations.

The ability to push down filters means that the data platform engine doesn’t need to scan large data sets unnecessarily. It also reduces the amount of data transferred over the network from the data source to Zoomdata. Zoomdata can push down all filters that a user requests in the UI.

Push-down of derived fields and custom metrics optimizes performance for the most resource-intensive operations. Zoomdata always pushes down custom metric aggregations: min, max, sum, avg, count, distinct count, last value, and percentiles. Where advantageous, the query engine combines several simpler aggregates to compute more complex metrics.

Zoomdata offers automatic time bucketing, which allows you to group and filter data by time categories such as current or prior week, month and year, rolling time periods, and so on. There is no need to pre-aggregate or model time buckets, freeing up technical personnel to work on other work. All that is needed is a date-time field. The query engine does all the work of interpreting and converting user requests to one or more queries and pushing the whole operation to the data store.

The benefits of the Zoomdata live-connect approach with push-down processing are:

  • You always have access to fresh data from the data store
  • Computational resources are scaled and managed where they make the most sense
  • Network bandwidth is conserved
  • It works very well for hybrid-cloud deployments, since there’s no need for massive data movement between systems

Zoomdata’s implementation of push-down processing also allows users to explore down to the atomic, row level detail. In this way, Zoomdata is unique in its ability to make the full breadth and depth of big data environments available for exploration.

Adaptive Caching

The query engine optionally accelerates performance through adaptive caching. The cache eviction algorithm prefers the most frequently used data, with consideration of cache size and expiration times. Users with the appropriate privileges can configure cache sizes and time-to-live (TTL) schedules, and can forcibly clear caches.

Zoomdata uses its cache to enhance performance in scenarios where large numbers of users are concurrently viewing the same shared visualizations.

By default, data caching is enabled for all data sources. The Zoomdata cache stores all the results of aggregated requests from your data source. When a chart is created the request is first sent to the Zoomdata cache. If the required results are found in the cache, they are visualized on your chart.

Each visualization that is cached has a key that uniquely identifies its content and how it can be reused. The key is calculated based both on the ​request​ and the ​user.​ The request provides the information related to data grouping and content, such as the data attributes and metrics. The user provides contextual information, like per-user security filters, user attributes, etc. This way all the security settings are taken into account when storing and using the cached visualization. Information from the cached visualization is shared between users only when they have the same data access permissions and security context.

The data cache optimizes data processing and avoids unnecessarily expensive queries. To avoid an expensive query, all or parts of multiple interactive data caches can be used to fulfill different user requests. Data retrieved from data sources and data previously processed by the query engine can be stored as materialized views and used to improve the performance of calculations, multisource analysis, and other analytical functionality.

The cache uses normalized requests for its cache entry key. This key allows data caches to be safely used by multiple users with different security contexts and visualization types. To serve requests that require different data, the query engine applies a number of possible transformations to the cached data. For example, columns can be dropped to enforce field-based security, additional filters can be applied to enforce row-level security, time windows can shift, or the data can be rolled up to higher levels of aggregation. Because there may be several ways to get the same result using different cache entries, the query engine evaluates different transformation approaches for complexity, and selects the simplest approach.

Not all data requests are cached. By default, cache entries are stored in the metadata repository, where they are retained after a service restart.

Was this topic helpful?