One of the Zoomdata Query Engine optimization strategies is to accelerate performance through adaptive caching. To accomplish that we adopted a layered caching approach just as you would see in a PC architecture. The key difference here is that we try to optimize the amount of time spent processing data for visualization and the round trips for fetching data as part of interactive analysis.
The Zoomdata Query Engine uses two caching layers: the visualization cache and a second interactive cache.
The Visualization Cache
The visualization cache is the first caching layer in Zoomdata. It minimizes latency on 100 percent cache hit requests and avoids repetitive queries by holding the information in a format that does not require processing before sending it back to the visualization. For example, a bar chart that depicts quarterly sales for one year will have exactly four values in its visualization cache, representing one value for each quarter.
The visualization cache can provide significant performance improvements when a large number of concurrent users with equivalent security credentials view the same visualization(s) at approximately the same time -- the typical "Monday morning problem." Why go back to the database to calculate quarterly sales over and over again? Just pick up the four values from the visualization cache and render the chart.
How does security work with the visualization cache? Each visualization cache has a key that identifies its content and how it can be reused. The key is calculated based on the request and the user. The "request" provides the information related to data groupings and content, such as data attributes and metrics (sum, average, etc.). The “user” provides contextual information, like per-user security filters, user attributes, etc. This way all security settings are taken into account when storing and using the visualization cache. Information from the visualization cache is shared between users only when they have the same data access permissions and security context.
(Note that we say visualizations and not dashboards. This is because the visualization cache is based on the query behind the visualization and can be leveraged across multiple dashboards that contain the same visualization / query configuration.)
The Interactive Cache
The interactive data cache is the second cache layer in Zoomdata. It optimizes data processing and avoids unnecessarily expensive queries.
To avoid an expensive query, all or parts of multiple interactive data caches can be used to fulfill different user requests. Data retrieved from data sources and data previously processed by the Query Engine can be cached to improve the performance of calculations, multisource analysis, and other analytical functionality.
For example, let's go back to the very simple example of the quarterly sales bar chart. There are four values in the visualization cache used to quickly render the bars in the chart; there are also four values in a separate interactive cache. If you add a KPI visualization to the dashboard, the four quarterly values in the interactive cache can be summed to deliver the yearly total for the new KPI visualization.
The interactive data cache uses normalized requests for its cache entry key. This key allows data caches to be safely used for multiple users with different security contexts and visualization types. To serve requests that require different data, the Query Engine can apply a number of possible transformations to the interactive data caches. For example, columns can be dropped to enforce field-based security, additional filters could be applied to enforce row-level security, time windows can shift, or data can be rolled up to higher levels of aggregation. As there may be several ways to get the same result using different cache entries, the Query Engine evaluates different transformation approaches for complexity and selects the simplest approach.
Flexibility and Scalability
The cache eviction algorithm targets the least frequently used data, with consideration of cache size and expiration times. Administrators with the appropriate privileges can configure cache sizes and time-to-live (TTL) schedules, and can forcibly clear caches.
Caching is optional and can be turned off at the data source level. Not all data requests are cached. By default, cache entries are stored in the main memory of the Query Engine, which can be deployed in a standalone mode or leverage a modern resource manager such as YARN or Kubernetes (coming soon!) for horizontal scalability. In case memory fills up, cached data can be spilled to disk and reloaded to the main memory when it's needed again.
In our next post, we’ll take on streaming data, historical playback, and live mode.