Overview of How Zoomdata Handles Data Fusion
Data Fusion is the concept of tying together data sources residing in disparate databases and presenting as a single data source for exploration and analysis. Once the Zoomdata Server joins together data from various sources, you can then visualize the combined results like any standard data source in a visualization or dashboard (see Figure 1 for a Concept Diagram on Zoomdata’s approach to Data Fusion).
Zoomdata is able to join all of its available pre-built data connectors (from big data sources to flat files, see Figure 2).
Such data fusion capabilities are available through Zoomdata’s familiar and intuitive user interface. Data fusion sources can be established using our standard connection interface. Step-by-step instructions are provided in the article Data Fusion Setup .
These fused datasets are then stored as a new type of data source - Data Fusion ( ) - that can be accessed and interacted with like any other standard data source. You can visualize the fused dataset on our chart canvas using either any of the standard styles available or a customized style (created from our Chart Studio).
The resulting data fusion can help generate more meaningful insights than what may be available from a single data source. For example, if a data warehouse stores ticketing sales and events data in three different databases (as shown in Figure 1):
- Information about buyers and sellers are stored in Redshift
- Events data are stored in Elastic Search
Ticket sales are stored in Cloudera Impala
… data fusion can join these databases together to allow for greater exploration and analysis.
Using Data Fusion, Zoomdata is able to take advantage of a variety of table structures that may reside in data repositories. These include lookup tables, fact tables, star and snowflake schema structures. The next section reviews these different table structures in greater detail.
A lookup table contains description fields that describe dimensional data. For example, a lookup table may contain a variety of information about sellers including username, first name, last name and contact information. But this table may not contain pertinent sales data that is linked to the seller. This table may need to be joined with a fact table to provide the insightful information needed (see Figure 3 for an example of a lookup table).
A fact table houses measurements, metrics and other analytical information for businesses. A fact table typically contains two types of columns - fact columns that contain the measures to analyze and dimensional keys to analyze the facts by different attribute contexts.
For example, analyzing quantities sold (fact) and price of tickets (fact) by event would be one context. Another context might be to analyze the same facts by seller. Fact tables may be missing the descriptions needed for categorizing and analyzing the data. Combining fact tables with lookup tables may be necessary to conduct proper data analysis.
Figure 4 illustrates an example fact table containing Sales information including the identifiers for each sales transaction along with the related event (and other information).
A star schema consists of one or more fact tables that have references to dimension tables. Dimension tables store descriptions for key identifiers in the fact table as well as parent attributes that can allow aggregation of metric data to higher levels. This type of table structure is useful for rolling up and analyzing metric data using various dimensional attributes (see Figure 5 for an example star schema).
A snowflake schema consists of one or more fact tables that have references to multiple dimension tables which in turn may connect to additional dimension tables. The logical arrangement of tables in a multidimensional database when displayed in a diagram resembles a snowflake shape. Like a star schema, dimension tables store descriptions for key identifiers in the fact table as well as parent attributes that can allow aggregation of metric data to higher levels. However, the difference is that the dimension table themselves may connect to additional dimension tables providing further information (see Figure 6 for an example snowflake schema).
How Zoomdata Fuses Disparate Datasets
The Zoomdata Server utilizes Apache Spark’s in-memory technology to join disparate data sources together. Figure 7 presents a data fusion conceptual diagram that illustrates disparate data sources test together within Zoomdata to produce a single dashboard where charts can display data coming from multiple sources.
With data fusion, Zoomdata is able to perform Group Bys using fields that are available across tables. For example, if a table in Amazon Redshift contains a Sellers field holding IDs and another table in MySQL contains the names of sellers, these disparate fields can be fused into one form Sellers with the two fields joined and accessible (as shown in Figure 8).
Data fusion allows for the joining of disparate data sources that is connected to Zoomdata. In order to fuse these disparate sources together, you need to join matching fields from the different data sources in the Fusion Editor (as shown in Figure 9). The Fusion Editor identifies the attributes to be used in the fusion. You can simply drag-and-drop attributes from the data source list into the work area to build your fused source. Instructions for using the Editor is provided in the article Setting Up Data Fusion in Zoomdata .
Once you have joined the necessary fields from the datasets, you are able to visualize and explore this fused dataset like any other data source. The key step in the process is to identify and specify the fields to be joined across the targeted data sources.
Data Fusion Use Cases
Zoomdata supports the following data fusion use cases:
- Looking up descriptions from Lookup Tables
- Aggregating data using dimensional tables in a Star or Snowflake Schema
- Joining multiple fact tables on Common Keys
We’ll explore each use case in greater detail and then demonstrate how you can combine these different scenarios together to build unique fusion datasets for your research and analysis.
The lookup description is the simplest use case since it is joining fact table to lookup table using a common identity key. In this scenario, the description from a lookup table can be visualized along with metrics information from a fact table. Figure 10 illustrates an example of a join between the event (lookup table) and the price paid. The common field between the two disparate tables is event_id.
Fig ur e 10
Figure 11 shows the resulting visualization of the fused dataset with the attributes provided from the (MySQL) lookup table while the metrics come from the (Cloudera Impala) fact table. Any additional group-by attributes that are available from the lookup table is displayed as well.
Star Schema Table
A star schema table extends the capability of the lookup description use case to dimensional tables that contain attribute descriptions and higher level group-by attributes than on fact table. The fact table can be joined to several dimensional tables using common ID keys (limited to one key per join). This allows you to visualize the fused dataset using group-bys and filtering on dimensional attributes sourced from the dimension table in conjunction with metrics from the fact table.
For example, figure 12 illustrates a star schema table. The Sales table is the fused dataset resulting from the following disparate data sources: Seller details from S3, Event information from MySQL, Dates from Cloudera Impala, and Listing details from MongoDB.
Figure 13 shows an example visualization of a star schema configuration with one group-by using an attribute from a fact table and additional attributes are sourced from dimension tables linked to the fact table. All available attributes in the fused dataset are displayed in the Attribute list.
Zoomdata also supports the capability to join different fact tables (or metrics) together. Similar to the lookup description use case, multiple fact tables are joined using a common key. As a result, metrics from different tables can be displayed on the same visualization with common keys as the group-bys. For example, figure 14 shows two disparate fact tables being joined under the common attribute - Seller_ID.
Figure 15 shows an example visualization of attributes and metrics from two different datasets - one from Cloudera Impala and the second from an S3 file. A Bars: Multiple Metrics visual shows the quantity and commission for tickets sold (from Cloudera Impala dataset) grouped by city (from S3 file).
Also shown in figure 15, custom calculations using metrics from different datasets can be created.
You also have the flexibility to fuse data using a combination of these use cases. For example, you can create a lookup to fact to lookup join (as shown in Figure 16) to explore the sellers, their ticket sales and event information all on one visualization. In this example, seller information is located in Amazon Redshift, Sales data is housed in Cloudera Impala, and Event details are stored in MySQL. The common keys connecting these datasets are UserID, SellerID and EventID.
Data Fusion Filtering Capabilities
Fused data can also be filtered in visualizations and dashboards. Filtering fused datasets is possible in the following scenarios:
- Different visualizations contain metrics from different fact tables but have common attributes.
- Filtering on a common attribute in one data fusion visualization can be applied to other visualizations in the dashboard.
Figure 17 shows a dashboard demonstrating filtering using one attribute from one dataset against a second attribute that is sourced from a second table.
The other limitation to note concerns account management controls available to Zoomdata Administrators. For Groups, the forced filters functionality does not apply to a fused data source. We are working on providing this capability in an upcoming release, so stay tuned for updates.