Zoomdata Version

Implementing Data Source Creation

Overview

When a user requests to create a data source using a connection to a data store, the connector must be able to complete the following tasks.

  1. Validate the data store
  2. Describe data store and connector features
  3. Describing Schemas, Collections, and Fields

Validating the Data Store

When an user creates a data source using a connection to a data store, Zoomdata sends a ValidateSourceRequest to the connector server to ensure that the connection is valid before allowing the user to proceed. The request contains RequestInfo, which contains DataSourceInfo. DataSourceInfo contains parameters that the user has provided for connecting to the data store. The connector server uses these parameters to connect to the data store and validate the connection.

In our example MyDB data store with a driver and simple security requiring only a username and password, the steps might be as follows.

  1. Create a connection to the MyDB data store using the connection string, username, and password provided.
  2. Run a simple query or command to validate the connection. This test should not be run against a specific schema or collection. For example, a simple validation SQL query might be select count(1). Note that there is no specific collection or schema information in the query. The query should work on any valid data store for that connector.
    If there is a particular permission required to perform some function for Zoomdata, such as performing aggregations, the validation query or command should also test those permissions.
    For example, many HDFS stores may allow a user to connect but not to run MapReduce or Tez jobs, which are required to perform aggregations for Zoomdata. In a Zoomdata Hive on Tez connector, the validation query must also trigger a Tez job to ensure that the connecting user has the appropriate permissions.
  3. Return a success message if the execution was successful without problems.
    In response to the ValidateSourceRequest, the connector server sends a ValidateSourceResponse, which should only be successiful if the user has permissions to execute queries in the provided data store.

Back to top

Describing Data Store and Connector Features

After validating a source, the next preliminary step a connector must undertake is to describe itself and its data store to Zoomata. The description is provided by responding to a ServerInfoRequest. The response, ServerInfoResponse, is a series of string key/value pairs that indicate how your connector server communicates, the features that it supports, and any specific limitations it may have. This list of keys should be constructed by the connector server based on the list of connector info keys included with this guide. In many cases the key values are static and pre-determined, so the creating the list may not need any communication with the data store. Some keys are required. The sample CrateDB connector provided with the SDK includes such a hard-coded set of keys.

Our hypothetical MyDB connector is a brand new connector to a simple data store, so it only supports a few capabilities and might return a list of the following features.

  • REQUEST.SEND_METADATA (required by Zoomdata)
  • REQUEST.TYPE (required by Zoomdata)
  • FEATURE.DISTINCT_COUNT
  • FEATURE.GROUP_BY_TIME
  • FEATURE.GROUP_BY_TIME.GROUP_BY_UNIX_TIME
  • FEATURE.LIVE_SOURCE
  • FEATURE.MULTI_GROUP_SUPPORT

In future versions of our hypothetical MyDB connector, we may implement more of the features of the MyDB data store.

Back to top

Describing Schemas, Collections, and Fields

After the connector describes its features and limitations to Zoomdata, Zoomdata will expect the connector to identify its schemas, collections, and fields included in its metadata. Zoomdata will request this information using a series of calls: MetaSchemasRequest, MetaCollectionsRequest, and MetaDescribeRequest. Your connector must respond with corresponding MetaSchemasResponse, MetaCollectionsResponse, and MetaDescribeResponse responses.

  • The MetaSchemasResponse details the schemas or schema-like objects that the data store uses to group collections, if it does so.
    • If the data store does not use schemas or schema-like objects such as catalogs or namespaces, FEATURE.SUPPORT_SCHEMA should be set to false in the ServerInfoResponse that the connector sends to Zoomdata. In this case, the connector should return an error to any MetaSchemasRequest that it receives.
    • The connector is responsible for excluding from its response any schemas to which the querying user should not have access, including any system schemas such as system metadata not normally intended for users.
  • The MetaCollectionsResponse details the collections or collection-like objects that the data store uses to group data.
    • Collections should be returned as a list of strings representing the collection names.
    • The connector should remove any schema prefix attached to collection names before sending the list of collection names to Zoomdata.
    • This response provides a list of collections to Zoomdata for use during source creation. If the connector supports schemas, the MetaCollectionsRequest will include the name of the schema to be queried. Only collections found within that schema should be returned in the response.
    • The connector is responsible for excluding from its response any collections to which the querying user should not have access.
  • The MetaDescribeResponse details a list of fields with their associated metadata for a given collection. If the data store uses schemas or schema-like objects, that will be provided in the request as well.
    • The connector is responsible for retrieving the list of fields, mapping them to Zoomdata Thrift field types [XYZ: link Zoomdata types], and setting their metadata with any additional applicable information. See the full metadata reference [XYZ: write and link] for more details.
    • It is the responsibility of the connector to assess how fields map to Zoomdata Thrift types and what additional flags may be added to their parameters.
    Take for example our hypothetical MyDB, which includes a hypothetical collection using the following fields and field types.
    Field name MyDB field type Gets Mapped to Zoomdata type Notes
    key_field long integer Indexed
    salary float double
    last_name varchar(200) string
    date_hired_indexed timestamp date Considered indexed, should be marked PLAYABLE
    date_hired_unix_time bigint integer
    complex_object mydb_object unknown Unknown types are designated as unknown and treated as RAW_DATA_ONLY.
    Note the following about the example:
    • Most of the types map directly to Zoomdata’s Thrift types
    • The complex_object of type mydb_object is a user defined type that Zoomdata cannot use, so it is mapped as unknown. Fields of unknown type are listed by Zoomdata as RAW_DATA_ONLY, meaning they can’t be queried, filtered, grouped, etc.
    • There are two indexed fields. One is a MyDB primary key. The other is an indexed timestamp. Since MyDB indexes are considered fast and can be quickly filtered, the field should have the PLAYABLE flag added to its parameters to indicate the field can be used to enable playback.

Back to top

Next Steps

After you have implemented functionality for adding the connector and creating a source on it, you must implement functionality to respond to requests for data from the data store. For more information about responding to requests for data, see Responding to Requests for Data.