Coral: A SQL translation, analysis, and rewrite engine for modern data lakehouses


Figure 2: View authoring and creation process

When creating views in the Dali Catalog, a deployment process associates the view definition with necessary UDFs that are required to evaluate the view using table properties. Two designated table properties, “dependencies” and “functions,” carry necessary information about the UDFs used in the view. The “dependencies” property references a list of artifact coordinates for UDF jars. Conventionally, UDF jars are self-contained, i.e., a single jar contains all dependencies for a certain UDF. The “functions” property references a map from UDF names to UDF classes. The UDF name is prepended with the view name so that if different versions of the UDF are used in nested views, the names do not collide.

Let us say we have two example tables, Member and Company. Their schemas are as follows:

Member
Id         BIGINT
Name       VARCHAR
CompanyID  BIGINT
Position   VARCHAR

Company
Id BIGINT
Name VARCHAR

For the sake of this example, let us assume that in the Member table, member positions are taken directly from the user’s input. For example, members may express a Senior Engineer position as “Senior Engineer,” “Sr Engineer,” or “Sr Eng.” A standardization UDF, standardize(), standardizes the position to one canonical form, e.g., “Senior Engineer.” Below is the view SQL definition for a view that returns the number of engineers at each level in every company.

SELECT Company.Name, STANDARDIZE(Position) AS StdPosition, COUNT(*) AS Cnt
FROM Member JOIN Company ON Member.CompanyId = Company.Id
WHERE INSTR(STANDARDIZE(Position), ‘Engineer’) > 0
GROUP BY Company.Name, STANDARDIZE(Position)

To create the above view using a Gradle project, users leverage a Dali Gradle plugin that exposes Gradle extensions to help users link UDF references in the view text, e.g., standardize(), to UDF artifacts in artifact repository management systems. For example, in their build scripts, users declare UDF dependencies as follows:

dependencies { 
 udf(‘com.linkedin.standardization:standardization-udfs:1.0.0’) { 
   functions[STANDARDIZE] = ‘com.linkedin.standardization.Standardize’ 
 } 

In the above snippet, ‘com.linkedin.standardization.Standardize’ is the class name of the class that contains the definition of the STANDARDIZE UDF, and ‘com.linkedin.standardization:standardization-udfs:1.0.0’ is the Maven coordinate for the artifact that contains that class.

Once the user’s project is checked in, a view creation pipeline uses the published artifacts and creates the view in the Dali Catalog. Information about the view UDFs is kept alongside the view definition. The following is the equivalent CREATE VIEW statement to the above user logic:

CREATE VIEW CompanyEngLevels
TBLPROPERTIES(‘functions’ = ‘STANDARDIZE:com.linkedin.standardization.Standardize’,
‘dependencies’ = ‘com.linkedin.standardization:standardization-udfs:1.0.0’)
AS SELECT Company.Name, STANDARDIZE(Position) AS StdPosition, COUNT(*) AS Cnt
FROM Member JOIN Company ON Member.CompanyId = Company.Id
WHERE INSTR(STANDARDIZE(Position), ‘Engineer’) > 0
GROUP BY Company.Name, STANDARDIZE(Position)

Coral overview

In the following sections, we describe how Coral achieves view portability by accessing views in the Dali Catalog and making those view definitions accessible in a number of other engines, such as Presto, Spark, and Pig. The discussion revolves around three main aspects:

  • View virtualization: the process of accessing the view definition and converting it to an internal representation, called Coral IR (Intermediate Representation). This step also involves inferring view-level metadata, such as field nullability, case preservation, and non-SQL standard data types, that are leverageable by some engines.
  • View translation and rewrite: the process of rewriting the Coral IR to a representation that is suitable to a target engine (e.g., Presto, Spark, Pig), so the engine can use its native APIs to query the Dali view with UDFs.
  • Integration of Coral to target engines: necessary APIs and integration points with various engines to make Coral fit into the overall compute engine architecture, and its various read APIs.

View virtualization

Dali view definitions are currently stored in the Hive Metastore. The Dali Catalog uses Coral Hive to interface with it. This Coral module has two main purposes:

  • Accessing database, table, and view information: The Hive Metastore organizes tables and views in databases. Each table or view has a schema, which describes the column name and types for that table or view. One of the responsibilities of the Coral Hive module is to access table or view information such as database names, table or view names within a database, table or view schemas, and HiveQL view definitions, as well as their UDF properties, e.g., “functions” and “dependencies.” This information is necessary to enable view parsing, validation, and conversion to intermediate representation.
  • Parsing, validation, and conversion to intermediate representation of HiveQL view definitions: The same module houses a parser (based on the HiveQL parser), a validator, and a converter to intermediate representation. The parser converts HiveQL to an Abstract Syntax Tree (AST), and the validator validates the AST and ensures its semantic correctness. The intermediate representation converter converts the AST to Coral IR, which is based on Apache Calcite’s Relational Algebra Expressions. A relational algebra expression is an expression that operates on relations in a database. It consists of standard operators such as scans, filters, projections, joins, unions, aggregates, etc.

Figure 3 depicts the view virtualization process, converting a HiveQL view definition to Coral IR.



Source link