In this post, I aim to explain the Open Table Format (OTF) in VantageCloud Lake and AI Unlimited and the significant considerations for using data stored in OTFs in your analytical ecosystem.
What the Open Table Format is
Open Table Formats (OTF) are open-source, standard table formats that provide a layer of abstraction on top of the files in a data lake. They make it easy to store and analyze data faster and more efficiently. They are also easily accessible and interoperable across various data processing and analytics tools.
Mind you, the OTF pioneers include Netflix with Apache Iceberg, Databricks with Delta Lake, and Uber with Apache Hudi.
Open Table Format and VantageCloud Lake
The Open Table Formats provide open and connected interoperability for cloud analytics and AI use cases. Consequently, your workloads on VantageCloud Lake and AI Unlimited benefit from accessing the OTFs in your organization, reducing or eliminating data silos.
Additionally, open catalogues provide direct access to the data and eliminate vendor lock-in.
So, to better leverage OTFs, Teradata supports multi-cloud reads/ writes, cross-catalogue reads/ writes, and cross-Lake reads/ writes for OTF. I.e., the OTF datasets (parquet files and catalogue) can be stored on AWS S3 or Azure ADLS Gen2. The Lake instance can run on AWS or Azure. From the AWS Lake instance, a user can access data stored in OTF in S3 or ADLS Gen2 and vice versa. Furthermore, you can join an Iceberg table in S3, in a Glue or Hive catalogue, with an Iceberg or a Delta table in Azure ADLS Gen2 in a Unity catalogue.
You should check with Teradata the Open Table Formats they support, catalogs, operations supported on every catalog, file formats for WRITE operations, object store, compression format and OTF version.
When to use Iceberg vs Delta Lake
Delta Lake
Delta Lake is beneficial in the following scenarios:
- When you need to enhance the current Data Lake to support reading DeltaLake tables in VantageCloud Lake.
- To provide connection and authentication mechanisms to connect to Databricks / Unity Catalog and object store.
- When you require the ability to query the catalogue to retrieve metadata information and support SELECTs with projection lists and predicate pushdown.
- If you need support for all Delta Lake data types and Time Travel queries.
- When you want to create a Delta Lake table with Uniform Iceberg support, enabling the creation of Iceberg metadata and snapshots.
Iceberg
The key points to consider when to use Apache Iceberg are:
- Large Datasets: Ideal for managing vast data sets efficiently.
- Slowly Changing Data: Suited for scenarios where data changes infrequently.
- Fast Read Performance: Provides fast read performance for large datasets.
- Historical Data Retention: Allows querying historical data without deletion.
- Schema Evolution: Supports changing table structure without affecting old snapshots.
Note that Iceberg is not suitable for high-frequency transactional workloads.
Separately, high-frequency transactional workloads.
Separately, Iceberg offers time travel, improved performance, and open standards. Furthermore, the Iceberg API is available in Java and Python.
Choosing a Catalogue Type
The catalogues that Teradata supports are:
- AWS Glue — If you are already using AWS, Glue may be a good choice.
- Apache Hive — However, if your priority is adhering to open standards, you should use Apache Hive.
- Unity Catalog—If you plan to use both Iceberg and Delta Lakes and multi-clouds such as AWS and Azure, Unity Catalog is a good choice.
Relational Properties of the Data Stored in Open Table Formats
Delta Lake Objects
Data stored in Delta Lake objects behave like any other table or view. I.e., you have SELECT right on; you can access them via SQL wherever you are, either the Primary Cluster or a Compute Cluster. You can join a normal table (in BFS, OFS or NOS) with data in a Delta Lake object.
Iceberg Objects
Iceberg is a bunch of Parquet and/ or Avro files with metadata stored in a Hive database. These files are not “relational” as they allow concurrent writes of objects via snapshots.
Note that the first “rule of “relational” is the identification of a single “row” without ambiguity:
- Rule 1 — The information rule: All information in a relational database is represented explicitly at the logical level and in exactly one way — by table values.
- Rule 2 — The guaranteed access rule: Each datum (atomic value) in a relational database is guaranteed to be logically accessible by combining table name, primary key value, and column name.
Thus, it is not relational since Iceberg does not allow you to identify a single row.
How to Access Data Stored in OTF from Lake
Installation and Set Up
The OTF read and write capabilities are enabled within Lake; no particular installation or set-up is required for the feature.
Separately, the user environment should consist of an Iceberg or Delta Lake data lake with the proper credentials and access to the catalogue and object store.
DATALAKE Object
The DATALAKE object encapsulates all the information needed to connect to an OTF data lake, including the Authorization information required to connect to the catalogue, object store and connection details.
So, the CREATE DATALAKE statement creates a DATALAKE object. Furthermore, all DATALAKEs are created in the TD_SERVER_DB database. Note that the Authorization information for connecting to the catalogue and object store is specified in the clause of the CREATE DATALAKE statement.
Authorizations in the DATALAKE Object
Teradata Authorization objects store the credentials for the Catalog and Object Store accounts and control access to a DATALAKE object. The online documentation provides several examples of creating Authorization and DATALAKE objects.
You can use GRANT and REVOKE statements to issue or revoke EXECUTE privileges on the Authorization objects.
Teradata supports simplified Authorization objects, which do not require users to define security constraints (such as INVOKER or DEFINER).
The Storage and Catalog credentials can be the same:
# Create a simplified authorization
CREATE AUTHORIZATION user1.iceberg_simplified_auth
USER '<user name>'
PASSWORD '<password>'
Credentials can also be different when accessing the Catalog and the Object Store. In this case, you should define two separate Teradata Authorization objects.
You better check with Teradata the catalogue and storage credentials you use to authenticate the OTF objects.
Querying an OTF Table
Queries that access data stored in Open Table Formats use the 3-level dot notation to reference the data:
<datalake_name.database_name.table_name>
Table Metadata Queries
Invoking system functions allows you to retrieve table metadata, such as table history, snapshots, manifests and partition information.
Time Travel Queries
With Time Travel, you can query a table at a particular point in time based on a timestamp or snapshot ID.
Time Travel can show how the data has changed over time, including what rows were inserted, deleted, or updated in the table and any changes to the table schema.
Teradata supports Time Travel queries based on a snapshot ID or a timestamp/ date literal. If a timestamp or date is specified, then you should use the snapshot ID AT or BEFORE. Note that you can get the snapshot ID for a table using the TD_SNAPSHOTS() function.
JOINs
As far as there is an entry in DBC.TVM for the tables and in DBC.TVFields for the columns, the JOINs between tables are performed. This is true independently of where the table is located (BFS, OFS, data stored in OTFs, or NOS data).
If the JOIN involves a foreign table (OTF, NOS), any pushdown operation will be accomplished and stored back in the Spool depending on the foreign table’s location and the underlying technology. Once the data is in Spool, Lake performs the JOIN between the Spool files.
So you could JOIN an Iceberg table with an external or internal table or a Delta Lake table with an external table.
Bear in mind that if you SELECT an external table, and there is no partition pruning, all data is written back into Spool, no matter how large the external table is. Thus, when you JOIN an OTF table with any other table, Lake retrieves all the data from the OTF table, stores it in a Spool file, and then JOINs the tables.
Technical Documentation
Further details are in Apache Iceberg Open Table Format on VantageCloud Lake – Getting Started. This document describes the steps to set up, query, and update Iceberg tables using VantageCloud Lake.
You have available for Delta Lake tables: Using Delta Lake Manifest Files and CREATE FOREIGN TABLE — Example: Creating Delta Lake Table.
Encryption
TLS1.2 encryption – HTTPS protocol
All data is transmitted between the VantageCloud Lake platform and the external object storage or catalogue using TLS1.2 encryption (HTTPS), regardless of whether it is encrypted at rest in external object storage.
Only AWS/Azure-managed encryption keys are supported
As of July 2024, Teradata doesn’t support CMEK in accessing catalogues and object stores.
Recommendations for Network Configuration
- Set up Private Links to enable secure traffic between the data in your VantageCloud Lake instance and your environment.
- All outgoing network connections in Lake go through Valtrix (verification and system validation) by default. This becomes a performance bottleneck when reading large objects from ADLS2 parallelly from all AMPs. The Teradata Product team recommends an Azure virtual network at the storage level by linking the Lake subnet to avoid this path and improve performance. Consequently, Lake and your Storage account are on the same network. To add a virtual network endpoint on your ADLS2 Storage account, you need the VNET/subnet information of the Lake tenant and run the following command on Azure CLI: az storage account network-rule add –resource-group “your-resourcegroup” –account-name “your-storageaccount” –subnet $Lake_subnetid
- You will avoid egress costs if your VantageCloud Lake tenants are in the same region as your accounts.
- For Lake to access the catalogue, you must add non-well-known URIs, such as the Hive metastore, to the Valtrix whitelist. Other URIs, such as AWS Glue, AWS S3, Azure ADLS2 and Databricks, are well-known and already included in the whitelist.
As an additional security consideration for network traffic when accessing OTFs from VantageCloud Lake, Teradata secures the API calls to NOS Buckets and OTFs, both in object storage. The API calls run within the Cloud Service Provider backbone when the bucket is in the same region as the Lake account. Only the API calls in Google Cloud are protected when the bucket is in another region. The table below summarises how Teradata configures the NOS API calls.
The Network Configuration posts for AWS, Azure, and GCP provide details on connecting to VantageCloud Lake.
Performance
Teradata recommends using Compute Clusters to run queries on data stored in OTFs for workload isolation and autoscaling, even though you can execute queries in the Primary and the Compute Clusters. You don’t want to overload the Primary Cluster.
Furthermore, queries in the Primary Cluster run at medium priority (Workload Management configuration).
On a separate note, if you need cross-cloud access to an OTF table, i.e. read an OTF table hosted in a bucket in a different Cloud Service Provider than the one where you have your Lake instance, currently, NOS and OTF requests will gateway through the Valtix-controlled egress/ingress, which may represent a bottleneck.
Leave a Reply