Teradata’s Cloud-Native Database

I wrote this post the same week Teradata announced VantageCloud Lake’s release. Since then, I have learned more about Lake, and I have written other posts which better explain Lake’s architecture and current capabilities. Please, check the infographic in VantageCloud Lake in a Nutshell instead for a summary of the critical VantageCloud Lake elements and the basis for using them. You may also want to read the posts:

Teradata announced this week that they were making Teradata VantageCloud Lake available. It is a long name, so let’s call it Lake.

In the first place, Teradata Lake is a cloud-native data lakehouse. I.e. a central repository of structured, unstructured, textual, analogue/IoT data with an analytical infrastructure, which allows us to read and understand the data in the lakehouse. This last bit, the analytical infrastructure, differentiates a data lake from a data lakehouse. If you want to know more about architecting data lakehouses, read Bill Inmon’s book “Building the data lakehouse”. Incidentally, Bill Inmon is the father of the Data Warehouse.

Additionally, Teradata Lake allows applying advanced analytics in the data it manages and provides additional use cases for its design.

Moreover, Teradata offers the same analytics capabilities as the existing Vantage offering in the cloud and the new cloud-native database. The main difference is that the new Lake is re-engineered from the ground to better use the Cloud’s features. Consequently, it opens the door to new, more agile, cloud-oriented features, such as a new approach to Disaster Recovery.

For now, the Lake is at your disposal in AWS, but Teradata plans to roll it out on the rest of the leading Cloud platforms.

In this post, I discuss the Lake’s new architecture and some of its features.

Teradata Lake’s Architecture

The diagram below shows the high-level Lake’s architecture.

Teradata VantageCloud Lake - Architecture

The Lake database must have a Primary Cluster. The nodes in the Primary Cluster use space in the Block File System (BFS), which has to be on Persistent Block Storage (in the case of AWS, EBS disks).

The Lake stores some internal data in BFS. You may also create your tables in BFS. However, the Primary Cluster can access the Object File Storage (OFS). OFS is a Teradata-managed and proprietary file system that resides in Native Object Storage (S3 for AWS).

Since both BFS and OFS are managed file systems, it is transparent to the user if (s)he is accessing BFS or OFS. The user queries the tables in both file systems with SQL. Furthermore, tables are ACID-compliant independently in BFS and in OFS.

In my opinion, OFS is more desirable to store data in the Cloud because:

  1. It is more effective from the cost perspective, and
  2. It offers the time-travel feature (not available in BFS).

Speaking of which, the Time Travel feature allows access to the data in a table as it looked at a previous time.

Apart from the Primary Cluster, a user has the option to create:

  1. Compute Clusters (in grey in my diagram), and
  2. Group Compute Clusters in Compute Groups, where (s)he can adjust the number of nodes and clusters depending on different timeframes (e.g., to use less power at night).

None of the above has local storage, but all can access the data in OFS.

How do all Nodes work?

The nodes in the Primary Cluster handle the connection, create the execution plan, and execute tactical queries.

Regarding the Compute Clusters, their function is to isolate the workload for specific areas (e.g., you can create a cluster for a particular department or application), frame its consumption and manage its expenses.

On a separate note, since nodes are detached from the storage, you can add or remove Compute Clusters quickly and without data redistribution.

How do users connect to Lake?

Despite having specialized clusters and storage, the architecture is transparent for the users. They connect their applications to the Primary Cluster, which orchestrates the work.

Furthermore, a Self-Service Console simplifies and manages many tasks, such as security, monitoring, launch resources, etc.

The Self-Service Console also allows users to configure QueryGrid easily. This solution becomes the fabric to link several Vantage databases in different environments.

Ingestion in Teradata Lake

The diagram below shows the different options to ingest data in the Lake and accommodate it to the right file system when Lake was released. Read the post Considerations To Load Data Into VantageCloud Lake for up-to-date recommendations.

Teradata VantageCloud Lake - Data Ingestion

Airflow, Kafka or AWS Kinesis are examples of different utilities you can use to load data in the Lake.

Knowing More about Teradata Lake

An excellent Orange Book discusses the Lake architecture, and another one provides practical guidelines for getting started. Remember that Teradata only makes orange books available to customers and employees.

Additionally, you can look the Lake documentation up to get started, have a visual tour of the console, review the types of nodes, or check anything while you work with Lake.


This article was amended on 2 January 2023 to replace Teradata VantageCloud Lake Edition for Teradata VantageCloud Lake as Teradata renamed their database.

This article was updated again on 27 March 2023 to include the term Native Disk Storage (NDS) as it is currently preferred over TDFS, even though you may find both in Teradata documentation. Additionally, I added a reference to the “Using VantageCloud™ Lake Architecture – A Practical Guide” Orange Book.

I updated this post again on 6 November 2023 to include a note at the beginning explaining that it is dated, and it is better to read the post VantageCloud Lake Architecture.

I edited this post on 10 November 2023 to include a link to the post Considerations To Load Data Into VantageCloud Lake.

This post was updated on 16 November 2023 to include a link to the post VantageCloud Lake: Autoscaling the Compute Clusters.

I updated this post on 23 November 2023 to add a link to the post Time Travel in VantageCloud Lake.

I updated this post again on 30 November 2023 to include a link to the post Session Manager in Lake: The Key to High Availability.

I included a link to the post The Path of a Query in VantageCloud Lake on 5 December 2023.

This post was again updated with the link to the post VantageCloud Lake in a Nutshell on 9 December 2023.

This article was amended on 16 April 2024 to add the links to the posts about network configuration for VantageCloud Lake on AWS and Azure.

I updated this article on 25 July 2024 to add the link to the post about network configuration for VantageCloud Lake on GCP.

On 29 July 2024, I added the link to the post Compute Clusters in VantageCloud Lake.

I modified this post on 30 July 2024 to include links to the posts Impact of Scaling in VantageCloud Lake and Open Table Format in VantageCloud Lake.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *