Celia Muriel https://celiamuriel.com/ Deep-sea data diving https://celiamuriel.com/wp-content/uploads/2022/03/cropped-Celia-Muriel-logo-1-e1646664342519-1.png 6987189DB81CB618F223396A679B6B05 EU Tech Legislation and the Worldwide Data & AI Landscape https://celiamuriel.com/eu-tech-legislation-and-the-worldwide-data-ai-landscape/ Since 2022, the European Union has been transforming its regulatory strategy for digital technologies. The goal is an EU digital market that fosters fair competition and protects consumers and data. In this post, I discuss the five laws and proposals I've seen directly impacting analytical ecosystems from a technical point of view in Europe and other regions. Celia Miscellaneous Mon, 07 Oct 2024 19:33:54 +0100

EU Tech Legislation and the Worldwide Data & AI Landscape

Since 2022, the European Union has been in the midst of a transformation of its regulatory strategy for digital technologies. The goal is an EU digital market that fosters fair competition, protects consumers and data, opens new opportunities for companies and citizens, and supports EU’s green transition to reach climate neutrality in 2050.

To that end, the EU is developing many technology-related legislation and proposals which will affect customers and vendors in various ways.

Clifford Chance – EU Tech Legislation to Watch in 2024 as seen on 7 October 2024.

So far, I have identified the legal proposals and laws in the sections below as directly impacting the Analytics use cases from a technological perspective.

EU Tech Legislation Directly Impacting the Analytical Ecosystem

Data Act

The Data Act focuses on making product data and related service data available to the user of a connected product or related service and data. It should connect data holders to data recipients, public sector bodies, relevant EU bodies, and authorities.

The key elements to know from this act are:

  • Accessibility and transparency: Products and services must be designed to make data accessible to users by default. Users must also be provided with certain transparent information about data before purchasing.
  • Data portability: Product users are granted the right to request that data holders make all data generated by products available to third parties of their choice.
  • Data-sharing agreements with small and medium-sized enterprises: The Data Act protects medium-sized enterprises against unfair contract clauses in data-sharing agreements with more powerful market players.
  • Switching cloud services: Cloud service providers must remove obstacles restricting customers from entering into contracts with new providers and porting over data, applications, and other digital assets to the new provider.

    • In other words, the European Union’s recently enacted Data Act might eventually force all public cloud service providers operating in Europe to stop charging for egress. The Data Act requires the gradual wind-down of switching charges, including charges for data egress, within the next two years.
    • It simplifies exit plans for leaving accounts in a cloud service provider and moving to another or returning to on-prem. It also democratises multi-cloud architectures.
    • Incidentally, AWS and Google Cloud have already announced that they won’t charge for egress costs when moving data to another cloud, even though they require approval to meet specific requirements. Note that these announcements apply to egress costs for these two cloud service providers worldwide, not only in the EU.

  • Rules for international transfer of non-personal data: The Data Act proposes new restrictions, similar to those found in the General Data Protection Regulation (GDPR) and Schrems, applicable to international transfers of non-personal data held in the EU.
  • Exclusion for database rights: The Data Act specifies that the database rights created by the EU Database Directive do not apply to databases containing data from or generated by a connected device.

The Data Act has been applicable since 11 January 2024.

Data Governance Act

The Data Governance Act (DGA) aims to improve data-sharing across sectors and EU countries, particularly by facilitating more comprehensive reuse of data held by public sector bodies. For example, it contemplates supporting data-driven innovation using health, mobility, environmental, agricultural, and public administration data. To achieve this aim, it introduces four types of measures:

  • Facilitating the reuse of public sector data not currently accessible to third parties.
  • Ensuring trust in data intermediaries.
  • Supporting individuals and businesses in making their data available for the benefit of society.
  • Facilitating data-sharing across sectors and borders and ensuring the correct data is found for the proper purpose.

The Data Governance Act has been applicable since 24 September 2023.

Digital Operational Resilience Act (DORA) – Financial sector

The DORA Act provides the legal means for financial institutions (banks, insurance companies, and investment firms) to use the latest and greatest technology (AI, cloud, blockchain, etc.) while strengthening their IT security and ensuring that the financial sector in the EU can stay resilient in the event of severe operational disruption.

One consequence of this act is that financial institutions are focusing on Disaster Recovery solutions and redesigning existing ones. For example, they want to have the primary and secondary database instances in different regions in the same cloud service providers. The financial institutions are abandoning architectures where their primary and secondary systems are in different availability zones within the same region.

The DORA Act is in force and will apply on 17 January 2025.

Artificial Intelligence Act

The Artificial Intelligence Act is a regulation targeted at regulating AI systems in the EU and across the EU’s single market. It has two key aims: to maintain trust in the AI systems used in the EU and the EU market, and to create an ecosystem of excellence for AI in the EU. It proposes to achieve these aims by addressing the risks of specific uses of AI, categorising them into four risk levels — unacceptable risk, high risk, limited risk, and minimal risk — and regulating systems that fall into each category accordingly.

The Artificial Intelligence Act is in force. The Commission expects to finalise the Code of Practice by April 2025, and the application should follow later.

AI Liability Directive

The AI Liability Directive is a proposed legal framework for the targeted harmonisation of product liability rules for AI. By enabling victims of AI-related damage to obtain compensation without burdensome evidentiary hurdles, the directive aims to boost consumer confidence in interacting with emerging technologies. It achieves this by alleviating the burden of proof concerning damage caused by AI systems, establishing broader protection for victims, and fostering the AI sector by increasing guarantees. It complements the Product Liability Directive, which covers a producer’s strict liability for defective products, and the AI Act.

The AI Liability Directive has two key features:

  • Presumption of causation. It creates a rebuttable presumption of causation when certain criteria are met.
  • Preservation of evidence. Regarding high-risk AI systems, it empowers courts to order specific measures to preserve—or enable access to—evidence that could prove a causal link.

The AI Liability Directive is in the construction phase.

Conclusions

The EU digital regulatory transformation shows concern for ensuring that technology truly benefits citizens and organisations operating within the territory. It tries to maintain balance among the different stakeholders without damaging any of them, especially the nationals. Furthermore, it considers that environmental matters are part of the desired outcome.

As a matter of fact, these considerations are not exclusive to the European Union, as many other territories and countries are already legislating on or debating some of these areas.

Consequently, the strategy the European Union is applying in technological legislation may not be seen as restricted to the EU. On the contrary, it shows a worldwide tendency to identify who is involved in the development of a solution and who receives the outcome in order to protect all interests, creating new scenarios that require architecting more advanced solutions.

So, changes in technological laws, such as in the EU, impact technology delivered to the market worldwide. For example, eliminating egress costs when leaving a cloud service provider benefits individuals and organisations anywhere by opening new opportunities to enhance their solutions.

Note that some legislation is not exclusive to the EU, such as the DORA Act, which is equivalent in other countries, such as the UK and Colombia. Furthermore, European legislation inspires laws in regions and countries outside the EU. The European Union also looks at other territories to shape its legal texts and proposals.

So, I recommend:

  • Even if your company doesn’t need to comply with the EU Tech Legislation, you may benefit from keeping an eye on what changes they are imposing on tech vendors so you can adopt features, solutions, or discounts.
  • Assess which new and proposed laws may apply to your business and what they mean to it. In this post, I only showed the pieces of legislation where I clearly see an impact on Analytical ecosystems.

    • You may need, among others, to adopt or modify policies and processes or adapt your services.

  • Adopt a holistic approach to compliance and how you adopt new solutions in your company, as it will cost you less time and money. Additionally, it will integrate better with your data architecture and strategy.

    • For example, the proposed AI Act, the proposed Cyber Resilience Act and the GDPR, require goods and services to meet prescribed security standards.
    • As an example of the integration with your architecture and requirements, consider moving data from point A to point B. You must meet specific security requirements, such as the data being encrypted with Customer-Managed Encryption Keys (CMEK keys). When moving the data, you need to decrypt it from the source to read it and encrypt it with a different key when writing it to the target system. So, you must ensure that your solution will support it and that you keep the encryption keys according to your security requirements.

  • Allocate appropriate resources.
  • Consider indirect impacts.

Finally, please comment below on any other law impacting your Data and AI architecture, or discuss my reasoning. I will benefit from your experience. Thank you.

]]>
Since 2022, the European Union has been in the midst of a transformation of its regulatory strategy for digital technologies. The goal is an EU digital market that fosters fair competition, protects consumers and data, opens new opportunities for companies and citizens, and supports EU’s green transition to reach climate neutrality in 2050. To that end, the EU is developing many technology-related legislation and proposals which will affect customers and vendors in various ways. Clifford Chance – EU Tech Legislation to Watch in 2024 as seen on 7 October 2024. So far, I have identified the legal proposals and laws in the sections below as directly impacting the Analytics use cases from a technological perspective. EU Tech Legislation Directly Impacting the Analytical Ecosystem Data Act The Data Act focuses on making product data and related service data available to the user of a connected product or related service and data. It should connect data holders to data recipients, public sector bodies, relevant EU bodies, and authorities. The key elements to know from this act are: Accessibility and transparency: Products and services must be designed to make data accessible to users by default. Users must also be provided with certain transparent information about data before purchasing. Data portability: Product users are granted the right to request that data holders make all data generated by products available to third parties of their choice. Data-sharing agreements with small and medium-sized enterprises: The Data Act protects medium-sized enterprises against unfair contract clauses in data-sharing agreements with more powerful market players. Switching cloud services: Cloud service providers must remove obstacles restricting customers from entering into contracts with new providers and porting over data, applications, and other digital assets to the new provider. In other words, the European Union’s recently enacted Data Act might eventually force all public cloud service providers operating in Europe to stop charging for egress. The Data Act requires the gradual wind-down of switching charges, including charges for data egress, within the next two years. It simplifies exit plans for leaving accounts in a cloud service provider and moving to another or returning to on-prem. It also democratises multi-cloud architectures. Incidentally, AWS and Google Cloud have already announced that they won’t charge for egress costs when moving data to another cloud, even though they require approval to meet specific requirements. Note that these announcements apply to egress costs for these two cloud service providers worldwide, not only in the EU. Rules for international transfer of non-personal data: The Data Act proposes new restrictions, similar to those found in the General Data Protection Regulation (GDPR) and Schrems, applicable to international transfers of non-personal data held in the EU. Exclusion for database rights: The Data Act specifies that the database rights created by the EU Database Directive do not apply to databases containing data from or generated by a connected device. The Data Act has been applicable since 11 January 2024. Data Governance Act The Data Governance Act (DGA) aims to improve data-sharing across sectors and EU countries, particularly by facilitating more comprehensive reuse of data held by public sector bodies. For example, it contemplates supporting data-driven innovation using health, mobility, environmental, agricultural, and public administration data. To achieve this aim, it introduces four types of measures: Facilitating the reuse of public sector data not currently accessible to third parties. Ensuring trust in data intermediaries. Supporting individuals and businesses in making their data available for the benefit of society. Facilitating data-sharing across sectors and borders and ensuring the correct data is found for the proper purpose. The Data Governance Act has been applicable since 24 September 2023. Digital Operational Resilience Act (DORA) – Financial sector The DORA Act provides the legal means for financial institutions (banks, insurance companies, and investment firms) to use the latest and greatest technology (AI, cloud, blockchain, etc.) while strengthening their IT security and ensuring that the financial sector in the EU can stay resilient in the event of severe operational disruption. One consequence of this act is that financial institutions are focusing on Disaster Recovery solutions and redesigning existing ones. For example, they want to have the primary and secondary database instances in different regions in the same cloud service providers. The financial institutions are abandoning architectures where their primary and secondary systems are in different availability zones within the same region. The DORA Act is in force and will apply on 17 January 2025. Artificial Intelligence Act The Artificial Intelligence Act is a regulation targeted at regulating AI systems in the EU and across the EU’s single market. It has two key aims: to maintain trust in the AI systems used in the EU and the EU market, and to create an ecosystem of excellence for AI in the EU. It proposes to achieve these aims by addressing the risks of specific uses of AI, categorising them into four risk levels — unacceptable risk, high risk, limited risk, and minimal risk — and regulating systems that fall into each category accordingly. The Artificial Intelligence Act is in force. The Commission expects to finalise the Code of Practice by April 2025, and the application should follow later. AI Liability Directive The AI Liability Directive is a proposed legal framework for the targeted harmonisation of product liability rules for AI. By enabling victims of AI-related damage to obtain compensation without burdensome evidentiary hurdles, the directive aims to boost consumer confidence in interacting with emerging technologies. It achieves this by alleviating the burden of proof concerning damage caused by AI systems, establishing broader protection for victims, and fostering the AI sector by increasing guarantees. It complements the Product Liability Directive, which covers a producer’s strict liability for defective products, and the AI Act. The AI Liability Directive has two key features: Presumption of causation. It creates a rebuttable presumption of causation when certain criteria are met. Preservation of evidence. Regarding high-risk AI systems, it empowers courts to order specific measures to preserve—or enable access to—evidence that could prove a causal link. The AI Liability Directive is in the construction phase. Conclusions The EU digital regulatory transformation shows concern for ensuring that technology truly benefits citizens and organisations operating within the territory. It tries to maintain balance among the different stakeholders without damaging any of them, especially the nationals. Furthermore, it considers that environmental matters are part of the desired outcome. As a matter of fact, these considerations are not exclusive to the European Union, as many other territories and countries are already legislating on or debating some of these areas. Consequently, the strategy the European Union is applying in technological legislation may not be seen as restricted to the EU. On the contrary, it shows a worldwide tendency to identify who is involved in the development of a solution and who receives the outcome in order to protect all interests, creating new scenarios that require architecting more advanced solutions. So, changes in technological laws, such as in the EU, impact technology delivered to the market worldwide. For example, eliminating egress costs when leaving a cloud service provider benefits individuals and organisations anywhere by opening new opportunities to enhance their solutions. Note that some legislation is not exclusive to the EU, such as the DORA Act, which is equivalent in other countries, such as the UK and Colombia. Furthermore, European legislation inspires laws in regions and countries outside the EU. The European Union also looks at other territories to shape its legal texts and proposals. So, I recommend: Even if your company doesn’t need to comply with the EU Tech Legislation, you may benefit from keeping an eye on what changes they are imposing on tech vendors so you can adopt features, solutions, or discounts. Assess which new and proposed laws may apply to your business and what they mean to it. In this post, I only showed the pieces of legislation where I clearly see an impact on Analytical ecosystems. You may need, among others, to adopt or modify policies and processes or adapt your services. Adopt a holistic approach to compliance and how you adopt new solutions in your company, as it will cost you less time and money. Additionally, it will integrate better with your data architecture and strategy. For example, the proposed AI Act, the proposed Cyber Resilience Act and the GDPR, require goods and services to meet prescribed security standards. As an example of the integration with your architecture and requirements, consider moving data from point A to point B. You must meet specific security requirements, such as the data being encrypted with Customer-Managed Encryption Keys (CMEK keys). When moving the data, you need to decrypt it from the source to read it and encrypt it with a different key when writing it to the target system. So, you must ensure that your solution will support it and that you keep the encryption keys according to your security requirements. Allocate appropriate resources. Consider indirect impacts. Finally, please comment below on any other law impacting your Data and AI architecture, or discuss my reasoning. I will benefit from your experience. Thank you.
Open Table Format in VantageCloud Lake https://celiamuriel.com/open-table-format-in-vantagecloud-lake/ In this post, I aim to explain the Open Table Format (OTF) in VantageCloud Lake and AI Unlimited and the significant considerations for using data stored in OTFs in your analytical ecosystem. Celia Cloud Tue, 30 Jul 2024 18:27:11 +0100

Open Table Format in VantageCloud Lake

In this post, I aim to explain the Open Table Format (OTF) in VantageCloud Lake and AI Unlimited and the significant considerations for using data stored in OTFs in your analytical ecosystem.

What the Open Table Format is

Open Table Formats (OTF) are open-source, standard table formats that provide a layer of abstraction on top of the files in a data lake. They make it easy to store and analyze data faster and more efficiently. They are also easily accessible and interoperable across various data processing and analytics tools.​

Open Table Formats

Mind you, the OTF pioneers include Netflix with Apache Iceberg, Databricks with Delta Lake, and Uber with Apache Hudi.

Open Table Format and VantageCloud Lake

OTF and VantageCloud Lake

The Open Table Formats provide open and connected interoperability for cloud analytics and AI use cases. Consequently, your workloads on VantageCloud Lake and AI Unlimited benefit from accessing the OTFs in your organization, reducing or eliminating data silos.

Additionally, open catalogues provide direct access to the data and eliminate vendor lock-in.

So, to better leverage OTFs, Teradata supports multi-cloud reads/ writes, cross-catalogue reads/ writes, and cross-Lake reads/ writes for OTF. I.e., the OTF datasets (parquet files and catalogue) can be stored on AWS S3 or Azure ADLS Gen2. The Lake instance can run on AWS or Azure. From the AWS Lake instance, a user can access data stored in OTF in S3 or ADLS Gen2 and vice versa. Furthermore, you can join an Iceberg table in S3, in a Glue or Hive catalogue, with an Iceberg or a Delta table in Azure ADLS Gen2 in a Unity catalogue.

Supported OTFs in VantageCloud Lake
As of July 2024

You should check with Teradata the Open Table Formats they support, catalogs, operations supported on every catalog, file formats for WRITE operations, object store, compression format and OTF version.

When to use Iceberg vs Delta Lake

Delta Lake

Delta Lake is beneficial in the following scenarios:

  • When you need to enhance the current Data Lake to support reading DeltaLake tables in VantageCloud Lake.
  • To provide connection and authentication mechanisms to connect to Databricks / Unity Catalog and object store.
  • When you require the ability to query the catalogue to retrieve metadata information and support SELECTs with projection lists and predicate pushdown.
  • If you need support for all Delta Lake data types and Time Travel queries.
  • When you want to create a Delta Lake table with Uniform Iceberg support, enabling the creation of Iceberg metadata and snapshots.

Iceberg

The key points to consider when to use Apache Iceberg are:

  • Large Datasets: Ideal for managing vast data sets efficiently.
  • Slowly Changing Data: Suited for scenarios where data changes infrequently.
  • Fast Read Performance: Provides fast read performance for large datasets.
  • Historical Data Retention: Allows querying historical data without deletion.
  • Schema Evolution: Supports changing table structure without affecting old snapshots.

Note that Iceberg is not suitable for high-frequency transactional workloads.

Separately, high-frequency transactional workloads.

Separately, Iceberg offers time travel, improved performance, and open standards. Furthermore, the Iceberg API is available in Java and Python.

Choosing a Catalogue Type

The catalogues that Teradata supports are:

  • AWS Glue — If you are already using AWS, Glue may be a good choice.
  • Apache Hive — However, if your priority is adhering to open standards, you should use Apache Hive.
  • Unity Catalog—If you plan to use both Iceberg and Delta Lakes and multi-clouds such as AWS and Azure, Unity Catalog is a good choice.

Relational Properties of the Data Stored in Open Table Formats

Delta Lake Objects

Data stored in Delta Lake objects behave like any other table or view. I.e., you have SELECT right on; you can access them via SQL wherever you are, either the Primary Cluster or a Compute Cluster. You can join a normal table (in BFS, OFS or NOS) with data in a Delta Lake object.

Iceberg Objects

Iceberg is a bunch of Parquet and/ or Avro files with metadata stored in a Hive database. These files are not “relational” as they allow concurrent writes of objects via snapshots.

Note that the first “rule of “relational” is the identification of a single “row” without ambiguity:

  • Rule 1 — The information rule: All information in a relational database is represented explicitly at the logical level and in exactly one way — by table values.
  • Rule 2 — The guaranteed access rule: Each datum (atomic value) in a relational database is guaranteed to be logically accessible by combining table name, primary key value, and column name.

Thus, it is not relational since Iceberg does not allow you to identify a single row.

How to Access Data Stored in OTF from Lake

Installation and Set Up

The OTF read and write capabilities are enabled within Lake; no particular installation or set-up is required for the feature.

Separately, the user environment should consist of an Iceberg or Delta Lake data lake with the proper credentials and access to the catalogue and object store.

DATALAKE Object

The DATALAKE object encapsulates all the information needed to connect to an OTF data lake, including the Authorization information required to connect to the catalogue, object store and connection details.

So, the CREATE DATALAKE statement creates a DATALAKE object. Furthermore, all DATALAKEs are created in the TD_SERVER_DB database. Note that the Authorization information for connecting to the catalogue and object store is specified in the clause of the CREATE DATALAKE statement.

Authorizations in the DATALAKE Object

Teradata Authorization objects store the credentials for the Catalog and Object Store accounts and control access to a DATALAKE object. The online documentation provides several examples of creating Authorization and DATALAKE objects.​

You can use GRANT and REVOKE statements to issue or revoke EXECUTE privileges on the Authorization objects.

Teradata supports simplified Authorization objects, which do not require users to define security constraints (such as INVOKER or DEFINER).

The Storage and Catalog credentials can be the same:

# Create a simplified authorization
CREATE AUTHORIZATION user1.iceberg_simplified_auth
USER '<user name>'
PASSWORD '<password>'

Credentials can also be different when accessing the Catalog and the Object Store. In this case, you should define two separate Teradata Authorization objects.

Catalogue and Storage Credentials in OTF - VantageCloud Lake
As of July 2024

You better check with Teradata the catalogue and storage credentials you use to authenticate the OTF objects.

Querying an OTF Table

Queries that access data stored in Open Table Formats use the 3-level dot notation to reference the data:

<datalake_name.database_name.table_name>

Table Metadata Queries

Invoking system functions allows you to retrieve table metadata, such as table history, snapshots, manifests and partition information.

Time Travel Queries

With Time Travel, you can query a table at a particular point in time based on a timestamp or snapshot ID.

Time Travel can show how the data has changed over time, including what rows were inserted, deleted, or updated in the table and any changes to the table schema.

Teradata supports Time Travel queries based on a snapshot ID or a timestamp/ date literal. If a timestamp or date is specified, then you should use the snapshot ID AT or BEFORE. Note that you can get the snapshot ID for a table using the TD_SNAPSHOTS() function.

JOINs

As far as there is an entry in DBC.TVM for the tables and in DBC.TVFields for the columns, the JOINs between tables are performed. This is true independently of where the table is located (BFS, OFS, data stored in OTFs, or NOS data).

If the JOIN involves a foreign table (OTF, NOS), any pushdown operation will be accomplished and stored back in the Spool depending on the foreign table’s location and the underlying technology. Once the data is in Spool, Lake performs the JOIN between the Spool files.

So you could JOIN an Iceberg table with an external or internal table or a Delta Lake table with an external table.

Bear in mind that if you SELECT an external table, and there is no partition pruning, all data is written back into Spool, no matter how large the external table is. Thus, when you JOIN an OTF table with any other table, Lake retrieves all the data from the OTF table, stores it in a Spool file, and then JOINs the tables.

Technical Documentation

Further details are in Apache Iceberg Open Table Format on VantageCloud Lake – Getting Started. This document describes the steps to set up, query, and update Iceberg tables using VantageCloud Lake.

You have available for Delta Lake tables: Using Delta Lake Manifest Files and CREATE FOREIGN TABLE — Example: Creating Delta Lake Table.

Encryption

TLS1.2 encryption – HTTPS protocol

All data is transmitted between the VantageCloud Lake platform and the external object storage or catalogue using TLS1.2 encryption (HTTPS), regardless of whether it is encrypted at rest in external object storage.

Only AWS/Azure-managed encryption keys are supported

As of July 2024, Teradata doesn’t support CMEK in accessing catalogues and object stores.

Recommendations for Network Configuration

  • Set up Private Links to enable secure traffic between the data in your VantageCloud Lake instance and your environment.
  • All outgoing network connections in Lake go through Valtrix (verification and system validation) by default. This becomes a performance bottleneck when reading large objects from ADLS2 parallelly from all AMPs. The Teradata Product team recommends an Azure virtual network at the storage level by linking the Lake subnet to avoid this path and improve performance. Consequently, Lake and your Storage account are on the same network. To add a virtual network endpoint on your ADLS2 Storage account, you need the VNET/subnet information of the Lake tenant and run the following command on Azure CLI: az storage account network-rule add –resource-group “your-resourcegroup” –account-name “your-storageaccount” –subnet $Lake_subnetid
  • You will avoid egress costs if your VantageCloud Lake tenants are in the same region as your accounts.
  • For Lake to access the catalogue, you must add non-well-known URIs, such as the Hive metastore, to the Valtrix whitelist. Other URIs, such as AWS Glue, AWS S3, Azure ADLS2 and Databricks, are well-known and already included in the whitelist.

As an additional security consideration for network traffic when accessing OTFs from VantageCloud Lake, Teradata secures the API calls to NOS Buckets and OTFs, both in object storage. The API calls run within the Cloud Service Provider backbone when the bucket is in the same region as the Lake account. Only the API calls in Google Cloud are protected when the bucket is in another region. The table below summarises how Teradata configures the NOS API calls.

Securing API calls to NOS Buckets and OTFs in VantageCloud Lake

The Network Configuration posts for AWS, Azure, and GCP provide details on connecting to VantageCloud Lake.

Performance

Teradata recommends using Compute Clusters to run queries on data stored in OTFs for workload isolation and autoscaling, even though you can execute queries in the Primary and the Compute Clusters. You don’t want to overload the Primary Cluster.

Furthermore, queries in the Primary Cluster run at medium priority (Workload Management configuration).

On a separate note, if you need cross-cloud access to an OTF table, i.e. read an OTF table hosted in a bucket in a different Cloud Service Provider than the one where you have your Lake instance, currently, NOS and OTF requests will gateway through the Valtix-controlled egress/ingress, which may represent a bottleneck.

]]>
In this post, I aim to explain the Open Table Format (OTF) in VantageCloud Lake and AI Unlimited and the significant considerations for using data stored in OTFs in your analytical ecosystem. What the Open Table Format is Open Table Formats (OTF) are open-source, standard table formats that provide a layer of abstraction on top of the files in a data lake. They make it easy to store and analyze data faster and more efficiently. They are also easily accessible and interoperable across various data processing and analytics tools.​ Mind you, the OTF pioneers include Netflix with Apache Iceberg, Databricks with Delta Lake, and Uber with Apache Hudi. Open Table Format and VantageCloud Lake The Open Table Formats provide open and connected interoperability for cloud analytics and AI use cases. Consequently, your workloads on VantageCloud Lake and AI Unlimited benefit from accessing the OTFs in your organization, reducing or eliminating data silos. Additionally, open catalogues provide direct access to the data and eliminate vendor lock-in. So, to better leverage OTFs, Teradata supports multi-cloud reads/ writes, cross-catalogue reads/ writes, and cross-Lake reads/ writes for OTF. I.e., the OTF datasets (parquet files and catalogue) can be stored on AWS S3 or Azure ADLS Gen2. The Lake instance can run on AWS or Azure. From the AWS Lake instance, a user can access data stored in OTF in S3 or ADLS Gen2 and vice versa. Furthermore, you can join an Iceberg table in S3, in a Glue or Hive catalogue, with an Iceberg or a Delta table in Azure ADLS Gen2 in a Unity catalogue. As of July 2024 You should check with Teradata the Open Table Formats they support, catalogs, operations supported on every catalog, file formats for WRITE operations, object store, compression format and OTF version. When to use Iceberg vs Delta Lake Delta Lake Delta Lake is beneficial in the following scenarios: When you need to enhance the current Data Lake to support reading DeltaLake tables in VantageCloud Lake. To provide connection and authentication mechanisms to connect to Databricks / Unity Catalog and object store. When you require the ability to query the catalogue to retrieve metadata information and support SELECTs with projection lists and predicate pushdown. If you need support for all Delta Lake data types and Time Travel queries. When you want to create a Delta Lake table with Uniform Iceberg support, enabling the creation of Iceberg metadata and snapshots. Iceberg The key points to consider when to use Apache Iceberg are: Large Datasets: Ideal for managing vast data sets efficiently. Slowly Changing Data: Suited for scenarios where data changes infrequently. Fast Read Performance: Provides fast read performance for large datasets. Historical Data Retention: Allows querying historical data without deletion. Schema Evolution: Supports changing table structure without affecting old snapshots. Note that Iceberg is not suitable for high-frequency transactional workloads. Separately, high-frequency transactional workloads. Separately, Iceberg offers time travel, improved performance, and open standards. Furthermore, the Iceberg API is available in Java and Python. Choosing a Catalogue Type The catalogues that Teradata supports are: AWS Glue — If you are already using AWS, Glue may be a good choice. Apache Hive — However, if your priority is adhering to open standards, you should use Apache Hive. Unity Catalog—If you plan to use both Iceberg and Delta Lakes and multi-clouds such as AWS and Azure, Unity Catalog is a good choice. Relational Properties of the Data Stored in Open Table Formats Delta Lake Objects Data stored in Delta Lake objects behave like any other table or view. I.e., you have SELECT right on; you can access them via SQL wherever you are, either the Primary Cluster or a Compute Cluster. You can join a normal table (in BFS, OFS or NOS) with data in a Delta Lake object. Iceberg Objects Iceberg is a bunch of Parquet and/ or Avro files with metadata stored in a Hive database. These files are not “relational” as they allow concurrent writes of objects via snapshots. Note that the first “rule of “relational” is the identification of a single “row” without ambiguity: Rule 1 — The information rule: All information in a relational database is represented explicitly at the logical level and in exactly one way — by table values. Rule 2 — The guaranteed access rule: Each datum (atomic value) in a relational database is guaranteed to be logically accessible by combining table name, primary key value, and column name. Thus, it is not relational since Iceberg does not allow you to identify a single row. How to Access Data Stored in OTF from Lake Installation and Set Up The OTF read and write capabilities are enabled within Lake; no particular installation or set-up is required for the feature. Separately, the user environment should consist of an Iceberg or Delta Lake data lake with the proper credentials and access to the catalogue and object store. DATALAKE Object The DATALAKE object encapsulates all the information needed to connect to an OTF data lake, including the Authorization information required to connect to the catalogue, object store and connection details. So, the CREATE DATALAKE statement creates a DATALAKE object. Furthermore, all DATALAKEs are created in the TD_SERVER_DB database. Note that the Authorization information for connecting to the catalogue and object store is specified in the clause of the CREATE DATALAKE statement. Authorizations in the DATALAKE Object Teradata Authorization objects store the credentials for the Catalog and Object Store accounts and control access to a DATALAKE object. The online documentation provides several examples of creating Authorization and DATALAKE objects.​ You can use GRANT and REVOKE statements to issue or revoke EXECUTE privileges on the Authorization objects. Teradata supports simplified Authorization objects, which do not require users to define security constraints (such as INVOKER or DEFINER). The Storage and Catalog credentials can be the same: # Create a simplified authorization CREATE AUTHORIZATION user1.iceberg_simplified_auth USER '<user name>' PASSWORD '<password>' Credentials can also be different when accessing the Catalog and the Object Store. In this case, you should define two separate Teradata Authorization objects. As of July 2024 You better check with Teradata the catalogue and storage credentials you use to authenticate the OTF objects. Querying an OTF Table Queries that access data stored in Open Table Formats use the 3-level dot notation to reference the data: <datalake_name.database_name.table_name> Table Metadata Queries Invoking system functions allows you to retrieve table metadata, such as table history, snapshots, manifests and partition information. Time Travel Queries With Time Travel, you can query a table at a particular point in time based on a timestamp or snapshot ID. Time Travel can show how the data has changed over time, including what rows were inserted, deleted, or updated in the table and any changes to the table schema. Teradata supports Time Travel queries based on a snapshot ID or a timestamp/ date literal. If a timestamp or date is specified, then you should use the snapshot ID AT or BEFORE. Note that you can get the snapshot ID for a table using the TD_SNAPSHOTS() function. JOINs As far as there is an entry in DBC.TVM for the tables and in DBC.TVFields for the columns, the JOINs between tables are performed. This is true independently of where the table is located (BFS, OFS, data stored in OTFs, or NOS data). If the JOIN involves a foreign table (OTF, NOS), any pushdown operation will be accomplished and stored back in the Spool depending on the foreign table’s location and the underlying technology. Once the data is in Spool, Lake performs the JOIN between the Spool files. So you could JOIN an Iceberg table with an external or internal table or a Delta Lake table with an external table. Bear in mind that if you SELECT an external table, and there is no partition pruning, all data is written back into Spool, no matter how large the external table is. Thus, when you JOIN an OTF table with any other table, Lake retrieves all the data from the OTF table, stores it in a Spool file, and then JOINs the tables. Technical Documentation Further details are in Apache Iceberg Open Table Format on VantageCloud Lake – Getting Started. This document describes the steps to set up, query, and update Iceberg tables using VantageCloud Lake. You have available for Delta Lake tables: Using Delta Lake Manifest Files and CREATE FOREIGN TABLE — Example: Creating Delta Lake Table. Encryption TLS1.2 encryption – HTTPS protocol All data is transmitted between the VantageCloud Lake platform and the external object storage or catalogue using TLS1.2 encryption (HTTPS), regardless of whether it is encrypted at rest in external object storage. Only AWS/Azure-managed encryption keys are supported As of July 2024, Teradata doesn’t support CMEK in accessing catalogues and object stores. Recommendations for Network Configuration Set up Private Links to enable secure traffic between the data in your VantageCloud Lake instance and your environment. All outgoing network connections in Lake go through Valtrix (verification and system validation) by default. This becomes a performance bottleneck when reading large objects from ADLS2 parallelly from all AMPs. The Teradata Product team recommends an Azure virtual network at the storage level by linking the Lake subnet to avoid this path and improve performance. Consequently, Lake and your Storage account are on the same network. To add a virtual network endpoint on your ADLS2 Storage account, you need the VNET/subnet information of the Lake tenant and run the following command on Azure CLI: az storage account network-rule add –resource-group “your-resourcegroup” –account-name “your-storageaccount” –subnet $Lake_subnetid You will avoid egress costs if your VantageCloud Lake tenants are in the same region as your accounts. For Lake to access the catalogue, you must add non-well-known URIs, such as the Hive metastore, to the Valtrix whitelist. Other URIs, such as AWS Glue, AWS S3, Azure ADLS2 and Databricks, are well-known and already included in the whitelist. As an additional security consideration for network traffic when accessing OTFs from VantageCloud Lake, Teradata secures the API calls to NOS Buckets and OTFs, both in object storage. The API calls run within the Cloud Service Provider backbone when the bucket is in the same region as the Lake account. Only the API calls in Google Cloud are protected when the bucket is in another region. The table below summarises how Teradata configures the NOS API calls. The Network Configuration posts for AWS, Azure, and GCP provide details on connecting to VantageCloud Lake. Performance Teradata recommends using Compute Clusters to run queries on data stored in OTFs for workload isolation and autoscaling, even though you can execute queries in the Primary and the Compute Clusters. You don’t want to overload the Primary Cluster. Furthermore, queries in the Primary Cluster run at medium priority (Workload Management configuration). On a separate note, if you need cross-cloud access to an OTF table, i.e. read an OTF table hosted in a bucket in a different Cloud Service Provider than the one where you have your Lake instance, currently, NOS and OTF requests will gateway through the Valtix-controlled egress/ingress, which may represent a bottleneck.
Impact of Scaling in VantageCloud Lake https://celiamuriel.com/impact-of-scaling-in-vantagecloud-lake/ Post that presents two cheat sheets summarising the impact of scaling in VantageCloud Lake for the Compute and the Primary Clusters. Celia Cloud Tue, 30 Jul 2024 13:14:27 +0100

Impact of Scaling in VantageCloud Lake

I have extensively discussed VantageCloud Lake elements. They provide Cloud-native features to your analytical ecosystem, including the flexibility that scaling provides. In this post, I present two cheat sheets to summarise the impact of scaling, one for the Compute Clusters and the other for the Primary ones.

Scaling the Compute Cluster

Scaling Compute Clusters in VantageCloud Lake
You can download a high-resolution cheat sheet for scaling the Compute Clusters. I keep it in my GitHub account‘s “VantageCloud Lake infographics” repository.

Teradata designed the Compute Clusters to quickly and cost-effectively adapt to workload demands, among other features. Thus, the Compute Cluster is the preferred element for scale in the Lake architecture.

As you can see in the cheat sheet above, you can scale out and in (change the number of nodes) in the Compute Cluster either by Autoscaling or by changing the Compute Cluster size (T-shirt size) through the Compute Profile. Either way, you can scale any Compute Cluster in a live VantageCloud Lake instance without an outage in your workload.

However, as of July 2024, the Compute Clusters do not scale up or down (change the node type).

Scaling the Primary Cluster

Scaling Primary Clusters in VantageCloud Lake
You can download a high-resolution cheat sheet for scaling the Primary Clusters. I keep it in my GitHub account‘s “VantageCloud Lake infographics” repository.

Teradata designed VantageCloud Lake to run only the tactical queries (the ones that access a row through the Primary Index) and some internal processes in the Primary Cluster. You should move the rest of your workload to the Compute Clusters.

Even though the workload in the Primary Cluster is expected to be relatively stable, you may need to scale it occasionally, either because your workload increased or because you migrated the workload in phases to the Compute Cluster and fewer queries are running in the Primary Cluster.

So, when needed, you can scale up and down (change the instance type) in the Primary Cluster through an in-place mechanism, which requires minimum downtime. The Session Manager will safeguard the active queries and sessions.

Finally, as of October 2024, the Primary Clusters do not scale out or in (change the number of nodes), nor can they be expanded (change the number of AMPs and PEs).


This article was amended on 21 October 2024 to correct the information about the Primary Clusters. Up to this date, Teradata supports scaling up/ down only on the Primary Cluster. It doesn’t yet support scaling out/ in, as previously said. I also corrected the infographic.

]]>
I have extensively discussed VantageCloud Lake elements. They provide Cloud-native features to your analytical ecosystem, including the flexibility that scaling provides. In this post, I present two cheat sheets to summarise the impact of scaling, one for the Compute Clusters and the other for the Primary ones. Scaling the Compute Cluster You can download a high-resolution cheat sheet for scaling the Compute Clusters. I keep it in my GitHub account‘s “VantageCloud Lake infographics” repository. Teradata designed the Compute Clusters to quickly and cost-effectively adapt to workload demands, among other features. Thus, the Compute Cluster is the preferred element for scale in the Lake architecture. As you can see in the cheat sheet above, you can scale out and in (change the number of nodes) in the Compute Cluster either by Autoscaling or by changing the Compute Cluster size (T-shirt size) through the Compute Profile. Either way, you can scale any Compute Cluster in a live VantageCloud Lake instance without an outage in your workload. However, as of July 2024, the Compute Clusters do not scale up or down (change the node type). Scaling the Primary Cluster You can download a high-resolution cheat sheet for scaling the Primary Clusters. I keep it in my GitHub account‘s “VantageCloud Lake infographics” repository. Teradata designed VantageCloud Lake to run only the tactical queries (the ones that access a row through the Primary Index) and some internal processes in the Primary Cluster. You should move the rest of your workload to the Compute Clusters. Even though the workload in the Primary Cluster is expected to be relatively stable, you may need to scale it occasionally, either because your workload increased or because you migrated the workload in phases to the Compute Cluster and fewer queries are running in the Primary Cluster. So, when needed, you can scale up and down (change the instance type) in the Primary Cluster through an in-place mechanism, which requires minimum downtime. The Session Manager will safeguard the active queries and sessions. Finally, as of October 2024, the Primary Clusters do not scale out or in (change the number of nodes), nor can they be expanded (change the number of AMPs and PEs). This article was amended on 21 October 2024 to correct the information about the Primary Clusters. Up to this date, Teradata supports scaling up/ down only on the Primary Cluster. It doesn’t yet support scaling out/ in, as previously said. I also corrected the infographic.
Compute Clusters in VantageCloud Lake https://celiamuriel.com/compute-clusters-in-vantagecloud-lake/ This post explains the Compute Clusters' main characteristics, how to use them for cost management, plan for scaling, and mapping applications with Compute Groups. Celia Cloud Mon, 29 Jul 2024 13:27:53 +0100

Compute Clusters in VantageCloud Lake

The Compute Clusters are additional units of compute power in VantageCloud Lake. They can support all kinds of workloads (analytics, ad hoc queries, loads, reporting, etc.) to provide additional isolated capability to execute those tasks.

In this post, I’ll explain their main characteristics, how they help you manage your costs in your analytical environment, the design considerations for Compute Groups and a method to plan how to scale, and the different options to map applications with Compute Groups.

Compute Clusters’ Main Characteristics

The Compute Cluster’s three main characteristics are:

  1. There are several cluster types which support different workloads. At the time of writing this post in July 2024, the cluster types are:

    • Standard – for a variety of applications as well as in-database data engineering and analytics,
    • Analytics – for Data Science exploration, and
    • Analytic GPU – for Machine Learning, Deep Learning, and Large Language Models.

  2. Dynamic Autoscaling.

    • To quickly and cost-effectively adapt to the workload demands.

  3. They allow compute-isolate workloads while connected to the same data layer.

    • You can isolate workloads from the Primary Cluster from other workloads running in other Compute Clusters. This allows you to separate individual groups of workloads in an easy-to-manage way and run exploratory workloads.

Compute Group Components

Teradata VantageCloud Lake - Compute Groups

The Compute Group is a set of Compute Profiles, ultimately associated with Compute Clusters. A user must have privileges on a Compute Group to use the Compute Clusters associated with it.

As for the Compute Profile, it defines the policy of the Compute Clusters within the Compute Group, i.e., the size, type, and number of clusters. Additionally, it specifies the timeframe when the Compute Cluster should be active.

To suspend and resume a Compute Profile, you can run commands from SQL or the Console.

Bear in mind that there can be several Compute Profiles within a Compute Group, but only one is active at a given time. So, the combination of Compute Profiles lets you decide how much you want to scale out or in.

Furthermore, if the user who submitted the query has the privilege to use the Compute Group, steps within a query will be executed in a Compute Group’s Clusters. The Compute Router balances work across all the active Clusters within the Compute Group.

Autoscaling automatically adds or removes Clusters, responding to demand. Note that Autoscaling only scales out/ in, i.e., it adds or removes nodes. So, a Compute Profile with a specific T-shirt size has a minimum of 1 and a maximum of 3, and it adds/ removes clusters between one cluster and three.

It is essential to realise that the Compute Clusters do not scale up or down, i.e., they do not change the node type.

Compute Cluster Size and Impact on Query Execution Time

Teradata uses the term T-shirt sizes to refer to the Compute Cluster sizes.

The table below shows the T-shirt sizes for the Compute Clusters, the number of nodes every size provides to the Cluster, the power increase referred to the Small size, and an example of the query execution time if the query runs on the different Cluster sizes.

Cost Management

To optimise the cost in your Compute Clusters, you should keep in mind the following considerations:

  1. Larger T-shirt size, better execution time.

    • Increasing the T-shirt size can improve workload execution time. On the contrary, a smaller cluster size means lower costs for workloads without a tight SLA.

  2. Scale Compute Clusters live.

    • While queries run, you can modify a Compute Cluster on a live Lake instance.

  3. Schedule processes and reduce costs.

    • Scheduling workloads lower costs through batch operations.

  4. Limit Autoscaling.

    • Autoscaling provides resources on demand up to the limit you set up.

Analytics Compute Cluster

The Analytics Compute Cluster is meant for Data Science exploration.

Teradata VantageCloud Lake - Analytics Compute Cluster

These Clusters play a crucial role in your business, positioning Data Science right next to the data. This setup enables you to utilise in-database analytics functions written in SQL, allowing you to use SQL for tasks like exploratory work and understanding data topology.

Additionally, you can benefit from the Open Analytics Framework, which allows you to use Python and R scripts. The Open Analytic Framework permits easy administration of Python and R scripts. Furthermore, the Python and R scripts run in containers, allowing you to run any algorithm on VantageCloud Lake.

Furthermore, the Analytics Cluster provides more gas to processes, so your queries and code get more resources (CPU, memory, etc.) to process in an Analytics cluster than in a Standard one.
Analytics Clusters are versatile and capable of handling a variety of workloads such as UDFs and high-CPU-consuming analytic functions like nPath and Python, making them suitable for a wide range of tasks.

Finally, you can isolate the analytics workload for those working on it and keep the Analytics Cluster(s) up during their working hours.

Analytic GPU Compute Cluster

The Analytic GPU Compute Cluster is for fast Machine Learning, Deep Learning and Large Language Model inferences.

These Clusters provide a convenient way to use open-source generative AI/ ML models. For instance, you can download a pre-trained Hugging Face model, install it in the Analytic GPU cluster, and interact with it through Python. You can also pull your data in and out of the model, as it is already hosted in Lake, making your workflow more comfortable to run and efficient. To make it work:

  1. Obtain open-source permissible models from Hugging Face.
  2. Create and instantiate an Analytic GPU Compute Profile in VantageCloud Lake.
  3. Create an Open Analytics Framework user environment to load Python packages, models and Python model inference scripts.
  4. Call Open Analytics APIs to run the Large Language Model inference on the Analytic GPU Compute Cluster.

How to Design Your Compute Clusters

Background

The Compute Profile is the key to implementing your Compute Cluster design. It permits you to scale as needed and adjust the Computing Cluster size to your workload needs and budget. You should become familiar with Compute Profiles and their parameters.

By defining several Compute Profiles, you enable multiple time windows, each with a different level of processing power, allowing you to plan for various workload scenarios.

Each Compute Profile must be scaled independently because it has a different processing power than the others. You can use Autoscaling to save money since it allows you to adjust the Compute Cluster size automatically.

So, Compute Profiles and Autoscaling introduce more processing flexibility but less precision when sizing Compute Clusters.

VantageCloud Lake - Day and Night Compute Clusters
Compute Clusters in VantageCloud Lake depending on the time window

Additionally, you can’t access BFS (Block File System) tables from the Compute Clusters, but object storage (Object File System or OFS, Open File Format or OFF, Open Table Format or OTF). The consumption patterns will differ from those of BFS because object storage has a different architecture from block storage.

Factors that Determine the Compute Cluster Design

The fundamental goal of the Compute Group is the driving consideration when designing your Compute Clusters. Then, query performance, throughput, and cost will be used to polish the Compute Cluster design.

Query Performance

Larger Compute Clusters provide more compute, I/O, memory, spool, and parallelism per query, resulting in shorter query elapsed time. Note that here, I mean larger Compute Clusters as a general term, with no specific reference to T-shirt size, number of nodes, etc.

You want large Compute Clusters for workloads that are very demanding regarding resources and AMP parallelism and use large data volumes.

If the workload has relatively stable arrival rates, i.e., a steady volume of queries coming in, and you want to minimise their elapsed and response times, you also need a large Compute Cluster.

Performance factor to design Compute Clusters in VantageCloud Lake

Throughput

Small to moderate-sized Compute Clusters with Autoscaling result in more concurrent query capacity and better responsiveness to variability in arrival rates.

If you expect your arrival rate and workload to vary over time, you need maximum throughput when there is a high query arrival rate. However, you don’t need a large Compute Cluster always available. In this scenario, you should have small to moderate size clusters and then set up Autoscaling with as many additional clusters as you need when you receive a higher workload.

Throughput factor to design Compute Clusters in VantageCloud Lake

Cost

Moderate to small Compute Clusters with little or no Autoscaling allow for clear cost limits by placing a ceiling on resource utilisation but at the expense of throughput and/ or performance.

If you must keep your expenses within a threshold, you can choose a small Compute Cluster and Autoscale it a little while keeping a low limit. In this case, some queries may be delayed, but you will ensure that costs stay under a limit.

Cost factor to design Compute Clusters in VantageCloud Lake

Considerations to Map Applications, Workloads or Departments to a Compute Group

Which Applications?

Decide which application(s) a Compute Group will support.

Which application to map applications to a Compute Group in VantageCloud Lake

Different Time Windows?

Consider the time windows that the combined applications will need. Save costs by reducing the cluster size or shutting down during low-demand windows.

Time Window to map applications to a Compute Group in VantageCloud Lake

Importance of the Work?

Identify the importance of the work running in a Compute Group to the business in each time window.

Importance of the work to map applications to a Compute Group in VantageCloud Lake

Summary: Design Considerations for Compute Clusters

  1. Design the Compute Clusters for price-performance. I.e., you should keep a balance between cost and performance when setting up the Compute Groups:

    • Maximum resources available in response to the arriving workload,
    • At a cost that is comfortable for you, and
    • Without leaving resources unused.

  2. Decide which applications, workloads or departments to map to a Compute Group.

    • It would be best to do it before deciding how much your clusters will scale.

Design Steps to Define How Much To Scale

Steps to Take with Known Applications

1. Analyse Resource Consumption

You should analyse resource consumption (CPU, spool, I/O) and concurrency from the Teradata Enterprise instance for the combined applications you’ll map to a particular Compute Group. Focus on peaks and valleys in resource demand.

In fact, ResUsage and DBQL provide the information you need for the resource analysis.

The resource consumption analysis is the starting point for designing your Compute Profiles and deciding how to scale your Compute Groups. Once you have done this, repeat it until you adjust your configuration.

Resource Consumption Report
Resource Consumption Report

2. Design Compute Profiles

Within each Compute Group, you must design Compute Profiles to reflect time windows that will support differing cycles of application demand.

Furthermore, you should allocate only the processing power required for each time window.

Then, you must determine when and how much Autoscaling is needed. You may have time windows when it is not required.

You can also start with no Autoscaling and adjust slowly over time by analysing the consumption patterns and arrival rates.

Design Compute Profiles in VantageCloud Lake

3. Convert Enterprise Consumption to Lake

You should match the resource consumption in the Compute Group applications to that on the Enterprise instance. You must determine the cluster type, the node type, and the number of nodes that will satisfy that demand on Lake.

Convert VantageCloud Enterprise Consumption to Lake
Convert VantageCloud Enterprise Consumption to Lake

4. Validate on a Test instance

Finally, you could implement the scaling decisions for a Compute Group on a test environment as a validation and a starting point.

Validate on a VantageCloud Lake Test Instance your Compute Profile Design
Validate on a VantageCloud Lake Test Instance your Compute Profile Design

Steps to Take with Unknown Applications

To design how to scale unknown applications, follow the same steps as with known ones but with less expected accuracy. You must replace the resource analysis with estimates based on the application characteristics. However, you should expect to perform more trial and error during validation.

Influence of Storage Type and Table Design on How the Compute Clusters Scale

The table below details the different storage and table design factors that influence how the Compute Clusters scale.

Storage or Table Design Factor Why It Impacts How the Compute Clusters Scale
NoPI (No Primary Index) Table Access NoPI OFS (Open File System) and OFF (Open File Format) table data are distributed randomly amongst objects, requiring more table scans for access.
Co-Location of Data Lack of co-location within OFS and OFF data leads to more redistributions when preparing for a join. – In BFS, if you join two tables, you co-locate the rows in the tables in the same AMP by choosing the same Primary Index.
ORDER BY Clauses A well-designed ORDER BY clause on OFS tables can significantly reduce the number of objects being read.
Fewer Indexing Options OFS and OFF tables have fewer indexing options than BFS tables, impacting query performance and consumption.
Column vs. Row Format OFS and OFF tables are recommended to be columnar, while BFS tables are typically row-based.
Objects Assigned to AMPs Small OFS tables (with fewer objects than Compute Cluster AMPs) will result in some skewing since some AMPs will end up having no objects to process.
Less Control Over Set-Up The organisation of external data in OFF storage may have been independently determined and not optimised for Teradata access.
External Format of Data When reading OFF data, Parquet, CSV, and JSON file formats have different performance characteristics.
Path Filtering Benefits A well-designed path filtering pattern can significantly reduce the number of OFF objects being read (object filtering) and, thus, the resources required to satisfy the request.

Example: Resource Analysis and Scaling Policy

The chart below shows an example of how you could map the CPU Utilisation of the combined applications you want to run in a Compute Cluster with the scaling policy you need to implement in your Compute Profile to adjust to your price-performance requirements.

Example Resource Analysis and Scaling Policy to Design Compute Profiles in Vantage Cloud Lake

Getting Started Using Compute Groups

Considerations When Setting Up Compute Groups

Reasons to Create a Compute Group

I list below the reasons to create a Compute Group:

  1. Budget Management

    • Financial governance factor: An individual application or department wants to manage its budget and resources.

  2. Workload Isolation

    • To confine resource-intensive, less optimised work (exploratory, Data Labs, Data Science projects, etc.) that disrupts other applications, a Compute Group will fence these noisy neighbour applications off and prevent them from damaging other workloads running in other Clusters. Additionally, it protects the Primary Cluster.

  3. Control over Resource Availability

    • To control the level of computing and elasticity of different applications or departments.

  4. Elasticity

    • Design Compute Groups to scale out and have enough resources when demand spikes. For example, the Finance department closes its books and needs additional resources during the last five days of a quarter.

  5. Service Level

    • If you have a high service level for an application at a specific time (e.g., the end-of-month processing needs to finish in three days), you could create a Compute Group for that application when needed, assign as many resources as necessary to finish on time, and then tear it down when done, so you are only paying for those three days of processing power.

  6. Common Pool

    • Staging area for modernised work until the application justifies its own Compute Group.

Modernisation Process in Phases

You can modernise your workload in Lake with Compute Clusters all at once. However, it will be easier if you divide it into phases. The diagram below shows an example of how to plan it.

Modernization Process for the Compute Clusters in Phases - Teradata VantageCloud Lake

Compute Profile Settings Define Cluster Characteristics

The following Compute Profile settings define the Compute Cluster characteristics:

  1. Cluster size

    • Provide enough parallelism within the cluster to satisfy the most consuming critical queries.
    • Smaller clusters enable more modest computing increases and decreases with Autoscaling.
    • Ensure overall resource utilisation within the clusters is acceptable, minimising waste.
    • The Compute Group is where you specify cluster type (Standard, Analytics, Analytic GPU).
    • Once you specify the cluster type, all clusters within that Compute Group in a particular Compute Profile will use the same type of nodes.

  2. Minimum and Maximum Cluster Count

    • Autoscaling can reduce costs at low-demand times but increase costs at busy times.
    • Ideally, Autoscaling should be designed to match the demand at peak processing times.
    • Cost can be controlled by limiting or removing Autoscale in certain Compute Profiles.

  3. Start and End Time

    • Understand how Compute Group resource consumption varies over time.
    • Set up Compute Profile parameters so unused resources are minimised.
    • Deactivate all Compute Profiles at times of no activity.

Mapping Applications, Workloads or Departments to Compute Groups

Options to Map Applications to Compute Groups

In broad terms, there are three main options for mapping applications to Compute Groups, as shown below. However, you can design your Compute Groups to fall between these options.

Options to Map Applications to Compute Groups in VantageCloud Lake

These options have pros and cons, as shown below.

One-to-One Application Mapping to Compute Clusters in VantageCloud Lake

Several-to-One Application Mapping to Compute Clusters in VantageCloud Lake

All-to-One Application Mapping to Compute Clusters in VantageCloud Lake

Considerations for Mapping Workloads to Compute Groups

The table in this section explains the considerations you should keep in mind when mapping workloads to specific Compute Groups.

Consideration when mapping workloads to Compute Groups Explanation
Compute Profile Settings Do the Compute Profile settings for Autoscale and processing power match the needs of all workloads targeted for the same Compute Group?
Cluster Isolation Needs Do all workloads exhibit similar query characteristics and resource demands? Will isolation benefit one workload at the expense of others?
OFS Cache Effectiveness Do workloads targeted to the same Compute Group access the same object storage data to optimise the use of the Object File System cache?
Financial Governance Can combining multiple workloads into a single Compute Group achieve fuller resource utilisation? Will budget monitoring granularity be acceptable?
Similar Query Profiles Will different workloads in the same Compute Group equitably share resources and concurrency slots so that a single workload does not dominate memory, CPU, and I/O?
Optimal Elastic Resource Will Autoscale needs to be similar during the same time windows for all workloads mapping to a given Compute Group?
Simple Workload Management When combined, will the Lake default workload management options support the Compute Group mix of work?
Ease of Initial Migration How simple will it be to set up the intended number of Compute Groups? Will it make the initial migration easier or more complicated?
Ongoing Admin Overhead Will Compute Group setup decisions simplify administrative overhead going forward, or will they add extra monitoring and tuning activities over time?

How the Three Application Mapping Options Compare

The table below shows how the three application mapping options compare when designing your Compute Groups.

How the Three Application Mapping Options Compare in the Compute Clusters in Teradata VantageCloud Lake

Closing Comment

While you can run all your workloads in the Primary Cluster, Teradata designed VantageCloud Lake to run only the tactical queries (the ones that access a row through the Primary Index) and some internal processes in the Primary Cluster. You should move the rest of your workload to the Compute Clusters for the most efficient balance between your platform’s performance, flexibility and cost, leveraging Cloud-native capabilities.


I modified this post on 30 July 2024 to include a link to the post Impact of Scaling in VantageCloud Lake.

]]>
The Compute Clusters are additional units of compute power in VantageCloud Lake. They can support all kinds of workloads (analytics, ad hoc queries, loads, reporting, etc.) to provide additional isolated capability to execute those tasks. In this post, I’ll explain their main characteristics, how they help you manage your costs in your analytical environment, the design considerations for Compute Groups and a method to plan how to scale, and the different options to map applications with Compute Groups. Compute Clusters’ Main Characteristics The Compute Cluster’s three main characteristics are: There are several cluster types which support different workloads. At the time of writing this post in July 2024, the cluster types are: Standard – for a variety of applications as well as in-database data engineering and analytics, Analytics – for Data Science exploration, and Analytic GPU – for Machine Learning, Deep Learning, and Large Language Models. Dynamic Autoscaling. To quickly and cost-effectively adapt to the workload demands. They allow compute-isolate workloads while connected to the same data layer. You can isolate workloads from the Primary Cluster from other workloads running in other Compute Clusters. This allows you to separate individual groups of workloads in an easy-to-manage way and run exploratory workloads. Compute Group Components The Compute Group is a set of Compute Profiles, ultimately associated with Compute Clusters. A user must have privileges on a Compute Group to use the Compute Clusters associated with it. As for the Compute Profile, it defines the policy of the Compute Clusters within the Compute Group, i.e., the size, type, and number of clusters. Additionally, it specifies the timeframe when the Compute Cluster should be active. To suspend and resume a Compute Profile, you can run commands from SQL or the Console. Bear in mind that there can be several Compute Profiles within a Compute Group, but only one is active at a given time. So, the combination of Compute Profiles lets you decide how much you want to scale out or in. Furthermore, if the user who submitted the query has the privilege to use the Compute Group, steps within a query will be executed in a Compute Group’s Clusters. The Compute Router balances work across all the active Clusters within the Compute Group. Autoscaling automatically adds or removes Clusters, responding to demand. Note that Autoscaling only scales out/ in, i.e., it adds or removes nodes. So, a Compute Profile with a specific T-shirt size has a minimum of 1 and a maximum of 3, and it adds/ removes clusters between one cluster and three. It is essential to realise that the Compute Clusters do not scale up or down, i.e., they do not change the node type. Compute Cluster Size and Impact on Query Execution Time Teradata uses the term T-shirt sizes to refer to the Compute Cluster sizes. The table below shows the T-shirt sizes for the Compute Clusters, the number of nodes every size provides to the Cluster, the power increase referred to the Small size, and an example of the query execution time if the query runs on the different Cluster sizes. Cost Management To optimise the cost in your Compute Clusters, you should keep in mind the following considerations: Larger T-shirt size, better execution time. Increasing the T-shirt size can improve workload execution time. On the contrary, a smaller cluster size means lower costs for workloads without a tight SLA. Scale Compute Clusters live. While queries run, you can modify a Compute Cluster on a live Lake instance. Schedule processes and reduce costs. Scheduling workloads lower costs through batch operations. Limit Autoscaling. Autoscaling provides resources on demand up to the limit you set up. Analytics Compute Cluster The Analytics Compute Cluster is meant for Data Science exploration. These Clusters play a crucial role in your business, positioning Data Science right next to the data. This setup enables you to utilise in-database analytics functions written in SQL, allowing you to use SQL for tasks like exploratory work and understanding data topology. Additionally, you can benefit from the Open Analytics Framework, which allows you to use Python and R scripts. The Open Analytic Framework permits easy administration of Python and R scripts. Furthermore, the Python and R scripts run in containers, allowing you to run any algorithm on VantageCloud Lake. Furthermore, the Analytics Cluster provides more gas to processes, so your queries and code get more resources (CPU, memory, etc.) to process in an Analytics cluster than in a Standard one.Analytics Clusters are versatile and capable of handling a variety of workloads such as UDFs and high-CPU-consuming analytic functions like nPath and Python, making them suitable for a wide range of tasks. Finally, you can isolate the analytics workload for those working on it and keep the Analytics Cluster(s) up during their working hours. Analytic GPU Compute Cluster The Analytic GPU Compute Cluster is for fast Machine Learning, Deep Learning and Large Language Model inferences. These Clusters provide a convenient way to use open-source generative AI/ ML models. For instance, you can download a pre-trained Hugging Face model, install it in the Analytic GPU cluster, and interact with it through Python. You can also pull your data in and out of the model, as it is already hosted in Lake, making your workflow more comfortable to run and efficient. To make it work: Obtain open-source permissible models from Hugging Face. Create and instantiate an Analytic GPU Compute Profile in VantageCloud Lake. Create an Open Analytics Framework user environment to load Python packages, models and Python model inference scripts. Call Open Analytics APIs to run the Large Language Model inference on the Analytic GPU Compute Cluster. How to Design Your Compute Clusters Background The Compute Profile is the key to implementing your Compute Cluster design. It permits you to scale as needed and adjust the Computing Cluster size to your workload needs and budget. You should become familiar with Compute Profiles and their parameters. By defining several Compute Profiles, you enable multiple time windows, each with a different level of processing power, allowing you to plan for various workload scenarios. Each Compute Profile must be scaled independently because it has a different processing power than the others. You can use Autoscaling to save money since it allows you to adjust the Compute Cluster size automatically. So, Compute Profiles and Autoscaling introduce more processing flexibility but less precision when sizing Compute Clusters. Compute Clusters in VantageCloud Lake depending on the time window Additionally, you can’t access BFS (Block File System) tables from the Compute Clusters, but object storage (Object File System or OFS, Open File Format or OFF, Open Table Format or OTF). The consumption patterns will differ from those of BFS because object storage has a different architecture from block storage. Factors that Determine the Compute Cluster Design The fundamental goal of the Compute Group is the driving consideration when designing your Compute Clusters. Then, query performance, throughput, and cost will be used to polish the Compute Cluster design. Query Performance Larger Compute Clusters provide more compute, I/O, memory, spool, and parallelism per query, resulting in shorter query elapsed time. Note that here, I mean larger Compute Clusters as a general term, with no specific reference to T-shirt size, number of nodes, etc. You want large Compute Clusters for workloads that are very demanding regarding resources and AMP parallelism and use large data volumes. If the workload has relatively stable arrival rates, i.e., a steady volume of queries coming in, and you want to minimise their elapsed and response times, you also need a large Compute Cluster. Throughput Small to moderate-sized Compute Clusters with Autoscaling result in more concurrent query capacity and better responsiveness to variability in arrival rates. If you expect your arrival rate and workload to vary over time, you need maximum throughput when there is a high query arrival rate. However, you don’t need a large Compute Cluster always available. In this scenario, you should have small to moderate size clusters and then set up Autoscaling with as many additional clusters as you need when you receive a higher workload. Cost Moderate to small Compute Clusters with little or no Autoscaling allow for clear cost limits by placing a ceiling on resource utilisation but at the expense of throughput and/ or performance. If you must keep your expenses within a threshold, you can choose a small Compute Cluster and Autoscale it a little while keeping a low limit. In this case, some queries may be delayed, but you will ensure that costs stay under a limit. Considerations to Map Applications, Workloads or Departments to a Compute Group Which Applications? Decide which application(s) a Compute Group will support. Different Time Windows? Consider the time windows that the combined applications will need. Save costs by reducing the cluster size or shutting down during low-demand windows. Importance of the Work? Identify the importance of the work running in a Compute Group to the business in each time window. Summary: Design Considerations for Compute Clusters Design the Compute Clusters for price-performance. I.e., you should keep a balance between cost and performance when setting up the Compute Groups: Maximum resources available in response to the arriving workload, At a cost that is comfortable for you, and Without leaving resources unused. Decide which applications, workloads or departments to map to a Compute Group. It would be best to do it before deciding how much your clusters will scale. Design Steps to Define How Much To Scale Steps to Take with Known Applications 1. Analyse Resource Consumption You should analyse resource consumption (CPU, spool, I/O) and concurrency from the Teradata Enterprise instance for the combined applications you’ll map to a particular Compute Group. Focus on peaks and valleys in resource demand. In fact, ResUsage and DBQL provide the information you need for the resource analysis. The resource consumption analysis is the starting point for designing your Compute Profiles and deciding how to scale your Compute Groups. Once you have done this, repeat it until you adjust your configuration. Resource Consumption Report 2. Design Compute Profiles Within each Compute Group, you must design Compute Profiles to reflect time windows that will support differing cycles of application demand. Furthermore, you should allocate only the processing power required for each time window. Then, you must determine when and how much Autoscaling is needed. You may have time windows when it is not required. You can also start with no Autoscaling and adjust slowly over time by analysing the consumption patterns and arrival rates. 3. Convert Enterprise Consumption to Lake You should match the resource consumption in the Compute Group applications to that on the Enterprise instance. You must determine the cluster type, the node type, and the number of nodes that will satisfy that demand on Lake. Convert VantageCloud Enterprise Consumption to Lake 4. Validate on a Test instance Finally, you could implement the scaling decisions for a Compute Group on a test environment as a validation and a starting point. Validate on a VantageCloud Lake Test Instance your Compute Profile Design Steps to Take with Unknown Applications To design how to scale unknown applications, follow the same steps as with known ones but with less expected accuracy. You must replace the resource analysis with estimates based on the application characteristics. However, you should expect to perform more trial and error during validation. Influence of Storage Type and Table Design on How the Compute Clusters Scale The table below details the different storage and table design factors that influence how the Compute Clusters scale. Storage or Table Design Factor Why It Impacts How the Compute Clusters Scale NoPI (No Primary Index) Table Access NoPI OFS (Open File System) and OFF (Open File Format) table data are distributed randomly amongst objects, requiring more table scans for access. Co-Location of Data Lack of co-location within OFS and OFF data leads to more redistributions when preparing for a join. – In BFS, if you join two tables, you co-locate the rows in the tables in the same AMP by choosing the same Primary Index. ORDER BY Clauses A well-designed ORDER BY clause on OFS tables can significantly reduce the number of objects being read. Fewer Indexing Options OFS and OFF tables have fewer indexing options than BFS tables, impacting query performance and consumption. Column vs. Row Format OFS and OFF tables are recommended to be columnar, while BFS tables are typically row-based. Objects Assigned to AMPs Small OFS tables (with fewer objects than Compute Cluster AMPs) will result in some skewing since some AMPs will end up having no objects to process. Less Control Over Set-Up The organisation of external data in OFF storage may have been independently determined and not optimised for Teradata access. External Format of Data When reading OFF data, Parquet, CSV, and JSON file formats have different performance characteristics. Path Filtering Benefits A well-designed path filtering pattern can significantly reduce the number of OFF objects being read (object filtering) and, thus, the resources required to satisfy the request. Example: Resource Analysis and Scaling Policy The chart below shows an example of how you could map the CPU Utilisation of the combined applications you want to run in a Compute Cluster with the scaling policy you need to implement in your Compute Profile to adjust to your price-performance requirements. Getting Started Using Compute Groups Considerations When Setting Up Compute Groups Reasons to Create a Compute Group I list below the reasons to create a Compute Group: Budget Management Financial governance factor: An individual application or department wants to manage its budget and resources. Workload Isolation To confine resource-intensive, less optimised work (exploratory, Data Labs, Data Science projects, etc.) that disrupts other applications, a Compute Group will fence these noisy neighbour applications off and prevent them from damaging other workloads running in other Clusters. Additionally, it protects the Primary Cluster. Control over Resource Availability To control the level of computing and elasticity of different applications or departments. Elasticity Design Compute Groups to scale out and have enough resources when demand spikes. For example, the Finance department closes its books and needs additional resources during the last five days of a quarter. Service Level If you have a high service level for an application at a specific time (e.g., the end-of-month processing needs to finish in three days), you could create a Compute Group for that application when needed, assign as many resources as necessary to finish on time, and then tear it down when done, so you are only paying for those three days of processing power. Common Pool Staging area for modernised work until the application justifies its own Compute Group. Modernisation Process in Phases You can modernise your workload in Lake with Compute Clusters all at once. However, it will be easier if you divide it into phases. The diagram below shows an example of how to plan it. Compute Profile Settings Define Cluster Characteristics The following Compute Profile settings define the Compute Cluster characteristics: Cluster size Provide enough parallelism within the cluster to satisfy the most consuming critical queries. Smaller clusters enable more modest computing increases and decreases with Autoscaling. Ensure overall resource utilisation within the clusters is acceptable, minimising waste. The Compute Group is where you specify cluster type (Standard, Analytics, Analytic GPU). Once you specify the cluster type, all clusters within that Compute Group in a particular Compute Profile will use the same type of nodes. Minimum and Maximum Cluster Count Autoscaling can reduce costs at low-demand times but increase costs at busy times. Ideally, Autoscaling should be designed to match the demand at peak processing times. Cost can be controlled by limiting or removing Autoscale in certain Compute Profiles. Start and End Time Understand how Compute Group resource consumption varies over time. Set up Compute Profile parameters so unused resources are minimised. Deactivate all Compute Profiles at times of no activity. Mapping Applications, Workloads or Departments to Compute Groups Options to Map Applications to Compute Groups In broad terms, there are three main options for mapping applications to Compute Groups, as shown below. However, you can design your Compute Groups to fall between these options. These options have pros and cons, as shown below. Considerations for Mapping Workloads to Compute Groups The table in this section explains the considerations you should keep in mind when mapping workloads to specific Compute Groups. Consideration when mapping workloads to Compute Groups Explanation Compute Profile Settings Do the Compute Profile settings for Autoscale and processing power match the needs of all workloads targeted for the same Compute Group? Cluster Isolation Needs Do all workloads exhibit similar query characteristics and resource demands? Will isolation benefit one workload at the expense of others? OFS Cache Effectiveness Do workloads targeted to the same Compute Group access the same object storage data to optimise the use of the Object File System cache? Financial Governance Can combining multiple workloads into a single Compute Group achieve fuller resource utilisation? Will budget monitoring granularity be acceptable? Similar Query Profiles Will different workloads in the same Compute Group equitably share resources and concurrency slots so that a single workload does not dominate memory, CPU, and I/O? Optimal Elastic Resource Will Autoscale needs to be similar during the same time windows for all workloads mapping to a given Compute Group? Simple Workload Management When combined, will the Lake default workload management options support the Compute Group mix of work? Ease of Initial Migration How simple will it be to set up the intended number of Compute Groups? Will it make the initial migration easier or more complicated? Ongoing Admin Overhead Will Compute Group setup decisions simplify administrative overhead going forward, or will they add extra monitoring and tuning activities over time? How the Three Application Mapping Options Compare The table below shows how the three application mapping options compare when designing your Compute Groups. Closing Comment While you can run all your workloads in the Primary Cluster, Teradata designed VantageCloud Lake to run only the tactical queries (the ones that access a row through the Primary Index) and some internal processes in the Primary Cluster. You should move the rest of your workload to the Compute Clusters for the most efficient balance between your platform’s performance, flexibility and cost, leveraging Cloud-native capabilities. I modified this post on 30 July 2024 to include a link to the post Impact of Scaling in VantageCloud Lake.
VantageCloud Lake on GCP: Network configuration https://celiamuriel.com/vantagecloud-lake-on-gcp-network-configuration/ Cheat sheet with the key network elements you need to connect with your Teradata VantageCloud Lake on Google Cloud and a detailed explanation. Celia Cloud Thu, 25 Jul 2024 12:58:17 +0100

VantageCloud Lake on GCP: Network configuration

This post contains a cheat sheet summarising the key elements of the network setup you must choose to connect your Teradata VantageCloud Lake instance on Google Cloud with your account. It also describes the GCP services Teradata supports for the connections. Below, you can find an explanation of all of them.

Cheat Sheet for the Network options for VantageCloud Lake on GCP

Teradata VantageCloud Lake on GCP - Cheat Sheet for Networking options
VantageCloud Lake on GCP cheat sheet for Networking options

You can download a high-resolution network cheat sheet for VantageCloud Lake on Azure in a repository in my GitHub account.

Network components in detail

Google connectivity Options Teradata supports for VaaS

On Prem-to-Cloud connection

Google Cloud Interconnect

Cloud Interconnect extends an on-premises network to Google’s network through a highly available, low-latency connection. So it is Teradata’s recommended option. Furthermore, its performance is more predictable than that of a Virtual Private Network (VPN).

There are two different flavours of Interconnect:

  • Dedicated Interconnect provides direct physical connections between on-premises and Google’s networks.
  • Partner Interconnect connects on-premises and Google’s networks through a supported service provider.

Note that you must procure and own the Cloud Interconnect, not Teradata.

Furthermore, you must use the Interconnect with either VPC Network Peering or VPN for the Cloud-to-Cloud connection, as Google does not support cross-tenant Cloud Interconnect.

Note that Teradata supports a Direct Interconnect between your on-prem site and the VaaS account.

VPN

Cloud VPN securely connects an on-premises network to a Virtual Private Cloud (VPC) network through an IPsec (Internet Protocol security encrypted tunnel) VPN connection in a single region. A VPN Gateway encrypts the traffic between the two, and another VPN Gateway decrypts it. Thus, the VPN protects your data as it travels over the internet.

Actually, the VPN is the only connectivity method that is encrypted at the route level by default.

You can use a VPN for the on-Premises-to-Cloud connection instead. However, remember that it has a lower bandwidth than an Interconnect and is less predictable in performance.

A virtual private network (VPN) can connect on-premises networks to the cloud, but it can also connect different VPCs within the cloud, different cloud service providers, and two instances of Cloud VPN to each other.

You may also configure application-level encryption on top of the IPsec route-level encryption (e.g., TLS 1.2 encryption for TTU drivers).

On a separate note, a VPN allows bidirectional traffic. You can secure your network by firewalling your site (on-prem data centre or GCP account).

Cloud-to-Cloud connection

We call the Teradata Lake account communication with your GCP account the “handshake”. It is your means to access your data, load more and consume it. Thus, this connection must:

  1. Be secure,
  2. Be fast and handle large amounts of data, and
  3. Preserve the built-in parallelism in Vantage when possible. Letting the database handle the workload will improve its performance.

Private Service Connect

The Private Service Connect provides private connectivity between VPCs, GCP services, and on-premises networks without exposing traffic to the public internet or opening networks to one another.

Also, in the case of the Private Service Connect, the traffic between the virtual network and the service travels the Google backbone network.

The Private Service Connect lets you securely connect a VPC to Teradata VantageCloud Lake with a uni-directional traffic pattern. Thus, session traffic is initiated only from your side of the link, and VantageCloud Lake can’t start a connection back into your network.

Consequently, there is no need to configure firewall rules or special routing tables since the two sides of the network are not directly joined.

As for security, Private Service Connect does not encrypt traffic because it is a private circuit — data does not flow on the public Internet. You can configure application-level encryption to encrypt data over a Private Service Connect (e.g., TLS 1.2 encryption for TTU drivers).

On a separate note, if you want to use LDAP, which requires bi-directional traffic between your GCP account and the Vantage one, you must create a reverse Private Service Connect and manually set up the LDAP configuration in the Session Manager. They will allow VantageCloud Lake to initiate traffic into your network.

QueryGrid or Data Copy also need bidirectional traffic. They require two separate Private Service Connects, one for reverse traffic.

Additionally, if you need your Lake instance to contact one of your object stores with OTFs, you will require a reverse Private Service Connect. Note that if you require cross-cloud access, NOS and OTF requests will currently be gateway through the Valtix-controlled egress and ingress traffic, which may represent a bottleneck.

Public Internet

Notably, you can connect third-party applications and Viewpoint through the public internet. However, other solutions, such as QueryGrid or LDAP, must communicate with VantageCloud Lake through Private Service Connect.

To access your VantageCloud Lake account through the public internet, you must provide the source CIDR block to whitelist access in the Console. Note that Lake doesn’t allow whitelisting of 0.0.0.0/0 CIDR block.

Bear in mind that you won’t need to perform additional network changes if you allow the source to access the public Internet.

However, if you are accessing Lake outside your company network, you should use your company’s VPN to route traffic from the allowed CIDR block.

API Calls to NOS Buckets and OTFs

You can read (and write) with VantageCloud Lake from Google Cloud Storage through a Teradata feature called NOS (Native Object Storage) Reads (and Writes). Additionally, Teradata supports reading and writing from data stored in Open Table Format (OTF).

To access Cloud Storage, you have to use its APIs. The Cloud Storage API calls run through the public internet by default, protected with an HTTPS protocol.

To secure Cloud Storage connections, Teradata leverages Private Google Access to connect to Cloud Storage. Teradata configures the database’s virtual machines with internal IP addresses only (they don’t have public IPs) and has Public Google Access enabled.

Private Google Access means that when you call the Cloud Storage API from VantageCloud Lake, the connection will use the Cloud Storage API endpoint over Google’s private network backbone. Thus, data never transfers through the public internet when you use NOS Reads and Writes or OTFs from Teradata VantageCloud Lake. Note that Public Google Access protects the Cloud Storage API calls whether you access a Cloud Storage bucket in the same region as Lake or in a different one. However, if you access a bucket in another region, you will incur egress costs. You will also likely have a larger latency than when you access a bucket in the same region.

For your information, Google offers multiple options to route Cloud Storage traffic through its backbone instead of through the public Internet, including the one Teradata uses with VantageCloud — specifically, Private Google Access.

Main security elements in VantageCloud architecture

In terms of security, there are five key aspects of VantageCloud Lake architecture:

  • Teradata configures the Compute Engine virtual machines that run the VantageCloud Lake instances with internal IP addresses only, i.e., they don’t have public IPs.
  • These virtual machines’ Virtual Networks have Azure Private Endpoints enabled.
  • NOS traffic initiated from these virtual machines goes to an Cloud Storage API endpoint.
  • With Private Service Connect enabled, the Cloud Storage bucket will use an internal IP provided by the Teradata account, while Lake and the bucket are in the same region.
  • Teradata configures VantageCloud Lake to use the HTTPS call by default.

You have all the details in the NOS Orange Book to read and write data in NOS storage.

Network & Encryption

All Teradata network connections can be encrypted. Some examples of connectivity encryption options are:

  • Protocol transit encryption for SQLe (Quality of Protection or TTU v17.10 + TLS).
  • All HTTP interfaces will be HTTPS (such as Viewpoint).
  • By default, all the SQLe traffic is encrypted with Teradata-provided cypher on the 1025 port and TLS1.2 certificate on the 443 port.

Teradata Clients (Teradata Tools and Utilities) are not specific to any Cloud Service Provider. However, when you consider the security of your connections, you can enable TLS on TTUs 17.10 and above. Moreover, you can enable Teradata generic encryption (256 bits) for TTUs 16.20 onwards.

Other Service Cloud Providers

You can also find posts and cheat sheets on connecting VantageCloud Lake on AWS and Azure in this blog.

]]>
This post contains a cheat sheet summarising the key elements of the network setup you must choose to connect your Teradata VantageCloud Lake instance on Google Cloud with your account. It also describes the GCP services Teradata supports for the connections. Below, you can find an explanation of all of them. Cheat Sheet for the Network options for VantageCloud Lake on GCP VantageCloud Lake on GCP cheat sheet for Networking options You can download a high-resolution network cheat sheet for VantageCloud Lake on Azure in a repository in my GitHub account. Network components in detail Google connectivity Options Teradata supports for VaaS On Prem-to-Cloud connection Google Cloud Interconnect Cloud Interconnect extends an on-premises network to Google’s network through a highly available, low-latency connection. So it is Teradata’s recommended option. Furthermore, its performance is more predictable than that of a Virtual Private Network (VPN). There are two different flavours of Interconnect: Dedicated Interconnect provides direct physical connections between on-premises and Google’s networks. Partner Interconnect connects on-premises and Google’s networks through a supported service provider. Note that you must procure and own the Cloud Interconnect, not Teradata. Furthermore, you must use the Interconnect with either VPC Network Peering or VPN for the Cloud-to-Cloud connection, as Google does not support cross-tenant Cloud Interconnect. Note that Teradata supports a Direct Interconnect between your on-prem site and the VaaS account. VPN Cloud VPN securely connects an on-premises network to a Virtual Private Cloud (VPC) network through an IPsec (Internet Protocol security encrypted tunnel) VPN connection in a single region. A VPN Gateway encrypts the traffic between the two, and another VPN Gateway decrypts it. Thus, the VPN protects your data as it travels over the internet. Actually, the VPN is the only connectivity method that is encrypted at the route level by default. You can use a VPN for the on-Premises-to-Cloud connection instead. However, remember that it has a lower bandwidth than an Interconnect and is less predictable in performance. A virtual private network (VPN) can connect on-premises networks to the cloud, but it can also connect different VPCs within the cloud, different cloud service providers, and two instances of Cloud VPN to each other. You may also configure application-level encryption on top of the IPsec route-level encryption (e.g., TLS 1.2 encryption for TTU drivers). On a separate note, a VPN allows bidirectional traffic. You can secure your network by firewalling your site (on-prem data centre or GCP account). Cloud-to-Cloud connection We call the Teradata Lake account communication with your GCP account the “handshake”. It is your means to access your data, load more and consume it. Thus, this connection must: Be secure, Be fast and handle large amounts of data, and Preserve the built-in parallelism in Vantage when possible. Letting the database handle the workload will improve its performance. Private Service Connect The Private Service Connect provides private connectivity between VPCs, GCP services, and on-premises networks without exposing traffic to the public internet or opening networks to one another. Also, in the case of the Private Service Connect, the traffic between the virtual network and the service travels the Google backbone network. The Private Service Connect lets you securely connect a VPC to Teradata VantageCloud Lake with a uni-directional traffic pattern. Thus, session traffic is initiated only from your side of the link, and VantageCloud Lake can’t start a connection back into your network. Consequently, there is no need to configure firewall rules or special routing tables since the two sides of the network are not directly joined. As for security, Private Service Connect does not encrypt traffic because it is a private circuit — data does not flow on the public Internet. You can configure application-level encryption to encrypt data over a Private Service Connect (e.g., TLS 1.2 encryption for TTU drivers). On a separate note, if you want to use LDAP, which requires bi-directional traffic between your GCP account and the Vantage one, you must create a reverse Private Service Connect and manually set up the LDAP configuration in the Session Manager. They will allow VantageCloud Lake to initiate traffic into your network. QueryGrid or Data Copy also need bidirectional traffic. They require two separate Private Service Connects, one for reverse traffic. Additionally, if you need your Lake instance to contact one of your object stores with OTFs, you will require a reverse Private Service Connect. Note that if you require cross-cloud access, NOS and OTF requests will currently be gateway through the Valtix-controlled egress and ingress traffic, which may represent a bottleneck. Public Internet Notably, you can connect third-party applications and Viewpoint through the public internet. However, other solutions, such as QueryGrid or LDAP, must communicate with VantageCloud Lake through Private Service Connect. To access your VantageCloud Lake account through the public internet, you must provide the source CIDR block to whitelist access in the Console. Note that Lake doesn’t allow whitelisting of 0.0.0.0/0 CIDR block. Bear in mind that you won’t need to perform additional network changes if you allow the source to access the public Internet. However, if you are accessing Lake outside your company network, you should use your company’s VPN to route traffic from the allowed CIDR block. API Calls to NOS Buckets and OTFs You can read (and write) with VantageCloud Lake from Google Cloud Storage through a Teradata feature called NOS (Native Object Storage) Reads (and Writes). Additionally, Teradata supports reading and writing from data stored in Open Table Format (OTF). To access Cloud Storage, you have to use its APIs. The Cloud Storage API calls run through the public internet by default, protected with an HTTPS protocol. To secure Cloud Storage connections, Teradata leverages Private Google Access to connect to Cloud Storage. Teradata configures the database’s virtual machines with internal IP addresses only (they don’t have public IPs) and has Public Google Access enabled. Private Google Access means that when you call the Cloud Storage API from VantageCloud Lake, the connection will use the Cloud Storage API endpoint over Google’s private network backbone. Thus, data never transfers through the public internet when you use NOS Reads and Writes or OTFs from Teradata VantageCloud Lake. Note that Public Google Access protects the Cloud Storage API calls whether you access a Cloud Storage bucket in the same region as Lake or in a different one. However, if you access a bucket in another region, you will incur egress costs. You will also likely have a larger latency than when you access a bucket in the same region. For your information, Google offers multiple options to route Cloud Storage traffic through its backbone instead of through the public Internet, including the one Teradata uses with VantageCloud — specifically, Private Google Access. Main security elements in VantageCloud architecture In terms of security, there are five key aspects of VantageCloud Lake architecture: Teradata configures the Compute Engine virtual machines that run the VantageCloud Lake instances with internal IP addresses only, i.e., they don’t have public IPs. These virtual machines’ Virtual Networks have Azure Private Endpoints enabled. NOS traffic initiated from these virtual machines goes to an Cloud Storage API endpoint. With Private Service Connect enabled, the Cloud Storage bucket will use an internal IP provided by the Teradata account, while Lake and the bucket are in the same region. Teradata configures VantageCloud Lake to use the HTTPS call by default. You have all the details in the NOS Orange Book to read and write data in NOS storage. Network & Encryption All Teradata network connections can be encrypted. Some examples of connectivity encryption options are: Protocol transit encryption for SQLe (Quality of Protection [QOP] or TTU v17.10 + TLS). All HTTP interfaces will be HTTPS (such as Viewpoint). By default, all the SQLe traffic is encrypted with Teradata-provided cypher on the 1025 port and TLS1.2 certificate on the 443 port. Teradata Clients (Teradata Tools and Utilities) are not specific to any Cloud Service Provider. However, when you consider the security of your connections, you can enable TLS on TTUs 17.10 and above. Moreover, you can enable Teradata generic encryption (256 bits) for TTUs 16.20 onwards. Other Service Cloud Providers You can also find posts and cheat sheets on connecting VantageCloud Lake on AWS and Azure in this blog.