The Compute Clusters are additional units of compute power in VantageCloud Lake. They can support all kinds of workloads (analytics, ad hoc queries, loads, reporting, etc.) to provide additional isolated capability to execute those tasks.
In this post, I’ll explain their main characteristics, how they help you manage your costs in your analytical environment, the design considerations for Compute Groups and a method to plan how to scale, and the different options to map applications with Compute Groups.
Compute Clusters’ Main Characteristics
The Compute Cluster’s three main characteristics are:
- There are several cluster types which support different workloads. At the time of writing this post in July 2024, the cluster types are:
- Standard – for a variety of applications as well as in-database data engineering and analytics,
- Analytics – for Data Science exploration, and
- Analytic GPU – for Machine Learning, Deep Learning, and Large Language Models.
- Dynamic Autoscaling.
- To quickly and cost-effectively adapt to the workload demands.
- They allow compute-isolate workloads while connected to the same data layer.
- You can isolate workloads from the Primary Cluster from other workloads running in other Compute Clusters. This allows you to separate individual groups of workloads in an easy-to-manage way and run exploratory workloads.
Compute Group Components
The Compute Group is a set of Compute Profiles, ultimately associated with Compute Clusters. A user must have privileges on a Compute Group to use the Compute Clusters associated with it.
As for the Compute Profile, it defines the policy of the Compute Clusters within the Compute Group, i.e., the size, type, and number of clusters. Additionally, it specifies the timeframe when the Compute Cluster should be active.
To suspend and resume a Compute Profile, you can run commands from SQL or the Console.
Bear in mind that there can be several Compute Profiles within a Compute Group, but only one is active at a given time. So, the combination of Compute Profiles lets you decide how much you want to scale out or in.
Furthermore, if the user who submitted the query has the privilege to use the Compute Group, steps within a query will be executed in a Compute Group’s Clusters. The Compute Router balances work across all the active Clusters within the Compute Group.
Autoscaling automatically adds or removes Clusters, responding to demand. Note that Autoscaling only scales out/ in, i.e., it adds or removes nodes. So, a Compute Profile with a specific T-shirt size has a minimum of 1 and a maximum of 3, and it adds/ removes clusters between one cluster and three.
It is essential to realise that the Compute Clusters do not scale up or down, i.e., they do not change the node type.
Compute Cluster Size and Impact on Query Execution Time
Teradata uses the term T-shirt sizes to refer to the Compute Cluster sizes.
The table below shows the T-shirt sizes for the Compute Clusters, the number of nodes every size provides to the Cluster, the power increase referred to the Small size, and an example of the query execution time if the query runs on the different Cluster sizes.
Cost Management
To optimise the cost in your Compute Clusters, you should keep in mind the following considerations:
- Larger T-shirt size, better execution time.
- Increasing the T-shirt size can improve workload execution time. On the contrary, a smaller cluster size means lower costs for workloads without a tight SLA.
- Scale Compute Clusters live.
- While queries run, you can modify a Compute Cluster on a live Lake instance.
- Schedule processes and reduce costs.
- Scheduling workloads lower costs through batch operations.
- Limit Autoscaling.
- Autoscaling provides resources on demand up to the limit you set up.
Analytics Compute Cluster
The Analytics Compute Cluster is meant for Data Science exploration.
These Clusters play a crucial role in your business, positioning Data Science right next to the data. This setup enables you to utilise in-database analytics functions written in SQL, allowing you to use SQL for tasks like exploratory work and understanding data topology.
Additionally, you can benefit from the Open Analytics Framework, which allows you to use Python and R scripts. The Open Analytic Framework permits easy administration of Python and R scripts. Furthermore, the Python and R scripts run in containers, allowing you to run any algorithm on VantageCloud Lake.
Furthermore, the Analytics Cluster provides more gas to processes, so your queries and code get more resources (CPU, memory, etc.) to process in an Analytics cluster than in a Standard one.
Analytics Clusters are versatile and capable of handling a variety of workloads such as UDFs and high-CPU-consuming analytic functions like nPath and Python, making them suitable for a wide range of tasks.
Finally, you can isolate the analytics workload for those working on it and keep the Analytics Cluster(s) up during their working hours.
Analytic GPU Compute Cluster
The Analytic GPU Compute Cluster is for fast Machine Learning, Deep Learning and Large Language Model inferences.
These Clusters provide a convenient way to use open-source generative AI/ ML models. For instance, you can download a pre-trained Hugging Face model, install it in the Analytic GPU cluster, and interact with it through Python. You can also pull your data in and out of the model, as it is already hosted in Lake, making your workflow more comfortable to run and efficient. To make it work:
- Obtain open-source permissible models from Hugging Face.
- Create and instantiate an Analytic GPU Compute Profile in VantageCloud Lake.
- Create an Open Analytics Framework user environment to load Python packages, models and Python model inference scripts.
- Call Open Analytics APIs to run the Large Language Model inference on the Analytic GPU Compute Cluster.
How to Design Your Compute Clusters
Background
The Compute Profile is the key to implementing your Compute Cluster design. It permits you to scale as needed and adjust the Computing Cluster size to your workload needs and budget. You should become familiar with Compute Profiles and their parameters.
By defining several Compute Profiles, you enable multiple time windows, each with a different level of processing power, allowing you to plan for various workload scenarios.
Each Compute Profile must be scaled independently because it has a different processing power than the others. You can use Autoscaling to save money since it allows you to adjust the Compute Cluster size automatically.
So, Compute Profiles and Autoscaling introduce more processing flexibility but less precision when sizing Compute Clusters.
Additionally, you can’t access BFS (Block File System) tables from the Compute Clusters, but object storage (Object File System or OFS, Open File Format or OFF, Open Table Format or OTF). The consumption patterns will differ from those of BFS because object storage has a different architecture from block storage.
Factors that Determine the Compute Cluster Design
The fundamental goal of the Compute Group is the driving consideration when designing your Compute Clusters. Then, query performance, throughput, and cost will be used to polish the Compute Cluster design.
Query Performance
Larger Compute Clusters provide more compute, I/O, memory, spool, and parallelism per query, resulting in shorter query elapsed time. Note that here, I mean larger Compute Clusters as a general term, with no specific reference to T-shirt size, number of nodes, etc.
You want large Compute Clusters for workloads that are very demanding regarding resources and AMP parallelism and use large data volumes.
If the workload has relatively stable arrival rates, i.e., a steady volume of queries coming in, and you want to minimise their elapsed and response times, you also need a large Compute Cluster.
Throughput
Small to moderate-sized Compute Clusters with Autoscaling result in more concurrent query capacity and better responsiveness to variability in arrival rates.
If you expect your arrival rate and workload to vary over time, you need maximum throughput when there is a high query arrival rate. However, you don’t need a large Compute Cluster always available. In this scenario, you should have small to moderate size clusters and then set up Autoscaling with as many additional clusters as you need when you receive a higher workload.
Cost
Moderate to small Compute Clusters with little or no Autoscaling allow for clear cost limits by placing a ceiling on resource utilisation but at the expense of throughput and/ or performance.
If you must keep your expenses within a threshold, you can choose a small Compute Cluster and Autoscale it a little while keeping a low limit. In this case, some queries may be delayed, but you will ensure that costs stay under a limit.
Considerations to Map Applications, Workloads or Departments to a Compute Group
Which Applications?
Decide which application(s) a Compute Group will support.
Different Time Windows?
Consider the time windows that the combined applications will need. Save costs by reducing the cluster size or shutting down during low-demand windows.
Importance of the Work?
Identify the importance of the work running in a Compute Group to the business in each time window.
Summary: Design Considerations for Compute Clusters
- Design the Compute Clusters for price-performance. I.e., you should keep a balance between cost and performance when setting up the Compute Groups:
- Maximum resources available in response to the arriving workload,
- At a cost that is comfortable for you, and
- Without leaving resources unused.
- Decide which applications, workloads or departments to map to a Compute Group.
- It would be best to do it before deciding how much your clusters will scale.
Design Steps to Define How Much To Scale
Steps to Take with Known Applications
1. Analyse Resource Consumption
You should analyse resource consumption (CPU, spool, I/O) and concurrency from the Teradata Enterprise instance for the combined applications you’ll map to a particular Compute Group. Focus on peaks and valleys in resource demand.
In fact, ResUsage and DBQL provide the information you need for the resource analysis.
The resource consumption analysis is the starting point for designing your Compute Profiles and deciding how to scale your Compute Groups. Once you have done this, repeat it until you adjust your configuration.
2. Design Compute Profiles
Within each Compute Group, you must design Compute Profiles to reflect time windows that will support differing cycles of application demand.
Furthermore, you should allocate only the processing power required for each time window.
Then, you must determine when and how much Autoscaling is needed. You may have time windows when it is not required.
You can also start with no Autoscaling and adjust slowly over time by analysing the consumption patterns and arrival rates.
3. Convert Enterprise Consumption to Lake
You should match the resource consumption in the Compute Group applications to that on the Enterprise instance. You must determine the cluster type, the node type, and the number of nodes that will satisfy that demand on Lake.
4. Validate on a Test instance
Finally, you could implement the scaling decisions for a Compute Group on a test environment as a validation and a starting point.
Steps to Take with Unknown Applications
To design how to scale unknown applications, follow the same steps as with known ones but with less expected accuracy. You must replace the resource analysis with estimates based on the application characteristics. However, you should expect to perform more trial and error during validation.
Influence of Storage Type and Table Design on How the Compute Clusters Scale
The table below details the different storage and table design factors that influence how the Compute Clusters scale.
Storage or Table Design Factor | Why It Impacts How the Compute Clusters Scale |
---|---|
NoPI (No Primary Index) Table Access | NoPI OFS (Open File System) and OFF (Open File Format) table data are distributed randomly amongst objects, requiring more table scans for access. |
Co-Location of Data | Lack of co-location within OFS and OFF data leads to more redistributions when preparing for a join. – In BFS, if you join two tables, you co-locate the rows in the tables in the same AMP by choosing the same Primary Index. |
ORDER BY Clauses | A well-designed ORDER BY clause on OFS tables can significantly reduce the number of objects being read. |
Fewer Indexing Options | OFS and OFF tables have fewer indexing options than BFS tables, impacting query performance and consumption. |
Column vs. Row Format | OFS and OFF tables are recommended to be columnar, while BFS tables are typically row-based. |
Objects Assigned to AMPs | Small OFS tables (with fewer objects than Compute Cluster AMPs) will result in some skewing since some AMPs will end up having no objects to process. |
Less Control Over Set-Up | The organisation of external data in OFF storage may have been independently determined and not optimised for Teradata access. |
External Format of Data | When reading OFF data, Parquet, CSV, and JSON file formats have different performance characteristics. |
Path Filtering Benefits | A well-designed path filtering pattern can significantly reduce the number of OFF objects being read (object filtering) and, thus, the resources required to satisfy the request. |
Example: Resource Analysis and Scaling Policy
The chart below shows an example of how you could map the CPU Utilisation of the combined applications you want to run in a Compute Cluster with the scaling policy you need to implement in your Compute Profile to adjust to your price-performance requirements.
Getting Started Using Compute Groups
Considerations When Setting Up Compute Groups
Reasons to Create a Compute Group
I list below the reasons to create a Compute Group:
- Budget Management
- Financial governance factor: An individual application or department wants to manage its budget and resources.
- Workload Isolation
- To confine resource-intensive, less optimised work (exploratory, Data Labs, Data Science projects, etc.) that disrupts other applications, a Compute Group will fence these noisy neighbour applications off and prevent them from damaging other workloads running in other Clusters. Additionally, it protects the Primary Cluster.
- Control over Resource Availability
- To control the level of computing and elasticity of different applications or departments.
- Elasticity
- Design Compute Groups to scale out and have enough resources when demand spikes. For example, the Finance department closes its books and needs additional resources during the last five days of a quarter.
- Service Level
- If you have a high service level for an application at a specific time (e.g., the end-of-month processing needs to finish in three days), you could create a Compute Group for that application when needed, assign as many resources as necessary to finish on time, and then tear it down when done, so you are only paying for those three days of processing power.
- Common Pool
- Staging area for modernised work until the application justifies its own Compute Group.
Modernisation Process in Phases
You can modernise your workload in Lake with Compute Clusters all at once. However, it will be easier if you divide it into phases. The diagram below shows an example of how to plan it.
Compute Profile Settings Define Cluster Characteristics
The following Compute Profile settings define the Compute Cluster characteristics:
- Cluster size
- Provide enough parallelism within the cluster to satisfy the most consuming critical queries.
- Smaller clusters enable more modest computing increases and decreases with Autoscaling.
- Ensure overall resource utilisation within the clusters is acceptable, minimising waste.
- The Compute Group is where you specify cluster type (Standard, Analytics, Analytic GPU).
- Once you specify the cluster type, all clusters within that Compute Group in a particular Compute Profile will use the same type of nodes.
- Minimum and Maximum Cluster Count
- Autoscaling can reduce costs at low-demand times but increase costs at busy times.
- Ideally, Autoscaling should be designed to match the demand at peak processing times.
- Cost can be controlled by limiting or removing Autoscale in certain Compute Profiles.
- Start and End Time
- Understand how Compute Group resource consumption varies over time.
- Set up Compute Profile parameters so unused resources are minimised.
- Deactivate all Compute Profiles at times of no activity.
Mapping Applications, Workloads or Departments to Compute Groups
Options to Map Applications to Compute Groups
In broad terms, there are three main options for mapping applications to Compute Groups, as shown below. However, you can design your Compute Groups to fall between these options.
These options have pros and cons, as shown below.
Considerations for Mapping Workloads to Compute Groups
The table in this section explains the considerations you should keep in mind when mapping workloads to specific Compute Groups.
Consideration when mapping workloads to Compute Groups | Explanation |
---|---|
Compute Profile Settings | Do the Compute Profile settings for Autoscale and processing power match the needs of all workloads targeted for the same Compute Group? |
Cluster Isolation Needs | Do all workloads exhibit similar query characteristics and resource demands? Will isolation benefit one workload at the expense of others? |
OFS Cache Effectiveness | Do workloads targeted to the same Compute Group access the same object storage data to optimise the use of the Object File System cache? |
Financial Governance | Can combining multiple workloads into a single Compute Group achieve fuller resource utilisation? Will budget monitoring granularity be acceptable? |
Similar Query Profiles | Will different workloads in the same Compute Group equitably share resources and concurrency slots so that a single workload does not dominate memory, CPU, and I/O? |
Optimal Elastic Resource | Will Autoscale needs to be similar during the same time windows for all workloads mapping to a given Compute Group? |
Simple Workload Management | When combined, will the Lake default workload management options support the Compute Group mix of work? |
Ease of Initial Migration | How simple will it be to set up the intended number of Compute Groups? Will it make the initial migration easier or more complicated? |
Ongoing Admin Overhead | Will Compute Group setup decisions simplify administrative overhead going forward, or will they add extra monitoring and tuning activities over time? |
How the Three Application Mapping Options Compare
The table below shows how the three application mapping options compare when designing your Compute Groups.
Closing Comment
While you can run all your workloads in the Primary Cluster, Teradata designed VantageCloud Lake to run only the tactical queries (the ones that access a row through the Primary Index) and some internal processes in the Primary Cluster. You should move the rest of your workload to the Compute Clusters for the most efficient balance between your platform’s performance, flexibility and cost, leveraging Cloud-native capabilities.
I modified this post on 30 July 2024 to include a link to the post Impact of Scaling in VantageCloud Lake.
Leave a Reply