There are many courses to learn about Azure Synapse Analytics (I list a few at the bottom of this post), all of which cover similar topics. This post discusses information I missed when I started working with Azure SQL Data Warehouse (now Synapse).
Microsoft Proprietary Transactional Database Portfolio
Microsoft offers plenty of databases within Azure, such as CosmosDB (NoSQL database); Azure Database for MySQL, MariaDB or PostgreSQL (managed services for these databases); Azure PostgreSQL Hyperscale (Massive Parallel Processing system, aka MPP, which is a version of PostgreSQL), etc.
Within that wide offering, Microsoft’s proprietary transactional databases are shown below. They originated from the SQL Server database engine but evolved separately. These databases have similar engines, with significant differences. Consequently, they use SQL codes with large divergences and disparate features, and the database objects are different. Thus, you must undergo a full migration (moving data and translating code) if your data is on a particular database and you want to use another.
SQL Server
Symmetric Multi-Processing (SMP) system, and the only database Microsoft offers on-prem. It runs on Windows and Linux but with less functionality and features on Linux.
Many customers wanted to keep the database flavour and avoid a database migration when moving to the cloud. Hence, Microsoft also offers the SQL Server Infrastructure as a service (IaaS), which is a virtual machine on Azure), and SQL Server Managed Instance (a virtual machine in Azure, but Microsoft manages the Operating System).
Azure SQL Database (SQL DB)
Fully managed Platform as a Service (PaaS) database engine. It handles most database management functions such as upgrading, patching, backups, and monitoring without user involvement.
It is an SMP system used for transactional databases and small Data Warehouses.
Azure Synapse Analytics
MPP system that started as a cloud-based Data Warehouse. Service customers with large volumes of relational data. Recently, it has added some functionality to connect to Hadoop and work natively with ML algorithms and Spark pipelines.
DATAllegro – The seed to Azure Synapse Analytics
Stuart Frost founded DATAllegro in 2003. A DATAllegro database had a proprietary database software that ran on an MPP cluster made up of commodity hardware and used SUSE Linux as an operating system. Netezza and Teradata were their main competitors.
Then, in 2008, Microsoft bought DATAllegro. They used the SQL Server database engine and adapted it to run on an MPP to become their Data Warehouse solution. Over the years, Microsoft made its MPP offering evolve into Azure Synapse Analytics.
What is Azure Synapse Analytics now?
Microsoft defines Azure Synapse Analytics as:
“An analytics service with access to data warehouse and big data systems”.
As the diagram above shows, Azure Synapse Analytics includes:
- SQL technologies used in enterprise data warehousing, such as:
- Built-in streaming capabilities, and
- Machine Learning functions
- Spark technologies for big data
- Apache Spark, not Databricks,
- Managed cluster, and
- Autoscaling
- Data Explorer for log and time-series analytics (in preview)
- Pipelines for data integration and ETL/ELT (E = Extract; T = Transform; L = Load)
- It contains the Azure Data Factory engine,
- Supports over 90 sources,
- Azure Data Factory is a code-free ETL solution, and
- Notebooks, Spark jobs, SQL scripts, etc.
- Integration with other Azure services, e.g. Power BI, CosmosDB, AzureML.
Azure Synapse Architecture
Let’s break down Synapse architecture, especially the SQL engine, to better understand how it works and Synapse’s key features as a Data Warehouse.
Node Components
Synapse is an MPP system, i.e. made up of several nodes, based on SQL Server. Below is a diagram of the SQL Server architecture, which is also the architecture of a Synapse node. The architecture shows the main components of the nodes.
Synapse SQL Architecture
Before discussing the Synapse SQL architecture, let’s discuss the Data Warehouse Units or DWUs. Microsoft bundles CPU, memory and IO into units of compute scale called Data Warehouse Units. A DWU represents an abstract, normalized measure of computing resources and performance for Azure Synapse SQL. Billing is based on DWUs.
Back to the Synapse SQL architecture, Microsoft offers Synapse SQL in two different flavours:
- A Dedicated SQL Pool (formerly SQL Data Warehouse or SQL DW) allows you to choose the ideal number of DWUs to optimize price and performance and how to change the number of units. It has one Control Node, and 1-60 Compute ones, depending on our configured compute resources (DWUs).
- The Serverless SQL Pool (aka SQL On-Demand Pool) is a pay-per-query service that doesn’t require picking the right size (DWUs). The system automatically scales based on your query resource requirements. E.g., the image above shows a serverless SQL pool utilizing 4 compute nodes to execute a query.
As the diagram above shows, Synapse SQL separates the compute power from storage, so computing and storage scale independently. For this reason, Microsoft charges separately for storage consumption.
For a Serverless SQL Pool, scaling is done automatically. It always runs in Microsoft’s tenant, and you can’t run it inside your tenant.
As for a Dedicated SQL Pool, you can:
- Grow or shrink compute power within a dedicated SQL pool, without moving data.
- Pause computing capacity during non-operational hours. Only pay for storage.
The two different SQL engines, Dedicated SQL Pool and Serverless have different code bases and, hence, different SQL syntax and feature functions.
Control Nodes
Regarding the Control Nodes, they orchestrate data loaded into Compute Nodes (worker nodes). They also manage DMS and HDFS bridge controls for SQL queries and storage allocations.
Additionally, applications connect and issue T-SQL commands to a Control Node, the single point of entry for Synapse SQL.
On another note, the Control Node utilizes a distributed query engine to optimize queries for parallel processing and then passes operations to Compute nodes to do their work in parallel.
The Serverless SQL Pool Control Node utilizes the Distributed Query Processing (DQP) engine to split queries into smaller queries that will be executed on Compute Nodes. Each small query is called a “task” and represents a distributed execution unit.
In contrast, the Dedicated SQL Pool uses the Data Movement Service (DMS) – an internal, system-level service – to move data across the nodes as necessary to run queries in parallel.
Storage
By default, the tables in a Provisioned SQL Pool are columnar, and Synapse stores them in Blob Storage. If you define a table as row-based, the system automatically keeps it in a Premium SSD disk.
In contrast, the SQL On-Demand and Apache Spark Pools are engines that process data from other services without associated storage.
Integration
Azure Synapse has embedded ADF (Azure Data Factory) capabilities to load data from many sources. We can also build ADF pipelines in an external service. However, Azure Synapse can run T-SQL on several services, so it doesn’t need ADF.
The services where Azure Synapse can run T-SQL are:
- Provisioned SQL Pools:
- All Microsoft transactional databases (SQL Server, SQL DB, Synapse SQL) through a database feature common to them called PolyBase.
- Azure Data Lake Storage Gen 2
- Azure Blob Storage
- HDInsight
- SQL On-Demand Pools
- Spark Tables
- CosmosDB
- Azure Data Lake Storage Gen 2
- Azure Blob Storage
- HDInsight
Distribution during query execution in Azure Synapse
Dedicated SQL Pool (aka SQL DW)
The Dedicated SQL Pool splits a query’s work into 60 smaller queries that run in parallel. Each of the 60 smaller queries runs on one of the data distributions, and each Compute Node manages one or more of the 60 distributions.
For instance, if it has the maximum compute resources (DWUs), the instance has 60 Compute Nodes. Each Compute node manages one distribution. However, if the Dedicated SQL Pool instance has the minimum DWUs, it has just one Compute Node, and all the distributions run on it.
As already stated, there is a fixed number (60) of data distributions. Synapse remaps the data slices to nodes without moving the data. One of the consequences of this design is that the Dedicated SQL Pool scales out quickly. Another side effect is that the compute node count has to be evenly divided from 60. So you can scale from 20 to 30 nodes, but not to anything in between. You can also scale from 30 to 60, but nothing in between.
Serverless SQL Pool
The Distributed Query Processing (DQP) engine optimizes and orchestrates the execution of a query. It divides the query into smaller ones and sends them to the Compute Nodes as planned.
Table Distribution in the Dedicated SQL Pool
It supports three sharding/ distribution patterns:
- Hash
- A hash-distributed table distributes rows among the distributions based on the value in the distribution column.
- You can only distribute based on one column. It has to be NOT NULL.
- A hash-distributed table delivers the highest query performance for joins and aggregations on large tables.
- A hash-distributed table distributes rows among the distributions based on the value in the distribution column.
- Round Robin (default)
- A round-robin table distributes table rows evenly and randomly across all data distributions.
- Loading data is fast.
- Queries can require more data movement.
- Replicate
- A replicated table has a full copy of the table available on every Compute node.
- Queries run fast on replicated tables because joins don’t require data movement.
- Replication requires extra storage.
- The Teradata optimizer already replicates small tables automatically in spool when convenient.
Concurrent Sessions and Queries in Azure Synapse
Dedicated SQL Pool
The Dedicated SQL Pool’s maximum number of concurrent open sessions is 1024 (checked on August 29th, 2022). Bear in mind that the number of concurrent open sessions varies based on the selected DWU, so the maximum number of open sessions for DWU1000c and above is 1024, while the maximum number of open sessions for DWU500c and below is 512.
On a separate note, the maximum number of concurrent queries is 128* per data distribution (checked on August 29th, 2022). There are 60 data distributions.
Consequently, the maximum number of concurrent, active queries in a Dedicated SQL pool is 7,680. The diagram below illustrates the calculation.
So when the concurrency limit is exceeded, the request goes into an internal queue where it waits to be processed. To workaround this limitation, you must create a Hub-and-Spoke architecture.
Finally, concurrent queries can decrease when users are assigned to higher resource classes or when the DWUs are lowered.
Serverless SQL Pool
Maximum concurrency is not limited to the Serverless SQL Pool. It depends on the query complexity and the amount of data scanned.
Other Azure Synapse’s Features
Azure Synapse Analytics have many other features, and I don’t mean to make an exhaustive list of all of them. You can find all of them in Microsoft’s documentation. Additionally, you will find a comprehensive discussion about the overall capabilities in the courses the platforms below offer.
Formal Courses on Azure Synapse
Finally, if you want to know more about Azure Synapse Analytics, and even prepare for the Microsoft Data Engineer certification, you will find the appropriate, orthodox training in the sites below:
Coursera – Official Microsoft training.
Cloud Academy – I particularly liked the course Introduction to Azure Synapse Analytics.
Leave a Reply