Delta Lake 101

In this article, I will explain the basics of the Delta Lake architecture, key features, and how it writes, as an essential process in the Delta tables. This information is relevant;

To understand what actually the Delta Lake OTF tables are,
How to access and use them,
Troubleshoot workloads,
Identify errors in the data and keep its quality, and
Integrate a Data Fabric based on Delta Lake tables with the rest of the analytical ecosystem.

Delta Lake’s Origin

Delta Lake is an open-source storage layer that supports ACID transactions. Additionally, it provides scalable metadata handling and unification of streaming and batch data processing. In other words, Delta Lake is an open table format.

In 2017, Michael Armbrust had an initial idea of providing transactional reliability for data lakes, which led to the Tahoe Project at Databricks. Just so you know, Michael is a committer/PMC member of Apache Spark™; a Delta Lake maintainer; one of the original creators of Spark SQL, Structured Streaming, and Delta Lake; and a distinguished software engineer at Databricks.

Eventually, Tahoe became Delta Lake in early 2018. Then, in 2019, Databricks donated it to the Linux Foundation as open source.

Delta Lake was initially designed to work with Apache Spark and address the limitations of Apache Spark’s file synchronisation. Specifically, Delta Lake aimed to handle large-scale data operations and provide robust transactional support. Thus, Delta Lake should have a scalable transaction log that could handle massive data volumes and complex operations.

In plain English, we have traditionally used the Lambda architecture to design ecosystems to process data continuously and incrementally with Spark (and other engines) as new data arrives in a cost-efficient way. However, the Lambda architecture presents several issues, including performance bottlenecks in processing many small files, difficulties in repartitioning and compacting tables, and delays in publishing data. Delta Lake was created to overcome those issues.

Example of a Lambda architecture. There is a high-resolution version of this diagram in my GitHub account, OTF repository.

Delta Lake Architecture

Delta Lake was designed to enhance the Hive table format in a way that the transaction logs adapted better to continuous and incremental workloads. For this reason, the Delta Lake architecture is basically a Hive one, with a modified metadata layer:

You can use Unity Catalog instead of the Hive Metastore, which provides with more functionality, and
It has a Transaction Log that Hive doesn’t include.

The infographic below describes Delta architecture.

Delta Lake Architecture

Incidentally, the metastore or catalogue in a Delta Lake implementation maintains part of the metadata needed to keep Delta capabilities. Organisations still need an Enterprise-level data catalogue in their ecosystem to ingest metadata (business, technical, operational) from a workload run on OTF in general, and in Delta Lake in particular.

Delta Lake Key Features

Delta Lake, along with a structured streaming process, allows organisations to implement Delta architectures with Spark. The objective is to continuously and incrementally ingest data. The advantages it has compared to the Lambda architecture are:

Unify batch and streaming with a continuous data flow model.
Infinitive retention to replay or reprocess historical events as required.
Independent, elastic compute and storage to scale while balancing costs.

Notably, when rows are deleted, Delta Lake creates a new file or files with the rows that don’t change, rather than modifying the Parquet files. The reason is that creating files in the object store is faster than deleting them. Since Delta Lake creates new files with the unchanged data, it also benefits from the following features:

Multiversion Concurrency Control (MVCC) — MVCC is a database optimisation technique. It creates copies of the data; thus, users can read and update data safely and concurrently in Delta Lake.
Time Travel.
Atomicity.
Speed.

On another front, when a Delta Lake table changes its schema (through the feature known as Schema Evolution), it first modifies the schema in the Hive Metastore or the Unity Catalog. Later on, Delta will update the table definition to the new schema. Internally, Delta tables automatically use the latest schema in the Hive Metastore or the Unity Catalog, regardless of the schema defined in the table. However, there are engines which use the schema defined in the table’s definition. Thus, they don’t see the updated schema until the table definition is modified. VantageCloud, Snowflake and AWS Redshift Spectrum, among others, take the schema from the table’s definition.

Separately, Delta Universal Format, or UniForm, simplifies the interoperability among Delta Lake, Apache Iceberg, and Apache Hudi. Note that all OTF flavours are composed of metadata and data (typically in Parquet file format). However, each OTF flavour creates, manages, and maintains the metadata differently. Delta UniForm concurrently generates Iceberg and Hudi metadata with the Delta format. Thus, Delta, Iceberg, and Hudi clients can read the data stored in Delta Lake because all of their APIs can understand the metadata.

As for the Delta Kernel, it simplifies the development of connectors by abstracting out all the protocol details, so the connectors do not need to understand them. The kernel itself implements the Delta transaction log specification. So developers only need to build the connectors against the Kernel library.

Regarding storage, organisations can run Delta Lake on the object storage provided by the main Cloud Service Providers (AWS S3, Azure Blob Storage, Google Cloud Storage) or on-prem, while Spark can run on the file system.

As for the Compute Engines and how to connect them to Delta Lake, there are APIs available in Java. Delta also allows some functionality in Python and Scala.

Delta Lake Write Process

As we said in the previous section, when a user wants to delete rows, Delta Lake creates a new file with the rows that don’t change, rather than modifying the Parquet files. This way of performing the write process improves performance and allows implementing Time Travel, among other advantages.

Below, you have an infographic that explains the Delta Lake write process at a high level.

There is a high-resolution version of this infographic in my GitHub account, OTF repository.

References

Delta Lake. (2025). Delta Lake online documentation on the Delta Lake official website. Retrieved in August 2025, from https://docs.delta.io/latest/index.html

Lee, D., Wentling, T., Haines, S., and Babu, P. (2024). Delta Lake: The Definitive Guide. Modern Data Lakehouse Architectures with Data Lakes. O’Reilly Media, Inc. Retrieved in August 2025, from https://delta.io/pdfs/dldg_databricks.pdf

Celia Muriel

Delta Lake’s Origin

Delta Lake Architecture

Delta Lake Key Features

Delta Lake Write Process

References

Comments

Leave a Reply Cancel reply