OTF Features to Integrate in Analytical Ecosystems: Evolution and Time Travel

by

in

Open Table Formats (OTF) belong to the Data Fabric within a Data Lakehouse, Warehouse or Lake. In an overly simplistic manner, they are Open File Formats (OFF) with metadata.

Data Lakehouse - Functional Elements
Functional Elements in a Data Lakehouse.
There is a high-resolution version of this infographic in my GitHub account, OTF repository.

The main differences between Open File Formats and Open Table Formats from the perspective of the integration and usage within an analytical ecosystem are:

  1. The different OTF flavours offer a limited number of supported file formats.
  2. The metadata in OTF let compute engines and users know:
    • Where the files for a table are located,
    • How it is partition,
    • The table version, or
    • How to improve the access and make queries run with a better performance, by using statistics, indexes and other features.
  3. OTF offer features that simplify and improve tasks on the date. With OFF, organisations would have to develop all the functionality from scratch.
  4. Depending on the Compute Engine, OTF would provide faster access than OFF, as data is structured in tabular format in OTF.

The different OTF flavours offer excellent features to improve performance, execution plans, access, data quality, compliance, and reliability on the data they keep (we analysed how Apache Iceberg and Delta Lake are designed, and the main characteristics they had in previous posts).

In this article, I analyse the main attributes that simplify the usage of OTF within a Data Fabric and represent the differentiators for why to use OTF over OFF.

Note that most features I describe in this post are common to Apache Iceberg and Delta Lake. I will make clear when they apply to each OTF flavour. These characteristics work quite similarly for Apache Iceberg and Delta Lake. However, there may be differences in how they work due to the internal architecture of the OTFs, and the features they offer to users. To write this article, I leaned towards the Apache Iceberg description of the features, as it was the information we found most complete.

Schema Evolution

Let’s think for a moment in a scenario where I keep my data in plain files, such as the Open File Formats. Now, assume I decide to change the structure of the data stored in the file, for example, I add a column. To do it, I need to copy all the data to a new file, which should have the new structure with the added column. It would be quite a cumbersome situation, where I would suffer delays in query executions and deadlocks on the data.

Consequently, to simplify the changes in the data structure, OTF has a feature called Schema Evolution. It allows dynamic changes (i.e., SQL commands) to OTF table structures, such as adding, dropping, renaming, or modifying columns, without disrupting existing data or queries. This flexibility ensures that companies can adapt to changing business requirements without the need for extensive data migration processes.

To implement the Schema Evolution, the changes in the schema are performed in the Catalogue.

Both Apache Iceberg and Delta Lake OTF tables provide the Schema Evolution feature.

The diagram below illustrates Schema Evolution, where an engine just goes from reading a file to reading another seamlessly, as OTF changes the metadata to the new file.

OTF Schema Evolution - One Engine
Schema Evolution. There is only one engine reading from the Data Fabric.
There is a high-resolution version of this infographic in my GitHub account, OTF repository.

Obviously, Schema Evolution solves even more complex architectures where more than one engine access the data without having to worry for the underlying files and their structures.

OTF Schema Evolution - Three Engine
Schema Evolution. There are three engines reading from the Data Fabric.
There is a high-resolution version of this infographic in my GitHub account, OTF repository.

Partition Evolution

In both Apache Iceberg and Delta Lake, the partitioning occurs physically, and it affects where the partitions are stored. Additionally, Apache Iceberg offers an optional transform function on the partitioning columns.

Then, partitioning is an optimisation feature. However, if the partitioning changes, the data is stored differently. So, modifying the partition definition would impact the files and their locations. Consequently, it would take some time for the system to update the data, which would jeopardise the query response time, and it may cause deadlocks.

To prevent such a scenario, the Partition Evolution feature allows dynamic changes (i.e., SQL command) to the OTF table partition schema to improve performance. To achieve it, the changes in the partitioning are performed in the Catalogue.

This feature ensures that companies can change their tables as their queries require it without having to rewrite entire tables to change the partition schema.

Both Apache Iceberg and Delta Lake OTF tables provide the Partition Evolution feature.

The diagram below shows a table initially partitioned by month. Eventually, the table evolved to be partitioned by day from a particular date onwards. The previously written data remains in month partitions, while the new data is in day partitions. When a query requires it, the engine prepares an execution plan for each partition based on the partition scheme applied to every date range.

OTF Partition Evolution
Partition Evolution.
There is a high-resolution version of this infographic in my GitHub account, OTF repository.

Sort Order Evolution

The Sort Order Evolution is not a feature that simplifies integration and operations in the analytical ecosystem. It is just a performance feature. However, I included it in this post to keep a complete view on all the Evolution features.

Data in sorted OTF tables is ordered by at least one column. Then, the data is written to the storage in data files that are classified by that column or columns.

In the example illustrated below, let’s first look at the unsorted data. The minimum and maximum values are checked in the Iceberg manifest files. Since they are not sorted and the custkey we’re looking for is 111029, all three files are read. However, in the sorted data, only file 2 needs to be read. Thus, sorted OTF tables can significantly improve query execution times. Furthermore, it can lead to a decrease in costs for cloud object storage if companies keep their Data Fabric in the cloud.

Apache Iceberg (but not Delta Lake) provides the Sort Order Evolution feature.

So, Iceberg allows for updating the sort order in an existing table, through a feature called Sort Order Evolution. When a user evolves a sort order, the old data written with an earlier order remains unchanged. Engines can choose to write data in the latest sort order or unsorted when sorting is prohibitively expensive.

To achieve the Sort Order Evolution, data needs to be sorted inside the data file. In other words, you have to rewrite all the data files.

Sort Order Evolution.
There is a high-resolution version of this infographic in my GitHub account, OTF repository.

Time Travel

Apache Iceberg and Delta Lake can take immutable snapshots of a table. So, a user can query the information for the table’s historical state at a point in time when there is a snapshot available. This feature is called Time Travel, and it is based on the equivalent feature in the object store.

OTF Time Travel
Time Travel.
There is a high-resolution version of this infographic in my GitHub account, OTF repository.

The Time Travel feature is an invaluable asset for backups. Nevertheless, it can’t provide with a complete Data Protection solution.

In OTF, the metadata is as important as the data itself, because without the metadata it’s impossible to know where the files for a table are located, where the partitions are and what version of the table we are reading. So, when backing up their data in OTF, creating a Disaster Recovery environment or migrating to a new platform, organisations need to plan how they copy and restore the data and the metadata.

Furthermore, if you look at the first diagram in this article, where we show the functional elements in a Data Lakehouse, you will notice all aspects an organisation needs to evaluate to design a Data Protection solution. If they must recover from a loss of service, or migrate the ecosystem somewhere else, organisations have to keep all the data in the Data Fabric. But they also must replicate all the engines and their configurations, the metadata, security, connections, etc.


Apache Iceberg. (n.d.). Apache Iceberg online documentation on the Apache Iceberg official website. Retrieved in August 2025, from https://iceberg.apache.org/docs/nightly/

Delta Lake. (2025). Delta Lake online documentation on the Delta Lake official website. Retrieved in August 2025, from https://docs.delta.io/latest/index.html

Lee, D., Wentling, T., Haines, S., and Babu, P. (2024). Delta Lake: The Definitive Guide. Modern Data Lakehouse Architectures with Data Lakes. O’Reilly Media, Inc. Retrieved in August 2025, from https://delta.io/pdfs/dldg_databricks.pdf

Nats, T. (2023). Improving performance with Iceberg sorted tables on the Starburst blog. Retrieved on September 30th, 2025, from https://www.starburst.io/blog/improving-performance-with-iceberg-sorted-tables/

Shiran, T., Hughes, J. and Merced, A. (2024). Apache Iceberg: The Definitive Guide. Data Lakehouse Functionality, Performance, and Scalability on the Data Lake. O’Reilly Media, Inc.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *