[kernel-spark][Part 1] CDC streaming offset management (initial snapshot) by zikangh · Pull Request #6075 · delta-io/delta

zikangh · 2026-02-19T01:50:41Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/cdf1 [Files changed]
- stack/cdf2 [Files changed]
  - stack/cdf2.5 [Files changed]
    - stack/cdf3 [Files changed]
      - stack/cdf4 [Files changed]
        
        stack/cdf5 [Files changed]
        
        stack/cdf6 [Files changed]
        
        stack/cdf-outofrange [Files changed]
        
        stack/cdf7 [Files changed]

Which Delta project/connector is this regarding?

Description

Adds initial snapshot write-time-CDC support to the DSv2 streaming read path (SparkMicroBatchStream), bringing it closer to DSv1 feature parity.

How was this patch tested?

Does this PR introduce any user-facing changes?

murali-db

[AI] The refactoring of IndexedFile with static factory methods and the extraction of applyBoundaryFiltering are clean improvements. The DSv1-vs-DSv2 parameterized tests provide good confidence in parity.

A few items to address:

PR description is empty. The "Description", "How was this patch tested?", and "Does this PR introduce any user-facing changes?" sections are all blank. Per project conventions — what does this change do? What breaks without it? Please fill these in.
Snapshot cache doesn't account for CDC mode — see inline comment on getSnapshotFiles. This is the main code concern.
Minor nits inline on naming and doc accuracy.

zikangh · 2026-03-23T18:42:42Z

Range-diff: master (7a4bdb7 -> f9ee471)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff f1f8e5f..7a4bdb7 d1139d2..f9ee471 | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-03-23T20:58:25Z

Hi @murali-db, I have addressed the AI comments. PTAL.

zikangh · 2026-03-23T21:16:25Z

Range-diff: master (f9ee471 -> 2001a72)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..f9ee471 d1139d2..2001a72 | Disable: git config gitstack.push-range-diff false}

huan233usc · 2026-04-01T17:56:26Z

+    validateCDFEnabledOnTable();
+    CloseableIterator<IndexedFile> result;
+    if (isInitialSnapshot) {
+      Snapshot snapshot = snapshotManager.loadSnapshotAt(fromVersion);


Can we reuse the initialSnapshot vs refreshing one here?

This is a good catch, thank you! We don't need a fresh snapshot.

huan233usc

One more question, and plz add CDCDataFileTest.java as well

zikangh · 2026-04-01T20:57:53Z

Range-diff: master (a13ae25 -> 9f42cb1)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..a13ae25 d1139d2..9f42cb1 | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-01T21:06:58Z

Range-diff: master (9f42cb1 -> 320b3ac)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..9f42cb1 d1139d2..320b3ac | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-02T17:22:53Z

Range-diff: master (320b3ac -> 3398dde)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..320b3ac d1139d2..3398dde | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-02T20:25:08Z

Range-diff: master (3398dde -> 41bc0a2)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..3398dde d1139d2..41bc0a2 | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-02T21:52:50Z

Range-diff: master (41bc0a2 -> 2123524)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..41bc0a2 d1139d2..2123524 | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-02T21:55:22Z

Range-diff: master (2123524 -> 18d8afb)

.claude/scheduled_tasks.lock

@@ -0,0 +1,6 @@
+diff --git a/.claude/scheduled_tasks.lock b/.claude/scheduled_tasks.lock
+new file mode 100644
+--- /dev/null
++++ b/.claude/scheduled_tasks.lock
++{"sessionId":"dd5422c1-eec5-4618-877c-bc933a699925","pid":3420592,"acquiredAt":1775161689602}
+\ No newline at end of file
\ No newline at end of file

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..2123524 d1139d2..18d8afb | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-02T21:55:42Z

Range-diff: master (18d8afb -> 07859c6)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

.gitignore

@@ -0,0 +1,7 @@
+diff --git a/.gitignore b/.gitignore
+--- a/.gitignore
++++ b/.gitignore
+ spark/unitycatalog/etc/
+ .scala-build/
+ 
++.claude/
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..18d8afb d1139d2..07859c6 | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-03T17:27:52Z

Range-diff: master (07859c6 -> 7a4827d)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..07859c6 d1139d2..7a4827d | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-06T17:30:58Z

Range-diff: master (7a4827d -> 6b26f5b)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..7a4827d d1139d2..6b26f5b | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-06T20:33:29Z

Range-diff: master (6b26f5b -> da38d13)

.claude/scheduled_tasks.lock

@@ -0,0 +1,6 @@
+diff --git a/.claude/scheduled_tasks.lock b/.claude/scheduled_tasks.lock
+new file mode 100644
+--- /dev/null
++++ b/.claude/scheduled_tasks.lock
++{"sessionId":"dd5422c1-eec5-4618-877c-bc933a699925","pid":3420592,"acquiredAt":1775161689602}
+\ No newline at end of file
\ No newline at end of file

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..6b26f5b d1139d2..da38d13 | Disable: git config gitstack.push-range-diff false}

zikangh · 2026-04-06T21:38:29Z

Range-diff: master (da38d13 -> ab609f8)

.github/CODEOWNERS

@@ -0,0 +1,12 @@
+diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
+--- a/.github/CODEOWNERS
++++ b/.github/CODEOWNERS
+ /project/                       @tdas
+ /version.sbt                    @tdas
+ 
++# Spark V2 and Unified modules
++/spark/v2/                      @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++/spark-unified/                 @tdas @huan233usc @TimothyW553 @raveeram-db @murali-db
++
+ # All files in the root directory
+ /*                              @tdas
\ No newline at end of file

.github/workflows/iceberg_test.yaml

@@ -0,0 +1,16 @@
+diff --git a/.github/workflows/iceberg_test.yaml b/.github/workflows/iceberg_test.yaml
+--- a/.github/workflows/iceberg_test.yaml
++++ b/.github/workflows/iceberg_test.yaml
+           # the above directories when we use the key for the first time. After that, each run will
+           # just use the cache. The cache is immutable so we need to use a new key when trying to
+           # cache new stuff.
+-          key: delta-sbt-cache-spark3.2-scala${{ matrix.scala }}
++          key: delta-sbt-cache-spark4.0-scala${{ matrix.scala }}
+       - name: Install Job dependencies
+         run: |
+           sudo apt-get update
+       - name: Run Scala/Java and Python tests
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_master_test.yaml
+         run: |
+-          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg
++          TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group iceberg --spark-version 4.0
\ No newline at end of file

.github/workflows/spark_examples_test.yaml

@@ -0,0 +1,54 @@
+diff --git a/.github/workflows/spark_examples_test.yaml b/.github/workflows/spark_examples_test.yaml
+--- a/.github/workflows/spark_examples_test.yaml
++++ b/.github/workflows/spark_examples_test.yaml
+         # Spark versions are dynamically generated - released versions only
+         spark_version: ${{ fromJson(needs.generate-matrix.outputs.spark_versions) }}
+         # These Scala versions must match those in the build.sbt
+-        scala: [2.13.16]
++        scala: [2.13.17]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+-      SPARK_VERSION: ${{ matrix.spark_version }}
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         id: spark-details
+         run: |
+-          # Get JVM version, package suffix, iceberg support for this Spark version
++          # Get JVM version, package suffix, iceberg support, and full version for this Spark version
+           JVM_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" targetJvm | jq -r)
+           SPARK_PACKAGE_SUFFIX=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" packageSuffix | jq -r)
+           SUPPORT_ICEBERG=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" supportIceberg | jq -r)
++          SPARK_FULL_VERSION=$(python3 project/scripts/get_spark_version_info.py --get-field "${{ matrix.spark_version }}" fullVersion | jq -r)
+           echo "jvm_version=$JVM_VERSION" >> $GITHUB_OUTPUT
+           echo "spark_package_suffix=$SPARK_PACKAGE_SUFFIX" >> $GITHUB_OUTPUT
+           echo "support_iceberg=$SUPPORT_ICEBERG" >> $GITHUB_OUTPUT
+-          echo "Using JVM $JVM_VERSION for Spark ${{ matrix.spark_version }}, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
++          echo "spark_full_version=$SPARK_FULL_VERSION" >> $GITHUB_OUTPUT
++          echo "Using JVM $JVM_VERSION for Spark $SPARK_FULL_VERSION, package suffix: '$SPARK_PACKAGE_SUFFIX', support iceberg: '$SUPPORT_ICEBERG'"
+       - name: install java
+         uses: actions/setup-java@v3
+         with:
+       - name: Run Delta Spark Local Publishing and Examples Compilation
+         # examples/scala/build.sbt will compile against the local Delta release version (e.g. 3.2.0-SNAPSHOT).
+         # Thus, we need to publishM2 first so those jars are locally accessible.
+-        # The SPARK_PACKAGE_SUFFIX env var tells examples/scala/build.sbt which artifact naming to use.
++        # -DsparkVersion is for the Delta project's publishM2 (which Spark version to compile Delta against).
++        # SPARK_VERSION/SPARK_PACKAGE_SUFFIX/SUPPORT_ICEBERG are for examples/scala/build.sbt (dependency resolution).
+         env:
+           SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
+           SUPPORT_ICEBERG: ${{ steps.spark-details.outputs.support_iceberg }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
+         run: |
+           build/sbt clean
+-          build/sbt -DsparkVersion=${{ matrix.spark_version }} publishM2
++          build/sbt -DsparkVersion=${{ steps.spark-details.outputs.spark_full_version }} publishM2
+           cd examples/scala && build/sbt "++ $SCALA_VERSION compile"
++      - name: Run UC Delta Integration Test
++        # Verifies that delta-spark resolved from Maven local includes all kernel module
++        # dependencies transitively by running a real UC-backed Delta workload.
++        env:
++          SPARK_PACKAGE_SUFFIX: ${{ steps.spark-details.outputs.spark_package_suffix }}
++          SPARK_VERSION: ${{ steps.spark-details.outputs.spark_full_version }}
++        run: |
++          cd examples/scala && build/sbt "++ $SCALA_VERSION runMain example.UnityCatalogQuickstart"
\ No newline at end of file

.github/workflows/spark_test.yaml

@@ -0,0 +1,27 @@
+diff --git a/.github/workflows/spark_test.yaml b/.github/workflows/spark_test.yaml
+--- a/.github/workflows/spark_test.yaml
++++ b/.github/workflows/spark_test.yaml
+         # These Scala versions must match those in the build.sbt
+         scala: [2.13.16]
+         # Important: This list of shards must be [0..NUM_SHARDS - 1]
+-        shard: [0, 1, 2, 3]
++        shard: [0, 1, 2, 3, 4, 5, 6, 7]
+     env:
+       SCALA_VERSION: ${{ matrix.scala }}
+       SPARK_VERSION: ${{ matrix.spark_version }}
+       # Important: This must be the same as the length of shards in matrix
+-      NUM_SHARDS: 4
++      NUM_SHARDS: 8
+     steps:
+       - uses: actions/checkout@v3
+       - name: Get Spark version details
+         # when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_python_test.yaml
+         run: |
+           TEST_PARALLELISM_COUNT=4 pipenv run python run-tests.py --group spark --shard ${{ matrix.shard }} --spark-version ${{ matrix.spark_version }}
++      - name: Upload test reports
++        if: always()
++        uses: actions/upload-artifact@v4
++        with:
++          name: test-reports-spark${{ matrix.spark_version }}-shard${{ matrix.shard }}
++          path: "**/target/test-reports/*.xml"
++          retention-days: 7
\ No newline at end of file

PROTOCOL.md

@@ -0,0 +1,537 @@
+diff --git a/PROTOCOL.md b/PROTOCOL.md
+--- a/PROTOCOL.md
++++ b/PROTOCOL.md
+   - [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
+   - [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
+   - [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
++- [Catalog-managed tables](#catalog-managed-tables)
++  - [Terminology: Commits](#terminology-commits)
++  - [Terminology: Delta Client](#terminology-delta-client)
++  - [Terminology: Catalogs](#terminology-catalogs)
++  - [Catalog Responsibilities](#catalog-responsibilities)
++  - [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  - [Commit Protocol](#commit-protocol)
++  - [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog)
++  - [Publishing Commits](#publishing-commits)
++  - [Maintenance Operations on Catalog-managed Tables](#maintenance-operations-on-catalog-managed-tables)
++  - [Creating and Dropping Catalog-managed Tables](#creating-and-dropping-catalog-managed-tables)
++  - [Catalog-managed Table Enablement](#catalog-managed-table-enablement)
++  - [Writer Requirements for Catalog-managed tables](#writer-requirements-for-catalog-managed-tables)
++  - [Reader Requirements for Catalog-managed tables](#reader-requirements-for-catalog-managed-tables)
++  - [Table Discovery](#table-discovery)
++  - [Sample Catalog Client API](#sample-catalog-client-api)
+ - [Requirements for Writers](#requirements-for-writers)
+   - [Creation of New Log Entries](#creation-of-new-log-entries)
+   - [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
+ __(1)__ `preimage` is the value before the update, `postimage` is the value after the update.
+ 
+ ### Delta Log Entries
+-Delta files are stored as JSON in a directory at the root of the table named `_delta_log`, and together with checkpoints make up the log of all changes that have occurred to a table.
+ 
+-Delta files are the unit of atomicity for a table, and are named using the next available version number, zero-padded to 20 digits.
++Delta Log Entries, also known as Delta files, are JSON files stored in the `_delta_log`
++directory at the root of the table. Together with checkpoints, they make up the log of all changes
++that have occurred to a table. Delta files are the unit of atomicity for a table, and are named
++using the next available version number, zero-padded to 20 digits.
+ 
+ For example:
+ 
+ ```
+ ./_delta_log/00000000000000000000.json
+ ```
+-Delta files use new-line delimited JSON format, where every action is stored as a single line JSON document.
+-A delta file, `n.json`, contains an atomic set of [_actions_](#Actions) that should be applied to the previous table state, `n-1.json`, in order to the construct `n`th snapshot of the table.
+-An action changes one aspect of the table's state, for example, adding or removing a file.
++
++Delta files use newline-delimited JSON format, where every action is stored as a single-line
++JSON document. A Delta file, corresponding to version `v`, contains an atomic set of
++[_actions_](#actions) that should be applied to the previous table state corresponding to version
++`v-1`, in order to construct the `v`th snapshot of the table. An action changes one aspect of the
++table's state, for example, adding or removing a file.
++
++**Note:** If the [catalogManaged table feature](#catalog-managed-tables) is enabled on the table,
++recently [ratified commits](#ratified-commit) may not yet be published to the `_delta_log` directory as normal Delta
++files - they may be stored directly by the catalog or reside in the `_delta_log/_staged_commits`
++directory. Delta clients must contact the table's managing catalog in order to find the information
++about these [ratified, potentially-unpublished commits](#publishing-commits).
++
++The `_delta_log/_staged_commits` directory is the staging area for [staged](#staged-commit)
++commits. Delta files in this directory have a UUID embedded into them and follow the pattern
++`<version>.<uuid>.json`, where the version corresponds to the proposed commit version, zero-padded
++to 20 digits.
++
++For example:
++
++```
++./_delta_log/_staged_commits/00000000000000000000.3a0d65cd-4056-49b8-937b-95f9e3ee90e5.json
++./_delta_log/_staged_commits/00000000000000000001.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json
++./_delta_log/_staged_commits/00000000000000000001.016ae953-37a9-438e-8683-9a9a4a79a395.json
++./_delta_log/_staged_commits/00000000000000000002.3ae45b72-24e1-865a-a211-34987ae02f2a.json
++```
++
++NOTE: The (proposed) version number of a staged commit is authoritative - file
++`00000000000000000100.<uuid>.json` always corresponds to a commit attempt for version 100. Besides
++simplifying implementations, it also acknowledges the fact that commit files cannot safely be reused
++for multiple commit attempts. For example, resolving conflicts in a table with [row
++tracking](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking) enabled requires
++rewriting all file actions to update their `baseRowId` field.
++
++The [catalog](#terminology-catalogs) is the source of truth about which staged commit files in
++the `_delta_log/_staged_commits` directory correspond to ratified versions, and Delta clients should
++not attempt to directly interpret the contents of that directory. Refer to
++[catalog-managed tables](#catalog-managed-tables) for more details.
+ 
+ ### Checkpoints
+ Checkpoints are also stored in the `_delta_log` directory, and can be created at any time, for any committed version of the table.
+ ### Commit Provenance Information
+ A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.
+ 
++When the `catalogManaged` table feature is enabled, the `commitInfo` action must have a field
++`txnId` that stores a unique transaction identifier string.
++
+ Implementations are free to store any valid JSON-formatted data via the `commitInfo` action.
+ 
+ When [In-Commit Timestamps](#in-commit-timestamps) are enabled, writers are required to include a `commitInfo` action with every commit, which must include the `inCommitTimestamp` field. Also, the `commitInfo` action must be first action in the commit.
+  - A single `protocol` action
+  - A single `metaData` action
+  - A collection of `txn` actions with unique `appId`s
+- - A collection of `domainMetadata` actions with unique `domain`s.
++ - A collection of `domainMetadata` actions with unique `domain`s, excluding tombstones (i.e. actions with `removed=true`).
+  - A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
+  - A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.
+  
+      - write a `metaData` action to add the `delta.columnMapping.mode` table property.
+  - Write data files by using the _physical name_ that is chosen for each column. The physical name of the column is static and can be different than the _display name_ of the column, which is changeable.
+  - Write the 32 bit integer column identifier as part of the `field_id` field of the `SchemaElement` struct in the [Parquet Thrift specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
+- - Track partition values and column level statistics with the physical name of the column in the transaction log.
++ - Track partition values, column level statistics, and [clustering column](#clustered-table) names with the physical name of the column in the transaction log.
+  - Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.
+ 
+ ## Reader Requirements for Column Mapping
+ ## Writer Requirement for Deletion Vectors
+ When adding a logical file with a deletion vector, then that logical file must have correct `numRecords` information for the data file in the `stats` field.
+ 
++# Catalog-managed tables
++
++With this feature enabled, the [catalog](#terminology-catalogs) that manages the table becomes the
++source of truth for whether a given commit attempt succeeded.
++
++The table feature defines the parts of the [commit protocol](#commit-protocol) that directly impact
++the Delta table (e.g. atomicity requirements, publishing, etc). The Delta client and catalog
++together are responsible for implementing the Delta-specific aspects of commit as defined by this
++spec, but are otherwise free to define their own APIs and protocols for communication with each
++other.
++
++**NOTE**: Filesystem-based access to catalog-managed tables is not supported. Delta clients are
++expected to discover and access catalog-managed tables through the managing catalog, not by direct
++listing in the filesystem. This feature is primarily designed to warn filesystem-based readers that
++might attempt to access a catalog-managed table's storage location without going through the catalog
++first, and to block filesystem-based writers who could otherwise corrupt both the table and the
++catalog by failing to commit through the catalog.
++
++Before we can go into details of this protocol feature, we must first align our terminology.
++
++## Terminology: Commits
++
++A commit is a set of [actions](#actions) that transform a Delta table from version `v - 1` to `v`.
++It contains the same kind of content as is stored in a [Delta file](#delta-log-entries).
++
++A commit may be stored in the file system as a Delta file - either _published_ or _staged_ - or
++stored _inline_ in the managing catalog, using whatever format the catalog prefers.
++
++There are several types of commits:
++
++1. **Proposed commit**:  A commit that a Delta client has proposed for the next version of the
++   table. It could be _staged_ or _inline_. It will either become _ratified_ or be rejected.
++
++2. <a name="staged-commit">**Staged commit**</a>: A commit that is written to disk at
++   `_delta_log/_staged_commits/<v>.<uuid>.json`. It has the same content and format as a published
++   Delta file.
++    - Here, the `uuid` is a random UUID that is generated for each commit and `v` is the version
++      which is proposed to be committed, zero-padded to 20 digits.
++    - The mere existence of a staged commit does not mean that the file has been ratified or even
++      proposed. It might correspond to a failed or in-progress commit attempt.
++    - The catalog is the source of truth around which staged commits are ratified.
++    - The catalog stores only the location, not the content, of a staged (and ratified) commit.
++
++3. <a name="inline-commit">**Inline commit**</a>: A proposed commit that is not written to disk but
++   rather has its content sent to the catalog for the catalog to store directly.
++
++4. <a name="ratified-commit">**Ratified commit**</a>: A proposed commit that a catalog has
++   determined has won the commit at the desired version of the table.
++    - The catalog must store ratified commits (that is, the staged commit's location or the inline
++      commit's content) until they are published to the `_delta_log` directory.
++    - A ratified commit may or may not yet be published.
++    - A ratified commit may or may not even be stored by the catalog at all - the catalog may
++      have just atomically published it to the filesystem directly, relying on PUT-if-absent
++      primitives to facilitate the ratification and publication all in one step.
++
++5. <a name="published-commit">**Published commit**</a>: A ratified commit that has been copied into
++   the `_delta_log` as a normal Delta file, i.e. `_delta_log/<v>.json`.
++    - Here, the `v` is the version which is being committed, zero-padded to 20 digits.
++    - The existence of a `<v>.json` file proves that the corresponding version `v` is ratified,
++      regardless of whether the table is catalog-managed or filesystem-based. The catalog is allowed
++      to return information about published commits, but Delta clients can also use filesystem
++      listing operations to directly discover them.
++    - Published commits do not need to be stored by the catalog.
++
++## Terminology: Delta Client
++
++This is the component that implements support for reading and writing Delta tables, and implements
++the logic required by the `catalogManaged` table feature. Among other things, it
++- triggers the filesystem listing, if needed, to discover published commits
++- generates the commit content (the set of [actions](#actions))
++- works together with the query engine to trigger the commit process and invoke the client-side
++  catalog component with the commit content
++
++The Delta client is also responsible for defining the client-side API that catalogs should target.
++That is, there must be _some_ API that the [catalog client](#catalog-client) can use to communicate
++to the Delta client the subset of catalog-managed information that the Delta client cares about.
++This protocol feature is concerned with what information Delta cares about, but leaves to Delta
++clients the design of the API they use to obtain that information from catalog clients.
++
++## Terminology: Catalogs
++
++1. **Catalog**: A catalog is an entity which manages a Delta table, including its creation, writes,
++   reads, and eventual deletion.
++    - It could be backed by a database, a filesystem, or any other persistence mechanism.
++    - Each catalog has its own spec around how catalog clients should interact with them, and how
++      they perform a commit.
++
++2. <a name="catalog-client">**Catalog Client**</a>: The catalog always has a client-side component
++   which the Delta client interacts with directly. This client-side component has two primary
++   responsibilities:
++    - implement any client-side catalog-specific logic (such as staging or
++      [publishing](#publishing-commits) commits)
++    - communicate with the Catalog Server, if any
++
++3. **Catalog Server**: The catalog may also involve a server-side component which the client-side
++   component would be responsible to communicate with.
++    - This server is responsible for coordinating commits and potentially persisting table metadata
++      and enforcing authorization policies.
++    - Not all catalogs require a server; some may be entirely client-side, e.g. filesystem-backed
++      catalogs, or they may make use of a generic database server and implement all of the catalog's
++      business logic client-side.
++
++**NOTE**: This specification outlines the responsibilities and actions that catalogs must implement.
++This spec does its best not to assume any specific catalog _implementation_, though it does call out
++likely client-side and server-side responsibilities. Nonetheless, what a given catalog does
++client-side or server-side is up to each catalog implementation to decide for itself.
++
++## Catalog Responsibilities
++
++When the `catalogManaged` table feature is enabled, a catalog performs commits to the table on behalf
++of the Delta client.
++
++As stated above, the Delta spec does not mandate any particular client-server design or API for
++catalogs that manage Delta tables. However, the catalog does need to provide certain capabilities
++for reading and writing Delta tables:
++
++- Atomically commit a version `v` with a given set of `actions`. This is explained in detail in the
++  [commit protocol](#commit-protocol) section.
++- Retrieve information about recent ratified commits and the latest ratified version on the table.
++  This is explained in detail in the [Getting Ratified Commits from the Catalog](#getting-ratified-commits-from-the-catalog) section.
++- Though not required, it is encouraged that catalogs also return the latest table-level metadata,
++  such as the latest Protocol and Metadata actions, for the table. This can provide significant
++  performance advantages to conforming Delta clients, who may forgo log replay and instead trust
++  the information provided by the catalog during query planning.
++
++## Reading Catalog-managed Tables
++
++A catalog-managed table can have a mix of (a) published and (b) ratified but non-published commits.
++The catalog is the source of truth for ratified commits. Also recall that ratified commits can be
++[staged commits](#staged-commit) that are persisted to the `_delta_log/_staged_commits` directory,
++or [inline commits](#inline-commit) whose content the catalog stores directly.
++
++For example, suppose the `_delta_log` directory contains the following files:
++
++```
++00000000000000000000.json
++00000000000000000001.json
++00000000000000000002.checkpoint.parquet
++00000000000000000002.json
++00000000000000000003.00000000000000000005.compacted.json
++00000000000000000003.json
++00000000000000000004.json
++00000000000000000005.json
++00000000000000000006.json
++00000000000000000007.json
++_staged_commits/00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json // ratified and published
++_staged_commits/00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json // ratified
++_staged_commits/00000000000000000008.b91807ba-fe18-488c-a15e-c4807dbd2174.json // rejected
++_staged_commits/00000000000000000010.0f707846-cd18-4e01-b40e-84ee0ae987b0.json // not yet ratified
++_staged_commits/00000000000000000010.7a980438-cb67-4b89-82d2-86f73239b6d6.json // partial file
++```
++
++Further, suppose the catalog stores the following ratified commits:
++```
++{
++  7  -> "00000000000000000007.016ae953-37a9-438e-8683-9a9a4a79a395.json",
++  8  -> "00000000000000000008.7d17ac10-5cc3-401b-bd1a-9c82dd2ea032.json",
++  9  -> <inline commit: content stored by the catalog directly>
++}
++```
++
++Some things to note are:
++- the catalog isn't aware that commit 7 was already published - perhaps the response from the
++  filesystem was dropped
++- commit 9 is an inline commit
++- neither of the two staged commits for version 10 have been ratified
++
++To read such tables, Delta clients must first contact the catalog to get the ratified commits. This
++informs the Delta client of commits [7, 9] as well as the latest ratified version, 9.
++
++If this information is insufficient to construct a complete snapshot of the table, Delta clients
++must LIST the `_delta_log` directory to get information about the published commits. For commits
++that are both returned by the catalog and already published, Delta clients must treat the catalog's
++version as authoritative and read the commit returned by the catalog. Additionally, Delta clients
++must ignore any files with versions greater than the latest ratified commit version returned by the
++catalog.
++
++Combining these two sets of files and commits enables Delta clients to generate a snapshot at the
++latest version of the table.
++
++**NOTE**: This spec prescribes the _minimum_ required interactions between Delta clients and
++catalogs for commits. Catalogs may very well expose APIs and work with Delta clients to be
++informed of other non-commit [file types](#file-types), such as checkpoint, log
++compaction, and version checksum files. This would allow catalogs to return additional
++information to Delta clients during query and scan planning, potentially allowing Delta
++clients to avoid LISTing the filesystem altogether.
++
++## Commit Protocol
++
++To start, Delta Clients send the desired actions to be committed to the client-side component of the
++catalog.
++
++This component then has several options for proposing, ratifying, and publishing the commit,
++detailed below.
++
++- Option 1: Write the actions (likely client-side) to a [staged commit file](#staged-commit) in the
++  `_delta_log/_staged_commits` directory and then ratify the staged commit (likely server-side) by
++  atomically recording (in persistent storage of some kind) that the file corresponds to version `v`.
++- Option 2: Treat this as an [inline commit](#inline-commit) (i.e. likely that the client-side
++  component sends the contents to the server-side component) and atomically record (in persistent
++  storage of some kind) the content of the commit as version `v` of the table.
++- Option 3: Catalog implementations that use PUT-if-absent (client- or server-side) can ratify and
++  publish all-in-one by atomically writing a [published commit file](#published-commit)
++  in the `_delta_log` directory. Note that this commit will be considered to have succeeded as soon
++  as the file becomes visible in the filesystem, regardless of when or whether the catalog is made
++  aware of the successful publish. The catalog does not need to store these files.
++
++A catalog must not ratify version `v` until it has ratified version `v - 1`, and it must ratify
++version `v` at most once.
++
++The catalog must store both flavors of ratified commits (staged or inline) and make them available
++to readers until they are [published](#publishing-commits).
++
++For performance reasons, Delta clients are encouraged to establish an API contract where the catalog
++provides the latest ratified commit information whenever a commit fails due to version conflict.
++
++## Getting Ratified Commits from the Catalog
++
++Even after a commit is ratified, it is not discoverable through filesystem operations until it is
++[published](#publishing-commits).
++
++The catalog-client is responsible to implement an API (defined by the Delta client) that Delta clients can
++use to retrieve the latest ratified commit version (authoritative), as well as the set of ratified
++commits the catalog is still storing for the table. If some commits needed to complete the snapshot
++are not stored by the catalog, as they are already published, Delta clients can issue a filesystem
++LIST operation to retrieve them.
++
++Delta clients must establish an API contract where the catalog provides ratified commit information
++as part of the standard table resolution process performed at query planning time.
++
++## Publishing Commits
++
++Publishing is the process of copying the ratified commit with version `<v>` to
++`_delta_log/<v>.json`. The ratified commit may be a staged commit located in
++`_delta_log/_staged_commits/<v>.<uuid>.json`, or it may be an inline commit whose content the
++catalog stores itself. Because the content of a ratified commit is immutable, it does not matter
++whether the client-side, server-side, or both catalog components initiate publishing.
++
++Implementations are strongly encouraged to publish commits promptly. This reduces the number of
++commits the catalog needs to store internally (and serve up to readers).
++
++Commits must be published _in order_. That is, version `v - 1` must be published _before_ version
++`v`.
++
++**NOTE**: Because commit publishing can happen at any time after the commit succeeds, the file
++modification timestamp of the published file will not accurately reflect the original commit time.
++For this reason, catalog-managed tables must use [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++to ensure stability of time travel reads. Refer to [Writer Requirements for Catalog-managed Tables](#writer-requirements-for-catalog-managed-tables)
++section for more details.
++
++## Maintenance Operations on Catalog-managed Tables
++
++[Checkpoints](#checkpoints-1) and [Log Compaction Files](#log-compaction-files) can only be created
++for versions that are already published in the `_delta_log`. In other words, in order to checkpoint
++version `v` or produce a log compaction file for commit range `x <= v <= y`, `_delta_log/<v>.json`
++must exist.
++
++Notably, the [Version Checksum File](#version-checksum-file) for version `v` _can_ be created in the
++`_delta_log` even if the commit for version `v` is not published.
++
++By default, maintenance operations are prohibited unless the managing catalog explicitly permits
++the client to run them. The only exceptions are checkpoints, log compaction, and version checksum,
++as they are essential for all basic table operations (e.g. reads and writes) to operate reliably.
++All other maintenance operations such as the following are not allowed by default.
++- [Log and other metadata files clean up](#metadata-cleanup).
++- Data files cleanup, for example VACUUM.
++- Data layout changes, for example OPTIMIZE and REORG.
++
++## Creating and Dropping Catalog-managed Tables
++
++The catalog and query engine ultimately dictate how to create and drop catalog-managed tables.
++
++As one example, table creation often works in three phases:
++
++1. An initial catalog operation to obtain a unique storage location which serves as an unnamed
++   "staging" table
++2. A table operation that physically initializes a new `catalogManaged`-enabled table at the staging
++   location.
++3. A final catalog operation that registers the new table with its intended name.
++
++Delta clients would primarily be involved with the second step, but an implementation could choose
++to combine the second and third steps so that a single catalog call registers the table as part of
++the table's first commit.
++
++As another example, dropping a table can be as simple as removing its name from the catalog (a "soft
++delete"), followed at some later point by a "hard delete" that physically purges the data. The Delta
++client would not be involved at all in this process, because no commits are made to the table.
++
++## Catalog-managed Table Enablement
++
++The `catalogManaged` table feature is supported and active when:
++- The table is on Reader Version 3 and Writer Version 7.
++- The table has a `protocol` action with `readerFeatures` and `writerFeatures` both containing the
++  feature `catalogManaged`.
++
++## Writer Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Writers must discover and access the table using catalog calls, which happens _before_ the table's
++  protocol is known. See [Table Discovery](#table-discovery) for more details.
++- The [in-commit-timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps)
++  table feature must be supported and active.
++- The `commitInfo` action must also contain a field `txnId` that stores a unique transaction
++  identifier string
++- Writers must follow the catalog's [commit protocol](#commit-protocol) and must not perform
++  ordinary filesystem-based commits against the table.
++- Writers must follow the catalog's [maintenance operation protocol](#maintenance-operations-on-catalog-managed-tables)
++
++## Reader Requirements for Catalog-managed tables
++
++When supported and active:
++
++- Readers must discover the table using catalog calls, which happens before the table's protocol
++  is known. See [Table Discovery](#table-discovery) for more details.
++- Readers must contact the catalog for information about unpublished ratified commits.
++- Readers must follow the rules described in the [Reading Catalog-managed Tables](#reading-catalog-managed-tables)
++  section above. Notably
++  - If the catalog said `v` is the latest version, clients must ignore any later versions that may
++    have been published
++  - When the catalog returns a ratified commit for version `v`, readers must use that
++    catalog-supplied commit and ignore any published Delta file for version `v` that might also be
++    present.
++
++## Table Discovery
++
++The requirements above state that readers and writers must discover and access the table using
++catalog calls, which occurs _before_ the table's protocol is known. This raises an important
++question: how can a client discover a `catalogManaged` Delta table without first knowing that it
++_is_, in fact, `catalogManaged` (according to the protocol)?
++
++To solve this, first note that, in practice, catalog-integrated engines already ask the catalog to
++resolve a table name to its storage location during the name resolution step. This protocol
++therefore encourages that the same name resolution step also indicate whether the table is
++catalog-managed. Surfacing this at the very moment the catalog returns the path imposes no extra
++round-trips, yet it lets the client decide — early and unambiguously — whether to follow the
++`catalogManaged` read and write rules.
++
++## Sample Catalog Client API
++
++The following is an example of a possible API which a Java-based Delta client might require catalog
++implementations to target:
++
++```scala
++
++interface CatalogManagedTable {
++    /**
++     * Commits the given set of `actions` to the given commit `version`.
++     *
++     * @param version The version we want to commit.
++     * @param actions Actions that need to be committed.
++     *
++     * @return CommitResponse which has details around the new committed delta file.
++     */
++    def commit(
++        version: Long,
++        actions: Iterator[String]): CommitResponse
++
++    /**
++     * Retrieves a (possibly empty) suffix of ratified commits in the range [startVersion,
++     * endVersion] for this table.
++     * 
++     * Some of these ratified commits may already have been published. Some of them may be staged,
++     * in which case the staged commit file path is returned; others may be inline, in which case
++     * the inline commit content is returned.
++     * 
++     * The returned commits are sorted in ascending version number and are contiguous.
++     *
++     * If neither start nor end version is specified, the catalog will return all available ratified
++     * commits (possibly empty, if all commits have been published).
++     *
++     * In all cases, the response also includes the table's latest ratified commit version.
++     *
++     * @return GetCommitsResponse which contains an ordered list of ratified commits
++     *         stored by the catalog, as well as table's latest commit version.
++     */
++    def getRatifiedCommits(
++        startVersion: Option[Long],
++        endVersion: Option[Long]): GetCommitsResponse
++}
++```
++
++Note that the above is only one example of a possible Catalog Client API. It is also _NOT_ a catalog
++API (no table discovery, ACL, create/drop, etc). The Delta protocol is agnostic to API details, and
++the API surface Delta clients define should only cover the specific catalog capabilities that Delta
++client needs to correctly read and write catalog-managed tables.
++
+ # Iceberg Compatibility V1
+ 
+ This table feature (`icebergCompatV1`) ensures that Delta tables can be converted to Apache Iceberg™ format, though this table feature does not implement or specify that conversion.
+  * Files that have been [added](#Add-File-and-Remove-File) and not yet removed
+  * Files that were recently [removed](#Add-File-and-Remove-File) and have not yet expired
+  * [Transaction identifiers](#Transaction-Identifiers)
+- * [Domain Metadata](#Domain-Metadata)
++ * [Domain Metadata](#Domain-Metadata) that have not been removed (i.e. excluding tombstones with `removed=true`)
+  * [Checkpoint Metadata](#checkpoint-metadata) - Requires [V2 checkpoints](#v2-spec)
+  * [Sidecar File](#sidecar-files) - Requires [V2 checkpoints](#v2-spec)
+ 
+ 1. Identify a threshold (in days) uptil which we want to preserve the deltaLog. Let's refer to
+ midnight UTC of that day as `cutOffTimestamp`. The newest commit not newer than the `cutOffTimestamp` is
+ the `cutoffCommit`, because a commit exactly at midnight is an acceptable cutoff. We want to retain everything including and after the `cutoffCommit`.
+-2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Lets call it `cutOffCheckpoint`.
+-We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
+-3. Delete all [delta log entries](#delta-log-entries) and [checkpoint files](#checkpoints) before the
+-`cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files) having startVersion <= `cutOffCheckpoint`'s version.
++2. Identify the newest checkpoint that is not newer than the `cutOffCommit`. A checkpoint at the `cutOffCommit` is ideal, but an older one will do. Let's call it `cutOffCheckpoint`.
++We need to preserve the `cutOffCheckpoint` (both the checkpoint file and the JSON commit file at that version) and all published commits after it. The JSON commit file at the `cutOffCheckpoint` version must be preserved because checkpoints do not preserve [commit provenance information](#commit-provenance-information) (e.g., `commitInfo` actions), which may be required by table features such as [In-Commit Timestamps](#in-commit-timestamps). All published commits after `cutOffCheckpoint` must be preserved to enable time travel for commits between `cutOffCheckpoint` and the next available checkpoint.
++    - If no `cutOffCheckpoint` can be found, do not proceed with metadata cleanup as there is
++      nothing to cleanup.
++3. Delete all [delta log entries](#delta-log-entries), [checkpoint files](#checkpoints), and
++   [version checksum files](#version-checksum-file) before the `cutOffCheckpoint` checkpoint. Also delete all the [log compaction files](#log-compaction-files)
++   having startVersion <= `cutOffCheckpoint`'s version.
++    - Also delete all the [staged commit files](#staged-commit) having version <=
++      `cutOffCheckpoint`'s version from the `_delta_log/_staged_commits` directory.
+ 4. Now read all the available [checkpoints](#checkpoints-1) in the _delta_log directory and identify
+ the corresponding [sidecar files](#sidecar-files). These sidecar files need to be protected.
+ 5. List all the files in `_delta_log/_sidecars` directory, preserve files that are less than a day
+ [Timestamp without Timezone](#timestamp-without-timezone-timestampNtz) | `timestampNtz` | Readers and writers
+ [Domain Metadata](#domain-metadata) | `domainMetadata` | Writers only
+ [V2 Checkpoint](#v2-checkpoint-table-feature) | `v2Checkpoint` | Readers and writers
++[Catalog-managed Tables](#catalog-managed-tables) | `catalogManaged` | Readers and writers
+ [Iceberg Compatibility V1](#iceberg-compatibility-v1) | `icebergCompatV1` | Writers only
+ [Iceberg Compatibility V2](#iceberg-compatibility-v2) | `icebergCompatV2` | Writers only
+ [Clustered Table](#clustered-table) | `clustering` | Writers only
\ No newline at end of file

README.md

@@ -0,0 +1,10 @@
+diff --git a/README.md b/README.md
+--- a/README.md
++++ b/README.md
+ ## Building
+ 
+ Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
++Ensure that your Java version is at least 17 (you can verify with `java -version`).
+ 
+ To compile, run
+ 
\ No newline at end of file

build.sbt

@@ -0,0 +1,218 @@
+diff --git a/build.sbt b/build.sbt
+--- a/build.sbt
++++ b/build.sbt
+       allMappings.distinct
+     },
+ 
+-    // Exclude internal modules from published POM
++    // Exclude internal modules from published POM and add kernel dependencies.
++    // Kernel modules are transitive through sparkV2 (an internal module), so they
++    // are lost when sparkV2 is filtered out. We re-add them explicitly here.
+     pomPostProcess := { node =>
+       val internalModules = internalModuleNames.value
++      val ver = version.value
+       import scala.xml._
+       import scala.xml.transform._
++
++      def kernelDependencyNode(artifactId: String): Elem = {
++        <dependency>
++          <groupId>io.delta</groupId>
++          <artifactId>{artifactId}</artifactId>
++          <version>{ver}</version>
++        </dependency>
++      }
++
++      val kernelDeps = Seq(
++        kernelDependencyNode("delta-kernel-api"),
++        kernelDependencyNode("delta-kernel-defaults"),
++        kernelDependencyNode("delta-kernel-unitycatalog")
++      )
++
+       new RuleTransformer(new RewriteRule {
+         override def transform(n: Node): Seq[Node] = n match {
+-          case e: Elem if e.label == "dependency" =>
+-            val artifactId = (e \ "artifactId").text
+-            // Check if artifactId starts with any internal module name
+-            // (e.g., "delta-spark-v1_4.1_2.13" starts with "delta-spark-v1")
+-            val isInternal = internalModules.exists(module => artifactId.startsWith(module))
+-            if (isInternal) Seq.empty else Seq(n)
++          case e: Elem if e.label == "dependencies" =>
++            val filtered = e.child.filter {
++              case child: Elem if child.label == "dependency" =>
++                val artifactId = (child \ "artifactId").text
++                !internalModules.exists(module => artifactId.startsWith(module))
++              case _ => true
++            }
++            Seq(e.copy(child = filtered ++ kernelDeps))
+           case _ => Seq(n)
+         }
+       }).transform(node).head
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-contribs is only published as delta-contribs_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     Compile / packageBin / mappings := (Compile / packageBin / mappings).value ++
+       listPythonFiles(baseDirectory.value.getParentFile / "python"),
+ 
+   ).configureUnidoc()
+ 
+ 
+-val unityCatalogVersion = "0.3.1"
++val unityCatalogVersion = "0.4.0"
+ val sparkUnityCatalogJacksonVersion = "2.15.4" // We are using Spark 4.0's Jackson version 2.15.x, to override Unity Catalog 0.3.0's version 2.18.x
+ 
+ lazy val sparkUnityCatalog = (project in file("spark/unitycatalog"))
+     libraryDependencies ++= Seq(
+       "org.apache.spark" %% "spark-sql" % sparkVersion.value % "provided",
+ 
+-      "io.delta" %% "delta-sharing-client" % "1.3.9",
++      "io.delta" %% "delta-sharing-client" % "1.3.10",
+ 
+       // Test deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
+ 
+       // Test Deps
+       "org.scalatest" %% "scalatest" % scalaTestVersion % "test",
++      // Jackson datatype module needed for UC SDK tests (excluded from main compile scope)
++      "com.fasterxml.jackson.datatype" % "jackson-datatype-jsr310" % "2.15.4" % "test",
+     ),
+ 
+     // Unidoc settings
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentModuleName(sparkVersion),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-iceberg is only published as delta-iceberg_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
+     libraryDependencies ++= {
+       if (supportIceberg) {
+         Seq(
+           "org.xerial" % "sqlite-jdbc" % "3.45.0.0" % "test",
+           "org.apache.httpcomponents.core5" % "httpcore5" % "5.2.4" % "test",
+           "org.apache.httpcomponents.client5" % "httpclient5" % "5.3.1" % "test",
+-          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided"
++          "org.apache.iceberg" %% icebergSparkRuntimeArtifactName % "1.10.0" % "provided",
++          // For FixedGcsAccessTokenProvider (GCS server-side planning credentials)
++          "com.google.cloud.bigdataoss" % "util-hadoop" % "hadoop3-2.2.26" % "provided"
+         )
+       } else {
+         Seq.empty
+   )
+ // scalastyle:on println
+ 
+-val icebergShadedVersion = "1.10.0"
++val icebergShadedVersion = "1.10.1"
+ lazy val icebergShaded = (project in file("icebergShaded"))
+   .dependsOn(spark % "provided")
+   .disablePlugins(JavaFormatterPlugin, ScalafmtPlugin)
+     commonSettings,
+     scalaStyleSettings,
+     releaseSettings,
+-    CrossSparkVersions.sparkDependentSettings(sparkVersion),
+-    libraryDependencies ++= Seq(
+-      "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
+-        ExclusionRule(organization = "org.apache.hadoop"),
+-        ExclusionRule(organization = "org.apache.zookeeper"),
+-      ),
+-      "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
+-      "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
+-    ),
++    // Set sparkVersion directly (not sparkDependentModuleName) so that
++    // runOnlyForReleasableSparkModules discovers this module, but without adding a Spark
++    // suffix to the artifact name. delta-hudi is only published as delta-hudi_2.13.
++    sparkVersion := CrossSparkVersions.getSparkVersion(),
++    libraryDependencies ++= {
++      if (supportHudi) {
++        Seq(
++          "org.apache.hudi" % "hudi-java-client" % "0.15.0" % "compile" excludeAll(
++            ExclusionRule(organization = "org.apache.hadoop"),
++            ExclusionRule(organization = "org.apache.zookeeper"),
++          ),
++          "org.apache.spark" %% "spark-avro" % sparkVersion.value % "test" excludeAll ExclusionRule(organization = "org.apache.hadoop"),
++          "org.apache.parquet" % "parquet-avro" % "1.12.3" % "compile"
++        )
++      } else {
++        Seq.empty
++      }
++    },
++    // Skip compilation and publishing when supportHudi is false
++    Compile / skip := !supportHudi,
++    Test / skip := !supportHudi,
++    publish / skip := !supportHudi,
++    publishLocal / skip := !supportHudi,
++    publishM2 / skip := !supportHudi,
+     assembly / assemblyJarName := s"${name.value}-assembly_${scalaBinaryVersion.value}-${version.value}.jar",
+     assembly / logLevel := Level.Info,
+     assembly / test := {},
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+       // crossScalaVersions must be set to Nil on the aggregating project
+       crossScalaVersions := Nil,
+       publishArtifact := false,
+-      publish / skip := false,
++      publish / skip := true,
+     )
+ }
+ 
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+     unidocSourceFilePatterns := {
+       (kernelApi / unidocSourceFilePatterns).value.scopeToProject(kernelApi) ++
+       (kernelDefaults / unidocSourceFilePatterns).value.scopeToProject(kernelDefaults)
+     // crossScalaVersions must be set to Nil on the aggregating project
+     crossScalaVersions := Nil,
+     publishArtifact := false,
+-    publish / skip := false,
++    publish / skip := true,
+   )
+ 
+ /*
+     sys.env.getOrElse("SONATYPE_USERNAME", ""),
+     sys.env.getOrElse("SONATYPE_PASSWORD", "")
+   ),
++  credentials += Credentials(
++    "Sonatype Nexus Repository Manager",
++    "central.sonatype.com",
++    sys.env.getOrElse("SONATYPE_USERNAME", ""),
++    sys.env.getOrElse("SONATYPE_PASSWORD", "")
++  ),
+   publishTo := {
+     val ossrhBase = "https://ossrh-staging-api.central.sonatype.com/"
++    val centralSnapshots = "https://central.sonatype.com/repository/maven-snapshots/"
+     if (isSnapshot.value) {
+-      Some("snapshots" at ossrhBase + "content/repositories/snapshots")
++      Some("snapshots" at centralSnapshots)
+     } else {
+       Some("releases"  at ossrhBase + "service/local/staging/deploy/maven2")
+     }
+ // Looks like some of release settings should be set for the root project as well.
+ publishArtifact := false  // Don't release the root project
+ publish / skip := true
+-publishTo := Some("snapshots" at "https://ossrh-staging-api.central.sonatype.com/content/repositories/snapshots")
++publishTo := Some("snapshots" at "https://central.sonatype.com/repository/maven-snapshots/")
+ releaseCrossBuild := false  // Don't use sbt-release's cross facility
+ releaseProcess := Seq[ReleaseStep](
+   checkSnapshotDependencies,
+   setReleaseVersion,
+   commitReleaseVersion,
+   tagRelease
+-) ++ CrossSparkVersions.crossSparkReleaseSteps("+publishSigned") ++ Seq[ReleaseStep](
++) ++ CrossSparkVersions.crossSparkReleaseSteps("publishSigned") ++ Seq[ReleaseStep](
+ 
+   // Do NOT use `sonatypeBundleRelease` - it will actually release to Maven! We want to do that
+   // manually.
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc

@@ -0,0 +1,3 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc
+new file mode 100644
+Binary files /dev/null and b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/.00000000000000000000.json.crc differ
\ No newline at end of file

connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc

@@ -0,0 +1,5 @@
+diff --git a/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
+new file mode 100644
+--- /dev/null
++++ b/connectors/golden-tables/src/main/resources/golden/collations-preview-table/_delta_log/00000000000000000000.crc
++{"txnId":"6132e880-0f3a-4db4-b882-1da039bffbad","tableSizeBytes":0,"numFiles":0,"numMetadata":1,"numProtocol":1,"setTransactions":[],"domainMetadata":[],"metadata":{"id":"0eb3e007-b3cc-40e4-bca1-a5970d86b5a6","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_binary_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"utf8_lcase_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"utf8_lcase_col\":\"spark.UTF8_LCASE\"}}},{\"name\":\"unicode_col\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"__COLLATIONS\":{\"unicode_col\":\"icu.UNICODE\"}}}]}","partitionColumns":[],"configuration":{},"createdTime":1773779518731},"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","collations-preview","appendOnly","invariants"]},"histogramOpt":{"sortedBinBoundaries":[0,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,12582912,16777216,20971520,25165824,29360128,33554432,37748736,41943040,50331648,58720256,67108864,75497472,83886080,92274688,100663296,109051904,117440512,125829120,130023424,134217728,138412032,142606336,146800640,150994944,167772160,184549376,201326592,218103808,234881024,251658240,268435456,285212672,301989888,318767104,335544320,352321536,369098752,385875968,402653184,419430400,436207616,452984832,469762048,486539264,503316480,520093696,536870912,553648128,570425344,587202560,603979776,671088640,738197504,805306368,872415232,939524096,1006632960,1073741824,1140850688,1207959552,1275068416,1342177280,1409286144,1476395008,1610612736,1744830464,1879048192,2013265920,2147483648,2415919104,2684354560,2952790016,3221225472,3489660928,3758096384,4026531840,4294967296,8589934592,17179869184,34359738368,68719476736,137438953472,274877906944],"fileCounts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"totalBytes":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},"allFiles":[]}
\ No newline at end of file

... (truncated, output exceeded 60000 bytes)

_{Reproduce locally: git range-diff e8cffee..da38d13 d1139d2..ab609f8 | Disable: git config gitstack.push-range-diff false}

…hot) (delta-io#6075) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/6075/files) to review incremental changes. - [**stack/cdf1**](delta-io#6075) [[Files changed](https://github.com/delta-io/delta/pull/6075/files)] - [stack/cdf2](delta-io#6076) [[Files changed](https://github.com/delta-io/delta/pull/6076/files/ab609f8e2185d3ba863c00304195c24b60f7d04b..301cf056c94305a2ca5d96dc3dcdfd88d5dbc37b)] - [stack/cdf2.5](delta-io#6391) [[Files changed](https://github.com/delta-io/delta/pull/6391/files/301cf056c94305a2ca5d96dc3dcdfd88d5dbc37b..d2ce6b47593a08ed491742e024056b8f565dd33f)] - [stack/cdf3](delta-io#6336) [[Files changed](https://github.com/delta-io/delta/pull/6336/files/70544ed3a42014a51dd0d8d97e7c4f2333e6a221..fb5a4f6e478f68410739ce68e9a3fd92b2091be1)] - [stack/cdf4](delta-io#6359) [[Files changed](https://github.com/delta-io/delta/pull/6359/files/fb5a4f6e478f68410739ce68e9a3fd92b2091be1..e52ab30e7fe3923ba302e489368231b836ff4314)] - [stack/cdf5](delta-io#6362) [[Files changed](https://github.com/delta-io/delta/pull/6362/files/e52ab30e7fe3923ba302e489368231b836ff4314..45bc4f043f996ad56d6937f6f0b5d1876fe1130a)] - [stack/cdf6](delta-io#6363) [[Files changed](https://github.com/delta-io/delta/pull/6363/files/45bc4f043f996ad56d6937f6f0b5d1876fe1130a..7d8e01ec6211eba6712b4c54e08eb68f45e542b6)] - [stack/cdf-outofrange](delta-io#6388) [[Files changed](https://github.com/delta-io/delta/pull/6388/files/216a484e5f6e7b20fa26d428703ca05dcbbb6b5a..fee69f12b67cc91d1aeed01a759f64940198405e)] - [stack/cdf7](delta-io#6370) [[Files changed](https://github.com/delta-io/delta/pull/6370/files/fee69f12b67cc91d1aeed01a759f64940198405e..7f5fc84dee6dc3386a7b18db5a0d1d7533024842)] ---------  #### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description Adds initial snapshot write-time-CDC support to the DSv2 streaming read path (SparkMicroBatchStream), bringing it closer to DSv1 feature parity.  ## How was this patch tested?  ## Does this PR introduce _any_ user-facing changes?

zikangh mentioned this pull request Feb 19, 2026

[kernel-spark][Part 2] CDC commit processing (convert delta log actions to IndexedFiles) #6076

Merged

5 tasks

zikangh force-pushed the stack/cdf1 branch 4 times, most recently from 626af46 to 2f3d3db Compare February 24, 2026 21:27

zikangh force-pushed the stack/cdf1 branch from 2f3d3db to ac7ed07 Compare March 12, 2026 17:37

zikangh requested review from TimothyW553, huan233usc, murali-db, raveeram-db and tdas as code owners March 12, 2026 17:37

zikangh force-pushed the stack/cdf1 branch 5 times, most recently from e89755b to ae0540c Compare March 12, 2026 20:55

zikangh self-assigned this Mar 12, 2026

murali-db reviewed Mar 12, 2026

View reviewed changes

Comment thread spark/v2/src/main/java/io/delta/spark/internal/v2/read/SparkMicroBatchStream.java Outdated

Comment thread spark/v2/src/main/java/io/delta/spark/internal/v2/read/CDCDataFile.java

Comment thread spark/v2/src/main/java/io/delta/spark/internal/v2/read/SparkMicroBatchStream.java Outdated

zikangh mentioned this pull request Mar 20, 2026

[kernel-spark][Part 3] Finish wiring up CDC streaming offset management #6336

Merged

5 tasks

zikangh force-pushed the stack/cdf1 branch 2 times, most recently from 79f2695 to 7a4bdb7 Compare March 21, 2026 00:44

zikangh changed the title ~~[kernel-spark] Add initial snapshot CDC support to SparkMicroBatchStream~~ [kernel-spark] CDC streaming offset management (1/3: initial snapshot) Mar 21, 2026

zikangh changed the title ~~[kernel-spark] CDC streaming offset management (1/3: initial snapshot)~~ [kernel-spark][Part 1] CDC streaming offset management (initial snapshot) Mar 21, 2026

zikangh requested a review from murali-db March 23, 2026 17:03

zikangh force-pushed the stack/cdf1 branch from 7a4bdb7 to f9ee471 Compare March 23, 2026 18:42

zikangh mentioned this pull request Mar 23, 2026

[kernel-spark][Part 4] CDC data reading: ReadFunc decorator, schema coordination, and reader factory wiring #6359

Merged

5 tasks

zikangh requested a review from johanl-db March 23, 2026 22:03

huan233usc reviewed Apr 1, 2026

View reviewed changes

zikangh force-pushed the stack/cdf1 branch from a13ae25 to 9f42cb1 Compare April 1, 2026 20:57

zikangh force-pushed the stack/cdf1 branch from 9f42cb1 to 320b3ac Compare April 1, 2026 21:06

zikangh requested a review from huan233usc April 1, 2026 21:10

johanl-db approved these changes Apr 2, 2026

View reviewed changes

huan233usc approved these changes Apr 2, 2026

View reviewed changes

zikangh force-pushed the stack/cdf1 branch from 320b3ac to 3398dde Compare April 2, 2026 17:22

zikangh force-pushed the stack/cdf1 branch from 3398dde to 41bc0a2 Compare April 2, 2026 20:24

zikangh force-pushed the stack/cdf1 branch from 41bc0a2 to 2123524 Compare April 2, 2026 21:52

zikangh force-pushed the stack/cdf1 branch from 2123524 to 18d8afb Compare April 2, 2026 21:55

zikangh force-pushed the stack/cdf1 branch from 18d8afb to 07859c6 Compare April 2, 2026 21:55

zikangh force-pushed the stack/cdf1 branch from 07859c6 to 7a4827d Compare April 3, 2026 17:27

zikangh force-pushed the stack/cdf1 branch from 7a4827d to 6b26f5b Compare April 6, 2026 17:30

zikangh force-pushed the stack/cdf1 branch from 6b26f5b to da38d13 Compare April 6, 2026 20:33

[kernel-spark] Add initial snapshot CDC support to SparkMicroBatchStream

ab609f8

zikangh force-pushed the stack/cdf1 branch from da38d13 to ab609f8 Compare April 6, 2026 21:38

huan233usc merged commit 8999f83 into delta-io:master Apr 6, 2026
29 checks passed

Conversation

zikangh commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

murali-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zikangh commented Mar 23, 2026

Uh oh!

zikangh commented Mar 23, 2026

Uh oh!

zikangh commented Mar 23, 2026

Uh oh!

huan233usc Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

zikangh Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

huan233usc left a comment

Choose a reason for hiding this comment

Uh oh!

zikangh commented Apr 1, 2026

Uh oh!

zikangh commented Apr 1, 2026

Uh oh!

zikangh commented Apr 2, 2026

Uh oh!

zikangh commented Apr 2, 2026

Uh oh!

zikangh commented Apr 2, 2026

Uh oh!

zikangh commented Apr 2, 2026

Uh oh!

zikangh commented Apr 2, 2026

Uh oh!

zikangh commented Apr 3, 2026

Uh oh!

zikangh commented Apr 6, 2026

Uh oh!

zikangh commented Apr 6, 2026

Uh oh!

zikangh commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zikangh commented Feb 19, 2026 •

edited

Loading