[AURON #2375] Fix Iceberg changelog scan field-id projection by lyne7-sc · Pull Request #2376 · apache/auron

lyne7-sc · 2026-07-02T13:32:54Z

Which issue does this PR close?

Rationale for this change

The regular native Iceberg scan path already passes Iceberg field IDs to the native reader, which makes top-level schema evolution such as column rename and drop-then-add safe for Parquet files.

The newer insert-only Iceberg changelog scan path also reads the underlying Parquet data files through the native reader, but it does not pass the same field-id mapping into the native scan plan yet. As a result, native Parquet schema matching falls back to column names on the changelog path.

This can return wrong results after Iceberg schema evolution. For example, after RENAME COLUMN, pre-rename files may read as null; after DROP + ADD of the same name, the newly added column may read data from the old dropped column.

What changes are included in this PR?

Extract field IDs from SparkChangelogScan's expected Iceberg schema.
Reuse the existing Iceberg rename/drop detection for changelog scans.
Pass changelog field IDs into IcebergScanPlan instead of Map.empty.
Keep nested rename/drop unsupported and make ORC changelog scans fall back after top-level rename/drop, consistent with the regular Iceberg scan path.
Add changelog scan integration tests for:
- renamed columns resolved by field-id;
- drop-then-add columns with the same name not reusing the dropped field-id.

Are there any user-facing changes?

Yes. Insert-only Iceberg changelog scans on renamed or drop-then-added Parquet columns now return correct results under the native scan. Unsupported cases continue to fall back to Spark. No API change.

How was this patch tested?

Added cases to AuronIcebergIntegrationSuite.

weiqingy

Thanks for taking this on — closing the field-id gap on the changelog path is a real correctness fix, and the two new tests are honest regression tests (each fails on the pre-fix Map.empty behavior and passes after), mirroring the existing file-scan coverage. A few questions inline.

weiqingy · 2026-07-03T23:40:27Z

    }
    val (fileSchema, partitionSchema) = schemas.get

+    val fieldIdsByName =


This block — field-id extraction, rename/drop detection, the two asserts, and the supportedFormat line at :284 — now repeats about 25 lines from planFileScan (:121-144, :187-191), differing only in which util method each calls. Worth noting this bug itself came from the two paths drifting apart: the changelog path shipped with Map.empty while the file-scan path already passed field-ids. Would it be worth factoring the shared portion into one helper parameterized by the two scan → … lookups, so a future field-id change can't miss one path again? Genuinely open — if you'd rather keep the two paths explicit for readability, that's a reasonable call too.

weiqingy · 2026-07-03T23:40:27Z


+  def expectedFieldIdsForChangelogScan(scan: AnyRef): Map[String, Int] = {
+    val expectedSchema =
+      FieldUtils.readField(scan, "expectedSchema", true).asInstanceOf[org.apache.iceberg.Schema]


The file-scan path reads the schema and table through the public expectedSchema() / table() methods (:38, :49), while the changelog path reaches them by string-keyed reflection into SparkChangelogScan (FieldUtils.readField(scan, "expectedSchema", true) here and "table" at :54). Since those are Iceberg-internal field names with no compile-time check, an Iceberg version that renames or restructures the field would slip through silently. The caller does guard both reads with try/NonFatal → return None (IcebergScanSupport.scala:214-230), so the worst case is a quiet fallback rather than a crash — which is a good safety net. Does SparkChangelogScan expose any public accessor we could use instead, or is reflection the only door in? If it's the only option, would a one-line note on the field-name assumption help whoever does the next Iceberg bump?

weiqingy · 2026-07-03T23:40:27Z


    val format = formats.headOption.getOrElse(FileFormat.PARQUET)
-    if (format != FileFormat.PARQUET && format != FileFormat.ORC) {
+    val supportedFormat =


Small consistency thing: the file-scan copy of this supportedFormat line has a comment just above it (:185-186) explaining why a top-level rename/drop makes older ORC files unsafe for native matching, but that comment didn't come across to the changelog copy. Worth mirroring it here so the ORC branch reads the same on both paths?

lyne7-sc and others added 3 commits July 1, 2026 21:45

Iceberg changelog scan field-id projectio

9645619

enhance iceberg suites

cebbb71

lint

276bfb9

github-actions Bot added the thirdparty-iceberg label Jul 2, 2026

weiqingy reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AURON #2375] Fix Iceberg changelog scan field-id projection#2376

[AURON #2375] Fix Iceberg changelog scan field-id projection#2376
lyne7-sc wants to merge 3 commits into
apache:masterfrom
lyne7-sc:fix/iceberg_changelogscan_fieldid

lyne7-sc commented Jul 2, 2026

Uh oh!

weiqingy left a comment

Uh oh!

weiqingy Jul 3, 2026

Uh oh!

weiqingy Jul 3, 2026

Uh oh!

weiqingy Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lyne7-sc commented Jul 2, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

weiqingy left a comment

Choose a reason for hiding this comment

Uh oh!

weiqingy Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants