Skip to content

[AURON #2375] Fix Iceberg changelog scan field-id projection#2376

Open
lyne7-sc wants to merge 3 commits into
apache:masterfrom
lyne7-sc:fix/iceberg_changelogscan_fieldid
Open

[AURON #2375] Fix Iceberg changelog scan field-id projection#2376
lyne7-sc wants to merge 3 commits into
apache:masterfrom
lyne7-sc:fix/iceberg_changelogscan_fieldid

Conversation

@lyne7-sc

@lyne7-sc lyne7-sc commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2375

Rationale for this change

The regular native Iceberg scan path already passes Iceberg field IDs to the native reader, which makes top-level schema evolution such as column rename and drop-then-add safe for Parquet files.

The newer insert-only Iceberg changelog scan path also reads the underlying Parquet data files through the native reader, but it does not pass the same field-id mapping into the native scan plan yet. As a result, native Parquet schema matching falls back to column names on the changelog path.

This can return wrong results after Iceberg schema evolution. For example, after RENAME COLUMN, pre-rename files may read as null; after DROP + ADD of the same name, the newly added column may read data from the old dropped column.

What changes are included in this PR?

  • Extract field IDs from SparkChangelogScan's expected Iceberg schema.
  • Reuse the existing Iceberg rename/drop detection for changelog scans.
  • Pass changelog field IDs into IcebergScanPlan instead of Map.empty.
  • Keep nested rename/drop unsupported and make ORC changelog scans fall back after top-level rename/drop, consistent with the regular Iceberg scan path.
  • Add changelog scan integration tests for:
    • renamed columns resolved by field-id;
    • drop-then-add columns with the same name not reusing the dropped field-id.

Are there any user-facing changes?

Yes. Insert-only Iceberg changelog scans on renamed or drop-then-added Parquet columns now return correct results under the native scan. Unsupported cases continue to fall back to Spark. No API change.

How was this patch tested?

Added cases to AuronIcebergIntegrationSuite.

@weiqingy weiqingy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on — closing the field-id gap on the changelog path is a real correctness fix, and the two new tests are honest regression tests (each fails on the pre-fix Map.empty behavior and passes after), mirroring the existing file-scan coverage. A few questions inline.

}
val (fileSchema, partitionSchema) = schemas.get

val fieldIdsByName =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block — field-id extraction, rename/drop detection, the two asserts, and the supportedFormat line at :284 — now repeats about 25 lines from planFileScan (:121-144, :187-191), differing only in which util method each calls. Worth noting this bug itself came from the two paths drifting apart: the changelog path shipped with Map.empty while the file-scan path already passed field-ids. Would it be worth factoring the shared portion into one helper parameterized by the two scan → … lookups, so a future field-id change can't miss one path again? Genuinely open — if you'd rather keep the two paths explicit for readability, that's a reasonable call too.


def expectedFieldIdsForChangelogScan(scan: AnyRef): Map[String, Int] = {
val expectedSchema =
FieldUtils.readField(scan, "expectedSchema", true).asInstanceOf[org.apache.iceberg.Schema]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file-scan path reads the schema and table through the public expectedSchema() / table() methods (:38, :49), while the changelog path reaches them by string-keyed reflection into SparkChangelogScan (FieldUtils.readField(scan, "expectedSchema", true) here and "table" at :54). Since those are Iceberg-internal field names with no compile-time check, an Iceberg version that renames or restructures the field would slip through silently. The caller does guard both reads with try/NonFatal → return None (IcebergScanSupport.scala:214-230), so the worst case is a quiet fallback rather than a crash — which is a good safety net. Does SparkChangelogScan expose any public accessor we could use instead, or is reflection the only door in? If it's the only option, would a one-line note on the field-name assumption help whoever does the next Iceberg bump?


val format = formats.headOption.getOrElse(FileFormat.PARQUET)
if (format != FileFormat.PARQUET && format != FileFormat.ORC) {
val supportedFormat =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small consistency thing: the file-scan copy of this supportedFormat line has a comment just above it (:185-186) explaining why a top-level rename/drop makes older ORC files unsafe for native matching, but that comment didn't come across to the changelog copy. Worth mirroring it here so the ORC branch reads the same on both paths?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg changelog scan returns wrong data after column rename / drop-then-add

2 participants