Skip to content

Parquet: Variant shredding follow-ups from PR #14297#16818

Merged
huaxingao merged 4 commits into
apache:mainfrom
nssalian:variant-shredding-followup
Jun 16, 2026
Merged

Parquet: Variant shredding follow-ups from PR #14297#16818
huaxingao merged 4 commits into
apache:mainfrom
nssalian:variant-shredding-followup

Conversation

@nssalian

@nssalian nssalian commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Changes

Follow-ups from the PR #14297 summary thread and some additional cleanup

  1. Added javadoc to TIE_BREAK_PRIORITY so readers know what the constant is for. Updated the class-level javadoc to remove the stale TreeMap reference and link to TIE_BREAK_PRIORITY.
  2. Moved PathNode.objectChildren from TreeMap to HashMap. Alphabetical schema field order is preserved by sorting once in createObjectTypedValue at schema build time.
  3. Added debug logging in ParquetFormatModel.buildShreddedAppender that records the buffer size at construction and, per inference flush, the buffered row count and the inferred shredded field count.
  4. Reused the existing isDecimalType helper in observe and dropped the now-unused VariantPrimitive import.
  5. Cached the result of getMostCommonType per FieldInfo. The schema build calls it once at the root and again per node, currently rebuilding the same family-aggregation every time.
  6. Used a min-heap of size MAX_SHREDDED_FIELDS for the field cap, replacing the full sort plus n-sized intermediate ArrayList plus HashSet allocation.
  7. Used an int[] for FieldInfo.typeCounts keyed on PhysicalType.ordinal(), replacing Map<PhysicalType, Integer>. Removed the lambda capture allocated by Map.compute on every observation and the boxing of Integer.

Test plan

  • testIntermediateFieldCapLimitsTrackedFields extended with three assertions verifying the min-heap retains the alphabetically-earliest fields when counts are tied.
  • New testUuidFieldIsTrackedAndShredded exercises int[] indexing for a high-ordinal PhysicalType value.
  • Build passed locally and TestVariantShreddingAnalyzer as well.

@nssalian nssalian marked this pull request as ready for review June 16, 2026 00:29
@nssalian nssalian requested review from huaxingao and pvary June 16, 2026 00:29

@huaxingao huaxingao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nssalian nssalian requested a review from huaxingao June 16, 2026 05:54
@huaxingao huaxingao merged commit 85ffa19 into apache:main Jun 16, 2026
53 checks passed
@huaxingao

Copy link
Copy Markdown
Contributor

Thanks @nssalian for the PR!

@nssalian nssalian deleted the variant-shredding-followup branch June 16, 2026 21:09
@nssalian nssalian added this to the Iceberg 1.12.0 milestone Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants