Parquet: Variant shredding follow-ups from PR #14297#16818
Merged
Conversation
huaxingao
reviewed
Jun 16, 2026
Contributor
|
Thanks @nssalian for the PR! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Follow-ups from the PR #14297 summary thread and some additional cleanup
TIE_BREAK_PRIORITYso readers know what the constant is for. Updated the class-level javadoc to remove the staleTreeMapreference and link toTIE_BREAK_PRIORITY.PathNode.objectChildrenfromTreeMaptoHashMap. Alphabetical schema field order is preserved by sorting once increateObjectTypedValueat schema build time.ParquetFormatModel.buildShreddedAppenderthat records the buffer size at construction and, per inference flush, the buffered row count and the inferred shredded field count.isDecimalTypehelper inobserveand dropped the now-unusedVariantPrimitiveimport.getMostCommonTypeperFieldInfo. The schema build calls it once at the root and again per node, currently rebuilding the same family-aggregation every time.MAX_SHREDDED_FIELDSfor the field cap, replacing the full sort plus n-sized intermediateArrayListplusHashSetallocation.int[]forFieldInfo.typeCountskeyed onPhysicalType.ordinal(), replacingMap<PhysicalType, Integer>. Removed the lambda capture allocated byMap.computeon every observation and the boxing ofInteger.Test plan
testIntermediateFieldCapLimitsTrackedFieldsextended with three assertions verifying the min-heap retains the alphabetically-earliest fields when counts are tied.testUuidFieldIsTrackedAndShreddedexercisesint[]indexing for a high-ordinalPhysicalTypevalue.TestVariantShreddingAnalyzeras well.