PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393
Conversation
c01b3f3 to
4b7e86b
Compare
133fb4b to
07a4d77
Compare
|
@wgtmac thank you for adding the interop test! 🙏 In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns. Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0. The latter is to make sure that the old rules regarding 0 min being set to -0 and 0 max set +0 are no longer followed with the new ordering. |
Did you mean arrow-rs will no longer write floats with the legacy TypeDefinedOrder? From the perspective of interoperability test, I think this is fine if it does not fail when reading files produced by other writers.
That's a good suggestion! I have updated the floating-point interop coverage to add explicit ZERO_MIN and ZERO_MAX cases, so we now verify that IEEE-754 total order no longer rewrites +0 min to -0 or -0 max to +0. I also expanded the NaN coverage to include both negative and positive NaN patterns. While debugging the test, I found that the Java implementation uses |
|
Thank you. I hope to have the rust tests done today. |
|
@shangxinli @etseidl Do you want to take a look again? I think now everything is ready on my end. cc @gszadovszky @Fokko |
gszadovszky
left a comment
There was a problem hiding this comment.
Added a couple of comments.
Also, it would be nice to add coverage for predicate filtering with signed NaN values. Since NaN value are excluded from the statistics, this is only for the value level (ValueInspector). Can be either covered unit or integration level.
|
Thanks @gszadovszky for the review! I think I've addressed all your comments. Let me know what you think. |
|
Thank you, @wgtmac. One more thing. This PR also changes how If we move to that direction, we also need to fix the dictionary filter. We use a boxed |
|
@gszadovszky That's a good question! I think the main behavior change is that now original NaN bits are preserved in not only dictionary but also encoded values. I would regard this as a benign bug fix. For dictionary filter and bloom filter on the read path, they are not aware of column order. Introducing IEEE754 total order is anyway a breaking change to them because raw bits of NaNs must be preserved as is. Update: I've sent https://lists.apache.org/thread/m6j8lzc09ytyd45wt6pdcyn5qy95f0vt to discuss this. |
|
I updated the patch so the pruning filters and record-level filters have separate semantics:
I still think that preserving raw bits of NaN values is a bug fix and benign breaking change. Let me know what you think. @gszadovszky |
|
Thanks @wgtmac for the updates. I think one issue remains with |
|
That's my oversight and thanks @gszadovszky for catching this! Now it should be fixed as well. |
gszadovszky
left a comment
There was a problem hiding this comment.
Thank you, @wgtmac. LGTM.
|
Merged. Thanks @etseidl @shangxinli @gszadovszky for the review! |
Rationale for this change
This implements the parquet-format IEEE 754 total order column-order work from apache/parquet-format#514 in parquet-java. The existing type-defined order remains the default.
While adding the new order, this also fixes floating-point NaN handling so parquet-java preserves the exact FLOAT, DOUBLE, and FLOAT16 bit patterns supplied by applications, including NaN sign and payload bits. Filters are updated to avoid false negatives when NaN semantics cannot be answered safely from metadata.
What changes are included in this PR?
Are these changes tested?
Yes. This PR adds focused unit and end-to-end tests for the new column order, NaN counts, NaN filtering behavior, raw-bit preservation, and interop with parquet-testing files.
Are there any user-facing changes?
Yes. Users can opt into IEEE_754_TOTAL_ORDER for floating columns. The default column order is unchanged, and existing files remain readable. Filtering around NaN values may be more conservative to avoid dropping data that can still match.
Closes #406