feat: implement merge multiple DVs#708
Conversation
There was a problem hiding this comment.
Pull request overview
Implements support for merging multiple deletion vectors (DVs) that reference the same data file by rewriting them into merged Puffin DV blobs, enabling format v3 workflows that previously returned NotImplemented.
Changes:
- Add DV-merge pipeline in
MergingSnapshotUpdatethat unions multiple DV bitmaps per referenced data file and writes merged DV blobs into a Puffin container. - Introduce
TransactionContext::NewDataLocation(...)to generate table-consistent data-file output locations for merged DV Puffin files. - Promote Puffin +
RoaringPositionBitmapcomponents fromiceberg_datainto the coreiceberglibrary (exports + build wiring) and add a test covering DV merging.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/update/merging_snapshot_update.h | Adds DV merge APIs and tracking for merged DV outputs. |
| src/iceberg/update/merging_snapshot_update.cc | Implements DV merging via Puffin reader/writer and bitmap union; adds cleanup handling for merged DV outputs. |
| src/iceberg/transaction.h | Adds TransactionContext::NewDataLocation API. |
| src/iceberg/transaction.cc | Implements NewDataLocation via LocationProvider. |
| src/iceberg/test/merging_snapshot_update_test.cc | Adds an integration-style test validating duplicate DV merging behavior and correctness. |
| src/iceberg/puffin/puffin_writer.h | Moves export macro to core ICEBERG_EXPORT. |
| src/iceberg/puffin/puffin_reader.h | Moves export macro to core ICEBERG_EXPORT. |
| src/iceberg/puffin/puffin_format.h | Moves export macro to core ICEBERG_EXPORT. |
| src/iceberg/puffin/json_serde_internal.h | Moves export macro to core ICEBERG_EXPORT. |
| src/iceberg/puffin/file_metadata.h | Moves export macro to core ICEBERG_EXPORT. |
| src/iceberg/deletes/roaring_position_bitmap.h | Moves export macro to core ICEBERG_EXPORT. |
| src/iceberg/meson.build | Moves Puffin + bitmap sources/deps into iceberg lib; adjusts croaring linkage. |
| src/iceberg/CMakeLists.txt | Moves Puffin + bitmap sources into iceberg lib; adjusts croaring linkage propagation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
284b077 to
f25cbe1
Compare
| 'catalog/session_catalog.cc', | ||
| 'catalog/session_context.cc', | ||
| 'delete_file_index.cc', | ||
| 'deletes/roaring_position_bitmap.cc', |
There was a problem hiding this comment.
Perhaps we need to move it out of the deletes subdirectory.
| ICEBERG_ASSIGN_OR_RAISE(auto blob, reader->ReadBlob(*blob_it)); | ||
| ICEBERG_ASSIGN_OR_RAISE( | ||
| auto dv_bitmap, | ||
| RoaringPositionBitmap::Deserialize(std::string_view( |
There was a problem hiding this comment.
Java/spec-compliant deletion-vector-v1 blobs are not raw Roaring payloads: they include the 4-byte length, magic bytes, bitmap, and CRC, and Java validates CRC/cardinality during deserialize. PuffinReader::ReadBlob returns the stored blob bytes unchanged, so passing them directly to RoaringPositionBitmap::Deserialize() will reject or misread DVs written by Java. The writer below has the symmetric issue because bitmap.Serialize() writes only the raw bitmap. Please add DV-level framing/parsing here.
|
|
||
| ICEBERG_ASSIGN_OR_RAISE( | ||
| auto blob_metadata, | ||
| writer.Write(puffin::Blob{ |
There was a problem hiding this comment.
This Puffin blob metadata does not match Java or the Puffin DV spec. Java BaseDVFileWriter writes fields=[ROW_POSITION], snapshot-id=-1, sequence-number=-1, and referenced-data-file/cardinality properties. Writing empty fields, concrete snapshot/sequence values, and no properties produces invalid DV footer metadata for other implementations. Please mirror the Java DV writer metadata.
|
@zhaoxuan1994 is working on adding puffin reader/writer for v3 DVs. Let's wait for it and then rebase on it later. |
No description provided.