Skip to content

feat: add metadata bridge for Python CUDA export#8604

Open
0ax1 wants to merge 1 commit into
developfrom
ad/pycudf
Open

feat: add metadata bridge for Python CUDA export#8604
0ax1 wants to merge 1 commit into
developfrom
ad/pycudf

Conversation

@0ax1

@0ax1 0ax1 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Add a private metadata-only bridge on PyVortex arrays so the optional CUDA extension can reconstruct arrays through its own local Vortex session instead of passing Rust ArrayRef values across Python extension modules.

The CUDA extension now parses the metadata tree, calls local array plugins for deserialization, and exports bufferless arrays through the Arrow C Device capsule path. Physical buffer handoff remains a follow-up.

@0ax1 0ax1 requested a review from a team June 26, 2026 13:43
@0ax1 0ax1 added the changelog/feature A new feature label Jun 26, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 26, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 5 improved benchmarks
❌ 1 regressed benchmark
✅ 1583 untouched benchmarks
⏩ 4 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_into_canonical[(1000, 10)] 169 µs 205.7 µs -17.87%
Simulation chunked_varbinview_canonical_into[(100, 100)] 259.7 µs 224.6 µs +15.65%
Simulation bitwise_not_vortex_buffer_mut[128] 244.4 ns 215.3 ns +13.55%
Simulation chunked_varbinview_into_canonical[(100, 100)] 306.5 µs 271.3 µs +12.97%
Simulation bitwise_not_vortex_buffer_mut[1024] 304.7 ns 275.6 ns +10.58%
Simulation encode_varbin[(1000, 2)] 158.4 µs 143.5 µs +10.34%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ad/pycudf (f827d8a) with develop (f172d58)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Add a private metadata-only bridge on PyVortex arrays so the optional CUDA extension can reconstruct arrays through its own local Vortex session instead of passing Rust ArrayRef values across Python extension modules.

The CUDA extension now parses the metadata tree, calls local array plugins for deserialization, and exports bufferless arrays through the Arrow C Device capsule path. The capsule ownership code handles live and consumed capsule names and releases Arrow resources on error paths. Physical buffer handoff remains a follow-up.

Signed-off-by: "Alexander Droste" <alexander.droste@protonmail.com>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1

0ax1 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

@claude review in depth

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @0ax1's task in 3m 35s —— View job


Claude PR Review — metadata bridge for Python CUDA export

I reviewed the full diff (vortex-python/src/arrays/mod.rs, the new vortex-python-cuda/src/lib.rs, Python glue, and tests) against origin/develop. Overall this is careful, high-quality FFI code — the capsule lifetime/ownership handling, error-path resource release, and post-deserialize validation are all well done. I did not find any memory-safety or correctness bugs. Findings below are mostly minor / design notes.

Tasks

  • Gather context & read changed files
  • Review vortex-python/src/arrays/mod.rs metadata bridge
  • Review vortex-python-cuda/src/lib.rs (capsule/FFI, unsafe)
  • Review Python glue + tests
  • Post review feedback

What's good

  • Capsule ownership is correct on every path. In export_device_array, if device_array_capsule fails after schema_capsule succeeded, the dropped schema Bound runs release_schema_capsule, so no leak. The schema_capsule/device_array_capsule constructors also reclaim the Box and call the Arrow release callback if PyCapsule::new_* itself fails. The check_requested_schema error path calls release_exported for both structs. Double-free is avoided because the underlying release_schema/release_device_array are guarded by if let Some(release) and rely on the standard Arrow contract of nulling the callback.
  • deserialize_metadata_tree validates aggressively — length, dtype, and is_supported_encoding after deserialization, plus a per-node buffer_count != 0 rejection. I confirmed nbuffers() is per-node (there's a separate nbuffers_recursive() in vortex-array/src/array/erased.rs), so the per-node check is semantically correct and recursion catches buffers in any child.
  • Concurrency is fine. static RUNTIME: CurrentThreadRuntime wraps Arc<smol::Executor<'static>>; block_on(executor.run(fut)) supports concurrent runners, so sharing it across py.detach calls from multiple Python threads is safe. The unsafe impl Send for ExportedDeviceArray is justified and clearly documented.
  • Tuple shape (len == 6) is validated on the Rust side and matches the producer in array_metadata_tuple and the .pyi annotation.

Findings / suggestions

1. Debug helper exercises a different session than the real export (vortex-python-cuda/src/lib.rs).
_debug_array_metadata_dtype deserializes through METADATA_SESSION (plain default), while export_device_array uses cuda_session(). As a debugging aid for the export path this is slightly misleading — an array that round-trips through _debug_array_metadata_dtype could still fail under the CUDA session if encoding registries diverge. Consider documenting that the debug helper validates only the metadata-bridge decode, not the CUDA export path. Fix this →

2. Cross-session encoding registry mismatch is a real behavioral limitation (by design).
The whole point of the bridge is reconstructing arrays via the extension's own local session. That means any encoding registered only in the base vortex-data session (e.g. a user-registered Python encoding) will fail in the CUDA extension with Unknown array encoding: .... This is the intended trade-off, but worth a sentence in the __vortex_array_metadata__ docstring (vortex-python/src/arrays/mod.rs) so consumers understand only encodings known to the local session are reconstructable.

3. reject_unsupported_kwargs silently accepts key=None.
Any unknown kwarg whose value is None is ignored rather than rejected, so a typo like requsted_schema=None (note the misspelling) would be silently swallowed instead of raising. If that leniency is intentional (matching the Arrow capsule protocol), a one-line comment would help; otherwise consider rejecting unknown names regardless of value. Fix this →

4. New public surface in vortex-cuda (vortex-cuda/src/arrow/mod.rs).
release_schema and release_device_array went from private to pub. They're needed across the crate boundary, so pub is justified, but these are low-level FFI release primitives that are easy to misuse on arbitrary structs. Consider #[doc(hidden)] or a clearly-internal module path to signal they're not part of the supported public API.

5. Test coverage (acknowledged in comments). The added tests cover the metadata round-trip, the buffer-handoff rejection, and the no-CUDA error path well. The capsule destructor and used_* rename paths can't run in CI without a CUDA device + Arrow Device consumer — the code comments call this out. No action required, but a follow-up that exercises the schema-capsule destructor against a synthetic FFI_ArrowSchema (CPU-only) would close part of that gap if it's cheap to add.

Nits

  • dtype_from_metadata clones metadata.dtype into a ByteBuffer on every call; negligible, mentioning only for completeness.
  • parse_array_metadata / array_metadata_tuple recurse per node — fine in practice, theoretically stack-bound on pathological depth.

CI / checks

This is Python-binding + PyO3 Rust work. Per CLAUDE.md the relevant checks are cargo +nightly fmt --check -p vortex-python (and vortex-python-cuda), cargo clippy --all-targets --all-features, uv run basedpyright vortex-python, and ruff. I did not run these (read-only review). The CodSpeed regression on chunked_varbinview_into_canonical[(1000, 10)] is unrelated to this diff (no changes to that path) and looks like Walltime-runner noise per the CodSpeed warning.
· branch ad/pycudf

@AdamGS AdamGS left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants