[format] Support writing and reading Arrow schema metadata for file formats by lxy-9602 · Pull Request #8321 · apache/paimon

lxy-9602 · 2026-06-22T10:57:42Z

Purpose

This PR is a sub-PR for shared-shredding.

Shared-shredding needs to attach dictionary and other field-level metadata before closing data files. To support that flow, this PR adds a generic metadata write/read path for file formats, so upper layers can provide raw key-value metadata during writing, and readers can parse the stored metadata back when opening files.

The metadata representation follows Arrow Parquet's existing schema metadata convention: Arrow stores the original serialized schema under the ARROW:schema file metadata key, base64-decodes it on read, and deserializes it with Arrow IPC schema reading. See Apache Arrow's Parquet schema implementation, where kArrowSchemaKey is ARROW:schema and the value is base64-decoded before ReadSchema.

Related design:
https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar+Storage+Optimization+for+MAP+Type+in+Paimon

Brief change log

Add SupportsWriterMetadata so format writers can accept raw Map<String, byte[]> metadata before file close.
Add SupportsReaderArrowSchema so format readers can return the stored Arrow schema metadata.
Add FormatMetadataUtils for:
- base64 encoding metadata values before storing them in format footers;
- base64 decoding stored metadata values;
- reading the fixed ARROW:schema key into an Arrow Schema;
- extracting field-level metadata as Map<String, Map<String, String>>.
Support metadata writing for Parquet and ORC writers.
Support Arrow schema metadata reading from Parquet and ORC readers.
Add Parquet/ORC tests covering metadata write/read and field-level Arrow metadata roundtrip.

Compatibility

The metadata value is stored using Arrow-compatible base64 encoding. For Arrow schema metadata, the key is ARROW:schema, matching Arrow Parquet's convention.

Tests

mvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false \
  -Dtest=ParquetFormatReadWriteTest#testWriteMetadata,OrcFormatReadWriteTest#testWriteMetadata,FormatMetadataUtilsTest test

…ormats

[format] Support writing and reading Arrow schema metadata for file f…

85740fb

…ormats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[format] Support writing and reading Arrow schema metadata for file formats#8321

[format] Support writing and reading Arrow schema metadata for file formats#8321
lxy-9602 wants to merge 1 commit into
apache:masterfrom
lxy-9602:add-format-meta

lxy-9602 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lxy-9602 commented Jun 22, 2026

Purpose

Brief change log

Compatibility

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant