Skip to content

[format] Support writing and reading Arrow schema metadata for file formats#8321

Open
lxy-9602 wants to merge 1 commit into
apache:masterfrom
lxy-9602:add-format-meta
Open

[format] Support writing and reading Arrow schema metadata for file formats#8321
lxy-9602 wants to merge 1 commit into
apache:masterfrom
lxy-9602:add-format-meta

Conversation

@lxy-9602

Copy link
Copy Markdown
Contributor

Purpose

This PR is a sub-PR for shared-shredding.

Shared-shredding needs to attach dictionary and other field-level metadata before closing data files. To support that flow, this PR adds a generic metadata write/read path for file formats, so upper layers can provide raw key-value metadata during writing, and readers can parse the stored metadata back when opening files.

The metadata representation follows Arrow Parquet's existing schema metadata convention: Arrow stores the original serialized schema under the ARROW:schema file metadata key, base64-decodes it on read, and deserializes it with Arrow IPC schema reading. See Apache Arrow's Parquet schema implementation, where kArrowSchemaKey is ARROW:schema and the value is base64-decoded before ReadSchema.

Related design:
https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar+Storage+Optimization+for+MAP+Type+in+Paimon

Brief change log

  • Add SupportsWriterMetadata so format writers can accept raw Map<String, byte[]> metadata before file close.
  • Add SupportsReaderArrowSchema so format readers can return the stored Arrow schema metadata.
  • Add FormatMetadataUtils for:
    • base64 encoding metadata values before storing them in format footers;
    • base64 decoding stored metadata values;
    • reading the fixed ARROW:schema key into an Arrow Schema;
    • extracting field-level metadata as Map<String, Map<String, String>>.
  • Support metadata writing for Parquet and ORC writers.
  • Support Arrow schema metadata reading from Parquet and ORC readers.
  • Add Parquet/ORC tests covering metadata write/read and field-level Arrow metadata roundtrip.

Compatibility

The metadata value is stored using Arrow-compatible base64 encoding. For Arrow schema metadata, the key is ARROW:schema, matching Arrow Parquet's convention.

Tests

mvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false \
  -Dtest=ParquetFormatReadWriteTest#testWriteMetadata,OrcFormatReadWriteTest#testWriteMetadata,FormatMetadataUtilsTest test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant