[spark] Record the write operation type in snapshot properties#8236
[spark] Record the write operation type in snapshot properties#8236Zouxxyy wants to merge 4 commits into
Conversation
|
I think adding operation as a dedicated nullable field in Snapshot is a better direction than storing it in properties. Compatibility should also be fine:
I would suggest modeling it as a first-class nullable enum or string field, for example Snapshot.Operation, rather than putting it into properties. commitKind describes the physical snapshot change, while operation describes the logical user operation, so both feel like core snapshot metadata. This would also avoid introducing a generic withCommitProperties API just for one standard field, and avoids potential conflicts around the "operation" property key. |
8f80251 to
1c14391
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1c14391 to
2705c1c
Compare
2705c1c to
87c8a42
Compare
| latest.properties(), | ||
| nextRowId); | ||
| nextRowId, | ||
| latest.operation()); |
There was a problem hiding this comment.
replaceManifestList is used by maintenance/repair flows such as row-id reassignment and removing broken manifests, but this copies the previous snapshot's logical operation into the new maintenance snapshot. That makes the $snapshots.operation value look like the previous user operation even though this snapshot was produced by a manifest rewrite. I think this should stay null (or use a dedicated maintenance operation) instead of inheriting latest.operation(); the same concern applies to the manifest compaction snapshot below.
…ntenance snapshots, revert pypaimon doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| @JsonProperty(FIELD_PROPERTIES) @Nullable Map<String, String> properties, | ||
| @JsonProperty(FIELD_NEXT_ROW_ID) @Nullable Long nextRowId) { | ||
| @JsonProperty(FIELD_NEXT_ROW_ID) @Nullable Long nextRowId, | ||
| @JsonProperty(FIELD_OPERATION) @Nullable Operation operation) { |
There was a problem hiding this comment.
Thanks for restoring the no-version constructor overload. The versioned public constructor is still source/binary incompatible because the old Snapshot(int version, ..., Long nextRowId) signature was replaced by the new one that requires operation. Since Snapshot is @Public, could we also keep that old versioned overload and delegate to this constructor with operation = null?
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Purpose
Add a first-class
operationfield (Snapshot.Operationenum) toSnapshot, recording the logical operation type that produced it. This complements the physicalCommitKind(APPEND/COMPACT/OVERWRITE/ANALYZE) and lets downstream tooling distinguish, e.g., an APPEND from INSERT vs. one from MERGE.Design:
Snapshot.Operationenum:WRITE,OVERWRITE,DELETE,TRUNCATE,UPDATE,MERGE,CREATE_TABLE_AS_SELECT,REPLACE_TABLE_AS_SELECT,CREATE_OR_REPLACE_TABLE_AS_SELECT@JsonInclude(NON_NULL)— old snapshots deserialize asnull, old readers ignore the unknown field via@JsonIgnoreProperties(ignoreUnknown = true)BatchTableCommit.withOperation(Operation)—defaultmethod, no breaking change for existing implementationsFileStoreCommit.withOperation(Operation)— internal API, propagated toSnapshotconstruction inFileStoreCommitImplTRUNCATEis automatically set byTableCommitImpl.truncateTable()/truncatePartitions()in core, so callers don't need to handle itSpark coverage (both v1 and v2 write paths):
WRITEOVERWRITEDELETETRUNCATETRUNCATEUPDATEMERGECREATE_TABLE_AS_SELECTREPLACE_TABLE_AS_SELECT/CREATE_OR_REPLACE_TABLE_AS_SELECTTests
SnapshotTest.testSnapshotWithOperation(paimon-core): JSON serialization round-trip, backward compatibility with old snapshotsSnapshotOperationTest(paimon-spark-ut): all operations under bothuse-v2-write=true/false, including CTAS/RTAS and truncate paths