Fix region-group cleanup: recover-safe submission + retry that actually re-runs the delete#18097
Merged
Merged
Conversation
…ly re-runs the delete Follow-up to #18033. The submit-and-return region-group cleanup had two recovery/correctness defects, both found by adversarial review: 1. The `!isStateDeserialized()` guard around submitting the cleanup RemoveRegionGroupProcedures (in DeleteDatabaseProcedure and CreateRegionGroupsProcedure) could permanently skip the submission. The executor persists a procedure AT a state before that state's body runs, so a leader switch/restart landing on that state means the submission has not happened yet -- the guard then skips it forever while the next state drops the partition table, orphaning every region's peers/data on disk with no record. Removed the guard: always submit. Re-submitting on recovery is safe (each RemoveRegionGroupProcedure gets a fresh procId and does an idempotent delete), whereas skipping leaks. 2. RemoveRegionGroupProcedure's "retry until deleted" was a no-op after any DataNode-side FAIL. The DataNode dedups delete tasks by taskId (which was getProcId()) and caches the terminal result forever, so every same-taskId retry returned the cached FAIL without ever re-running the delete -- an infinite silent loop with the region data already orphaned. Each genuine re-attempt now uses a fresh DataNode-side taskId so the DataNode actually re-executes the idempotent delete. The taskId is derived deterministically from the (replicated) procId and a persisted monotonic sequence, packed into the negative i64 space: negative taskIds are disjoint from all procIds (>= 0), which add/remove-peer tasks use as their taskId against the same DataNode task map, so there is no collision; and because it depends only on replicated/persisted state it is stable and never regresses across a leader change. The in-flight taskId is persisted so a leader change re-polls the same task instead of double-submitting. RemoveRegionGroupProcedure now retries a replica indefinitely (a group delete must not leave data on disk) instead of failing after a fixed budget.
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18097 +/- ##
=========================================
Coverage 41.65% 41.65%
Complexity 318 318
=========================================
Files 5296 5296
Lines 371663 371685 +22
Branches 48088 48090 +2
=========================================
+ Hits 154819 154836 +17
- Misses 216844 216849 +5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Follow-up to #18033. The submit-and-return region-group cleanup path had two recovery/correctness defects (both surfaced by adversarial review of the merged change):
1.
!isStateDeserialized()guard could permanently skip cleanup submission → orphaned region dataDeleteDatabaseProcedure(stateDELETE_DATABASE_SCHEMA) andCreateRegionGroupsProcedure(stateSHUNT_REGION_REPLICAS) submitted the cleanupRemoveRegionGroupProcedures only when!isStateDeserialized().The executor persists a procedure at a state before that state's body runs (it advances the state on the previous cycle, then can stop at the inter-state boundary on a leader switch — see
ProcedureExecutor#executeProcedure). So a leader switch / restart that recovers onto that state means the submission has not happened yet — butisStateDeserialized()istrue, so the guard skips it forever.DeleteDatabaseProcedurethen advances and drops the partition table, leaving every region's consensus peer and TsFile data orphaned on the DataNodes with no record of where they live.Fix: remove the guard; always submit. Re-submitting on recovery is safe — each
RemoveRegionGroupProceduregets a fresh procId and performs an idempotent delete, so a duplicate is harmless, whereas a skip leaks. (CreateRegionGroupsProcedurerecomputes the cleanups from the serializedfailedRegionReplicaSets;DeleteDatabaseProcedurere-queriesgetAllReplicaSetsbefore the table is dropped — both deterministic on replay.)2. "Retry until deleted" was an infinite no-op after any DataNode-side FAIL
RemoveRegionGroupProcedureusedgetProcId()as the DataNode task id. The DataNode dedups region-maintain tasks by task id and caches the terminal result forever (RegionMigrateService.taskResultMap, cleared only on DataNode restart). So after any DataNode-side delete FAIL, every same-id retry returned the cached FAIL without ever re-running the delete — an infinite 5s loop, region data already orphaned (partition table dropped by the caller), a consensus write every cycle.Fix: each genuine re-attempt uses a fresh DataNode task id so the DataNode actually re-executes the (idempotent) delete. The id is derived deterministically from the (consensus-replicated)
procIdand a persisted monotonic sequence, packed into the negativei64space:procIds (always>= 0), whichAddRegionPeer/RemoveRegionPeeruse as their task id against the same DataNode task map — so no cross-procedure collision / silent dedup.RemoveRegionGroupProcedurenow retries a replica indefinitely (a group delete must not leave data on disk) instead of failing after a fixed retry budget.Tests
RemoveRegionGroupProcedureTest#serDeTest— round-trips the new persisteddeleteTaskSeq/deleteTaskSubmittedcursor.RemoveRegionGroupProcedureTest#deleteTaskIdIsNegativeAndUnique— asserts the derived task id is always negative and injective over(procId, seq)across the full bit budget.spotless:check+ compile clean.