Replace the export partition local manifests lock by a multiversion container#1912
Replace the export partition local manifests lock by a multiversion container#1912arthurpassos wants to merge 19 commits into
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef56768b4f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| std::vector<ReplicatedPartitionExportInfo> infos; | ||
| infos.reserve(snapshots.size()); | ||
| std::vector<ReplicatedPartitionExportInfo> infos(model->size()); |
There was a problem hiding this comment.
Reserve the export info vector instead of pre-sizing it
When the read model contains N exports, this constructs N default ReplicatedPartitionExportInfo rows and the loop then appends the N real rows with emplace_back. As soon as any export exists, system.replicated_partition_exports will expose extra blank rows and report twice as many entries; this should reserve capacity rather than resize the vector.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46f5b11775
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| std::vector<ReplicatedPartitionExportInfo> infos; | ||
| infos.reserve(snapshots.size()); | ||
| std::vector<ReplicatedPartitionExportInfo> infos(model->size()); |
There was a problem hiding this comment.
Avoid pre-sizing
infos before appending rows
When there is at least one published export entry, this constructs model->size() default ReplicatedPartitionExportInfo objects and then the loop appends another row for each real entry with emplace_back. As a result, system.replicated_partition_exports returns one blank/default row per real export in addition to the correct rows, with fields like destination/table/status empty and some numeric fields default/uninitialized. This should reserve capacity instead of changing the vector size.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This seems to be a forgotten thing to be fixed
There was a problem hiding this comment.
Oh wow, i did fix it yesterday before pinging you. But I think I forgot to push it.
mkmkme
left a comment
There was a problem hiding this comment.
Requesting changes since the comment found by Codex looks like one worth fixing. The rest of my comments is open for discussions and not necessarily an urge for action.
| { | ||
| const auto & key = it->getCompositeKey(); | ||
| const auto & key = entry.getCompositeKey(); | ||
| if (zk_children.contains(key)) |
There was a problem hiding this comment.
Since this loop got split into two separate loops, this body looks quite simple. It can be even simpler if done as
for (const auto & entry : entries_by_key)
{
const auto & key = entry.getCompositeKey();
if (!zk_children.contains(key))
stale_keys.emplace_back(key, entry.manifest.transaction_id);
}IMHO this looks a bit more straightforward and to the point of what we actually expect from the loop.
But the bigger question is: was there any justification for splitting the loop in the first place? I agree it does look simpler now, but now we have two iterations instead of one (and introduces another entries_by_key.find() operation that wasn't needed in the previous version). And IIUC both loops are under one larger lock (IOW there's no reduce in lock scope). So I'm a bit puzzled here
There was a problem hiding this comment.
I am sorry I let that pass. Splitting the loop was garbage from the previous rw attempt where I would only lock when really needed. Tbh, it probably didn't even make sense back then, ai slop. Reverted.
| local_status_changes.pop(); | ||
| continue; | ||
| } | ||
| const std::string transaction_id = it->manifest.transaction_id; |
There was a problem hiding this comment.
This could be moved closer to line 711 where it's actually needed. Also this could be a const-ref.
And since it's used only once, maybe we don't need it as a local var at all and can continue calling killExportPart with it->manifest.transaction_id directly
| #include <Common/EventNotifier.h> | ||
| #include <Common/MultiVersion.h> | ||
| #include <Common/ProfileEventsScope.h> | ||
| #include <Common/SharedMutex.h> |
There was a problem hiding this comment.
Is this include needed here? Don't see any code changes in the header that would require it
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 027578b030
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (had_changes) | ||
| storage.export_read_model.set(std::move(working_model)); |
There was a problem hiding this comment.
Preserve processed status changes on handler errors
With the new copy-on-write model, the status changes applied above do not become visible until this single publish. If a later item in local_status_changes throws after earlier items have already been popped, the catch block only requeues the unpopped tail, so those earlier terminal transitions are neither reprocessed nor published; their one-shot watches have also been consumed. In a batch with multiple status events followed by a ZooKeeper/session exception or the existing failpoint, system.replicated_partition_exports can stay stale and keep part references until the periodic poll repairs it. Publish incrementally or requeue the whole unpublished batch on error.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Requeued the events in case of an exception - 6249c11.
| if (has_local_entry) | ||
| { | ||
| ProfileEvents::increment(ProfileEvents::ExportPartitionZooKeeperRequests); | ||
| ProfileEvents::increment(ProfileEvents::ExportPartitionZooKeeperGet); | ||
|
|
||
| auto status_watch_callback = std::make_shared<Coordination::WatchCallback>([weak_manifest_updater, key](const Coordination::WatchResponse &) | ||
| zk->tryGet(fs::path(entry_path) / "status", status_string); | ||
| } |
There was a problem hiding this comment.
Re-arm status updates when cleanup changes local entries
For an entry that already exists locally, this branch only performs a plain status read. When this same poll holds exports_cleanup_lock, tryCleanup below can move /status from PENDING to KILLED for a timeout, but the code still republishes the pre-cleanup PENDING value and no fresh watch is installed. If the old watch was lost, for example after a ZooKeeper session reset, the local export is not cancelled by killExportPart until the next periodic poll, so a task that was just killed can continue exporting for another interval. Re-arm the watch for cleanup-lock polls or have tryCleanup return the terminal status so it can be applied before publishing.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think we can live with it, not a problem
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6249c116d6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /// Hold the published snapshot for the whole pass and iterate it directly (sorted by | ||
| /// create_time). It is immutable and the shared_ptr copy never blocks the writer. The scheduler | ||
| /// is a pure reader; status converges via the status watch -> handleStatusChanges and poll(). | ||
| const auto model = storage.export_read_model.get(); |
There was a problem hiding this comment.
Serialize KILL with export scheduling
Since run now holds only a lock-free snapshot, it can pass the /status PENDING check and then race with killExportPartition setting the same task to KILLED. If the status handler cancels existing work before this pass reaches exportPartToTable, the new export manifest is inserted after cancellation and there is no later status change to stop it, so a killed export can still write files and potentially commit; please re-check status after taking the part lock or keep KILL/status cancellation serialized with scheduling.
Useful? React with 👍 / 👎.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Yet another attempt to reduce lock contention observed while slow writes / commits to iceberg backed by
ice-rest-catalog. Uses aMultiVersioncontainer instead of locking. Follow up of #1813. Closes #1920Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: