Add LCOW live migration support across the controller stacks#2790
Open
rawahars wants to merge 1 commit into
Open
Add LCOW live migration support across the controller stacks#2790rawahars wants to merge 1 commit into
rawahars wants to merge 1 commit into
Conversation
a33ffa9 to
3ae59a0
Compare
Implement end-to-end live migration support for LCOW pods so a running guest and its workloads can be moved between hosts. Every controller in the stack — VM, pod, Linux container, process, network, and SCSI/Plan9/VPCI devices — gains a save/import lifecycle plus the destination-side patch and resume plumbing needed to rebind local resources and bring the workload back online. Migration lifecycle: - Source: Save serializes a controller's state into a self-describing protobuf envelope and freezes the controller (StateSourceMigrating) so no mutating ops race the transfer; Resume rolls the freeze back, and a finalize Stop or VM teardown terminates it. - Destination: Import rehydrates a controller into a migrating state, Patch repoints saved resources (layer VHDs, process IO/bundle, network namespace) at the destination host, and Resume binds the live VM, guest, and devices and republishes events so containerd treats the task as locally running. AbortMigrated discards an imported-but-never-resumed controller and emits synthetic exits so Delete can proceed. VM controller: - Add the source/destination migrating states and the full HCS migration lifecycle: InitializeLiveMigrationOnSource, StartLiveMigrationOnSource, StartLiveMigrationTransfer, FinalizeLiveMigration, plus StartWithMigrationOptions on the destination. - Exchange the opaque compatibility blob, retain the final HCS document so the destination can recreate an identical VM, and recover the GCS port/bridge-id allocator floors so reissued ids cannot collide. - Make SCSI initialization lazy (built on first use from the HCS document) and handle never-started/destination teardown paths, including the already-stopped HCS error. Controller-specific changes: - SCSI controller switches to an RWMutex and rejects all ops while migrating; ReserveForRootfs now carries the full disk config. - Process, network, container, and VM state machines document and enforce the new migrating states and transitions. - Pod gains a migrating guard, AbortMigrated fan-out, and routes new containers through lazy SCSI init. Includes accompanying unit tests for the new save/import/patch/resume paths across all controllers. Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement end-to-end live migration support for LCOW pods so a running guest and its workloads can be moved between hosts. Every controller in the stack — VM, pod, Linux container, process, network, and SCSI/Plan9/VPCI devices — gains a save/import lifecycle plus the destination-side patch and resume plumbing needed to rebind local resources and bring the workload back online.
Migration lifecycle:
VM controller:
Controller-specific changes:
Includes accompanying unit tests for the new save/import/patch/resume paths across all controllers.