feat(cli): add GPU count requests#1812
Conversation
|
🌿 Preview your docs: https://nvidia-preview-pr-1812.docs.buildwithfern.com/openshell |
|
/ok to test abe5b79 |
|
/ok to test abe5b79 |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Landing #1815 first should simplify the changes here. |
abe5b79 to
06c69dd
Compare
|
Label |
PR Review StatusValidation: this is maintainer-authored, project-valid GPU CLI/API/runtime work that aligns GPU sandbox intent with structured resource requirements and the related resource-requirements RFC direction. Review findings:
Docs: Fern docs were updated under Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
06c69dd to
87c9a8c
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
The breaking proto change is intentional. However, I will defer to @drew on whether we should rather reserve the previous |
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
87c9a8c to
da6fbd8
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Re-check After CI and Independent ReviewI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
|
Responding to the two blocking findings from gator:
Given that the API is still alpha, we do not want to carry legacy GPU-specific reserved or transitional fields forward into the proto shape we intend to stabilize. I will update RFC 0004 in this branch to reflect that decision, since the current RFC text still describes reserving the old GPU fields. |
c7cf9d6 to
b681234
Compare
b681234 to
60a7ccd
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Checks: the latest forced-push head is still early in CI; DCO and required status publication are pending at the time of this re-check. The review findings above must be addressed before gator can move this PR to pipeline watch. Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Re-check After CI and Independent ReviewI re-evaluated latest head Disposition: not resolved. Remaining items:
Independent review summary: no additional code-level blocker was found in this bounded pass beyond the items above. Next state: |
35c7ef2 to
8e108bb
Compare
8e108bb to
cad3745
Compare
cad3745 to
82f88d8
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: partially resolved, but not ready to leave review. Remaining items:
Checks: Next state: |
BREAKING CHANGE: SandboxSpec.gpu and DriverSandboxSpec.gpu were replaced with resource_requirements.gpu. This intentionally reuses protobuf field 9, changing it from a bool to a message for both public and driver APIs. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
82f88d8 to
3a244be
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: partially resolved, but not ready to leave review. Remaining items:
Independent review summary: no additional code-level blocker was found in this focused pass beyond the items above. Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved, but not ready to leave review. Remaining items:
Next state: |
|
On the 2 items marked by the gator:
|
|
Thanks for the review @pimlock. The gator comment kept coming up even though I explicitly stated that the breaking change was acceptable at this stage.
I suppose the behaviour depends on what an on-wire On the python API. Thanks for the link. |
Monitoring CompleteMonitoring is complete because this PR has merged. Final status: the PR merged with an active I removed the active |
Summary
Adds structured GPU resource requirements for sandbox creation and updates the
CLI/API/runtime path so
openshell sandbox create --gpu [COUNT]records GPUintent in
ResourceRequirements.gpu.This is an intentional alpha API break:
SandboxSpec.gpuandDriverSandboxSpec.gpuare replaced byresource_requirements.gpu. Theprotobuf field number is reused intentionally for that replacement, changing
field 9 from a bool to a message in both public and driver APIs. Existing live
or persisted legacy GPU intent is not migrated; callers should use a matching
OpenShell CLI/API version and recreate GPU sandboxes when they need the new
typed shape. RFC 0004 is updated to document that decision.
Related Issue
Part of #1444. Related to #1338, #1156, #1360, and #1492. Follow-up GPU support
preflight semantics are tracked in #1807.
Changes
ResourceRequirements.gpu.countto the public and compute-driver protos.A present GPU requirement with omitted count means one GPU;
count = 0isrejected.
--gpufor one GPU and--gpu COUNTfor counted requests.
translation, provisioning timeout messages, and driver helper APIs.
nvidia.com/gpulimits from GPU requirements.driver_config: Docker andPodman use
cdi_devices, and VM usesgpu_device_ids.IDs are rejected, a single exact device works with default
--gpu, andmulti-device exact lists require
--gpu COUNTmatching the list length.selector refreshes CDI inventory before validation/create, picks from the
normalized NVIDIA CDI inventory in round-robin order, fails when count exceeds
selectable devices, and treats WSL2
nvidia.com/gpu=allfallback as oneselectable device.
Kubernetes driver READMEs.
Testing
mise run pre-commit/Users/elezar/.local/bin/mise exec -- cargo check -p openshell-core -p openshell-driver-docker -p openshell-driver-podman -p openshell-driver-vm -p openshell-driver-kubernetes/Users/elezar/.local/bin/mise exec -- cargo test -p openshell-core -p openshell-driver-docker -p openshell-driver-podman -p openshell-driver-vm -p openshell-driver-kubernetes gpu --lib/Users/elezar/.local/bin/mise exec -- cargo clippy -p openshell-core -p openshell-driver-docker -p openshell-driver-podman -p openshell-driver-vm -p openshell-driver-kubernetes --all-targets -- -D warningsCI is running for the rebased head. The PR has the
test:e2e-gpulabel so therequired Docker GPU E2E gate runs in CI.
Checklist