Add GPT-OSS 20B recipes#507
Open
kunal-vaishnavi wants to merge 5 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new set of Olive recipe configurations for optimizing OpenAI GPT-OSS 20B across CPU, CUDA, and WebGPU execution providers, replacing an older one-off CUDA graph-capture script layout with a more standardized per-EP recipe structure.
Changes:
- Added per-EP (cpu/cuda/webgpu) folders containing Olive
ModelBuilderJSON configs for INT4 QMoE variants (includingk_quant_mixedand INT8-expert options). - Added per-EP READMEs and requirements files for running the new recipes.
- Removed the previous
int4_cuda_int4_qmoescript-based flow and the top-levelgpt-oss-20b/requirements.txt.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| gpt-oss-20b/webgpu/requirements.txt | Adds Python deps for WebGPU recipe execution. |
| gpt-oss-20b/webgpu/README.md | Documents available WebGPU recipes and setup steps. |
| gpt-oss-20b/webgpu/info.yaml | Adds recipe metadata for WebGPU entrypoints. |
| gpt-oss-20b/webgpu/gpt-oss-20b_webgpu_int4_int4_qmoe_default.json | Default INT4/INT4 WebGPU ModelBuilder recipe. |
| gpt-oss-20b/webgpu/gpt-oss-20b_webgpu_int4_int4_qmoe_k_quant_mixed.json | WebGPU INT4/INT4 recipe using k_quant_mixed. |
| gpt-oss-20b/webgpu/gpt-oss-20b_webgpu_int4_int8_qmoe_default.json | WebGPU INT4 with INT8 expert weights recipe. |
| gpt-oss-20b/webgpu/gpt-oss-20b_webgpu_int4_int8_qmoe_k_quant_mixed.json | WebGPU INT4+INT8-expert with k_quant_mixed recipe. |
| gpt-oss-20b/requirements.txt | Removes old top-level requirements file. |
| gpt-oss-20b/int4_cuda_int4_qmoe/README.md | Removes legacy CUDA capture-onnx-graph documentation. |
| gpt-oss-20b/int4_cuda_int4_qmoe/info.yml | Removes legacy scanner metadata for the old script recipe. |
| gpt-oss-20b/int4_cuda_int4_qmoe/gpt-oss-20b.sh | Removes legacy CUDA capture-onnx-graph script. |
| gpt-oss-20b/cuda/requirements.txt | Adds Python deps for CUDA recipe execution. |
| gpt-oss-20b/cuda/README.md | Documents available CUDA recipes and setup steps. |
| gpt-oss-20b/cuda/info.yaml | Adds recipe metadata for CUDA entrypoints. |
| gpt-oss-20b/cuda/gpt-oss-20b_cuda_int4_int4_qmoe_k_quant_mixed.json | CUDA INT4/INT4 recipe using k_quant_mixed. |
| gpt-oss-20b/cuda/gpt-oss-20b_cuda_int4_int8_qmoe_k_quant_mixed.json | CUDA INT4+INT8-expert with k_quant_mixed recipe. |
| gpt-oss-20b/cuda/gpt-oss-20b_cuda_int4_int8_qmoe_default.json | CUDA INT4 with INT8 expert weights recipe. |
| gpt-oss-20b/cuda/gpt-oss-20b_cuda_int4_int4_qmoe_default.json | Default INT4/INT4 CUDA ModelBuilder recipe. |
| gpt-oss-20b/cpu/requirements.txt | Adds Python deps for CPU recipe execution. |
| gpt-oss-20b/cpu/README.md | Documents available CPU recipes and setup steps. |
| gpt-oss-20b/cpu/info.yaml | Adds recipe metadata for CPU entrypoints. |
| gpt-oss-20b/cpu/gpt-oss-20b_cpu_int4_int8_qmoe_k_quant_mixed.json | CPU INT4+INT8-expert with k_quant_mixed recipe. |
| gpt-oss-20b/cpu/gpt-oss-20b_cpu_int4_int8_qmoe_default.json | CPU INT4 with INT8 expert weights recipe. |
| gpt-oss-20b/cpu/gpt-oss-20b_cpu_int4_int4_qmoe_k_quant_mixed.json | CPU INT4/INT4 recipe using k_quant_mixed. |
| gpt-oss-20b/cpu/gpt-oss-20b_cpu_int4_int4_qmoe_default.json | Default INT4/INT4 CPU ModelBuilder recipe. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
tianleiwu
reviewed
Jun 15, 2026
Comment on lines
+23
to
+27
| "precision": "int4", | ||
| "int4_op_types_to_quantize": [ | ||
| "MatMul", | ||
| "Gather" | ||
| ] |
There was a problem hiding this comment.
Need test accuracy for this setting.
I think it could have accuracy problem if lm_head is quantized to 4 bits.
tianleiwu
reviewed
Jun 15, 2026
Comment on lines
+23
to
+27
| "precision": "int4", | ||
| "int4_op_types_to_quantize": [ | ||
| "MatMul", | ||
| "Gather" | ||
| ], |
There was a problem hiding this comment.
Same here. It could have accuracy problem if lm_head is quantized to 4 bits.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds recipes for OpenAI's GPT-OSS 20B on the CPU EP, CUDA EP, and WebGPU EP.
Motivation and Context
The recipes were originally created and documented here. Recent changes to the
QMoEop in ORT now allow block-wise quantization to work for all EPs.