Skip to content

Add GPT-OSS 20B recipes#507

Open
kunal-vaishnavi wants to merge 5 commits into
mainfrom
kvaishnavi/gpt-oss
Open

Add GPT-OSS 20B recipes#507
kunal-vaishnavi wants to merge 5 commits into
mainfrom
kvaishnavi/gpt-oss

Conversation

@kunal-vaishnavi

Copy link
Copy Markdown
Contributor

Description

This PR adds recipes for OpenAI's GPT-OSS 20B on the CPU EP, CUDA EP, and WebGPU EP.

Motivation and Context

The recipes were originally created and documented here. Recent changes to the QMoE op in ORT now allow block-wise quantization to work for all EPs.

Copilot AI review requested due to automatic review settings June 15, 2026 23:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new set of Olive recipe configurations for optimizing OpenAI GPT-OSS 20B across CPU, CUDA, and WebGPU execution providers, replacing an older one-off CUDA graph-capture script layout with a more standardized per-EP recipe structure.

Changes:

  • Added per-EP (cpu/cuda/webgpu) folders containing Olive ModelBuilder JSON configs for INT4 QMoE variants (including k_quant_mixed and INT8-expert options).
  • Added per-EP READMEs and requirements files for running the new recipes.
  • Removed the previous int4_cuda_int4_qmoe script-based flow and the top-level gpt-oss-20b/requirements.txt.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
gpt-oss-20b/webgpu/requirements.txt Adds Python deps for WebGPU recipe execution.
gpt-oss-20b/webgpu/README.md Documents available WebGPU recipes and setup steps.
gpt-oss-20b/webgpu/info.yaml Adds recipe metadata for WebGPU entrypoints.
gpt-oss-20b/webgpu/gpt-oss-20b_webgpu_int4_int4_qmoe_default.json Default INT4/INT4 WebGPU ModelBuilder recipe.
gpt-oss-20b/webgpu/gpt-oss-20b_webgpu_int4_int4_qmoe_k_quant_mixed.json WebGPU INT4/INT4 recipe using k_quant_mixed.
gpt-oss-20b/webgpu/gpt-oss-20b_webgpu_int4_int8_qmoe_default.json WebGPU INT4 with INT8 expert weights recipe.
gpt-oss-20b/webgpu/gpt-oss-20b_webgpu_int4_int8_qmoe_k_quant_mixed.json WebGPU INT4+INT8-expert with k_quant_mixed recipe.
gpt-oss-20b/requirements.txt Removes old top-level requirements file.
gpt-oss-20b/int4_cuda_int4_qmoe/README.md Removes legacy CUDA capture-onnx-graph documentation.
gpt-oss-20b/int4_cuda_int4_qmoe/info.yml Removes legacy scanner metadata for the old script recipe.
gpt-oss-20b/int4_cuda_int4_qmoe/gpt-oss-20b.sh Removes legacy CUDA capture-onnx-graph script.
gpt-oss-20b/cuda/requirements.txt Adds Python deps for CUDA recipe execution.
gpt-oss-20b/cuda/README.md Documents available CUDA recipes and setup steps.
gpt-oss-20b/cuda/info.yaml Adds recipe metadata for CUDA entrypoints.
gpt-oss-20b/cuda/gpt-oss-20b_cuda_int4_int4_qmoe_k_quant_mixed.json CUDA INT4/INT4 recipe using k_quant_mixed.
gpt-oss-20b/cuda/gpt-oss-20b_cuda_int4_int8_qmoe_k_quant_mixed.json CUDA INT4+INT8-expert with k_quant_mixed recipe.
gpt-oss-20b/cuda/gpt-oss-20b_cuda_int4_int8_qmoe_default.json CUDA INT4 with INT8 expert weights recipe.
gpt-oss-20b/cuda/gpt-oss-20b_cuda_int4_int4_qmoe_default.json Default INT4/INT4 CUDA ModelBuilder recipe.
gpt-oss-20b/cpu/requirements.txt Adds Python deps for CPU recipe execution.
gpt-oss-20b/cpu/README.md Documents available CPU recipes and setup steps.
gpt-oss-20b/cpu/info.yaml Adds recipe metadata for CPU entrypoints.
gpt-oss-20b/cpu/gpt-oss-20b_cpu_int4_int8_qmoe_k_quant_mixed.json CPU INT4+INT8-expert with k_quant_mixed recipe.
gpt-oss-20b/cpu/gpt-oss-20b_cpu_int4_int8_qmoe_default.json CPU INT4 with INT8 expert weights recipe.
gpt-oss-20b/cpu/gpt-oss-20b_cpu_int4_int4_qmoe_k_quant_mixed.json CPU INT4/INT4 recipe using k_quant_mixed.
gpt-oss-20b/cpu/gpt-oss-20b_cpu_int4_int4_qmoe_default.json Default INT4/INT4 CPU ModelBuilder recipe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gpt-oss-20b/cpu/info.yaml
Comment thread gpt-oss-20b/cuda/info.yaml
Comment thread gpt-oss-20b/webgpu/info.yaml
Comment thread gpt-oss-20b/cpu/requirements.txt
Comment thread gpt-oss-20b/cuda/requirements.txt
Comment thread gpt-oss-20b/webgpu/requirements.txt
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment on lines +23 to +27
"precision": "int4",
"int4_op_types_to_quantize": [
"MatMul",
"Gather"
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need test accuracy for this setting.
I think it could have accuracy problem if lm_head is quantized to 4 bits.

Comment on lines +23 to +27
"precision": "int4",
"int4_op_types_to_quantize": [
"MatMul",
"Gather"
],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. It could have accuracy problem if lm_head is quantized to 4 bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants