[reliability] Daily Reliability Review - 2026-06-18

### Executive Summary

Telemetry is flowing for `gh-aw` (Sentry org `github`, project `gh-aw`): the **spans** dataset has fresh data through 2026-06-18T23:25Z. Over the last 24h I confirmed **16 distinct workflow runs with `gh-aw.run.status:failure`** against a floor of ≥19 successful runs (the success-span query hit the 100-span cap, so true success count is higher). **No `timed_out` or `cancelled` run-status values were present** — failing runs are recorded as `failure`, not timeouts.

The single highest-signal recurring problem is **Smoke Copilot (4 failed runs in 24h)**. The failure localizes to the **agent phase**, not export/infra: trace continuity is intact and the OTLP exporter is healthy. The **errors** and **logs** datasets are **empty** for the window, and three core attributes the standard review playbook expects (`gen_ai.response.finish_reasons`, native `span.status`, span-level `github.run_id`) are **not searchable in Sentry's spans dataset** despite being emitted — a confirmed instrumentation/mapping gap that blocks truncation analysis and native failure filtering.

### Top Reliability Findings

| Priority | Workflow | Problem | Evidence | Next Action |
| --- | --- | --- | --- | --- |
| 1 | Smoke Copilot | Recurring agent-phase failure (4 runs/24h) | `gh-aw.run.status:failure` on runs `27784259295`, `27738605733`, `27737664591`, `27726924811`; rep. trace `0c0034be...` agent.conclusion=630.2s status=failure | Inspect copilot agent step logs on run 27784259295 |
| 2 | Matt Pocock Skills Reviewer | Repeated failure (2 runs/24h) | failure on runs `27755558806`, `27737630181` | Compare against Smoke Copilot agent-phase cause |
| 3 | 8 other workflows | One-off failures each | failure runs incl. PR Sous Chef `27780036106`, Smoke Claude `27737636941`, Daily Regulatory Report Generator `27791592821`, Test Quality Sentinel `27756657067` (+3 `[Filtered]` names) | Triage individually; likely not systemic |
| 4 | (fleet) | Truncation / runaway-token outcome unverifiable | `has:gen_ai.response.finish_reasons` → 0 spans/24h; max output_tokens=92,805 (model `small`, trace `8b49cd03...`) — inconclusive | Restore finish-reason searchability (see Rec. 3) |
| 5 | (fleet) | Native `span.status` not populated in Sentry | `span.status:ok` and `span.status:internal_error` both → 0 results/24h | Triage by `gh-aw.run.status`; see Rec. 1 & 4 |

**Not a failure (separated):** the slowest `gh-aw.agent.conclusion` spans (up to **3,011,599 ms ≈ 50 min**, e.g. Daily AW Cross-Repo Compile Check `27751740756`) are all `status:success` — expected long-running daily jobs, not timeouts.

### Representative Traces

<details>
<summary>View representative traces</summary>

**Confirmed failure — Smoke Copilot, run `27784259295`, trace `0c0034be80c6343dff5c8b5e5734fd26`**
- Continuity intact across `gh-aw.pre_activation` → `activation` → `agent` → `push_experiments_state` (all share the trace).
- Failure localizes to `gh-aw.agent.conclusion` = **630,186 ms (~10.5 min), `gh-aw.run.status:failure`** and `gh-aw.agent.agent` = 436,491 ms.
- Inside the agent phase: a `gateway.backend.execute` / `mcp.tool_call` ran **76,156 ms (~76 s)** — a candidate contributing factor.
- Exporter healthy; this is an agent-execution failure, not an export or auth failure.
- Trace: https://github.sentry.io/explore/traces/trace/0c0034be80c6343dff5c8b5e5734fd26

**Latency outlier (healthy) — Daily AW Cross-Repo Compile Check, run `27751740756`, trace `83630186ba94b071ed242dbdf7776ca6`**
- `gh-aw.agent.conclusion` = 3,011,599 ms with `gh-aw.run.status:success`. Long but expected; not a reliability defect.

**Token outlier (inconclusive) — trace `8b49cd03b942b6a0c5dce166a460f6a0`**
- `gen_ai.usage.output_tokens` = 92,805, `gen_ai.request.model:small`. No `finish_reasons` present, so cannot confirm truncation vs. legitimate large output.

</details>

### Recommendations

1. **Triage by `gh-aw.run.status`, not `span.status`** (no code change). Update the Sentry saved query/playbook to the emitted keys — `gh-aw.workflow.name`, `gh-aw.run.status`, `gh-aw.run.id` — since the playbook's `gh_aw.workflow_name` / `span.status` / `github.run_id` return false negatives in Sentry's spans dataset.
2. **Investigate Smoke Copilot's recurring agent-phase failure** (4/24h). Start from run `27784259295` agent-step logs and the 76 s MCP tool call in trace `0c0034be...`.
3. **Restore finish-reason searchability.** `send_otlp_span.cjs:2143-2146` emits `gen_ai.response.finish_reasons` as an **array** attribute, but Sentry's spans dataset returns 0 spans for `has:gen_ai.response.finish_reasons` over 24h. Emit an additional **scalar** `gen_ai.response.finish_reason` alongside the array so truncation/length-stop is queryable.
4. **Surface failures via native span status, or document the canonical field.** Failures already set OTLP `status.code=2` (`send_otlp_span.cjs:2024,2060`), yet Sentry's `span.status` shows neither `ok` nor `internal_error`. Verify the OTLP→Sentry status mapping, or document `gh-aw.run.status` as the canonical outcome field for dashboards/alerts.

### Notes

<details>
<summary>View notes</summary>

- **MCP build limitation:** this Sentry MCP exposes only `list_events`/`list_issue_events` — no `search_events` or `get_trace_details`. Trace continuity was verified via `list_events` filtered by `trace:<id>`. `list_events` caps `limit` at 100 and renders only a fixed field set plus explicitly-requested attributes.
- **Datasets:** `errors` and `logs` (and `ourlogs`) datasets returned **0 events/24h** — no error-event or log correlation is available; failures are observable only as span attributes. State explicitly as an observability finding, not a clean bill of health.
- **Emit vs. Sentry mapping (cross-checked in `actions/setup/js/send_otlp_span.cjs`):**
  - `github.run_id`, `github.run_attempt`, `service.version` are emitted as **resource** attributes (`:360`, `:423-424`) and are **not** exposed as searchable span fields in Sentry — their query "absence" is a backend mapping artifact, not an emit bug. Sentry's `release` (its mapping of `service.version`) **is** present.
  - OTLP `status.code` defaults to OK=1 (`:329`) and is set to ERROR=2 on agent-non-OK/failure (`:2024`, `:2060`); Sentry does not surface this as `span.status`.
- **PII scrubbing:** three failing runs render with `[Filtered]` workflow names (Sentry data scrubbing) — reduces attribution; run IDs `27738606387`, `27737665241`, `27726925449`.
- **Inconclusive vs. confirmed:** failures are **confirmed** via the `gh-aw.run.status` attribute + verified trace continuity. Truncation/runaway-token outcomes are **inconclusive** (no `finish_reasons` in Sentry). No timeouts were claimed — no `timed_out` status exists in the data.

**References:**
- [§27784259295](https://github.com/github/gh-aw/actions/runs/27784259295) — Smoke Copilot (recurring failure, rep. trace)
- [§27791592821](https://github.com/github/gh-aw/actions/runs/27791592821) — Daily Regulatory Report Generator (failure)
- [§27755558806](https://github.com/github/gh-aw/actions/runs/27755558806) — Matt Pocock Skills Reviewer (repeated failure)

</details>







> Generated by [🚨 Daily Reliability Review](https://github.com/github/gh-aw/actions/runs/27795432593) · 175.5 AIC · ⌖ 12.6 AIC · ⊞ 5.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-reliability-review%22&type=issues)
> - [x] expires  on Jun 20, 2026, 3:33 PM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reliability] Daily Reliability Review - 2026-06-18 #40168

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Workflow	Problem	Evidence	Next Action
1	Smoke Copilot	Recurring agent-phase failure (4 runs/24h)	`gh-aw.run.status:failure` on runs `27784259295`, `27738605733`, `27737664591`, `27726924811`; rep. trace `0c0034be...` agent.conclusion=630.2s status=failure	Inspect copilot agent step logs on run 27784259295
2	Matt Pocock Skills Reviewer	Repeated failure (2 runs/24h)	failure on runs `27755558806`, `27737630181`	Compare against Smoke Copilot agent-phase cause
3	8 other workflows	One-off failures each	failure runs incl. PR Sous Chef `27780036106`, Smoke Claude `27737636941`, Daily Regulatory Report Generator `27791592821`, Test Quality Sentinel `27756657067` (+3 `[Filtered]` names)	Triage individually; likely not systemic
4	(fleet)	Truncation / runaway-token outcome unverifiable	`has:gen_ai.response.finish_reasons` → 0 spans/24h; max output_tokens=92,805 (model `small`, trace `8b49cd03...`) — inconclusive	Restore finish-reason searchability (see Rec. 3)
5	(fleet)	Native `span.status` not populated in Sentry	`span.status:ok` and `span.status:internal_error` both → 0 results/24h	Triage by `gh-aw.run.status`; see Rec. 1 & 4

[reliability] Daily Reliability Review - 2026-06-18 #40168

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions