WIP - PlatformAudio instability investigation by alan-george-lk · Pull Request #180 · livekit/client-sdk-cpp

alan-george-lk · 2026-06-23T03:25:58Z

No description provided.

Drop nightly.yml (superseded by platform-audio-triage.yml). Make the triage workflow the single focused, mac-only crash-hunting tool: - Remove the temporary pull_request trigger (workflow_dispatch only) so it stops doing a ~20-minute build on every PR push. - Cache the Rust submodule build (Swatinem/rust-cache) to skip the cold build on re-runs. - Raise dispatch defaults to repeat=500 / pin_iterations=200 now that the test loop is confirmed cheap relative to the build. Co-authored-by: Cursor <cursoragent@cursor.com>

xianshijing-lk · 2026-06-23T05:20:05Z

could you please provide some descriptions of problems for this investigation ?

Instrument both triage arms with a background sampler that records RSS, thread count, fd count, and mach-port count of the integration-test process over time. Mach-port growth is the tell for a CoreAudio HAL client leak across ADM dispose/recreate cycles. CSVs are uploaded as artifacts and first/last deltas are surfaced in the job summary. Also lower the default repeat from 500 to 200 so Arm A finishes and yields a full leak curve instead of timing out. Co-authored-by: Cursor <cursoragent@cursor.com>

The resource sampler matched its own command line (argv contains the process pattern) and picked it via head -1, so it measured an idle 1-thread shell instead of the test binary. Select the matching PID with the largest RSS instead, excluding the sampler's own PID, so it tracks the real instrumented binary. Drop --gtest_break_on_failure from the triage arms: it converted ordinary EXPECT failures (e.g. "no platform audio frames received") into SIGTRAP core dumps (a misleading ~2GB artifact) and halted the repeat loop before the sampler could capture the full curve. Add Arm C, which runs only PlatformAudioFramesReachRemote with a small repeat, to distinguish "frame flow dead on a fresh ADM" from "frame flow only dies after prior teardown/recreate cycles churn the ADM". Co-authored-by: Cursor <cursoragent@cursor.com>

The instability reproduces on Apple Silicon too (arm64 integration tests have been seen to SIGSEGV, exit 139), not just Intel x64. Replace the single-runner input with a matrix that fans "all" out across one Intel (macos-15-large) and one arm64 (macos-15) runner by default, while still allowing a single runner to be targeted. Artifact names are suffixed with the runner so the parallel arms don't collide on upload. Co-authored-by: Cursor <cursoragent@cursor.com>

The corrected resource sampler showed RSS growing unbounded (to ~4.9 GB on Intel before it crashed) while threads, fds, and mach ports stay flat, and the growth reproduces even with the ADM pinned -- i.e. a heap leak in the per-room publish/subscribe cycle, not an ADM-teardown or handle leak. Add Arm D, which runs the pinned-cycle reproducer under macOS `leaks --atExit` with MallocStackLogging so each still-allocated block is reported with its allocating backtrace (symbol + file:line). This names the leaking call site (C++ SDK vs Rust FFI) directly. The report is uploaded as an artifact and a small leak_iterations input keeps the stack-logging overhead bounded. Co-authored-by: Cursor <cursoragent@cursor.com>

The leak is reachable retention, not lost-pointer leaks: `leaks` reports 0 on both arches because the growing memory is still referenced. `leaks --atExit` also can't see it (cleanup reclaims at shutdown). Code review ruled out the obvious C++ suspects -- the FFI response buffer handle is already dropped via an FfiHandle guard in sendRequest, and Room deregisters its FfiClient listener on disconnect/destruction. Add Arm E, which runs the dispose+recreate path (the worst leaker) under MallocStackLogging and samples the LIVE heap mid-run via `heap` and `malloc_history` (new heap_snapshots.sh). Diffing successive heap summaries plus the malloc_history stacks names the growing allocation type and its call site so we can localize the retention to C++ SDK vs Rust FFI vs WebRTC. Co-authored-by: Cursor <cursoragent@cursor.com>

First run plateaued before the snapshot window opened, so all heap samples were identical and malloc_history (gated to the last ticks) never fired. The steady-state heap was already telling: dominated by webrtc::Codec copies and StatsReport entries in liblivekit_ffi.dylib. Snapshot every 10s (not 25s), allow more ticks, raise the dispose-path repeat to 60 so the process keeps churning across the window, and capture malloc_history (-allBySize | head) on every tick so we always get the allocating backtraces even if the process exits or hangs early. Co-authored-by: Cursor <cursoragent@cursor.com>

Resolve submodule conflict by keeping the latest bugfix/ffi_handle_cleanup commit (6881168d) instead of main's dynacast bump. Co-authored-by: Cursor <cursoragent@cursor.com>

alan-george-lk and others added 2 commits June 22, 2026 21:25

New changes to catch audio bug

bbed188

alan-george-lk and others added 9 commits June 23, 2026 09:19

Possible race in local_participant

13f8b11

Maybe fix shutdown

3980fb4

Merge origin/main; keep client-sdk-rust at bugfix/ffi_handle_cleanup

44df5f7

Resolve submodule conflict by keeping the latest bugfix/ffi_handle_cleanup commit (6881168d) instead of main's dynacast bump. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - PlatformAudio instability investigation#180

WIP - PlatformAudio instability investigation#180
alan-george-lk wants to merge 11 commits into
mainfrom
feature/platform-audio-stability

alan-george-lk commented Jun 23, 2026

Uh oh!

xianshijing-lk commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alan-george-lk commented Jun 23, 2026

Uh oh!

xianshijing-lk commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants