WIP - PlatformAudio instability investigation#180
Draft
alan-george-lk wants to merge 11 commits into
Draft
Conversation
Drop nightly.yml (superseded by platform-audio-triage.yml). Make the triage workflow the single focused, mac-only crash-hunting tool: - Remove the temporary pull_request trigger (workflow_dispatch only) so it stops doing a ~20-minute build on every PR push. - Cache the Rust submodule build (Swatinem/rust-cache) to skip the cold build on re-runs. - Raise dispatch defaults to repeat=500 / pin_iterations=200 now that the test loop is confirmed cheap relative to the build. Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
|
could you please provide some descriptions of problems for this investigation ? |
Instrument both triage arms with a background sampler that records RSS, thread count, fd count, and mach-port count of the integration-test process over time. Mach-port growth is the tell for a CoreAudio HAL client leak across ADM dispose/recreate cycles. CSVs are uploaded as artifacts and first/last deltas are surfaced in the job summary. Also lower the default repeat from 500 to 200 so Arm A finishes and yields a full leak curve instead of timing out. Co-authored-by: Cursor <cursoragent@cursor.com>
The resource sampler matched its own command line (argv contains the process pattern) and picked it via head -1, so it measured an idle 1-thread shell instead of the test binary. Select the matching PID with the largest RSS instead, excluding the sampler's own PID, so it tracks the real instrumented binary. Drop --gtest_break_on_failure from the triage arms: it converted ordinary EXPECT failures (e.g. "no platform audio frames received") into SIGTRAP core dumps (a misleading ~2GB artifact) and halted the repeat loop before the sampler could capture the full curve. Add Arm C, which runs only PlatformAudioFramesReachRemote with a small repeat, to distinguish "frame flow dead on a fresh ADM" from "frame flow only dies after prior teardown/recreate cycles churn the ADM". Co-authored-by: Cursor <cursoragent@cursor.com>
The instability reproduces on Apple Silicon too (arm64 integration tests have been seen to SIGSEGV, exit 139), not just Intel x64. Replace the single-runner input with a matrix that fans "all" out across one Intel (macos-15-large) and one arm64 (macos-15) runner by default, while still allowing a single runner to be targeted. Artifact names are suffixed with the runner so the parallel arms don't collide on upload. Co-authored-by: Cursor <cursoragent@cursor.com>
The corrected resource sampler showed RSS growing unbounded (to ~4.9 GB on Intel before it crashed) while threads, fds, and mach ports stay flat, and the growth reproduces even with the ADM pinned -- i.e. a heap leak in the per-room publish/subscribe cycle, not an ADM-teardown or handle leak. Add Arm D, which runs the pinned-cycle reproducer under macOS `leaks --atExit` with MallocStackLogging so each still-allocated block is reported with its allocating backtrace (symbol + file:line). This names the leaking call site (C++ SDK vs Rust FFI) directly. The report is uploaded as an artifact and a small leak_iterations input keeps the stack-logging overhead bounded. Co-authored-by: Cursor <cursoragent@cursor.com>
The leak is reachable retention, not lost-pointer leaks: `leaks` reports 0 on both arches because the growing memory is still referenced. `leaks --atExit` also can't see it (cleanup reclaims at shutdown). Code review ruled out the obvious C++ suspects -- the FFI response buffer handle is already dropped via an FfiHandle guard in sendRequest, and Room deregisters its FfiClient listener on disconnect/destruction. Add Arm E, which runs the dispose+recreate path (the worst leaker) under MallocStackLogging and samples the LIVE heap mid-run via `heap` and `malloc_history` (new heap_snapshots.sh). Diffing successive heap summaries plus the malloc_history stacks names the growing allocation type and its call site so we can localize the retention to C++ SDK vs Rust FFI vs WebRTC. Co-authored-by: Cursor <cursoragent@cursor.com>
First run plateaued before the snapshot window opened, so all heap samples were identical and malloc_history (gated to the last ticks) never fired. The steady-state heap was already telling: dominated by webrtc::Codec copies and StatsReport entries in liblivekit_ffi.dylib. Snapshot every 10s (not 25s), allow more ticks, raise the dispose-path repeat to 60 so the process keeps churning across the window, and capture malloc_history (-allBySize | head) on every tick so we always get the allocating backtraces even if the process exits or hangs early. Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve submodule conflict by keeping the latest bugfix/ffi_handle_cleanup commit (6881168d) instead of main's dynacast bump. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.