[experiment] On-device decompressed AssemblyStore cache (CoreCLR) by simonrozsival · Pull Request #11967 · dotnet/android

simonrozsival · 2026-07-03T07:33:57Z

Warning

Experimental prototype — not ready to merge. No MSBuild opt-in, no assembly-store version stamp, CoreCLR only. Opening as a draft to gather feedback and run CI.

Builds on top of the merged Zstd AssemblyStore compression (#11730). This prototype explores caching decompressed assemblies on-device so subsequent launches skip Zstd decompression and load assemblies via a file-backed mmap (clean, shareable pages) instead of dirty anonymous memory.

What it does

On-device decompression cache (src/native/clr/host/assembly-store.cc)

On a cache miss, a single background thread atomically writes the decompressed bytes to <codeCacheDir>/decompressed-assembly-cache/<Assembly.dll> (temp → fsync → rename).
On the next launch the file is mmap-ed (MAP_PRIVATE, COW) and decompression is skipped.
Per-assembly: only assemblies actually touched get cached.
Staleness guarded by an 8-byte footer holding an xxhash of the compressed payload.
Plumbs codeCacheDir (Context.getCodeCacheDir()) through Java initInternal → appDirs[3] → AndroidSystem, so Android auto-wipes a stale cache on app/platform update.
Runtime A/B toggle via the debug.net.asmcache system property (and XA_DISABLE_ASSEMBLY_CACHE env var).

Max Zstd compression level (22)

Decompression speed is independent of the Zstd level, so max compression only costs a few extra seconds of build time (fine for Release) in exchange for the smallest store (~17% smaller than the default level 3, ~29% smaller than LZ4 on a sample MAUI app).

Benchmarks (cache on/off)

Measured on #11730 while prototyping on this branch. MauiBench (dotnet new maui, blank + --sample-content), Release / CoreCLR / default partial-R2R + PGO / android-arm64, built at max Zstd level (22) with extractNativeLibs=false (what Google Play ships for .aab on API 26+). Device: Samsung Galaxy A16 (SM-A165F), Android 16 (API 36). Settled harness: am start -S -W TotalTime, force-stop + 10 s between every launch, order-balanced interleaving, n=20/cell. Cache toggled via adb shell setprop debug.net.asmcache 0|1.

Warm-start latency — cache OFF vs ON (WARM is the trustworthy metric; see caveats):

app	warm OFF (mean)	warm ON (mean)	Δ (Welch t)
`maui` blank	1062.8 ms	1035.0 ms	−27.8 ms (t = −2.55, sig)
`maui --sample-content`	2205.6 ms	2156.6 ms	−49.0 ms (t = −5.91, sig)

The cache's warm benefit scales with store size (bigger store → more decompression avoided per launch). An earlier prototype run (Zstd L3, extractNativeLibs=true) showed the same direction: sample-content warm mean 2264 ms (OFF) → 2226 ms (ON), −38 ms (t = −3.3, sig), vs 2204 ms for LZ4 on main — i.e. the cache recovers ~~2/3 of Zstd's decompression penalty vs LZ4 but does not quite reach LZ4 parity (~~+22 ms).

Store / download size is unaffected by the cache (it only trades on-device disk for warm-start CPU). Level 22 is the actionable size lever, independent of the cache (sample-content app, arm64):

	LZ4 (`main`)	Zstd L3	Zstd L22
store `.so`	9.21 MB	7.65 MB	6.52 MB
vs LZ4	—	−16.9%	−29.2%
vs Zstd L3	—	—	−14.8%

With extractNativeLibs=false the store is Stored (uncompressed) in the APK, so the APK delta ≈ the store delta: L22 takes ~2.7 MB off the sample app's download vs LZ4 (~1.1 MB vs Zstd L3).

Caveats / honest read

The warm-start effect is small (tens of ms on a ~1–2 s startup, ~1–3%). Decompression isn't the MAUI startup bottleneck (CLR + framework init + first render dominate), so the latency case for the cache is weak.
Ignore the COLD column. On cold the cache is empty (pm clear before each launch), so ON can only do more work; the observed cold swings (both directions across apps in the same session) are a CPU-governor artifact of the background writer thread, SoC-specific and not portable.
Compression level and the cache are orthogonal. Zstd decode speed is level-invariant, so L22 is a ~29% size win at zero runtime cost without the cache — "the cache lets us crank compression" is not the real relationship.
Not yet measured: RAM/PSS. Converting the ~13 MB dirty-anonymous decompression buffer into clean file-backed mmap pages is likely the stronger justification than latency; that measurement is still outstanding.

Full write-ups and raw numbers: #11730 (comment) and #11730 (comment).

Notes

This branch was rebased onto latest main after the Zstd compression work merged; the Microsoft.Android.Build.Tasks / CompressAssemblies refactor is now part of main and no longer appears here.

Prototype exploring caching decompressed assemblies on-device so that subsequent launches skip zstd decompression and load the data via a file-backed mmap instead of dirty anonymous memory. - assembly-store.cc: on a decompression cache miss, a single background thread atomically writes the decompressed bytes to <codeCacheDir>/decompressed-assembly-cache/<Assembly.dll> (temp -> fsync -> rename). On the next launch the file is mmap'd (MAP_PRIVATE, COW) and decompression is skipped. Per-assembly, only assemblies actually touched are cached. Staleness guarded by an 8-byte footer holding an xxhash of the compressed payload. - Plumb codeCacheDir (Context.getCodeCacheDir()) through Java initInternal -> appDirs[3] -> AndroidSystem, so a stale cache is auto-wiped by Android on app/platform update. - Runtime A/B toggle via `debug.net.asmcache` system property (and XA_DISABLE_ASSEMBLY_CACHE env var). Experimental only: no MSBuild opt-in, no assembly-store version stamp, CoreCLR only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The background writer thread read directly from the shared uncompressed_assemblies_data_buffer, but on a cache miss that same buffer is handed to the runtime once the decompress lock is released, and the runtime may write into the assembly image (the reason the cache-hit path maps the file MAP_PRIVATE / COW). Concurrent writes could persist a torn or post-mutation image; since the staleness footer only hashes the *compressed* payload, that corrupt image would then be reloaded from cache as if pristine on the next launch. Take a private snapshot of the decompressed bytes in enqueue_write, while the caller still holds assembly_decompress_mutex and before the buffer is exposed to the runtime, so the writer only ever touches immutable memory it owns. On allocation failure we skip caching that assembly rather than aborting. Trade-off: this adds one memcpy per newly-cached assembly on the first-launch (cache-miss) path and holds the queued snapshots (up to the touched working set) transiently until the writer drains them. Subsequent launches hit the mmap path and never enqueue. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Tidy up the file-writing path without pulling <fstream>/iostreams into the runtime .so (only the build-time pinvoke-table generator uses those; the runtime deliberately sticks to raw syscalls to keep the library small and startup cheap). - Lay out the full on-disk image ([payload][8-byte token footer]) in the snapshot buffer at enqueue time, so the writer emits it in a single contiguous write. This drops the separate footer write (and with it a bug: that write didn't handle EINTR/partial writes) and lets WriteRequest lose its token field. - Extract a write_fully() helper for the EINTR/partial-write retry loop, leaving writer_loop as open -> write_fully -> fsync -> close -> rename. No behavior change: the cache file format is identical, so existing cache files remain valid. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move the open/write/fsync/close/rename ceremony into a dedicated write_cache_file() method so writer_loop() only owns the concurrency concerns (waiting on the queue, dequeuing under the lock) and delegates the actual persistence. No behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

simonrozsival force-pushed the dev/simonrozsival/assembly-store-decompression-cache branch from 7cac3de to 9d831a5 Compare July 3, 2026 08:08

simonrozsival force-pushed the dev/simonrozsival/assembly-store-decompression-cache branch from 9d831a5 to f1de798 Compare July 3, 2026 08:12

simonrozsival and others added 2 commits July 3, 2026 10:18

simonrozsival added the copilot `copilot-cli` or other AIs were used to author this label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[experiment] On-device decompressed AssemblyStore cache (CoreCLR)#11967

[experiment] On-device decompressed AssemblyStore cache (CoreCLR)#11967
simonrozsival wants to merge 4 commits into
mainfrom
dev/simonrozsival/assembly-store-decompression-cache

simonrozsival commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

simonrozsival commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it does

Benchmarks (cache on/off)

Caveats / honest read

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simonrozsival commented Jul 3, 2026 •

edited

Loading