Skip to content

Stop datasets auto-saving on construction (fixes region-scope corruption)#450

Merged
anth-volk merged 1 commit into
mainfrom
fix-dataset-autosave-corruption
Jul 2, 2026
Merged

Stop datasets auto-saving on construction (fixes region-scope corruption)#450
anth-volk merged 1 commit into
mainfrom
fix-dataset-autosave-corruption

Conversation

@anth-volk

Copy link
Copy Markdown
Contributor

Fixes #449

Problem

Constructing a PolicyEngineUSDataset / PolicyEngineUKDataset with in-memory data= auto-saved to self.filepath (via model_post_init). Region scoping in run() built its filtered copy with filepath=<the source file>, so constructing the scoped copy truncated and overwrote the shared source dataset with a single region's households.

In the simulation API this corrupted the baked single-year dataset file inside reused Modal containers: after a state-scoped run, the shared file held only that state's rows, so the next request on the same container either crashed (ValueError: No households found matching state_fips=N) or, for a national request, silently computed a "national" result on ~1–2% of the population — fast and wrong. This drove ~50% state-run failure rates in 51-state fan-outs (breaking downstream congressional-district breakdowns) and a confusing mix of fast-but-wrong vs. slow-but-correct national runs depending on container reuse.

Fix

  • Remove the auto-save branch from model_post_init in both country dataset classes; keep auto-load. Persistence is now always explicit via save().
  • run() loads the input dataset only when the caller did not already supply in-memory data (if dataset.data is None: dataset.load()) instead of unconditionally — this is what made the autosave load-bearing, and the change keeps in-memory / notebook workflows working without a disk round trip.
  • Dataset.filepath becomes Optional[str]; region-scoped copies are constructed with filepath=None; save() raises a clear error when filepath is None.
  • Removes the redundant double-writes of create_datasets outputs and simulation output datasets.

Why this is safe / why the autosave existed

The autosave (PR #180, 2025-11) had one inline rationale — "synchronised between in-memory and storage." No documentation promises that construction persists, and no issue tracked it. It was only load-bearing because run() re-read the input dataset from disk every time; making that read conditional removes the need to persist on construction.

Org-wide search: only policyengine-api-v2 (consumes via ensure_datasets/load_datasets — benefits, no code change) and policyengine-api-v2-alpha (household.py constructs with data=, kept working by the conditional run() load) are external consumers. No org .ipynb constructs these classes directly; populace, policyengine-us-data, policyengine-uk-data, and the country model packages are unaffected.

Tests

New tests/test_dataset_persistence.py:

  • test__construct_with_data__does_not_write_to_filepath — construction with data= leaves an existing file's bytes unchanged (fails under the old behavior).
  • test__sequential_region_scopes__leave_shared_source_intact — the production incident, at the dataset level: scope state 6 then state 34 against a shared file; both succeed and the file is unchanged.
  • save()-without-filepath raises (US + UK); load-on-construction still works; no-filepath/no-data construction is inert.

Existing _write_us_h5 fixture updated to call save() explicitly (it previously relied on the construction side effect).

Validation

ruff format + ruff check clean. Directly-affected suites green: tests/test_dataset_persistence.py, tests/test_us_long_term_datasets.py, tests/test_filtering.py, tests/test_scoping_strategy.py (46 passed). The full suite (incl. microsim run() paths) was still finishing at PR-open time; CI is the source of truth. Draft until CI is green.

…ion)

Constructing a PolicyEngineUSDataset/PolicyEngineUKDataset with in-memory
data auto-saved to self.filepath. Region scoping built its filtered copy
with filepath=<source file>, so constructing the scoped copy truncated and
overwrote the shared source dataset with one region's households. In the
simulation API this corrupted the baked single-year file inside reused
Modal containers: subsequent state runs crashed ("No households found
matching state_fips=N") and national runs silently computed on one state's
data.

- Remove the auto-save branch from model_post_init in both country dataset
  classes; keep auto-load. Persistence is now always explicit via save().
- run() loads the input dataset only when the caller did not already supply
  in-memory data, so in-memory datasets are runnable without a disk round
  trip (this is what made the autosave load-bearing).
- Make Dataset.filepath Optional; build region-scoped copies with
  filepath=None; save() raises a clear error when filepath is None.

Also removes the redundant double-writes of create_datasets outputs and
simulation output datasets.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset construction auto-saves to filepath, corrupting shared dataset files during region scoping

1 participant