Stop datasets auto-saving on construction (fixes region-scope corruption)#450
Merged
Conversation
…ion)
Constructing a PolicyEngineUSDataset/PolicyEngineUKDataset with in-memory
data auto-saved to self.filepath. Region scoping built its filtered copy
with filepath=<source file>, so constructing the scoped copy truncated and
overwrote the shared source dataset with one region's households. In the
simulation API this corrupted the baked single-year file inside reused
Modal containers: subsequent state runs crashed ("No households found
matching state_fips=N") and national runs silently computed on one state's
data.
- Remove the auto-save branch from model_post_init in both country dataset
classes; keep auto-load. Persistence is now always explicit via save().
- run() loads the input dataset only when the caller did not already supply
in-memory data, so in-memory datasets are runnable without a disk round
trip (this is what made the autosave load-bearing).
- Make Dataset.filepath Optional; build region-scoped copies with
filepath=None; save() raises a clear error when filepath is None.
Also removes the redundant double-writes of create_datasets outputs and
simulation output datasets.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #449
Problem
Constructing a
PolicyEngineUSDataset/PolicyEngineUKDatasetwith in-memorydata=auto-saved toself.filepath(viamodel_post_init). Region scoping inrun()built its filtered copy withfilepath=<the source file>, so constructing the scoped copy truncated and overwrote the shared source dataset with a single region's households.In the simulation API this corrupted the baked single-year dataset file inside reused Modal containers: after a state-scoped run, the shared file held only that state's rows, so the next request on the same container either crashed (
ValueError: No households found matching state_fips=N) or, for a national request, silently computed a "national" result on ~1–2% of the population — fast and wrong. This drove ~50% state-run failure rates in 51-state fan-outs (breaking downstream congressional-district breakdowns) and a confusing mix of fast-but-wrong vs. slow-but-correct national runs depending on container reuse.Fix
model_post_initin both country dataset classes; keep auto-load. Persistence is now always explicit viasave().run()loads the input dataset only when the caller did not already supply in-memory data (if dataset.data is None: dataset.load()) instead of unconditionally — this is what made the autosave load-bearing, and the change keeps in-memory / notebook workflows working without a disk round trip.Dataset.filepathbecomesOptional[str]; region-scoped copies are constructed withfilepath=None;save()raises a clear error whenfilepathisNone.create_datasetsoutputs and simulation output datasets.Why this is safe / why the autosave existed
The autosave (PR #180, 2025-11) had one inline rationale — "synchronised between in-memory and storage." No documentation promises that construction persists, and no issue tracked it. It was only load-bearing because
run()re-read the input dataset from disk every time; making that read conditional removes the need to persist on construction.Org-wide search: only
policyengine-api-v2(consumes viaensure_datasets/load_datasets— benefits, no code change) andpolicyengine-api-v2-alpha(household.pyconstructs withdata=, kept working by the conditionalrun()load) are external consumers. No org.ipynbconstructs these classes directly;populace,policyengine-us-data,policyengine-uk-data, and the country model packages are unaffected.Tests
New
tests/test_dataset_persistence.py:test__construct_with_data__does_not_write_to_filepath— construction withdata=leaves an existing file's bytes unchanged (fails under the old behavior).test__sequential_region_scopes__leave_shared_source_intact— the production incident, at the dataset level: scope state 6 then state 34 against a shared file; both succeed and the file is unchanged.save()-without-filepath raises (US + UK); load-on-construction still works; no-filepath/no-data construction is inert.Existing
_write_us_h5fixture updated to callsave()explicitly (it previously relied on the construction side effect).Validation
ruff format+ruff checkclean. Directly-affected suites green:tests/test_dataset_persistence.py,tests/test_us_long_term_datasets.py,tests/test_filtering.py,tests/test_scoping_strategy.py(46 passed). The full suite (incl. microsimrun()paths) was still finishing at PR-open time; CI is the source of truth. Draft until CI is green.