feat(#1425): strict_provenance config flag for runtime enforcement#1474
feat(#1425): strict_provenance config flag for runtime enforcement#1474dimitri-yatsenko wants to merge 1 commit into
Conversation
Implements T2.2.c of the provenance trinity, completing the trio (Diagram.trace → self.upstream → strict_provenance). When dj.config["strict_provenance"] = True, runtime gates enforce the upstream-only convention inside make(): - Reads must target a table in the active trace's allowed set (declared ancestors + self + self's Parts). - Writes must target self or self's Parts. - Inserted rows' PK columns that overlap with the current key must equal the key's values (key-consistency rule). Default is False. Existing make() bodies are unaffected. Branch stacked on feat/1424-self-upstream (#1473) → feat/1423-diagram-trace (#1471) → fix/1429-cascade-part-part-renamed-fk (#1468). Will rebase onto master after the chain merges. What's added: - src/datajoint/provenance.py (new): the runtime context module. - `_active_strict_make` ContextVar holding (target, allowed_tables, key) for the currently-executing make() invocation. ContextVar chosen over threading.local to propagate correctly across contextvars-aware concurrency boundaries. - `push_strict_make_context` / `pop_strict_make_context` — context lifecycle managed by `_populate_one`'s try/finally. - `assert_read_allowed(query_expression)` — read gate. Recursively discovers base tables via the QueryExpression's `_support` chain and checks each against the allowed set. - `assert_write_allowed(target_table, rows)` — write gate. Verifies the target is self or one of self's Part tables, and checks the key-consistency rule on each dict row. - src/datajoint/settings.py: new `strict_provenance: bool` field on Config (default False), env-var `DJ_STRICT_PROVENANCE`, ENV_VAR_MAPPING entry. - src/datajoint/autopopulate.py: in `_populate_one`, push the strict context (when the flag is on) just before the make() invocation block. The allowed table set = trace's ancestor nodes ∪ {self.full_table_name} ∪ {self's Parts}. Pop in the existing `finally` block. - src/datajoint/expression.py: `QueryExpression.cursor` now calls `assert_read_allowed(self)` before issuing SQL. No-op outside make(). - src/datajoint/table.py: `Table.insert` calls `assert_write_allowed(self, rows)` after the existing `_allow_insert` check. No-op outside make(). Part-table detection uses class `__dict__` traversal (filtered to Part subclasses) instead of `dir/getattr` to avoid triggering the `_JobsDescriptor` (which would lazy-declare ~~table inside the populate transaction — caught by the first test iteration). Documented limitation (deferred): the read gate does not distinguish reads that came through `self.upstream` from reads of the same ancestor via a direct expression. Both are allowed if the table is in the allowed set. The intent is to catch reads from *undeclared* dependencies; tightening the "must come through self.upstream" path requires propagating an attribution marker through QueryExpression composition and is left for a follow-up release. Tests in tests/integration/test_strict_provenance.py (6 new): - test_strict_compliant_make_passes — make() reading via self.upstream and writing self.insert1 with matching key runs cleanly under strict. - test_strict_blocks_read_from_undeclared_table — read from an unrelated table raises with "strict_provenance ... undeclared" message. - test_strict_blocks_write_to_other_table — insert into a non-self, non-Part target raises "not permitted". - test_strict_blocks_write_with_mismatched_key — row PK that disagrees with the current key raises "does not match the current make() key". - test_strict_writes_to_part_table_pass — self.PartName.insert(...) works. - test_strict_off_by_default_no_change — default-off regression check; the canonical "direct (Ancestor & key).fetch1()" pattern still works when strict_provenance is unset. Regression: 17/17 autopopulate tests pass with strict_provenance unset (default). 6/6 new strict tests pass with strict_provenance=True. 8/8 trace tests + 9/9 cascade tests unaffected. Slated for DataJoint 2.3.
9aa7784 to
f60495b
Compare
86b1ae7 to
7e4130f
Compare
A note on adoption: two patterns strict mode will break, and what that tells us we're missingI've spent time with this PR and the spec, and I want to say up front: the provenance trinity is a real evolution for DataJoint. What I want to do here is think through adoption. There are a couple of patterns that are fairly commonly used in practice today that this feature will break. Both of them are, honestly, bad patterns — they're provenance bugs, and strict mode is right to catch them. But they're bad patterns that a lot of working pipelines lean on, and I think if we turn this on without a story for how those pipelines migrate, we take DataJoint to the next level and leave a chunk of our users behind. So this is really an argument for the next thing we should build, using these two patterns to show why. Pattern 1 — reading "sideways" to a table that isn't a declared ancestorThe common case looks like this: a computation needs a piece of reference information that lives off to the side of its own lineage. For example, class SpikeDetection(dj.Computed):
definition = "-> Recording"
def make(self, key):
signal = (Recording & key).fetch1("signal") # declared ancestor — fine
region = (BrainRegion & (Surgery & key)).fetch1("region") # NOT an ancestor
baseline = BASELINES[region]
...
Worth noting the boundary: the very common "parameter table" case ( Pattern 2 — one
|
| Durable domain model | Records where the data came from | Stable when ingestion changes | Keeps DataJoint orchestration | |
|---|---|---|---|---|
| A. Landing table (file is upstream) | ✗ input-driven | ~ but as a fake dependency | ✗ new source ⇒ schema change | ✓ |
| B. Roots + external orchestration | ✓ | ~ but outside DataJoint | ✓ | ✗ |
C. Fat make() (what strict mode blocks) |
✓ | ✗ severed | ✓ | ✓ |
Every row loses at least one thing that matters.
Why the "landing table" option is worse than it first looks
Option A is the one people will reach for, so it's worth being clear about its deepest problem: it bakes a transient operational fact into the durable data model.
How an entity physically arrives is not durable. A Session might come from a file today, an API call next quarter, a direct database sync the year after. But Session belongs to Subject — that's the real, invariant truth of the model, and it holds no matter how the data got there. These two facts live on completely different timescales: the domain relationships change on the scale of scientific redesign (years), while ingestion mechanics change on the scale of infrastructure (months).
The foreign-key graph is supposed to encode the durable, invariant relationships. The moment you put "came from this file" into it, you've married the slow-moving graph to the fast-moving one. Every time ingestion changes, you're doing a schema migration; and your historical rows permanently carry whatever ingestion ancestor happened to exist when they were created. So the principled strict-mode option is actually the least stable one over the life of a pipeline — enforcing provenance this way makes the schema churn with your infrastructure. That's a genuinely counterintuitive result, and I think it's the crux.
Off versus on, stated carefully
Here's the asymmetry that worries me. With strict_provenance off, a workable option exists — the fat make() — that keeps a clean, durable domain model and keeps DataJoint orchestration. Its only casualty is provenance, which the framework wasn't guaranteeing anyway. With it on, that option is gone, and there's no remaining option that records the file→entity relationship truthfully — as "how it arrived," rather than as a fabricated (and non-durable) dependency in the FK graph — while keeping orchestration. So turning strict mode on doesn't just close a loophole; for this class of pipeline it takes away the one option whose only cost was something already soft, and leaves you choosing between corrupting the durable model (A) or giving up the framework's orchestration (B).
What this actually points to — two things worth building
I don't think the takeaway is "don't ship strict mode." I think it's that strict mode is the feature that makes two follow-on capabilities necessary — it's the alarm, and these are the fire exits:
-
Provenance-first schema evolution. The reason both patterns are so painful to fix is that the remediation is always "restructure your schema," and we have no tooling for that — so people fall back on drop-and-repopulate, which destroys history. A proper schema-evolution tool would use the trinity's own primitives —
traceto find what's upstream of a change,cascadeto find what it invalidates — to change dependency structure and recompute safely. It's not alembic; it's evolution that manages the invalidation cascade. And it's built from exactly what this PR ships, which is what makes it a natural next step rather than a detour. -
A first-class notion of where data came from, kept separate from the FK graph. Everything above keeps circling one gap: DataJoint has one graph, and it uses it for both "what is derived from what" (durable, invariant) and, when forced, "how this row arrived" (transient, operational). Those need to be different things. Ingestion lineage should be recorded — append-only, allowing many sources over time — so a
Sessioncan carry a "came from file X" record today and an "came from API Y" record later, with no schema change at all. The FK graph structurally can't do that, and forcing it to (option A) is exactly what corrupts the model.
Neither of these is a 2.3 concern, and I'm not suggesting they gate this release. But I'd love for us to decide the sequencing deliberately, so we're not shipping the alarm well ahead of the exits. Happy to write these up properly as their own proposals if that's useful.
There was a problem hiding this comment.
I did a close read of the enforcement path and want to flag three correctness issues before this ships, so I'm requesting changes. To be clear up front: the scaffolding is right — the ContextVar push/pop with token reset, the exception-path cleanup, transaction rollback, Part detection via __dict__ (nicely avoiding the _JobsDescriptor), and the flag-off no-op are all correct. The problems are in the two gate functions themselves. None of this is about the broader strict-mode design conversation; it's purely implementation correctness.
1. [blocking] Write gate can silently drop all rows — data loss on compliant code
assert_write_allowed(self, rows) runs in Table.insert (table.py:804) before insert materializes rows, and when a strict context is active its non-dict branch does for row in rows. For a one-shot iterable (generator, iter(...), map(...)), that exhausts it; the post-gate consumer — _insert_rows → list(… for row in rows) at table.py:864, or the chunked iter(rows) path at :836 — then sees an empty iterator and inserts zero rows, with no error:
def make(self, key):
self.insert(dict(**key, x=i) for i in range(n)) # strict on → inserts nothing, silentlyTwo precisions, because they make this a genuine blocker rather than an edge case:
- It's scoped to one-shot iterables. A
list/tupleof dicts is safe — it's re-iterable, so the gate walks it andinsertre-walks it fine. Soself.insert([...])is unaffected; onlyself.insert(<generator>)breaks. But passing a generator toinsert()is a documented, idiomatic pattern (the chunked-insert path is literally built arounditer(rows)), so this isn't exotic input. - It strikes compliant code. The target check passes for a normal
self.insert(<generator>)into the table being populated — the gate then consumes the generator anyway. So a correct, strict-compliantmake()that inserts many computed rows (very common for Part tables) silently loses all of them. This isn't "you did something forbidden and got a confusing failure"; it's "you wrote correct code and it discarded your data."
Fix direction: don't consume rows in the gate — do the target-only check first (it doesn't need the rows), and defer the key-consistency check to where insert has already materialized rows into a concrete list.
Separately, insert(some_query_expression) isn't data loss but is double-executed: the gate iterates it (running the source SELECT client-side and firing the read gate on it), then insert re-runs it as INSERT … SELECT. Worth fixing in the same pass.
2. [blocking] Read gate misses len() / bool() / in
Only QueryExpression.cursor calls assert_read_allowed. But __len__ (expression.py:1149), __bool__ (:1178), and __contains__ (:1195, via bool(self & item)) each issue connection.query(...) directly and never touch cursor. So inside make():
len(Unrelated & key) # ungated
bool(Unrelated & cond) # ungated
key in Unrelated # ungated (→ __bool__)all read a forbidden table with no check. Aggregation.__len__/__bool__ and Union.__len__/__bool__ have the same direct-query bypass.
3. [blocking] Read gate is blind to restriction-by-table
_base_tables (provenance.py:56) only walks full_table_name and _support. A semijoin restriction Ancestor & Unrelated renders Unrelated as an IN (subquery) in the WHERE clause — it never enters _support — so _base_tables returns {Ancestor} and the read of Unrelated passes. (Joins are caught, because join extends _support; restrictions are not, and restrict-by-table is at least as common.)
Net on 2 + 3: the gate catches UndeclaredTable.fetch() but not the everyday idioms (restrict-by, existence check, count), so an undeclared dependency stays readable through several common paths. Since the whole value here is a guarantee, I think these need to close before merge — a gate that blocks the naive case while passing the common ones invites false confidence, which is worse than an honest "best-effort."
Tests
The 6 tests exercise the naive paths and would all pass with the three bugs above present. Worth adding: join-with-a-disallowed-operand (the one read path that actually works — untested, so a regression would be invisible), restriction-by-table, len/in/bool, insert-from-query, generator rows, reentrant make() context restore, and exception-path context cleanup.
Summary
T2.2.c of the provenance trinity — completes the trio.
When `dj.config["strict_provenance"] = True`, runtime gates enforce the upstream-only convention inside `make()`:
Default is False. Existing `make()` bodies are unaffected.
Closes #1425. Slated for DataJoint 2.3.
Branch stack
Will rebase onto master after the chain merges in order.
What's added
Implementation notes
Documented limitation (deferred)
The read gate does not distinguish reads that came through `self.upstream` from reads of the same ancestor via a direct expression. Both are allowed if the table is in the allowed set. The intent is to catch reads from undeclared dependencies; tightening the "must come through self.upstream" path requires propagating an attribution marker through QueryExpression composition and is left for a follow-up.
Tests
Test plan