igerber · igerber · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026
diff --git a/TODO.md b/TODO.md
@@ -34,7 +34,6 @@ The `Origin` column (Actionable tables) and the `PR` column (Deferred tables) bo
 | `SyntheticControl` conformal (CWZ 2021) extensions: (a) one-sided / signed-`t` variants (§7); (b) covariates in the conformal proxy (`X_jt`, eqs 4/6 — current proxy is outcomes-only); (c) AR / innovation-permutation path (Lemmas 5-7) for time-series proxies. The joint test, pointwise CIs, and average-effect CI have landed. | `conformal.py`, `synthetic_control_results.py` | CWZ-2021 | Heavy | Low |
 | `ContinuousDiD` CGBS-2024 extensions (matches R `contdid` v0.1.0 deferral set): (a) `covariates=` kwarg; (b) discrete-treatment saturated regression (integer dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per Remark 3.1 when `P(D=0)=0`. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these `[ ]`. | `continuous_did.py` | CGBS-2024 | Heavy | Low |
 | `EfficientDiD` survey-weighted Silverman bandwidth in conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std; survey-weighted statistics better reflect the population distribution (second-order refinement). | `efficient_did_covariates.py` | — | Quick | Low |
-| Survey sandwich SE is not exactly invariant to zero-weight (subpopulation / padded) rows: `_compute_stratified_psu_meat`'s finite-sample correction counts zero-weight units as PSUs, so padding shifts the SE ~2e-4 relative. Point estimate is exactly invariant. Fix: count only positive-weight PSUs in the correction (cross-cutting across all survey-enabled estimators). | `survey.py` (`_compute_stratified_psu_meat`) | PR-B | Mid | Low |
 | `ImputationDiD` LOO conservative-variance refinement (BJS 2024 Supp. Appendix A.9) — a finite-sample improvement to the auxiliary-model residuals reducing overfit of `tau_tilde_g` to `epsilon`. Asymptotic Theorem-3 variance is implemented and matches R `didimputation` (which also omits LOO by default). | `imputation.py` | imputation-validation | Mid | Low |
 | `TwoWayFixedEffects(vcov_type in {hc2, hc2_bm})` with replicate-weight designs raises `NotImplementedError` (`twfe.py:~233`). The replicate path re-demeans per replicate, which doesn't compose with the full-dummy HC2/HC2-BM build — a correct impl needs per-replicate full-dummy refit. Workaround: `hc1` for replicate-weight CR1. | `twfe.py::fit` | follow-up | Heavy | Low |
 | TWFE's HC2/HC2-BM inline full-dummy build (`twfe.py:280-315`) duplicates the dummy-construction logic in `DifferenceInDifferences(fixed_effects=...)` (`estimators.py:478-486`). Extract a shared helper, or delegate TWFE's HC2/HC2-BM path to DiD's `fixed_effects=` branch (with TWFE-specific cluster-default threading), to reduce drift risk on FE naming / survey behavior / result-surface conventions. Substantive refactor — touches both estimators. | `twfe.py::fit`, `estimators.py::DifferenceInDifferences.fit` | follow-up | Heavy | Low |
@@ -153,6 +152,7 @@ Doable in principle, but no current caller and/or explicitly out of paper scope.
 | **R-script-per-test consolidation has no CI impact.** No CI workflow installs R, so every R-parity test skips in CI behind a per-file availability gate — consolidating `Rscript` spawns yields zero CI speedup. `test_methodology_twfe.py` already session-caches its R fits. The only residual is a LOCAL-dev micro-opt for `test_methodology_continuous_did.py` / `test_methodology_callaway.py` (re-spawn `library(...)` per call). Low value; retained as a local-dev note. | `tests/test_methodology_continuous_did.py`, `tests/test_methodology_callaway.py` | #139 / 2026-06-07 |
 | **`HeterogeneousAdoptionDiD` mass-point IV bread is non-symmetric — `_rank_guarded_inv` inapplicable.** The structural rank-guard sweep (continuous_did / two_stage / spillover / conley) excluded `had.py`'s `ZtWX = Zd'WX` ([1, instrument]' × [1, endogenous]): it is a non-symmetric 2×2 Wald-IV bread (`V = ZtWX_inv @ Omega @ ZtWX_inv.T`), and `_rank_guarded_inv` assumes a **symmetric PSD** Gram (symmetric `D=diag(A)` equilibration + eigendecomposition), so applying it would be methodologically wrong. The existing fallback already returns NaN SE on a singular bread; an IV-appropriate near-singular guard would need a different mechanism. | `had.py` | structural-rank-guard / 2026-06-28 |
 | **`ImputationDiD` SE vcov is already rank-guarded upstream.** Excluded from the structural rank-guard sweep: the lead/effect vcov comes from `solve_ols(..., return_vcov=True, rank_deficient_action=...)` at the OLS fit (`imputation.py:~2316`), which already drops rank-deficient columns. The only raw inverse (`solve(V_gamma, gamma)`, `imputation.py:~2530`) is the pretrends **Wald F-test statistic** with a safe `NaN` fallback — a test statistic, not a sandwich bread — so there is no garbage-SE exposure. No structural rank-guard needed. | `imputation.py` | structural-rank-guard / 2026-06-28 |
+| **Survey TSL SE intentionally counts genuine-subpopulation zero-weight PSUs (matches R, NOT a bug).** Re-examined the former "count only positive-weight PSUs in the correction" item (origin PR-B). `_compute_stratified_psu_meat`'s finite-sample correction `(1-f_h)·n_{PSU,h}/(n_{PSU,h}-1)` and PSU-mean centering keep zero-weight PSUs — this is the **full-design domain-estimation convention** (Lumley 2004 §3.4; R `survey::svyrecvar(subset())`), already documented in REGISTRY § "Subpopulation Analysis" (the survey-vcov path deliberately differs from the positive-weight invariance applied *outside* it). The ATT is exactly invariant; the SE is intentionally NOT invariant to genuine-subpopulation zeroing (it *should* differ from a naive physical subset — that is the whole point of `subpopulation()`). Repro (`scratchpad`): zeroing a full PSU vs physically dropping it differs ~5e-3 rel — the Lumley-correct gap, and R's `svyrecvar(subset())` produces the matching SE (only `df` differs; see the § "Subpopulation Analysis" Deviation note). The only truly invariance-violating shape — appending *synthetic new* all-zero PSUs — does not arise in any estimator path (real padding reuses existing PSU labels and is already bit-invariant, or is genuine domain estimation via `prep.py`'s zero-padded full-design cell variance). "Fixing" the meat to positive-weight-only would break the documented Lumley/R parity. Regression-locked by `tests/test_survey.py::TestZeroWeightPsuConventionWaiver`. | `survey.py` (`_compute_stratified_psu_meat`) | PR-B / 2026-06-30 |
 
 ---
 

diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -4536,6 +4536,25 @@ Domain estimation preserving full design structure.
   paths use positive-weight count for df adjustments, ensuring zero-weight
   padding is inference-invariant outside the survey vcov path. DEFF
   effective-n also uses positive-weight count.
+- **Note:** The TSL meat itself follows the same full-design convention:
+  `_compute_stratified_psu_meat`'s per-stratum finite-sample correction
+  `(1 - f_h)·n_{PSU,h}/(n_{PSU,h}-1)` and PSU-mean centering count
+  zero-weight PSUs (a genuine-subpopulation PSU with all members outside the
+  domain contributes a zero PSU-score `0` that is centered to `-\bar{z}_h`
+  and still increments `n_{PSU,h}`). This is the Lumley (2004 §3.4) /
+  R `survey::svyrecvar(subset())` domain estimator — so the survey SE is
+  deliberately **not** invariant to genuine-subpopulation zeroing: it
+  differs from the SE of a naive physical subset, which is the whole point
+  of preserving the design (`subpopulation()` is correct *because* it does
+  not equal the naive subset). Zero-weight rows that reuse an existing PSU
+  label are inert (their weighted score is `0`, so the PSU-score sum is
+  unchanged), so padding that preserves PSU membership is bit-invariant;
+  only adding *new* synthetic all-zero PSUs would shift the SE, and no
+  estimator path does so (domain padding goes through the zero-padded
+  full-design cell variance in `prep.py`, which retains the real PSU layout).
+  A former TODO proposed counting only positive-weight PSUs to force SE
+  invariance; that was waived (TODO § "Won't-fix / waived") because it would
+  break this documented R parity.
 - **Deviation from R:** `subpopulation()` preserves all strata in df
   computation even when a stratum has no positive-weight observations,
   while R's `subset()` drops empty strata from `survey::degf()`. For

diff --git a/tests/test_survey.py b/tests/test_survey.py
@@ -15,6 +15,7 @@
 from diff_diff.linalg import LinearRegression, compute_robust_vcov, solve_ols
 from diff_diff.survey import (
     ResolvedSurveyDesign,
+    _compute_stratified_psu_meat,
     compute_survey_metadata,
     compute_survey_vcov,
 )
@@ -2146,6 +2147,175 @@ def test_zero_weights_accepted(self):
         assert coef is not None
 
 
+class TestZeroWeightPsuConventionWaiver:
+    """Lock the waived decision on the survey TSL finite-sample correction.
+
+    ``_compute_stratified_psu_meat`` intentionally keeps genuine-subpopulation
+    zero-weight PSUs in its per-stratum correction
+    ``(1 - f_h)*n_PSU_h/(n_PSU_h - 1)`` and PSU-mean centering — the full-design
+    domain estimator of Lumley (2004 §3.4) / R ``survey::svyrecvar(subset())``.
+    A former TODO proposed "count only positive-weight PSUs" to force the survey
+    SE to be invariant to zero-weight rows; that was **waived** (TODO
+    § "Won't-fix / waived"; REGISTRY § "Subpopulation Analysis") because it would
+    break the documented R parity. These tests trip if that change is ever made:
+    the proposed fix would collapse the subpopulation SE onto the naive-subset SE.
+    """
+
+    @staticmethod
+    def _panel():
+        rng = np.random.default_rng(20260630)
+        rows = []
+        uid = 0
+        for h in range(3):  # strata
+            for p in range(4):  # PSUs per stratum (>=2 so a drop leaves the path live)
+                psu_id = f"s{h}_p{p}"
+                for _ in range(2):  # units per PSU
+                    treated = int(rng.random() < 0.5)
+                    for t in (0, 1):  # periods
+                        y = 2.0 * treated * t + 0.5 * h + 0.3 * p + rng.normal()
+                        rows.append(
+                            dict(
+                                unit=uid,
+                                stratum=h,
+                                psu=psu_id,
+                                treated=treated,
+                                post=t,
+                                outcome=y,
+                                w=rng.uniform(0.5, 2.0),
+                            )
+                        )
+                    uid += 1
+        return pd.DataFrame(rows)
+
+    def _fit(self, df):
+        sd = SurveyDesign(weights="w", strata="stratum", psu="psu")
+        res = DifferenceInDifferences().fit(
+            df, outcome="outcome", treatment="treated", time="post", survey_design=sd
+        )
+        return res.att, res.se
+
+    def test_inert_padding_reusing_existing_psu_is_bit_invariant(self):
+        """Zero-weight rows under an EXISTING PSU label are inert: ATT and SE
+        bit-identical. (Their weighted score is 0, so the PSU-score sum and the
+        PSU count are both unchanged.)"""
+        base = self._panel()
+        att0, se0 = self._fit(base)
+
+        pad = []
+        for _ in range(8):
+            for t in (0, 1):
+                pad.append(
+                    dict(
+                        unit=8000,
+                        stratum=0,
+                        psu="s0_p1",  # reuse an existing PSU
+                        treated=0,
+                        post=t,
+                        outcome=99.0,  # arbitrary; weight 0 makes it inert
+                        w=0.0,
+                    )
+                )
+        padded = pd.concat([base, pd.DataFrame(pad)], ignore_index=True)
+        att1, se1 = self._fit(padded)
+
+        np.testing.assert_allclose(att1, att0, atol=1e-12)
+        np.testing.assert_allclose(se1, se0, atol=1e-12)
+
+    def test_subpopulation_zeroing_keeps_full_psu_structure(self):
+        """Genuine subpopulation (zero a full PSU's weights) is NOT a naive
+        physical subset. The WLS fit is identical (zero-weight rows drop out of
+        X'WX), so the ATT is exactly invariant; but the TSL SE intentionally
+        differs from the dropped-rows SE because the zeroed PSU still counts in
+        the finite-sample correction (Lumley full-design convention).
+
+        The proposed "count only positive-weight PSUs" change would make the
+        zeroed SE equal the dropped SE; this assertion guards against it.
+        """
+        base = self._panel()
+        victim = "s0_p0"
+
+        zeroed = base.copy()
+        zeroed.loc[zeroed["psu"] == victim, "w"] = 0.0
+        att_zero, se_zero = self._fit(zeroed)
+
+        dropped = base[base["psu"] != victim].copy()
+        att_drop, se_drop = self._fit(dropped)
+
+        # ATT is exactly invariant: zero-weight rows contribute nothing to the
+        # weighted normal equations, so the point estimate matches the subset.
+        np.testing.assert_allclose(att_zero, att_drop, atol=1e-10)
+
+        # SE is deliberately NOT invariant: the zeroed PSU stays in n_PSU_h and
+        # the centering (the Lumley domain gap, ~5e-3 rel here). A positive-
+        # weight-only correction would drive this to ~0.
+        rel_gap = abs(se_zero - se_drop) / se_drop
+        assert rel_gap > 1e-3, (
+            f"survey SE became invariant to subpopulation zeroing "
+            f"(rel_gap={rel_gap:.2e}); the full-design Lumley convention "
+            f"(TODO § Won't-fix / waived) appears to have been reverted."
+        )
+
+    def test_stratified_meat_counts_zero_score_psu_in_full_design(self):
+        """Direct unit test on ``_compute_stratified_psu_meat``: an all-zero-score
+        PSU (a fully zeroed subpopulation PSU) stays in ``n_PSU_h``, the stratum
+        PSU-mean, and the centering. Asserts the **exact** full-design meat and
+        that it is NOT the positive-weight-only meat (which would drop the zero
+        PSU). This guards against partial edits the SE-level test could miss —
+        e.g. changing only the finite-sample denominator while still centering
+        over the zero PSU.
+        """
+        # Two strata; stratum 0 has an all-zero PSU (psu 2 = two zeroed rows).
+        scores = np.array(
+            [
+                [1.5, -0.5],  # stratum 0, psu 0
+                [0.5, 2.0],  # stratum 0, psu 1
+                [0.0, 0.0],  # stratum 0, psu 2 (all-zero subpop PSU), row 1
+                [0.0, 0.0],  # stratum 0, psu 2, row 2
+                [-1.0, 0.7],  # stratum 1, psu 3
+                [0.3, -1.2],  # stratum 1, psu 4
+            ]
+        )
+        strata = np.array([0, 0, 0, 0, 1, 1])
+        psu = np.array([0, 1, 2, 2, 3, 4])
+        # PSU 2's rows carry zero weight as well as zero score, so the fixture
+        # models a *true* fully zero-weight subpopulation PSU. The current meat
+        # ignores `weights` (it operates on scores), so this does not change the
+        # expected value — but it hardens the lock against a future denominator-
+        # only edit that reads `resolved.weights` to drop positive-weight PSUs.
+        resolved = ResolvedSurveyDesign(
+            weights=np.array([1.0, 1.0, 0.0, 0.0, 1.0, 1.0]),
+            weight_type="pweight",
+            strata=strata,
+            psu=psu,
+            fpc=None,
+            n_strata=2,
+            n_psu=5,
+            lonely_psu="remove",
+        )
+
+        meat, variance_computed, _ = _compute_stratified_psu_meat(scores, resolved)
+        assert variance_computed
+
+        def _stratum_meat(psu_scores):
+            # Full-design per-stratum meat with fpc=None: (n_h/(n_h-1)) * S'S.
+            n_h = psu_scores.shape[0]
+            centered = psu_scores - psu_scores.mean(axis=0, keepdims=True)
+            return (n_h / (n_h - 1)) * (centered.T @ centered)
+
+        # Full-design reference INCLUDES the all-zero PSU in stratum 0 (n_h=3).
+        s0_full = np.array([[1.5, -0.5], [0.5, 2.0], [0.0, 0.0]])
+        s1 = np.array([[-1.0, 0.7], [0.3, -1.2]])
+        expected_full = _stratum_meat(s0_full) + _stratum_meat(s1)
+        np.testing.assert_allclose(meat, expected_full, atol=1e-12)
+
+        # The positive-weight-only meat would DROP the zero PSU from stratum 0
+        # (n_h=2, mean over the two nonzero PSUs) — exactly the change the waived
+        # TODO proposed. The full-design meat must NOT equal it.
+        s0_pos = np.array([[1.5, -0.5], [0.5, 2.0]])
+        positive_only = _stratum_meat(s0_pos) + _stratum_meat(s1)
+        assert not np.allclose(meat, positive_only, atol=1e-8)
+
+
 class TestRound8Fixes:
     """Tests for round-8 review fixes (PR #218)."""