Skip to content

docs: add engine-native record selection plan#792

Open
nabinchha wants to merge 1 commit into
mainfrom
codex/790-record-selection-plan
Open

docs: add engine-native record selection plan#792
nabinchha wants to merge 1 commit into
mainfrom
codex/790-record-selection-plan

Conversation

@nabinchha

Copy link
Copy Markdown
Contributor

📋 Summary

Adds the source-of-truth design plan for engine-native record selection, allowing users to request an exact number of rows that satisfy a declared boolean criterion in one bounded, resumable DataDesigner run. The plan defines V1 as the complete user-facing feature and leaves concurrent batches and early cancellation as benchmark-driven optimizations.

🔗 Related Issue

Related to #790. The issue remains open to track implementation of the approved design.

🔄 Changes

  • Add the engine-native record selection plan.
  • Define the proposed RecordSelectionConfig API and exact accepted-row semantics.
  • Separate logical candidate-batch progress from physical scheduler row groups.
  • Specify durable checkpoints, resume behavior, exhaustion handling, metadata, processing order, and artifact cleanup.
  • Document implementation phases, regression coverage motivated by feat: add repeat until workflow stages #773, risks, open questions, and the V1 definition of done.

🧪 Testing

  • git diff --check origin/main..HEAD
  • Markdown code and Mermaid fence balance validated
  • Unit tests: N/A — design-plan-only change with no executable logic
  • E2E tests: N/A — design-plan-only change with no executable logic

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated

Document the V1 API, engine architecture, resume model, implementation phases, and validation strategy for exact accepted-row targets.

Refs #790

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha requested a review from a team as a code owner July 1, 2026 20:53
@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a source-of-truth design plan for engine-native record selection, allowing users to declare a boolean predicate column and request an exact number of accepted output rows from a single bounded DataDesigner.create() call. The plan covers the full V1 feature: RecordSelectionConfig API, AcceptanceController internals, immutable candidate-batch execution, durable checkpoints, resume behavior, artifact layout, processor interactions, and a phased implementation roadmap.

  • Defines RecordSelectionConfig with predicate_column, max_candidate_records, and on_exhausted, along with exact semantics for null, non-boolean, overshoot, trimming, exhaustion, and resume.
  • Introduces an AcceptanceController that tracks candidate and accepted progress as separate coordinate spaces, persisting a batch marker (even for zero-acceptance batches) as the durable commit point.
  • Separates V1 (sequential candidate batches, deterministic output) from potential v2 optimizations (concurrent batches, early predicate cancellation), with V2 gated on benchmark evidence.

Confidence Score: 4/5

Documentation-only change with no executable logic; safe to merge, with three design gaps worth resolving before implementation begins.

The plan is thorough and internally consistent across its API, artifact layout, controller design, and test matrix. Three gaps are worth resolving: the media artifact strategy is explicitly required for V1 but left undecided; the resume state machine omits the IF_POSSIBLE + fingerprint-mismatch transition; and the metadata accounting invariant silently breaks when rows fail before predicate evaluation. None of these block merging the design document itself, but each will surface as an implementation ambiguity if left unaddressed.

The single changed file plans/790/engine-native-record-selection.md contains all three gaps; the media-artifact and resume state machine sections deserve the most attention before implementation kicks off.

Important Files Changed

Filename Overview
plans/790/engine-native-record-selection.md New 985-line design plan for engine-native record selection; three design gaps identified: unresolved V1 media artifact strategy, incomplete resume state machine for the IF_POSSIBLE + fingerprint-mismatch path, and a metadata accounting invariant that breaks when rows fail before predicate evaluation.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant U as User
    participant I as DataDesigner.create
    participant C as AcceptanceController
    participant S as AsyncTaskScheduler
    participant D as Column DAG
    participant St as ArtifactStorage

    U->>I: "create(builder, num_records=X)"
    I->>C: "initialize(target=X, cap=M)"

    loop "while accepted < X and candidates < M"
        C->>C: "next_candidate_batch(size=min(buffer_size, target, remaining_budget))"
        C->>S: run(row_group_id, candidate_offset, batch_size)
        S->>D: generate full column DAG for candidate batch
        D-->>S: completed rows + predicate column
        S->>C: select(dataframe)
        Note over C: Validate bool/null, build accepted mask, trim to remaining target
        C-->>S: SelectionDecision(accepted_indices, trimmed_count)
        S->>St: write accepted rows to parquet-files/batch_N.parquet
        S->>St: commit batch marker to selection-checkpoints/batch_N.json
        St-->>C: record_checkpoint(batch, decision)
    end

    alt "accepted == X"
        I-->>U: DatasetCreationResults (exact)
    else cap exhausted + return_partial
        I-->>U: DatasetCreationResults (partial)
    else cap exhausted + raise
        I-->>U: RecordSelectionExhaustedError
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant U as User
    participant I as DataDesigner.create
    participant C as AcceptanceController
    participant S as AsyncTaskScheduler
    participant D as Column DAG
    participant St as ArtifactStorage

    U->>I: "create(builder, num_records=X)"
    I->>C: "initialize(target=X, cap=M)"

    loop "while accepted < X and candidates < M"
        C->>C: "next_candidate_batch(size=min(buffer_size, target, remaining_budget))"
        C->>S: run(row_group_id, candidate_offset, batch_size)
        S->>D: generate full column DAG for candidate batch
        D-->>S: completed rows + predicate column
        S->>C: select(dataframe)
        Note over C: Validate bool/null, build accepted mask, trim to remaining target
        C-->>S: SelectionDecision(accepted_indices, trimmed_count)
        S->>St: write accepted rows to parquet-files/batch_N.parquet
        S->>St: commit batch marker to selection-checkpoints/batch_N.json
        St-->>C: record_checkpoint(batch, decision)
    end

    alt "accepted == X"
        I-->>U: DatasetCreationResults (exact)
    else cap exhausted + return_partial
        I-->>U: DatasetCreationResults (partial)
    else cap exhausted + raise
        I-->>U: RecordSelectionExhaustedError
    end
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
plans/790/engine-native-record-selection.md:694-699
**Unresolved V1 media artifact strategy**

The section states "V1 must choose and document one of these approaches," but then presents three options without selecting one. Open question 5 also defers the decision. Because the artifact layout section fully specifies every other path, leaving the media strategy unresolved means implementers could pick different approaches, which would make the artifacts, test cases, and cleanup behavior inconsistent. The three options have meaningfully different ownership implications for `MediaStorage`: option 1 requires candidate-scoped staging directories that don't exist today; option 2 requires per-row path tracking in the scheduler; option 3 adds a GC pass at cleanup time. A decision here should be made before implementation of Phase 2 begins.

### Issue 2 of 3
plans/790/engine-native-record-selection.md:634-657
**Resume state machine missing `IF_POSSIBLE` fingerprint-mismatch path**

The state machine only shows `LoadConfig --> Incompatible` for fingerprint mismatch when `resume=ALWAYS`. When checkpoints exist but the fingerprint has changed and `resume=IF_POSSIBLE`, there is no transition shown. Standard resume semantics would fall back to `Fresh`, discarding committed candidate progress, but that branch is absent. An implementer who reads only this diagram could silently re-run from scratch without any signal to the caller that prior work was abandoned, or could raise an error when the mode was intended to be lenient.

### Issue 3 of 3
plans/790/engine-native-record-selection.md:586-612
**Metadata accounting invariant breaks when rows fail before predicate evaluation**

The plan states "Failed/dropped candidate slots count against `max_candidate_records`; they consumed an attempt and potentially model cost," but no `failed_generation_records` counter appears in the proposed `record_selection` metadata block. The example JSON validates as `accepted (4 217) + rejected (7 769) + null_predicate (14) == candidate_records_generated (12 000)`, which holds only when all candidates reach the predicate step. If a batch has partial generation failures, the invariant silently breaks and the discrepancy becomes invisible to callers reading `metadata.json` or the batch markers.

Reviews (1): Last reviewed commit: "docs: add record selection design plan" | Re-trigger Greptile

Comment on lines +694 to +699
V1 must choose and document one of these approaches:

1. **Preferred:** write candidate media under a candidate-batch-scoped temporary directory and commit only accepted paths.
2. Track media paths produced per row and delete those associated exclusively with rejected rows.
3. Explicitly allow orphan media until run cleanup, then garbage-collect paths not referenced by accepted output.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unresolved V1 media artifact strategy

The section states "V1 must choose and document one of these approaches," but then presents three options without selecting one. Open question 5 also defers the decision. Because the artifact layout section fully specifies every other path, leaving the media strategy unresolved means implementers could pick different approaches, which would make the artifacts, test cases, and cleanup behavior inconsistent. The three options have meaningfully different ownership implications for MediaStorage: option 1 requires candidate-scoped staging directories that don't exist today; option 2 requires per-row path tracking in the scheduler; option 3 adds a GC pass at cleanup time. A decision here should be made before implementation of Phase 2 begins.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/790/engine-native-record-selection.md
Line: 694-699

Comment:
**Unresolved V1 media artifact strategy**

The section states "V1 must choose and document one of these approaches," but then presents three options without selecting one. Open question 5 also defers the decision. Because the artifact layout section fully specifies every other path, leaving the media strategy unresolved means implementers could pick different approaches, which would make the artifacts, test cases, and cleanup behavior inconsistent. The three options have meaningfully different ownership implications for `MediaStorage`: option 1 requires candidate-scoped staging directories that don't exist today; option 2 requires per-row path tracking in the scheduler; option 3 adds a GC pass at cleanup time. A decision here should be made before implementation of Phase 2 begins.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +634 to +657
```mermaid
stateDiagram-v2
[*] --> LoadConfig
LoadConfig --> Fresh: no selection checkpoints
LoadConfig --> Reconstruct: checkpoints exist and fingerprint matches
LoadConfig --> Incompatible: fingerprint mismatch with resume=ALWAYS

Reconstruct --> ValidateMarkers
ValidateMarkers --> Corrupt: missing or invalid committed artifact
ValidateMarkers --> Satisfied: accepted >= target
ValidateMarkers --> Exhausted: candidates >= cap and accepted < target
ValidateMarkers --> GenerateNext: accepted < target and budget remains

Fresh --> GenerateNext
GenerateNext --> CommitBatch
CommitBatch --> Satisfied: accepted >= target
CommitBatch --> Exhausted: candidate cap reached
CommitBatch --> GenerateNext: more candidates needed

Satisfied --> [*]
Exhausted --> [*]
Incompatible --> [*]
Corrupt --> [*]
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Resume state machine missing IF_POSSIBLE fingerprint-mismatch path

The state machine only shows LoadConfig --> Incompatible for fingerprint mismatch when resume=ALWAYS. When checkpoints exist but the fingerprint has changed and resume=IF_POSSIBLE, there is no transition shown. Standard resume semantics would fall back to Fresh, discarding committed candidate progress, but that branch is absent. An implementer who reads only this diagram could silently re-run from scratch without any signal to the caller that prior work was abandoned, or could raise an error when the mode was intended to be lenient.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/790/engine-native-record-selection.md
Line: 634-657

Comment:
**Resume state machine missing `IF_POSSIBLE` fingerprint-mismatch path**

The state machine only shows `LoadConfig --> Incompatible` for fingerprint mismatch when `resume=ALWAYS`. When checkpoints exist but the fingerprint has changed and `resume=IF_POSSIBLE`, there is no transition shown. Standard resume semantics would fall back to `Fresh`, discarding committed candidate progress, but that branch is absent. An implementer who reads only this diagram could silently re-run from scratch without any signal to the caller that prior work was abandoned, or could raise an error when the mode was intended to be lenient.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +586 to +612

```json
{
"target_num_records": 5000,
"actual_num_records": 4217,
"record_selection": {
"predicate_column": "meets_criteria",
"max_candidate_records": 20000,
"on_exhausted": "raise",
"candidate_records_generated": 12000,
"candidate_batches_completed": 12,
"accepted_records": 4217,
"rejected_records": 7769,
"null_predicate_records": 14,
"trimmed_accepted_records": 0,
"acceptance_rate": 0.3514167,
"selection_satisfied": false,
"selection_exhausted": false,
"next_candidate_batch_id": 12,
"next_candidate_offset": 12000
}
}
```

The candidate-batch marker directory is the filesystem source of truth. Global metadata is a convenient summary and may lag
by one checkpoint during a crash, just as current metadata may lag parquet writes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Metadata accounting invariant breaks when rows fail before predicate evaluation

The plan states "Failed/dropped candidate slots count against max_candidate_records; they consumed an attempt and potentially model cost," but no failed_generation_records counter appears in the proposed record_selection metadata block. The example JSON validates as accepted (4 217) + rejected (7 769) + null_predicate (14) == candidate_records_generated (12 000), which holds only when all candidates reach the predicate step. If a batch has partial generation failures, the invariant silently breaks and the discrepancy becomes invisible to callers reading metadata.json or the batch markers.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/790/engine-native-record-selection.md
Line: 586-612

Comment:
**Metadata accounting invariant breaks when rows fail before predicate evaluation**

The plan states "Failed/dropped candidate slots count against `max_candidate_records`; they consumed an attempt and potentially model cost," but no `failed_generation_records` counter appears in the proposed `record_selection` metadata block. The example JSON validates as `accepted (4 217) + rejected (7 769) + null_predicate (14) == candidate_records_generated (12 000)`, which holds only when all candidates reach the predicate step. If a batch has partial generation failures, the invariant silently breaks and the discrepancy becomes invisible to callers reading `metadata.json` or the batch markers.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Code Review: PR #792docs: add engine-native record selection plan

Summary

This PR adds a single 985-line design document, plans/790/engine-native-record-selection.md,
proposing an engine-native record-selection feature: users declare a boolean predicate column and
request an exact number of accepted rows from one DataDesigner.create() call. The engine
generates bounded, immutable candidate batches, evaluates the predicate, checkpoints only accepted
rows, and supports durable resume.

Per the review instructions for plans/-only changes, this review focuses on plan completeness,
feasibility, and alignment with the existing architecture
(verified against the codebase), not
linting or code style.

Overall this is a strong, unusually thorough design. It respects the interface → engine → config
dependency direction, correctly separates candidate coordinates from accepted-output coordinates,
and is grounded in real prior art (PR #773). The findings below are gaps and inconsistencies that
should be resolved before implementation, plus one significant reuse/altitude miss where the plan
proposes building infrastructure that already exists.

Findings

1. Proposed CandidateBatchPlan duplicates the existing ExplicitRowGroupPlan (reuse / altitude)

§2 (lines 397–410) proposes adding "a plan type capable of preserving explicit start offsets," a
new CandidateBatchPlan dataclass with __iter__ yielding (row_group_id, size) and a
row_group_start_offset(row_group) method. Phase 2 (line 812) lists row_group_plan.py — explicit candidate batch plan with start offset as new work.

This capability already exists. packages/.../dataset_builders/row_group_plan.py:244
defines ExplicitRowGroupPlan, which conforms to the RowGroupPlanLike protocol
(row_group_plan.py:11), materializes explicit _start_offsets per row group in __post_init__,
implements __iter__, and exposes row_group_start_offset() (line 289). The scheduler already
consumes this via _get_rg_start_offset() (async_scheduler.py:2148) and threads it through the
current_row_group_start_offset ContextVar (async_scheduler.py:1593). normalize_row_group_plan()
(line 300) is the existing entry point.

Since v1 maps each candidate batch to exactly one fresh row group of a known size at a known offset,
a candidate batch can be scheduled by constructing an ExplicitRowGroupPlan((row_group_id, size),)
with the desired base offset — no new plan type is needed. The plan should either reuse
ExplicitRowGroupPlan or justify why its interface is insufficient. As written it risks a parallel,
redundant implementation of already-tested offset infrastructure.

2. CandidateBatch and CandidateBatchPlan are near-identical DTOs (simplification)

The CandidateBatch dataclass (lines 343–348) and the proposed CandidateBatchPlan (lines 398–403)
carry the same four fields: candidate_batch_id, row_group_id, start_offset, size. Collapsing
these (and, per finding #1, folding scheduling onto ExplicitRowGroupPlan) removes a redundant type
and a source of drift between two structures that must stay in sync.

3. Batch-marker accepted_records vs trimmed_accepted_records semantics are undefined for resume

The batch marker (lines 567–579) records both accepted_records: 137 and
trimmed_accepted_records: 0, and resume "reconstruct[s] accepted and candidate progress from
committed batch markers" (line 625) by summing across markers. But the plan never states whether
accepted_records is the pre-trim count or the post-trim count actually written to parquet.

This matters on the overshoot path: the final candidate batch is trimmed to the remaining target
(lines 185, 373). If accepted_records is pre-trim and reconstruction sums accepted_records
across markers, the controller will compute a total larger than the rows actually persisted, and a
resume of a satisfied run could mis-detect the terminal state — or, worse, a crash between the
trimmed parquet write and marker commit could leave the reconstructed count inconsistent with the
parquet row count that "Verify every marker … points to a readable file with the expected accepted
count" (line 626) checks against. Define precisely: which field equals the parquet row count, and
which quantity the controller sums to decide accepted >= target.

4. Media-artifact cleanup is required for v1 but left undecided (completeness gap)

Line 694 states "V1 must choose and document one of these approaches" for rejected-row media
(orphaned image/audio/video/trace artifacts), and line 700 says "Do not silently accumulate unbounded
rejected media." Yet the decision is deferred to open question #5 (line 948), the "Recommended
decisions for v1" section (954–967) says nothing about media, and the Definition of Done (969–982)
omits it entirely. This is an internal contradiction: a concern the plan itself labels mandatory for
v1 has no committed decision and is absent from the completion criteria. Either commit to one
strategy (the plan already marks option 1, candidate-scoped staging, as "Preferred") and add it to
the DoD, or explicitly scope media-producing columns out of v1 selection.

5. buffer_size drives candidate batch sizing but is excluded from the fingerprint — resume interaction unaddressed

Candidate batch size derives from min(run_config.buffer_size, target, remaining_budget) (lines
422–428), and the plan places batch sizing in RunConfig deliberately (line 170). However
RunConfig/buffer_size is not part of DataDesignerConfig.fingerprint() — the fingerprint's
identity-relevant fields are columns, models, tools, seed, constraints, processors
(fingerprint.py:81–89), and only RecordSelectionConfig is added (line 169). Meanwhile the
existing resume contract already warns "buffer_size must match the original run"
(data_designer.py:232).

Consequence: a resume with a changed buffer_size will produce differently-sized future candidate
batches while past offsets are reconstructed from markers. Contiguity of candidate offsets is
preserved (each new batch starts where the last ended), so this may be acceptable — but the plan
never states it. Add an explicit note on whether buffer_size may change across a resumed selection
run and what the offset/sizing guarantee is. The resume test list (898–907) should cover it.

6. Phase decomposition omits the interface-layer num_records reinterpretation and run-boundary validation

The plan's central contract is that num_records changes meaning to "accepted output records" when
selection is configured (lines 21, 176). The create()-boundary validations
(max_candidate_records >= num_records, predicate existence, boolean output type — lines 190–197)
are correctly placed at the run boundary rather than in config (config cannot see the runtime
num_records). But the four implementation phases assign this nowhere: Phase 1 is config, Phase 2/3
are engine, Phase 4 is "Results, docs, and examples." The interface plumbing that routes
num_records as an accepted-target into DatasetBuilder.build(target_accepted=X) (per the sequence
diagram, line 283) and performs the run-boundary validation has no phase. Add an explicit
interface-layer deliverable so this isn't dropped.

7. Preview scope is left fuzzy relative to the Definition of Done

The Preview section (772–782) requires that preview(num_records=N) return accepted rows and honor
max_candidate_records, but offers two options (reuse controller in-memory vs. reject with a clear
error) without committing, and open question #4 restates the indecision. preview() today defaults
num_records=DEFAULT_NUM_RECORDS (data_designer.py:384), so an un-guarded preview under selection
would silently reinterpret that default as an accepted target and could run unbounded up to
max_candidate_records. The DoD (969–982) doesn't mention preview at all. Recommend committing to
the "reject clearly in v1" option in the DoD if preview integration isn't in scope, so the default
preview path can't become accidentally unbounded.

8. RecordSelectionExhaustedError is raised inside the engine builder loop but its home layer is an open question

The exhaustion example raises RecordSelectionExhaustedError(...) directly (lines 710–715), which by
the sequence diagram executes inside the engine DatasetBuilder — yet open question #8 (line 951)
still asks whether the error belongs "under interface errors or normalize[d as] an engine error at
the interface boundary." Per the project's "errors normalize at boundaries" invariant (AGENTS.md),
this should be resolved in the plan: define the error in the engine and re-expose/normalize it at the
interface, matching existing typed early-shutdown errors (which the plan references at line 727).
Leaving the layer undecided while already showing engine-side raise risks an implementation that
leaks an engine-internal type to callers.

Verdict

Approve with revisions requested (non-blocking for a design proposal).

This is a high-quality, implementable design that correctly respects the package layering and the
declarative-config contract. Before implementation begins, the author should resolve the reuse of
ExplicitRowGroupPlan (finding #1 — the most impactful, as it prevents building redundant offset
infrastructure), pin down the marker count semantics for resume (finding #3), and either commit to a
media-cleanup strategy or scope media out of v1 (finding #4). Findings #5#8 are smaller completeness
gaps that would benefit from a sentence or a DoD/phase edit each. None of these block landing the
plan as a status: proposal document, but they should be addressed as the design moves toward
"approved."

@andreatgretel andreatgretel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engine-native direction makes sense to me: exact accepted-row selection is a stage-local data-plane operation, while workflow chaining remains responsible for separate generate, judge, enrich, and transform stages.

The main design points I think need resolving before implementation are:

  1. preserving explicit candidate offsets through the current row-group-plan normalization;
  2. keeping selection checkpoint artifacts valid after after-generation processing; and
  3. defining a successful zero-row return_partial path through the interface and profiler.

The remaining comments are API and lifecycle clarifications rather than objections to the overall direction.

...
```

Add a plan type capable of preserving explicit start offsets:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CandidateBatchPlan as sketched will lose its explicit start_offset in the current scheduler wiring. _prepare_async_run() accepts RowGroupInput, and normalize_row_group_plan() preserves only CompactRowGroupPlan and ExplicitRowGroupPlan. Any other iterable is wrapped in ExplicitRowGroupPlan, which recomputes offsets starting from zero.

Because each candidate plan contains one row group, every candidate batch would expose offset zero to the ordered seed generator and replay the beginning of the seed dataset.

Could the plan specify either a complete RowGroupPlanLike implementation that is preserved by the normalizer, or a dedicated candidate-offset parameter? The regression test should assert that candidate batch 1 starts after candidate batch 0 rather than only checking output counts.

For v1:

1. Reject known row-count-changing after-generation processors at compile/runtime setup.
2. Run allowed after-generation processors once over accepted output.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resume model treats candidate-batch markers and their referenced parquet files as the source of truth. However, ProcessorRunner.run_after_generation() currently deletes parquet-files/ and re-chunks the combined dataset. Even a row-count-preserving processor can therefore remove marker paths or change their expected row counts, making a completed selection run appear corrupt on resume.

Could the plan explicitly separate immutable accepted-candidate partitions from the published postprocessed dataset? Selection markers could reference the immutable partitions, while terminal metadata points to the final processed output.

)
```

For `return_partial`, finalize the accepted output and record `selection_satisfied=false` and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_partial is specified to complete successfully even when zero candidates pass, but DataDesigner.create() currently rejects every zero-row dataset before profiling. Since zero-acceptance batches intentionally write no parquet, the all-rejected path cannot currently return a DatasetCreationResults.

Could the plan explicitly require terminal empty-output handling, including bypassing the ordinary zero-row failure guard, materializing a schema-bearing empty dataset, and defining empty profiling behavior? Otherwise the empty-partial test described below will not be implementable.


- It participates in normal DAG dependency discovery.
- It can be previewed and debugged like any other column.
- It can be generated by expressions, plugins, validators, or future generator types.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we clarify the scope of custom or plugin-generated predicates? A full-column predicate can perform batch-global filtering, but it cannot safely implement run-global deduplication, quotas, or ranking unless its cross-batch state is durable across resume. Checkpointed accepted rows also cannot later be revoked, which excludes selectors such as global top-N.

It may be worth stating that V1 predicates must be row-local or otherwise monotonic and resume-safe.

builder.add_column(
dd.ExpressionColumnConfig(
name="meets_criteria",
expr="{{ quality.score >= 0.8 }}",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this predicate path is incomplete for LLMJudgeColumnConfig. Judge results are nested under the configured score name, so with Score(name="answer_quality", ...) the expression would be something like quality.answer_quality.score >= 0.8. quality.score does not match the current structured judge output.

Could the example define a concrete Score and use its full nested path?

self._handle_selection_completion(controller)
```

Initialize generator instances once and reuse them across candidate batches. In particular:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fresh scheduler per candidate batch currently means calling _prepare_async_run() per batch, and that method invokes every generator's log_pre_generation(). To satisfy the stated once-per-logical-build behavior, the implementation will need to move those calls outside scheduler preparation and invoke them before the candidate loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants