Skip to content

feature: Add maintain command for automated runway provisioning#30

Merged
codybswaney merged 8 commits into
masterfrom
feature/pgslice-maintain
Jul 1, 2026
Merged

feature: Add maintain command for automated runway provisioning#30
codybswaney merged 8 commits into
masterfrom
feature/pgslice-maintain

Conversation

@codybswaney

Copy link
Copy Markdown

Summary

maintain is the scheduler entry point: it discovers every partitioned table that carries a pgslice settings comment and extends them all in one run.

  • Per-period horizons--future-daily/-weekly/-monthly/-yearly (defaults 90/26/6/1 respectively). Each also reads a PGSLICE_FUTURE_* environment variable (precedence: flag > env > default) for env-based scheduled-job config.
  • Per-table isolation — one table's failure is recorded without aborting the rest of the fleet; the process exits non-zero if any table failed or is replication-unsafe.
  • Replica-identity guard — read-only check that every leaf has a usable replica identity for logical replication; surfaces problems, never mutates.
  • Structured JSONL logging — one record per line, each stamped with a per-run jobId and the target host + database.
  • FixAdvisoryLock.withLock no longer lets a lock-release failure in its finally mask the handler's real error.

Discovery filtering (--schema), grant inheritance, and the composite PK handling reuse the existing add_partitions machinery — maintain simply adds the fleet-wide discovery/iteration plus the logging and additional guard layers.

Test plan

  • npm test (vitest against PostgreSQL 13–18): discovery + filtering, per-period extension, idempotent re-run, per-table failure isolation + non-zero exit, the replica-identity guard, the JSONL record shape (host / db / jobId, no-op vs. extended, error surfacing), and env/flag precedence.
  • Validated end-to-end against ephemeral database copies via a scheduled-job runner, see this page for WorkOS internal validation against real tables.

For INFRA-5546

codybswaney and others added 7 commits July 1, 2026 11:52
…guard)

Discovers every partitioned parent carrying a valid pgslice settings
comment and extends each with add_partitions in one run, with per-table
failure isolation and a read-only CDC-readiness guard (every leaf must
have a replica identity usable for logical replication). Exits non-zero
if any table failed or is CDC-unsafe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A single --future N means very different runway for weekly vs monthly vs
yearly tables (3 weeks vs 3 months vs 3 years). Replace it with per-period
--future-daily / --future-weekly / --future-monthly / --future-yearly
(defaults 90 / 26 / 6 / 1); discovery returns each table's period, and each
table is extended by the horizon for its own period, so the fleet gets
comparable forward coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
maintain now writes one JSON object per line so a log collector can extract
the keys as attributes:
  - a start record (target db + per-period horizons),
  - one record per table extended (msg, level, target, partition counts,
    success),
  - a final summary (succeeded/failed counts and table lists).

Only info and error levels are used, and the partitioning model is left out
of the logs. The command signals failure through its exit code (now honored
by BaseCommand.execute) instead of a plain-text error line, keeping stdout
pure JSONL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the endpoint host (from the connection URL — host only, no credentials) to
every maintain log record's target alongside the database name, so a run
identifies which host and DB it extended. Distinguish a table that needed no new
partitions ("Table already up to date; no extension needed") from one that was
extended, and drop the deployment-specific rationale from the code comments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Generate one UUID per maintain run and stamp it on the start, per-table, and
final records so all logs from a single invocation can be correlated. Generated
with crypto.randomUUID; injectable via options for deterministic tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AdvisoryLock.withLock released the session-level lock in a finally block. When
the handler aborts the transaction (e.g. a failed partition CREATE), the unlock
query itself errors with "current transaction is aborted", and that finally
exception replaced the handler's real error — so maintain logged an unhelpful
message instead of the cause. Release now only propagates its own failure when
the handler succeeded; otherwise the handler's error wins. The test harness runs
with advisory locks disabled, so this only surfaced against a live database.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Give each --future-* flag a clipanion env fallback (PGSLICE_FUTURE_DAILY /
_WEEKLY / _MONTHLY / _YEARLY), so a scheduled job can configure the horizons via
environment — as the Terrace ScheduledProcess will — without templating the
command args. Precedence is flag > env > baked default, matching how --url
already falls back to PGSLICE_URL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds a fleet-wide maintain command for pgslice partition runway management. The main changes are:

  • New maintain CLI command with per-period future horizons.
  • Catalog discovery for managed partitioned tables.
  • Per-table isolation with structured JSONL run logs.
  • Replica-identity checks across partition leaves.
  • Transaction-scoped advisory locking for command-scoped operations.

Confidence Score: 5/5

This looks safe to merge.

No blocking issues found in the changed code.

No files need attention.

T-Rex T-Rex Logs

What T-Rex did

  • The maintain CLI was checked out at base and an initial maintain test run was attempted, but there were no test files and the run exited with code 1.
  • The maintain CLI was checked out at head and a subsequent maintain test run was attempted, but it failed because the PGSLICE_URL environment variable must be set for tests.
  • The CLI help was inspected, showing the top-level pgslice maintain command and options such as --future-daily, --future-weekly, --future-monthly, --future-yearly, and --schema.
  • The replica identity guard base side tests were blocked because PGSLICE_URL was not set.
  • The replica identity guard head side tests were blocked because PGSLICE_URL was not set.
  • The maintain API base checkout/build/import showed hasMaintain: undefined and no src/maintain.test.ts.
  • The maintain API head checkout/build/import showed hasMaintain: function, but running tests failed with ECONNREFUSED to PostgreSQL.
  • The advisory lock tests on the base code showed OBSERVED_ERROR_MESSAGE=CURRENT_TX_ABORTED_RELEASE_FAILURE and the preservation assertion failing.
  • The advisory lock tests on the head code showed OBSERVED_ERROR_MESSAGE=TREX_REAL_HANDLER_ERROR and both targeted tests passed.
  • The advisory lock repro source advisory-lock-error-repro.test.ts contains the executed test scenario and the contention check.

View all artifacts

T-Rex Ran code and verified through T-Rex

Important Files Changed

Filename Overview
src/advisory-lock.ts withLock now uses transaction-scoped advisory locks, while long-running generator paths keep the session-scoped lock path.
src/table.ts unsafeReplicaIdentityPartitions now checks true partition-tree leaves for CDC-unsafe replica identity settings.
src/pgslice.ts Adds the maintain orchestration flow with discovery, per-table execution, logging, and replica-identity readiness results.
src/commands/maintain.ts Adds the CLI wrapper for maintain, including option parsing, host metadata, JSONL output, and non-zero exits for failed tables.

Reviews (2): Last reviewed commit: "fix: Address advisory-lock leak and nest..." | Re-trigger Greptile

Comment thread src/advisory-lock.ts Outdated
Comment thread src/table.ts Outdated
- withLock takes a transaction-scoped advisory lock (pg_try_advisory_xact_lock)
  instead of a session lock released in a finally. It's freed when the enclosing
  transaction commits or rolls back, so a handler that aborts the transaction
  can't leak the lock into the pooled connection — and there's no unlock query
  left to fail on an aborted transaction and mask the handler's error (this
  supersedes the earlier swallow-the-error workaround). The session-based
  acquire() is kept for fill/synchronize, which hold a lock across batches.
- unsafeReplicaIdentityPartitions walks the whole partition tree via
  pg_partition_tree(...) filtered to leaves, so a leaf beneath a sub-partitioned
  child is inspected too — the previous direct-children query could report a
  table CDC-ready while a nested leaf had an unusable replica identity.

Adds a leak-regression test (an aborted handler no longer blocks a second
session) and a nested-leaf CDC test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codybswaney codybswaney requested review from a team and ryanmcilmoyl July 1, 2026 20:09
@codybswaney codybswaney merged commit d4b15e0 into master Jul 1, 2026
7 checks passed
@codybswaney codybswaney deleted the feature/pgslice-maintain branch July 1, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants