Skip to content

fix(hub): surface silently swallowed errors in hubsync hook and replication loop#114

Open
omergk28 wants to merge 2 commits into
ActiveMemory:mainfrom
omergk28:fix/100-hub-silent-error-suppression
Open

fix(hub): surface silently swallowed errors in hubsync hook and replication loop#114
omergk28 wants to merge 2 commits into
ActiveMemory:mainfrom
omergk28:fix/100-hub-silent-error-suppression

Conversation

@omergk28

Copy link
Copy Markdown
Contributor

Summary

Two hub code paths swallowed errors with no logging surface — the session-start hubsync hook (internal/cli/system/core/hubsync) and the cluster replication loop (internal/hub/replicate.go). Operators couldn't tell a healthy quiet sync from one that never reached the network.

This PR wires every silent return through the established warn sink with format constants in config/warn, and un-conflates the sync-error check from the empty-result check (only the error warns now). Both functions keep their signatures and never-block contracts — logging is the only behavior change.

Closes #100.

Already landed

Two of #100's sub-items were already in main before this PR (the conn.Close defer warn and the store.Append warn + keep-consuming, via CloseHubClient/HubReplicateAppend). This PR covers the rest and adds the regression tests the issue asked for — including pinning the already-landed keep-consuming behavior, which previously had no coverage.

One deliberate deviation from the issue's proposed code

The issue's sketch warns on every RecvMsg error — but io.EOF is the normal end of every sync stream, so that would emit a warning once per ReplicateInterval on perfectly healthy replication. The receive path here stays silent for eof() (the hub package's own end-of-stream helper, same as the other two RecvMsg sites) and for caller shutdown (ctx.Err() != nil), and warns on everything else. TestReplicateOnce_CleanReplicationDoesNotWarn pins this.

Tests (mapped to the issue's Tests Required)

Issue ask Test
Warns on load error TestSync_WarnsOnLoadError
Warns on dial error TestSync_WarnsOnDialError — empirically, the only eager grpc.NewClient failure is a control character failing URL parse; every other malformed target (://invalid…) defers to first use. TestSync_WarnsOnPullError covers the lazy path (closed port → Sync RPC fails)
No warn on empty result (the un-conflation) TestSync_NoWarnOnEmptyResult — real in-process hub, zero entries, asserts an empty warn buffer
Replication warns on dial error TestReplicateOnce_WarnsOnDialError + TestReplicateOnce_WarnsOnTransportError
Keeps consuming after append error TestReplicateOnce_KeepsConsumingAfterAppendError — read-only follower dir, asserts exactly 2 append warnings (root/Windows skip guards per the rc/require_test.go precedent)

All use the existing warn.SetSink seam; the hubsync package previously had zero tests.

Verification

  • go test -race -count=2 on both packages — clean.
  • Mutation checks: three deliberate regressions each fail their pinning test — (1) re-conflating syncErr with the empty check fails TestSync_NoWarnOnEmptyResult; (2) removing the EOF guard fails TestReplicateOnce_CleanReplicationDoesNotWarn; (3) dropping the dial warn fails TestSync_WarnsOnDialError.
  • make audit — all checks passed.

🤖 Generated with Claude Code

…cation loop

The session-start hubsync hook returned "" on config-load, dial,
sync-RPC, and write failures with no trace, and conflated a real
sync error with the genuine "nothing new" result. The follower
replication loop returned silently on dial, stream-open, send, and
close-send failures and on every receive error. Operators could not
distinguish a healthy quiet sync from one that never reached the
network.

Wire every silent return through the warn sink with format
constants in config/warn. Behavior is otherwise unchanged: both
functions keep their signatures and never-block contracts.

- Sync now warns per failure site and stays silent only for a
  genuine zero-entry result (the un-conflation from ActiveMemory#100).
- replicateOnce warns per failure site; the receive path stays
  silent for the eof() end-of-stream (the one deliberate deviation
  from the issue's proposed code, which would have warned once per
  replication cycle) and for caller shutdown.
- Two ActiveMemory#100 sub-items (close-defer warn, append warn + keep
  consuming) had already landed; this covers the rest and adds the
  regression tests the issue asked for, including pinning
  keep-consuming-after-append-error and clean-EOF-no-warn.

Closes ActiveMemory#100.

Spec: specs/fix-hub-silent-error-suppression.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Omer Kocaoglu <omergk28@gmail.com>
@omergk28 omergk28 requested a review from josealekhine as a code owner June 12, 2026 01:15
@josealekhine

Copy link
Copy Markdown
Member

The fix is correct, and the key insight is right

Some nits below:

The real bug was the conflation of "error" with "nothing new":

// internal/cli/system/core/hubsync/sync.go — before
entries, syncErr := client.Sync(context.Background(), cfg.Types, 0)
if syncErr != nil || len(entries) == 0 {
    return ""
}

The PR un-conflates them so a real error warns while a genuine empty result stays silent:

// after
if syncErr != nil {
    logWarn.Warn(cfgWarn.HubSyncPull, cfg.HubAddr, syncErr)
    return ""
}
if len(entries) == 0 {
    // Genuine empty result: not an error, no warning.
    return ""
}

Every other silent return "" now routes through the logWarn sink, and both functions keep their signatures and non-blocking contract (still return "" everywhere). The replication loop's recv suppression is the right call — silent on io.EOF (normal stream end) and ctx.Err() != nil (shutdown), warn on anything else:

// internal/hub/replicate.go
if recvErr := stream.RecvMsg(msg); recvErr != nil {
    if !eof(recvErr) && ctx.Err() == nil {
        logWarn.Warn(cfgWarn.HubReplicateRecv, masterAddr, recvErr)
    }
    return
}

…and the spec documents this as a deliberate deviation from issue #100's proposed code (which warned on every recv error and would have made clean cycles noisy).

Things I checked that hold up

  • All 9 new warn format/arg arities match. go vet would not catch a mismatch here — Warn(format string, args ...any) isn't a recognized printf-wrapper — so I verified each call by hand (e.g. HubSyncWrite = "hubsync: write %d entries: %v"Warn(..., len(entries), writeErr)).
  • No secret leakage. The token never appears in any warn format; only the hub address + error.
  • Tests are genuinely excellent. Each error path asserts its warning; TestSync_NoWarnOnEmptyResult pins the un-conflation; TestReplicateOnce_CleanReplicationDoesNotWarn pins the EOF suppression (zero warnings on a real round-trip + lastSeq == 2); TestReplicateOnce_KeepsConsumingAfterAppendError pins continue-on-append-failure (chmod-0 dir → exactly 2 append warnings). The control-char ("\x00") trick to force an eager grpc.NewClient failure is a nice touch given the resolver is otherwise lazy.
  • Noise is bounded. The live hook (checkhubsync.Run) is gated by check.DailyThrottled, so Sync and its warnings fire at most once per day, not every session start.
  • Replication path is dormant. replicateOnce is reserved dead code today (var _ = startReplication, "called from Server.Start when a follower peer is configured"), so those warnings won't fire until cluster mode is wired.

Findings (all minor / adjacent — none are blockers)

1. (adjacent, pre-existing) The "must not block the session" contract isn't actually guaranteed

The doc update reaffirms the contract:

// internal/cli/system/core/hubsync/doc.go
// Every error is surfaced as a stderr warning via the warn sink,
// but never propagates: the hook must not block the session start.

But the underlying RPC has no deadline:

// sync.go — context.Background(), no timeout
entries, syncErr := client.Sync(context.Background(), cfg.Types, 0)

// internal/hub/client.go — lazy NewClient, fail-fast (not WaitForReady), no dial timeout
conn, dialErr := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), ...)

A hub that accepts the TCP connection but never responds (hung server / black-hole proxy) makes RecvMsg inside Sync block indefinitely → the session-start hook hangs, contradicting the contract this PR re-documents. Not introduced here (the context.Background() line is unchanged), and the daily throttle limits exposure to once/day — but since the PR draws attention to this exact contract, wrapping the Sync RPC in a context.WithTimeout would make "never block" actually true. This is the one finding with real teeth; worth a follow-up rather than blocking the merge.

2. (forward-looking) Replication warning cadence when wired

replicateOnce is dead code today, but when startReplication is wired (#96 territory), a persistently-down-but-reachable master would warn every ReplicateInterval (5s, per internal/config/hub) via HubReplicateRecv with no dedup/backoff — the EOF+ctx guard only suppresses normal completion and shutdown, not a recurring transport error. Consider warn-once-per-state-transition when that path goes live. Noted now because the warning constants and call sites ship in this PR.

3. (trivial) eof() uses ==, not errors.Is

// internal/hub/eof.go (pre-existing helper)
func eof(err error) bool { return err == stdio.EOF }

Correct for gRPC's unwrapped io.EOF (the clean-replication test confirms it in practice), just marginally less defensive than errors.Is(err, io.EOF) if any layer ever wraps it. Pre-existing, but this PR newly relies on it for the suppression branch.

…ory#114 review)

Review of ActiveMemory#100 surfaced that the re-documented "must not block the
session start" contract was not actually guaranteed: the hubsync pull
RPC ran under context.Background() with no deadline, so a hub that
accepts the connection but never responds (hung server, black-hole
proxy) would hang the session-start hook indefinitely.

- Bound the pull with context.WithTimeout(HubSyncTimeout) — new 10s
  constant in internal/config/hub. An exceeded deadline surfaces via
  the existing HubSyncPull warning and the hook returns "" like any
  other failure; the daily throttle plus next-session retry covers a
  cut-off pull. A test-overridable syncTimeout var (export_test.go
  seam) lets TestSync_WarnsOnHungHub drive a black-hole listener in
  milliseconds instead of the full interval.
- Harden internal/hub/eof.go to errors.Is(err, io.EOF) so the
  replication receive-suppression branch stays correct if any layer
  ever wraps EOF (gRPC delivers it unwrapped today).
- Record the forward-looking finding (dedup/backoff for the
  per-cycle HubReplicateRecv warning once startReplication is wired,
  ActiveMemory#96) as a task rather than guessing now; replicateOnce is dead
  code until then.

Spec: specs/fix-hub-silent-error-suppression.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Omer Kocaoglu <omergk28@gmail.com>
@omergk28

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review — addressed in 3cfdfd14:

  • F1 (the one with teeth) — Sync could block on a hung hub. Bounded the pull with context.WithTimeout(context.Background(), HubSyncTimeout) (new 10s constant in internal/config/hub). An exceeded deadline surfaces through the existing HubSyncPull warning and the hook still returns "" like any other failure; the daily throttle + next-session retry cover a cut-off pull. Made syncTimeout a test-overridable var (export_test.go seam) and added TestSync_WarnsOnHungHub, which drives a black-hole TCP listener (accepts the connection, never speaks gRPC) and asserts Sync returns rather than hanging. Sanity check: with the deadline reverted to context.Background(), that test fails at its 5s guard — so it has teeth.
  • F3 — eof() now uses errors.Is(err, io.EOF) instead of ==.
  • F2 (replication warn cadence) — deferred as you suggested. Recorded as a task tied to the Hub status: surface cluster leadership (IsLeader, LeaderAddr) in Status RPC + ctx hub status #96 wiring (with a spec Out-of-Scope note) rather than guessed at now, since replicateOnce is dead code until then.

The deadline (F1) is the only behavior change; the never-block contract the PR re-documents is now actually true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hub: silent error suppression in hubsync hook + replication loop

2 participants