Skip to content

feat(core): OIDC sign-in via device flow (RFC 8628)#52

Open
glasstiger wants to merge 83 commits into
mainfrom
ia_oidc_device_flow
Open

feat(core): OIDC sign-in via device flow (RFC 8628)#52
glasstiger wants to merge 83 commits into
mainfrom
ia_oidc_device_flow

Conversation

@glasstiger

@glasstiger glasstiger commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Adds interactive OIDC sign-in to the Java client using the OAuth 2.0 Device Authorization Grant (RFC 8628). A process with no local browser — a remote notebook kernel, a container, a headless job — can sign a human in against QuestDB Enterprise: the user authorizes on any device (laptop or phone) while the process only makes outbound calls to the identity provider.

On first use it prints a verification URL and a short code (and, by default, also tries to open the URL in a local browser); once the user authorizes, the token is cached in memory and refreshed silently on later calls.

import io.questdb.client.Sender;
import io.questdb.client.cutlass.auth.OidcDeviceAuth;

// Discover the client id, scope and endpoints from the QuestDB server's /settings:
try (OidcDeviceAuth auth = OidcDeviceAuth.fromQuestDB("https://questdb.example.com:9000")) {
    auth.signIn(); // sign in once: prompts on first use, then caches and refreshes

    // Pass a token provider, not a fixed string: the sender pulls a freshly refreshed token on each
    // request, so a long-lived sender keeps working as the token rotates. getToken() refreshes
    // silently and never prompts on the flush path.
    try (Sender sender = Sender.builder(Sender.Transport.HTTP)
            .address("questdb.example.com:9000")
            .enableTls()
            .httpTokenProvider(auth::getToken)
            .build()) {
        sender.table("trades")
                .symbol("symbol", "ETH-USD")
                .doubleColumn("price", 2615.54)
                .atNow();
    }
}

What's new

OidcDeviceAuth (io.questdb.client.cutlass.auth) — runs the flow and owns the token:

  • OidcDeviceAuth.fromQuestDB(url) discovers the client id, scope, audience and IdP endpoints from the server's unauthenticated /settings; OidcDeviceAuth.fromQuestDB(url, DiscoveryOptions) adds an identity-provider pin (.issuer(...)), a TLS config, an allowInsecureTransport opt-in, and the prompt (see Discovery and trust below); OidcDeviceAuth.builder() configures the identity provider explicitly.
  • signIn() signs in interactively on first use, then serves a cached token and refreshes it silently; getToken() never prompts and never waits behind an interactive sign-in (safe on a request/flush path); getAuthorizationHeaderValue() returns the full Bearer … value; clearCache() drops the cached token so the next signIn() re-signs-in; close() cancels an in-flight sign-in (observed between polls, so it can take up to one HTTP request timeout to return). Calls are serialized by a ReentrantLock; getToken() uses tryLock and fails fast rather than wait behind an interactive sign-in. Token state is in-memory only by default; pass a TokenStore to persist it across restarts (see Token persistence below).

Sender integration — new HttpTokenProvider interface and Sender.builder(...).httpTokenProvider(auth::getToken). The sender pulls a freshly refreshed token on every request, so a long-lived sender keeps working as the token rotates — unlike a fixed httpToken(...), which is captured once and eventually starts returning 401s. Mutually exclusive with httpToken/httpUsernamePassword. Supported over HTTP and WebSocket transport (a WebSocket sender re-queries the provider on every (re)connect/upgrade); rejected for TCP and UDP. The two transports diverge on a sustained token outage: over HTTP a failed pull leaves the request token-pending and is retried on the next row, whereas over WebSocket a pull that keeps failing past the sender's reconnect budget terminates the sender for good, like any persistent reconnect failure. The first pull is deferred off the build path to the first row, so the documented construct → signIn() → send ordering works and a provider that throws leaves the request retriable instead of corrupting the sender.

QWP egress query clientQwpQueryClient.withBearerTokenProvider(HttpTokenProvider) accepts the same on-demand provider, so OidcDeviceAuth::getToken plugs into the egress query path as well as ingress. The provider is queried at every WebSocket upgrade — the initial connect() and each failover reconnect — so a long-lived query client follows token rotation; each returned token is validated before it reaches the header, and a provider that throws fails that connection attempt (matching the ingress sender). Mutually exclusive with withBearerToken/withBasicAuth.

DeviceCodePrompt / DeviceAuthorizationChallenge — how the verification URL and user code are shown. The default, DeviceCodePrompt.openBrowser(), prints the instructions to System.out and also tries to open the verification URL in the local default browser; the browser open is best-effort (skipped on a headless JVM, without the java.desktop module, or for a non-http(s) URL, and disabled by -Dquestdb.client.oidc.open.browser=false) and never blocks or fails sign-in. Use DeviceCodePrompt.SYSTEM_OUT to print only, or supply your own to render a clickable link or a QR code, e.g. in a notebook.

audiencebuilder().audience(...) / discovered from acl.oidc.audience. When set, the audience parameter is sent on the device-authorization and refresh requests, for providers that require it to stamp the aud claim QuestDB expects.

The token can be presented to QuestDB over any auth path the server already validates:

  • REST / ingestionAuthorization: Bearer <token>.
  • PG-wire — connect as _sso with the token as the password (requires acl.oidc.pg.token.as.password.enabled=true on the server).

Discovery and trust

fromQuestDB(...) takes the IdP endpoints from the server's unauthenticated /settings, so by default it trusts that server to designate where the user signs in: a spoofed, compromised, or man-in-the-middled server could otherwise redirect the sign-in — and the long-lived refresh token — to an attacker-controlled identity provider. An optional DiscoveryOptions.issuer(...) pin addresses this, and also covers servers that do not advertise a device-authorization endpoint. The pin separates two sources of endpoints and trusts them differently — an endpoint the untrusted /settings advertised is constrained to the issuer, while an endpoint read from the identity provider's own .well-known is trusted wherever the provider hosts it:

  • .well-known discovery fallback. Current servers do not advertise the device-authorization endpoint. When it (and/or the token endpoint) is missing, a pinned issuer reads it from {issuer}/.well-known/openid-configuration. The discovery origin comes only from the caller-supplied issuer, never from a /settings-supplied value, so a tampered /settings cannot choose where discovery — and the credential POSTs it resolves — are aimed. Without a pin, discovery is refused rather than guessed.
  • Co-location pin. validateEndpointOrigins, enforced on every construction path (discovery and the explicit builder()), requires the token and device-authorization endpoints to share one origin (RFC 8628 co-locates them on a single authorization server), so a tampered /settings or discovery document cannot siphon one of the two credential POSTs off to a different origin.
  • /settings-advertised endpoints are pinned to the issuer. An endpoint the untrusted /settings response supplied must sit on the pinned issuer's origin, and — when the issuer has a path — under that path (compared segment by segment, rejecting ./.., percent-encoded traversal, and a percent-encoded path separator such as %2f or %5c, at every decode level). The path check matters for a path-based provider that shares one origin per tenant (e.g. a Keycloak realm path /realms/<realm>), where the origin check alone cannot stop a tampered /settings from steering credentials to a sibling tenant. The issuer is supplied out of band and cannot be forged.
  • IdP-discovered endpoints are trusted where the issuer hosts them. An endpoint read from the issuer's own .well-known is neither origin-pinned nor path-scoped: that document is fetched from the pinned issuer origin and is authoritative for wherever the provider hosts its endpoints. This is deliberate — some providers (e.g. Google, Azure AD) serve their token and device endpoints from a different origin or path than the issuer, and discovery against them signs in normally. The co-location pin above still applies.
  • Plaintext-channel pin. A /settings response fetched over plaintext http to a non-loopback host (only reachable with allowInsecureTransport) is MITM-able, so its advertised endpoints are not trusted to route credentials without an issuer pin.

Without a pin, the behaviour against an https server that advertises its endpoints is unchanged: that server is trusted, as before.

Security

  • https is required by default for both the QuestDB server and the IdP endpoints; http is rejected unless the caller opts in with allowInsecureTransport(true). That opt-in relaxes only the QuestDB /settings link — the IdP device-authorization and token endpoints always require https (loopback excepted), so the device code and refresh token never cross the network in cleartext (matching the Python client).
  • Tokens never leak into logs or exceptions. Only the HTTP status of a token/device response is captured; the body — which carries access, id and refresh tokens — is never retained or surfaced in a message. The captured status is validated to be exactly three digits, so a malformed or hostile status line cannot splice ANSI/control bytes into a later [httpStatus=…] echo, nor can a short all-digit status (2, 5) be misread as a 2xx/5xx class — a malformed-length status falls through to the terminal reject path.
  • Untrusted IdP text is sanitized before it is shown in a prompt or an exception message: per-code-point stripping of control characters, ANSI escapes, CR/LF, and bidirectional / zero-width / Unicode-format characters (including supplementary-plane "tag" characters that arrive as surrogate pairs, and unpaired surrogates), so an attacker-influenced field cannot reorder, hide, or forge what a human reads — e.g. a right-to-left override that makes the displayed verification URL differ from the one the browser opens. A user code or verification URL that is non-empty on the wire but sanitizes to nothing is rejected (or, for verification_uri_complete, treated as absent) rather than shown as a blank line or handed to the browser launcher.
  • The token itself is validated before use. The token QuestDB will actually receive — the id token when the server encodes groups in the token, the access token otherwise — is rejected if it carries a control or non-ASCII character (outside 0x200x7e) before it is cached, placed in the Authorization: Bearer header, or used as the PG-wire password — so a tampered or hostile identity provider cannot smuggle a CR/LF into the request the client then sends to the trusted QuestDB server. Only the served kind is checked; a stray character in the unused token kind, which never reaches the wire, no longer aborts an otherwise usable grant. (The JsonLexer change below decodes JSON escapes, which is what turns a \r/\n in a token into a real byte rather than two literal characters.)
  • URLs are validated up front. Endpoint.parse rejects control characters, whitespace and display-unsafe code points anywhere in the url (so a tampered endpoint cannot inject a CR/LF into the request line or a bidi char into a log line), rejects bracketed IPv6 literals rather than mis-parsing them, rejects userinfo (user@host) — which the HTTP layer would otherwise try to connect to literally — terminates the authority at the first /, ? or # so a query or fragment is never folded into the host, and range-checks the port to 1..65535.
  • Bounded against a hostile or stalled server. Response reads are capped by a 4 MiB byte limit and a wall-clock deadline that bounds the whole read — covering both a chunked response whose chunk-size line is dribbled a byte at a time and a Content-Length body dribbled through stalled TLS records, either of which previously could keep a single read running well past the deadline. After such a bounded-read abort the half-read poll connection is dropped so the next poll reconnects on a clean socket, rather than the loop spinning on the stalled response's leftover bytes until the device code expires. The device-code lifetime, the poll interval, and the token TTL are all clamped (defaults applied for absent/zero values, hard caps for absurd ones). A 429 with no OAuth error is treated as a transient back-off (a 429 that also carries a terminal error such as access_denied still aborts on the error); a transient transport failure or 5xx during polling keeps polling until the device-code deadline rather than failing the sign-in (RFC 8628; matches the Python client), while a definitive OAuth error or a terminal 4xx aborts immediately. The trade-off is that a persistently flaky network is no longer cut short by a separate error budget — it polls to the device-code deadline.

Token persistence (opt-in)

By default token state is in-memory only, so a restarted process re-runs the interactive device flow. Passing a TokenStore persists it, so the restarted process resumes from the saved refresh token (one silent token-endpoint round-trip) instead of re-prompting — getToken() then even works as the first call, with no explicit signIn().

  • TokenStore SPI (io.questdb.client.cutlass.auth) — load/save/clear keyed by a non-secret TokenStoreKey (endpoints, client id, scope, audience, groups-in-token mode), plus an optional inLock hook for cross-process coordination. Wire it in with builder().tokenStore(...) or DiscoveryOptions.tokenStore(...). Persistence is best-effort: a store failure warns to System.err and the in-memory token is used regardless.
  • FileTokenStore (the default) — one plaintext JSON file per identity under ${user.home}/.questdb/oidc-tokens/ (override with questdb.client.oidc.token.store.dir), the refresh token protected at rest by file permissions (0600 file, 0700 directory on POSIX) rather than encryption — the same approach gcloud, aws and gh take. The file name is a SHA-256 of the identity, so it leaks neither endpoint nor client id and several identities coexist. FileTokenStore.atDefaultLocation() / FileTokenStore.at(dir).
  • Treated as untrusted on load. A persisted file is attacker-writable, so on load it is size-bounded, parsed defensively (a corrupt/oversized/garbage file is ignored, not fatal), its fingerprint re-checked against the live config (a token minted for one identity is never served for another), and the served token re-validated for control/non-ASCII characters before it can reach a header — the same CR/LF / non-ASCII rejection the device flow applies to IdP responses. A tampered far-future expiry is clamped, not trusted. A bad file degrades to a refresh or an interactive sign-in.
  • Integrity and cross-process coordination. Each update is written to a temp file then atomically renamed, so a concurrent reader never sees a half-written credential. When the IdP rotates the refresh token on each refresh, the read-refresh-write is serialized across processes with an O_CREAT|O_EXCL lock file (not an OS advisory lock, which Java FileLock and Python flock cannot share); a process that cannot acquire the lock degrades to a lock-free refresh rather than stall.
  • The on-disk format is a frozen cross-language contract (design/oidc-token-persistence.md): the file name, JSON schema, atomic-write and lock-file protocols are specified so the Python client (and others) can share one file.

Supporting changes

  • JsonLexer now resolves JSON string escape sequences (\", \\, \/, \b \f \n \r \t, \uXXXX; lenient on malformed input), so string values arrive fully decoded. This also reaches the existing ILP error-response parser, which now sees decoded message/code/line/errorId fields.
  • Response.recv(int timeout) (both the Content-Length and chunked implementations) bounds the whole read to the timeout in total, not per socket read, so a server that dribbles the body — the chunk-size line of a chunked response, or Content-Length bytes behind stalled TLS records — cannot keep a single read running past the caller's deadline. A non-positive timeout keeps the legacy unbounded behaviour. The existing ILP flush reads go through the no-arg recv(), which passes the configured client timeout (positive by default), so they now bound the whole body read to that timeout instead of re-arming it per socket read: a response that completes within the timeout is unaffected, while one that legitimately dribbles a single fragment for longer than the timeout — previously tolerated as long as each socket read made progress — now aborts.
  • AbstractLineHttpSender plumbs the token provider through with a deferred, retriable per-request pull (a throwing or blank-returning provider leaves the request token-pending for the next row instead of corrupting the half-built request), so the very first send already carries a provider-sourced token. Its error rendering now routes every untrusted server-supplied string through putAsPrintable — the decoded JSON error body, the [http-status=…] field, and the line-protocol-version detection probe body — escaping control and Unicode format characters (bidi overrides, zero-width joiners, the BOM), so a hostile or proxied endpoint cannot reorder, hide, or forge the text shown in a LineSenderException (or spliced into a log line or terminal).
  • QwpWebSocketSender sources its auth header from the token provider too, re-querying it on every connect/reconnect so a rotating token keeps a long-lived WebSocket sender authenticated.

Tradeoffs and limitations

  • The origin pin behaves differently on the two construction paths. fromQuestDB(...) discovery trusts an endpoint read from the issuer's .well-known wherever the provider hosts it, so an off-origin provider (e.g. Google) signs in normally through a pinned issuer; only an endpoint the /settings response itself advertised is held to the issuer's origin and path. The explicit builder().issuer(...) pin is stricter — a plain sanity check that both supplied endpoints sit on the issuer origin — so an off-origin provider configured that way must have its issuer omitted, or its endpoints supplied to match. This matches the Python client.
  • A failed token pull is handled differently per transport: over HTTP it leaves the request token-pending and retries on the next row, but over WebSocket a pull that keeps failing past the sender's reconnect budget terminates the sender, like any persistent reconnect failure. A getToken() provider that fails only transiently recovers on HTTP; a long outage ends a WebSocket sender.
  • The co-location check requires the token and device-authorization endpoints to share an origin. A test that previously pointed the token endpoint at a dead second port to simulate an unreachable endpoint was reworked to drop a co-located connection instead (MockOidcServer.dropConnection()).
  • The plaintext-channel pin's firing path is exercised end to end by reaching the loopback mock through a short-form 127.x address that the loopback classifier deliberately rejects as non-loopback. That trick relies on the OS resolver expanding the short form (BSD inet_aton, on Linux/macOS), which Windows getaddrinfo does not do, so that one end-to-end test is skipped on Windows; the loopback classifier itself is covered cross-platform.
  • Persistence writes a long-lived refresh token to disk in plaintext, protected only by file permissions — anyone who can read the file holds a credential until the IdP expires or revokes it. This is why persistence is opt-in; for at-rest encryption, supply a TokenStore backed by an OS keychain or a secrets manager instead of FileTokenStore. On Windows POSIX permissions cannot be enforced, so the file currently relies on the user-profile directory's default ACL (owner-only ACL hardening is a follow-up); the client warns once to System.err.
  • A store-coordinated getToken() may briefly wait to acquire the cross-process lock before a silent refresh (a few seconds at most for FileTokenStore — the acquire budget is capped — then it proceeds without the lock). It still never waits behind an interactive sign-in; this is a quick silent refresh, not an interactive wait.
  • The Response.recv(int) whole-read bound (see Supporting changes) also tightens existing, non-OIDC ILP flushes. The no-arg recv() now bounds the whole flush-response body read to the configured client timeout instead of re-arming it per socket read, so a response that legitimately dribbles a single fragment for longer than that timeout — previously tolerated as long as each socket read made progress — now aborts. A healthy response is unaffected: the abort surfaces as a retryable HttpClientException that flush0 catches and retries like any transport failure, so the only observable effect is that a pathologically slow single fragment is retried rather than tolerated indefinitely, marginally widening the pre-existing ILP-over-HTTP at-least-once window.

Tests & docs

OidcDeviceAuthTest (~107 cases) + MockOidcServer, BrowserLauncherTest, LineHttpSenderTokenProviderTest, WebSocketTokenProviderTest + TestWebSocketServer, SenderBuilderErrorApiTest, JsonLexerTest, LineHttpSenderErrorResponseTest, DisplaySafeTest, ChunkedResponseTest/ResponseTest, QwpQueryClientTokenProviderTest, FileTokenStoreTest, OidcDeviceAuthPersistenceTest; runnable OidcDeviceFlowExample / OIDCAuthExample; and README "OIDC Sign-In (Device Flow)" and "Persisting the Token Across Restarts" sections. Coverage includes:

  • .well-known discovery via a pinned issuer; a discovery document that omits the device-authorization endpoint
  • the co-location check rejecting split-origin endpoints; the builder().issuer(...) origin pin rejecting off-origin endpoints; a /settings-advertised endpoint rejected when off the issuer origin; an endpoint discovered from the issuer's .well-known accepted even when off the issuer origin (the Google case)
  • issuer path scoping: endpoints under the issuer path accepted; a sibling realm, percent-encoded traversal, an encoded path separator (%2f), and a percent-encoded backslash (%5c) rejected
  • the plaintext-channel pin requiring an issuer pin for advertised endpoints over http (firing path skipped on Windows)
  • the audience parameter discovered from /settings and sent on the device and refresh requests
  • token-provider support over HTTP and WebSocket (initial upgrade and re-queried per reconnect; static token / username-password still work over WebSocket); rejected for TCP/UDP; mutual exclusion with other auth; null/empty provider token rejected; deferred build-time pull; no sender corruption when the provider throws after a flush; the QWP egress query client's token provider — header synthesis, per-resolve re-query, token validation, and mutual exclusion
  • a token with a control or non-ASCII character rejected rather than sent (the served kind validated); tokens never echoed in messages
  • bounded reads: a stalled body and an oversized body aborting on the deadline / 4 MiB cap; a chunked body read aborting when the chunk-size line is dribbled; a bounded-read abort dropping the dirty poll connection so the next poll reconnects and signs in
  • the poll model: a 429 and a transient 5xx/transport failure keep polling to the deadline; a terminal 4xx and an OAuth error fail fast (including a 429 that also carries a terminal error); slow_down growth and the 60 s interval clamp; device-code-lifetime and clock-skew clamps
  • Endpoint.parse rejecting a malformed url: userinfo (user@host), a bracketed IPv6 literal, an out-of-range port, and control/whitespace/display-unsafe characters
  • a malformed HTTP status code rejected on both the device-authorization and the token-poll path: a non-numeric status, and a short all-digit status (2, 5) that must not be read as a 2xx/5xx class
  • display sanitizing: bidi/zero-width, lone-surrogate and supplementary-plane format characters stripped from the challenge and OAuth error; a server's JSON error body, its HTTP status field, and the protocol-version probe body with such characters escaped, not rendered raw; a user code or verification URL that sanitizes to empty rejected, and a verification_uri_complete that sanitizes to empty treated as absent
  • JsonLexer escape decoding, including a \uXXXX escape split across two parse fragments, and the lenient/exotic escape arms
  • getToken() failing fast while another thread holds the lock in an interactive sign-in or a silent refresh; native-memory cleanup on the error/rejection construction paths
  • token persistence: round-trip save/load with 0600/0700 permissions and control-char JSON escaping; a corrupt, empty, oversized, schema-version-mismatched, or per-field fingerprint-mismatched file ignored; a tampered far-future expiry clamped and a CR/LF served token rejected on load; a restart serving a valid persisted token (or silently refreshing an expired one) without re-running the device flow; rotating vs non-rotating refresh write behaviour; a swallowed save not replaying a revoked token; the cross-process lock-file protocol (acquire, mutual exclusion, stale-steal, degrade); getToken() degrading to a lock-free refresh when a peer holds the lock; the HTTP-timeout cap; the file-name hash pinned as a cross-language contract

Tandem

OSS: questdb/questdb#7331
Ent: https://github.com/questdb/questdb-enterprise/pull/1090

@glasstiger glasstiger changed the title feat(core): OIDC device flow feat(core): OIDC sign-in via device flow (RFC 8628) Jun 17, 2026
glasstiger and others added 19 commits June 17, 2026 19:31
Two fixes for the OIDC device flow in the Java client.

M2 - Bidi / zero-width Unicode bypassed the display sanitizer.
sanitizeForDisplay and OidcAuthException.putSanitized filtered only on
Character.isISOControl, which covers C0/C1 and DEL but not the
bidirectional overrides (U+202A-202E), isolates (U+2066-2069), marks
(U+200E/200F), zero-width characters or the BOM (U+FEFF). Those fields
- user_code, verification_uri(_complete), error and error_description -
all come from the IdP/settings boundary and reach System.out and the
exception messages, so a hostile or MITM'd IdP could embed a
right-to-left override and spoof the verification URL a human reads and
then opens. The JSON lexer's \uXXXX decoding widens the vector, since an
escaped override decodes to the real character before display.

Both sanitizers now share OidcAuthException.isUnsafeForDisplay, which
also strips the Unicode format category (Cf) plus the explicit bidi/BOM
set. The predicate uses hex int literals rather than char escapes,
keeping the source strictly ASCII so the file carries none of the
characters it guards against.

M3 - httpTokenProvider forced a successful sign-in before build().
createLineSender eagerly rebuilt the pending request when a provider was
set, calling getToken() at build time. With the documented
.httpTokenProvider(auth::getTokenSilently), that threw unless the caller
had already signed in, so the natural "construct the sender, sign in,
then send" ordering was impossible.

The first token pull is now deferred off the build path to the first row
(table()). The provider is wired at build but not queried; the initial
request is stamped with a token when the first row starts, and the
pending flag is cleared only after the pull succeeds, so a
not-yet-signed-in provider that throws leaves the stamp pending for a
retry. The Sender.httpTokenProvider Javadoc now states the provider is
not called at build time.

Tests: new bidi/zero-width cases for the challenge fields and the oauth
error message (fed as JSON \uXXXX escapes so they exercise the
decode-then-display path), and a new LineHttpSenderTokenProviderTest
covering the deferred pull and a lazily signing-in provider. Each test
was confirmed to fail without its fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three robustness fixes in the OIDC device-flow parser.

m3 - A JSON null arrives from the lexer as the literal "null". The
token and device parsers used putValue, which stored it verbatim, so
"access_token": null became the 4-char token "null" and "error": null
was read as an OAuth error code "null". Merged putValue with
SettingsDiscoveryParser's null-guarding putNonNull into one shared
helper used by all three parsers, so a JSON null is treated as absent
everywhere.

m4 - Endpoint.parse did not range-check the port, so host:0, host:-1
and host:99999 parsed and flowed to the transport. Added a 1..65535
guard that rejects them with a clear message.

m5 - The token-response expires_in was not clamped, unlike the
device-auth value, so a TTL near Integer.MAX_VALUE cached the token for
~68 years. storeTokens now applies the same boundedSeconds clamp (the
default for a non-positive value, capped at MAX_EXPIRES_IN_SECONDS). The
server still enforces the real expiry; this only bounds how long the
client trusts its cached copy.

Tests: null access_token and null error are rejected/ignored,
out-of-range ports are rejected at build, and a clamped token expiry
forces a fresh sign-in (observed via a clock-skew margin set above the
clamp). Each test was confirmed to fail without its fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The post-flush reset() eagerly rebuilt the next request and pulled the
provider token via httpTokenProvider.getToken() after the current batch
had already been sent and accepted. If that pull threw (e.g.
OidcDeviceAuth::getTokenSilently when a silent refresh fails) it turned
an already-successful flush into a thrown exception and left the shared
Request half-built (contentStart == -1, no withContent()), so the next
row's data went into the header region - a malformed request, lost rows
and a permanently corrupted sender.

Route every request's token pull through the same deferred, retriable
path the initial request already used: newRequest() no longer pulls the
provider token (it marks the request token-pending and builds a valid
token-less request), and stampTokenIfPending() pulls it lazily when the
first row of a request starts. A failed pull leaves the flag set and the
sender untouched, so the next row re-runs the stamp and fully rebuilds
the request. Per-request token rotation is unchanged.

Rename isInitialTokenPending/stampInitialTokenIfPending to
isTokenPending/stampTokenIfPending since the deferral now covers every
request, and stamp the token in putRawMessage() too. Add a regression
test that fails at the first, successful flush without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
getToken() and getTokenSilently() were both synchronized on the instance
monitor, and getToken() holds it for the entire interactive device flow -
up to the device-code lifetime, clamped to one hour. A long-lived Sender
wired with httpTokenProvider(auth::getTokenSilently) therefore stalled on
the flush path for up to an hour whenever another thread ran an
interactive sign-in (e.g. a re-auth after the refresh token died). The
javadoc claimed the opposite ("safe on a request/flush path").

Replace the synchronized methods with a ReentrantLock. getToken() and
clearCache() still acquire it blocking, but getTokenSilently() now uses
tryLock() and fails fast with an OidcAuthException instead of waiting:
while a sign-in is in progress there is no token to serve anyway, so the
caller gets a prompt, retriable exception rather than a wedged flush. The
interactive flow still holds the lock for its whole duration and close()
still sets the volatile cancellation flag before acquiring the lock, so
the no-use-after-free guarantee is unchanged.

Correct the class and getTokenSilently() javadocs, and add a regression
test that fails (getTokenSilently blocks ~10s behind an in-flight
sign-in) without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
isUnsafeForDisplay() inspected one UTF-16 code unit at a time, so a
supplementary-plane (>= U+10000) format or control character - an
invisible U+E00xx "tag" char, for instance - arrived as a surrogate
pair whose halves are each neither a control nor category Cf and so
passed the filter unstripped. Because the JSON lexer reassembles such
😀-style escapes, a hostile or man-in-the-middled identity
provider could smuggle invisible/spoofing characters into a user_code,
a verification_uri, or an error_description and on into the terminal
prompt and exception messages.

Judge a Unicode code point instead: isUnsafeForDisplay() takes an int,
and both sanitizers (putSanitized for exception messages,
sanitizeForDisplay for the prompt) walk the text by code point with
Character.codePointAt/charCount, so Character.getType classifies a
supplementary char as one character. A legitimate astral character
(an emoji) is still preserved.

Make the assertNoUnsafeDisplayChars test helper code-point-aware too -
it shared the blind spot - and add a regression test that fails (the
U+E0001 tag char survives) without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pollOnce() checked for a token before the HTTP status and the OAuth
error field, so a response that carried a token alongside an error, or
under a non-2xx status, was cached as a valid grant. tryRefresh() had
the same flaw: it accepted the refreshed token on token presence alone.
Both contradict RFC 6749 - 5.1 makes a grant a 2xx response carrying a
token, and 5.2 says an error response must not be treated as a grant.

Handle the OAuth error first in pollOnce(), so a token smuggled
alongside an error never counts, and accept a token only when the
status is 2xx; a token under a non-2xx status goes to the transport-
error budget instead of being trusted. Guard tryRefresh() the same way:
cache the refreshed token only from a clean 2xx response with no error,
otherwise fall back to the interactive flow.

The happy path and the existing pending/slow_down/access_denied/empty-
body outcomes are unchanged. Add regression tests for a token alongside
an error, a token under a non-2xx status, and a refresh that smuggles a
token with an error - each fails without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
newRequest() passed the token from httpTokenProvider.getToken() straight
to authToken(), which does not null- or empty-check it. A provider that
returned null, "", or whitespace therefore produced a malformed
"Authorization: Bearer " header that the server only answered with a 401
far from the cause - no client-side error at all. The HttpTokenProvider
contract forbids such a return but nothing enforced it, and httpToken()
already rejects a blank token, so the provider path was the weaker spot.

Validate the pulled token with Chars.isBlank (as httpToken does) and
throw a clear LineSenderException instead. The check sits inside the
deferred pull, so a rejected token leaves the stamp pending and the next
row retries cleanly, just like a throwing provider does. OidcDeviceAuth
never returns a blank token, so this guards custom providers.

Add tests that a null, an empty, and a whitespace-only provider token is
rejected at first use - each fails without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
JsonLexer.getCharSequence rescanned every decoded value and name from
the start to look for a backslash, even though the parse loop already
detects one when it sets ignoreNext. Record that in a sawEscape flag
(carried across parse() fragments) and resolve escapes only when it is
set, so the common no-escape value returns the assembled sink without a
second pass.

OidcDeviceAuth.Endpoint.parse now rejects a host that contains control
characters or whitespace - a smuggled CR/LF would otherwise flow into
the outbound Host header.

Add the tests these paths lacked: a cross-fragment escape; the lexer's
lenient and exotic escape arms (surrogate pairs, \b/\f, unknown and
malformed escapes, lone surrogates); the version-probe settings parser
reading an escaped key through unescape; HTTP-token-provider rejection
for UDP and WebSocket (not just TCP); and the control-character host
cases above.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port the issuer feature from py-questdb-client (PR #133) onto
OidcDeviceAuth, so the device flow keeps working against servers that
do not advertise their device-authorization endpoint, and so the
device code and refresh token are only sent where the caller pins.

The issuer plays three roles:

- Discovery fallback: when /settings omits the device (and/or token)
  endpoint, fromQuestDB(url, issuer) reads it from the issuer's
  .well-known/openid-configuration document. The discovery origin comes
  only from the out-of-band issuer (or an explicit discoveryUrl), never
  from a /settings-supplied value, so a tampered /settings cannot
  redirect discovery. Without a pin, discovery is refused.

- Plaintext-channel pin: a /settings response fetched over plaintext
  http to a non-loopback host (only reachable with
  allowInsecureTransport) cannot route credentials to its advertised
  endpoints without a pin.

- Endpoint-origin pin: validateEndpointOrigins, enforced in
  Builder.build() on every construction path, requires the token and
  device endpoints to share one origin (RFC 8628 co-location) and, when
  an issuer is set, to belong to it.

Config surface: Builder.issuer(...); new fromQuestDB overloads
(url, issuer), (url, issuer, allowInsecure), and a 5-arg master taking
issuer, discoveryUrl and a TLS config.

Tradeoffs:

- The co-location check makes the token and device endpoints share an
  origin. testPersistentTransportFailureDuringPollingAborts simulated
  an unreachable token endpoint with a dead second port; it now uses a
  new MockOidcServer.dropConnection() against a co-located path.

- The origin pin compares scheme/host/port and ignores the path, so an
  identity provider that hosts its endpoints on a different origin than
  its issuer must be configured without an issuer. This matches the
  Python client.

- allowInsecureTransport still relaxes the identity provider endpoints
  too (unchanged); the Python client always forces https/loopback for
  the IdP. Left as-is to avoid changing settled transport behavior.

Adds 7 tests and updates the README OIDC section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Endpoint.parse now rejects control characters and whitespace anywhere
in the url before splitting it. The host was already checked, but the
path was not, so a tampered /settings or discovery document could carry
a CR/LF in an endpoint path that the JSON lexer decodes and postForm
writes verbatim onto the request line via .url(endpoint.path) - a
header-injection / request-smuggling vector that the origin pin (which
compares scheme/host/port only) does not catch. Validating the whole
url up front also keeps it safe to echo in the parse error messages.

fromQuestDB now derives the pin origin from a caller-supplied
discoveryUrl when no issuer was resolved. Previously a discoveryUrl pin
only took effect when discovery actually ran (an endpoint missing from
/settings); when /settings advertised both endpoints the discovery
branch was skipped and validateEndpointOrigins ran with a null issuer,
so a compromised server could advertise both endpoints at an attacker
origin and slip past the pin. The discoveryUrl pin now behaves like the
issuer pin on every construction path.

Adds regression tests for both: a CR/LF-injected advertised endpoint,
path and query cases in Endpoint.parse, and discoveryUrl-pin accept and
reject against on- and off-origin endpoints.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Endpoint.parse already rejected control characters and whitespace in
the url, which kept it safe to echo into the exception messages once it
passed validation. That scan did not catch bidi, zero-width or other
format characters (U+202E, U+200B, U+FEFF, the Cf category, and the
supplementary-plane tag characters), so a tampered /settings or
discovery endpoint url could still smuggle one into an OidcAuthException
message and reorder, hide or forge the log line it lands in.

The url scan now runs per code point and also rejects anything
isUnsafeForDisplay flags, so an OIDC url may carry no control,
whitespace or display-unsafe character. Every raw url echo in
Endpoint.parse, requireSecureTransport and fromQuestDB is therefore
safe on screen as well as on the wire, and the rejection message
sanitizes the url it reports.

Adds a regression test covering a right-to-left override, a zero-width
space, the BOM and a supplementary-plane tag character.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
isUnsafeForDisplay now treats an unpaired UTF-16 surrogate as unsafe,
so a lone surrogate half - which JsonLexer emits verbatim for a single
backslash-u-XXXX escape and which codePointAt surfaces as a SURROGATE
code point - is stripped from a user_code, verification_uri or error
string before it reaches a terminal or a log line. A valid high+low
pair is still reassembled by codePointAt and judged on its real
category, so a legitimate emoji survives. The method comment is
corrected too: codePointAt in the callers reassembles pairs, not the
lexer.

close() and the class Javadoc no longer claim an in-flight sign-in is
cancelled "promptly". The cancel flag is observed between polls (within
about 100ms) but a poll request already in flight is not interrupted,
so close() can take up to one HTTP request timeout to return - still
far short of the device-code lifetime. The docs now say so.

Adds tests: lone high and low surrogates are stripped from the device
challenge while an emoji survives; and the private isLoopbackHost
classifier (which gates the plaintext-channel MITM pin) is pinned for
localhost and the 127.0.0.0/8 block, and against non-loopback and
spoofing hosts such as 127.evil.com, localhost.evil.com, 127.1 and
127.0.0.256.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The poll loop now clamps the slow_down-inflated interval to the same
MAX_POLL_INTERVAL_SECONDS cap the initial interval already respects, so
repeated slow_down responses from the identity provider cannot grow the
wait without bound.

The device-authorization, token and well-known parsers now reset their
current field to FIELD_NONE after each value, matching
SettingsDiscoveryParser. The parsers are not currently confusable - in
well-formed JSON a name event always sets the field before the next
value, array elements arrive as EVT_ARRAY_VALUE, and nested values are
filtered by the depth check - so this is a defensive consistency fix
that removes a latent field-confusion foot-gun rather than a behavior
change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
JsonLexer.unescape no longer re-scans the value from the start to
re-find the backslash the lexer already flagged via hasEscape; it walks
the value once, copying plain characters and resolving escapes in place.
That drops the now-dead "no escapes" early return and the separate
prefix copy, so an escaped value is traversed about twice (decode then
unescape) instead of three times. parseHex4 looks the hex digit up in
the shared Numbers.hexNumbers table instead of Character.digit, keeping
the same -1-on-non-hex contract. All of this is on the cold
error/discovery/auth parse path, never on ingestion.

Reorders pollForToken ahead of pollOnce so the private methods stay in
alphabetical order; no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@glasstiger glasstiger added enhancement New feature or request security labels Jun 19, 2026
glasstiger and others added 7 commits June 19, 2026 15:42
The 4 MiB response-body cap (MAX_RESPONSE_BODY_BYTES) that bounds the
OIDC device flow against a hostile or MITM'd server streaming an
endless body had no test coverage on the parseBody path.

Add an oversizedJson() mode to MockOidcServer that streams a chunked,
mostly-whitespace body past the cap, and a test that drives discovery
against it and asserts the bounded read aborts with the size-limit
error - which also confirms the token-bearing body never reaches the
message. The body is whitespace so the lexer keeps consuming until the
byte cap trips, instead of hitting its per-value length limit first.

Verified both ways: the test passes with the 4 MiB cap and fails when
the cap is disabled, where the full body is read and parsing fails with
"Unterminated object" instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three small fixes to the OIDC device authorization flow, all in
OidcDeviceAuth:

- runDeviceFlow now rejects a non-2xx device authorization response.
  Previously it trusted any body that carried device_code/user_code/
  verification_uri and no OAuth error, so a non-2xx response would
  prompt the user and start polling. It now applies the same 2xx gate
  pollOnce and tryRefresh already use before trusting a body.

- pollForToken checks the device-code deadline at the top of the loop
  and never sleeps past it, so an expiry that elapses during a sleep
  times out promptly instead of after one more wasted poll and up to a
  full extra poll interval.

- tryRefresh drops an unreachable branch that rethrew on an OAuth
  error. postForm only throws on a parse failure here, and a real
  OAuth error arrives in tokenParser.error (handled by the
  hasRequiredToken check), so the branch was dead. No behaviour change.

Add testNonSuccessDeviceAuthorizationResponseRejected covering the new
2xx gate; it fails without the check (the 403 is accepted, the user is
prompted, and polling fails later instead).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A discoveryUrl pins the identity provider, yet fromQuestDB adopted the
issuer the discovery document declared about itself and validated the
token and device endpoints against that, never against the pinned
discoveryUrl origin. A document served at the pinned url could therefore
name an attacker issuer, co-locate both endpoints under it, and route the
device code and the long-lived refresh token there while the co-location
and issuer checks passed trivially - so the discoveryUrl pin did not in
fact pin the provider, contradicting its documented guarantee.

Reject a document whose own issuer sits on a different origin than the
pinned discoveryUrl (RFC 8414 section 3.3), and derive the endpoint pin
from the discoveryUrl origin rather than the document's self-declared
issuer. An identity provider that serves its discovery document on a
different origin than its endpoints must instead be configured with
explicit endpoints via OidcDeviceAuth.builder().

The issuer-pinned path is unchanged: it already binds the endpoints to
the caller-supplied issuer. testFromQuestDbDiscoveryUrlPinRejectsForeign
IssuerInDocument covers the new rejection and fails without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
readResponse copied the response status code into a sink that later
appears in OidcAuthException messages. A well-formed status code is bare
digits, but the HTTP header parser keeps the status-line token verbatim
apart from SP/CR/LF, so a hostile or MITM'd identity provider could
splice ESC or other control bytes into it - smuggling ANSI sequences
into a log or terminal, or fabricating a leading digit that passes the
2xx success gate.

Validate the status code as it is captured: on any non-digit byte, drain
the body so the keep-alive connection stays usable, then reject the
response with a message that echoes none of its bytes. A clean status is
copied digit by digit, so every later [httpStatus=...] echo is bare digits.

testNonNumericStatusCodeRejected drives a status code with a spliced ANSI
reset and asserts the rejection; it fails without the fix. The new
MockOidcServer.raw() helper writes a verbatim response so a test can craft
a malformed status line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
JsonLexer now resolves JSON string escapes, so the message and errorId
fields a QuestDB endpoint returns in a JSON error body arrive at the
sender fully decoded. The JSON error parser put them into the
LineSenderException verbatim, so a hostile or proxied endpoint could
inject real control characters or ANSI escapes that forge a log line or
rewrite a terminal when the exception text is printed.

Render the server-supplied message, id, code and line through
putAsPrintable - the same escaping the column-name errors in this class
already use - so a decoded control byte arrives escaped.

LineHttpSenderErrorResponseTest flushes against a server returning a
chunked JSON error whose message and errorId carry an ESC and a newline,
and asserts they reach the exception escaped; it fails without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The plaintext-channel pin refuses /settings-supplied OIDC endpoints
fetched over a non-loopback http channel unless the identity provider is
pinned out of band, so a tampered response cannot route the device code
and refresh token to an attacker. Only its loopback exemption was
exercised end to end, because the test mock binds to 127.0.0.1; the
firing branch had no integration coverage.

Reach the loopback mock through "127.1": the OS resolver expands the
short form to 127.0.0.1 so the mock answers, but the loopback classifier
deliberately rejects the short form, so the server host is non-loopback
and the pin fires. Assert that a plaintext /settings advertising both
endpoints without a pin is refused, and that pinning the issuer over the
same channel is accepted - proving the pin, not an unrelated rejection,
is the gate. The test fails if the firing check is removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Skip testPlaintextSettingsWithAdvertisedEndpointsRequiresPin on
Windows: it reaches the loopback mock through the "127.1" short-form
address, which Linux/macOS getaddrinfo expands to 127.0.0.1 but Windows
getaddrinfo rejects, so discovery cannot connect there. No host string
is both reachable at the loopback mock and classified non-loopback on
Windows, so the end-to-end firing path cannot run there; the classifier
stays covered cross-platform by
testLoopbackHostClassifierRejectsNonLoopbackAndSpoofing.

Wrap every OidcDeviceAuth construction in try-with-resources so the
native JSON lexer and HTTP clients are always released, including the
rejection paths where build()/fromQuestDB() throws.

Also replace manual StringBuilder fills with String.repeat, switch
index loops to enhanced-for, and collapse the split-value test helper
to a single lexer cache-limit parameter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@glasstiger

Copy link
Copy Markdown
Contributor Author

Code review — OIDC device flow (independent level-3 pass)

Reviewed at 1b7fecd (base 84da57d), 36 files, +8546/−129. Independent 9-dimension adversarial pass (correctness, concurrency, resources, cross-context callers, performance ×2, tests, code quality, metadata, plus a fresh-context "find what's wrong" agent), every finding verified against working-tree source — not a re-confirmation of the prior passes.

Process note for anyone re-reviewing: gh pr diff 52 currently serves a stale cached diff for this PR — it shows a removed auth::getTokenSilently reference and older Javadoc wording that do not exist at the real head 1b7fecd. The authoritative diff is git diff $(git merge-base HEAD origin/main)...HEAD; read source from the checked-out head. Findings below are against the real head.

Verdict: approve. No Critical or Moderate issues.

The three highest-blast-radius changes all touch the existing ILP path and were each verified safe at their out-of-diff callers (incl. a cross-module compile of OSS questdb/core against the changed client + 522 passing tests across the affected suites):

  • Utf16Sink.putAsPrintable — for 0x00–0x1F/0x7F the new \uXXXX output is byte-identical to the old \u00XX; only C1/FORMAT/surrogate chars change from raw to escaped (strictly safer). Legitimate non-ASCII letters/emoji still pass raw. No existing caller (ConfStringParser, column-name errors, ILP error rendering) depended on the old format.
  • JsonLexer escape decoding — the new hasEscape fast-path returns the raw sink unchanged for escape-free values (the common case for the ILP JsonErrorParser); decoded values are re-escaped via putAsPrintable before display; numeric fields carry no backslash and are unaffected.
  • Response.recv(int) whole-call bound — active on the ILP flush (30s default, re-armed per recv() call, so a large body across many recv()s is unaffected), bites only a single fragment read that stalls ~30s, and surfaces as a retryable HttpClientException caught in flush0.

This is an unusually well-defended PR: CRLF/header injection, bidi/ANSI terminal spoofing, MITM endpoint redirection (origin + path pinning with multi-decode traversal defense), token-kind confusion, use-after-free on close(), and unbounded reads were each anticipated and hold up under adversarial tracing.

Minor (none blocking)

  1. Tautological native-leak guardOidcDeviceAuthTest.testMalformedEndpointDoesNotLeakNativeMemory (~:1993). "not-a-url" throws in Endpoint.parse before the native JsonLexer is allocated (it's the last ctor statement), so the RSS-equality assertion can never fail — it doesn't guard the construct-then-validate ordering it claims to. Production ordering is safe. Fix: use a parseable-but-rejected endpoint, e.g. deviceAuthorizationEndpoint("http://idp.example/device") + allowInsecureTransport(false). (Still open from the prior passes — the one item most worth doing before merge.)

  2. Test gap: short all-digit status on the device-authorization path. The 3-digit isHttpStatusSuccess() guard is tested on the token-poll path but not at the device-auth call site (runDeviceFlow, :1195); the device-auth status tests use a non-numeric status, a different code path. Add a raw("HTTP/1.1 2 OK") device-step case.

  3. Test gap: throwing httpTokenProvider on a WebSocket ingest sender reconnect. Only the QWP QueryClient initial-connect throw and the WS happy-reconnect re-query are covered. Production handling is correct — the throw (OidcAuthException/LineSenderException, both RuntimeException) is caught by catch (Exception e) at QwpWebSocketSender.java:2500, the client is closed, and it retries within the reconnect budget — but a regression narrowing that catch would ship green. (The :2456 "caught below" comment is accurate, just non-specific — not a defect.)

  4. Test gap: ILP flush recv() whole-call timeout end-to-end. ResponseTest/ChunkedResponseTest cover the unit-level bound well, but no integration test drives a real Sender.flush() against a dribbling/stalled server to confirm the abort fires and is classified retryable.

  5. OIDC JSON parsers track object depth only, ignoring array nesting. All four inner parsers bump depth on EVT_OBJ_START/END but ignore EVT_ARRAY_START/END, so array-wrapping ({"config":[{…}]} / [{…}]) extracts fields from RFC-malformed shapes, subverting the depth-based "only top-level config trusted" intent. No practical security/correctness impact: the /settings body is already fully attacker-controlled in the fromQuestDB threat model (array-wrapping grants no new capability), endpoints stay origin/path-pinned regardless of parse shape, a preferences value still can't be read as config, and a real QuestDB server never returns arrays there. Robustness/defense-in-depth nit — optionally track array depth or reject a non-object top level.

  6. Cold-path perf nits (optional). tryRefresh re-encodes refresh_token per silent refresh (invariant per token lifetime — could pre-encode in storeTokens); the issuer-path check runs two independent multi-pass percent-decodes over the same path (endpointPathHasEncodedSeparator then decodePathSegments — fusible); percentDecodeOnce allocates a fresh StringSink per call. All run once per construction/refresh, off the data path. The per-row/per-flush path is allocation-clean and the Authorization: Bearer write is allocation-free.

  7. PR description: disclose the ILP-flush timeout change under "Tradeoffs", not "Supporting changes". The Response.recv(int) whole-call bound affects existing non-OIDC senders; the repo convention is "regressions as prominently as improvements." Worth one line in Tradeoffs noting a flush aborted by the tighter deadline is subject to normal retry (the pre-existing ILP-over-HTTP at-least-once window, marginally widened). Labels enhancement+security are correct (the ILP/REST API/Core labels don't exist in this submodule repo).

  8. Naming/comment nits (non-actionable). groupsInToken/allowInsecureTransport lack the is/has prefix but follow the fluent-builder idiom (mirroring scope()/audience()); the :103 clock-skew comment names only getToken() though signIn() shares the logic (accurate but incomplete).

Checked and dismissed (so they don't get re-raised)

  • Utf16Sink.putAsPrintable "misses supplementary-plane tag chars" — false: the CharSequence overload scans by code point via DisplaySafe and does strip them.
  • Use-after-free of the native jsonLexer on close() — refuted: close() sets the closed volatile then lock.lock(), blocking until any in-flight parse releases the lock; the lexer is never freed mid-parse.
  • OOB read on a trailing backslash in JsonLexer.unescape — refuted: the lexer's ignoreNext guarantees a char follows every \ before the closing quote; the i+1>=n guard is dead-but-correct.
  • OidcAuthException static-method ordering — correct (public-static before package-private-static, then alphabetical).

Summary

~8 Minor items, none blocking: 4 test-coverage/quality gaps (#1#4), 1 robustness nit (#5), 1 optional cold-path perf cleanup (#6), 1 PR-description prominence fix (#7), naming/comment nits (#8). ~6 draft findings dropped as false positives after source verification. Every confirmed finding is in-diff (tests / new parser code / PR body); the high-blast-radius out-of-diff callers (ILP error rendering, JsonLexer consumers, recv()/createLineSender callers, getAuthorizationHeaderForTest) were walked and verified safe — 0 out-of-diff defects. The single item most worth doing before merge is #1.

🤖 Generated with Claude Code

glasstiger and others added 8 commits June 29, 2026 18:03
Address the moderate findings from the OIDC token-persistence review.

getToken() runs on the latency-sensitive flush and reconnect path,
but persistence routed it through FileTokenStore's cross-process
lock, which could spin up to the acquire budget before a silent
refresh. The advanced FileTokenStore constructor accepted an
unbounded budget, so a misconfiguration could stall a flush. Cap the
acquire budget at 30s, and reconcile the HttpTokenProvider SPI and
the OidcDeviceAuth class docs, which still claimed getToken() never
blocks: it never waits behind an interactive sign-in, but a
coordinated silent refresh may briefly wait for the store lock.

Add the missing tests on the security-critical persistence branches
(CI had not covered these files): schema-version mismatch, per-field
fingerprint mismatch (including groups_in_token and the audience
nullableEquals path), the httpTimeoutMillis cap, a control-char JSON
round-trip, the 0600 lock-file permissions, getToken() degrading to
a lock-free refresh while a peer holds the lock, and a golden-hash
anchor pinning the cross-language on-disk file-name contract. Also
assert the previously-unchecked load/clear/lock store counters, the
load-once guarantee, and clearCache() not reloading a just-cleared
entry.

FileTokenStoreTest 19 -> 26, OidcDeviceAuthPersistenceTest 14 -> 18;
all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FileTokenStore released its cross-process lock by bare path, so a hold
that outran lockStaleMillis (and was therefore stolen and recreated by a
peer) deleted the peer's live lock on release, admitting a third acquirer
alongside it and breaking the mutual exclusion the lock guards. The
release now verifies ownership: acquireLock stamps a unique nonce
(pid@host + timestamp + UUID) into the lock and returns it, and
releaseLock deletes the lock only when it still carries that nonce. A
stamp that cannot be written degrades to a lock-free refresh rather than
holding an unverifiable lock. The frozen cross-language contract in
design/oidc-token-persistence.md now documents the owner stamp and the
ownership-verified release so other clients mirror it.

OidcDeviceAuth.adopt() capped a loaded expiry but never floored it, so a
tampered expires_at_millis near Long.MIN_VALUE made the validity check
(now < expiresAtMillis - skew) underflow to a large positive and serve a
garbage-expiry token as valid forever. adopt() now clamps the expiry to
[0, now + maxLife], keeping the check underflow-safe while an
already-expired entry still falls through to a refresh.

Add tests: a peer-stolen lock survives our release; a tampered far-past
expiry is refreshed rather than served; the parseLast() truncated-
document reject; and a real FileTokenStore.save rename failure throws
OidcAuthException and leaves no .tmp file. Both regression guards were
proven to fail with their fix reverted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FileTokenStore.load read the file with Files.readAllBytes after a
separate Files.size check. Because the file is attacker-writable, a
concurrent grow between the two would make readAllBytes allocate
gigabytes and throw OutOfMemoryError - an Error that the best-effort
RuntimeException guard in OidcDeviceAuth.maybeLoadFromStore does not
catch, so a bad file aborted sign-in instead of degrading. A new
readBounded reads through a FileChannel into a buffer capped at the
reported size (already bounded by MAX_FILE_BYTES) plus one byte, so a
file that grew past its reported size is rejected, never allocated.

TokenFileParser tracked only object depth, so an array-wrapped object
([ {..} ]) extracted its fields through the depth gate. The parser now
marks any array as malformed and parseAndVerify rejects a shape that is
not a single flat JSON object, keeping the on-disk format strict.

TokenStoreKey now rejects a null clientId, tokenEndpoint,
deviceAuthorizationEndpoint or scope with a clear OidcAuthException
instead of surfacing a raw NullPointerException deep in a store's
serialize/fingerprint path, and normalises an empty audience to null so
getAudience(), hash() and the save/load round-trip agree that an absent
audience is null (an empty audience previously broke its own round-trip).

parseAndVerify now bulk-copies the file bytes into native memory via a
new DirectUtf8Sink.put(byte[], int, int) instead of a byte-by-byte loop.

Add tests for the array-wrapped reject, the empty-audience
normalisation, and the null-required-field rejection; each was proven to
fail with its fix reverted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sanitize the warnPersistence detail. An IO error from the store can
carry the operator-supplied store path (questdb.client.oidc.token.store.dir
or user.home), which could itself hold terminal-spoofing characters, so
route cause.getMessage() through the null-safe sanitizeForDisplay before
printing to System.err - matching how every other untrusted display
string is sanitized.

Reap orphan temp files. A crash between createTempFile and the atomic
rename leaves a <hash>*.tmp holding a valid-at-the-time refresh token;
nothing ever steals it, so they would accumulate across crashes. save()
now best-effort sweeps <hash>*.tmp older than lockStaleMillis, leaving a
concurrent writer's fresh temp (recent mtime) untouched - so the random
per-writer suffix that keeps concurrent saves from colliding stays.

Chmod the store directory only on drift. ensureDirectory re-tightens a
pre-existing directory on every save and every inLock; read the
permissions first and only setPosixFilePermissions when they actually
differ, dropping a redundant write syscall in the common case while
keeping the re-tighten defense and the non-POSIX fallback.

Compare the schema version as a long. The parser narrowed it to an int,
so a tampered "v" of 1 + 2^32 truncated to SCHEMA_VERSION and passed the
gate; keep the full long and compare as a long.

Add tests for the temp-file sweep and the version-overflow reject; each
was proven to fail with its fix reverted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run clear() under the cross-process lock. A bare deleteIfExists let a
peer mid-refresh atomically rename a fresh file in just after the delete,
resurrecting the entry; clear() now deletes inside inLock (which cleans
up its own lock and degrades to lock-free if it cannot acquire one). A
Files.isDirectory guard keeps clear() a true no-op on a never-used store
rather than creating the directory just to run the locked delete. Cross-
process clear stays best-effort: a peer holding a live in-memory token
may legitimately re-persist afterwards.

Derive the adopted token lifetime from the absolute expiry. adopt()
clamped expires_at_millis and token_ttl_millis from the file
independently, so a tampered file could make them disagree and throw off
effectiveSkewMillis (which caps the skew at half the lifetime). Derive
tokenTtlMillis from the clamped expiry instead, so the skew basis always
matches the authoritative remaining lifetime; a legitimate file is
unaffected.

Document the cross-process contract more precisely. The README now notes
that lock-file staleness is judged by modification time, so coordination
assumes a shared clock (a single machine or synchronized clocks; NFS
clock skew can mis-judge it), and that clearCache() is best-effort across
processes. The frozen on-disk contract now states the document must be a
single flat JSON object, which the parser already enforces by rejecting
an array-wrapped or non-object shape - so the Python client mirrors it.

Order parseAndVerify before parseLongOrZero to match the alphabetical
member ordering.

Add tests for the expiry-derived lifetime and the clear-on-empty-store
no-op; both were proven to fail with their fix reverted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@glasstiger

Copy link
Copy Markdown
Contributor Author

Code review — OIDC token persistence (level-3, persistence-layer focus)

Reviewed at aba8399 (base 660a06b). This is a focused pass on the token-persistence layer — commit a4e928f plus the four hardening commits after it (71e01f2aba8399), i.e. everything added after the three prior level-3 reviews (all at 1b7fecd, before persistence existed). New surface: FileTokenStore.java (797 lines), TokenStore/TokenStoreKey/PersistedToken, the +240 persistence integration in OidcDeviceAuth.java, DirectUtf8Sink.put(byte[],int,int), the frozen design/oidc-token-persistence.md, and the two new test suites. Authoritative diff taken from merge-base...HEAD and the working tree (the gh pr diff cache is stale for this branch). Findings verified against source; the highest-stakes logic was traced both ways.

Verdict: approve. No Critical or Moderate issues.

The two highest-stakes properties hold up under adversarial tracing:

  • Rotating-refresh-token coordination never replays a revoked token. Verified single-process (the lastPersistedRefreshToken != refreshToken guard in refreshUnderLock keeps the newer in-memory token after a best-effort save failure) and cross-process (a peer's rotated token is adopted from disk under the lock before any network call). The O_CREAT|O_EXCL lock-file acquire/steal/release protocol admits at most one stamped holder; releaseLock's nonce check never deletes a peer's live lock on the happy path.
  • An attacker-writable file cannot inject a credential or serve a wrong identity. The 1 MiB size cap, the iterative lexer, array→malformed, the in-file fingerprint re-check (Chars.equals on all six identity fields + nullableEquals for audience + groupsInToken !=), the served-token char gate, and the expiry clamp [0, now+maxLife] (underflow-safe at Long.MIN_VALUE, overflow-safe at Long.MAX_VALUE) all hold.

Feature coverage is strong (54 new tests, all assertMemoryLeak; golden cross-language SHA-256 values pinned; dedicated tests for revoked-replay, far-future/far-past expiry, and CR/LF-token rejection). The canonicalEndpoint ↔ on-disk-fingerprint round-trip is verified end-to-end through a real FileTokenStore, not just the fake.

Minor (none blocking)

1. Long.MIN_VALUE long-field serializes as a bare JSON null — violates the frozen cross-language contract and silently round-trips to 0. (in-diff)
FileTokenStore.putLongMember (FileTokenStore.java:359-363) writes via sink.put(value)Numbers.append(sink, value, /*checkNaN=*/true) (Numbers.java:112-126), which emits the literal null for Long.MIN_VALUE. This affects expires_at_millis and token_ttl_millis. It contradicts both serialize()'s own comment (FileTokenStore.java:490-494) and the byte-frozen spec (design/oidc-token-persistence.md:334-339: "a field whose value is null … is omitted entirely, not written as JSON null") — yet here a present field is written as a bare null. On read, parseLongOrZero("null") throws NumericException and returns 0, so Long.MIN_VALUE → 0 silently; a Python client mirroring the contract would parse None and diverge.
Reachability: not via OidcDeviceAuthstoreTokens always sets sane positive values and adopt re-clamps to [0, now+maxLife] before re-persisting. Only a direct FileTokenStore.save(key, new PersistedToken(…, Long.MIN_VALUE, …)) hits it (both public API), and Long.MIN_VALUE is a nonsensical timestamp — so real-world impact is low, but it's a genuine defect in a format documented as frozen to the byte.
Fix: serialize the two long fields with Numbers.append(sink, value, /*checkNaN=*/false) via a small helper (writes the digits, never the NULL sentinel), or normalize/reject Long.MIN_VALUE in serialize.

2. acquireLock deletes the lock by bare path on the nonce-stamp-failure cleanup, skipping the ownership check the rest of the protocol performs. (in-diff)
FileTokenStore.java:568-572: when the exclusive create wins but writeLockHolder fails, the cleanup is Files.deleteIfExists(lock) with no nonce check — unlike releaseLock (:459-462), which was specifically hardened to delete only its own stamp. If a peer stole the just-created empty file and stamped its own nonce in between, this removes the peer's live lock and can admit two holders.
Reachability: only under lockStaleMillis misconfigured near zero (the constructor enforces only > 0); under the 600 s default the microsecond window can never look stale, and that same misconfig already breaks mutual exclusion more directly. Defensive, not a live bug.
Fix: guard with Files.size(lock) == 0 before deleting (a stamped peer's lock is non-empty), and/or raise the lockStaleMillis constructor floor.

3. Test gap: the writeLockHolder-returns-false → "drop the unverifiable lock" branch (FileTokenStore.java:562-573) is untested. (test)
A correctness/security branch (it exists so the process never holds a lock releaseLock can't prove it owns), not a best-effort no-op, and reachable via a read-only-lock-file seam. The other ~50 uncovered FileTokenStore lines are acceptable gaps — they need a non-POSIX FS (the AtomicMoveNotSupportedException fallback, the UnsupportedOperationException POSIX-perm fallbacks) or an inherently racy condition (readBounded's read > size concurrent-grow guard, the inner acquireLock/releaseLock IOException degradations).

4. Defense-in-depth: the loaded refresh token is not char-validated in adopt(). (in-diff)
adopt validates only the served kind (hasOnlyTokenChars); the refreshToken is assigned unchecked. Safe today because its only egress (tryRefresh) URL-encodes it — but the served kinds are validated at the source, making them robust to a new sink, whereas the refresh token is safe-by-egress, not safe-by-construction. A future change adding a non-URL-encoded consumer of the loaded refresh token would turn this into a CR/LF sink. Optional: validate the refresh token's chars in adopt() too.

5. Nits. (in-diff)

  • DirectUtf8Sink.put(byte[], int, int) (:112-121) does no bounds check before the native copyMemory; the only caller passes (bytes, 0, bytes.length), so it's not reachable, but it's a new public method on a shared sink — a cheap assert lo >= 0 && hi <= src.length && lo <= hi would fence the footgun.
  • TokenFileParser resolves duplicate JSON keys last-wins (:788-795); harmless for this client (the fingerprint re-check makes a mismatch fall back), but a first-wins/reject-duplicate peer parser would disagree on the frozen contract.
  • testSaveFailureThenRefreshDoesNotReplayRevokedToken uses Thread.sleep(1_200) against a 1 s-TTL token — the most flake-prone sleep in the suite (low risk given the 1.2 s vs 1.0 s margin and the 30 s timeout). The other timing-based concurrency tests are written so a stall only widens the assertion margin.

Checked and dismissed (so they don't get re-raised)

  • Identity-confusion (token for A served to B) — refuted: fingerprint re-check + nullableEquals(audience) + groupsInToken != are airtight; canonicalEndpoint can't collapse two origins; the filename hash is only a bucketing key.
  • DoS via a hostile file (unbounded alloc / OOM / infinite loop / stack overflow) — refuted: 1 MiB readBounded cap, iterative (non-recursive) lexer, any array marks malformed, parseLongOrZero is O(n) and overflow-safe.
  • Numeric-tamper bypass (_/+/overflow in v/expiry/ttl) — refuted: version compared as a full long (no 32-bit narrowing); Numbers.parseLong requires _ to follow a digit and rejects +0; expiry clamped and underflow-safe.
  • Writer/reader escaping asymmetry — refuted: symmetric; the writer never emits \/ or \uXXXX for ≥ 0x20, and the lexer's lenient arms are never exercised by writer output.
  • Path traversal via the filename — refuted: key.hash() is always [0-9a-f]{64}; no caller bytes enter the file/lock/temp names.
  • Terminal-spoofing via the warning paths — refuted: warnNoPosixPermsOnce is a fixed ASCII literal; warnPersistence runs the IO-error detail through sanitizeForDisplay; neither prints token bytes.
  • ByteBuffer.flip() Java-8 covariant trap (FileTokenStore.java:444) — not a defect for the shipped artifact: the release compiles on JDK 8, where it binds to Buffer.flip(). Optional ((Buffer) buffer).flip() hardening. No other Java 9+ API/syntax found in any changed file.
  • Native-memory leak on parseAndVerify / constructor error paths — refuted: try-with-resources frees the DirectUtf8Sink + JsonLexer on every path including JsonException; storeKey is built before the lexer alloc; save's moved-flag finally cleans the temp.

Summary

~14 draft findings verified, ~9 dropped as false positives / by-design. All 5 confirmed findings are in-diff; 0 out-of-diff defects — the out-of-diff callers of the changed shared symbols (Numbers.append, JsonLexer, DirectByteSink, the Sender/AbstractLineHttpSender wire egress) were walked and verified safe. Nothing blocks merge. The single item most worth doing is #1 (stop Long.MIN_VALUE from serializing as a bare JSON null — a one-line checkNaN=false fix that keeps the frozen cross-language format honest); #2 and #4 are cheap defense-in-depth follow-ups. The native-leak test guard the prior reviews flagged (testMalformedEndpointDoesNotLeakNativeMemory feeding "not-a-url", which throws before the lexer alloc) is in the non-persistence surface and still open.

🤖 Generated with Claude Code

glasstiger and others added 3 commits June 30, 2026 01:17
build() now rejects a FileTokenStore whose lockStaleMillis is below
4x httpTimeoutMillis. The cross-process lock presumes a holder has
crashed once its lock outlives that staleness window and steals it; if
the window is shorter than the worst-case time a live refresh holds the
lock (send + await + parse + body drain, each bounded by
httpTimeoutMillis), a peer can steal a live holder's lock mid-refresh
and reopen the rotating-refresh-token race the lock exists to prevent.
The store cannot see the client's timeout, so build() enforces the
invariant where both values are known. The default store (600s) and any
non-coordinating TokenStore are unaffected.

Also add the two load-path security tests this subsystem lacked:
- a groups-in-token instance rejects a persisted entry whose served id
  token carries CR/LF (adopt() must validate the id token, not the
  access token, in that mode) and falls back to the device flow;
- a persisted served token carrying a non-ASCII char (> 0x7e), not just
  a control char, is rejected on load.

Each new test was confirmed to fail without its fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
acquireLock stole an abandoned lock by deleting it by bare path: two
acquirers contending over one stale lock could both delete it and a
third could delete a peer's freshly created live lock, admitting two
holders into the read-refresh-write at once. stealIfStale now captures
the lock with an atomic rename to a unique name (so exactly one stealer
wins), verifies the captured owner stamp matches the one it judged
stale, and on a mismatch restores the peer's lock with a non-clobbering
move rather than stealing it. This mirrors releaseLock's own-stamp
check and shrinks the race from the whole isStale->delete gap to the
gap between two renames.

releaseLock read the lock file with an unbounded Files.readAllBytes.
The lock file shares the attacker-writable token-store directory, so an
inflated lock could throw OutOfMemoryError - an Error the best-effort
RuntimeException guards on the getToken()/signIn() path do not catch,
aborting the sign-in. readLockHolder now reads it with the same hard
cap readBounded applies to the token file and treats an oversized lock
as unreadable.

Add a concurrent steal-contention test asserting mutual exclusion and
no capture-temp leak, and a test that an oversized stale lock is stolen
rather than wedging acquisition.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The FileTokenStore lock-file protocol had four edge cases where the
cross-process mutual exclusion could degrade. Each degrades gracefully
(at worst a re-prompt on a rotating-refresh-token IdP, never a torn or
forged credential), but the fixes are cheap and the comments overstated
the guarantees.

sweepStaleTempFiles now skips names containing ".lock.": a steal
captures a stale lock by renaming it to <hash>.lock.<uuid>.tmp, and
ATOMIC_MOVE preserves the stale mtime, so a concurrent save's temp sweep
would judge the in-flight capture old and delete it - destroying a lock
the stealer may be about to restore to its live owner.

stealIfStale now reclaims an empty/unstamped lock after a short grace
(EMPTY_LOCK_STEAL_GRACE_MILLIS) instead of the full staleness window. A
holder that crashes between the exclusive create and the stamp leaves an
empty lock with a fresh mtime that the staleness check protected for the
whole window, wedging peers into lock-free refreshes. The grace dwarfs
the create-to-stamp gap, so a peer mid-stamp is never pre-empted (which
would steal a lock its rightful owner is about to hold). isStale becomes
the parameterized isOlderThan.

The restore branch now documents honestly that a multi-actor race can
momentarily admit two holders - inherent to stealing with a lock file,
since a filesystem offers no atomic compare-and-delete.

The comments in FileTokenStore, OidcDeviceAuth, and the design doc
claimed the under-lock refresh hold is bounded by 4x httpTimeoutMillis;
they now note the connection phase (DNS + TCP connect + TLS handshake)
is not bounded by httpTimeoutMillis, so lockStaleMillis must clear that
on top of the refresh I/O. The default 600s window already absorbs it.

Adds regression tests for the empty-lock grace steal and the
sweep-skips-capture behaviour; both fail on the unfixed code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@glasstiger

Copy link
Copy Markdown
Contributor Author

Code review — focused re-review of ea7c1a82d (Level 3)

This is a re-review scoped to the one new commit since the prior pass (b016dd0): ea7c1a82d "Harden token-store cross-process lock edge cases", which addresses the four Layer-2 cross-process lock edge cases flagged earlier. The delta is two code changes + two doc-honesty changes across FileTokenStore.java, OidcDeviceAuth.java, FileTokenStoreTest.java, and the design doc.

All four changed symbols (isOlderThan, stealIfStale, sweepStaleTempFiles, the new constant) are private to FileTokenStore; OidcDeviceAuth is comments-only (no logic change). Zero out-of-diff callsites — legitimately so.

Critical

None.

Moderate

None blocking. The four previously-flagged edge cases are correctly resolved:

  • Fix 1 — sweep skips captures (sweepStaleTempFiles). Closes a real lock-loss hole: a concurrent save()'s temp-sweep could delete an in-flight steal capture (<hash>.lock.<uuid>.tmp) because ATOMIC_MOVE preserves the stale mtime. The .contains(".lock.") filter is precisehash() is lowercase-hex SHA-256 (no dots), so a <hash><digits>.tmp write-temp can never contain .lock..
  • Fix 2 — empty-lock grace (stealIfStale). Closes the wedge where a crash in the createLockFilewriteLockHolder gap left an empty lock un-stealable for the full window. min(5s, lockStaleMillis) preserves the invariant that an empty lock is never less stealable than a stamped one, and the grace dwarfs the microsecond create→stamp gap so a live holder mid-stamp is not pre-empted.
  • Fix 3/4 — honest comments. The two-holder restore residual and the connection-phase caveat are now documented accurately.

Minor (non-blocking, both documented)

  1. Crash-orphaned captures are now never reclaimed. Since the sweep skips .lock. names, a <hash>.lock.<uuid>.tmp orphaned by a crash in the microsecond window between ATOMIC_MOVE and the restore/delete persists indefinitely (~hundreds of bytes each). Accumulation is negligible and the comment acknowledges it; the mtime can't distinguish an in-flight capture from an orphan, so no cheap age-based reclaim exists. Acceptable.
  2. Python client must still mirror the empty-lock grace. The design doc states this as a MUST; until then a not-yet-updated peer degrades gracefully (wedges slightly longer on a crashed Java holder's empty lock).

Checked and dismissed

  • "Empty-lock grace could steal a live lock"false. A live lock is stamped (before != null → full-window path); the grace path only triggers on an empty/unreadable lock, whose owner is crashed or in the microsecond stamp gap.
  • "Weaker before==null capture-verify deletes the wrong lock"graceful only. It can only affect empty (create-gap/orphan) locks; a stamped lock has after != null → restore. Worst case = one lock-free degrade + one holder = documented double-refresh.
  • "Fix 4 doc claim is wrong"verified true. HttpClient.connect() does blocking getAddrInfo (DNS) + blocking connectAddrInfo (TCP) and only switches to non-blocking after connect; the timeout arg governs only post-connect ioWait. DNS+TCP connect are genuinely unbounded by httpTimeoutMillis, so the prior docs overstated the ~480s bound.
  • "build() floor should add a connection-stall allowance"can't and needn't. A black-holed DNS/connect stall is unbounded, so no finite floor closes it; documenting it + a generous default (600s) is the sound choice and the degradation is graceful.

Verification performed

  • Regression guards proven both ways: both new tests pass on the fixed code and both fail on reverted production code with the exact assertions ("the temp-sweep must not delete an in-flight steal-captured lock", "an empty (unstamped) lock past the grace must be stolen..."). Real guards, not trivially-passing.
  • No regression: full FileTokenStoreTest green (39/39) — the stealIfStale change doesn't break existing steal/lock tests.
  • Member ordering preserved, no Java 9+ APIs, no LOG changes, assertMemoryLeak() used.

Verdict: approve

The commit cleanly resolves all four lock edge cases — two with correctly-scoped code fixes, two with accurate documentation — backed by genuine regression tests verified both ways and a green suite. Every residual race degrades gracefully to a re-prompt on a rotating-refresh-token IdP, never to a torn or forged credential (Layer-1 integrity is independent).

glasstiger and others added 11 commits July 1, 2026 03:04
Address three moderate findings from the device-flow review.

adopt() no longer nulls a live in-memory refresh token when a re-read
store entry carries a valid served token but no refresh_token (a
cross-language peer that never received one, or a tampered file).
Nulling it made tryRefresh() call urlEncode(null) and throw an uncaught
NullPointerException that aborted getToken()/signIn() instead of
degrading to a refresh or an interactive sign-in. adopt() now keeps the
current refresh token and tracks what the file actually carried;
tryRefresh() also guards against a null refresh token defensively.

DisplaySafe.isDisplaySafe() now rejects U+2028 LINE SEPARATOR and U+2029
PARAGRAPH SEPARATOR, which are neither ISO control nor Cf format chars
yet break a rendered log line in ECMAScript/GUI/JSON log consumers. One
classifier change closes the gap across every sanitizer that shares it
(sanitizeForDisplay, OidcAuthException.putSanitized, and
Utf16Sink.putAsPrintable).

The httpTokenProvider Javadoc now documents that a failed HTTP flush
preserves the buffered request and its baked token and re-sends it
verbatim on retry rather than re-pulling, so a flush that keeps failing
until the pulled token expires is then rejected (for example a 401)
until the caller discards the buffered rows.

Add DisplaySafeTest coverage for U+2028/U+2029 and an
OidcDeviceAuthPersistenceTest regression that reproduces the refresh NPE
via a peer entry omitting the refresh token (verified both ways).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Endpoint.parse now rejects an endpoint URL carrying a fragment (#...).
pathOnly() stripped the fragment before the issuer-path pin while
postForm sent endpoint.path verbatim on the wire, so a "#/../other/token"
payload from a tampered /settings could steer the credential POST to a
path the pin never validated on a server that normalizes '..'. Rejecting
the fragment makes the validated path and the sent path identical.

The token-store lock's stamp write and its stamp-failure cleanup now
verify ownership the way release and steal already do. writeLockHolder
refuses to overwrite a lock a peer stamped while this holder was
pre-empted in the create->stamp gap (a long pause or cross-machine clock
skew wider than the empty-lock grace), and the cleanup drops only a
still-empty lock, never a peer's live one by bare path. The residual
degrades to a re-prompt on a rotating refresh token, never a torn or
forged credential; the design doc and the overstated grace comment are
corrected to say so. The on-disk lock protocol is unchanged.

Add end-to-end tests proving the httpTokenProvider token reaches the
Authorization: Bearer header on the wire and rotates per request, that a
failed flush re-sends the same baked token without re-pulling, and that
adopt() re-saves a kept refresh token the file omitted. The endpoint
fragment and re-save guards are verified both ways.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
QwpWebSocketSender.buildAndConnect resolved the Authorization header
inside its per-endpoint walk, so a token provider was queried once per
endpoint per reconnect round. A throwing provider (a failed silent
refresh, or not signed in) was then caught as a per-endpoint transport
error, retried across every endpoint, and surfaced as "all endpoints
unreachable" - masking the real auth failure and re-hammering the token
endpoint with the same dead credential.

Resolve the header once per (re)connect round, before the endpoint walk,
and reuse it across a failover (a token is cluster-wide). A provider
throw now propagates to connectWithRetry, which retries it within the
reconnect budget - the documented streaming model - and surfaces the
provider's own message. A close that races the round aborts before the
possibly blocking token pull. This mirrors QwpQueryClient, which
likewise resolves the credential once before its walk.

Add a WebSocketTokenProviderTest case that proves the provider is
queried once (not per endpoint) and its error surfaces instead of
"all endpoints unreachable"; it fails against the pre-fix code.

Also correct the OidcDeviceAuth.getToken() javadoc: the silent refresh
is not bounded by httpTimeoutMillis end to end. That timeout bounds the
send, response wait and body parse, but the preceding connection phase
(DNS, TCP connect, TLS handshake) is bounded by the OS, so an
unreachable token endpoint can stall the refresh for the OS TCP-connect
timeout (~2 minutes), not 30s. The class already documented this for
the token-store lock; the getToken() contract now matches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A review flagged several minor issues; this commit addresses the
actionable ones.

Endpoint.parse now rejects a non-ASCII host. String.equalsIgnoreCase
folds several non-ASCII letters onto ASCII (U+0130 -> i, U+212A -> k,
...), so a homoglyph host advertised by a tampered /settings could
otherwise pass the origin pin against a pinned issuer; a non-ASCII host
would not resolve anyway (the HTTP layer sends it to the OS resolver as
raw UTF-8, no IDNA). Endpoint.parse also lower-cases the URL scheme
before matching so a case-insensitive HTTPS/Http (RFC 3986) builds,
matching the browser launcher; toLowerCase folds only ASCII, so a
homoglyph scheme stays rejected. Renamed sameOrigin to isSameOrigin per
the is/has convention.

FileTokenStore.save now retries the atomic rename on a transient Windows
AccessDeniedException (a sharing violation from a concurrent reader), so
a routine read/write overlap does not needlessly degrade best-effort
persistence. POSIX rename over an open file is unaffected.

Smaller nits: DirectUtf8Sink.put(byte[],int,int) asserts its range;
AbstractLineHttpSender.close() nulls jsonErrorParser after freeing it;
Utf16Sink.putAsPrintable skips a redundant charAt on the BMP fast path;
and a duplicated comment in TokenStoreKey is removed.

Adds OidcDeviceAuthTest cases: a non-ASCII host and an uppercase scheme
(both proven to fail against the pre-fix code), and the issuer-pin
accept path for host casing and an implicit vs explicit 443 port.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
testConcurrentStealContentionPreservesMutualExclusion asserted a hard
mutual-exclusion invariant (overlaps == 0, maxInside == 1) that the
FileTokenStore cross-process lock does not provide. stealIfStale is
best-effort by design and documents a three-actor residual - a peer
recreating the lock while a second captures that fresh live lock and a
third claims the freed path - under which two holders briefly run at
once, degrading only to one extra token refresh (Layer-1 atomic rename
keeps the credential intact). Under four-way contention that residual is
reachable, so the test failed intermittently on CI.

A stress harness confirmed the mechanism: 8 contenders overlapped in 12
of 700 iterations (maxInside 2), while 2 contenders never overlapped in
800 iterations.

Split the one over-strict test in two:

- testConcurrentStealContentionTwoWayPreservesMutualExclusion runs two
  contenders and keeps the strict overlaps == 0 / maxInside == 1
  assertions. The three-actor residual cannot arise with two threads, so
  this is deterministic and still pins the exclusion the atomic capture
  guarantees.

- testConcurrentStealContentionDegradesCleanly runs four contenders and
  asserts only the benign invariants - every contender runs its section
  and no capture temp leaks - tolerating the documented residual.

No production change. Full FileTokenStoreTest is 40/40 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up fixes from a review of the OIDC device-flow PR.

Security
- Endpoint.parse rejects a URL query (?...), matching its existing
  fragment (#) rejection: pathOnly() strips the query before the
  issuer-path pin, yet postForm sends endpoint.path verbatim, so a
  tampered /settings could route an unvalidated query to the IdP.

Correctness
- tryRefresh falls back to the interactive flow when a refreshed
  token carries a control/non-ASCII char, instead of letting
  validateTokenChars propagate past the fallback.
- JsonLexer.unescape keeps the backslash on an unknown or malformed
  escape rather than dropping it, so non-conformant text survives.
- DirectUtf8Sink.put(byte[]) range-checks and throws instead of an
  assert that is a no-op in client apps run without -ea.
- FileTokenStore warns once about missing POSIX perms via an
  AtomicBoolean compareAndSet, closing a benign double-warn race.

Performance
- AbstractLineHttpSender precomputes the User-Agent header once
  instead of concatenating it on every newRequest.
- OidcDeviceAuth reuses build()'s parsed endpoints in the
  constructor instead of re-parsing the raw strings.

Docs and tests
- DeviceCodePrompt javadoc says "display-safe text", not "plain ASCII".
- oidc-token-persistence.md reflects the shipped implementation.
- Add direct DirectUtf8Sink.put(byte[]) range/bounds tests and a
  short 1-digit 4xx/5xx status rejection test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FileTokenStore.putLongMember wrote a long field via sink.put(long),
which routes through Numbers.append(..., checkNaN=true) and renders
Long.MIN_VALUE as the literal JSON null. The two long fields
(expires_at_millis, token_ttl_millis) are present, non-nullable
integers, so a bare null there breaks the frozen cross-language format
(serialize() omits absent fields rather than writing null, so a null is
indistinguishable from absent) and round-trips back to 0 on read via
parseLongOrZero -- a silent corruption a Python peer would also diverge
on.

putLongMember now calls Numbers.append(sink, value, false), which emits
the full number, so every long value round-trips verbatim. The path is
reachable only through the public FileTokenStore.save(...) SPI;
OidcDeviceAuth clamps both fields to a non-negative range before
persisting, so the OIDC flow never hits it.

Add testLongFieldsSerializeAsDigitsNotBareNull: it saves Long.MIN_VALUE
in both long fields, asserts the on-disk JSON carries the digits rather
than a bare null, and that load() round-trips them instead of
collapsing to 0. The test fails on the pre-fix code and passes with the
fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address minor code-review findings on the OIDC device flow: fill test
gaps around the token provider and tidy comments and dead code. No
production behavior change beyond simplifying unreachable branches.

Tests:
- Add testRefreshedTokenWithControlCharFallsBackToInteractiveFlow: a
  refresh whose 200 response carries a served token with an escaped
  control char (JsonLexer now decodes it into a real CR) is rejected by
  validateTokenChars, and tryRefresh must fall back to the interactive
  flow rather than propagate. Guards the tryRefresh storeTokens catch;
  verified to fail without it.
- Add testThrowingProviderOnReconnectIsRetriedAndRecovers and
  testPersistentlyThrowingProviderOnReconnectTerminatesTheSender: a
  token provider that throws on a WebSocket reconnect (the background
  I/O thread) is retried within the reconnect budget and recovers on a
  transient failure, or terminates the sender once the budget is
  exhausted on a persistent one.
- Replace the vacuous testMalformedEndpointDoesNotLeakNativeMemory
  (which fed "not-a-url", rejected before the lexer is allocated) with
  testRejectedBuildDoesNotLeakNativeMemory (a parseable-but-rejected
  endpoint) and testSuccessfulBuildAndCloseDoNotLeakNativeMemory - the
  path that actually allocates and frees the native lexer. The new
  guard fails if close() stops freeing the lexer.
- Add @test(timeout=30_000) to
  testShortAllDigitStatusNotTreatedAsTransientOrTerminal so a guard
  regression fails fast instead of polling to the device-code deadline.

Comments and dead code:
- Reword the writeLockHolder and acquireLock comments to match
  releaseLock's honest framing: the read-then-write and
  size-check-then-delete narrow, but do not close, the peer-stamp
  clobber window; the residual degrades to a double-refresh, never a
  torn credential.
- Reduce pathOnly() to return Endpoint.parse(url).path and drop the
  now-unreachable ?/# arms in Endpoint.parse's authority terminator and
  path construction (Endpoint.parse rejects ? and # up front).
- Expand the refreshUnderLock comment: the equals(refreshToken,
  lastPersistedRefreshToken) guard no longer strictly means "unsaved
  newer token" once adopt() keeps a live token the file lacked.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
adopt() set tokenTtlMillis to the remaining span (expiresAtMillis -
now), while storeTokens() sets it to the full issued lifetime.
effectiveSkewMillis() caps the 30s clock-skew margin at
tokenTtlMillis / 2 - a guard meant only for a genuinely short-issued
(< 60s) token - so a TokenStore-loaded token in its final minute had
that margin collapse toward zero, and getToken() served it on the
flush path instead of refreshing. Under client/server clock drift the
served token could then be rejected mid-request, and on the ILP HTTP
sender that 401 is not auto-recovered.

adopt() now takes the skew basis from the file's stored issued TTL
(clamped to [0, maxLife]), which snapshot() already persists, so
adopt() and storeTokens() give tokenTtlMillis one meaning. A tampered
ttl can only shrink the skew, never inflate it past CLOCK_SKEW_MILLIS,
and the server still enforces the real expiry, so trusting the stored
value is no less safe than deriving it from the remaining span.

Replace testAdoptDerivesTtlFromExpiry, which pinned the old
remaining-span behaviour, with testAdoptTrustsStoredIssuedTtlNot-
RemainingSpan, and add testAdoptedTokenNearExpiryStillRefreshesOn-
FlushPath as a regression guard. Both fail on the reverted production
line and pass with the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the minor hardening findings from the device-flow review.

endpointPathHasEncodedSeparator() out-decoded the endpoint path with a
byte-oriented percentDecodeOnce and scanned each level for %2f/%5c/%25,
but missed encodings it does not resolve - an overlong-UTF-8 %c0%ae or
an IIS-style %u002e that a permissive server decodes to '.'/'/'. Such a
segment, sitting past the issuer prefix, slipped the path scope. A real
OIDC endpoint path is plain ASCII, so reject any '%' or '\' in it
outright; a provider that encodes its path must be configured
explicitly with builder(), which pins the origin only.

adopt() accepted an empty served token (hasOnlyTokenChars("") is
vacuously true) and would serve it as a blank "Bearer " header that
only draws a 401. Reject an empty served token too, so a corrupt or
tampered entry falls through to a refresh or an interactive sign-in.

Endpoint.parse() let Integer.parseInt read a ":+443" port as 443,
slipping the range check, and accepted a backslash in the host that the
WHATWG URL spec folds to '/'. Reject a leading '+' on the port and a
backslash in the host.

Correct the JsonLexer.unescape() default-case comment: a '\' before a
recognized escape letter is still decoded, so the "Windows path is not
dropped" note held only for genuinely unknown escapes.

Add regression tests for each - overlong-UTF-8 / %u endpoint paths, an
empty served token, and the +port / backslash-host cases; every one
fails on the reverted production line and passes with the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mtopolnik

Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 1572 / 1729 (90.92%)

file detail

path covered line new line coverage
🔵 io/questdb/client/cutlass/auth/TokenStore.java 0 1 00.00%
🔵 io/questdb/client/cutlass/auth/BrowserLauncher.java 11 21 52.38%
🔵 io/questdb/client/cutlass/auth/FileTokenStore.java 336 399 84.21%
🔵 io/questdb/client/cutlass/qwp/client/QwpWebSocketSender.java 7 8 87.50%
🔵 io/questdb/client/cutlass/auth/DeviceCodePrompt.java 22 24 91.67%
🔵 io/questdb/client/cutlass/qwp/client/QwpQueryClient.java 21 23 91.30%
🔵 io/questdb/client/cutlass/auth/OidcDeviceAuth.java 885 959 92.28%
🔵 io/questdb/client/cutlass/auth/TokenStoreKey.java 37 39 94.87%
🔵 io/questdb/client/Sender.java 29 30 96.67%
🔵 io/questdb/client/cutlass/line/http/AbstractLineHttpSender.java 38 39 97.44%
🔵 io/questdb/client/std/str/DirectUtf8Sink.java 9 9 100.00%
🔵 io/questdb/client/std/str/Utf16Sink.java 12 12 100.00%
🔵 io/questdb/client/cutlass/auth/DeviceAuthorizationChallenge.java 12 12 100.00%
🔵 io/questdb/client/cutlass/http/client/AbstractChunkedResponse.java 8 8 100.00%
🔵 io/questdb/client/cutlass/json/JsonLexer.java 65 65 100.00%
🔵 io/questdb/client/cutlass/auth/PersistedToken.java 12 12 100.00%
🔵 io/questdb/client/cutlass/auth/OidcAuthException.java 33 33 100.00%
🔵 io/questdb/client/cutlass/http/client/AbstractResponse.java 8 8 100.00%
🔵 io/questdb/client/std/str/DisplaySafe.java 20 20 100.00%
🔵 io/questdb/client/HttpTokenProvider.java 7 7 100.00%

@glasstiger

Copy link
Copy Markdown
Contributor Author

Code review — OIDC device flow (level-3, client-only)

Reviewed at head 4b385ee (base 660a06b), 45 files, +13,199/−131. Full mission-critical pass: 9 parallel dimension agents (correctness/security, concurrency, resources, performance ×2, tests, code quality, metadata, cross-context) plus an exhaustive independent from-source read of every security-critical function. Every finding verified against source; the authoritative diff was taken from merge-base…HEAD (byte-identical to gh pr diff now — the earlier stale-cache warning no longer applies).

Verdict: approve. No Critical or Moderate issues.

An unusually well-defended change — CR/LF/header injection, terminal spoofing (bidi/ANSI/zero-width/supplementary-plane tag chars), MITM endpoint redirection (origin + multi-decode path pinning), token-kind confusion, unbounded reads, use-after-free on close(), cross-process credential replay, and identity confusion were each anticipated and hold up under adversarial tracing. The three shared-infra changes that touch the existing ILP path (putAsPrintableDisplaySafe, JsonLexer escape decoding, Response.recv(int) bounding) were verified safe at every out-of-diff caller — 0 out-of-diff defects.

The prior review passes' three actionable items are all fixed at this head (re-verified): putLongMember now uses Numbers.append(…, checkNaN=false) so Long.MIN_VALUE serializes as digits not null; acquireLock cleanup now guards Files.size(lock)==0 before deleting; and the tautological native-leak test was replaced by a parseable-but-rejected-endpoint test plus a lexer-allocating complement.

Minor (none blocking) — all in-diff

The highest-value items are test-coverage gaps on correct-but-unguarded security branches.

  1. Raw (unencoded) ../. dot-segment issuer-path reject has no regression testOidcDeviceAuth.java:782-786. Every traversal test uses encoded forms (%2e%2e, %252f, %c0%ae, %u002e, \evil), all caught earlier by endpointPathHasEncodedSeparator's %/\ gate at :776, so they never reach the dot-segment loop; the #/../ and ?x=/../ tests hit fragment/query rejection in Endpoint.parse. A literal https://idp/realms/acme/../evil/token parses cleanly, passes the % gate, and is rejected only by lines 782-786 — a load-bearing anti-tenant-hopping check that would ship green if deleted. Add a raw-.. case via invokeIsEndpointUnderIssuerPath / fromQuestDB. Most worth doing before merge.

  2. JSON array-nesting is unguarded in the four OidcDeviceAuth network parsersSettingsDiscoveryParser (:2078), WellKnownDiscoveryParser (:2257), DeviceAuthorizationResponseParser (:1843), TokenResponseParser (:2186) track object depth only. {"config":[{"acl.oidc.token.endpoint":"…"}]} surfaces the wrapped object's members at depth==2 && isInConfig, so the field is extracted. Not exploitable (traced): extracted endpoints stay origin+path-pinned, tokens stay validateTokenChars-checked, preferences still can't feed discovery, responses come from a pinned/TLS origin. But FileTokenStore.TokenFileParser does reject arrays (malformed=true) — an asymmetry. Cheap defense-in-depth fix: mirror that array-reject in the two untrusted-input discovery parsers.

  3. parseHex4 non-ASCII \u guard untestedJsonLexer.java:303. The c < 128 ? hexNumbers[c] : -1 guard is present and correct (prevents an OOB index), but no test drives a \u whose window holds a raw non-ASCII byte (an AIOOBE there escapes parseBody's JsonException catch).

  4. QWP query-client per-reconnect re-query / bad-token rejection asserted via a @TestOnly hook (getAuthorizationHeaderForTest()), not a real reconnect socket — unlike the HTTP (…RotatesPerFlush) and WebSocket (…RequeriedOnEveryReconnect) paths, which capture real rotated headers.

  5. The recv(int) whole-call bound is never exercised end-to-end from the existing ILP flush path — only OIDC drives a real stalling socket; the ILP delegation is unit-tested via a hand-rolled subclass that never runs through flush0. (All five ILP callsites were traced to catch the resulting retryable HttpClientException, so this is a coverage gap, not a bug.)

  6. Two native-allocating test classes lack assertMemoryLeakQwpQueryClientTokenProviderTest (10 tests) and DirectUtf8SinkTest (10 tests, exercises the new put(byte[],int,int)). AbstractTest.tearDown() only logs, so a net native leak passes silently — inconsistent with the module convention (OidcDeviceAuthTest/FileTokenStoreTest wrap correctly).

  7. Device-auth path has no short-all-digit status test — the "2"/"5" traps drive only the poll path; the shared isHttpStatusSuccess() length()==3 gate is exercised there, so low value.

  8. groupsInToken field lacks the is/has prefixOidcDeviceAuth.java:175, TokenStoreKey.java:51. The fluent groupsInToken(boolean) setter is the accepted exception; renaming the field's usage has low cost but the public builder API rename has cost — low priority.

  9. close() javadoc overstates its wait bound — says "one HTTP request timeout," but a store-coordinated getToken() refresh can hold the lock for lockAcquireBudgetMillis (≤3s) + a round-trip. Doc tweak only.

Informational — documented tradeoffs / residuals (not defects)

  • Double newRequest() per flush under a token provider (AbstractLineHttpSender): a deliberate crash-safety choice (defer the throwable getToken() off the reset path via isTokenPending); negligible cost, per-flush cadence, non-provider senders unaffected.
  • Cross-process lock double-hold is reachable only under a multi-minute TCP-connect black-hole (the staleMillis ≥ 4×httpTimeout floor covers send+await+parse but not the OS connect phase). Bounded to a double-refresh / re-prompt — never a torn, forged, or replayed credential. Documented residual.
  • save()'s inner finally can mask the original IOException if deleteIfExists(tmp) itself throws, orphaning a .tmp — self-healing via sweepStaleTempFiles; an OS file, not a native leak.
  • Metadata: one commit carries a feat(core): prefix (commit titles shouldn't) + a few 51-52-char / body-less trivial commits — moot after squash-merge. Title/labels/tone/TODO checks pass; the ILP-flush recv regression is prominently disclosed under Tradeoffs.

Checked and cleared (so they don't get re-raised)

  • Java 8 floor: clean. readAllBytes is all Files.readAllBytes (Java 7; the production hits are comments); .isBlank is Chars.isBlank (own); buffer.flip() are bare statements binding Buffer.flip(); URLEncoder.encode(v,"UTF-8") is the legacy charset-name overload; ManagementFactory (not ProcessHandle) for PID. Exhaustive 43-file scan found no Java 9+ API/syntax.
  • Origin/path-pinning bypass: none — encoded/overlong/backslash/dot-segment/matrix-;/trailing-slash/IPv6/userinfo/#/?/non-ASCII-homoglyph-host all rejected; segment compare is exact and case-sensitive; co-location enforced on both construction paths; .well-known target derived only from the out-of-band issuer.
  • CR/LF / non-ASCII token injection: blocked on all three egress paths (device flow, silent refresh, file load), validated on the decoded value; refresh_token/device_code reach the wire only URL-encoded.
  • HTTP status validation: readResponse rejects the first non-digit byte with a constant message before storing, so no ANSI/control byte can splice into a later [httpStatus=…] echo; all three classifiers require exactly 3 digits (a short all-digit "2"/"5" fails closed).
  • Use-after-free on close(): refuted — sets the closed volatile then acquires the lock before Misc.free; every lexer use is under the lock.
  • Identity confusion in FileTokenStore: refuted — the 6-field fingerprint re-check (clientId, both endpoints, scope, audience via nullableEquals, groupsInToken) makes serving A's token to B impossible.
  • Concurrency / resources: every mutable token field is read/written only under the ReentrantLock; closed is the only field touched outside it and is volatile; no native leak or double-free on any happy/error/cancellation path; fetchJson/parseAndVerify free on every path.

Summary

~9 draft findings verified, ~7 false positives dropped (Java-8 breaks, ByteBuffer.flip, method ordering, and every claimed pinning/injection/UAF/identity bypass). Every confirmed finding is in-diff; 0 out-of-diff defects — the cross-context pass walked every out-of-diff caller of the four changed shared-infra symbols and cleared each. Nothing blocks merge; #1 (raw .. dot-segment test) and #2 (array-reject in the two discovery parsers) are the two worth doing first.

🤖 Generated with Claude Code

@glasstiger

Copy link
Copy Markdown
Contributor Author

Code review — OIDC device flow (level-3, client-only)

Reviewed at head 4b385ee (base 660a06b), 45 files, +13,199/-131. Full mission-critical pass: 10 parallel dimension agents (correctness/security, concurrency, resources, cross-context callers, performance x2, tests, code quality, metadata, plus a fresh-context adversarial agent) + an independent from-source read of every production file. Every finding verified against source; #1 and #2 below were proven by deleting the guard and re-running the suite. This is a client-only review — the linked OSS #7331 / Ent #1090 were not examined.

Verdict: approve. No Critical or Moderate issues.

Three independent adversarial passes converged on the same result: no functional or security bug. The security core fails closed everywhere — served-kind token char-validation on every egress path, origin + multi-decode-path issuer pinning, Endpoint.parse hardening, the RFC 8628 poll state machine, use-after-free-safe close(), and cross-process lock races that only ever degrade to a re-prompt (never a torn/forged credential). The cross-context pass cleared every out-of-diff caller of all 7 changed shared symbols (putAsPrintable->DisplaySafe, JsonLexer escape decoding, Response.recv(int), the token-provider plumbing, Sender/QWP auth, DirectUtf8Sink.put(byte[])) — 0 out-of-diff defects.

Minor (all in-diff, none blocking)

The two highest-value items are load-bearing guards with no regression test — both proven by build+test runs.

1. Raw (unencoded) ../. dot-segment issuer-path reject is untestedOidcDeviceAuth.java:782-786. Every traversal test embeds a % or \ (caught earlier by the %-gate at :776) or a #/? (caught by Endpoint.parse); none feeds a bare https://idp/realms/acme/../evil/token, which is rejected only here (the server would normalize it to a sibling realm). Proven: deleting L782-786 leaves all 113 OidcDeviceAuthTest tests green. Most worth doing before merge.

2. JsonLexer.parseHex4's non-ASCII \u-window guard is untested — its removal is a real crashJsonLexer.java:303 (c < 128 ? Numbers.hexNumbers[c] : -1, table is int[128]). Proven: with the guard removed, a \u escape whose window holds a UTF-8 é throws ArrayIndexOutOfBoundsException: Index 233 out of bounds for length 128, propagating as an uncaught RuntimeException (not JsonException) through unescape->getCharSequence->parse — and the OIDC callers only catch JsonException, so a malformed \u in any untrusted field (id_token claim, discovery doc, /settings) would abort sign-in with an unhandled AIOOBE. The guard is present and correct in the shipped code, so this is a test gap, not a live bug — but a high-value one.

3. Array-wrapped JSON accepted by the four discovery/response parsers (defense-in-depth, not exploitable) — SettingsDiscoveryParser/WellKnownDiscoveryParser/DeviceAuthorizationResponseParser/TokenResponseParser ignore EVT_ARRAY_START, so {"config":[{...}]} surfaces fields at the gated depth. Non-exploitable (/settings is already attacker-controlled in the threat model; preferences still can't feed discovery; everything stays origin/path-pinned + char-validated). The sibling FileTokenStore.TokenFileParser does reject arrays (:897, tested non-trivially) — an inconsistency. Mirror that malformed=true guard here.

4. adopt() serves a whitespace-only persisted tokenOidcDeviceAuth.java:1029 guards servedToken.isEmpty(), but a " " token passes (isEmpty() false, hasOnlyTokenChars allows 0x20) and is served as Bearer -> 401, instead of falling through to a refresh/re-sign-in as the code's own comment (:1025-1027) intends. Inconsistent with HttpTokenProvider.validateToken, which uses Chars.isBlank. Low impact (a blank token can't authenticate), but use Chars.isBlank to honor the stated intent; storeTokens/validateTokenChars share the same blind spot.

5. HttpTokenProvider.getToken() "must return promptly" contract vs. a ~2-min stall (doc) — the interface javadoc (:48-60) promises "prompt ... brief, bounded wait," but OidcDeviceAuth::getToken can block for the OS TCP-connect timeout (~2 min) on a black-holed token endpoint (the connect phase isn't bounded by httpTimeoutMillis). Documented on OidcDeviceAuth.getToken (:476-482) and in the PR Tradeoffs, but the interface wording doesn't cross-reference it. One-line doc fix.

6. Native-leak test-guard gapsAbstractTest.tearDown() only logs (no leak assertion). Missing assertMemoryLeak: DirectUtf8SinkTest (incl. the new put(byte[],int,int)), QwpQueryClientTokenProviderTest (real connect() -> native WS buffers), ResponseTest/ChunkedResponseTest (raw Unsafe.malloc/free), and — most subtly — JsonLexerTest's shared-LEXER tests that force extendCache() realloc (a mis-freed intermediate buffer is invisible to @AfterClass). Note assertMemoryLeak tracks only native memory, not the threads/sockets MockOidcServer spawns — that leak class is carried entirely by try-with-resources discipline (which is applied consistently).

7. Trivial-pass teststestTruncatedJsonReturnsNull (FileTokenStoreTest.java:997) and testOversizedFileReturnsNull (:733) leave the fingerprint/version fields at defaults, so the schema-version / fingerprint check returns null before the guard they claim to test (parseLast() truncation, MAX_FILE_BYTES) ever decides — deleting those guards wouldn't fail the tests. Same for testOversizedStaleLockIsStolen (steal is driven by mtime, not the size cap). (testArrayWrappedJsonReturnsNull is correctly non-trivial — the contrast to match.)

8. Other coverage/quality gaps (low) — recv(int) whole-read bound never exercised end-to-end from a real flush() against a dribbling socket; FileTokenStore.load()'s generic-IOException path untested (asymmetric with the tested save side); no test pins the segment-exact prefix property (/realms/acme vs /realms/acmeevil); expiry clamp tested only at Long.MIN/MAX, not the 0/now+maxLife boundary; refresh-token chars unvalidated on load (safe by URL-encoding); 4 tests use setAccessible reflection instead of the module's @TestOnly idiom; two hard-ceiling timing assertions (elapsed < 2000ms, [500,5000)ms) carry two-sided CI-flake risk.

9. Code-quality nitsDeviceCodePrompt.SYSTEM_OUT prints challenge fields via plain put() not putAsPrintable (safe only because runDeviceFlow pre-sanitizes at :1386-1396 — an undocumented cross-file invariant); OIDCAuthExample.java:70 prints a raw server message to System.err, teaching the anti-pattern the PR defends against; sanitizeForDisplay/putSanitized duplication and use of OidcAuthException.isUnsafeForDisplay over DisplaySafe; OidcAuthException(Throwable) leaves getMessage() empty unless .put(...) is chained; verbatim-duplicated guard/deadline/read blocks (Sender auth guards, recv deadline, readBounded/readLockHolder — already drifted); a few booleans without is/has; MockOidcServer.close() (:115) closes serverSocket unguarded before its try/catch teardown.

Informational — documented tradeoffs / residuals (not defects)

  • Double newRequest() per flush under a provider (crash-safety choice; ~100 bytes rewritten, amortized over 75k rows, zero alloc; token written once).
  • Per-(re)connect "Bearer " + token allocation — cold path (once per connect round; verified, corroborated by two agents + WebSocketTokenProviderTest), acceptable by the zero-GC-on-data-paths standard; optional direct-buffer-write cleanup.
  • Issuer-path check runs 3 linear passes where 1 would do — cold, once per construction.
  • Response.recv(int) tightens existing ILP flushes — prominently disclosed under Tradeoffs; abort is a retryable HttpClientException caught by flush0, no data loss.
  • Enterprise ACL: N/A — the client presents the token over auth paths the server already validates; no new server SQL / state mutation.

Checked and dismissed (so they don't get re-raised)

  • "Per-connect Bearer concat is Moderate/warm" -> cold-path Minor (traced once-per-connect-round; 3 sources agree).
  • "Fixed httpToken/withBearerToken skip validateToken" -> benign (caller-supplied static strings, pre-existing; OIDC tokens validated at the OidcDeviceAuth layer before reaching .httpToken(...)).
  • "TokenStoreKey lacks equals/hashCode" -> not used as a hash key; identity is the String hash() + per-field fingerprint compare.
  • "Device-auth path lacks short-status coverage" / "throwing-provider WS-reconnect untested" -> both actually covered: short status via the shared classifier through a real socket; testThrowing/PersistentlyThrowingProviderOnReconnect... drive a real TestWebSocketServer drop-and-reconnect asserting the wire-level header.
  • Java 8 floor / unclosed LOG -> clean across all 45 files (this client uses System.out/err + SLF4J, not the builder-chain framework).

Summary

All confirmed findings are in-diff; 0 out-of-diff defects (the cross-context pass walked every caller of the 7 changed shared symbols and cleared each). ~24 draft findings verified, ~5 dropped as false-positive/benign. Nothing blocks merge. The two items most worth doing first are #1 (raw-.. dot-segment regression test) and #2 (non-ASCII-\u-window test) — both proven load-bearing crash/security guards with zero coverage today; #3 (array-reject mirror) and #4 (isBlank fix) are cheap follow-ups.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request security tandem

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants