Skip to content

Fix race condition in IsRunningStartupCheckStrategy with stale cached container state (#11860)#11861

Open
vpelikh wants to merge 2 commits into
testcontainers:mainfrom
vpelikh:GH-11860
Open

Fix race condition in IsRunningStartupCheckStrategy with stale cached container state (#11860)#11861
vpelikh wants to merge 2 commits into
testcontainers:mainfrom
vpelikh:GH-11860

Conversation

@vpelikh

@vpelikh vpelikh commented Jun 30, 2026

Copy link
Copy Markdown

Problem

In CI environments (Kubernetes, Jenkins), PostgreSQL (and potentially other) containers fail with:

Wait strategy failed. Container is removed

Root cause: IsRunningStartupCheckStrategy had a short-circuit optimization that used container.getContainerInfo().getState() — which returns stale cached state from the preceding port-mapping check. If the container exits/crashes between the port-mapping check and the startup check:

  1. Port-mapping check caches containerInfo with state "running"
  2. Container exits
  3. IsRunningStartupCheckStrategy reads stale "running" state → startup passes incorrectly
  4. Wait strategy starts (docker logs --follow) on crashed/removed container → NotFoundException
  5. User sees "Container is removed" with no indication of the actual failure

Fix

Don't blindly trust the cached state — verify it with one live Docker inspect before declaring success. The cached state from the preceding port-mapping check is used as a hint (fast path), but is confirmed via a single checkStartupState call. If the live inspect confirms the container is running, we return success immediately. If the live detect shows a different state (stale cache), we fall through to full rate-limited polling. If the live inspect fails/timeout (e.g., Docker unresponsive on slow CI), we gracefully fall back to trusting the cached state as the best available information.

Additional improvements:

  • Include containerId in the "Container is removed" error message for better diagnostics
  • Wrap getLogs() and stop() in individual try-catch blocks during cleanup to prevent cascading failures (e.g., NotFoundException when the container is already removed) from suppressing the original ContainerLaunchException
  • Re-enable testCommandQuickExitFailure (was @Disabled due to flakiness from the cached-state race)
  • Add testQuickExitWithDifferentExitCode to validate any non-zero exit code is detected

Design Rationale

The hybrid approach was chosen over two alternatives:

  1. Remove cached-state shortcut entirely — The simplest fix for [Bug]: PostgreSQL container intermittently fails to start with "Wait strategy failed. Container is removed" (TimeoutException) in CI environment #11860, mine original implementation. It caused consistent CI failures on CircleCI where docker inspect hangs, making every startup check timeout.
  2. Keep the cached-state shortcut as-is — Would pass CI but doesn't fix the stale-state bug.
  3. Hybrid (implemented) — Cached state as hint, verify with one live Docker inspect, catch timeout → trust cache. This fixes the stale-state bug when Docker is responsive, while remaining resilient to Docker inspect timeouts observed on CircleCI's machine executor.

Changes

  • IsRunningStartupCheckStrategy.java: Replaced blind cached-state shortcut with verify-then-trust hybrid — uses cached state as hint but confirms with one live Docker inspect before returning success
  • GenericContainer.java: Better error message; NotFoundException-safe cleanup
  • IsRunningStartupCheckStrategyTest.java: Re-enabled flaky test; added test for non-zero exit code

@vpelikh vpelikh requested a review from a team as a code owner June 30, 2026 12:55
@vpelikh vpelikh force-pushed the GH-11860 branch 3 times, most recently from 13ac364 to e18592b Compare June 30, 2026 13:45
… container state (testcontainers#11860)

IsRunningStartupCheckStrategy used container.getContainerInfo().getState() which returns stale cached state from the port-mapping check. If the container exits between the port-mapping check and the startup check, the stale 'running' state caused the startup check to pass prematurely, and the wait strategy would start on a crashed/removed container.

Fix by using the cached state as a hint (fast path) but verifying it with a single live Docker inspect. If the live inspect confirms the container is running, return success immediately. If it shows a different state (stale cache), fall through to rate-limited polling. If the live inspect fails/timeout (e.g., Docker unresponsive on slow CI), gracefully fall back to trusting the cached state as the best available information.

Also improve diagnostics:
- Include containerId in 'Container is removed' error message
- Handle NotFoundException gracefully when retrieving logs from a removed container during cleanup
- Wrap stop() in try-catch to prevent cleanup failures from suppressing the original ContainerLaunchException

Re-enable testCommandQuickExitFailure which was disabled due to this race, and add testQuickExitWithDifferentExitCode.
… container state (testcontainers#11860)

IsRunningStartupCheckStrategy used container.getContainerInfo().getState() which returns stale cached state from the port-mapping check. If the container exits between the port-mapping check and the startup check, the stale 'running' state caused the startup check to pass prematurely, and the wait strategy would start on a crashed/removed container.

Fix by always doing a live Docker inspect via the base class polling loop instead of short-circuiting with cached state.

Also improve diagnostics:
- Include containerId in 'Container is removed' error message
- Handle NotFoundException gracefully when retrieving logs from a removed container during cleanup
- Wrap stop() in try-catch to prevent cleanup failures from suppressing the original ContainerLaunchException

Re-enable testCommandQuickExitFailure which was disabled due to this race, and add testQuickExitWithDifferentExitCode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant