Fix race condition in IsRunningStartupCheckStrategy with stale cached container state (#11860)#11861
Open
vpelikh wants to merge 2 commits into
Open
Fix race condition in IsRunningStartupCheckStrategy with stale cached container state (#11860)#11861vpelikh wants to merge 2 commits into
vpelikh wants to merge 2 commits into
Conversation
13ac364 to
e18592b
Compare
… container state (testcontainers#11860) IsRunningStartupCheckStrategy used container.getContainerInfo().getState() which returns stale cached state from the port-mapping check. If the container exits between the port-mapping check and the startup check, the stale 'running' state caused the startup check to pass prematurely, and the wait strategy would start on a crashed/removed container. Fix by using the cached state as a hint (fast path) but verifying it with a single live Docker inspect. If the live inspect confirms the container is running, return success immediately. If it shows a different state (stale cache), fall through to rate-limited polling. If the live inspect fails/timeout (e.g., Docker unresponsive on slow CI), gracefully fall back to trusting the cached state as the best available information. Also improve diagnostics: - Include containerId in 'Container is removed' error message - Handle NotFoundException gracefully when retrieving logs from a removed container during cleanup - Wrap stop() in try-catch to prevent cleanup failures from suppressing the original ContainerLaunchException Re-enable testCommandQuickExitFailure which was disabled due to this race, and add testQuickExitWithDifferentExitCode.
… container state (testcontainers#11860) IsRunningStartupCheckStrategy used container.getContainerInfo().getState() which returns stale cached state from the port-mapping check. If the container exits between the port-mapping check and the startup check, the stale 'running' state caused the startup check to pass prematurely, and the wait strategy would start on a crashed/removed container. Fix by always doing a live Docker inspect via the base class polling loop instead of short-circuiting with cached state. Also improve diagnostics: - Include containerId in 'Container is removed' error message - Handle NotFoundException gracefully when retrieving logs from a removed container during cleanup - Wrap stop() in try-catch to prevent cleanup failures from suppressing the original ContainerLaunchException Re-enable testCommandQuickExitFailure which was disabled due to this race, and add testQuickExitWithDifferentExitCode.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In CI environments (Kubernetes, Jenkins), PostgreSQL (and potentially other) containers fail with:
Root cause:
IsRunningStartupCheckStrategyhad a short-circuit optimization that usedcontainer.getContainerInfo().getState()— which returns stale cached state from the preceding port-mapping check. If the container exits/crashes between the port-mapping check and the startup check:containerInfowith state "running"IsRunningStartupCheckStrategyreads stale "running" state → startup passes incorrectlydocker logs --follow) on crashed/removed container →NotFoundExceptionFix
Don't blindly trust the cached state — verify it with one live Docker inspect before declaring success. The cached state from the preceding port-mapping check is used as a hint (fast path), but is confirmed via a single
checkStartupStatecall. If the live inspect confirms the container is running, we return success immediately. If the live detect shows a different state (stale cache), we fall through to full rate-limited polling. If the live inspect fails/timeout (e.g., Docker unresponsive on slow CI), we gracefully fall back to trusting the cached state as the best available information.Additional improvements:
containerIdin the "Container is removed" error message for better diagnosticsgetLogs()andstop()in individual try-catch blocks during cleanup to prevent cascading failures (e.g.,NotFoundExceptionwhen the container is already removed) from suppressing the originalContainerLaunchExceptiontestCommandQuickExitFailure(was@Disableddue to flakiness from the cached-state race)testQuickExitWithDifferentExitCodeto validate any non-zero exit code is detectedDesign Rationale
The hybrid approach was chosen over two alternatives:
docker inspecthangs, making every startup check timeout.Changes
IsRunningStartupCheckStrategy.java: Replaced blind cached-state shortcut with verify-then-trust hybrid — uses cached state as hint but confirms with one live Docker inspect before returning successGenericContainer.java: Better error message; NotFoundException-safe cleanupIsRunningStartupCheckStrategyTest.java: Re-enabled flaky test; added test for non-zero exit code