HDDS-13434. Intermittent failure in TestOzoneDebugShell#testChunkInfoVerifyPathsAreDifferent#10600
Conversation
| AtomicInteger exitCode = new AtomicInteger(1); | ||
| AtomicInteger lastPathCount = new AtomicInteger(-1); | ||
| AtomicReference<Throwable> lastError = new AtomicReference<>(); |
There was a problem hiding this comment.
Thanks @chihsuan for working on this. I don't think these three variables need to be atomic. Could you walk me through the reasoning behind using atomic here?
There was a problem hiding this comment.
Thanks for the review! @chungen0126 Yeah, these aren't for thread-safety. I picked Atomic* specifically because the lambda has to write exitCode/diagnostics back out, and a plain int/Throwable can't be reassigned inside a lambda so they need to be mutable holders.
Besides, I observed that Atomic* is the existing idiom for waitFor holders in the repo (e.g. TestSnapshotBackgroundServices).
Open to changing it if you have any suggestions. Thanks!
There was a problem hiding this comment.
Thanks @chihsuan for the explanation! Since it aligns with the existing idiom in the repo, keeping it as Atomic* looks good to me. Approved!
Created HDDS-15709. |
|
Thanks @chihsuan for the patch, @chungen0126 for the review. |
What changes were proposed in this pull request?
Test-only change.
TestOzoneDebugShell.testChunkInfoVerifyPathsAreDifferentintermittently failed withexpected: <3> but was: <2>. A RATIS THREE write is acknowledged on a Ratis majority, so right after the key is written one replica may not have applied the write yet. Whenozone debug replicas chunk-inforuns in that window,ReadContaineron the lagging datanode fails, the datanode is dropped from the response, and the test sees only 2 distinct block file paths instead of 3.Fix. Wait until all three datanodes report the block (three distinct paths) before asserting, using
GenericTestUtils.waitFor, instead of asserting on a single read taken immediately after the write. A persistent error or an unexpected path count is reported via the wait's last observed state, so genuine regressions still surface with context rather than an opaque timeout.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13434
How was this patch tested?
BucketLayoutvalues.flaky-test-checkon the fork,TestOzoneIntegrationNonHAALL, 10x10: theOzoneDebugShellcases (including this test) passed in all 100 iterations. https://github.com/chihsuan/ozone/actions/runs/28101143264TestListKeysWithFSO(BUCKET_ALREADY_EXISTSin test setup) appeared in a single split; it is not touched by this change.