Skip to content

PHOENIX-7876 Improve EXPLAIN#2556

Open
apurtell wants to merge 27 commits into
apache:masterfrom
apurtell:PHOENIX-7876
Open

PHOENIX-7876 Improve EXPLAIN#2556
apurtell wants to merge 27 commits into
apache:masterfrom
apurtell:PHOENIX-7876

Conversation

@apurtell

Copy link
Copy Markdown
Contributor

Phoenix's EXPLAIN [WITH REGIONS] output is today incomplete and easy to misread for both query analysis and performance investigation. Non-trivial joins render with almost no detail. The cost vector advertises three dimensions but only compares IO, and estimates/statistics are unreliable. Some optimizations are incorrectly categorized, for example some shown as row-eliminating predicates are actually column-projection optimizations. The operator trees of complex queries are flattened to a single ident level. Many planner facts are not surfaced at all, such as which rule chooses an index, which indexes the optimizer considered but rejected, what query rewrites took place (e.g. subquery decorrelation, star-join detection, right-to-left normalization, HAVING lift, RVC-offset translation, reverse-scan substitution, UNION ORDER BY, index expression substitution, and more), the specific hash-join strategy chosen, salt bucket counts, local vs. global vs. uncovered-global index distinctions, the particular flavor of atomic upsert chosen and server-side atomic update expressions, multi tenant context, CDC scope, transaction provider, projection lists, predicate to filter attribution, hints honored vs ignored, and the structure of the JSON/BSON/array path expressions evaluated server-side.

This proposal closes all of those gaps. Details provided in the design document

This work was implemented on the dev branch PHOENIX-7876-feature. All of the individual commits made on the dev branch are preserved as independent commits for this PR.

The bulk of the changes are to unit tests and integration tests where they formerly dumped raw EXPLAIN text and did exact string matching against the result. Now every test uses a fluent assertion API against ExplainPlanAttributes. The coverage of the tests has not changed. Their maintainability has improved 1000% (imho). The EXPLAIN features themselves have significantly more test coverage now. Everything I touched I added tests for. Most of them are connectionless tests that don't require a minicluster.

PHOENIX_EXPLAIN_MIGRATION_GUIDE.md is a migration guide for anyone scraping today's EXPLAIN PLAN output.

PHOENIX_EXPLAIN_EXAMPLES.md is a companion document for the reviewer's guide that illustrates the new capabilities with ~60 examples that exercise the new and modified EXPLAIN PLAIN features.

The following tests are known to have pre-existing failure or flake conditions and flake or fail on current master:

org.apache.phoenix.end2end.BackwardCompatibilityIT
org.apache.phoenix.end2end.GlobalConnectionTenantTableIT
org.apache.phoenix.end2end.index.IndexTwoPhaseCreateIT
org.apache.phoenix.end2end.index.LocalIndexIT
org.apache.phoenix.end2end.IndexScrutinyWithMaxLookbackIT
org.apache.phoenix.end2end.transform.TransformIT
org.apache.phoenix.end2end.transform.TransformMonitorIT
org.apache.phoenix.end2end.transform.TransformToolIT
org.apache.phoenix.jdbc.SecureUserConnectionsIT
org.apache.phoenix.query.MaxConcurrentConnectionsIT
org.apache.phoenix.rpc.UpdateCacheIT

I have root caused these failures and have either already filed a JIRA for them or will will shortly.

Co-Authored-By: Claude Opus 4.8[1m] noreply@anthropic.com

apurtell and others added 27 commits June 27, 2026 22:08
…zation compatibility (apache#2495)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…pache#2503)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…apache#2507)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…ters (apache#2511)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…pache#2513)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…ILD decorations (apache#2514)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…IN (apache#2522)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…pache#2524)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…pache#2528)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…e#2529)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…tOptimizerDecision (apache#2532)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…e#2537)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…ache#2538)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…xts (apache#2544)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…inner plans (apache#2545)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
…pache#2546)

Co-authored-by: Claude Opus 4.8[1m] <noreply@anthropic.com>
@virajjasani

Copy link
Copy Markdown
Contributor

@apurtell

apurtell commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

The precommit build https://ci-hadoop.apache.org/job/Phoenix/job/Phoenix-PreCommit-GitHub-PR/job/PR-2556/ had four failures. All are not related to this change.
/cc @virajjasani

  1. org.apache.phoenix.end2end.CDCQueryIT.testSelectWithTimeRange

java.lang.IllegalArgumentException: bound must be positive
at java.util.Random.nextInt(Random.java:388)
at org.apache.phoenix.end2end.CDCQueryIT.testSelectWithTimeRange(CDCQueryIT.java:786)

CDCQueryIT is flaky because it does not always provide a positive bound to Random.nextInt(). This is unrelated to the EXPLAIN changes.

  1. org.apache.phoenix.end2end.MetadataServerConnectionsIT.testViewCreationAndServerConnections

org.apache.hadoop.hbase.DoNotRetryIOException: Not allowed
at org.apache.phoenix.end2end.MetadataServerConnectionsIT$TestMetaDataEndpointImpl.createTable(MetadataServerConnectionsIT.java:183)
at org.apache.phoenix.coprocessor.generated.MetaDataProtos$MetaDataService.callMethod(MetaDataProtos.java:17539)
at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7992)

The code in MetadataServerConnectionsIT calls Thread.activeCount() to get a thread count, allocates an array of that size, then calls Thread.enumerate() to fill it. However, Thread.enumerate() can return fewer threads than the array size if threads exit between the two calls, leaving null slots in the array. When the stream tries to call getName() on these null entries, it throws an NPE. This is unrelated to the EXPLAIN changes.

  1. org.apache.phoenix.end2end.UpsertSelectIT.testUpsertSelectOnDescToAsc

java.lang.AssertionError: expected:<0> but was:<1>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at org.apache.phoenix.end2end.UpsertSelectIT.assertNoConnLeak(UpsertSelectIT.java:113)

The connection leak is a preexisting intermittent connection cleanup race in the client-side UPSERT SELECT parallel mutating iterator path. This PR does not modify any of the files that govern this connection lifecycle. The PR's UpsertCompiler edits only add a returningRow field, restructure the getExplainPlan() text, and pass a pre-built StatementContext.forRewrite(...) through SubselectRewriter.flatten / SubqueryRewriter.transform. None of those code paths runs during executeUpdate(). They only run when the user (or test) asks for an explicit EXPLAIN. This is unrelated to the EXPLAIN changes.

  1. org.apache.phoenix.monitoring.PhoenixTableLevelMetricsIT.testHistogramMetricsForQueries

java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at org.apache.phoenix.monitoring.PhoenixTableLevelMetricsIT.assertHistogramMetricsForQueries(PhoenixTableLevelMetricsIT.java:1595)

The issue is that the SelfHealingTask running on a background thread is making its own queries to system tables between when the test resets the global metrics and when it runs its own query, so the global query time ends up capturing both the background task's work and the test's work, causing the assertion to fail because the recorded max latency value is larger than expected. (TaskRegionObserver schedules SelfHealingTask via scheduleWithFixedDelay.) This is unrelated to the EXPLAIN changes.


I think this change is good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants