Skip to content

Isolate staging Cloud Run service and add backend response header (migration PR 4, stage 1)#3720

Draft
anth-volk wants to merge 2 commits into
masterfrom
feat/migration-pr4-stage1-staging-isolation
Draft

Isolate staging Cloud Run service and add backend response header (migration PR 4, stage 1)#3720
anth-volk wants to merge 2 commits into
masterfrom
feat/migration-pr4-stage1-staging-isolation

Conversation

@anth-volk

Copy link
Copy Markdown
Collaborator

Summary

Stage 1 of the API host cutover ("PR 4" of the unified API v2-alpha migration plan): before a load balancer can front the production Cloud Run service, staging deploys must stop promoting to it.

  • Isolate staging Cloud Run onto policyengine-api-staging. deploy-cloud-run-staging and promote-cloud-run-staging previously used the cloud_run_env.sh default CLOUD_RUN_SERVICE=policyengine-api, so every push promoted a staging-configured revision to 100% of the production service URL for the duration of the pipeline. Both jobs now pin CLOUD_RUN_SERVICE: policyengine-api-staging at the job level. The first staging deploy after merge auto-creates the new service.
  • Decouple the image name from the service name. CLOUD_RUN_IMAGE_URI embedded ${CLOUD_RUN_SERVICE}; production reuses the staging-built image, so the image path now uses a fixed CLOUD_RUN_IMAGE_NAME (default policyengine-api) regardless of service override.
  • Tag every API response with X-PolicyEngine-Backend (app_engine/cloud_run, from the existing API_HOST_BACKEND migration flag). Set in the shared Flask after_request hook (covers App Engine and Cloud Run's Flask-mounted routes) and in the ASGI middleware for native FastAPI routes. This is the in-band signal for verifying traffic-weight changes and rollbacks during the ramp with sampled requests; the helper never raises, falling back to the default backend on an invalid flag.

Test plan

  • pytest tests/unit/test_cloud_run_deploy_scripts.py tests/unit/routes/test_migration_context_logging.py tests/unit/test_migration_flags.py tests/unit/test_asgi_factory.py — 72 passed, including new invariants: staging jobs pin the staging service, production job never references it, image URI is service-independent, deploy/promote dry-runs target the overridden service, and backend-header behavior on Flask, native FastAPI, and hook-less fallback paths.
  • ruff format --check clean on changed files.
  • After merge (stage 1 exit gates): observe one full push cycle — gcloud run services describe policyengine-api shows only stage3-prod-* tags at 100% throughout; policyengine-api-staging healthy and passing the staging integration suite; header present on GAE prod, CR staging, and CR prod URLs.

🤖 Generated with Claude Code

Migration PR 4 stage 1. The staging Cloud Run track previously promoted
staging-configured revisions to 100% of the production policyengine-api
service URL on every push; once a load balancer fronts that service this
becomes a public incident per deploy. Staging now deploys and promotes
its own policyengine-api-staging service, with the image name decoupled
from the service name because production reuses the staging-built image.

Every API response now carries X-PolicyEngine-Backend (app_engine or
cloud_run) so traffic-weight changes and rollbacks during the host
cutover can be verified in-band with sampled requests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.82%. Comparing base (63bb169) to head (eb9b79f).
⚠️ Report is 14 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3720      +/-   ##
==========================================
+ Coverage   78.81%   79.82%   +1.00%     
==========================================
  Files          71       70       -1     
  Lines        4320     4336      +16     
  Branches      804      808       +4     
==========================================
+ Hits         3405     3461      +56     
+ Misses        702      655      -47     
- Partials      213      220       +7     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Pin CLOUD_RUN_SERVICE explicitly on the production Cloud Run job and add
workflow invariants: every service-targeting Cloud Run job must pin a
service, and only the production job may promote the production service.
The cloud_run_env.sh default silently targeting production was the
mechanism behind the original staging-promotes-to-prod bug.

Drop the dead try/except around the backend header assignment;
get_api_host_backend() is fail-safe by design, so the except could only
mask real header-assignment bugs, and the ASGI path never had one.

Document the one-time staging service bootstrap (gcloud rejects
--no-traffic when creating a new service, and the deploy service account
cannot set the public invoker binding) in a Stage 1 runbook.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant