Skip to content

feat: add rubygems ecosystem to bq-dataset-ingest (CM-1296)#4307

Open
themarolt wants to merge 13 commits into
mainfrom
feat/add-rubygems-ingest-CM-1296
Open

feat: add rubygems ecosystem to bq-dataset-ingest (CM-1296)#4307
themarolt wants to merge 13 commits into
mainfrom
feat/add-rubygems-ingest-CM-1296

Conversation

@themarolt

@themarolt themarolt commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds RubyGems as a first-class ecosystem to the bq-dataset-ingest (osspckgs) pipeline, and folds in two operational improvements developed alongside it: resumable package_dependencies ingests and removal of the index drop/recreate step. RubyGems is manifest-sourced (deps.dev Scenario B) — packages/versions/repos/advisories come from the standard *Latest views; dependencies come from RubyGemsRequirementsLatest.

Changes

  • RubyGems ecosystem — added RUBYGEMS to the systems filter, deps SQL (full + incremental via RubyGemsRequirements, RuntimeDependencies only), createVersionsLookup, triggerBootstrap CLI, and the monitor's known-ecosystems list.
  • Resumable deps ingest — new getResumeExport activity + getIngestJobForResume DAL query let a partially-merged package_dependencies job resume by id, reusing its GCS export and chunk boundaries (skips the multi-hour BQ export on retry).
  • Removed index drop/recreate — dropped the manage{PackageDeps,Versions}{Indexes,Constraints} activities and the drop→load→rebuild flow for all job kinds. full now merges against live indexes via ON CONFLICT DO NOTHING (idempotent). The day-long rebuild was only needed for the original NPM onboarding.
  • Monitor — buffered redraw to eliminate flicker; shows the merging phase step during the merge loop.

Type of change

  • Bug fix
  • New feature
  • Refactor / cleanup
  • Performance improvement
  • Chore / dependency update
  • Documentation

JIRA ticket

https://linuxfoundation.atlassian.net/browse/CM-1296


Note

Medium Risk
Changes affect billion-row package_dependencies/versions load paths (removing index drops may lengthen full merges) and resume logic must match original export settings; RubyGems is additive BQ surface area.

Overview
RubyGems is wired through the osspckgs pipeline: default ecosystem lists, BQ deps SQL (full from RubyGemsRequirementsLatest / incremental from RubyGemsRequirements with runtime deps only), versions lookup validation, CLI help, and the monitor.

Resumable package_dependencies skips a new BQ export when --resume-job <id> is used: getIngestJobForResume / getResumeExport load the prior GCS prefix and file progress; the workflow validates status, restores stored meta:ecosystems and meta:fill (new isFill on bqExportToGcs), restarts the chunk loop with overlap-safe merges, and bootstrap only allows resume with kinds=[package_dependencies].

Full ingest no longer drops indexes/FK constraints on versions and package_dependencies — the four manage*Indexes/Constraints activities are removed. Full and incremental both merge into live tables via DISTINCT ON + ON CONFLICT DO NOTHING (fill still uses upsert where version_constraint is NULL). Merge activity timeout for deps is raised to 4h.

The osspckgs monitor buffers frames to reduce flicker, refines merge/total ETAs, and surfaces the merging step during chunk loops.

Reviewed by Cursor Bugbot for commit ac7eee9. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings July 3, 2026 15:54
Comment thread services/apps/packages_worker/src/scripts/triggerBootstrap.ts Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the osspckgs bq-dataset-ingest pipeline to treat RubyGems as a first-class ecosystem, while also changing how large versions / package_dependencies full loads are merged (no longer dropping/rebuilding indexes & constraints) and adding a resume-by-job-id path for partially merged package_dependencies ingests.

Changes:

  • Add RUBYGEMS to ecosystem/system filters, deps.dev dependency SQL (full + incremental), versions lookup creation, CLI bootstrap trigger, and the monitor’s known ecosystem list.
  • Add resumable package_dependencies ingest support via --resume-job <id> (reuses a prior job’s GCS export and recorded progress to skip re-exporting from BigQuery).
  • Remove the drop → load → rebuild index/FK workflow for versions and package_dependencies, using ON CONFLICT merges against live constraints instead; improve monitor rendering/ETA behavior.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/osspckgs/ingestJobs.ts Adds DAL query to fetch resume-relevant ingest job fields (gcsPrefix + progress + merged rows).
services/apps/packages_worker/src/scripts/triggerBootstrap.ts Adds RUBYGEMS to CLI and introduces --resume-job argument validation + propagation.
services/apps/packages_worker/src/scripts/monitorOsspckgs.ts Buffered redraw to reduce flicker; improves status/ETA logic; adds rubygems to known ecosystems.
services/apps/packages_worker/src/deps-dev/workflows/ingestVersions.ts Removes drop/rebuild flow; merges with ON CONFLICT DO NOTHING against live unique index.
services/apps/packages_worker/src/deps-dev/workflows/ingestDependencies.ts Adds resume flow and removes drop/rebuild; increases merge timeout; sets step to merging.
services/apps/packages_worker/src/deps-dev/workflows/bootstrapOsspckgs.ts Threads resumeJobId through bootstrap and skips watermark checks in resume mode.
services/apps/packages_worker/src/deps-dev/queries/systems.ts Adds RUBYGEMS to default and valid system lists.
services/apps/packages_worker/src/deps-dev/queries/depsSql.ts Adds RubyGems full+incremental dependency extraction branches.
services/apps/packages_worker/src/deps-dev/activities/manage.ts Deletes index/constraint management activities (no longer used).
services/apps/packages_worker/src/deps-dev/activities/index.ts Removes exports of deleted manage* activities; exports new getResumeExport.
services/apps/packages_worker/src/deps-dev/activities/getResumeExport.ts Adds activity to fetch resume metadata for a prior ingest job.
services/apps/packages_worker/src/deps-dev/activities/createVersionsLookup.ts Allows rubygems ecosystem in versions lookup creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +106 to +109
// Fill-constraints variant: UNIQUE constraint stays in place, so ON CONFLICT is valid. Upserts
// version_constraint only for rows where it is currently NULL — safe to run against a table already
// populated by --deps-table-b (which sets version_constraint = NULL for all rows). DISTINCT ON
// resolves duplicate (root, dep) pairs from BQ (same root/dep, different to_version) before the upsert.
Comment on lines +42 to 44
// Merge against the live UNIQUE index — ON CONFLICT DO NOTHING makes every chunk idempotent, so the
// table's indexes/keys/constraints are never dropped. Both full and incremental use this path.
const MERGE_SQL = `
Comment on lines 23 to 26
const { mergeStagingToTable } = proxyActivities<typeof depsDevActivities>({
startToCloseTimeout: '1 hour',
retry: { maximumAttempts: 1 },
})
Comment on lines +125 to +130
const totalChunks = Math.ceil(fileNames.length / filesPerChunk)
let priorRowsAffected = 0
let priorStagingRows = 0
const priorTableRowCounts: Record<string, number> = {}

for (let chunkIndex = 0; chunkIndex < totalChunks; chunkIndex++) {
Comment on lines +79 to +82
// Resume mode reuses a prior job's export, so there is no fresh BQ export to validate. Skip the
// incremental watermark/partition checks below — the resumed partition may not match `today`.
const resume = opts.resumeJobId != null

Signed-off-by: Uroš Marolt <uros@marolt.me>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit dfafcbe. Configure here.

Signed-off-by: Uroš Marolt <uros@marolt.me>
Copilot AI review requested due to automatic review settings July 3, 2026 17:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Comment on lines +89 to +95
// Returns an existing exported job for the given GCS prefix so callers can
// skip re-running the BQ export when retrying a failed workflow.
// Loads the fields needed to resume a partially-merged chunked job by id: the exact GCS export
// path (so the same parquet files — and therefore the same chunk boundaries — are re-listed) plus
// the file-level load progress and rows-merged-so-far. progressDone/progressTotal come from the
// table_row_counts JSONB written by updateLoadingProgress. Returns null if the job id is unknown.
export async function getIngestJobForResume(
Comment on lines +25 to +26
// NuGet groups deps by TargetFramework — flatten all groups, dedup handled downstream
// by DISTINCT ON in MERGE_SQL_FULL and ON CONFLICT in MERGE_SQL.
// by ON CONFLICT in MERGE_SQL (and DISTINCT ON in the fill-constraints variant).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants