feat: add rubygems ecosystem to bq-dataset-ingest (CM-1296)#4307
feat: add rubygems ecosystem to bq-dataset-ingest (CM-1296)#4307themarolt wants to merge 13 commits into
Conversation
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
There was a problem hiding this comment.
Pull request overview
This PR extends the osspckgs bq-dataset-ingest pipeline to treat RubyGems as a first-class ecosystem, while also changing how large versions / package_dependencies full loads are merged (no longer dropping/rebuilding indexes & constraints) and adding a resume-by-job-id path for partially merged package_dependencies ingests.
Changes:
- Add RUBYGEMS to ecosystem/system filters, deps.dev dependency SQL (full + incremental), versions lookup creation, CLI bootstrap trigger, and the monitor’s known ecosystem list.
- Add resumable
package_dependenciesingest support via--resume-job <id>(reuses a prior job’s GCS export and recorded progress to skip re-exporting from BigQuery). - Remove the drop → load → rebuild index/FK workflow for
versionsandpackage_dependencies, usingON CONFLICTmerges against live constraints instead; improve monitor rendering/ETA behavior.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/osspckgs/ingestJobs.ts | Adds DAL query to fetch resume-relevant ingest job fields (gcsPrefix + progress + merged rows). |
| services/apps/packages_worker/src/scripts/triggerBootstrap.ts | Adds RUBYGEMS to CLI and introduces --resume-job argument validation + propagation. |
| services/apps/packages_worker/src/scripts/monitorOsspckgs.ts | Buffered redraw to reduce flicker; improves status/ETA logic; adds rubygems to known ecosystems. |
| services/apps/packages_worker/src/deps-dev/workflows/ingestVersions.ts | Removes drop/rebuild flow; merges with ON CONFLICT DO NOTHING against live unique index. |
| services/apps/packages_worker/src/deps-dev/workflows/ingestDependencies.ts | Adds resume flow and removes drop/rebuild; increases merge timeout; sets step to merging. |
| services/apps/packages_worker/src/deps-dev/workflows/bootstrapOsspckgs.ts | Threads resumeJobId through bootstrap and skips watermark checks in resume mode. |
| services/apps/packages_worker/src/deps-dev/queries/systems.ts | Adds RUBYGEMS to default and valid system lists. |
| services/apps/packages_worker/src/deps-dev/queries/depsSql.ts | Adds RubyGems full+incremental dependency extraction branches. |
| services/apps/packages_worker/src/deps-dev/activities/manage.ts | Deletes index/constraint management activities (no longer used). |
| services/apps/packages_worker/src/deps-dev/activities/index.ts | Removes exports of deleted manage* activities; exports new getResumeExport. |
| services/apps/packages_worker/src/deps-dev/activities/getResumeExport.ts | Adds activity to fetch resume metadata for a prior ingest job. |
| services/apps/packages_worker/src/deps-dev/activities/createVersionsLookup.ts | Allows rubygems ecosystem in versions lookup creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Fill-constraints variant: UNIQUE constraint stays in place, so ON CONFLICT is valid. Upserts | ||
| // version_constraint only for rows where it is currently NULL — safe to run against a table already | ||
| // populated by --deps-table-b (which sets version_constraint = NULL for all rows). DISTINCT ON | ||
| // resolves duplicate (root, dep) pairs from BQ (same root/dep, different to_version) before the upsert. |
| // Merge against the live UNIQUE index — ON CONFLICT DO NOTHING makes every chunk idempotent, so the | ||
| // table's indexes/keys/constraints are never dropped. Both full and incremental use this path. | ||
| const MERGE_SQL = ` |
| const { mergeStagingToTable } = proxyActivities<typeof depsDevActivities>({ | ||
| startToCloseTimeout: '1 hour', | ||
| retry: { maximumAttempts: 1 }, | ||
| }) |
| const totalChunks = Math.ceil(fileNames.length / filesPerChunk) | ||
| let priorRowsAffected = 0 | ||
| let priorStagingRows = 0 | ||
| const priorTableRowCounts: Record<string, number> = {} | ||
|
|
||
| for (let chunkIndex = 0; chunkIndex < totalChunks; chunkIndex++) { |
| // Resume mode reuses a prior job's export, so there is no fresh BQ export to validate. Skip the | ||
| // incremental watermark/partition checks below — the resumed partition may not match `today`. | ||
| const resume = opts.resumeJobId != null | ||
|
|
Signed-off-by: Uroš Marolt <uros@marolt.me>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit dfafcbe. Configure here.
Signed-off-by: Uroš Marolt <uros@marolt.me>
| // Returns an existing exported job for the given GCS prefix so callers can | ||
| // skip re-running the BQ export when retrying a failed workflow. | ||
| // Loads the fields needed to resume a partially-merged chunked job by id: the exact GCS export | ||
| // path (so the same parquet files — and therefore the same chunk boundaries — are re-listed) plus | ||
| // the file-level load progress and rows-merged-so-far. progressDone/progressTotal come from the | ||
| // table_row_counts JSONB written by updateLoadingProgress. Returns null if the job id is unknown. | ||
| export async function getIngestJobForResume( |
| // NuGet groups deps by TargetFramework — flatten all groups, dedup handled downstream | ||
| // by DISTINCT ON in MERGE_SQL_FULL and ON CONFLICT in MERGE_SQL. | ||
| // by ON CONFLICT in MERGE_SQL (and DISTINCT ON in the fill-constraints variant). |
Signed-off-by: Uroš Marolt <uros@marolt.me>

Summary
Adds RubyGems as a first-class ecosystem to the
bq-dataset-ingest(osspckgs) pipeline, and folds in two operational improvements developed alongside it: resumablepackage_dependenciesingests and removal of the index drop/recreate step. RubyGems is manifest-sourced (deps.dev Scenario B) — packages/versions/repos/advisories come from the standard*Latestviews; dependencies come fromRubyGemsRequirementsLatest.Changes
RUBYGEMSto the systems filter, deps SQL (full + incremental viaRubyGemsRequirements, RuntimeDependencies only),createVersionsLookup,triggerBootstrapCLI, and the monitor's known-ecosystems list.getResumeExportactivity +getIngestJobForResumeDAL query let a partially-mergedpackage_dependenciesjob resume by id, reusing its GCS export and chunk boundaries (skips the multi-hour BQ export on retry).manage{PackageDeps,Versions}{Indexes,Constraints}activities and the drop→load→rebuild flow for all job kinds.fullnow merges against live indexes viaON CONFLICT DO NOTHING(idempotent). The day-long rebuild was only needed for the original NPM onboarding.mergingphase step during the merge loop.Type of change
JIRA ticket
https://linuxfoundation.atlassian.net/browse/CM-1296
Note
Medium Risk
Changes affect billion-row
package_dependencies/versionsload paths (removing index drops may lengthen full merges) and resume logic must match original export settings; RubyGems is additive BQ surface area.Overview
RubyGems is wired through the osspckgs pipeline: default ecosystem lists, BQ deps SQL (full from
RubyGemsRequirementsLatest/ incremental fromRubyGemsRequirementswith runtime deps only), versions lookup validation, CLI help, and the monitor.Resumable
package_dependenciesskips a new BQ export when--resume-job <id>is used:getIngestJobForResume/getResumeExportload the prior GCS prefix and file progress; the workflow validates status, restores storedmeta:ecosystemsandmeta:fill(newisFillonbqExportToGcs), restarts the chunk loop with overlap-safe merges, and bootstrap only allows resume withkinds=[package_dependencies].Full ingest no longer drops indexes/FK constraints on
versionsandpackage_dependencies— the fourmanage*Indexes/Constraintsactivities are removed. Full and incremental both merge into live tables viaDISTINCT ON+ON CONFLICT DO NOTHING(fill still uses upsert whereversion_constraintis NULL). Merge activity timeout for deps is raised to 4h.The osspckgs monitor buffers frames to reduce flicker, refines merge/total ETAs, and surfaces the
mergingstep during chunk loops.Reviewed by Cursor Bugbot for commit ac7eee9. Bugbot is set up for automated code reviews on this repo. Configure here.