Skip to content

feat: Add arc ce slurm CI test#7

Open
aspiringmind-code wants to merge 18 commits into
DIRACGrid:mainfrom
aspiringmind-code:add-arc-ce-slurm-ci
Open

feat: Add arc ce slurm CI test#7
aspiringmind-code wants to merge 18 commits into
DIRACGrid:mainfrom
aspiringmind-code:add-arc-ce-slurm-ci

Conversation

@aspiringmind-code

Copy link
Copy Markdown

Add ARC CE + SLURM integration test (GitHub Actions)

Summary

Adds a self-contained Docker image that runs a NorduGrid ARC7 Compute Element wired to a single-node SLURM batch system, plus a GitHub Actions workflow that builds the image, boots it, and runs a full submit → monitor → retrieve integration test against it using the real ARC client tools (arcsub, arcstat, arcget).

What's included

  • docker/Dockerfile — AlmaLinux 9 image with munge, SLURM (slurmctld/slurmd), ARC7 (nordugrid-arc7-arex/-client/-arcctl), and dbus-broker, running systemd as PID 1 so the packages' own unit files and arcctl work as designed.
  • docker/slurm.conf, docker/cgroup.conf, docker/arc.conf — single-node SLURM cluster and ARC CE config (LRMS backend = SLURM, REST interface on :443).
  • docker/bootstrap.sh + docker/arc-bootstrap.service — one-shot startup: waits for munge/SLURM, mints an ARC Test-CA host cert bound to the container's runtime hostname, starts A-REX, and issues a test client certificate.
  • docker/healthcheck.sh — Docker HEALTHCHECK gating readiness on the bootstrap sequence actually completing.
  • test/job.xrsl, test/run.sh, test/run_integration_test.sh — the integration test itself: submits a job, polls until Finished, retrieves output, asserts its contents.
  • .github/workflows/integration-test.yml — builds the image, runs it --privileged (required for the systemd/cgroup setup), waits for health, runs the test, uploads ARC/SLURM logs as artifacts on every run (pass or fail).
  • docker-compose.yml + README.md — local reproduction of the same flow, and a write-up of the design decisions below.

Notable fixes baked into this config (found via iterative debugging)

  • SLURM's EPEL package doesn't create its own system user — created explicitly.
  • munge.key isn't auto-generated in this build environment — generated explicitly with correct ownership.
  • SLURM's NodeName must match the container's actual runtime hostname or slurmd fails immediately with "Unable to determine this slurmd's NodeName".
  • arc.conf needed corrections against ARC7's actual schema: no x509_user_key/x509_user_cert in [common], allowaccess only valid in [arex/ws/jobs] (not [arex/ws]), and [infosys]/[infosys/glue2] are mandatory blocks without which A-REX's info provider fails on every cycle.
  • slurmd needs a working D-Bus connection for its cgroup "extern step" scope (SLURM 21+, independent of ProctrackType/CgroupAutomount) — added dbus-broker.
  • slurm_use_sacct = yes silently stalls job-completion detection with no slurmdbd configured — switched to squeue/scontrol-based scanning.

CI platform note

Originally built against GitLab CI, but the available GitLab runner is a locked-down Kubernetes executor (Kyverno registry allowlist + no privileged pods), which this setup fundamentally needs (systemd-as-PID1, --privileged for cgroup/dbus access). Moved to GitHub Actions, where hosted runners are plain VMs with Docker preinstalled and unrestricted --privileged support — no runner/cluster admin changes required. .gitlab-ci.yml has been removed accordingly.

Testing

Verified locally via docker compose up --build and in GitHub Actions (.github/workflows/integration-test.yml) — full pipeline builds, boots, and passes submit/monitor/retrieve end to end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant