Skip to content

Allow for compiler+accelerator specific MPI overrides#231

Open
ocaisa wants to merge 6 commits into
EESSI:mainfrom
ocaisa:additional_rpath_fallbacks
Open

Allow for compiler+accelerator specific MPI overrides#231
ocaisa wants to merge 6 commits into
EESSI:mainfrom
ocaisa:additional_rpath_fallbacks

Conversation

@ocaisa

@ocaisa ocaisa commented May 14, 2026

Copy link
Copy Markdown
Member

Alternative to #230 where we focus only on the potential need for CUDA/ROCm variants.

This also opens the door to other types of variants (but the options here would be multiplicative so I haven't included that until we hit a need for it).

@ocaisa

ocaisa commented May 14, 2026

Copy link
Copy Markdown
Member Author

Example of the output

# Set things up
ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ export EESSI_ACCELERATOR_TARGET_OVERRIDE=accel/nvidia/cc86
ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ module load EESSI/2025.06
Module for EESSI/2025.06 loaded successfully
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ echo $MODULEPATH
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/accel/nvidia/cc80/modules/all:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/modules/all:/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/accel/nvidia/cc80/modules/all:/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/modules/all:/cvmfs/software.eessi.io/init/modules
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ module load EESSI-extend
-- Using /tmp/$USER as a temporary working directory for installations, you can override this by setting the environment variable WORKING_DIR and reloading the module (e.g., /dev/shm is a common option)
Configuring for use of EESSI_USER_INSTALL under /home/ocaisa/eessi
-- To create installations for EESSI, you _must_ have write permissions to /home/ocaisa/eessi/versions/2025.06/software/linux/aarch64/neoverse_n1
-- You may wish to configure a sources directory for EasyBuild (for example, via setting the environment variable EASYBUILD_SOURCEPATH) to allow you to reuse existing sources for packages.

# Pretend to want  to do a build
{EESSI/2025.06} ocaisa@~/EESSI/software-layer-scripts(additional_rpath_fallbacks)$ eb OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb --stop prepare --rebuild --hooks=./eb_hooks.py
== Temporary log file in case of crash /tmp/eb-uflhewm6/easybuild-0ha7tv9j.log
== found valid index for /cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs, so using it...
== Running parse hook for OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb...
== found valid index for /cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs, so using it...
== Running parse hook for gompi-2025b.eb...
...
== Running parse hook for lfbf-2025b.eb...
== processing EasyBuild easyconfig
/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/5.3.0/easybuild/easyconfigs/o/OSU-Micro-Benchmarks/OSU-Micro-Benchmarks-7.5.1-gompi-2025b-CUDA-12.9.1.eb
== building and installing OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1...
  >> installation prefix: /home/ocaisa/eessi/versions/2025.06/software/linux/aarch64/neoverse_n1/software/OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1
== fetching files and verifying checksums...
== Running pre-fetch hook...
  >> sources:
  >> /tmp/ocaisa/easybuild/sources/o/OSU-Micro-Benchmarks/osu-micro-benchmarks-7.5.1.tar.gz [SHA256: 160d0d5e3c3cb022520ecb247e9875bb0973b1d3cadccd6c17624f8407c52e22]
== ... (took < 1 sec)
== creating build dir, resetting environment...
  >> build dir: /tmp/ocaisa/easybuild/build/OSUMicroBenchmarks/7.5.1/gompi-2025b-CUDA-12.9.1
== Running post-ready hook...

WARNING: Deprecated functionality, will no longer work in EasyBuild v6.0: Easyconfig parameter 'parallel' is deprecated, use 'max_parallel' or the parallel property instead.; see
https://docs.easybuild.io/deprecated-functionality/ for more information

== ... (took < 1 sec)
== unpacking...
  >> running shell command:
        tar xzf /tmp/ocaisa/easybuild/sources/o/OSU-Micro-Benchmarks/osu-micro-benchmarks-7.5.1.tar.gz
        [started at: 2026-05-14 16:03:37]
        [working dir: /tmp/ocaisa/easybuild/build/OSUMicroBenchmarks/7.5.1/gompi-2025b-CUDA-12.9.1]
        [output and state saved to /tmp/eb-uflhewm6/run-shell-cmd-output/tar-gfx7xw93]
  >> command completed: exit 0, ran in < 1s
== ... (took < 1 sec)
== patching...
== ... (took < 1 sec)
== preparing...
== Running pre-prepare hook...
== Updated rpath_override_dirs (to allow overriding MPI family OpenMPI):
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neover
se_n1/rpath_overrides/OpenMPI/system-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib:/cvmfs/software.eessi.io/host_injec
tions/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib64
  >> loading toolchain module: gompi/2025b
== ... (took < 1 sec)
...

@ocaisa ocaisa changed the title Allow for accelerator-specific MPI overrides Allow for compiler+accelerator specific MPI overrides May 14, 2026
@ocaisa

ocaisa commented May 14, 2026

Copy link
Copy Markdown
Member Author

Increased the complexity a bit but it might be necessary:

== Updated rpath_override_dirs (to allow overriding MPI family OpenMPI):
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/ne
overse_n1/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-GCC/lib:/cvmfs/software.eessi.i
o/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system-GCC/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI
/system/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/neoverse_n1/rpath_overrides/OpenMPI/system/lib64
  >> loading toolchain module: gompi/2025b

Comment thread eb_hooks.py
@laraPPr

laraPPr commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@ocaisa #230 is dependend on this one and also 2026. Can you let us know if your still looking at this or if you input?

@ocaisa

ocaisa commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

No feedback to date, I'm waiting on someone to review it

@laraPPr

laraPPr commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@TopRichard can you review it?

@casparvl

Copy link
Copy Markdown
Contributor

Discussed in support meeting: @TopRichard said he already tested this for CUDA and that it works. We agreed he'll add a review here, including the steps taken by him to test it. I can then try to mimic that for ROCm and validate that it also works there.

@TopRichard

Copy link
Copy Markdown
Collaborator

I have tested this locally, Integrating the changes introduced in the PR into test_eb_hooks.py, and running easybuild with the --hooks=test_eb_hooks.py option. Below is a sample result when executed with CUDA enabled software: readelf -d /cluster/installations/eessi/default/eessi_local/aarch64-2025.06/software/PyTorch/2.9.1-foss-2025b-CUDA-12.9.1/lib/python3.13/site-packages/torch/lib/libtorch.so | less:

Dynamic section at offset 0x233b0 contains 25 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software/CUDA/12.9.1/lib64/libnvrtc.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch_cpu.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch_cuda.so]
 0x000000000000000e (SONAME)             Library soname: [libtorch.so]
 0x000000000000000f (RPATH)              Library rpath: [/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system-GCC/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system-GCC/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib64:/cluster/installations/eessi/default/eessi_local/aarch64-2025.06/software/PyTorch/2.9.1-foss-2025b-CUDA-12.9.1/lib:/cluster/installations/eessi/default/eessi_local/aarch64-2025.06/software/PyTorch/2.9.1-foss-2025b-CUDA-12.9.1/lib64:$ORIGIN:$ORIGIN/../lib:$ORIGIN/../lib64:/cvmfs/software.eessi.io/versions/2025.06/software/linux/aarch64/nvidia/grace/software/ScaLAPACK/2.2.2-gompi-2025b-fb/lib64...

@casparvl

casparvl commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

For clarity: this is currently blocked by #228 for the ROCm side of things.

Comment thread eb_hooks.py Outdated
if dep[0] in top_level_accelerator_packages:
# Store the dependency as a property for later potential use
# (e.g., accelerator-specific MPI RPATH overrides)
ec.eessi_gpu_dependency = dep

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a reminder: this will need to be done for the ROCm side of things after #228 gets merged as well (you'll need to merge main into this feature branch, resolve any potential conflicts because they both touch this same part of the code, then add some ec.eessi_gpu_dependency = ... to the ROCm side of things).

@zerefwayne

Copy link
Copy Markdown
Contributor

#228 is merged.

@casparvl casparvl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting:

== Updated rpath_override_dirs (to allow overriding MPI family OpenMPI):
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-ROCm-ROCM-6.4.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/
zen2/rpath_overrides/OpenMPI/system-ROCm-ROCM-6.4.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-ROCm/lib:/cvmfs/software.eessi.io/hos
t_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-ROCm/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/l
ib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/lib64

I have the feeling that the system-ROCm-ROCM-6.4.1 should really be system-GCC-ROCM-6.4.1.

Other than that, it seems that the most specific path (i.e. including the ROCm version) comes first, which is good.

For CUDA, I do see:

/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/
zen2/rpath_overrides/OpenMPI/system-GCC-CUDA-12.9.1/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-GCC/lib:/cvmfs/software.eessi.io/host
_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system-GCC/lib64:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/lib
:/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/rpath_overrides/OpenMPI/system/lib64

Which has the system-GCC-CUDA-<cudaver> as expected.

@casparvl

Copy link
Copy Markdown
Contributor

Ah, maybe this does make sense, becausse rompi is a ROCm-based toolchain, whereas the CUDA OSU is a GCC based toolchain. Since you construct the suffix as:

gpu_stub = f"{self.toolchain.COMPILER_FAMILY}-{self.cfg.eessi_gpu_dependency[0]}-{self.cfg.eessi_gpu_dependency[1]}"

self.toolchain.COMPILER_FAMILY probably resolves to ROCm for rompi?

@casparvl

Copy link
Copy Markdown
Contributor

@casparvl

Copy link
Copy Markdown
Contributor

I guess what I'm wondering about is... what should this string refer to?

I guess system-GCC-CUDA-<cudaver> basically means: any MPI lib provided in this directory has been built with the system GCC, and a (system-provided) CUDA with version cudaver?

But what does system-ROCm-ROCm-<rocmver> then mean? The 2nd part is likely "built with a ROCm with version ", but the first... it's almost certainly not build with a system-ROCm. For compatiblity, it may need to be compiled with a system-LLVM (that's ROCm-enabled?)? I'm really not sure what the compatibility requirement is here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants