Running GEOS on Prism's GH200
Basics
- Go to
gh
nodessh
intoprism
ssh
intogpulogin1
salloc
withsalloc -p grace --nodes=1 --gpus-per-node=1
- Use preset modules at `/explore/nobackup/people/fgdeconi/work/modules/module
⚠️ Always salloc
on a GH node - login nodes are x86 but GH are ARM64 ⚠️
⚠️ Loads module after ssh in the GH box! ⚠️
Building GEOS
Module dsl/build_geos
loads all required modules for building GEOS:
- Compilers are gcc/g++/gfortran
nvidia
suite is not loaded on purpose, or else thempicc/c++/f90
will default to Nvidia's compiler which are not capable of compiling GEOS
Running GEOS
Once GEOS is built, running GEOS + NDSL requires to fiddle with the compilers.
First we load & install the stack w/ proper nvidia support:
- load
nvidia/12.8
: your default compilers are now theNVHPC
suite (nvc/nvc++/nvcc). Those are faulty for C/C++, we will need to enforce GNU - load/create a conda environment with python
3.9
. Once the stack is installed, re-installed MPI4Py with the following
# To link the correct MPI to be linked in
CC=nvc CFLAGS="-noswitcherror" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py
We can then run GEOS with the following env modifications:
- In your
gcm_run.j
(or before running) export as follows to force compiler & fix UCX interface
For GPU backends
#CSH-style
setenv CC gcc # restore C compiler to GNU
setenv CXX g++ # restore C++ compiler to GNU
setenv CUDA_HOST_CXX g++ # force nvcc host compiler to GNU (GT4Py specific)
setenv CUDA_HOME $NVHPC_ROOT/cuda # help GT4Py found the nvcc binary
setenv UCX_NET_DEVICES mlx5_2:1
For CPU backends, the same but comment out the CUDA_HOST_CXX
.
To run nsys
with a low overhead you can prefix the mpirun
call with
nsys profile --output report_%h_%p.nsys-rep \ # Output file name, unique
--trace="cuda,nvtx" \ # Trace only the cuda and nvtx event
--cuda-event-trace=false \ # Deactivate to reduce overhead
--sample=none \ # No CPU sampling (overhead--)
--cpuctxsw=none \ # No process switch tracking (overhead--)
--stats=true \ # Optional, bigger files but print some stats
Common issues
-
I hit a
@GLIBCXX_3.4.32 cannot load library
error: your GCC module is the x86 one, you need to reloadmodule reload gcc/14.2.0
-
I hit a
--ccbin unknown options
: you haveCUDA_HOST_CXX
set in your env, and GT4Py is misdirecting the GPU host linker flag onto GCC. Unset variable.