GPU Jobs
ARCH offers several GPU-equipped partitions for compute-intensive and AI/ML workloads. This page lists each partition, the CPU-per-GPU billing ratios, access requirements, and submission examples.
Available GPU partitions
Partition |
GPUs / node |
CPU cores billed per GPU |
Typical use-case |
|---|---|---|---|
|
8 × NVIDIA L40 S (48 GB) |
14 |
Large-memory image / data analytics |
|
8 × NVIDIA A100-40 GB |
10 |
Mixed HPC + DL |
|
4 × NVIDIA H100 (96 GB) |
30 |
Highest-end training / inference |
|
4 × NVIDIA H100 (80 GB) |
30 |
Same hardware as nvl; kept separate for scheduling |
DefCpuPerGPU from scontrol show partition; this is what Slurm charges per elapsed hour per GPU.
GPU usage limits
QoS limits are enforced cluster-wide. Most projects can have up to 18 GPUs in use simultaneously. This limit is applied to both per account and per user, whichever limit is hit first.
Submitting a GPU batch job
Example – 2 × A100 GPUs for 24 h:
#SBATCH --partition=a100
#SBATCH --qos=qos_gpu
#SBATCH --account=jsmith123_gpu
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=24 # 12 cores / GPU × 2
#SBATCH --time=24:00:00
module load cuda/12.3
srun python train.py --epochs 90
Monitoring GPUs
List GPU nodes & load:
sinfo -p l40s,a100,nvl,h100 -N -o "%N %G %T %m"
Per-job utilisation:
jobstats <jobid>
Troubleshooting
QOSMaxGRESPerAccount → you’ve hit the GPU cap; wait or cancel other runs.
AssocGrpGRES → wrong account/QoS pair.
Resources → request fewer GPUs or shorter wall-time to back-fill.
Need help? Open a ticket or e-mail help@arch.jhu.edu.