GPU Jobs

ARCH offers several GPU-equipped partitions for compute-intensive and AI/ML workloads. This page lists each partition, the CPU-per-GPU billing ratios, access requirements, and submission examples.

Available GPU partitions

Partition

GPUs / node

CPU cores billed per GPU

Typical use-case

l40s

8 × NVIDIA L40 S (48 GB)

14

Large-memory image / data analytics

a100

8 × NVIDIA A100-40 GB

10

Mixed HPC + DL

nvl

4 × NVIDIA H100 (96 GB)

30

Highest-end training / inference

h100

4 × NVIDIA H100 (80 GB)

30

Same hardware as nvl; kept separate for scheduling

DefCpuPerGPU from scontrol show partition; this is what Slurm charges per elapsed hour per GPU.

GPU usage limits

QoS limits are enforced cluster-wide. Most projects can have up to 18 GPUs in use simultaneously. This limit is applied to both per account and per user, whichever limit is hit first.

Submitting a GPU batch job

Example – 2 × A100 GPUs for 24 h:

#SBATCH --partition=a100
#SBATCH --qos=qos_gpu
#SBATCH --account=jsmith123_gpu
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=24     # 12  cores / GPU × 2
#SBATCH --time=24:00:00

module load cuda/12.3
srun python train.py --epochs 90

Monitoring GPUs

List GPU nodes & load:

sinfo -p l40s,a100,nvl,h100 -N -o "%N %G %T %m"

Per-job utilisation:

jobstats <jobid>

Troubleshooting

  • QOSMaxGRESPerAccount → you’ve hit the GPU cap; wait or cancel other runs.

  • AssocGrpGRES → wrong account/QoS pair.

  • Resources → request fewer GPUs or shorter wall-time to back-fill.

Need help? Open a ticket or e-mail help@arch.jhu.edu.