Available Partitions
####################

Slurm divides resources into **partitions**, sometimes called **queues**. Each partition targets specific hardware or workloads.

.. list-table:: **DSAI Partition Summary**
   :header-rows: 1
   :widths: 12 10 12 14 12 12 38

   * - **Partition**
     - **# Nodes**
     - **CPU cores / node**
     - **Memory / core (MB)**
     - **GPUs / node**
     - **Time limit (hh:mm:ss)**
     - **Key features**
   * - ``cpu``
     - 80
     - 108
     - 4 000
     - — (N/A)
     - 72:00:00
     - Intel Xeon Platinum 8480+ (56-core) dual-socket nodes
   * - ``interactive``
     - 4
     - 88
     - 10 000
     - 8 × NVIDIA A100 80 GB
     - 72:00:00
     - AMD EPYC 7443 (24-core) + A100 GPUs
   * - ``l40s``
     - 8
     - 124
     - 6 000
     - 8 × NVIDIA L40S 48 GB
     - 72:00:00
     - AMD EPYC 9534 (64-core) + high-mem L40S GPUs
   * - ``a100``
     - 11
     - 88
     - 10 000
     - 8 × NVIDIA A100 80 GB
     - 72:00:00
     - AMD EPYC 7443 (24-core) + A100 GPUs
   * - ``h100``
     - 16
     - 124
     - 12 000
     - 4 × NVIDIA H100 80 GB
     - 72:00:00
     - AMD EPYC 9534 (64-core) + H100 GPUs
   * - ``nvl``
     - 16
     - 124
     - 12 000
     - 4 × NVIDIA H100-NVL 96 GB
     - 72:00:00
     - AMD EPYC 9534 (64-core) + H100-NVL GPUs

Partition Descriptions
------------------------

cpu
~~~

* **No GPUs** – ideal for CPU only jobs.

interactive
~~~~~~~~~~~

* Interactive, short, hands-on debugging or exploratory runs (not for long production jobs).
* Up to **1 node** per job; **MaxTime = 3 days (72:00:00)**.
* Runs on A100 nodes (``c012–c015``), same chassis as the ``a100`` partition.

l40s
~~~~

* **8 × L40 S 48 GB** per node.   

a100
~~~~

* **8 × A100 80 GB** per node.

h100
~~~~

* **4 × H100 80 GB** per node.
* Connected via Mellanox NDR and may give good performance for parallel GPU jobs.

nvl
~~~

* **4 × H100-NVL 96 GB** per node.
* Connected via Mellanox NDR and may give good performance for parallel GPU jobs. 

GPU core-billing ratios
-----------------------

.. list-table::
   :header-rows: 1
   :widths: 18 18

   * - **Partition**
     - **Billed CPU cores per GPU**
   * - ``l40s``
     - 14
   * - ``a100``
     - 10
   * - ``h100`` / ``nvl``
     - 30

Only request the GPUs you truly need—extra GPUs multiply your billed
core-hours and may increase queue time.

Viewing Partition Configuration
--------------------------------

You can view details about any partition with the `scontrol` command. This is helpful to check limits, available nodes, default memory settings, and which QoS values are allowed or denied.

- Use `scontrol show partition` without any arguments to see **all** partitions.
- To find which QoS values are allowed or blocked in a partition, look at `QoS=` and `DenyQos=`.

Example:

.. code-block:: console

   scontrol show partition=h100

Sample Output:

.. code-block:: console

   PartitionName=h100
      AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
      AllocNodes=ALL Default=NO QoS=N/A
      DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
      MaxNodes=3 MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=128 MaxCPUsPerSocket=UNLIMITED
      Nodes=h[01-16]
      PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
      OverTimeLimit=NONE PreemptMode=REQUEUE
      State=UP TotalCPUs=2048 TotalNodes=16 SelectTypeParameters=NONE
      JobDefaults=DefCpuPerGPU=32
      DefMemPerCPU=12000 MaxMemPerCPU=12000
      TRES=cpu=1984,mem=24752000M,node=16,billing=24320,gres/gpu=64,gres/gpu:h100=64
      TRESBillingWeights=CPU=10,Mem=0.83G,GRES/gpu=380

Key Fields to Note
------------------

- **MaxTime**: The maximum wall-clock time allowed for jobs in this partition.
- **DefMemPerCPU**: The default memory available per core (can be overridden with `--mem` or `--mem-per-cpu`).
- **Nodes**: The physical nodes available for this partition.
- **OverSubscribe**: Indicates if jobs can share nodes.
- **DenyQos**: QOS values that are explicitly blocked from this partition.
- **TRES**: Total Resources (CPUs, memory, nodes) assigned to this partition.

Helpful Tips
-------------

- You can view the current load on each partition with:

  .. code-block:: console

    [root@dsailogin ~]$ sinfo -s
    PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
    l40s*        up 3-00:00:00          7/1/0/8 l[01-08]
    a100         up 3-00:00:00        14/0/1/15 c[001-015]
    nvl          up 3-00:00:00        14/2/0/16 n[01-16]
    h100         up 3-00:00:00        16/0/0/16 h[01-16]
    cpu          up 3-00:00:00       2/62/16/80 cpu[001-080]
    Secondary    up 3-00:00:00          3/1/0/4 c015,h16,l08,n16  

  This provides a summary view of each partition’s usage and availability.

- To see the list of available partitions and their state:

  .. code-block:: console

     sinfo -o "%P %.5D %.10t %.10l %.6c %.10m"

  This will output:
  
  - Partition name
  - Node count
  - State (idle/alloc/mix)
  - Max time
  - CPUs per node
  - Memory

Partition Best Practices
-------------------------

- Use `\-\-partition=` to explicitly request a partition in your batch script.
- Avoid defaulting to GPU partitions unless required — this helps ensure fair usage.
- Read memory policies carefully (e.g., shared nodes have 4 GB/core).
- Always pair GPU partitions with the appropriate QOS and allocation account.