Skip to content

Job States and Reason Codes in Slurm (PERUN)

Understanding job statuses is essential for debugging and managing workloads on the PERUN Supercomputer.
Slurm provides two types of codes:

  • Job State Codes — describe the current state of a job
  • Job Reason Codes — explain why a job is in a particular state

1. Job State Codes

What Are Job States?

These codes reflect what your job is currently doing — running, pending, failed, etc.

Job States Table

Status Code Explanation
CANCELLED CA Job was cancelled by user or administrator.
COMPLETED CD Job finished successfully.
COMPLETING CG Job is finishing; some processes still active.
DEADLINE DL Job terminated because it reached its deadline.
FAILED F Job ended with a non-zero exit code.
NODE_FAIL NF One or more allocated nodes failed.
OUT_OF_MEMORY OOM Job ran out of memory.
PENDING PD Job is waiting for resources and will eventually run.
PREEMPTED PR Job was terminated to make room for higher-priority job.
RUNNING R Job is running on assigned nodes.
SUSPENDED S Job paused; CPU cores released to others.
STOPPED ST Job paused; cores reserved for the job.
TIMEOUT TO Job reached its time limit.

Tip — Check Jobs Efficiently

Use:

squeue -u $USER
to list all your jobs with their state codes.


2. Job Reason Codes

Why Is My Job Pending?

A pending job (PD) always has a reason code explaining why it cannot start.

Reason Codes Table

Reason Code Explanation
Priority A higher-priority job is ahead of yours.
Dependency Waiting for another job to finish.
Resources Required resources are currently unavailable.
InvalidAccount The account used is invalid — cancel and resubmit.
InvalidQoS The QoS setting is invalid — fix and resubmit.
QOSGrpCpuLimit All CPUs allowed under the QoS are in use.
QOSGrpMaxJobsLimit Max number of jobs for the QoS reached.
QOSGrpNodeLimit All nodes allocated under this QoS are occupied.
PartitionCpuLimit CPUs in the partition are fully allocated.
PartitionMaxJobsLimit Max jobs for the partition reached.
PartitionNodeLimit All nodes in the partition are busy.
AssociationCpuLimit CPUs for your account/association are in use.
AssociationMaxJobsLimit Max jobs for your association reached.
AssociationNodeLimit All nodes assigned to the association are in use.

Example — Job Pending Because of Resources

$ squeue -j 123456
STATE: PD
REASON: Resources

3. Debugging Failed Jobs

Critical — Job Failed

A job marked with F, OOM, or NF requires immediate attention.

Enable Debug Mode in Job Script

Add -x to the bash shebang:

#!/bin/bash -x
#SBATCH -p cpu

This prints each command before execution, helping identify failures.

Tip — Check Job Summary

seff <jobid>
Shows CPU usage, memory usage, and job efficiency.


4. Summary

  • Job states tell you what is happening.
  • Reason codes tell you why it’s happening.
  • Use squeue, sacct, and seff for monitoring and debugging.
  • Use debug mode (#!/bin/bash -x) for troubleshooting complex job scripts.