Job States and Reason Codes in Slurm (PERUN)¶
Understanding job statuses is essential for debugging and managing workloads on the PERUN Supercomputer.
Slurm provides two types of codes:
- Job State Codes — describe the current state of a job
- Job Reason Codes — explain why a job is in a particular state
1. Job State Codes¶
What Are Job States?
These codes reflect what your job is currently doing — running, pending, failed, etc.
Job States Table¶
| Status | Code | Explanation |
|---|---|---|
| CANCELLED | CA | Job was cancelled by user or administrator. |
| COMPLETED | CD | Job finished successfully. |
| COMPLETING | CG | Job is finishing; some processes still active. |
| DEADLINE | DL | Job terminated because it reached its deadline. |
| FAILED | F | Job ended with a non-zero exit code. |
| NODE_FAIL | NF | One or more allocated nodes failed. |
| OUT_OF_MEMORY | OOM | Job ran out of memory. |
| PENDING | PD | Job is waiting for resources and will eventually run. |
| PREEMPTED | PR | Job was terminated to make room for higher-priority job. |
| RUNNING | R | Job is running on assigned nodes. |
| SUSPENDED | S | Job paused; CPU cores released to others. |
| STOPPED | ST | Job paused; cores reserved for the job. |
| TIMEOUT | TO | Job reached its time limit. |
2. Job Reason Codes¶
Why Is My Job Pending?
A pending job (PD) always has a reason code explaining why it cannot start.
Reason Codes Table¶
| Reason Code | Explanation |
|---|---|
| Priority | A higher-priority job is ahead of yours. |
| Dependency | Waiting for another job to finish. |
| Resources | Required resources are currently unavailable. |
| InvalidAccount | The account used is invalid — cancel and resubmit. |
| InvalidQoS | The QoS setting is invalid — fix and resubmit. |
| QOSGrpCpuLimit | All CPUs allowed under the QoS are in use. |
| QOSGrpMaxJobsLimit | Max number of jobs for the QoS reached. |
| QOSGrpNodeLimit | All nodes allocated under this QoS are occupied. |
| PartitionCpuLimit | CPUs in the partition are fully allocated. |
| PartitionMaxJobsLimit | Max jobs for the partition reached. |
| PartitionNodeLimit | All nodes in the partition are busy. |
| AssociationCpuLimit | CPUs for your account/association are in use. |
| AssociationMaxJobsLimit | Max jobs for your association reached. |
| AssociationNodeLimit | All nodes assigned to the association are in use. |
3. Debugging Failed Jobs¶
Critical — Job Failed
A job marked with F, OOM, or NF requires immediate attention.
Enable Debug Mode in Job Script¶
Add -x to the bash shebang:
This prints each command before execution, helping identify failures.
4. Summary¶
- Job states tell you what is happening.
- Reason codes tell you why it’s happening.
- Use
squeue,sacct, andsefffor monitoring and debugging. - Use debug mode (
#!/bin/bash -x) for troubleshooting complex job scripts.