Submitting Jobs on PERUN – Complete Guide with Automatic Scratch¶
This guide explains practical Slurm batch scripts for the PERUN supercomputer with the automatic scratch system.
What's New
- Automatic scratch management (prolog/epilog)
- Fast I/O on Lustre instead of slow NFS
- Automatic data staging and result synchronization
- One-line activation - just add
source .activate_scratch
Table of Contents¶
- How Automatic Scratch Works
- Basic Job Templates
- Advanced Examples
- Troubleshooting
- Slurm Basics
- Environment Variables
- Best Practices
- Migration Guide
- FAQ
- Complete Working Example
1. How Automatic Scratch Works¶
The Three Phases¶
Phase 1: PROLOG (Before Your Job)¶
Automatic Execution
Runs automatically, you don't do anything.
1. Creates /lustre/scratch/$USER/job_$JOBID/
2. Copies your ENTIRE submit directory to scratch
3. Creates .activate_scratch helper file
What gets copied:
Files Included in Copy
Included:
- All
.py,.sh,.txtfiles - Subdirectories (
data/,models/, etc.) - Configuration files
Excluded:
- Hidden files (
.git/,.venv/) - Output files (
*.out,*.err) __pycache__/directories
Phase 2: YOUR JOB (Your Code)¶
One Line Addition
You add ONE line:
This does:
cd /lustre/scratch/$USER/job_$JOBID
export SCRATCH_DIR="$(pwd)"
export DATA_DIR="$SCRATCH_DIR/data"
export TMPDIR="$SCRATCH_DIR/tmp"
export RESULTS_DIR="$SCRATCH_DIR/results"
Now your job runs in fast Lustre scratch instead of slow NFS home.
Phase 3: EPILOG (After Your Job)¶
Automatic Cleanup
Runs automatically, you don't do anything.
1. Syncs EVERYTHING from scratch → ~/results_job_$JOBID/
2. Creates job summary file
3. Cleans up scratch automatically
Result
All your outputs, checkpoints, logs safely in your home directory.
2. Basic Job Templates¶
2.1 Single GPU Training¶
#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=48G
#SBATCH --time=24:00:00
# Activate scratch (ONE LINE!)
source .activate_scratch
# Your training code
python3 train.py \
--data data/dataset.csv \
--checkpoint checkpoints/ \
--output results/
# Done! Epilog automatically syncs:
# - checkpoints/ → ~/results_job_XXXXX/checkpoints/
# - results/ → ~/results_job_XXXXX/results/
# - logs/ → ~/results_job_XXXXX/logs/
2.2 Multi-GPU Training (DDP)¶
#!/bin/bash
#SBATCH --job-name=train_ddp
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --time=48:00:00
# Activate scratch
source .activate_scratch
# DDP training
srun python3 -m torch.distributed.run \
--nproc_per_node=4 \
train_ddp.py
# Results automatically synced!
2.3 CPU-Only Job¶
#!/bin/bash
#SBATCH --job-name=preprocess
#SBATCH --output=%x_%j.out
#SBATCH --partition=CPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00
# Activate scratch
source .activate_scratch
# CPU preprocessing
python3 preprocess_data.py \
--input data/raw/ \
--output data/processed/
# Processed data automatically synced!
3. Advanced Examples¶
3.1 Multi-Node DDP Training¶
#!/bin/bash
#SBATCH --job-name=ddp_multinode
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=96:00:00
# Activate scratch
source .activate_scratch
# Setup master node
MASTER_ADDR=$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)
MASTER_PORT=29500
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo "Training on $SLURM_NNODES nodes, $SLURM_NTASKS GPUs total"
echo "Master: $MASTER_ADDR:$MASTER_PORT"
# Launch distributed training
srun python3 -m torch.distributed.run \
--nnodes=$SLURM_NNODES \
--nproc_per_node=4 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
train_ddp.py
# Results from all nodes synced!
3.2 Hyperparameter Search (Job Array)¶
#!/bin/bash
#SBATCH --job-name=hparam_search
#SBATCH --output=logs/search_%A_%a.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:1
#SBATCH --array=0-99%10
#SBATCH --time=12:00:00
# Activate scratch
source .activate_scratch
# Each array task gets different hyperparameters
SEED=$SLURM_ARRAY_TASK_ID
python3 train.py \
--seed $SEED \
--lr $(python3 -c "print(0.0001 * (1.5 ** $SEED))") \
--output results/seed_${SEED}/
# Each task's results synced separately!
3.3 Checkpoint Resume¶
#!/bin/bash
#SBATCH --job-name=resume_training
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:2
#SBATCH --time=48:00:00
# Activate scratch
source .activate_scratch
# Copy previous checkpoint to scratch
if [ -f "$HOME/results_job_3500/checkpoints/best_model.pt" ]; then
cp "$HOME/results_job_3500/checkpoints/best_model.pt" checkpoints/
echo "Resumed from previous checkpoint"
fi
# Continue training
python3 train.py --resume checkpoints/best_model.pt
# New checkpoints automatically synced!
3.4 Custom Sync Strategy (Advanced)¶
#!/bin/bash
#SBATCH --job-name=custom_sync
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
# Activate scratch
source .activate_scratch
echo "Running in: $SCRATCH_DIR"
# Long training with intermediate syncs
python3 train.py &
TRAIN_PID=$!
# Sync critical checkpoints every hour (while training runs)
while kill -0 $TRAIN_PID 2>/dev/null; do
sleep 3600
# Manual sync of critical files
if [ -f checkpoints/latest.pt ]; then
rsync -a checkpoints/latest.pt "$HOME/backup_checkpoints/"
echo "$(date): Synced intermediate checkpoint"
fi
done
wait $TRAIN_PID
# Epilog still syncs everything at the end!
4. Troubleshooting¶
4.1 Job Failed Immediately¶
Symptom
Job exits with "Permission denied" or "No such file"
Solution
Make sure you added source .activate_scratch:
4.2 Output Files Not Found After Job¶
Symptom
Can't find results after job completes
Solution
Check the results directory:
All outputs are synced here, not in the original submit directory!
4.3 Large Checkpoints Missing¶
Symptom
Some checkpoints didn't sync
Possible causes:
- Job hit time limit before epilog completed
- Disk quota exceeded
- Checkpoint was too large (>100GB needs more time)
Solution
# Check epilog log
ssh root@gpu01 'tail -100 /var/log/slurm/prolog-epilog/epilog-job*XXXXX*.log'
# Manually recover from scratch (if still exists)
rsync -avP /lustre/scratch/$USER/job_XXXXX/ ~/recovered_results/
4.4 Job Slower Than Expected¶
Symptom
Training is slow despite using scratch
Diagnostics:
# Check if actually running in scratch
squeue -j $JOBID -o "%i %Z" # WorkDir should be /lustre/scratch/...
# Check I/O wait
ssh gpu01 'iostat -x 1 5'
# Check if data is actually in scratch
ls -lh /lustre/scratch/$USER/job_$JOBID/data/
4.5 Monitoring Live Progress¶
View live output:
# From login node
tail -f ~/results_job_XXXXX/train_model_XXXXX.out
# Or from scratch (while running)
ssh gpu01 'tail -f /lustre/scratch/$USER/job_XXXXX/*.out'
Monitor prolog/epilog:
# Watch real-time sync progress
ssh root@gpu01 'tail -f /var/log/slurm/prolog-epilog/*.log | grep RSYNC'
5. Slurm Basics (Cheat Sheet)¶
Essential Commands¶
# Submit job
sbatch job.sh
# Check queue
squeue -u $USER
# Job details
scontrol show job XXXXX
# Cancel job
scancel XXXXX
# Job history
sacct -j XXXXX --format=JobID,State,Elapsed,MaxRSS,ReqMem
Common SBATCH Directives¶
#SBATCH --job-name=my_job # Job name
#SBATCH --output=%x_%j.out # Output file (%x=name, %j=jobid)
#SBATCH --error=%x_%j.err # Error file
#SBATCH --partition=GPU # Queue: CPU or GPU
#SBATCH --account=perun2501234 # Project account
#SBATCH --qos=perun2501234 # Quality of Service
#SBATCH --gres=gpu:2 # Request 2 GPUs
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1 # Number of processes
#SBATCH --cpus-per-task=8 # CPUs per process
#SBATCH --mem=64G # Memory
#SBATCH --time=24:00:00 # Time limit (HH:MM:SS)
Account and QoS
Replace perun2501234 with your actual project number. You can check your available accounts and QoS limits with:
Output Filename Patterns¶
| Pattern | Meaning | Example |
|---|---|---|
%x |
Job name | train_model |
%j |
Job ID | 3928 |
%A |
Array job ID | 4000 |
%a |
Array task ID | 5 |
%N |
Node name | gpu01 |
Recommended Pattern
6. Environment Variables¶
Available in Jobs¶
$SLURM_JOB_ID # Job ID
$SLURM_JOB_NAME # Job name
$SLURM_SUBMIT_DIR # Directory where sbatch was called
$SLURM_CPUS_PER_TASK # CPUs requested
$SLURM_NTASKS # Total tasks
$SLURM_NNODES # Number of nodes
$SLURM_NODELIST # List of nodes
$CUDA_VISIBLE_DEVICES # Visible GPUs (set by Slurm)
After source .activate_scratch¶
$SCRATCH_DIR # /lustre/scratch/$USER/job_$JOBID
$DATA_DIR # $SCRATCH_DIR/data
$TMPDIR # $SCRATCH_DIR/tmp
$RESULTS_DIR # $SCRATCH_DIR/results
Use in Python
import os
scratch = os.environ['SCRATCH_DIR']
checkpoint_dir = os.path.join(scratch, 'checkpoints')
os.makedirs(checkpoint_dir, exist_ok=True)
7. Best Practices¶
DO¶
Recommended Practices
- Always use scratch for training - 40x faster I/O
- Use
%x_%j.outfor output files - easier to track - Always set
--accountand--qos- required for tracked jobs - Request appropriate resources - don't over-request
- Set realistic time limits - helps scheduler
- Test with short jobs first - debug before long runs
- Monitor job progress - use
tail -fon output - Keep code in Git - submit directory gets copied
DON'T¶
Avoid These Mistakes
- Don't write large files to home - use scratch!
- Don't request 8 GPUs if you use 1 - wastes resources
- Don't use
--nodelistin production - reduces flexibility - Don't forget
source .activate_scratch- defeats the purpose - Don't store results only in scratch - epilog syncs them
- Don't manually clean scratch - epilog does it
- Don't run interactive jobs 24/7 - use batch jobs
8. Migration Guide¶
If You Have Existing Jobs¶
Before (manual scratch):
#!/bin/bash
#SBATCH --partition=GPU
SCRATCH="/lustre/scratch/$USER/job_$SLURM_JOB_ID"
mkdir -p "$SCRATCH"
rsync -a "$SLURM_SUBMIT_DIR"/ "$SCRATCH"/
cd "$SCRATCH"
python3 train.py
rsync -a output/ "$SLURM_SUBMIT_DIR/output/"
rm -rf "$SCRATCH"
After (automatic):
#!/bin/bash
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
source .activate_scratch
python3 train.py
# That's it!
Changes Needed
- Add
source .activate_scratch - Add
--accountand--qosdirectives - Remove manual
mkdir,rsync,cd, cleanup - Results will be in
~/results_job_XXXXX/instead of submit dir - Update any hardcoded paths if needed
9. FAQ¶
Do I need to change my Python code?
No! Your code runs in scratch automatically. Paths stay the same.
What if my dataset is 5TB?
Don't copy it! Keep large datasets in /lustre/datasets/ and reference them directly.
Can I submit from any directory?
Yes! Prolog copies from wherever you run sbatch.
What if I need specific files NOT to sync?
Create .rsyncignore (advanced) or exclude in epilog config.
How long does sync take?
~1-2 seconds per GB. A 10GB checkpoint = ~15 seconds.
Can I monitor sync progress?
Yes! ssh root@NODE 'tail -f /var/log/slurm/prolog-epilog/epilog*.log | grep RSYNC'
What if job is killed mid-training?
Epilog still runs! Results are synced even for failed jobs.
Can I disable automatic scratch?
Yes, just don't add source .activate_scratch. Job runs in submit directory.
How do I find my account and QoS?
Run: sacctmgr show user $USER withassoc format=account,qos
10. Complete Working Example¶
Here's a complete, production-ready training script:
#!/bin/bash
################################################################################
# BERT-Large Fine-tuning on SKQuAD Dataset
# Expected runtime: ~2 hours on 2x H200 GPUs
# Results: ~/results_job_XXXXX/
################################################################################
#SBATCH --job-name=skquad_bert
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=96G
#SBATCH --time=04:00:00
# Activate automatic scratch
source .activate_scratch
# Environment setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CUDA_LAUNCH_BLOCKING=0
# Print job info
echo "═══════════════════════════════════════════════════════════"
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Node: $(hostname)"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Scratch: $SCRATCH_DIR"
echo "═══════════════════════════════════════════════════════════"
echo
# Verify GPU access
nvidia-smi --query-gpu=name,memory.total --format=csv
echo
# Start training
python3 train_bert.py \
--model_name bert-large-uncased \
--dataset skquad \
--output_dir checkpoints/ \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--warmup_steps 500 \
--save_steps 1000 \
--logging_steps 100 \
--fp16
echo
echo "═══════════════════════════════════════════════════════════"
echo "Training complete! Results syncing to ~/results_job_$SLURM_JOB_ID/"
echo "═══════════════════════════════════════════════════════════"
# Epilog automatically:
# 1. Syncs checkpoints/ → ~/results_job_XXXXX/checkpoints/
# 2. Syncs logs → ~/results_job_XXXXX/logs/
# 3. Creates job summary
# 4. Cleans up scratch
Submit the Job
Monitor Progress
After Completion