Submitting Jobs on PERUN – Complete Guide with Automatic Scratch¶

This guide explains practical Slurm batch scripts for the PERUN supercomputer with the automatic scratch system.

What's New

Automatic scratch management (prolog/epilog)
Fast I/O on Lustre instead of slow NFS
Automatic data staging and result synchronization
One-line activation - just add source .activate_scratch

1. How Automatic Scratch Works¶

The Three Phases¶

Phase 1: PROLOG (Before Your Job)¶

Automatic Execution

Runs automatically, you don't do anything.

1. Creates /lustre/scratch/$USER/job_$JOBID/
2. Copies your ENTIRE submit directory to scratch
3. Creates .activate_scratch helper file

What gets copied:

Files Included in Copy

Included:

All .py, .sh, .txt files
Subdirectories (data/, models/, etc.)
Configuration files

Excluded:

Hidden files (.git/, .venv/)
Output files (*.out, *.err)
__pycache__/ directories

Phase 2: YOUR JOB (Your Code)¶

One Line Addition

You add ONE line:

source .activate_scratch

This does:

cd /lustre/scratch/$USER/job_$JOBID
export SCRATCH_DIR="$(pwd)"
export DATA_DIR="$SCRATCH_DIR/data"
export TMPDIR="$SCRATCH_DIR/tmp"
export RESULTS_DIR="$SCRATCH_DIR/results"

Now your job runs in fast Lustre scratch instead of slow NFS home.

Phase 3: EPILOG (After Your Job)¶

Automatic Cleanup

Runs automatically, you don't do anything.

1. Syncs EVERYTHING from scratch → ~/results_job_$JOBID/
2. Creates job summary file
3. Cleans up scratch automatically

Result

All your outputs, checkpoints, logs safely in your home directory.

2. Basic Job Templates¶

2.1 Single GPU Training¶

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=48G
#SBATCH --time=24:00:00

# Activate scratch (ONE LINE!)
source .activate_scratch

# Your training code
python3 train.py \
    --data data/dataset.csv \
    --checkpoint checkpoints/ \
    --output results/

# Done! Epilog automatically syncs:
#   - checkpoints/ → ~/results_job_XXXXX/checkpoints/
#   - results/     → ~/results_job_XXXXX/results/
#   - logs/        → ~/results_job_XXXXX/logs/

2.2 Multi-GPU Training (DDP)¶

#!/bin/bash
#SBATCH --job-name=train_ddp
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --time=48:00:00

# Activate scratch
source .activate_scratch

# DDP training
srun python3 -m torch.distributed.run \
    --nproc_per_node=4 \
    train_ddp.py

# Results automatically synced!

2.3 CPU-Only Job¶

#!/bin/bash
#SBATCH --job-name=preprocess
#SBATCH --output=%x_%j.out
#SBATCH --partition=CPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00

# Activate scratch
source .activate_scratch

# CPU preprocessing
python3 preprocess_data.py \
    --input data/raw/ \
    --output data/processed/

# Processed data automatically synced!

3. Advanced Examples¶

3.1 Multi-Node DDP Training¶

#!/bin/bash
#SBATCH --job-name=ddp_multinode
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=96:00:00

# Activate scratch
source .activate_scratch

# Setup master node
MASTER_ADDR=$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)
MASTER_PORT=29500

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "Training on $SLURM_NNODES nodes, $SLURM_NTASKS GPUs total"
echo "Master: $MASTER_ADDR:$MASTER_PORT"

# Launch distributed training
srun python3 -m torch.distributed.run \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_ddp.py

# Results from all nodes synced!

3.2 Hyperparameter Search (Job Array)¶

#!/bin/bash
#SBATCH --job-name=hparam_search
#SBATCH --output=logs/search_%A_%a.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:1
#SBATCH --array=0-99%10
#SBATCH --time=12:00:00

# Activate scratch
source .activate_scratch

# Each array task gets different hyperparameters
SEED=$SLURM_ARRAY_TASK_ID

python3 train.py \
    --seed $SEED \
    --lr $(python3 -c "print(0.0001 * (1.5 ** $SEED))") \
    --output results/seed_${SEED}/

# Each task's results synced separately!

3.3 Checkpoint Resume¶

#!/bin/bash
#SBATCH --job-name=resume_training
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:2
#SBATCH --time=48:00:00

# Activate scratch
source .activate_scratch

# Copy previous checkpoint to scratch
if [ -f "$HOME/results_job_3500/checkpoints/best_model.pt" ]; then
    cp "$HOME/results_job_3500/checkpoints/best_model.pt" checkpoints/
    echo "Resumed from previous checkpoint"
fi

# Continue training
python3 train.py --resume checkpoints/best_model.pt

# New checkpoints automatically synced!

3.4 Custom Sync Strategy (Advanced)¶

#!/bin/bash
#SBATCH --job-name=custom_sync
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00

# Activate scratch
source .activate_scratch

echo "Running in: $SCRATCH_DIR"

# Long training with intermediate syncs
python3 train.py &
TRAIN_PID=$!

# Sync critical checkpoints every hour (while training runs)
while kill -0 $TRAIN_PID 2>/dev/null; do
    sleep 3600

    # Manual sync of critical files
    if [ -f checkpoints/latest.pt ]; then
        rsync -a checkpoints/latest.pt "$HOME/backup_checkpoints/"
        echo "$(date): Synced intermediate checkpoint"
    fi
done

wait $TRAIN_PID

# Epilog still syncs everything at the end!

4. Troubleshooting¶

4.1 Job Failed Immediately¶

Symptom

Job exits with "Permission denied" or "No such file"

Solution

Make sure you added source .activate_scratch:

#!/bin/bash
#SBATCH --partition=GPU

# THIS IS REQUIRED!
source .activate_scratch

python3 train.py

4.2 Output Files Not Found After Job¶

Symptom

Can't find results after job completes

Solution

Check the results directory:

ls -lh ~/results_job_XXXXX/

All outputs are synced here, not in the original submit directory!

4.3 Large Checkpoints Missing¶

Symptom

Some checkpoints didn't sync

Possible causes:

Job hit time limit before epilog completed
Disk quota exceeded
Checkpoint was too large (>100GB needs more time)

Solution

# Check epilog log
ssh root@gpu01 'tail -100 /var/log/slurm/prolog-epilog/epilog-job*XXXXX*.log'

# Manually recover from scratch (if still exists)
rsync -avP /lustre/scratch/$USER/job_XXXXX/ ~/recovered_results/

4.4 Job Slower Than Expected¶

Symptom

Training is slow despite using scratch

Diagnostics:

# Check if actually running in scratch
squeue -j $JOBID -o "%i %Z"  # WorkDir should be /lustre/scratch/...

# Check I/O wait
ssh gpu01 'iostat -x 1 5'

# Check if data is actually in scratch
ls -lh /lustre/scratch/$USER/job_$JOBID/data/

4.5 Monitoring Live Progress¶

View live output:

# From login node
tail -f ~/results_job_XXXXX/train_model_XXXXX.out

# Or from scratch (while running)
ssh gpu01 'tail -f /lustre/scratch/$USER/job_XXXXX/*.out'

Monitor prolog/epilog:

# Watch real-time sync progress
ssh root@gpu01 'tail -f /var/log/slurm/prolog-epilog/*.log | grep RSYNC'

5. Slurm Basics (Cheat Sheet)¶

Essential Commands¶

# Submit job
sbatch job.sh

# Check queue
squeue -u $USER

# Job details
scontrol show job XXXXX

# Cancel job
scancel XXXXX

# Job history
sacct -j XXXXX --format=JobID,State,Elapsed,MaxRSS,ReqMem

Common SBATCH Directives¶

#SBATCH --job-name=my_job         # Job name
#SBATCH --output=%x_%j.out        # Output file (%x=name, %j=jobid)
#SBATCH --error=%x_%j.err         # Error file
#SBATCH --partition=GPU           # Queue: CPU or GPU
#SBATCH --account=perun2501234    # Project account
#SBATCH --qos=perun2501234        # Quality of Service
#SBATCH --gres=gpu:2              # Request 2 GPUs
#SBATCH --nodes=1                 # Number of nodes
#SBATCH --ntasks=1                # Number of processes
#SBATCH --cpus-per-task=8         # CPUs per process
#SBATCH --mem=64G                 # Memory
#SBATCH --time=24:00:00           # Time limit (HH:MM:SS)

Account and QoS

Replace perun2501234 with your actual project number. You can check your available accounts and QoS limits with:

sacctmgr show user $USER withassoc format=account,qos

Output Filename Patterns¶

Pattern	Meaning	Example
`%x`	Job name	`train_model`
`%j`	Job ID	`3928`
`%A`	Array job ID	`4000`
`%a`	Array task ID	`5`
`%N`	Node name	`gpu01`

Recommended Pattern

#SBATCH --output=%x_%j.out

6. Environment Variables¶

Available in Jobs¶

$SLURM_JOB_ID              # Job ID
$SLURM_JOB_NAME            # Job name
$SLURM_SUBMIT_DIR          # Directory where sbatch was called
$SLURM_CPUS_PER_TASK       # CPUs requested
$SLURM_NTASKS              # Total tasks
$SLURM_NNODES              # Number of nodes
$SLURM_NODELIST            # List of nodes
$CUDA_VISIBLE_DEVICES      # Visible GPUs (set by Slurm)

After `source .activate_scratch`¶

$SCRATCH_DIR               # /lustre/scratch/$USER/job_$JOBID
$DATA_DIR                  # $SCRATCH_DIR/data
$TMPDIR                    # $SCRATCH_DIR/tmp
$RESULTS_DIR               # $SCRATCH_DIR/results

Use in Python

import os

scratch = os.environ['SCRATCH_DIR']
checkpoint_dir = os.path.join(scratch, 'checkpoints')
os.makedirs(checkpoint_dir, exist_ok=True)

7. Best Practices¶

DO¶

Recommended Practices

Always use scratch for training - 40x faster I/O
Use %x_%j.out for output files - easier to track
Always set --account and --qos - required for tracked jobs
Request appropriate resources - don't over-request
Set realistic time limits - helps scheduler
Test with short jobs first - debug before long runs
Monitor job progress - use tail -f on output
Keep code in Git - submit directory gets copied

DON'T¶

Avoid These Mistakes

Don't write large files to home - use scratch!
Don't request 8 GPUs if you use 1 - wastes resources
Don't use --nodelist in production - reduces flexibility
Don't forget source .activate_scratch - defeats the purpose
Don't store results only in scratch - epilog syncs them
Don't manually clean scratch - epilog does it
Don't run interactive jobs 24/7 - use batch jobs

8. Migration Guide¶

If You Have Existing Jobs¶

Before (manual scratch):

#!/bin/bash
#SBATCH --partition=GPU

SCRATCH="/lustre/scratch/$USER/job_$SLURM_JOB_ID"
mkdir -p "$SCRATCH"
rsync -a "$SLURM_SUBMIT_DIR"/ "$SCRATCH"/
cd "$SCRATCH"

python3 train.py

rsync -a output/ "$SLURM_SUBMIT_DIR/output/"
rm -rf "$SCRATCH"

After (automatic):

#!/bin/bash
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234

source .activate_scratch
python3 train.py

# That's it!

Changes Needed

Add source .activate_scratch
Add --account and --qos directives
Remove manual mkdir, rsync, cd, cleanup
Results will be in ~/results_job_XXXXX/ instead of submit dir
Update any hardcoded paths if needed

9. FAQ¶

Do I need to change my Python code?

No! Your code runs in scratch automatically. Paths stay the same.

What if my dataset is 5TB?

Don't copy it! Keep large datasets in /lustre/datasets/ and reference them directly.

Can I submit from any directory?

Yes! Prolog copies from wherever you run sbatch.

What if I need specific files NOT to sync?

Create .rsyncignore (advanced) or exclude in epilog config.

How long does sync take?

~1-2 seconds per GB. A 10GB checkpoint = ~15 seconds.

Can I monitor sync progress?

Yes! ssh root@NODE 'tail -f /var/log/slurm/prolog-epilog/epilog*.log | grep RSYNC'

What if job is killed mid-training?

Epilog still runs! Results are synced even for failed jobs.

Can I disable automatic scratch?

Yes, just don't add source .activate_scratch. Job runs in submit directory.

How do I find my account and QoS?

Run: sacctmgr show user $USER withassoc format=account,qos

10. Complete Working Example¶

Here's a complete, production-ready training script:

#!/bin/bash
################################################################################
# BERT-Large Fine-tuning on SKQuAD Dataset
# Expected runtime: ~2 hours on 2x H200 GPUs
# Results: ~/results_job_XXXXX/
################################################################################

#SBATCH --job-name=skquad_bert
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --account=perun2501234
#SBATCH --qos=perun2501234
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=96G
#SBATCH --time=04:00:00

# Activate automatic scratch
source .activate_scratch

# Environment setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CUDA_LAUNCH_BLOCKING=0

# Print job info
echo "═══════════════════════════════════════════════════════════"
echo "Job ID:   $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Node:     $(hostname)"
echo "GPUs:     $CUDA_VISIBLE_DEVICES"
echo "Scratch:  $SCRATCH_DIR"
echo "═══════════════════════════════════════════════════════════"
echo

# Verify GPU access
nvidia-smi --query-gpu=name,memory.total --format=csv
echo

# Start training
python3 train_bert.py \
    --model_name bert-large-uncased \
    --dataset skquad \
    --output_dir checkpoints/ \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --learning_rate 3e-5 \
    --warmup_steps 500 \
    --save_steps 1000 \
    --logging_steps 100 \
    --fp16

echo
echo "═══════════════════════════════════════════════════════════"
echo "Training complete! Results syncing to ~/results_job_$SLURM_JOB_ID/"
echo "═══════════════════════════════════════════════════════════"

# Epilog automatically:
# 1. Syncs checkpoints/ → ~/results_job_XXXXX/checkpoints/
# 2. Syncs logs         → ~/results_job_XXXXX/logs/
# 3. Creates job summary
# 4. Cleans up scratch

Submit the Job

sbatch train_skquad.sh

Monitor Progress

tail -f ~/results_job_XXXXX/skquad_bert_XXXXX.out

After Completion

ls -lh ~/results_job_XXXXX/
# checkpoints/
# logs/
# skquad_bert_XXXXX.out
# job_XXXXX_summary.txt

Submitting Jobs on PERUN – Complete Guide with Automatic Scratch¶

Table of Contents¶

1. How Automatic Scratch Works¶

The Three Phases¶

Phase 1: PROLOG (Before Your Job)¶

Phase 2: YOUR JOB (Your Code)¶

Phase 3: EPILOG (After Your Job)¶

2. Basic Job Templates¶

2.1 Single GPU Training¶

2.2 Multi-GPU Training (DDP)¶

2.3 CPU-Only Job¶

3. Advanced Examples¶

3.1 Multi-Node DDP Training¶

3.2 Hyperparameter Search (Job Array)¶

3.3 Checkpoint Resume¶

3.4 Custom Sync Strategy (Advanced)¶

4. Troubleshooting¶

4.1 Job Failed Immediately¶

4.2 Output Files Not Found After Job¶

4.3 Large Checkpoints Missing¶

4.4 Job Slower Than Expected¶

4.5 Monitoring Live Progress¶

5. Slurm Basics (Cheat Sheet)¶

Essential Commands¶

Common SBATCH Directives¶

Output Filename Patterns¶

6. Environment Variables¶

Available in Jobs¶

After source .activate_scratch¶

7. Best Practices¶

DO¶

DON'T¶

8. Migration Guide¶

If You Have Existing Jobs¶

9. FAQ¶

10. Complete Working Example¶

After `source .activate_scratch`¶