CUDA-Q¶
What is CUDA-Q?¶
CUDA-Q is NVIDIA's platform for hybrid quantum-classical computing that enables seamless integration of quantum processing units (QPUs), GPUs, and CPUs. It provides a unified programming model for quantum circuit simulation and algorithm development with support for both Python and C++.
GPU Acceleration
CUDA-Q provides GPU-accelerated quantum circuit simulation for significantly improved performance on large qubit counts.
Python Environment
CUDA-Q is installed via conda/pip in user environments. Check https://wiki.perun.tuke.sk/env/conda/ for instructions on how to setup conda.
Installation¶
A single installation works for both CPU and GPU partitions:
CUDA Version
Use cuda-quantum-cu12 (not cudaq) to match PERUN's CUDA 12.9 drivers.
Example Script¶
example.py
Bell state quantum circuit
import sys
import cudaq
print(f"Running on target {cudaq.get_target().name}")
qubit_count = int(sys.argv[1]) if 1 < len(sys.argv) else 2
@cudaq.kernel
def kernel():
qubits = cudaq.qvector(qubit_count)
h(qubits[0])
for i in range(1, qubit_count):
x.ctrl(qubits[0], qubits[i])
mz(qubits)
result = cudaq.sample(kernel)
print(result)
SLURM Job Script (CPU)¶
cudaq_cpu.sh
CPU job
SLURM Job Script (GPU)¶
cudaq_gpu.sh
GPU job
Expected Results¶
CPU Partition¶
CPU execution
Check output:
Expected output:
The qpp-cpu target indicates CPU-only simulation using OpenMP parallelization.
GPU Partition¶
GPU execution
Check output:
Expected output:
The nvidia target indicates GPU-accelerated simulation. CUDA-Q automatically detects and uses available GPUs.
Parallel Simulations with MPI¶
CUDA-Q supports MPI for running multiple independent simulations in parallel. This is useful for parameter sweeps, Monte Carlo studies, or batch processing multiple circuits.
Installation¶
Install OpenMPI and mpi4py in your environment:
Example: Parallel Parameter Sweep¶
mpi_sweep.py
Parameter sweep across MPI ranks
import cudaq
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
# Each rank tests a different qubit count
qubit_count = 10 + rank
@cudaq.kernel
def kernel():
qubits = cudaq.qvector(qubit_count)
h(qubits[0])
for i in range(1, qubit_count):
x.ctrl(qubits[0], qubits[i])
mz(qubits)
result = cudaq.sample(kernel)
print(f"Rank {rank} ({qubit_count} qubits): {result}")
SLURM Job Script¶
cudaq_mpi.sh
MPI parallel job
#!/bin/bash
#SBATCH --job-name=cudaq_mpi
#SBATCH --partition=GPU
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:4
#SBATCH --mem=16G
#SBATCH --time=01:00:00
#SBATCH --output=cudaq_mpi_%j.out
source ~/miniconda3/bin/activate
conda activate cudaq
# Use srun with explicit GPU assignment
srun bash -c 'export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID; python mpi_sweep.py'
Resource Allocation
--ntasks=N: Number of parallel simulations--gres=gpu:N: Request one GPU per task for best performance- Each MPI rank is assigned to a separate GPU via
CUDA_VISIBLE_DEVICES - Use
srunto properly integrate with SLURM's GPU allocation
Multiple Tasks Per GPU
To run more tasks than GPUs (e.g., 32 tasks on 4 GPUs):
This distributes tasks round-robin across GPUs. Multiple tasks sharing the same GPU will time-share GPU resources through CUDA's built-in scheduling.