CUDA-Q¶

What is CUDA-Q?¶

CUDA-Q is NVIDIA's platform for hybrid quantum-classical computing that enables seamless integration of quantum processing units (QPUs), GPUs, and CPUs. It provides a unified programming model for quantum circuit simulation and algorithm development with support for both Python and C++.

GPU Acceleration

CUDA-Q provides GPU-accelerated quantum circuit simulation for significantly improved performance on large qubit counts.

Python Environment

CUDA-Q is installed via conda/pip in user environments. Check https://wiki.perun.tuke.sk/env/conda/ for instructions on how to setup conda.

Installation¶

A single installation works for both CPU and GPU partitions:

Setup

conda create -y -n cudaq python=3.11 pip
conda activate cudaq
pip install cuda-quantum-cu12

CUDA Version

Use cuda-quantum-cu12 (not cudaq) to match PERUN's CUDA 12.9 drivers.

Example Script¶

example.py

Bell state quantum circuit

import sys
import cudaq

print(f"Running on target {cudaq.get_target().name}")
qubit_count = int(sys.argv[1]) if 1 < len(sys.argv) else 2

@cudaq.kernel
def kernel():
    qubits = cudaq.qvector(qubit_count)
    h(qubits[0])
    for i in range(1, qubit_count):
        x.ctrl(qubits[0], qubits[i])
    mz(qubits)

result = cudaq.sample(kernel)
print(result)

SLURM Job Script (CPU)¶

cudaq_cpu.sh

CPU job

#!/bin/bash
#SBATCH --job-name=cudaq_cpu
#SBATCH --partition=CPU
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --output=cudaq_%j.out

source ~/miniconda3/bin/activate
conda activate cudaq

python example.py 10

Submit job

sbatch cudaq_cpu.sh

SLURM Job Script (GPU)¶

cudaq_gpu.sh

GPU job

#!/bin/bash
#SBATCH --job-name=cudaq_gpu
#SBATCH --partition=GPU
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=01:00:00
#SBATCH --output=cudaq_%j.out

source ~/miniconda3/bin/activate
conda activate cudaq

python example.py 28

Submit job

sbatch cudaq_gpu.sh

Expected Results¶

CPU Partition¶

CPU execution

Check output:

tail cudaq_*.out

Expected output:

Running on target qpp-cpu
{ 0000000000:502 1111111111:498 }

The qpp-cpu target indicates CPU-only simulation using OpenMP parallelization.

GPU Partition¶

GPU execution

Check output:

tail cudaq_*.out

Expected output:

Running on target nvidia
{ 0000000000000000000000000000:501 1111111111111111111111111111:499 }

The nvidia target indicates GPU-accelerated simulation. CUDA-Q automatically detects and uses available GPUs.

Parallel Simulations with MPI¶

CUDA-Q supports MPI for running multiple independent simulations in parallel. This is useful for parameter sweeps, Monte Carlo studies, or batch processing multiple circuits.

Installation¶

Install OpenMPI and mpi4py in your environment:

Add MPI support

conda activate cudaq
conda install -y -c conda-forge openmpi mpi4py

Example: Parallel Parameter Sweep¶

mpi_sweep.py

Parameter sweep across MPI ranks

import cudaq
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

# Each rank tests a different qubit count
qubit_count = 10 + rank

@cudaq.kernel
def kernel():
    qubits = cudaq.qvector(qubit_count)
    h(qubits[0])
    for i in range(1, qubit_count):
        x.ctrl(qubits[0], qubits[i])
    mz(qubits)

result = cudaq.sample(kernel)
print(f"Rank {rank} ({qubit_count} qubits): {result}")

SLURM Job Script¶

cudaq_mpi.sh

MPI parallel job

#!/bin/bash
#SBATCH --job-name=cudaq_mpi
#SBATCH --partition=GPU
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:4
#SBATCH --mem=16G
#SBATCH --time=01:00:00
#SBATCH --output=cudaq_mpi_%j.out

source ~/miniconda3/bin/activate
conda activate cudaq

# Use srun with explicit GPU assignment
srun bash -c 'export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID; python mpi_sweep.py'

Submit job

sbatch cudaq_mpi.sh

Resource Allocation

--ntasks=N: Number of parallel simulations
--gres=gpu:N: Request one GPU per task for best performance
Each MPI rank is assigned to a separate GPU via CUDA_VISIBLE_DEVICES
Use srun to properly integrate with SLURM's GPU allocation

Multiple Tasks Per GPU

To run more tasks than GPUs (e.g., 32 tasks on 4 GPUs):

srun bash -c 'export CUDA_VISIBLE_DEVICES=$((SLURM_PROCID % 4)); python mpi_sweep.py'

This distributes tasks round-robin across GPUs. Multiple tasks sharing the same GPU will time-share GPU resources through CUDA's built-in scheduling.

More Information¶

Documentation

Official CUDA-Q documentation:

https://nvidia.github.io/cuda-quantum/latest/