Ollama¶

What is Ollama?¶

Ollama is a lightweight framework for running large language models (LLMs) locally on your own hardware. It provides a simple CLI and REST API for downloading and interacting with open-source models such as Llama, Mistral, Qwen, Gemma, and many others — without sending data to external servers.

Base Environment

Ollama is provided via the environment modules system.

GPU Nodes Only

Ollama is only available on GPU nodes.

You must first start an interactive GPU session or submit a GPU job before loading the module:

srun -p GPU --gres=gpu:1 --pty bash

Loading Ollama¶

Load module

module load ollama/0.17.7

Best Practice

Always verify the loaded version:

ollama --version

Available Models¶

To see all models currently available on the cluster:

List available models

ollama list

Browse all models available for download at https://ollama.com/search.

Pulling a Model¶

Pull a model

ollama pull llama3.2

Specify a variant

Many models have multiple size variants. You can specify one explicitly:

ollama pull llama3.2:1b
ollama pull qwen2.5:7b

Running a Model¶

Interactive mode¶

Start an interactive chat session

ollama run llama3.2

Type /bye to exit the session.

Single prompt (non-interactive)¶

Run a single prompt

ollama run llama3.2 "Explain what a SLURM job scheduler does"

Using the REST API¶

Ollama exposes a REST API at http://localhost:11434. This is the recommended way to interact with Ollama programmatically.

Basic API request

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain what a SLURM job scheduler does",
  "stream": false
}'

Chat API (OpenAI-compatible)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Explain what a SLURM job scheduler does" }
  ],
  "stream": false
}'

Use the API over the CLI for scripting

The CLI (ollama run) outputs terminal control characters that may appear as garbled text when captured in scripts or logs. The REST API returns clean JSON and is preferred for any programmatic use.

Using Ollama in Python¶

GPU Nodes Only

Ollama is only available on GPU nodes. Python scripts using Ollama must be submitted via sbatch with the GPU partition.

Install the Python client

pip install ollama

example.py

Basic Python usage

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain what a SLURM job scheduler does'}
    ]
)

print(response['message']['content'])

ollama_job.sh

Example SLURM job

#!/bin/bash
#SBATCH --job-name=ollama_test
#SBATCH --partition=GPU
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=01:00:00
#SBATCH --output=ollama_%j.out

module load ollama/0.17.7

source ~/miniconda3/bin/activate
conda activate myenv  # change to your environment name, or remove if using base

python example.py

Submit job

sbatch ollama_job.sh

Model Information¶

To inspect details about a specific model (architecture, parameters, quantization, capabilities):

Show model details

ollama show llama3.2 --verbose

Capabilities¶

Models on Ollama may support different capabilities:

Capability	Description
`completion`	Standard text generation
`tools`	Function/tool calling support
`vision`	Image input support
`thinking`	Extended reasoning/chain-of-thought
`embedding`	Text embedding generation

More Information¶

Documentation

Official Ollama documentation:

https://docs.ollama.com

Available models:

https://ollama.com/search