Skip to content

Ollama

What is Ollama?

Ollama is a lightweight framework for running large language models (LLMs) locally on your own hardware. It provides a simple CLI and REST API for downloading and interacting with open-source models such as Llama, Mistral, Qwen, Gemma, and many others — without sending data to external servers.

Base Environment

Ollama is provided via the environment modules system.

GPU Nodes Only

Ollama is only available on GPU nodes.

You must first start an interactive GPU session or submit a GPU job before loading the module:

srun -p GPU --gres=gpu:1 --pty bash

Loading Ollama

Load module

module load ollama/0.17.7

Best Practice

Always verify the loaded version:

ollama --version

Available Models

To see all models currently available on the cluster:

List available models

ollama list

Browse all models available for download at https://ollama.com/search.

Pulling a Model

Pull a model

ollama pull llama3.2

Specify a variant

Many models have multiple size variants. You can specify one explicitly:

ollama pull llama3.2:1b
ollama pull qwen2.5:7b

Running a Model

Interactive mode

Start an interactive chat session

ollama run llama3.2

Type /bye to exit the session.

Single prompt (non-interactive)

Run a single prompt

ollama run llama3.2 "Explain what a SLURM job scheduler does"

Using the REST API

Ollama exposes a REST API at http://localhost:11434. This is the recommended way to interact with Ollama programmatically.

Basic API request

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain what a SLURM job scheduler does",
  "stream": false
}'

Chat API (OpenAI-compatible)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Explain what a SLURM job scheduler does" }
  ],
  "stream": false
}'

Use the API over the CLI for scripting

The CLI (ollama run) outputs terminal control characters that may appear as garbled text when captured in scripts or logs. The REST API returns clean JSON and is preferred for any programmatic use.

Using Ollama in Python

GPU Nodes Only

Ollama is only available on GPU nodes. Python scripts using Ollama must be submitted via sbatch with the GPU partition.

Install the Python client

pip install ollama

example.py

Basic Python usage

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain what a SLURM job scheduler does'}
    ]
)

print(response['message']['content'])

ollama_job.sh

Example SLURM job

#!/bin/bash
#SBATCH --job-name=ollama_test
#SBATCH --partition=GPU
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=01:00:00
#SBATCH --output=ollama_%j.out

module load ollama/0.17.7

source ~/miniconda3/bin/activate
conda activate myenv  # change to your environment name, or remove if using base

python example.py

Submit job

sbatch ollama_job.sh

Model Information

To inspect details about a specific model (architecture, parameters, quantization, capabilities):

Show model details

ollama show llama3.2 --verbose

Capabilities

Models on Ollama may support different capabilities:

Capability Description
completion Standard text generation
tools Function/tool calling support
vision Image input support
thinking Extended reasoning/chain-of-thought
embedding Text embedding generation

More Information

Documentation

Official Ollama documentation:

https://docs.ollama.com

Available models:

https://ollama.com/search