How to Install Ollama on Linux (2026): Complete Setup Guide

Installing Ollama on Linux is the foundation for production-grade local AI. Linux provides unmatched flexibility, performance, and GPU support for deploying language models at scale. This guide covers installation on all major distributions, GPU driver setup, and everything you need to run Ollama in production.

Why Ollama on Linux?

Linux is where AI inference reaches peak performance. Native GPU drivers, multi-GPU support, containerization, and production observability make Linux the obvious choice for deploying Ollama at scale.

Linux advantages for Ollama:

Native driver support: Direct CUDA and ROCm integration without middleman abstractions
Multi-GPU scaling: Easily distribute models across all available GPUs
Systemd service: Ollama runs automatically at startup; full restart resilience
Container deployment: Docker support for reproducible setups across environments
Cost efficiency: Run on cheap cloud instances (AWS, GCP, Linode) or on-premise servers
Production monitoring: Full logging, metrics, and observability

System Requirements

Minimum: Any modern Linux distro (Ubuntu 20.04+, CentOS 7+, Debian 11+), 16GB RAM, 20GB disk

GPU (recommended): NVIDIA (CUDA compute 3.5+) or AMD (RDNA/CDNA)

Check available resources:

free -h      # RAM
df -h /      # Disk space
lspci | grep -i nvidia  # NVIDIA GPU

Step 1: Install Ollama on Linux

The official installer handles everything, including systemd service setup:

curl -fsSL https://ollama.com/install.sh | sh

This script:

Downloads Ollama binary
Installs to /usr/local/bin/
Creates systemd service
Starts Ollama automatically

Verify installation:

ollama --version

Step 2: Install GPU Drivers (NVIDIA CUDA)

For NVIDIA GPUs:

Ubuntu/Debian:

sudo apt update
sudo apt install -y nvidia-driver-545
nvidia-smi  # Verify driver installation

CentOS/RHEL:

sudo yum install -y nvidia-driver-latest
nvidia-smi

The nvidia-smi command should display your GPU model and available VRAM.

Step 3: Install ROCm for AMD GPUs

For AMD Radeon GPUs:

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian focal main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install -y rocm-dkms

Add your user to the video group for GPU access:

sudo usermod -a -G video $USER
newgrp video  # Apply group changes immediately

Verify ROCm installation:

rocm-smi

Step 4: Verify Ollama Service Is Running

Check systemd service status:

sudo systemctl status ollama

Expected output: Active: active (running)

If not running, start it:

sudo systemctl start ollama
sudo systemctl enable ollama  # Auto-start on reboot

Step 5: Download Your First Model

Pull Llama 2 7B:

ollama pull llama2

Monitor progress and storage:

du -sh ~/.ollama/models/

First-time downloads take 5–15 minutes depending on your connection and model size.

Step 6: Test Ollama Locally

Run an interactive session:

ollama run llama2

Type a test prompt:

>>> What are the best practices for Linux system administration?
[Llama 2 responds with detailed best practices...]

Type "exit" or press Ctrl+D to quit.

Step 7: Access Ollama via REST API

The Ollama service listens on localhost:11434. Query models programmatically:

curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "prompt": "Explain containerization in 50 words",
    "stream": false
  }'

Python integration:

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama2',
    'prompt': 'What is the difference between Docker and Kubernetes?',
    'stream': False
})

print(response.json()['response'])

Advanced Linux Configuration

Configure systemd service with custom options:

sudo systemctl edit ollama

Add custom environment variables:

[Service]
Environment="OLLAMA_NUM_GPU=2"
Environment="OLLAMA_MAX_VRAM=24000000000"
Environment="OLLAMA_HOST=0.0.0.0:11434"  # Allow remote connections

Restart the service:

sudo systemctl restart ollama

Monitor Ollama logs:

journalctl -u ollama -n 50 -f  # Follow logs in real-time

Multi-GPU setup:

Verify all GPUs are detected:

nvidia-smi  # List all GPUs
export CUDA_VISIBLE_DEVICES=0,1  # Use GPUs 0 and 1

Deploy Ollama in Docker

For reproducible, isolated deployments:

docker run -d --gpus all -p 11434:11434 ollama/ollama

This creates a containerized Ollama instance with full GPU access, accessible at localhost:11434.

Performance Monitoring and Optimization

Monitor real-time GPU usage:

watch -n 1 nvidia-smi  # NVIDIA
watch -n 1 rocm-smi    # AMD

Monitor CPU and memory:

top

Benchmark model performance (tokens/sec):

time ollama run llama2 "Generate 200 tokens about artificial intelligence"

Typical performance: 10–20 tokens/sec on consumer GPUs, 50+ tokens/sec on high-end GPUs.

Explore Available Models

Qwen2.5 14B: Best for balanced performance and quality

ollama pull qwen2.5

Phi 4: Most efficient; excellent for edge/lower VRAM

ollama pull phi4

Mistral 7B: Fastest inference

ollama pull mistral

Llama 2 70B: Highest quality reasoning (requires 24GB+ VRAM)

ollama pull llama2:70b

Expose Ollama to Remote Clients

By default, Ollama listens only on localhost. To allow remote access, edit the systemd service:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Restart:

sudo systemctl restart ollama

Now remote machines can query your Ollama server:

curl -X POST http://[server-ip]:11434/api/generate -d '{...}'

Troubleshooting Linux Installation

Issue: GPU not detected / model runs on CPU

Solution: Update GPU drivers (nvidia-driver-update or rocm-update). Restart systemd service: sudo systemctl restart ollama

Issue: Permission denied accessing models

Solution: sudo chown -R $USER:$USER ~/.ollama

Issue: Out of memory errors

Solution: Reduce context size: ollama run llama2 --num-ctx 2048. Or use a smaller model.

Issue: Ollama fails to start

Solution: Check logs: journalctl -u ollama -n 100. Common causes: GPU driver issues, disk space, or permission problems.

Production Deployment Patterns

Once Ollama runs reliably, scale to production:

Load balancing: Deploy multiple Ollama servers behind a reverse proxy
Model serving: Use Kubernetes to orchestrate Ollama containers
Monitoring: Export metrics to Prometheus; visualize in Grafana
Caching: Cache common inference results to reduce compute

🚀

Deploy Ollama at enterprise scale. From development to production. Daily AI Agents provides production frameworks for deploying, monitoring, and scaling Ollama across Linux clusters. Explore advanced strategies.

Explore Production Deployment →

Conclusion

Ollama on Linux is the gold standard for local AI deployment. From development on a single machine to production clusters running across dozens of servers, Linux and Ollama provide the flexibility, performance, and control needed for enterprise-grade AI inference.

Install Ollama today. Unlock your infrastructure's AI potential.