How to Install Ollama on Linux (2026): Complete Setup Guide
Installing Ollama on Linux is the foundation for production-grade local AI. Linux provides unmatched flexibility, performance, and GPU support for deploying language models at scale. This guide covers installation on all major distributions, GPU driver setup, and everything you need to run Ollama in production.
Why Ollama on Linux?
Linux is where AI inference reaches peak performance. Native GPU drivers, multi-GPU support, containerization, and production observability make Linux the obvious choice for deploying Ollama at scale.
Linux advantages for Ollama:
- Native driver support: Direct CUDA and ROCm integration without middleman abstractions
- Multi-GPU scaling: Easily distribute models across all available GPUs
- Systemd service: Ollama runs automatically at startup; full restart resilience
- Container deployment: Docker support for reproducible setups across environments
- Cost efficiency: Run on cheap cloud instances (AWS, GCP, Linode) or on-premise servers
- Production monitoring: Full logging, metrics, and observability
System Requirements
Minimum: Any modern Linux distro (Ubuntu 20.04+, CentOS 7+, Debian 11+), 16GB RAM, 20GB disk
GPU (recommended): NVIDIA (CUDA compute 3.5+) or AMD (RDNA/CDNA)
Check available resources:
free -h # RAM
df -h / # Disk space
lspci | grep -i nvidia # NVIDIA GPU
Step 1: Install Ollama on Linux
The official installer handles everything, including systemd service setup:
curl -fsSL https://ollama.com/install.sh | sh
This script:
- Downloads Ollama binary
- Installs to /usr/local/bin/
- Creates systemd service
- Starts Ollama automatically
Verify installation:
ollama --version
Step 2: Install GPU Drivers (NVIDIA CUDA)
For NVIDIA GPUs:
Ubuntu/Debian:
sudo apt update
sudo apt install -y nvidia-driver-545
nvidia-smi # Verify driver installation
CentOS/RHEL:
sudo yum install -y nvidia-driver-latest
nvidia-smi
The nvidia-smi command should display your GPU model and available VRAM.
Step 3: Install ROCm for AMD GPUs
For AMD Radeon GPUs:
wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian focal main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install -y rocm-dkms
Add your user to the video group for GPU access:
sudo usermod -a -G video $USER
newgrp video # Apply group changes immediately
Verify ROCm installation:
rocm-smi
Step 4: Verify Ollama Service Is Running
Check systemd service status:
sudo systemctl status ollama
Expected output: Active: active (running)
If not running, start it:
sudo systemctl start ollama
sudo systemctl enable ollama # Auto-start on reboot
Step 5: Download Your First Model
Pull Llama 2 7B:
ollama pull llama2
Monitor progress and storage:
du -sh ~/.ollama/models/
First-time downloads take 5–15 minutes depending on your connection and model size.
Step 6: Test Ollama Locally
Run an interactive session:
ollama run llama2
Type a test prompt:
>>> What are the best practices for Linux system administration?
[Llama 2 responds with detailed best practices...]
Type "exit" or press Ctrl+D to quit.
Step 7: Access Ollama via REST API
The Ollama service listens on localhost:11434. Query models programmatically:
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"prompt": "Explain containerization in 50 words",
"stream": false
}'
Python integration:
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama2',
'prompt': 'What is the difference between Docker and Kubernetes?',
'stream': False
})
print(response.json()['response'])
Advanced Linux Configuration
Configure systemd service with custom options:
sudo systemctl edit ollama
Add custom environment variables:
[Service]
Environment="OLLAMA_NUM_GPU=2"
Environment="OLLAMA_MAX_VRAM=24000000000"
Environment="OLLAMA_HOST=0.0.0.0:11434" # Allow remote connections
Restart the service:
sudo systemctl restart ollama
Monitor Ollama logs:
journalctl -u ollama -n 50 -f # Follow logs in real-time
Multi-GPU setup:
Verify all GPUs are detected:
nvidia-smi # List all GPUs
export CUDA_VISIBLE_DEVICES=0,1 # Use GPUs 0 and 1
Deploy Ollama in Docker
For reproducible, isolated deployments:
docker run -d --gpus all -p 11434:11434 ollama/ollama
This creates a containerized Ollama instance with full GPU access, accessible at localhost:11434.
Performance Monitoring and Optimization
Monitor real-time GPU usage:
watch -n 1 nvidia-smi # NVIDIA
watch -n 1 rocm-smi # AMD
Monitor CPU and memory:
top
Benchmark model performance (tokens/sec):
time ollama run llama2 "Generate 200 tokens about artificial intelligence"
Typical performance: 10–20 tokens/sec on consumer GPUs, 50+ tokens/sec on high-end GPUs.
Explore Available Models
Qwen2.5 14B: Best for balanced performance and quality
ollama pull qwen2.5
Phi 4: Most efficient; excellent for edge/lower VRAM
ollama pull phi4
Mistral 7B: Fastest inference
ollama pull mistral
Llama 2 70B: Highest quality reasoning (requires 24GB+ VRAM)
ollama pull llama2:70b
Expose Ollama to Remote Clients
By default, Ollama listens only on localhost. To allow remote access, edit the systemd service:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Restart:
sudo systemctl restart ollama
Now remote machines can query your Ollama server:
curl -X POST http://[server-ip]:11434/api/generate -d '{...}'
Troubleshooting Linux Installation
Issue: GPU not detected / model runs on CPU
Solution: Update GPU drivers (nvidia-driver-update or rocm-update). Restart systemd service: sudo systemctl restart ollama
Issue: Permission denied accessing models
Solution: sudo chown -R $USER:$USER ~/.ollama
Issue: Out of memory errors
Solution: Reduce context size: ollama run llama2 --num-ctx 2048. Or use a smaller model.
Issue: Ollama fails to start
Solution: Check logs: journalctl -u ollama -n 100. Common causes: GPU driver issues, disk space, or permission problems.
Production Deployment Patterns
Once Ollama runs reliably, scale to production:
- Load balancing: Deploy multiple Ollama servers behind a reverse proxy
- Model serving: Use Kubernetes to orchestrate Ollama containers
- Monitoring: Export metrics to Prometheus; visualize in Grafana
- Caching: Cache common inference results to reduce compute
Conclusion
Ollama on Linux is the gold standard for local AI deployment. From development on a single machine to production clusters running across dozens of servers, Linux and Ollama provide the flexibility, performance, and control needed for enterprise-grade AI inference.
Install Ollama today. Unlock your infrastructure's AI potential.