How to Run Phi 4 on Linux: Step-by-Step Guide (2026)
Linux is the ideal platform for running Phi 4 in production. With unmatched performance, flexibility, and GPU support across NVIDIA and AMD hardware, Linux enables you to deploy powerful local language models at scale. This guide covers everything from initial installation through production-grade optimization.
Why Phi 4 on Linux?
Phi 4 is Microsoft's efficient 14B parameter language model, and Linux is where it truly shines. Linux servers—whether cloud-hosted or on-premise—offer superior GPU utilization, multi-GPU support, and containerization options compared to other platforms.
Key advantages for Linux users:
- Native GPU drivers: Optimal CUDA and ROCm support without middleman abstractions
- Multi-GPU scaling: Easily distribute inference across multiple GPUs
- Container deployment: Docker support for reproducible, portable setups
- Production-grade observability: Full logging, metrics, and monitoring capabilities
- Cost efficiency: Run on cheap commodity hardware or cloud VMs
System Requirements for Linux
Minimum: Any modern Linux distribution (Ubuntu 20.04+, CentOS 7+, Debian 11+), 16GB RAM, 20GB disk space
GPU (recommended): NVIDIA (CUDA compute capability 3.5+) or AMD (RDNA/CDNA architecture)
Check your Linux GPU:
lspci | grep -i nvidia # For NVIDIA cards
lspci | grep -i amd # For AMD cards
Verify available RAM:
free -h
Step 1: Install Ollama on Linux
Install Ollama using the official installation script:
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama system-wide and starts the service automatically. Verify installation:
ollama --version
Check service status:
systemctl status ollama
Expected output: Active: active (running)
Step 2: Configure GPU Drivers
For NVIDIA (CUDA):
Install CUDA toolkit and cuDNN (Ollama bundles CUDA libraries but system drivers are essential):
# Ubuntu/Debian
sudo apt update
sudo apt install nvidia-driver-545 # or latest available version
# Verify installation
nvidia-smi
nvidia-smi should display your GPU's memory and compute capability.
For AMD (ROCm):
Install ROCm drivers:
wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian focal main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-dkms
Add your user to the video group:
sudo usermod -a -G video $USER
Step 3: Download Phi 4
Pull the Phi 4 model using Ollama:
ollama pull phi4
This downloads ~8.5GB. On a typical server connection, expect 3–8 minutes. Monitor download progress and storage:
du -sh ~/.ollama/models/
Step 4: Start the Ollama Server
Ollama runs as a systemd service on Linux. Restart it to ensure GPU acceleration is enabled:
sudo systemctl restart ollama
sudo systemctl status ollama
The service listens on http://localhost:11434. Test it:
curl http://localhost:11434/api/tags
Expected response: JSON listing installed models including phi4.
Step 5: Run Phi 4 Locally
For interactive testing, run Phi 4 directly:
ollama run phi4
This opens a chat session where you type prompts and receive immediate responses accelerated by your GPU:
>>> What is the fastest way to learn Rust?
[Phi 4 responds with detailed learning path...]
Type "/exit" or press Ctrl+D to quit.
Step 6: Integrate Phi 4 via API for Production
Access Phi 4 programmatically from any language. The Ollama API is fully RESTful:
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "phi4",
"prompt": "Explain distributed systems in 100 words",
"stream": false
}'
Python example:
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'phi4',
'prompt': 'List five best practices for API design',
'stream': False
})
print(response.json()['response'])
Advanced Linux Configuration
Multi-GPU Setup:
If your server has multiple GPUs, Ollama automatically distributes inference across them. Verify with:
nvidia-smi # Shows all detected GPUs
Systemd Service with Custom Environment:
Edit /etc/systemd/system/ollama.service to customize resource limits or GPU settings:
sudo systemctl edit ollama
Add environment variables:
[Service]
Environment="OLLAMA_NUM_GPU=2"
Environment="OLLAMA_MAX_VRAM=24000000000"
Docker Deployment:
For reproducible, isolated deployments:
docker run -d --gpus all -p 11434:11434 ollama/ollama ollama run phi4
This spins up an Ollama container with GPU access, ready for production workloads.
Performance Monitoring
Monitor Phi 4's performance on Linux with standard tools:
watch -n 1 nvidia-smi # GPU utilization (NVIDIA)
top # CPU and memory
nethogs # Network I/O
For production monitoring, integrate with Prometheus/Grafana or export metrics via Ollama's built-in endpoints.
Troubleshooting on Linux
Issue: "GPU not available" or model runs on CPU
Solution: Verify GPU drivers with nvidia-smi or rocm-smi. Restart Ollama service: sudo systemctl restart ollama
Issue: Permission denied when pulling models
Solution: Ensure user permissions: sudo chown -R $USER:$USER ~/.ollama
Issue: OOM (out of memory) errors
Solution: Reduce context or batch size, or add swap: fallocate -l 16G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile
Production Deployment Patterns
Now that Phi 4 runs on Linux, scale to production workflows:
- High-availability clusters: Deploy Phi 4 across multiple Linux servers with load balancing
- Inference API gateway: Wrap Ollama with authentication, rate limiting, and caching
- Vector databases: Pair Phi 4 with Chroma, Milvus, or Qdrant for RAG pipelines
- Batch processing: Leverage Linux parallelism for offline document processing at scale
Conclusion
Linux is the platform where Phi 4 reaches its full potential. With native GPU support, container orchestration, and production-grade monitoring, you can run state-of-the-art language models locally at any scale—whether you're prototyping on a laptop or deploying enterprise inference clusters. Start with a single machine, then scale horizontally as your workload grows.
Local, powerful, and fully under your control.