How to Run Phi 4 on Linux: Step-by-Step Guide (2026)

Linux is the ideal platform for running Phi 4 in production. With unmatched performance, flexibility, and GPU support across NVIDIA and AMD hardware, Linux enables you to deploy powerful local language models at scale. This guide covers everything from initial installation through production-grade optimization.

Why Phi 4 on Linux?

Phi 4 is Microsoft's efficient 14B parameter language model, and Linux is where it truly shines. Linux servers—whether cloud-hosted or on-premise—offer superior GPU utilization, multi-GPU support, and containerization options compared to other platforms.

Key advantages for Linux users:

Native GPU drivers: Optimal CUDA and ROCm support without middleman abstractions
Multi-GPU scaling: Easily distribute inference across multiple GPUs
Container deployment: Docker support for reproducible, portable setups
Production-grade observability: Full logging, metrics, and monitoring capabilities
Cost efficiency: Run on cheap commodity hardware or cloud VMs

System Requirements for Linux

Minimum: Any modern Linux distribution (Ubuntu 20.04+, CentOS 7+, Debian 11+), 16GB RAM, 20GB disk space

GPU (recommended): NVIDIA (CUDA compute capability 3.5+) or AMD (RDNA/CDNA architecture)

Check your Linux GPU:

lspci | grep -i nvidia  # For NVIDIA cards
lspci | grep -i amd     # For AMD cards

Verify available RAM:

free -h

Step 1: Install Ollama on Linux

Install Ollama using the official installation script:

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama system-wide and starts the service automatically. Verify installation:

ollama --version

Check service status:

systemctl status ollama

Expected output: Active: active (running)

Step 2: Configure GPU Drivers

For NVIDIA (CUDA):

Install CUDA toolkit and cuDNN (Ollama bundles CUDA libraries but system drivers are essential):

# Ubuntu/Debian
sudo apt update
sudo apt install nvidia-driver-545  # or latest available version

# Verify installation
nvidia-smi

nvidia-smi should display your GPU's memory and compute capability.

For AMD (ROCm):

Install ROCm drivers:

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian focal main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-dkms

Add your user to the video group:

sudo usermod -a -G video $USER

Step 3: Download Phi 4

Pull the Phi 4 model using Ollama:

ollama pull phi4

This downloads ~8.5GB. On a typical server connection, expect 3–8 minutes. Monitor download progress and storage:

du -sh ~/.ollama/models/

Step 4: Start the Ollama Server

Ollama runs as a systemd service on Linux. Restart it to ensure GPU acceleration is enabled:

sudo systemctl restart ollama
sudo systemctl status ollama

The service listens on http://localhost:11434. Test it:

curl http://localhost:11434/api/tags

Expected response: JSON listing installed models including phi4.

Step 5: Run Phi 4 Locally

For interactive testing, run Phi 4 directly:

ollama run phi4

This opens a chat session where you type prompts and receive immediate responses accelerated by your GPU:

>>> What is the fastest way to learn Rust?
[Phi 4 responds with detailed learning path...]

Type "/exit" or press Ctrl+D to quit.

Step 6: Integrate Phi 4 via API for Production

Access Phi 4 programmatically from any language. The Ollama API is fully RESTful:

curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi4",
    "prompt": "Explain distributed systems in 100 words",
    "stream": false
  }'

Python example:

import requests
response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'phi4',
    'prompt': 'List five best practices for API design',
    'stream': False
})
print(response.json()['response'])

Advanced Linux Configuration

Multi-GPU Setup:

If your server has multiple GPUs, Ollama automatically distributes inference across them. Verify with:

nvidia-smi  # Shows all detected GPUs

Systemd Service with Custom Environment:

Edit /etc/systemd/system/ollama.service to customize resource limits or GPU settings:

sudo systemctl edit ollama

Add environment variables:

[Service]
Environment="OLLAMA_NUM_GPU=2"
Environment="OLLAMA_MAX_VRAM=24000000000"

Docker Deployment:

For reproducible, isolated deployments:

docker run -d --gpus all -p 11434:11434 ollama/ollama ollama run phi4

This spins up an Ollama container with GPU access, ready for production workloads.

Performance Monitoring

Monitor Phi 4's performance on Linux with standard tools:

watch -n 1 nvidia-smi  # GPU utilization (NVIDIA)
top                     # CPU and memory
nethogs                 # Network I/O

For production monitoring, integrate with Prometheus/Grafana or export metrics via Ollama's built-in endpoints.

Troubleshooting on Linux

Issue: "GPU not available" or model runs on CPU

Solution: Verify GPU drivers with nvidia-smi or rocm-smi. Restart Ollama service: sudo systemctl restart ollama

Issue: Permission denied when pulling models

Solution: Ensure user permissions: sudo chown -R $USER:$USER ~/.ollama

Issue: OOM (out of memory) errors

Solution: Reduce context or batch size, or add swap: fallocate -l 16G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile

Production Deployment Patterns

Now that Phi 4 runs on Linux, scale to production workflows:

High-availability clusters: Deploy Phi 4 across multiple Linux servers with load balancing
Inference API gateway: Wrap Ollama with authentication, rate limiting, and caching
Vector databases: Pair Phi 4 with Chroma, Milvus, or Qdrant for RAG pipelines
Batch processing: Leverage Linux parallelism for offline document processing at scale

🚀

Deploy local AI at enterprise scale. Daily AI Agents provides production-grade frameworks for deploying, monitoring, and scaling Phi 4 and other local LLMs on Linux. Learn more.

Explore Production Deployment →

Conclusion

Linux is the platform where Phi 4 reaches its full potential. With native GPU support, container orchestration, and production-grade monitoring, you can run state-of-the-art language models locally at any scale—whether you're prototyping on a laptop or deploying enterprise inference clusters. Start with a single machine, then scale horizontally as your workload grows.

Local, powerful, and fully under your control.