How to Install Ollama and Run Local LLMs on Linux

Install Ollama on Ubuntu 24.04, pull an open-source LLM, and run it completely offline on your own hardware — no cloud account or API key required.

Why Run LLMs Locally

Running large language models on your own machine gives you two things you cannot get from cloud APIs:

1. Privacy. Your prompts, documents, and conversation history stay on your hardware. No data is sent to OpenAI, Google, or any third party. This matters when working with proprietary code, personal notes, or sensitive research.

2. No usage caps or API costs. Once the model is downloaded, inference is free. You can query it as often as you like without rate limits or token-based billing.

Local LLMs are not a replacement for GPT-4 or Claude on complex tasks, but for everyday Q&A, brainstorming, code review, and document summarisation a 7B–8B model running on a consumer GPU is surprisingly capable.

Hardware Requirements

Before installing, check that your system meets these minimums:

Component	Minimum	Recommended
CPU	x86-64 with AVX2 (2013+)	8-core modern CPU
RAM	8 GB	16–32 GB
GPU	Optional (CPU-only works for 7B)	NVIDIA with 6 GB+ VRAM
Storage	10 GB free	50 GB NVMe SSD
OS	Linux (Ubuntu 22.04+), macOS 13+, or Windows 11	Ubuntu 24.04 LTS

This guide targets Ubuntu 24.04. If you are on a different OS, the ollama commands are identical — only the initial install method differs.

What is Ollama

Ollama is a free, open-source inference engine that simplifies running LLMs on local hardware. It handles model downloading, quantisation, GPU acceleration, and exposes an OpenAI-compatible REST API. You can think of it as Docker for language models — you pull a model by name and run it with a single command.

Step-by-Step Installation

Step 1: Install Ollama

1# Ubuntu / Debian
2curl -fsSL https://ollama.com/install.sh | sh

For macOS, download the installer from ollama.com. For Windows, download from ollama.com and run the installer.

After installation, verify it worked:

1ollama --version

Step 2: Pull a Model

Ollama serves models from its library. Start with a lightweight model that runs on most hardware:

1# Pull Llama 3.1 8B (approx 4.7 GB download)
2ollama pull llama3.1:8b
3
4# Or for older CPUs without AVX2:
5ollama pull phi3:mini

The download may take a few minutes depending on your connection. Models are stored in ~/.ollama/models/ by default.

Step 3: Run the Model

Start a chat session:

1ollama run llama3.1:8b

You should see a >>> prompt. Try a query:

>>> What is the difference between RAM and VRAM for running LLMs?

Type /bye to exit.

Step 4: Serve the Model as an API

Ollama runs a background REST API on http://localhost:11434. Start it explicitly:

1ollama serve

In another terminal, test it:

1curl http://localhost:11434/api/generate -d '{
2  "model": "llama3.1:8b",
3  "prompt": "What is Ollama?"
4}'

This API is compatible with the OpenAI chat format, so any tool that supports OpenAI can point to http://localhost:11434/v1 instead.

Choosing the Right Model

Model	Parameters	VRAM Needed	Best For
Phi-3 Mini	3.8 B	3 GB Q4	Low-end hardware, CPU-only
Llama 3.1 8B	8 B	6 GB Q4	General purpose — best starting point
Mistral 7B	7 B	5 GB Q4	Code, reasoning
Qwen 2.5 14B	14 B	10 GB Q4	Strong general, requires a GPU
Llama 3 70B	70 B	42 GB Q4	High-quality output, needs 2× GPUs

The “Q4” column refers to 4-bit quantisation — a size/quality trade-off that Ollama applies automatically. Q4 is a sensible default that balances capability with hardware requirements.

Using a Web Interface: Open WebUI

The terminal works, but for a polished experience install Open WebUI — a self-hosted web interface for Ollama:

1# Using Docker (easiest)
2docker run -d -p 3000:8080 \
3  --add-host=host.docker.internal:host-gateway \
4  -v open-webui:/app/backend/data \
5  --name open-webui \
6  --restart always \
7  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000. It supports multi-user chat, document upload, and model switching — a much better daily driver than the terminal.

For a full walkthrough of Open WebUI features, see my dedicated Open WebUI guide.

Other Local LLM Tools Worth Knowing

LM Studio — polished GUI for downloading and running models on Windows/macOS
Text Generation WebUI — powerful but more complex; good for non-AVX CPUs
GPT4All — beginner-friendly, runs well on CPU

For a detailed comparison of all these tools, see the local LLM ecosystem overview. Once you have a model running, the hyperparameter guide explains how to tune context window, temperature, and quantisation for your hardware.

Conclusion

You now have Ollama running on your own hardware with a local model that responds to queries without sending data to any third party. The setup takes about 10 minutes and the ongoing cost is zero.

Start with Llama 3.1 8B, then experiment with larger models if you have the GPU memory, or smaller ones like Phi-3 if you are CPU-only. Add Open WebUI for a browser-based interface, and you have a fully self-hosted AI stack that competes with cloud APIs on convenience while keeping your data private.

Need More Power

Running large 70B models locally requires significant GPU hardware. If you only need that scale occasionally, renting cloud GPUs is more economical than buying:

Try DigitalOcean GPU Droplets — H100 and A100 GPUs by the hour, with free starter credits.

A used RTX 3090 (24 GB VRAM) remains the best value for serious local LLM work if you prefer to buy rather than rent.

If you have questions or suggestions, let me know in the comments. Next up: connecting Ollama to Open WebUI for a polished browser interface.