Install Ollama on Ubuntu 24.04, pull an open-source LLM, and run it completely offline on your own hardware — no cloud account or API key required.
Why Run LLMs Locally
Running large language models on your own machine gives you two things you cannot get from cloud APIs:
1. Privacy. Your prompts, documents, and conversation history stay on your hardware. No data is sent to OpenAI, Google, or any third party. This matters when working with proprietary code, personal notes, or sensitive research.
2. No usage caps or API costs. Once the model is downloaded, inference is free. You can query it as often as you like without rate limits or token-based billing.
Local LLMs are not a replacement for GPT-4 or Claude on complex tasks, but for everyday Q&A, brainstorming, code review, and document summarisation a 7B–8B model running on a consumer GPU is surprisingly capable.
Hardware Requirements
Before installing, check that your system meets these minimums:
| Component | Minimum | Recommended |
|---|---|---|
| CPU | x86-64 with AVX2 (2013+) | 8-core modern CPU |
| RAM | 8 GB | 16–32 GB |
| GPU | Optional (CPU-only works for 7B) | NVIDIA with 6 GB+ VRAM |
| Storage | 10 GB free | 50 GB NVMe SSD |
| OS | Linux (Ubuntu 22.04+), macOS 13+, or Windows 11 | Ubuntu 24.04 LTS |
This guide targets Ubuntu 24.04. If you are on a different OS, the ollama commands are identical — only the initial install method differs.
What is Ollama
Ollama is a free, open-source inference engine that simplifies running LLMs on local hardware. It handles model downloading, quantisation, GPU acceleration, and exposes an OpenAI-compatible REST API. You can think of it as Docker for language models — you pull a model by name and run it with a single command.
Step-by-Step Installation
Step 1: Install Ollama
1# Ubuntu / Debian
2curl -fsSL https://ollama.com/install.sh | sh
For macOS, download the installer from ollama.com. For Windows, download from ollama.com and run the installer.
After installation, verify it worked:
1ollama --version
Step 2: Pull a Model
Ollama serves models from its library. Start with a lightweight model that runs on most hardware:
1# Pull Llama 3.1 8B (approx 4.7 GB download)
2ollama pull llama3.1:8b
3
4# Or for older CPUs without AVX2:
5ollama pull phi3:mini
The download may take a few minutes depending on your connection. Models are stored in ~/.ollama/models/ by default.
Step 3: Run the Model
Start a chat session:
1ollama run llama3.1:8b
You should see a >>> prompt. Try a query:
>>> What is the difference between RAM and VRAM for running LLMs?
Type /bye to exit.
Step 4: Serve the Model as an API
Ollama runs a background REST API on http://localhost:11434. Start it explicitly:
1ollama serve
In another terminal, test it:
1curl http://localhost:11434/api/generate -d '{
2 "model": "llama3.1:8b",
3 "prompt": "What is Ollama?"
4}'
This API is compatible with the OpenAI chat format, so any tool that supports OpenAI can point to http://localhost:11434/v1 instead.
Choosing the Right Model
| Model | Parameters | VRAM Needed | Best For |
|---|---|---|---|
| Phi-3 Mini | 3.8 B | 3 GB Q4 | Low-end hardware, CPU-only |
| Llama 3.1 8B | 8 B | 6 GB Q4 | General purpose — best starting point |
| Mistral 7B | 7 B | 5 GB Q4 | Code, reasoning |
| Qwen 2.5 14B | 14 B | 10 GB Q4 | Strong general, requires a GPU |
| Llama 3 70B | 70 B | 42 GB Q4 | High-quality output, needs 2× GPUs |
The “Q4” column refers to 4-bit quantisation — a size/quality trade-off that Ollama applies automatically. Q4 is a sensible default that balances capability with hardware requirements.
Using a Web Interface: Open WebUI
The terminal works, but for a polished experience install Open WebUI — a self-hosted web interface for Ollama:
1# Using Docker (easiest)
2docker run -d -p 3000:8080 \
3 --add-host=host.docker.internal:host-gateway \
4 -v open-webui:/app/backend/data \
5 --name open-webui \
6 --restart always \
7 ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000. It supports multi-user chat, document upload, and model switching — a much better daily driver than the terminal.
For a full walkthrough of Open WebUI features, see my dedicated Open WebUI guide.
Other Local LLM Tools Worth Knowing
- LM Studio — polished GUI for downloading and running models on Windows/macOS
- Text Generation WebUI — powerful but more complex; good for non-AVX CPUs
- GPT4All — beginner-friendly, runs well on CPU
For a detailed comparison of all these tools, see the local LLM ecosystem overview. Once you have a model running, the hyperparameter guide explains how to tune context window, temperature, and quantisation for your hardware.
Conclusion
You now have Ollama running on your own hardware with a local model that responds to queries without sending data to any third party. The setup takes about 10 minutes and the ongoing cost is zero.
Start with Llama 3.1 8B, then experiment with larger models if you have the GPU memory, or smaller ones like Phi-3 if you are CPU-only. Add Open WebUI for a browser-based interface, and you have a fully self-hosted AI stack that competes with cloud APIs on convenience while keeping your data private.
Need More Power
Running large 70B models locally requires significant GPU hardware. If you only need that scale occasionally, renting cloud GPUs is more economical than buying:
Try DigitalOcean GPU Droplets — H100 and A100 GPUs by the hour, with free starter credits.
A used RTX 3090 (24 GB VRAM) remains the best value for serious local LLM work if you prefer to buy rather than rent.
If you have questions or suggestions, let me know in the comments. Next up: connecting Ollama to Open WebUI for a polished browser interface.















