CodeIntermediateOllama

Ollama: How to Run Open-Source LLMs Locally on Your Own Machine

Ollama is the free, open-source tool for running Llama, Mistral, Qwen and more directly on your computer — private, offline, and free. A hands-on guide to how it works, picking a model for your hardware, and calling it from your own code.

June 8, 2026·5 min read
Share:
Ollama Run open-source LLMs on your own machine — private and offline $ ollama run llama3.2 >>> Ready. Everything runs here. Llama Mistral Qwen Gemma BITSMINDS.COM

For two years the default way to use a capable AI model was to send your prompts to someone else's servers and pay per token. Ollama flips that: it is a free, open-source tool that downloads and runs open-weight large language models — Llama, Mistral, Qwen, Gemma, DeepSeek and dozens more — entirely on your own computer. No API key, no usage meter, no data leaving the machine. If your laptop can run it, you can chat with it on a plane.

This guide covers what Ollama is, how it works under the hood, the two commands you need to get started, how to choose a model your hardware can actually handle, and how to call it from your own code. It assumes you can open a terminal; everything else it explains.

What Ollama is — and why local matters

Ollama is a small runtime that wraps the hard parts of running a model locally — downloading weights, loading them into memory, and serving them — behind a simple command line and a local HTTP API. It builds on the llama.cpp engine and the GGUF model format, so it runs efficiently on ordinary CPUs and is accelerated by Apple Silicon, NVIDIA, and AMD GPUs when available. The appeal is concrete: privacy (your data never leaves the device), cost (no per-token billing), offline use, and control over exactly which model and version you run.

How it works under the hood

When you install Ollama it runs a small background server on localhost:11434. The first time you ask for a model, Ollama pulls its weights once from the registry and stores them on disk; after that everything is local. Every interface — the CLI, your own scripts, a desktop chat app — talks to that same local server, which loads the requested model into memory and streams back tokens.

How Ollama runs models locally Model registry ollama.com/library pull (once) prompt load You CLI: ollama run Your app via REST Ollama server localhost:11434 loads and runs models Models on disk Llama, Qwen, Mistral GGUF, quantized BITSMINDS.COM
You talk to a local server on port 11434; it loads quantized GGUF models from disk, downloaded once from the registry. Nothing leaves your machine at inference time. Diagram: BitsMinds

Getting started in two commands

Installation is a single step on macOS, Linux, or Windows. Then one command pulls a model and drops you into an interactive chat:

# macOS / Linux (Windows: download the installer from ollama.com)
curl -fsSL https://ollama.com/install.sh | sh

# pull a model and start chatting
ollama run llama3.2

That is the whole onboarding. A few other commands round out daily use: ollama pull qwen2.5 downloads a model without running it, ollama list shows what you have installed, and ollama rm llama3.2 frees the disk space. Models live in a local store, so switching between them is instant once downloaded.

Choosing a model for your hardware

The single most common beginner mistake is grabbing the biggest model and watching it crawl or fail to load. The binding constraint is memory: a model needs roughly enough RAM (or GPU VRAM) to hold its weights. Quantization — shipping weights at lower precision, commonly 4-bit (Q4) — cuts that requirement by about half with only minor quality loss, which is why most default Ollama tags are already quantized.

Pick a size for your machine Small - 1 to 3B ~8 GB RAM Fast, light tasks, runs on most laptops. Llama 3.2 3B, Phi Medium - 7 to 9B ~16 GB RAM Best all-round quality vs speed. Llama 3.1 8B, Qwen Large - 70B+ 48 GB+ RAM Max quality, needs a strong GPU or Mac. Llama 70B, Mixtral Quantized builds (Q4) cut memory roughly in half with minor quality loss. BITSMINDS.COM
Match the model size to your available memory. When in doubt, start medium (7-9B) and only scale up if quality demands it. Diagram: BitsMinds

Using Ollama from your own code

Because the server speaks HTTP, any language can drive it. The native REST endpoint takes a model and a prompt:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain MCP in one sentence.",
  "stream": false
}'

Even handier for existing projects: Ollama also exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Point any OpenAI SDK at that base URL with a dummy API key and your code talks to a local model with no other changes — the fastest way to swap a paid API for a private one during development.

Customizing with a Modelfile

A Modelfile lets you bake a system prompt and parameters into a reusable variant, much like a Dockerfile for models:

FROM llama3.2

SYSTEM "You are a terse assistant. Answer in one sentence."
PARAMETER temperature 0.3

Run ollama create terse -f Modelfile and you have a new model named terse that always behaves that way. It is the simplest path to a purpose-built local assistant without fine-tuning anything.

Tips and good habits

Start with a medium model and only scale up if the output quality genuinely requires it — speed and responsiveness matter more than a benchmark point or two for everyday work. Keep an eye on disk space, since model files run from a couple of gigabytes to dozens. Leave the server bound to localhost unless you deliberately need network access, and if you do expose it, put authentication in front. And treat downloaded models like any dependency: pull from the official library, and note the license, since open-weight does not always mean unrestricted commercial use.