Local AI Without the Overhead

Building a practical AI sandbox with Ollama and Open WebUI

8 minute read

Most conversations about running AI locally start with privacy or cost. Both matter, but neither is why I actually built a local stack.

I wanted to test ideas without friction. Not one idea carefully scoped against an API budget, but a dozen half-formed ones thrown at different models to see what sticks. The cloud is great for production workloads, but for the messy, exploratory phase of figuring out what AI can actually do for a project, every small barrier compounds. Register for an API key. Set up billing. Configure authentication. Monitor rate limits. Each step is minor individually. Together, they create just enough resistance that you don’t bother testing the quick thought you had at 9pm. A local stack removes all of that. Pull a model, ask a question. That’s it.

The whole setup took roughly two hours, and most of that was watching Debian packages download.

What We’re Working With

Two pieces of open-source software make this practical.

Ollama is a local model runner, but calling it that undersells what it actually does. Think of it as an orchestration layer for large language models. You pull models the way you’d pull Docker images (ollama pull llama3.2 grabs Meta’s Llama 3.2, ollama pull mistral gets Mistral) and Ollama exposes them all through a single REST API on port 11434. The model becomes a parameter of the query, not a separate endpoint. Want to compare how Llama and Mistral handle the same prompt? Same API call, different model parameter.

That’s a bigger deal than it sounds. Azure offers something similar with the AI Foundry Model Router, a single endpoint that routes queries across multiple models. But setting that up means provisioning Azure resources and configuring deployments before you can run your first query. Ollama gives you the same architectural pattern on your own hardware, with no configuration beyond pulling the models you want.

 
Ollama’s model library includes over 100 models ranging from lightweight 2B parameter options that run on a laptop to 70B+ models that need serious hardware. For most experimentation, the 7B-8B range hits a sweet spot between quality and speed.

Open WebUI is the interface layer. It provides a browser-based chat experience similar to ChatGPT or Claude, but pointed at your local Ollama instance. Open WebUI does for your AI sandbox what a pre-built admin panel does for a new web app. You could build your own chat interface against Ollama’s API. You’d spend days on conversation management, model switching, document upload, and prompt templating before you ever got to the thing you actually wanted to test. Open WebUI gives you all of that out of the gate, with over 280 million downloads and an active community building extensions on top. It also supports RAG, letting you upload documents and query against them directly.

The key insight is that Open WebUI isn’t the frontend for your projects. It’s your test bed. It lets you focus on figuring out what to build instead of building the tooling to figure it out.

The Virtualization Layer (Brief Detour)

My setup runs on Proxmox VE, an open-source virtualization platform. If you’re running a homelab, you probably already know it. If you’re not, the short version: Proxmox lets you spin up virtual machines and lightweight Linux containers (LXC) on bare metal. LXC containers give you isolated environments with near-native performance and minimal overhead, ideal for running services like Ollama that benefit from direct hardware access without the weight of a full VM. I won’t go deep on Proxmox here because that’s its own article.

 
You don’t need Proxmox or a homelab to run this stack. Ollama installs directly on macOS, Linux, and Windows, and Open WebUI runs anywhere Docker does. The Open WebUI documentation walks through standalone installation if you want to run everything on your local machine.

The Proxmox community maintains a library of one-click installation scripts for common services. Open WebUI and Ollama have one, and it handles roughly 90% of the setup. That matters because the goal here is a working AI sandbox, not a systems administration exercise.

Setting It Up

The actual installation is almost anticlimactic. From the Proxmox host:

bash -c "$(curl -fsSL https://raw.githubusercontent.com/community-scripts/ProxmoxVE/main/ct/openwebui.sh)"

This script creates an LXC container with Open WebUI installed and asks if you want Ollama bundled in. Say yes.

 
If your default storage (local-lvm) is tight on space, the installer lets you redirect to a different storage target. I’d recommend at least 200GB of dedicated space. Individual model files range from 4-5GB for an 8B model to 40GB+ for larger ones, and you’ll want room to pull several models for comparison testing. The space adds up faster than you’d expect.

GPU Passthrough (Optional but Meaningful)

Without a GPU, Ollama runs inference on CPU. It works, and for quick tests on smaller models it’s fine. But for anything beyond casual prompting, CPU inference means 30-40 second response times for queries that would be nearly instant on a cloud provider. That performance gap is the honest trade-off of running locally without dedicated hardware.

If you have an NVIDIA card available, GPU passthrough to the LXC container makes a dramatic difference. The process involves installing matching NVIDIA drivers on both the Proxmox host and inside the container (using the --no-kernel-module flag for the latter), then mapping the /dev/nvidia* devices in the LXC configuration file at /etc/pve/lxc/<ID>.conf.

It’s not complicated, but it’s fiddly. Wrong driver version inside the container? Silent failures. Missing device mapping? Ollama falls back to CPU without telling you. I’d recommend getting the basic CPU setup working first and verifying everything functions before adding GPU passthrough as a separate step.

Accessing Your New Stack

Once the container is running, it picks up an IP via DHCP like any other device on your network. Open WebUI will be available at port 8080 and Ollama’s API at port 11434. That’s enough to get started.

If you want cleaner access, you can add internal DNS entries and route them through a reverse proxy like nginx. I set up ai.internal pointing to Open WebUI and ollama.internal pointing to Ollama’s API, each with self-signed certificates. Having a named endpoint for Ollama becomes particularly useful when you start pointing other tools at it, but it’s not required for the basic setup.

How the Pieces Fit Together

Explore & Test
Your Browser
Chat, compare models, upload docs
Open WebUI :8080
Your administrative dashboard and control center
Build & Integrate
Your Apps & Tools
Your next great idea, service, or system
↓ OpenAI-compatible API
Ollama :11434
Single endpoint — model is a parameter, not a destination
↓ Inference
Local Models Llama, Mistral, Qwen, etc.

Open WebUI is your test bed. Your apps talk directly to Ollama.

The Honest Trade-offs

Running AI locally isn’t free, even when the software is.

Performance is the biggest gap. Without a dedicated GPU, response times on anything beyond a 7B model are slow enough to change how you work. A query that takes 2 seconds on Claude or GPT-4 might take 30-40 seconds locally on CPU. That’s workable for batch testing or background processing, but it makes interactive conversation feel sluggish.

Model quality is the other consideration. Local models have improved dramatically, and something like Llama 3.2 at 8B parameters is genuinely impressive for its size. But we’re not pretending it matches frontier models on complex reasoning tasks. The value isn’t in replacing cloud AI for everything. It’s in having a zero-friction environment for the majority of use cases where a good-enough model with no overhead beats a perfect model behind a setup wall.

 
I’ve written before about the hidden costs of AI strategy and the build-vs-buy decision. Local AI adds a third option: build your own sandbox for experimentation while using cloud services for production workloads.

What’s Next

The setup described above gets you a working chat interface. That’s table stakes. Where it gets interesting is using Ollama as a service that powers other things.

Because Ollama exposes an OpenAI-compatible API, any tool that integrates with that protocol can be pointed at your local instance instead. Code editors, automation scripts, document processing pipelines. The Ollama endpoint becomes a drop-in replacement for api.openai.com in most configurations.

I’m currently exploring a couple of directions with this setup. Using Ollama as a backend my conversational NPCs in our game engine’s development is the most immediately practical. Beyond that, I’m digging into various code review workflows and batch evaluations against different models to compare output quality for specific tasks.

The API-first approach matters here more than it might seem. Because Ollama follows an established protocol, the investment in building against it isn’t wasted if you later switch models or move some workloads back to cloud providers. The integration layer stays the same.

I’m still figuring out the right balance between local and cloud for different workloads. The cloud handles production. The homelab handles learning. Having both options available, with a clean API boundary between them, feels like the right architecture for this phase.

What would you explore first with a zero-friction AI sandbox on your own network?