GPU Local — Remote Ollama Delegation.
Master-worker setup that lets Claude Code on a local PC delegate cheap, high-volume inference to an Ollama server on a Vast.ai rented GPU via SSH tunnel.
- Year
- 2026
- Role
- Solo Engineer
- Domain
- AI Infrastructure
What it does
This project is the infrastructure that lets a local Claude Code session on a Windows PC offload high-volume, low-skill inference work to a much more powerful Ollama server running on a Vast.ai rented GPU. The local machine is the commander; the remote GPU is the workhorse. Cost is the motivation — for routine code generation, summarisation and boilerplate work, calling a local Qwen Coder on a rented 24 GB GPU is dramatically cheaper than calling the same volume through a frontier API, and the latency is acceptable because the SSH tunnel is fast. The deliverable is a small set of PowerShell scripts, a JSON address book, a teaching-mode CLAUDE.md that explains the architecture in plain language, and a documented setup that the same human can repeat the next time they spin up a new GPU instance — including the exact troubleshooting steps for the Vast.ai authentication proxy that consumes most new-user debugging budget. Source repository is private.
How it’s structured
The repository is organised around four small scripts and one configuration file. start_claude_worker.sh runs on the GPU side and brings Ollama up cleanly, ensuring the chosen models are loaded and ready to serve. quick-setup.ps1 runs on the local machine, builds the SSH tunnel, and verifies that the local port can talk to the remote Ollama instance end-to-end. delegate-to-remote.ps1 is the actual delegation script — it accepts a task, picks the right model, sends it through the tunnel, and returns the answer to the calling Claude Code session. connection-config.json is the address book: it stores the IP of the current Vast.ai instance, the ports, and the list of models available on the GPU. The repository also carries a teaching-mode CLAUDE.md that explains the architecture in plain language, plus a hard-won troubleshooting guide written after a particular Vast.ai authentication bug ate several hours of debugging.
How it works
When the user wants to delegate, the local Claude Code session calls delegate-to-remote.ps1 with a task and a target model. The script forwards the request through the SSH tunnel to the GPU’s Ollama instance, which runs the inference and streams the answer back. The architecture has one critical detail that took longer than expected to discover: Vast.ai puts a Caddy reverse proxy in front of every external port, and that proxy demands HTTP basic authentication. Trying to reach Ollama on the proxied external port returns a 401, which made it look like the model was misconfigured when in fact the proxy was simply doing its job. The fix is to talk to Ollama on its internal localhost port through the SSH tunnel, bypassing Caddy entirely — which is why every connection in the project is via SSH, not over the public Vast.ai URL. Models are pre-pulled on the GPU side (Qwen Coder, DeepSeek, Qwen Notool variants) so that delegation does not pay a download cost on every task.
What I learned
The biggest lesson was that “rent a GPU and connect to it” sounds trivial until you meet the proxy in front of every external port. Discovering Caddy as an unwanted middleman taught me to always start network debugging from the inside out — talk to the service on its own host before trusting any reverse proxy in between, and read the actual response code rather than assume the model is at fault. The master-worker pattern itself turned out to be the right shape for cost optimisation: the frontier model stays in charge of planning and review, and the cheap local model on a rented GPU handles the bulk generation, which is exactly the work a coder model is good at. I also learned how much value a small, focused troubleshooting document delivers — the markdown notes captured during the Caddy debugging session paid for themselves the first time I rebuilt the setup on a new GPU instance, and they are exactly the kind of artefact a teaching-mode CLAUDE.md is meant to make discoverable.