This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).
Soon you are juggling vLLM, llama.cpp, and more—each stack on its own port. Everything downstream still wants one /v1 base URL; otherwise you keep shuffling ports, profiles, and one-off scripts. llama-swap is the /v1 proxy before those stacks.
I have tested how OpenCode works with several locally hosted on Ollama and llama.cpp LLMs,
and for comparison added some Free models from OpenCode Zen.
I keep coming back to llama.cpp for local inference—it gives you control that Ollama and others abstract away, and it just works. Easy to run GGUF models interactively with llama-cli or expose an OpenAI-compatible HTTP API with llama-server.
Strategic guide to hosting large language models locally with Ollama, llama.cpp, vLLM, or in the cloud. Compare tools, performance trade-offs, and cost considerations.