LLM Hosting

Unload All llama.cpp Router Models Without Restarting

llama.cpp router mode is one of the most useful changes to llama-server in years. It finally gives local LLM operators something close to the model management experience people expect from Ollama, while keeping the raw performance and low-level control that make llama.cpp worth using in the first place.

Llama-Server Router Mode - Dynamic Model Switching Without Restarts

For a long time, llama.cpp had a glaring limitation:
you could only serve one model per process, and switching meant a restart.

Vane (Perplexica 2.0) Quickstart With Ollama and llama.cpp

Vane is one of the more pragmatic entries in the “AI search with citations” space: a self-hosted answering engine that mixes live web retrieval with local or cloud LLMs, while keeping the whole stack under your control.

TGI - Text Generation Inference - Install, Config, Troubleshoot

Text Generation Inference (TGI) has a very specific energy. It is not the newest kid in the inference street, but it is the one that already learned how production breaks -

Remote Ollama access via Tailscale or WireGuard, no public ports

Ollama is at its happiest when it is treated like a local daemon: the CLI and your apps talk to a loopback HTTP API, and the rest of the network never finds out it exists.

Ollama in Docker Compose with GPU and Persistent Model Storage

Ollama works great on bare metal. It gets even more interesting when you treat it like a service: a stable endpoint, pinned versions, persistent storage, and a GPU that is either available or it is not.

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

Running Ollama behind a reverse proxy is the simplest way to get HTTPS, optional access control, and predictable streaming behaviour.

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

SGLang is a high-performance serving framework for large language models and multimodal models, built to deliver low-latency and high-throughput inference across everything from a single GPU to distributed clusters.

llama.swap Model Switcher Quickstart for OpenAI-Compatible Local LLMs

Soon you are juggling vLLM, llama.cpp, and more—each stack on its own port. Everything downstream still wants one /v1 base URL; otherwise you keep shuffling ports, profiles, and one-off scripts. llama-swap is the /v1 proxy before those stacks.

LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally

LocalAI is a self-hosted, local-first inference server designed to behave like a drop-in OpenAI API for running AI workloads on your own hardware (laptop, workstation, or on-prem server).

llama.cpp Quickstart with CLI and Server

I keep coming back to llama.cpp for local inference—it gives you control that Ollama and others abstract away, and it just works. Easy to run GGUF models interactively with llama-cli or expose an OpenAI-compatible HTTP API with llama-server.

Self-hosting LLMs keeps data, models, and inference under your control-a practical path to AI sovereignty for teams, enterprises, nations.

Open WebUI is a powerful, extensible, and feature-rich self-hosted web interface for interacting with large language models.

vLLM is a high-throughput, memory-efficient inference and serving engine for Large Language Models (LLMs) developed by UC Berkeley’s Sky Computing Lab.

Ollama vs vLLM vs LM Studio: Best Way to Run LLMs Locally in 2026?

Running LLMs locally is now practical for developers, startups, and even enterprise teams.
But choosing the right tool — Ollama, vLLM, LM Studio, LocalAI or others — depends on your goals:

Docker Model Runner: Context Size Config Guide

Configuring context sizes in Docker Model Runner is more complex than it should be.

LLM Hosting

Unload All llama.cpp Router Models Without Restarting

Llama-Server Router Mode - Dynamic Model Switching Without Restarts

Vane (Perplexica 2.0) Quickstart With Ollama and llama.cpp

TGI - Text Generation Inference - Install, Config, Troubleshoot

Remote Ollama access via Tailscale or WireGuard, no public ports

Ollama in Docker Compose with GPU and Persistent Model Storage

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

llama.swap Model Switcher Quickstart for OpenAI-Compatible Local LLMs

LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally

llama.cpp Quickstart with CLI and Server

LLM Self-Hosting and AI Sovereignty

Open WebUI: Self-Hosted LLM Interface

vLLM Quickstart: High-Performance LLM Serving - in 2026

Ollama vs vLLM vs LM Studio: Best Way to Run LLMs Locally in 2026?

Docker Model Runner: Context Size Config Guide