Self-Hosting

Unload All llama.cpp Router Models Without Restarting

llama.cpp router mode is one of the most useful changes to llama-server in years. It finally gives local LLM operators something close to the model management experience people expect from Ollama, while keeping the raw performance and low-level control that make llama.cpp worth using in the first place.

Agentic LLM Inference Parameters Reference for Qwen 3.6 and Gemma 4

This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).

You already chat to Hermes Agent from your phone with text. Now you want to talk to it directly and get spoken replies back. That is usually the right move, especially if you already use Hermes as a persistent self-hosted assistant. Typing long prompts on a small screen is slow and error-prone

NemoClaw practical guide for secure OpenClaw operations in 2026

Most AI agent stacks still treat security as a post-demo fix. NemoClaw starts from the opposite assumption and makes isolation, policy, and routing day-zero defaults.

Knowledge Management in 2026: PKM Tools, Self-Hosted Wikis & Digital Systems

Personal knowledge management spans Obsidian, Logseq, DokuWiki, Zettelkasten, and PARA — the right choice depends on whether you want a local note graph, a self-hosted wiki, or an outliner-driven workflow.

Claude, OpenClaw, and the End of Flat Pricing for Agents

The quiet loophole that powered a wave of agent experimentation is now closed.

Vane (Perplexica 2.0) Quickstart With Ollama and llama.cpp

Vane is one of the more pragmatic entries in the “AI search with citations” space: a self-hosted answering engine that mixes live web retrieval with local or cloud LLMs, while keeping the whole stack under your control.

TGI - Text Generation Inference - Install, Config, Troubleshoot

Text Generation Inference (TGI) has a very specific energy. It is not the newest kid in the inference street, but it is the one that already learned how production breaks -

16 GB VRAM LLM benchmarks with llama.cpp (speed and context)

Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.

RTX 5090 in Australia March 2026 Pricing Stock Reality

Australia has RTX 5090 stock. Barely. And if you find one, you will pay a premium that feels detached from reality.

Remote Ollama access via Tailscale or WireGuard, no public ports

Ollama is at its happiest when it is treated like a local daemon: the CLI and your apps talk to a loopback HTTP API, and the rest of the network never finds out it exists.

Ollama in Docker Compose with GPU and Persistent Model Storage

Ollama works great on bare metal. It gets even more interesting when you treat it like a service: a stable endpoint, pinned versions, persistent storage, and a GPU that is either available or it is not.

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

Running Ollama behind a reverse proxy is the simplest way to get HTTPS, optional access control, and predictable streaming behaviour.

Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs

If you are working through retrieval-augmented generation (RAG), this section walks through text embeddings in plain terms — what they are, how they fit search and retrieval, and how to call two common local setups from Python using Ollama or an OpenAI-compatible HTTP API (as many llama.cpp-based servers expose).

IndexNow explained - notify search engines when you publish

Static sites and blogs change whenever you deploy. Search engines that support IndexNow can learn about those changes without waiting for the next blind crawl.

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

SGLang is a high-performance serving framework for large language models and multimodal models, built to deliver low-latency and high-throughput inference across everything from a single GPU to distributed clusters.