Llama-Server Router Mode - Dynamic Model Switching Without Restarts
Serve and swap LLMs without restarts.
For a long time, llama.cpp had a glaring limitation:
you could only serve one model per process, and switching meant a restart.
Serve and swap LLMs without restarts.
For a long time, llama.cpp had a glaring limitation:
you could only serve one model per process, and switching meant a restart.
Build Claude Skills that survive real work
Most teams misuse Claude Skills in one of two ways. They either turn SKILL.md into a dumping ground, or they never graduate from giant copy-pasted prompts.
Profile-first Hermes setups for serious workloads
Hermes AI assistant, officially documented as Hermes Agent, is not positioned as a simple chat wrapper.
The skills worth keeping, and the ones to skip
OpenClaw has two extension stories, and they are easy to mix up.
Plugins extend the runtime. Skills extend the agent’s behavior.
Plugins first. Skills naming in brief.
This article is about OpenClaw plugins — native gateway packages that add channels, model providers, tools, speech, memory, media, web search, and other runtime surfaces.
How real OpenClaw systems are actually structured
OpenClaw looks simple in demos. In production, it becomes a system.
Claude subscriptions no longer power agents
The quiet loophole that powered a wave of agent experimentation is now closed.
Self-hosted AI search with local LLMs
Vane is one of the more pragmatic entries in the “AI search with citations” space: a self-hosted answering engine that mixes live web retrieval with local or cloud LLMs, while keeping the whole stack under your control.
Agentic coding, now with local model backends.
Claude Code is not autocomplete with better marketing. It is an agentic coding tool: it reads your codebase, edits files, runs commands, and integrates with your development tools.
Hermes Agent install and quickstart for devs
Hermes Agent is a self-hosted, model-agnostic AI assistant that runs on a local machine or low-cost VPS, works through terminal and messaging interfaces, and improves over time by turning repeated tasks into reusable skills.
Install TGI, ship fast, debug faster
Text Generation Inference (TGI) has a very specific energy. It is not the newest kid in the inference street, but it is the one that already learned how production breaks -
llama.cpp token speed on 16 GB VRAM (tables).
Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.
RTX 5090 in AU is scarce and overpriced
Australia has RTX 5090 stock. Barely. And if you find one, you will pay a premium that feels detached from reality.
Remote Ollama access without public ports
Ollama is at its happiest when it is treated like a local daemon: the CLI and your apps talk to a loopback HTTP API, and the rest of the network never finds out it exists.
Compose-first Ollama server with GPU and persistence.
Ollama works great on bare metal. It gets even more interesting when you treat it like a service: a stable endpoint, pinned versions, persistent storage, and a GPU that is either available or it is not.
HTTPS Ollama without breaking streaming responses.
Running Ollama behind a reverse proxy is the simplest way to get HTTPS, optional access control, and predictable streaming behaviour.