Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU
MTP vs standard decoding on RTX 4080 — real benchmarks
I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM.
MTP vs standard decoding on RTX 4080 — real benchmarks
I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM.
Free VRAM without killing llama-server.
llama.cpp router mode is one of the most useful changes to llama-server in years. It finally gives local LLM operators something close to the model management experience people expect from Ollama, while keeping the raw performance and low-level control that make llama.cpp worth using in the first place.
Search is not knowledge structure
Most modern knowledge systems optimize retrieval, and that is understandable. Search is visible, easy to demo, and feels magical when it works. Type a question, get an answer.
Compiled knowledge for AI systems
The premise is simple: compiled knowledge is more reusable than retrieved fragments. RAG became the default answer to a straightforward question - how do I give an LLM access to external knowledge?
A map of modern knowledge systems
PKM, RAG, wikis, and AI memory systems are often discussed as if they solve the same problem. They do not. They all deal with knowledge, but they operate at different layers:
Notes are storage. A second brain is computation.
Information overload is less about sheer volume than about unresolved inputs. Modern knowledge work leaves a trail of tabs, chat threads, docs, highlights, snippets, transcripts, screenshots, and half-written notes.
Stop parsing vibes. Validate contracts.
Most LLM “structured output” tutorials are unserious. They teach you to ask for JSON politely and then hope the model behaves. That is not validation. That is optimism with braces.
Agentic LLM tuning reference
This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).
Stop duplicate side effects
Idempotency in distributed systems is the property that saves you after the network lies, the queue retries, the client panics, and the operator hits replay. In production systems, duplicate delivery is normal. Duplicate side effects are the bug.
Talk to Hermes from your phone
You already chat to Hermes Agent from your phone with text. Now you want to talk to it directly and get spoken replies back. That is usually the right move, especially if you already use Hermes as a persistent self-hosted assistant. Typing long prompts on a small screen is slow and error-prone
Control Hermes Kanban load on your self hosted LLM.
Hermes Agent ships with a Kanban-style board and the Hermes Gateway that can saturate your self-hosted LLM if too many tasks are dispatched at once.
Author Hermes skills that load fast and behave reliably
Hermes Agent treats skills as the default way to teach repeatable workflows. Official documentation describes them as on-demand knowledge documents aligned with the open agentskills.io shape, loaded through progressive disclosure so the model sees a small index first and only pulls full instructions when a task actually needs them.
Shell and TUI commands for self-hosted Hermes Agent.
Hermes Agent from Nous Research is a model-agnostic, tool-using assistant you run locally or on a VPS.
MinIO CE is effectively end of life in 2026.
MinIO Community Edition is no longer a safe default for new production systems.
Run OpenClaw safely with NemoClaw
Most AI agent stacks still treat security as a post-demo fix. NemoClaw starts from the opposite assumption and makes isolation, policy, and routing day-zero defaults.
Eight pluggable backends for persistent agent memory.
Modern assistants still forget everything when you close the tab unless something persists beyond the context window. Agent memory providers are services or libraries that hold facts and summaries across sessions — often wired in as plugins so the framework stays thin while memory scales.
Get new posts on AI systems, Infrastructure, and AI engineering.