Speculative Decoding: 20-50% Faster LLM Inference
Faster LLM inference without quality loss - a practical guide
A 70B model generates one token per forward pass, and each pass reloads weights from VRAM, computes attention across the context, and synchronizes memory. Between tokens, the GPU sits idle while it waits for sequential dependencies to resolve.