The Economics of Speculative Decoding (fergusfinn.com) AI

The article argues that speculative decoding remains a key inference performance win, but changing model architectures—especially mixture-of-experts (MoE) layers and compressed attention/KV-cache techniques—reduce the “free” nature of speculative tokens by shifting attention and feed-forward operations closer to compute-bound regimes. It describes how MoE routing changes the memory/compute roofline (making some speculative tokens costly to verify, especially at low batch sizes) and how compressed attention can remove the slack that speculation previously exploited. Using these updated cost considerations, it proposes that effective speculation lengths must be chosen more conservatively based on acceptance likelihood, since rejected speculative tokens are no longer zero-cost.

June 11, 2026 22:15 Source: Hacker News