Speculative KV coding: losslessly compressing KV cache by up to ~4× (fergusfinn.com) AI

The post proposes “Speculative KV coding,” a lossless method that compresses an LLM’s KV cache by using a cheaper predictor model to estimate each KV scalar’s value and uncertainty, then arithmetic-coding the exact target cache based on how well the predictor fits; experiments on Qwen3 suggest up to ~4× lossless compression (on top of ~8× from FP8 cache compression).

June 07, 2026 08:03 Source: Hacker News