Speculative KV coding: losslessly compressing KV cache by up to ~4× (fergusfinn.com) AI
The post proposes “Speculative KV coding,” a lossless method that compresses an LLM’s KV cache by using a cheaper predictor model to estimate each KV scalar’s value and uncertainty, then arithmetic-coding the exact target cache based on how well the predictor fits; experiments on Qwen3 suggest up to ~4× lossless compression (on top of ~8× from FP8 cache compression).
June 07, 2026 08:03
Source: Hacker News