From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem (news.future-shock.ai) AI

The article explains how a transformer’s KV cache makes ongoing conversations “remember” recent tokens in GPU memory, and why its byte cost forces constant memory management. It compares several architecture changes—like grouped-query attention, compressed latent caches, and sliding-window attention—that reduce per-token cache size, and contrasts this short-lived working memory with long-term “memory” features that rely on separate systems such as retrieval and stored facts. It also discusses what happens when the cache is evicted or too large, including lossy compaction and the resulting need for external memory tools.

March 31, 2026 17:33 Source: Hacker News