Do Transformers Need Three Projections? Systematic Study of QKV Variants (arxiv.org) AI
The arXiv paper “Do Transformers Need Three Projections?” systematically tests transformer attention variants that share or tie Q, K, and V projections (including Q=K=V, Q-K=V, and Q=K-V) and reports that the resulting models often match or sometimes exceed standard QKV performance. In language modeling experiments, the Q-K=V sharing option achieves large KV cache reductions with only a small perplexity degradation and is shown to combine effectively with head sharing (GQA/MQA) for further memory savings relevant to on-device inference.
June 04, 2026 23:20
Source: Hacker News