Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon (tridao.me) AI
The article proposes “Gram Newton-Schulz” (used in an optimizer called GramMuon) to speed up Muon’s Newton-Schulz orthogonalization by iterating on a smaller symmetric Gram matrix (XXᵀ) rather than the full rectangular weight matrix, enabling faster symmetric matrix-multiplication kernels and reducing the orthogonalization runtime by about 40–50%. It also studies numerical instability in the naive Gram form (e.g., spurious negative eigenvalues in half precision) and introduces a “restarting” strategy to stabilize it while preserving optimization quality (within ~0.01 validation perplexity). The authors report up to ~50% optimizer-time reduction in large MoE models and release implementation code and custom GPU kernels.
June 11, 2026 22:15
Source: Hacker News