I made a kernel 2.2x faster. It made my training loop 3x slower (kyrieblunders.bearblog.dev) AI

A developer building a Dr. GRPO RL post-training loop for Qwen2.5-0.5B on GSM8K reports that a custom fused decode-attention kernel microbenchmarks 2.2× faster than the SDPA baseline, but makes the overall HuggingFace generate decode step nearly 3× slower because it breaks an auto-compile path the baseline was benefiting from.

June 05, 2026 03:30 Source: Hacker News