Bringing Up DeepSeek-V4-Flash on AMD MI300X (fergusfinn.com) AI

A Doubleword worklog details the process of getting DeepSeek-V4-Flash working on AMD MI300X, highlighting major blockers in FP8 “fnuz” vs OCP dialect support, missing/buggy AITER tuned-kernel paths for specific sparse attention shapes on the MI300X gfx942 core, and the need to use HIP graphs carefully to keep captured execution static. After addressing correctness issues and optimizing sparse MLA decode and MXFP4 MoE bookkeeping, they report a small performance uplift on a benchmark (+8.6% to 2699 output tokens/s per GPU) and argue MI300X can be cost-effective despite remaining software gaps that are expected to improve on newer AMD parts.

June 02, 2026 19:05 Source: Hacker News