DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark (forums.developer.nvidia.com) AI

A NVIDIA developer forum post shares a working end-to-end setup for running DeepSeek-V4-Flash in official FP8 on a dual-node 2x DGX Spark cluster (tensor parallel TP=2) with MTP and a 200K context window, including specific build/run steps, key recipe flags, and reported throughput/TTFT numbers (e.g., ~44 tok/s decode warm, ~2s TTFT on short prompts, ~6-minute cold start). The thread also documents issues encountered—especially cross-node NCCL/version “pinning,” long-context prefill being slow, a SparkRun benchmark hang caused by tokenizer resolution in the benchmark harness—and follow-up results suggesting further performance tuning, including a later test with 256K context and updated benchmark figures.

June 03, 2026 00:30 Source: Hacker News