Playing with Vision Embeddings (prestonbjensen.com) AI

The post explores how DINOv3 vision transformer embeddings (single 384-number vectors) encode image information by generating images via gradient optimization, then using a sparse autoencoder to learn thousands of more interpretable “feature directions” and decompose or recombine embeddings (e.g., identifying features for scenes like the Golden Gate Bridge, demonstrating feature superposition, and showing how adding/interpolating features blends or juxtaposes visual concepts).

June 08, 2026 06:05 Source: Hacker News