Finding Optimal Tokenizers (blog.aqnichol.com) AI

A blog post describes an approach to compute provably optimal tokenizers by formulating tokenization as an integer linear program and then using cutting-plane techniques to force the relaxed LP solution toward an integral optimum. The author reports that, despite theory suggesting optimal tokenization is intractable, they found optimal vocabularies for toy problems (including a vocab size 512 tokenizer for Pride and Prejudice) and discusses limitations such as reliance on a pretokenizer, near-optimal state of existing methods, and generalization concerns.

June 12, 2026 00:15 Source: Hacker News