What Are Tokens in LLMs? (bearisland.dev) AI

The article explains that in LLMs, text is converted into model-specific integer token IDs rather than raw characters or words, using tokenizers built from algorithms like (byte-level) BPE. It walks through how BPE incrementally builds a vocabulary by repeatedly merging frequent adjacent pairs, including an example showing how words like “cat” can become single tokens. It also clarifies the “strawberry” effect—models may split a word differently because their vocabularies differ, and byte-level tokenization avoids out-of-vocabulary characters by starting from UTF-8 bytes.

June 07, 2026 21:05 Source: Hacker News