MisoTTS Emotive Speech Model (misolabs.ai) AI

Miso Labs released MisoTTS, an open-source (weights on Hugging Face) 8-billion-parameter text-to-speech model designed to generate more natural, expressive speech by using a hierarchical residual vector quantization (RVQ) transformer and conditioning on both text and audio context; it reportedly uses a 7.7B backbone plus a 300M decoder to predict 32 codebook indices per audio token. The company says the approach addresses limitations of standard TTS systems that rely only on text and have difficulty covering the wide variety of human speech sounds, though it notes current limits such as half-duplex audio and future work on turn-taking and full-duplex conversation.

June 04, 2026 05:01 Source: Hacker News