How modern speech systems are built and trained — LLM-based TTS, audio LLMs, and the neural codecs underneath them — ending with concrete recipes for training your own.
How modern speech systems are built and trained — LLM-based TTS, audio LLMs, and the neural codecs underneath them — ending with concrete recipes for training your own.
Classical text-to-speech (TTS) models have long excelled at voice cloning and speech synthesis. They generally follow a two-stage process: first, a model like Tacotron converts text into an intermediate acoustic representation (such as a spectrogram), and then a vocoder (for example, WaveGlow or HiFi-GAN) transforms that representation into waveform audio. While these systems are capable of producing lifelike voices, their primary focus has been on replicating a given speaker's sound, with limited capacity to engage in dynamic, context-aware conversations.
The advent of large language models (LLMs) offers a compelling opportunity to enhance these systems. By incorporating LLMs into TTS pipelines, we can leverage their sophisticated reasoning and contextual understanding to create truly conversational speech systems. Instead of merely cloning a voice, these enhanced systems can interpret context, adapt to dialogue flows, and generate responses that feel both natural and interactive. Essentially, LLMs open up a new dimension where synthesis isn’t only about producing sound—it’s about enabling intelligent, context-aware conversation.
One practical way to integrate these capabilities is through a cascaded approach in a speech-to-speech system, which typically involves three distinct modules:
This cascaded method combines the strengths of each specialized component. However, this approach is not without its limitations. One major challenge is that the LLM does not capture the full richness of the speech input. Speech carries subtle cues—intonation, rhythm, emotion, and prosodic nuance—that are often lost in the conversion process to text. As a result, when an LLM processes transcribed text, it receives a significantly distilled representation of the original audio. This loss of detail can limit the model’s ability to produce responses that fully mirror the expressive qualities of the initial speech, potentially resulting in synthetic output that feels less dynamic or contextually aware.
Integrating speech directly with an LLM could solve this challenge but it also presents significant difficulties. Unlike text, speech is a continuous, high-dimensional signal. LLMs are designed to work with discrete tokens, so converting speech into a format that these models can process requires additional steps. Existing methods address this gap in two main ways:
This playbook is about the second path — the end-to-end one, where a single model ingests or produces speech directly rather than relaying it through a chain of black boxes. Everything downstream depends on one prerequisite the cascaded pipeline sidesteps entirely: an LLM only speaks in discrete tokens, so before anything else we need a way to turn a continuous waveform into a sequence of tokens, and those tokens back into sound. That is where we begin.
This post is a playbook, and it builds toward a concrete goal. We build up the modern speech stack roughly in the order you would learn it: first the neural codec that turns waveforms into tokens (the next chapter), then how to make an LLM understand audio (Qwen2-Audio, Voxtral), how to make it speak (LLaSA, Orpheus, CSM), and how those combine into real-time conversational systems — studying not just each architecture but how it is actually trained, and the two pillars every builder runs into along the way: data and evaluation. By the end, you should have a working recipe for all three: an LLM-based TTS model, a neural audio codec, and an audio LLM. The selection here is opinionated — models chosen because each teaches a distinct lesson. For an exhaustive map of the field, the ACL 2025 SpeechLM survey
Before we look at a single model, we have to solve the problem they all share. An LLM is a machine for predicting the next item in a sequence of discrete symbols from a fixed vocabulary. Speech is the opposite of that: a continuous waveform, tens of thousands of real-valued samples every second, with no natural alphabet. Three consequences follow, and together they explain almost every design decision in the rest of this playbook.
The device that takes on the first two problems directly is the neural audio codec: a learned compressor that turns a waveform into a short sequence of discrete tokens, and those tokens back into audio. Get it right and an LLM can treat speech as just another language to model; get it wrong and no amount of LLM scale will rescue you. It is the foundation everything else here is built on — so it is where we start.
A neural codec typically consists of an encoder that compresses raw audio into a compact latent representation, a quantizer that maps each latent frame to one or more discrete codebook indices, and a decoder that reconstructs the waveform from those indices. The key variables are:
EnCodec (Meta, 2022) established the modern neural-codec paradigm with a 24 kHz, 1.5–6 kbps RVQGAN architecture
RVQ is easiest to understand by watching it work. The first codebook stores a coarse sketch of each frame; every subsequent codebook encodes only the residual — the part of the signal the previous levels missed. Drag the slider to add codebooks and watch the reconstruction converge while the token rate and bitrate climb:
And here is the same trade-off with your ears instead of your eyes: the same speech clip passed through EnCodec at three codebook depths. At 2 codebooks the voice is intelligible but robotic; each doubling restores texture — and doubles the LLM's token bill.
facebook/encodec_24khz in transformers by varying the bandwidth argument.There is a second axis that matters even more than codebook count once an LLM enters the picture: what the tokens actually encode. Pure compression codecs like EnCodec and DAC are trained for a single goal — reconstruct the waveform as faithfully as possible — so their tokens are purely acoustic: they capture how the audio sounds, but an individual token has no clean correspondence to the phonemes being spoken. That is ideal for storage and terrible for an LLM, which is at heart a model of linguistic structure. Asking it to predict acoustic tokens is asking it to predict reverb and microphone hiss alongside the actual words.
The fix is to bake linguistic structure into the tokens themselves. A semantic token is trained — usually distilled from a self-supervised speech model such as HuBERT or Wav2Vec2-BERT — to correlate with phonetic content rather than raw acoustics. Codecs built for SpeechLMs increasingly carry both: semantic tokens for "what is being said," acoustic tokens for "how it sounds." Mimi distills semantics into its first codebook and leaves the rest acoustic; XCodec2 and NeuCodec fuse a semantic encoder's features into the quantizer before emitting their single stream. This is the biggest reason recent SpeechLMs are so much more intelligible than the first codec-token experiments: when the first token the LLM predicts is already about meaning, next-token prediction becomes a linguistic act rather than an acoustic one. Keep this acoustic-vs-semantic split in mind for every codec below.
Descript Audio Codec (DAC) extends EnCodec to 44.1 kHz stereo audio and introduces improved RVQGAN training with larger codebooks, achieving higher perceptual quality at 6–8 kbps
SNAC (Multi-Scale Neural Audio Codec) takes a different approach by quantizing at multiple temporal resolutions simultaneously—e.g., separate codebooks for coarse and fine structure
XCodec2 (HKUST, used by LLaSA) opts for a single large codebook of 65,536 entries at 50 Hz. By eliminating the hierarchical RVQ structure, XCodec2 simplifies the LLM's prediction task to straightforward next-token generation, reportedly utilizing ~99% of the codebook. The trade-off is that a single token must encode a 20 ms frame, so the model relies on the massive capacity of the codebook itself to preserve nuance. The design is catching on: Neuphonic\u2019s NeuCodec (the codec behind NeuTTS Air) extends XCodec2 with Finite Scalar Quantization, keeping the single 50 tokens/s stream while cutting the bitrate to 0.8 kbps and pairing a Wav2Vec2-BERT semantic encoder with the acoustic one.
Mimi (Kyutai, 2024) is the codec behind Moshi. It runs at 12.5 Hz with a bitrate of only 1.1 kbps, yet combines semantic tokens (from a self-supervised speech encoder) with acoustic tokens in a single stream. This hybrid design means the LLM receives both high-level linguistic content and low-level timbre information at an extremely low token rate, which is crucial for real-time full-duplex dialogue. Mimi's decoder is also heavily optimized for streaming, running in under ten milliseconds on a single CPU core.
| Codec | Token Rate | RVQ Depth | Typical Bitrate | Used By |
|---|---|---|---|---|
| EnCodec | 75 Hz | 32 | 1.5–6 kbps | Early SpeechLLMs, audiobook pipelines |
| DAC | ~75 Hz | 9 | 6–8 kbps | OuteTTS, high-fidelity TTS |
| SNAC | Multi-scale | 3 levels | ~2 kbps | Orpheus 3B |
| XCodec2 | 50 Hz | 1 (flat) | ~2 kbps | LLaSA |
| NeuCodec | 50 Hz | 1 (FSQ, flat) | 0.8 kbps | NeuTTS Air |
| Mimi | 12.5 Hz | 8 (1 semantic + 7 acoustic) | 1.1 kbps | Moshi |
The choice of codec is not merely an implementation detail: it shapes the LLM's input distribution, the achievable latency, and the fidelity ceiling. A flat single-codebook design like XCodec2 simplifies training and inference, but demands the LLM to carry more of the acoustic modeling burden. A multi-scale or semantic+acoustic design like Mimi pushes complexity into the codec, letting the LLM focus on high-level dialogue and reasoning. As the field matures, we expect to see more hybrid codecs that balance these axes for specific deployment constraints—on-device, low-latency, or studio-quality.
To see why this matters so much for LLMs, look at what one second of speech actually costs in tokens. A text LLM spends roughly 3–4 tokens to represent a second of spoken English; a naively flattened DAC stream spends nearly 800. This token bill determines how much conversation history fits in the context window — and therefore how long a dialogue your SpeechLM can remember:
Here is the payoff, and the shape of the rest of this playbook. Once we can move freely between waveforms and tokens, an LLM can do two mirror-image things with audio: consume it as input (recognition and understanding) or produce it as output (speech synthesis) — the same next-token objective, run in two directions. The two directions don't always use the same representation: as we'll see, a model that only needs to understand audio often feeds the LLM continuous encoder features instead of discrete codec tokens, while any model that generates speech must emit discrete tokens a codec can turn back into sound. The next chapters follow exactly that arc — first teaching an LLM to listen, then to speak, then combining both into real-time conversation.
Before the parade of models begins, here is a single mental model to carry through all of it. Almost every speech LLM is pinned down by just two questions: how does audio get into the LLM, and what does the LLM produce? The first axis is the continuous-vs-discrete choice we just drew — encoder features (rich, but read-only) versus codec tokens (discrete, and writable). The second is simply text out (understanding) versus speech out (generation). Place those two axes at right angles and the entire field falls into a handful of cells.
A few things jump out once it's laid out this way. Understanding models cluster in one cell — continuous features in, text out — because if you never need to generate audio, discrete tokens only cost you fidelity; that is the entire "encoder-plus-LLM" recipe, and Qwen2-Audio, Voxtral, SALMONN, Audio Flamingo 3, and Granite Speech all live there. Pure TTS sits in the opposite corner — text in, codec tokens out — where models like LLaSA and Orpheus need no encoder at all. And the genuinely hard, interesting systems are the ones that span the whole grid: a model that takes audio in and emits speech out is a conversational agent, and how it bridges the two — a separate Talker head (Qwen2.5-Omni), a hybrid input (Kimi-Audio), or one unified token stream (Moshi) — is the defining design choice of the final chapter. As each model comes up, it helps to ask first: which cell is it in, and how does it move between them?
Qwen2-Audio employs a two-component architecture featuring an audio encoder and a large language model (LLM). The audio encoder is initialized using weights from the Whisper-large-v3 model
The training involves a three-stage process aimed at maximizing the probability of the next text token, conditioned on the audio representations and preceding text tokens:
This architectural and training methodology highlights a relatively straightforward approach to building powerful audio-language models. It demonstrates the feasibility of effectively combining strong, pre-existing unimodal models – like a capable audio encoder (Whisper) and a robust LLM (Qwen) – and then adapting them through targeted fine-tuning stages (SFT, DPO). This process of "plugging in" a modality-specific encoder and then fine-tuning the combined system mirrors common practices in multimodal LLMs, particularly analogous to how vision capabilities are often integrated into large language models.
Qwen2-Audio's blueprint has since become the standard pattern, and the transformers library has been steadily absorbing its descendants — each with one distinctive twist worth knowing:
The takeaway: "make an LLM understand audio" is no longer a research problem but a design menu — pick an encoder, pick a fusion mechanism (projector, Q-former, in-place replacement), and decide how you will protect the text weights (data replay, frozen stages, or Granite's audio-gated LoRA).
The first TTS LLM-powered model we examine is LLaSA, developed by researchers at HKUST and Microsoft. LLaSA directly tackles the question: How far can we push large language model (LLM) scaling principles when applied to speech synthesis? While many text-to-speech (TTS) systems use hybrid pipelines or multiple models, LLaSA adopts a minimalist, LLM-style design—one transformer, one stage, one codebook—and scales it up massively across model size and data.
LLaSA’s design philosophy is simplicity and reuse. It starts from a pretrained Llama model and extends both the tokenizer and LM head to include 65,536 new audio tokens on top of the original text vocabulary. These audio tokens come from XCodec2, a flat vector quantizer operating at 50 Hz – i.e. one token corresponds to 20 ms of audio. This design yields 50 tokens per second of audio, all from one codebook. By using a very large codebook, the codec achieves high fidelity with a single token stream (reportedly ~99% codebook usage, meaning it effectively utilizes the full range of tokens for nuanced encoding).
The model is then fully fine-tuned on sequences that combine both text and audio tokens, allowing the transformer to treat speech as a natural continuation of text. No architecture changes are made to the transformer itself—LLaSA simply learns to model speech the same way Llama models language: as a next-token prediction task over a unified vocabulary of text and audio tokens.
The advantage of this single-level approach is its simplicity for the LLM – the model just treats audio tokens like another language with a 65k vocabulary, not unlike how a wordpiece tokenizer might have 50k tokens for text. Training LLaSA thus becomes very similar to training a standard LLM: they convert all audio in the training set into long sequences of XCodec2 tokens and concatenate with the corresponding text transcriptions. The transformer learns to predict the next token, whether that next token is part of the text or the audio. The authors note this compatibility means they can directly apply techniques like data parallel scaling, model compression, or acceleration from the NLP world to this TTS model.
LLaSA’s training of 250k hours is one of the largest in TTS, spanning diverse speech in English and Chinese, it also has a multilingual version for the 1B model and a 3B model. The resulting models are correspondingly powerful. The largest 8B model in particular demonstrates remarkable naturalness and prosody. According to the paper, increasing model size consistently improved speech quality – bigger models produced more accurate and complex prosody patterns and sounded more natural. This is analogous to how in text LLMs, going from 1B to 7B to 70B yields more fluent and context-aware language; here it yields more human-like intonation and rhythm. Even the 1B model, while less expressive, still functions for basic speech and is extremely lightweight to run.
It can also do zero-shot voice cloning by taking a speech prompt. If you feed a short recording of a speaker (converted to tokens via XCodec2) followed by the text, LLaSA will generate the continuation in that voice. This works because the model has effectively learned to continue in the style provided by preceding audio tokens. Additionally, because it’s bilingual, you can prompt it with a Chinese voice and have it speak English in that voice or vice-versa, enabling cross-lingual voice transfer – a very useful feature for voice assistants in multilingual settings.
One interesting contribution of the LLaSA work is in scaling inference-time compute . The authors experimented with using speech understanding models as verifiers during generation . In practice, this means when sampling audio tokens, they would involve a pretrained model (like a speech recognition or speaker ID network) to guide the choice of tokens, re-ranking or biasing the outputs toward those that make the verifier happy. For example, a ASR verifier ensures the content is pronounced clearly (improving word accuracy), while a speaker encoder verifier ensures the voice timbre stays consistent, and a classifier might ensure the emotion matches some target. They found that by spending more computation at inference in this way, they could significantly improve aspects like emotional expressiveness, speaker consistency, and content accuracy in the generated speech. This is akin to how some text LLMs use external tools or rerankers to improve outputs post-hoc. While such techniques increase inference cost, they show a pathway to higher quality without retraining – useful for customizing style or ensuring correctness in critical applications. In terms of output quality, LLaSA is state-of-the-art on traditional TTS metrics. The paper reports that the codec can reconstruct 16 kHz speech with a MOS (Mean Opinion Score) around 4.1 (out of 5) on test data, which is near the ground truth quality. The transformer does introduce some modeling imperfections, but the massive training helps mitigate that. For prosody, the 8B model especially was noted to produce more lifelike intonation, handling even tricky sentences with complex emphasis better than smaller models. The bilingual nature means it learned to control tone appropriate to each language – for example, using the correct cadence for a Chinese question vs an English question. It can also mix languages to an extent (e.g., speaking an English sentence with a Chinese accent or vice versa, if prompted), reflecting the multilingual data.
While LLaSA pushed the boundaries with its massive scale and single-codebook simplicity, other models emerged concurrently, exploring similar LLM-driven TTS principles but often opting for different neural audio codec strategies. Notable examples include Canopy Labs' Orpheus 3B and OuteAI's OuteTTS.
A key difference lies in their choice of codec. Instead of XCodec2's single large codebook, both Orpheus and OuteTTS use codecs based on Residual Vector Quantization (RVQ). RVQ codecs like SNAC (Multi-Scale Neural Audio Codec)
Built upon a Llama-3B backbone, Orpheus pairs the LLM with the SNAC. SNAC is an advanced RVQ codec that captures audio information across different temporal resolutions, aiming for efficient compression and detailed reconstruction. To manage the multiple token streams from SNAC within a standard LLM framework, Orpheus employs a strategy of generating a flattened sequence of tokens (7 tokens per audio frame sequentially). It achieves low-latency streaming (~200ms to ~25-50 ms with input streaming of text into the KV cache) suitable for real-time applications by using an optimized decoding process involving a sliding window technique on the SNAC decoder, ensuring smooth audio output without pops. Orpheus particularly emphasizes generating expressive, emotive speech and supports zero-shot voice cloning from short audio prompts.
| Model | Base LLM | Audio Codec | Key Features |
|---|---|---|---|
| OuteAI/Llama-OuteTTS-1.0-1B | meta-llama/Llama-3.2-1B | ibm-research/DAC.speech.v1.0 |
- One-Shot Voice Cloning - Multilingual - Trained on ~60k hours of audio |
| SparkAudio/Spark-TTS-0.5B | Qwen/Qwen2.5-0.5B | BiCodec |
- Controllable TTS with prompt - Trained on 100k hours of open source audio - Bilingual capabilities |
| nari-labs/Dia-1.6B | Custom 1.6B transformer (trained from scratch) | Descript Audio Codec (DAC) |
- Multi-speaker dialogue generation ([S1]/[S2] tags) - Non-verbal sounds (laughter, coughs, sighs) - Voice cloning from an audio prompt |
| bosonai/higgs-audio-v2-generation-3B-base | meta-llama/Llama-3.2-3B | Unified tokenizer (semantic + acoustic, in-house) |
- DualFFN: separate FFN path for audio tokens - 10M hours pretraining (AudioVerse) - Expressive multi-speaker audio, zero post-training |
| moonshotai/Kimi-Audio-7B-Instruct | Qwen 2.5-7B | Hybrid (continuous + discrete) |
- Universal audio foundation model - ASR, TTS, audio QA, SER, SEC - 13M+ hours pre-training |
| Qwen/Qwen2.5-Omni-7B | Qwen 2.5 | Thinker-Talker (built-in) |
- Multimodal: text, image, audio, video - Real-time voice & video chat - Selectable voices (Chelsie, Ethan) |
| mistralai/Voxtral-Mini-4B-Realtime | Mistral 7B | Voxtral Codec |
- Streaming ASR - 13 languages - 480ms delay |
| mistralai/Voxtral-4B-TTS | Mistral 7B | Voxtral Codec (VQ-FSQ) |
- Multilingual TTS - Voice cloning from 3s audio - Flow-matching decoder |
Moshi is an open 7-billion-parameter speech–text foundation model that aims to make synthetic voices feel as immediate and interruptible as ordinary conversation. Every 80 ms the model’s large Temporal Transformer rolls the entire dialogue history—both text and previously generated audio tokens—into a single context vector. A tiny linear head turns that vector into one semantic token, after which a compact Depth Transformer fills in seven acoustic tokens that refine prosody and timbre. These eight tokens drive the causal Mimi codec, so the first fragment of sound can leave the speaker roughly 160–200 ms after the user finishes talking
Because Moshi writes its own audio tokens and reads the user’s tokens in the same sequence, it can back-channel (“mm-hm”), pause mid-sentence when interrupted, or pick up the thread again—all without external VAD, ASR, or turn-taking heuristics. The entire stack is Apache-2.0 licensed, and the Mimi decoder runs in under ten milliseconds on a single CPU core, making on-device streaming practical for mobile hardware.
Sesame AI’s Conversational Speech Model (CSM-1B) attacks the same problem from the TTS side. Rather than a full-duplex agent like Moshi, CSM is a context-aware speech generator: text and audio tokens from the whole conversation so far form one interleaved sequence, and the model generates the next turn’s audio conditioned on all of it. Because the prosody of a reply depends on what was just said — and how it was said — feeding the model raw conversational context lets it get intonation, emotion, and timing right where an isolated TTS call would sound flat
Architecturally, CSM is two transformers operating on Mimi codec tokens (12.5 Hz, one semantic + N−1 acoustic codebooks per frame). A Llama-style backbone processes the interleaved text-and-audio history and predicts the semantic codebook for each frame; a much smaller audio decoder then models the remaining acoustic codebooks to reconstruct high-fidelity speech. Both stages are autoregressive — the split exists so the expensive backbone runs only once per frame while the cheap decoder fills in the detail, keeping latency low enough for real-time use.
The training corpus is roughly 1 million hours of transcribed, diarized, predominantly English audio, trained for five epochs at a 2,048-token sequence length (about two minutes of dialogue). Sesame trained three sizes — 1B, 3B, and 8B backbones — and released the 1B model openly. In their evaluations, listeners rated isolated CSM utterances on par with human recordings; only when the conversational context was shown did humans retain a clear edge — evidence that naturalness is largely solved and contextual appropriateness is the remaining frontier.
Moving towards truly interactive conversational speech, models need to inherently understand and adapt to the flow of dialogue. Sesame AI's Conversational Speech Model (CSM) represents a significant step in this direction, explicitly designed to leverage context for more natural and coherent speech synthesis, as detailed in their research "Crossing the Uncanny Valley of Voice".
Kimi-Audio from Moonshot AI is a 7-billion-parameter open-source audio foundation model that unifies audio understanding, generation, and conversation within a single framework
The model introduces a hybrid audio input architecture that combines continuous acoustic features with discrete semantic tokens, allowing the LLM core to ingest audio at multiple levels of abstraction. For output, Kimi-Audio employs parallel heads for both text and audio token generation, making it capable of producing text transcriptions, natural language answers, or synthesized speech from the same forward pass. A chunk-wise streaming detokenizer based on flow matching keeps audio output latency low enough for real-time interaction.
Pre-training consumed over 13 million hours of diverse audio data spanning speech, music, and environmental sounds, making Kimi-Audio competitive across a wide spectrum of benchmarks. It achieves strong results on ASR (Common Voice, LibriSpeech), audio question answering (MMAU), speech emotion recognition, and sound event classification, while also supporting end-to-end speech conversation. The model is released in two checkpoints: Kimi-Audio-7B (base) and Kimi-Audio-7B-Instruct (fine-tuned for dialogue).
Alibaba's Qwen2.5-Omni pushes the boundary of multimodal LLMs by simultaneously perceiving text, images, audio, and video while generating both text and natural speech responses in a streaming manner
The architecture splits responsibilities into a Thinker–Talker design. The Thinker is a transformer backbone that processes all modalities; the Talker is a lightweight decoder head that converts hidden states into speech tokens. A novel position-embedding scheme called TMRoPE (Time-aligned Multimodal RoPE) synchronizes video timestamps with audio, so lip movements and spoken words stay aligned in the model's internal representation. This is especially important for tasks like video-based question answering or real-time video chat.
Qwen2.5-Omni comes in 3B and 7B variants. In audio-only benchmarks it rivals or surpasses Qwen2-Audio and Whisper-large-v3, while in speech generation it achieves competitive speaker similarity and content consistency on the SEED evaluation suite. On the OmniBench multimodal benchmark, the 7B model reaches state-of-the-art performance among open models. The model also supports selectable output voices—Chelsie (female, warm) and Ethan (male, upbeat)—giving developers control over persona without extra training.
Mistral AI's Voxtral family is a comprehensive audio suite that spans understanding, generation, and real-time transcription
Voxtral Chat (Mini and Small) is a multimodal audio chat model that comprehends both spoken audio and text documents. A 32K context window lets it handle audio files up to 40 minutes and sustain long multi-turn conversations. Voxtral Small outperforms several closed-source models on audio benchmarks while remaining compact enough to run locally. The model is trained on diverse audio tasks including transcription, audio question answering, and cross-modal reasoning.
Voxtral TTS is an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. It adopts a hybrid architecture: an autoregressive Transformer generates semantic speech tokens, and a flow-matching decoder turns those tokens into acoustic features via the custom Voxtral Codec—a speech tokenizer with hybrid VQ-FSQ quantization. In native-speaker evaluations, Voxtral TTS achieved a 68.4% win rate over ElevenLabs Flash v2.5 on multilingual voice cloning, demonstrating strong naturalness and expressivity.
Voxtral Realtime is a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike chunking or sliding-window adaptations of offline models, Voxtral Realtime is trained end-to-end for streaming with explicit alignment between audio and text streams. It builds on the Delayed Streams Modeling framework, introducing a causal audio encoder and Ada RMS-Norm for improved delay conditioning. At a 480 ms delay, it achieves performance on par with Whisper, the most widely deployed offline transcription system, while scaling to 13 languages.
All three Voxtral models are available on the Hugging Face Hub and can be loaded through the standard transformers library, making them accessible to any developer already familiar with the Hugging Face ecosystem.
Reading the model cards, these systems can look like magic: one checkpoint that listens, reasons, and speaks. But every model in this chapter follows essentially the same three-phase curriculum, directly inherited from text LLMs: pretraining teaches the modality, mid-training teaches the conversational format, and post-training teaches the behavior. Explore the actual pipelines below — the details differ, but the skeleton is remarkably consistent.
It is worth saying plainly: for most teams the hard part of a SpeechLM is not the model, it is the data. Each of the three phases above assumes a corpus of speech — usually paired with text — and assembling one at the right scale and quality is where the real effort goes. The numbers are sobering: a usable single-speaker voice can emerge from a few thousand hours, but the models in this playbook were trained on far more (LLaSA on 250k hours, CSM on ~1M, Kimi-Audio on 13M). The good news is that the open ecosystem has largely caught up, so you rarely start from zero.
| Dataset | ~Hours | Languages | Style | Notes |
|---|---|---|---|---|
| LibriSpeech | 1k | English | read audiobooks | the classic ASR benchmark; clean, CC BY 4.0 |
| LibriHeavy / LibriTTS-R | 50k / 0.6k | English | read | LibriHeavy keeps punctuation & casing; LibriTTS-R is 24 kHz TTS-grade |
| Multilingual LibriSpeech | 50k | 8 | read | the multilingual workhorse; CC BY 4.0 |
| Granary | 1M | 25 (European) | pseudo-labeled, in-the-wild | NVIDIA, 2025; ~650k h ASR + ~350k h speech translation |
| GigaSpeech | 10k | English | audiobooks, podcasts, web | varied speaking style, transcribed |
| Common Voice | 30k+ | 100+ | read, crowdsourced | unmatched language coverage; CC0 |
| Emilia / Emilia-YODAS | 100k+ / 200k+ | 6 | in-the-wild (podcasts, talk shows) | the modern default for expressive TTS; check license terms |
| People's Speech | 30k | English | diverse / spontaneous | large and permissively licensed (CC-BY) |
| Loquacious Set | 25k | English | read, spontaneous, talks; clean & noisy | curated from 6 corpora for research and commercial use |
Raw audio is never training-ready. The standard preparation pipeline — essentially what Voxtral, Moshi, and every model here run — is: segment long recordings with voice-activity detection, diarize to separate speakers, pseudo-label untranscribed audio with a strong ASR model (Whisper is the workhorse), filter on length, language ID, and quality, and finally encode each clip to codec tokens. This pipeline is increasingly productized: NVIDIA's Granary is essentially a million hours of audio run through exactly these steps with their NeMo tooling, released as ready-to-train data across 25 languages. That last step hides a practical trick worth knowing: encode the whole corpus to tokens once and store the tokens, not the waveforms. Neuphonic did exactly this to release Emilia-YODAS pre-encoded with NeuCodec, shrinking it from 1.7 TB to 41 GB — training then streams cheap integer tokens instead of decoding audio on the fly.
One caveat that catches newcomers: "open" does not always mean "commercially usable." Much of the largest in-the-wild speech data carries research-only or source-derived licenses, so check the terms before building a product on top of it. This is exactly the gap the Loquacious Set targets — 25k hours curated from six commercially-usable corpora into one diverse English set, assembled so you do not have to reconcile seven licenses yourself.
No one trains a SpeechLM from scratch on audio. Every system starts from a pretrained text LLM — Helium for Moshi, Qwen2.5-7B for Kimi-Audio, Mistral for Voxtral — because the linguistic knowledge, world knowledge, and reasoning live in the text weights. Audio pretraining is therefore really continued pretraining: expose the model to enormous amounts of audio (7M hours for Moshi
The most important design decision is what the audio-text training sample looks like. Voxtral
The repetition pattern is essentially ASR formatted as language modeling, and it drives transcription accuracy. The continuation pattern — predict the transcript of the next chunk from the current audio — is the one that produces understanding: the model can only continue speech it has actually comprehended, which is exactly the skill spoken QA and dialogue require. Kimi-Audio reaches the same conclusion independently: of its seven pretraining tasks, three are interleaving variants (audio→semantic-token, audio→text, and joint audio+text prediction), trained over a budget of 585B audio tokens and 585B text tokens. If you remember one thing about audio pretraining, make it this: interleaved data is what turns an ASR system into an audio language model.
A small but recurring trick: freeze first, unfreeze later. Voxtral's first pass over the data trains only the adapter between the frozen audio encoder and the frozen LLM; Kimi-Audio keeps its Whisper feature extractor frozen for the first ~20% of pretraining tokens. Aligning a randomly initialized adapter against two stable representations is much easier than letting everything drift at once.
A model pretrained on single-stream audio can continue speech, but it has never seen a conversation — two voices, overlapping, taking turns. Moshi's recipe makes this phase explicit. First, it applies speaker diarization to its unsupervised corpus to simulate two streams (the target speaker's waveform on one channel, everyone else on the other) and trains for 100k steps. Only then does it use real two-channel telephone conversations — the Fisher corpus, 2,000 hours of phone calls recorded with one channel per participant — to learn genuine turn-taking, interruptions, and back-channels. The lesson generalizes: scarce gold data (real two-channel dialogue) is used last, after cheap simulated data has done the heavy lifting.
Mid-training is also where the engineering constraints of audio bite hardest, because audio sequences are long and each position can carry many codebooks. CSM's answer is compute amortization
Supervised fine-tuning for speech has a unique problem text never had: the assistant needs a voice, and your SFT data defines it. Moshi generates 20k+ hours of synthetic dialogues — transcripts written by its own text LLM, then synthesized with a TTS engine conditioned on one voice actor across 70+ speaking styles — while augmenting the user stream with gain changes, background noise, echo and reverb so the model stays robust to real microphones. Kimi-Audio records a professional voice actor across 20+ styles and emotion intensities, then uses voice conversion to expand coverage. And Voxtral contributes an important negative result: SFT data synthesized purely with TTS generalizes poorly to accented human speech — you need real spoken queries in the mix.
The final ingredient — and the fastest-moving — is reinforcement learning. The first wave was DPO, used for two distinct ends:
Through 2025, speech post-training caught the same wave that reshaped reasoning LLMs: a shift from preference pairs (DPO) to GRPO with verifiable rewards (R − mean) / std pushes probability toward the better-than-average samples — with a KL penalty anchoring the model to its starting point. Crucially, the reward needs no human labels and no trained reward model: synthesize the speech, run it back through an ASR model, and score the transcript against the target text. The ASR-in-the-loop verifiers that LLaSA originally used to rerank outputs at inference time (back in the TTS section) are now baked into the weights.
The hard-won lesson across every one of these papers is that a single reward is a trap. Optimize for ASR-measured word error alone and intelligibility climbs while the cloned voice quietly drifts and prosody flattens — the model learns to enunciate, not to sound right. So the recipes converge on composites: the GRPO-TTS paper blends character error with the ASR model's log-likelihood (a softer signal that catches subtle mispronunciations), cutting CosyVoice2's Chinese CER from 1.41 to 1.07 while lifting naturalness MOS from 4.42 to 4.58
This is also a domain where you do not need a frontier-scale rig: applying GRPO to LLaSA-1B — with a composite of ASR word error and model confidence — measurably improved error rates and naturalness on a single A100
Put together, the pattern across these systems is hard to miss. Pretraining = text LLM + massive interleaved audio + anti-forgetting replay. Mid-training = conversational structure (multi-stream, long context) + compute tricks. Post-training = voice-consistent SFT, then GRPO with a composite of verifiable (ASR-in-the-loop) and identity rewards. The final section of this playbook turns this into a checklist.
You cannot improve what you cannot measure, and speech is measured along three largely independent axes. A model can be perfectly intelligible in the wrong voice, or sound gorgeous while saying the wrong words — so you track all three at once (plus latency, if you stream). Crucially, these are the same quantities the RL chapter used as rewards: evaluation metrics and training rewards are the same thing seen from two sides.
The crucial limitation echoes the lesson from reinforcement learning: these objective metrics are necessary but not sufficient. WER says nothing about whether a question sounds like a question; speaker similarity says nothing about emotion. Prosody, expressiveness, and affect still largely escape automatic measurement — which is why human MOS endures, and why the learned, generative reward models from the last chapter (GSRM and kin) are an active frontier for evaluation just as much as for training.
| If you're building… | Measure above all | With |
|---|---|---|
| A TTS / voice-cloning model | intelligibility + voice match + naturalness | WER (Whisper), SECS (WavLM), UTMOS + a human MOS study |
| An ASR / understanding model | accuracy | WER/CER; task scores on AudioBench / VoiceBench |
| A neural codec | reconstruction + modelability | ViSQOL / PESQ / Mel-distance; plus the WER of a small LM trained on its tokens |
| A conversational agent | all of the above + latency | the above + time-to-first-audio and turn-taking metrics |
You rarely report numbers in a vacuum — the field has converged on shared benchmarks: Seed-TTS-eval for zero-shot TTS (WER + speaker similarity on deliberately hard sentences), the Open ASR Leaderboard for recognition, and VoiceBench, AudioBench, and AIR-Bench for audio understanding and spoken dialogue. Note the codec row above: a codec can reconstruct beautifully and still be hard for an LLM to model, so the downstream test — train a small model on its tokens and measure the resynthesized speech — matters more for our purposes than raw reconstruction scores alone.
Everything in this post is a transformers model. Whisper, Voxtral, CSM, Qwen2.5-Omni and friends all live on the Hub with native library support, which means three familiar abstractions cover the whole speech stack: the pipeline() one-liner for transcription, the processor + chat template pattern (with {"type": "audio"} content parts) for audio understanding, and generate() with audio output for models that speak. And the roster keeps growing — recent merges include Audio Flamingo 3, Granite Speech, Higgs Audio V2, VibeVoice ASR, and Music Flamingo, all following the same idioms. Pick a task:
transformers idioms. Each snippet is self-contained — swap the checkpoint to try a different model of the same family.
A few practical notes. Load in bfloat16 with device_map="auto" — speech models are no different from text LLMs here, and the 3–8B models in this post fit comfortably on a single 24 GB GPU. The chat-template pattern is the one worth internalizing: because audio is just another content type in the conversation list, the same code path handles audio + text questions, multi-turn dialogue, and (for omni models) images and video. And when you outgrow these snippets — streaming microphone input, sub-second response latency, interruption handling — the models' own repos (Moshi, CSM, Qwen2.5-Omni) ship dedicated real-time serving stacks that wrap the same checkpoints.
Everything above condenses into three concrete recipes. None of them is hypothetical — every step below is something one of the surveyed models actually does, with the source noted so you can go deeper.
[text tokens] → [audio tokens]. LLaSA used 250k hours; useful models emerge at a few thousand hours.The models surveyed here trace a clear trajectory. First, LLMs learned to listen: Qwen2-Audio and Voxtral Chat showed that pairing a strong audio encoder with a pretrained text model is enough for serious audio understanding. Then they learned to speak: LLaSA and Orpheus demonstrated that TTS can be reduced to next-token prediction over codec tokens, inheriting the entire scaling playbook of text LLMs. Finally, they are learning to converse: Moshi, CSM, Kimi-Audio, and Qwen2.5-Omni close the loop with full-duplex, streaming architectures where listening and speaking happen in the same model, in real time. And despite their architectural differences, they all train the same way: continued pretraining on interleaved audio-text with heavy text replay, mid-training for conversational structure, then voice-consistent SFT and DPO.
Underneath all of it sit the neural audio codecs, whose token rate, codebook structure, and semantic grounding quietly determine what the model above them can achieve. If there is one practical takeaway, it is this: when evaluating or building a SpeechLM, look at the codec first — the choice between a flat single codebook, a hierarchical RVQ stack, or a semantically distilled hybrid shapes the latency, quality, and modeling difficulty of everything downstream. The gap between synthetic and human conversation is closing fast, and it is being closed as much by better tokenizers as by bigger transformers.
For academic attribution, please cite this work as:
"SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored", 2025.
BibTeX citation
@misc{speechlms_explained,
title={SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored},
author={Steven Zheng},
year={2025}
}