The SpeechLLM Playbook

How modern speech systems are built and trained — LLM-based TTS, audio LLMs, and the neural codecs underneath them — ending with concrete recipes for training your own.

Beyond Voice Cloning: From Classical TTS to Conversational Speech Systems

Classical text-to-speech (TTS) models have long excelled at voice cloning and speech synthesis. They generally follow a two-stage process: first, a model like Tacotron converts text into an intermediate acoustic representation (such as a spectrogram), and then a vocoder (for example, WaveGlow or HiFi-GAN) transforms that representation into waveform audio. While these systems are capable of producing lifelike voices, their primary focus has been on replicating a given speaker's sound, with limited capacity to engage in dynamic, context-aware conversations.

Audio sample from the Kokoro-TTS model.

The advent of large language models (LLMs) offers a compelling opportunity to enhance these systems. By incorporating LLMs into TTS pipelines, we can leverage their sophisticated reasoning and contextual understanding to create truly conversational speech systems. Instead of merely cloning a voice, these enhanced systems can interpret context, adapt to dialogue flows, and generate responses that feel both natural and interactive. Essentially, LLMs open up a new dimension where synthesis isn’t only about producing sound—it’s about enabling intelligent, context-aware conversation.

One practical way to integrate these capabilities is through a cascaded approach in a speech-to-speech system, which typically involves three distinct modules:

Hover over elements for details

This cascaded method combines the strengths of each specialized component. However, this approach is not without its limitations. One major challenge is that the LLM does not capture the full richness of the speech input. Speech carries subtle cues—intonation, rhythm, emotion, and prosodic nuance—that are often lost in the conversion process to text. As a result, when an LLM processes transcribed text, it receives a significantly distilled representation of the original audio. This loss of detail can limit the model’s ability to produce responses that fully mirror the expressive qualities of the initial speech, potentially resulting in synthetic output that feels less dynamic or contextually aware.

Integrating speech directly with an LLM could solve this challenge but it also presents significant difficulties. Unlike text, speech is a continuous, high-dimensional signal. LLMs are designed to work with discrete tokens, so converting speech into a format that these models can process requires additional steps. Existing methods address this gap in two main ways:

Despite these innovations, each method comes with trade-offs. Audio encoders must balance the preservation of critical information with the need for compact, discrete representations. Neural codecs, meanwhile, face challenges related to token rate—since speech typically generates far more tokens per second than text—and the potential loss of fine-grained acoustic details during quantization.

In summary, while classical TTS models provide a strong foundation for effective voice cloning and speech synthesis, integrating LLM reasoning significantly expands the potential use cases by enabling contextual, conversational interactions. The cascaded STT–LLM–TTS pipeline is a practical approach to achieve this integration, yet it carries inherent challenges such as error propagation between modules and difficulties in capturing the full richness of the speech signal. Advances in audio encoders and neural codecs are crucial for overcoming these hurdles, paving the way for next-generation conversational speech systems that seamlessly combine natural language understanding with high-fidelity audio synthesis.

This post is a playbook, and it builds toward a concrete goal. We will walk through the three families of systems that make up the modern speech stack — models that understand audio (Qwen2-Audio, Voxtral), models that speak (LLaSA, Orpheus, CSM), and the neural codecs that connect waveforms to tokens — and study not just their architectures but how each one is actually trained. By the end, you should have a working recipe for all three: an LLM-based TTS model, a neural audio codec, and an audio LLM. The selection here is opinionated — models chosen because each teaches a distinct lesson. For an exhaustive map of the field, the ACL 2025 SpeechLM survey and the full-duplex spoken dialogue survey are excellent companions; the timeline at the top of this page shows where the models we cover sit in that larger story (AudioLM and VALL-E started the codec-token paradigm, SpeechGPT and SALMONN the encoder-plus-LLM one).

Making LLMs understand speech and audio

Qwen2-Audio

Qwen2-Audio employs a two-component architecture featuring an audio encoder and a large language model (LLM). The audio encoder is initialized using weights from the Whisper-large-v3 model . Leveraging powerful pre-trained models like Whisper as the audio encoder is an approach also seen in other audio-language models; for instance, the Ultravox's models similarly use a Whisper encoder as part of its architecture. For Qwen2-Audio, the input audio (resampled to 16kHz) is converted into a 128-channel mel-spectrogram (25ms window, 10ms hop), which is then processed by a pooling layer, resulting in each encoder frame representing about 40ms of audio. The LLM component is the Qwen-7B model, bringing the total parameter count for Qwen2-Audio to 8.2 billion.

The training involves a three-stage process aimed at maximizing the probability of the next text token, conditioned on the audio representations and preceding text tokens:

This architectural and training methodology highlights a relatively straightforward approach to building powerful audio-language models. It demonstrates the feasibility of effectively combining strong, pre-existing unimodal models – like a capable audio encoder (Whisper) and a robust LLM (Qwen) – and then adapting them through targeted fine-tuning stages (SFT, DPO). This process of "plugging in" a modality-specific encoder and then fine-tuning the combined system mirrors common practices in multimodal LLMs, particularly analogous to how vision capabilities are often integrated into large language models.

The encoder-plus-LLM recipe, everywhere

Qwen2-Audio's blueprint has since become the standard pattern, and the transformers library has been steadily absorbing its descendants — each with one distinctive twist worth knowing:

The takeaway: "make an LLM understand audio" is no longer a research problem but a design menu — pick an encoder, pick a fusion mechanism (projector, Q-former, in-place replacement), and decide how you will protect the text weights (data replay, frozen stages, or Granite's audio-gated LoRA).

Scaling TTS with LLMs

LLaSA: a simple approach to scaling TTS with LLMs

The first TTS LLM-powered model we examine is LLaSA, developed by researchers at HKUST and Microsoft. LLaSA directly tackles the question: How far can we push large language model (LLM) scaling principles when applied to speech synthesis? While many text-to-speech (TTS) systems use hybrid pipelines or multiple models, LLaSA adopts a minimalist, LLM-style design—one transformer, one stage, one codebook—and scales it up massively across model size and data.

LLaSA’s design philosophy is simplicity and reuse. It starts from a pretrained Llama model and extends both the tokenizer and LM head to include 65,536 new audio tokens on top of the original text vocabulary. These audio tokens come from XCodec2, a flat vector quantizer operating at 50 Hz – i.e. one token corresponds to 20 ms of audio. This design yields 50 tokens per second of audio, all from one codebook. By using a very large codebook, the codec achieves high fidelity with a single token stream (reportedly ~99% codebook usage, meaning it effectively utilizes the full range of tokens for nuanced encoding).

The model is then fully fine-tuned on sequences that combine both text and audio tokens, allowing the transformer to treat speech as a natural continuation of text. No architecture changes are made to the transformer itself—LLaSA simply learns to model speech the same way Llama models language: as a next-token prediction task over a unified vocabulary of text and audio tokens.

The advantage of this single-level approach is its simplicity for the LLM – the model just treats audio tokens like another language with a 65k vocabulary, not unlike how a wordpiece tokenizer might have 50k tokens for text. Training LLaSA thus becomes very similar to training a standard LLM: they convert all audio in the training set into long sequences of XCodec2 tokens and concatenate with the corresponding text transcriptions. The transformer learns to predict the next token, whether that next token is part of the text or the audio. The authors note this compatibility means they can directly apply techniques like data parallel scaling, model compression, or acceleration from the NLP world to this TTS model.

LLaSA’s training of 250k hours is one of the largest in TTS, spanning diverse speech in English and Chinese, it also has a multilingual version for the 1B model and a 3B model. The resulting models are correspondingly powerful. The largest 8B model in particular demonstrates remarkable naturalness and prosody. According to the paper, increasing model size consistently improved speech quality – bigger models produced more accurate and complex prosody patterns and sounded more natural. This is analogous to how in text LLMs, going from 1B to 7B to 70B yields more fluent and context-aware language; here it yields more human-like intonation and rhythm. Even the 1B model, while less expressive, still functions for basic speech and is extremely lightweight to run.

It can also do zero-shot voice cloning by taking a speech prompt. If you feed a short recording of a speaker (converted to tokens via XCodec2) followed by the text, LLaSA will generate the continuation in that voice. This works because the model has effectively learned to continue in the style provided by preceding audio tokens. Additionally, because it’s bilingual, you can prompt it with a Chinese voice and have it speak English in that voice or vice-versa, enabling cross-lingual voice transfer – a very useful feature for voice assistants in multilingual settings.

More details about scaling inference-time compute with LLaSA

One interesting contribution of the LLaSA work is in scaling inference-time compute . The authors experimented with using speech understanding models as verifiers during generation . In practice, this means when sampling audio tokens, they would involve a pretrained model (like a speech recognition or speaker ID network) to guide the choice of tokens, re-ranking or biasing the outputs toward those that make the verifier happy. For example, a ASR verifier ensures the content is pronounced clearly (improving word accuracy), while a speaker encoder verifier ensures the voice timbre stays consistent, and a classifier might ensure the emotion matches some target. They found that by spending more computation at inference in this way, they could significantly improve aspects like emotional expressiveness, speaker consistency, and content accuracy in the generated speech. This is akin to how some text LLMs use external tools or rerankers to improve outputs post-hoc. While such techniques increase inference cost, they show a pathway to higher quality without retraining – useful for customizing style or ensuring correctness in critical applications. In terms of output quality, LLaSA is state-of-the-art on traditional TTS metrics. The paper reports that the codec can reconstruct 16 kHz speech with a MOS (Mean Opinion Score) around 4.1 (out of 5) on test data, which is near the ground truth quality. The transformer does introduce some modeling imperfections, but the massive training helps mitigate that. For prosody, the 8B model especially was noted to produce more lifelike intonation, handling even tricky sentences with complex emphasis better than smaller models. The bilingual nature means it learned to control tone appropriate to each language – for example, using the correct cadence for a Chinese question vs an English question. It can also mix languages to an extent (e.g., speaking an English sentence with a Chinese accent or vice versa, if prompted), reflecting the multilingual data.

Text tokens
Llama tokenizer (~128k vocab)
Audio tokens
XCodec2 · 1 codebook × 65,536 codes · 50 tokens/s
Single Llama transformer
next-token prediction over the unified text + audio vocabulary — no extra adapters, no second stage
LLaSA-1B
Llama-3.2-1B
multilingual variant
LLaSA-3B
Llama-3.2-3B
EN + ZH
LLaSA-8B
Llama-3.1-8B
best prosody & naturalness
Figure 3: LLaSA variants and their single-codebook (XCodec2) token stream.

Concurrent Approaches

While LLaSA pushed the boundaries with its massive scale and single-codebook simplicity, other models emerged concurrently, exploring similar LLM-driven TTS principles but often opting for different neural audio codec strategies. Notable examples include Canopy Labs' Orpheus 3B and OuteAI's OuteTTS.

A key difference lies in their choice of codec. Instead of XCodec2's single large codebook, both Orpheus and OuteTTS use codecs based on Residual Vector Quantization (RVQ). RVQ codecs like SNAC (Multi-Scale Neural Audio Codec) and DAC (Descript Audio Codec) represent audio hierarchically, using multiple quantizers (codebooks) where each layer progressively refines the audio representation encoded by the previous layers. This can potentially offer higher fidelity at lower token rates compared to single-codebook approaches, though it may require the LLM to predict multiple token streams or handle a flattened representation.

Orpheus 3B

Built upon a Llama-3B backbone, Orpheus pairs the LLM with the SNAC. SNAC is an advanced RVQ codec that captures audio information across different temporal resolutions, aiming for efficient compression and detailed reconstruction. To manage the multiple token streams from SNAC within a standard LLM framework, Orpheus employs a strategy of generating a flattened sequence of tokens (7 tokens per audio frame sequentially). It achieves low-latency streaming (~200ms to ~25-50 ms with input streaming of text into the KV cache) suitable for real-time applications by using an optimized decoding process involving a sliding window technique on the SNAC decoder, ensuring smooth audio output without pops. Orpheus particularly emphasizes generating expressive, emotive speech and supports zero-shot voice cloning from short audio prompts.

Figure 4: Comparison between traditional Residual Vector Quantization (RVQ) and SNAC.

More models

Model Base LLM Audio Codec Key Features
OuteAI/Llama-OuteTTS-1.0-1B meta-llama/Llama-3.2-1B ibm-research/DAC.speech.v1.0 - One-Shot Voice Cloning
- Multilingual
- Trained on ~60k hours of audio
SparkAudio/Spark-TTS-0.5B Qwen/Qwen2.5-0.5B BiCodec - Controllable TTS with prompt
- Trained on 100k hours of open source audio
- Bilingual capabilities
nari-labs/Dia-1.6B Custom 1.6B transformer (trained from scratch) Descript Audio Codec (DAC) - Multi-speaker dialogue generation ([S1]/[S2] tags)
- Non-verbal sounds (laughter, coughs, sighs)
- Voice cloning from an audio prompt
bosonai/higgs-audio-v2-generation-3B-base meta-llama/Llama-3.2-3B Unified tokenizer (semantic + acoustic, in-house) - DualFFN: separate FFN path for audio tokens
- 10M hours pretraining (AudioVerse)
- Expressive multi-speaker audio, zero post-training
moonshotai/Kimi-Audio-7B-Instruct Qwen 2.5-7B Hybrid (continuous + discrete) - Universal audio foundation model
- ASR, TTS, audio QA, SER, SEC
- 13M+ hours pre-training
Qwen/Qwen2.5-Omni-7B Qwen 2.5 Thinker-Talker (built-in) - Multimodal: text, image, audio, video
- Real-time voice & video chat
- Selectable voices (Chelsie, Ethan)
mistralai/Voxtral-Mini-4B-Realtime Mistral 7B Voxtral Codec - Streaming ASR
- 13 languages
- 480ms delay
mistralai/Voxtral-4B-TTS Mistral 7B Voxtral Codec (VQ-FSQ) - Multilingual TTS
- Voice cloning from 3s audio
- Flow-matching decoder

Beyond LLMs and Audio Codecs: Toward a conversational speech LLM system

Moshi: Real-Time Full-Duplex Speech Generation

Moshi is an open 7-billion-parameter speech–text foundation model that aims to make synthetic voices feel as immediate and interruptible as ordinary conversation. Every 80 ms the model’s large Temporal Transformer rolls the entire dialogue history—both text and previously generated audio tokens—into a single context vector. A tiny linear head turns that vector into one semantic token, after which a compact Depth Transformer fills in seven acoustic tokens that refine prosody and timbre. These eight tokens drive the causal Mimi codec, so the first fragment of sound can leave the speaker roughly 160–200 ms after the user finishes talking .

Because Moshi writes its own audio tokens and reads the user’s tokens in the same sequence, it can back-channel (“mm-hm”), pause mid-sentence when interrupted, or pick up the thread again—all without external VAD, ASR, or turn-taking heuristics. The entire stack is Apache-2.0 licensed, and the Mimi decoder runs in under ten milliseconds on a single CPU core, making on-device streaming practical for mobile hardware.

Moshi architecture: Temporal Transformer, Depth Transformer and Mimi codec
Figure 5: A single context vector becomes one semantic and seven acoustic tokens per 80 ms frame.

CSM-1B: Context-Aware Conversational TTS from Sesame

Sesame AI’s Conversational Speech Model (CSM-1B) attacks the same problem from the TTS side. Rather than a full-duplex agent like Moshi, CSM is a context-aware speech generator: text and audio tokens from the whole conversation so far form one interleaved sequence, and the model generates the next turn’s audio conditioned on all of it. Because the prosody of a reply depends on what was just said — and how it was said — feeding the model raw conversational context lets it get intonation, emotion, and timing right where an isolated TTS call would sound flat .

Architecturally, CSM is two transformers operating on Mimi codec tokens (12.5 Hz, one semantic + N−1 acoustic codebooks per frame). A Llama-style backbone processes the interleaved text-and-audio history and predicts the semantic codebook for each frame; a much smaller audio decoder then models the remaining acoustic codebooks to reconstruct high-fidelity speech. Both stages are autoregressive — the split exists so the expensive backbone runs only once per frame while the cheap decoder fills in the detail, keeping latency low enough for real-time use.

The training corpus is roughly 1 million hours of transcribed, diarized, predominantly English audio, trained for five epochs at a 2,048-token sequence length (about two minutes of dialogue). Sesame trained three sizes — 1B, 3B, and 8B backbones — and released the 1B model openly. In their evaluations, listeners rated isolated CSM utterances on par with human recordings; only when the conversational context was shown did humans retain a clear edge — evidence that naturalness is largely solved and contextual appropriateness is the remaining frontier.

Sesame CSM-1B block diagram
Figure 6: CSM-1B’s single-stage pipeline. The orange arrow shows audio tokens fed back into the same sequence that will soon contain the model’s next reply.

Moving towards truly interactive conversational speech, models need to inherently understand and adapt to the flow of dialogue. Sesame AI's Conversational Speech Model (CSM) represents a significant step in this direction, explicitly designed to leverage context for more natural and coherent speech synthesis, as detailed in their research "Crossing the Uncanny Valley of Voice".

Kimi-Audio: A Universal Audio Foundation Model

Kimi-Audio from Moonshot AI is a 7-billion-parameter open-source audio foundation model that unifies audio understanding, generation, and conversation within a single framework . Built on a Qwen 2.5-7B backbone, it represents one of the most ambitious attempts to create a universal audio model rather than a task-specific TTS or ASR system.

The model introduces a hybrid audio input architecture that combines continuous acoustic features with discrete semantic tokens, allowing the LLM core to ingest audio at multiple levels of abstraction. For output, Kimi-Audio employs parallel heads for both text and audio token generation, making it capable of producing text transcriptions, natural language answers, or synthesized speech from the same forward pass. A chunk-wise streaming detokenizer based on flow matching keeps audio output latency low enough for real-time interaction.

Pre-training consumed over 13 million hours of diverse audio data spanning speech, music, and environmental sounds, making Kimi-Audio competitive across a wide spectrum of benchmarks. It achieves strong results on ASR (Common Voice, LibriSpeech), audio question answering (MMAU), speech emotion recognition, and sound event classification, while also supporting end-to-end speech conversation. The model is released in two checkpoints: Kimi-Audio-7B (base) and Kimi-Audio-7B-Instruct (fine-tuned for dialogue).

Qwen-Omni: A Multimodal LLM for Conversational Speech

Alibaba's Qwen2.5-Omni pushes the boundary of multimodal LLMs by simultaneously perceiving text, images, audio, and video while generating both text and natural speech responses in a streaming manner . Unlike audio-only models, Qwen2.5-Omni is a true any-to-any system, making it suitable for rich voice-and-video assistants.

The architecture splits responsibilities into a Thinker–Talker design. The Thinker is a transformer backbone that processes all modalities; the Talker is a lightweight decoder head that converts hidden states into speech tokens. A novel position-embedding scheme called TMRoPE (Time-aligned Multimodal RoPE) synchronizes video timestamps with audio, so lip movements and spoken words stay aligned in the model's internal representation. This is especially important for tasks like video-based question answering or real-time video chat.

Qwen2.5-Omni comes in 3B and 7B variants. In audio-only benchmarks it rivals or surpasses Qwen2-Audio and Whisper-large-v3, while in speech generation it achieves competitive speaker similarity and content consistency on the SEED evaluation suite. On the OmniBench multimodal benchmark, the 7B model reaches state-of-the-art performance among open models. The model also supports selectable output voices—Chelsie (female, warm) and Ethan (male, upbeat)—giving developers control over persona without extra training.

Voxtral: Mistral AI's Audio Suite

Mistral AI's Voxtral family is a comprehensive audio suite that spans understanding, generation, and real-time transcription . Rather than a single monolithic model, Voxtral is split into three complementary systems, each optimized for a different audio modality and all released under permissive licenses.

Voxtral Chat (Mini and Small) is a multimodal audio chat model that comprehends both spoken audio and text documents. A 32K context window lets it handle audio files up to 40 minutes and sustain long multi-turn conversations. Voxtral Small outperforms several closed-source models on audio benchmarks while remaining compact enough to run locally. The model is trained on diverse audio tasks including transcription, audio question answering, and cross-modal reasoning.

Voxtral TTS is an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. It adopts a hybrid architecture: an autoregressive Transformer generates semantic speech tokens, and a flow-matching decoder turns those tokens into acoustic features via the custom Voxtral Codec—a speech tokenizer with hybrid VQ-FSQ quantization. In native-speaker evaluations, Voxtral TTS achieved a 68.4% win rate over ElevenLabs Flash v2.5 on multilingual voice cloning, demonstrating strong naturalness and expressivity.

Voxtral Realtime is a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike chunking or sliding-window adaptations of offline models, Voxtral Realtime is trained end-to-end for streaming with explicit alignment between audio and text streams. It builds on the Delayed Streams Modeling framework, introducing a causal audio encoder and Ada RMS-Norm for improved delay conditioning. At a 480 ms delay, it achieves performance on par with Whisper, the most widely deployed offline transcription system, while scaling to 13 languages.

All three Voxtral models are available on the Hugging Face Hub and can be loaded through the standard transformers library, making them accessible to any developer already familiar with the Hugging Face ecosystem.

How are these systems trained? Pretraining, mid-training and post-training

Reading the model cards, these systems can look like magic: one checkpoint that listens, reasons, and speaks. But every model in this chapter follows essentially the same three-phase curriculum, directly inherited from text LLMs: pretraining teaches the modality, mid-training teaches the conversational format, and post-training teaches the behavior. Explore the actual pipelines below — the details differ, but the skeleton is remarkably consistent.

Click a model to compare its training pipeline. Stage details are taken from each paper / tech report.
Figure 7: Training pipelines of five conversational speech systems, normalized to the same pretraining → mid-training → post-training skeleton.

Pretraining: teaching the LLM a new modality

No one trains a SpeechLM from scratch on audio. Every system starts from a pretrained text LLM — Helium for Moshi, Qwen2.5-7B for Kimi-Audio, Mistral for Voxtral — because the linguistic knowledge, world knowledge, and reasoning live in the text weights. Audio pretraining is therefore really continued pretraining: expose the model to enormous amounts of audio (7M hours for Moshi , 13M hours for Kimi-Audio ) while actively defending the text abilities you started with. Moshi spends half of all pretraining batches on text-only data; Kimi-Audio's task mixture gives text-only data a sampling weight of 7 versus 1 for audio-only. Catastrophic forgetting, not audio modeling, is the main enemy at this stage.

The most important design decision is what the audio-text training sample looks like. Voxtral makes this beautifully explicit with two interleaving patterns, balanced 50/50 and disambiguated by a special token:

Figure 8: Voxtral's two pretraining patterns over the same chunked audio (A) and transcript (T) stream. Toggle to compare.

The repetition pattern is essentially ASR formatted as language modeling, and it drives transcription accuracy. The continuation pattern — predict the transcript of the next chunk from the current audio — is the one that produces understanding: the model can only continue speech it has actually comprehended, which is exactly the skill spoken QA and dialogue require. Kimi-Audio reaches the same conclusion independently: of its seven pretraining tasks, three are interleaving variants (audio→semantic-token, audio→text, and joint audio+text prediction), trained over a budget of 585B audio tokens and 585B text tokens. If you remember one thing about audio pretraining, make it this: interleaved data is what turns an ASR system into an audio language model.

A small but recurring trick: freeze first, unfreeze later. Voxtral's first pass over the data trains only the adapter between the frozen audio encoder and the frozen LLM; Kimi-Audio keeps its Whisper feature extractor frozen for the first ~20% of pretraining tokens. Aligning a randomly initialized adapter against two stable representations is much easier than letting everything drift at once.

Mid-training: from monologue to conversation

A model pretrained on single-stream audio can continue speech, but it has never seen a conversation — two voices, overlapping, taking turns. Moshi's recipe makes this phase explicit. First, it applies speaker diarization to its unsupervised corpus to simulate two streams (the target speaker's waveform on one channel, everyone else on the other) and trains for 100k steps. Only then does it use real two-channel telephone conversations — the Fisher corpus, 2,000 hours of phone calls recorded with one channel per participant — to learn genuine turn-taking, interruptions, and back-channels. The lesson generalizes: scarce gold data (real two-channel dialogue) is used last, after cheap simulated data has done the heavy lifting.

Mid-training is also where the engineering constraints of audio bite hardest, because audio sequences are long and each position can carry many codebooks. CSM's answer is compute amortization : the backbone predicts the semantic codebook on every frame, but the acoustic decoder trains on only a random 1/16th of frames — with no measurable difference in decoder loss. Qwen2.5-Omni instead splits the problem architecturally (the Talker trains separately from the Thinker) and adds a dedicated long-sequence stage so the model can attend over very long audio and video .

Post-training: SFT and reinforcement learning

Supervised fine-tuning for speech has a unique problem text never had: the assistant needs a voice, and your SFT data defines it. Moshi generates 20k+ hours of synthetic dialogues — transcripts written by its own text LLM, then synthesized with a TTS engine conditioned on one voice actor across 70+ speaking styles — while augmenting the user stream with gain changes, background noise, echo and reverb so the model stays robust to real microphones. Kimi-Audio records a professional voice actor across 20+ styles and emotion intensities, then uses voice conversion to expand coverage. And Voxtral contributes an important negative result: SFT data synthesized purely with TTS generalizes poorly to accented human speech — you need real spoken queries in the mix.

The final ingredient — and the fastest-moving — is reinforcement learning. The first wave was DPO, used for two distinct ends:

The GRPO turn: verifiable rewards for speech

Through 2025, speech post-training caught the same wave that reshaped reasoning LLMs: a shift from preference pairs (DPO) to GRPO with verifiable rewards . The mechanism is a clean fit for TTS. For each input text the policy samples a group of G candidate utterances, each is scored by a reward function, and the group-relative advantage (R − mean) / std pushes probability toward the better-than-average samples — with a KL penalty anchoring the model to its starting point. Crucially, the reward needs no human labels and no trained reward model: synthesize the speech, run it back through an ASR model, and score the transcript against the target text. The ASR-in-the-loop verifiers that LLaSA originally used to rerank outputs at inference time (back in the TTS section) are now baked into the weights.

Toggle reward components to see what each one optimizes — and what it can quietly break.
Figure 12: GRPO for TTS. The policy samples G candidates, an ASR-in-the-loop reward scores each with no human labels, and the group-relative advantage updates the policy. Toggle reward components to see the central tension.

The hard-won lesson across every one of these papers is that a single reward is a trap. Optimize for ASR-measured word error alone and intelligibility climbs while the cloned voice quietly drifts and prosody flattens — the model learns to enunciate, not to sound right. So the recipes converge on composites: the GRPO-TTS paper blends character error with the ASR model's log-likelihood (a softer signal that catches subtle mispronunciations), cutting CosyVoice2's Chinese CER from 1.41 to 1.07 while lifting naturalness MOS from 4.42 to 4.58 ; Multi-Reward GRPO adds an explicit speaker-similarity term and prosody proxies to keep voice identity intact at scale ; GLM-TTS optimizes pronunciation, timbre, and naturalness together ; and IndexTTS 2.5 reports GRPO trimming English WER from 1.89% to 1.73% with speaker similarity held stable. The pattern is always content-reward + identity-reward, plus guardrails like a duration penalty so the model can't game WER by slowing down.

This is also a domain where you do not need a frontier-scale rig: applying GRPO to LLaSA-1B — with a composite of ASR word error and model confidence — measurably improved error rates and naturalness on a single A100 , a reminder that the expensive part of a SpeechLM is the pretraining, not the alignment. The open frontier is rewards themselves: rule-based metrics are blind to prosody, emotion, and expressiveness, so learned generative reward models like GSRM are emerging to judge the perceptual qualities WER and cosine similarity miss — speech catching up to the trained-reward-model stage that text RLHF reached years ago .

Put together, the pattern across these systems is hard to miss. Pretraining = text LLM + massive interleaved audio + anti-forgetting replay. Mid-training = conversational structure (multi-stream, long context) + compute tricks. Post-training = voice-consistent SFT, then GRPO with a composite of verifiable (ASR-in-the-loop) and identity rewards. The final section of this playbook turns this into a checklist.

Neural Audio Codecs: The Backbone of High-Fidelity Speech

Every speech LLM discussed so far relies on a neural audio codec to bridge the gap between continuous waveforms and discrete tokens. Understanding how these codecs work—and how they differ—is essential for appreciating the trade-offs each model makes.

A neural codec typically consists of an encoder that compresses raw audio into a compact latent representation, a quantizer that maps each latent frame to one or more discrete codebook indices, and a decoder that reconstructs the waveform from those indices. The key variables are:

EnCodec (Meta, 2022) established the modern neural-codec paradigm with a 24 kHz, 1.5–6 kbps RVQGAN architecture . It uses a convolutional encoder/decoder and a residual quantizer with 32 codebooks at 75 Hz. EnCodec powers many early speech-language models and remains a robust baseline.

RVQ is easiest to understand by watching it work. The first codebook stores a coarse sketch of each frame; every subsequent codebook encodes only the residual — the part of the signal the previous levels missed. Drag the slider to add codebooks and watch the reconstruction converge while the token rate and bitrate climb:

1
Figure 9: Residual Vector Quantization, conceptually. Each added codebook encodes what the previous ones missed — fidelity rises, but so does the number of tokens the LLM must generate (EnCodec-style 10-bit codes at 75 Hz shown).

And here is the same trade-off with your ears instead of your eyes: the same speech clip passed through EnCodec at three codebook depths. At 2 codebooks the voice is intelligible but robotic; each doubling restores texture — and doubles the LLM's token bill.

2 codebooks · 1.5 kbps · 150 tok/s
8 codebooks · 6 kbps · 600 tok/s
32 codebooks · 24 kbps · 2,400 tok/s
Original · uncompressed
Figure 9b: The same clip reconstructed by EnCodec (24 kHz) with 2, 8, and 32 codebooks. Generated with facebook/encodec_24khz in transformers by varying the bandwidth argument.

Descript Audio Codec (DAC) extends EnCodec to 44.1 kHz stereo audio and introduces improved RVQGAN training with larger codebooks, achieving higher perceptual quality at 6–8 kbps . DAC is used by OuteTTS and other production TTS pipelines where full-band audio fidelity is required.

SNAC (Multi-Scale Neural Audio Codec) takes a different approach by quantizing at multiple temporal resolutions simultaneously—e.g., separate codebooks for coarse and fine structure . This lets an LLM predict the most important long-range features first, then refine details, which is why Orpheus 3B employs SNAC for its streaming pipeline.

XCodec2 (HKUST, used by LLaSA) opts for a single large codebook of 65,536 entries at 50 Hz. By eliminating the hierarchical RVQ structure, XCodec2 simplifies the LLM's prediction task to straightforward next-token generation, reportedly utilizing ~99% of the codebook. The trade-off is that a single token must encode a 20 ms frame, so the model relies on the massive capacity of the codebook itself to preserve nuance. The design is catching on: Neuphonic\u2019s NeuCodec (the codec behind NeuTTS Air) extends XCodec2 with Finite Scalar Quantization, keeping the single 50 tokens/s stream while cutting the bitrate to 0.8 kbps and pairing a Wav2Vec2-BERT semantic encoder with the acoustic one.

Mimi (Kyutai, 2024) is the codec behind Moshi. It runs at 12.5 Hz with a bitrate of only 1.1 kbps, yet combines semantic tokens (from a self-supervised speech encoder) with acoustic tokens in a single stream. This hybrid design means the LLM receives both high-level linguistic content and low-level timbre information at an extremely low token rate, which is crucial for real-time full-duplex dialogue. Mimi's decoder is also heavily optimized for streaming, running in under ten milliseconds on a single CPU core.

Codec Token Rate RVQ Depth Typical Bitrate Used By
EnCodec 75 Hz 32 1.5–6 kbps Early SpeechLLMs, audiobook pipelines
DAC ~75 Hz 9 6–8 kbps OuteTTS, high-fidelity TTS
SNAC Multi-scale 3 levels ~2 kbps Orpheus 3B
XCodec2 50 Hz 1 (flat) ~2 kbps LLaSA
NeuCodec 50 Hz 1 (FSQ, flat) 0.8 kbps NeuTTS Air
Mimi 12.5 Hz 8 (1 semantic + 7 acoustic) 1.1 kbps Moshi

The choice of codec is not merely an implementation detail: it shapes the LLM's input distribution, the achievable latency, and the fidelity ceiling. A flat single-codebook design like XCodec2 simplifies training and inference, but demands the LLM to carry more of the acoustic modeling burden. A multi-scale or semantic+acoustic design like Mimi pushes complexity into the codec, letting the LLM focus on high-level dialogue and reasoning. As the field matures, we expect to see more hybrid codecs that balance these axes for specific deployment constraints—on-device, low-latency, or studio-quality.

To see why this matters so much for LLMs, look at what one second of speech actually costs in tokens. A text LLM spends roughly 3–4 tokens to represent a second of spoken English; a naively flattened DAC stream spends nearly 800. This token bill determines how much conversation history fits in the context window — and therefore how long a dialogue your SpeechLM can remember:

Figure 10: The LLM-side cost of each codec, assuming flattened single-stream generation. Architectures like Moshi's depth transformer or CSM's parallel heads exist precisely to pay this bill differently — only the 12.5 Hz backbone positions occupy the context, not every codebook token.

Using Speech LLMs with ⌘ Transformers

Everything in this post is a transformers model. Whisper, Voxtral, CSM, Qwen2.5-Omni and friends all live on the Hub with native library support, which means three familiar abstractions cover the whole speech stack: the pipeline() one-liner for transcription, the processor + chat template pattern (with {"type": "audio"} content parts) for audio understanding, and generate() with audio output for models that speak. And the roster keeps growing — recent merges include Audio Flamingo 3, Granite Speech, Higgs Audio V2, VibeVoice ASR, and Music Flamingo, all following the same idioms. Pick a task:

Figure 11: The four speech tasks and their transformers idioms. Each snippet is self-contained — swap the checkpoint to try a different model of the same family.

A few practical notes. Load in bfloat16 with device_map="auto" — speech models are no different from text LLMs here, and the 3–8B models in this post fit comfortably on a single 24 GB GPU. The chat-template pattern is the one worth internalizing: because audio is just another content type in the conversation list, the same code path handles audio + text questions, multi-turn dialogue, and (for omni models) images and video. And when you outgrow these snippets — streaming microphone input, sub-second response latency, interruption handling — the models' own repos (Moshi, CSM, Qwen2.5-Omni) ship dedicated real-time serving stacks that wrap the same checkpoints.

The Playbook: Training Your Own

Everything above condenses into three concrete recipes. None of them is hypothetical — every step below is something one of the surveyed models actually does, with the source noted so you can go deeper.

Recipe 1: An LLM-based TTS model (LLaSA / Orpheus style)

  1. Pick your codec first. This is the decision everything else inherits. A flat single codebook (XCodec2: 50 Hz, 65,536 codes) gives you the simplest possible LLM training; a hierarchical RVQ codec (SNAC, DAC) gives better fidelity per token but forces you to flatten streams or add prediction heads. Prefer a codec with semantic grounding — intelligibility comes from the codec more than from the LLM.
  2. Start from a pretrained text LLM (Llama 1B–8B class) and extend its tokenizer and LM head with the codec's vocabulary. No architecture changes.
  3. Build the data pipeline: collect speech, transcribe what is unlabeled with a strong ASR model, encode the audio to codec tokens, and format samples as [text tokens] → [audio tokens]. LLaSA used 250k hours; useful models emerge at a few thousand hours.
  4. Train with plain next-token prediction. All the text-LLM infrastructure (data parallelism, sequence packing, compression, serving stacks) applies unchanged — this is the entire point of the design.
  5. Get voice cloning for free: condition on a short audio-token prompt of the target speaker and the model continues in that voice. No special training needed beyond diverse speakers in the data.
  6. Align with GRPO and verifiable rewards. This is now the standard finishing step. Sample a group of candidates per prompt and score each with a composite reward: an ASR-in-the-loop term for content accuracy (CER/WER, optionally the ASR model's log-likelihood) plus a speaker-encoder term for voice consistency, with a duration guardrail. Never optimize word error alone — it improves intelligibility while eroding the cloned voice. This works at modest scale (LLaSA-1B + GRPO fits on a single A100), and learned reward models are the emerging way to also reward prosody and emotion.

Recipe 2: A neural audio codec

  1. Architecture: a convolutional encoder that downsamples the waveform to a low frame rate, a quantizer, and a mirror-image decoder. The frame rate you choose (12.5–75 Hz) is the single most consequential number: it fixes the sequence length of every model trained on top.
  2. Choose the quantizer: RVQ (Encodec, DAC) for hierarchical fidelity, a single large codebook (XCodec2) for LLM-friendliness, or FSQ variants (Voxtral Codec) to sidestep codebook-collapse issues entirely. Monitor codebook usage — a healthy codec uses nearly all its codes (LLaSA reports ~99%).
  3. Train with the RVQGAN loss stack: multi-scale spectrogram reconstruction losses, adversarial discriminators (the main driver of perceptual quality), feature-matching, and commitment losses. DAC's recipe is the reference implementation.
  4. Add semantic distillation if LLMs will consume your tokens: distill a self-supervised speech model (WavLM-style) into the first codebook (Mimi) or fuse a speech-understanding encoder's features before quantization (XCodec2). This is the difference between a compression codec and an LLM tokenizer.
  5. Make it causal and streamable if conversation is the target: causal convolutions, chunk-wise decoding, and a decoder fast enough for real time (Mimi decodes in under 10 ms on one CPU core).
  6. Evaluate on three axes: reconstruction quality (MOS, ViSQOL) vs. bitrate, downstream LLM modelability (train a small LM on your tokens, measure phone/word error of resynthesized speech), and latency.

Recipe 3: An audio LLM (understanding + conversation)

  1. Start from the strongest text LLM you can serve, and attach a pretrained audio encoder (Whisper-class) through a small adapter. Optionally add a discrete semantic-token stream alongside the continuous features (Kimi-Audio's hybrid input).
  2. Warm up the adapter alone — encoder and LLM frozen — for the first pass over the data (Voxtral), or keep the feature extractor frozen for the first ~20% of tokens (Kimi-Audio).
  3. Pretrain on interleaved audio-text: mix the ASR-style repetition pattern with the cross-modal continuation pattern (~50/50, Voxtral), plus ASR, TTS, and audio-only tasks (Kimi-Audio). Keep a large share of text-only data — up to half of all batches — to prevent forgetting.
  4. Mid-train for the target format: long-context extension for 40-minute audio, and multi-stream / two-channel data (diarization-simulated first, real conversations like Fisher last) if you want full-duplex dialogue rather than turn-based QA.
  5. SFT with a consistent assistant voice (one voice actor + voice conversion, or a TTS engine with fixed conditioning) and diverse, augmented user audio. Include real human spoken queries — TTS-only SFT data fails on accented speech.
  6. Align with DPO: rank candidate responses with a text reward model over transcripts for response quality, and with ASR-based WER rewards for speech stability. Watch for capability taxes — check ASR benchmarks before and after.
  7. If speaking is required, bolt on generation last: either emit codec tokens directly (Moshi's depth transformer) or attach a Talker head over the LLM's hidden states with a streaming detokenizer (Qwen2.5-Omni, Kimi-Audio's flow-matching + vocoder).

Conclusion

The models surveyed here trace a clear trajectory. First, LLMs learned to listen: Qwen2-Audio and Voxtral Chat showed that pairing a strong audio encoder with a pretrained text model is enough for serious audio understanding. Then they learned to speak: LLaSA and Orpheus demonstrated that TTS can be reduced to next-token prediction over codec tokens, inheriting the entire scaling playbook of text LLMs. Finally, they are learning to converse: Moshi, CSM, Kimi-Audio, and Qwen2.5-Omni close the loop with full-duplex, streaming architectures where listening and speaking happen in the same model, in real time. And despite their architectural differences, they all train the same way: continued pretraining on interleaved audio-text with heavy text replay, mid-training for conversational structure, then voice-consistent SFT and DPO.

Underneath all of it sit the neural audio codecs, whose token rate, codebook structure, and semantic grounding quietly determine what the model above them can achieve. If there is one practical takeaway, it is this: when evaluating or building a SpeechLM, look at the codec first — the choice between a flat single codebook, a hierarchical RVQ stack, or a semantically distilled hybrid shapes the latency, quality, and modeling difficulty of everything downstream. The gap between synthetic and human conversation is closing fast, and it is being closed as much by better tokenizers as by bigger transformers.

Citation

For academic attribution, please cite this work as:

"SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored", 2025.

BibTeX citation

@misc{speechlms_explained,
    title={SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored},
    author={Steven Zheng},
    year={2025}
}