The SpeechLLM Playbook

How modern speech systems are built and trained — LLM-based TTS, audio LLMs, and the neural codecs underneath them — ending with concrete recipes for training your own.

Beyond Voice Cloning: From Classical TTS to Conversational Speech Systems

Classical text-to-speech (TTS) models have long excelled at voice cloning and speech synthesis. They generally follow a two-stage process: first, a model like Tacotron converts text into an intermediate acoustic representation (such as a spectrogram), and then a vocoder (for example, WaveGlow or HiFi-GAN) transforms that representation into waveform audio. While these systems are capable of producing lifelike voices, their primary focus has been on replicating a given speaker's sound, with limited capacity to engage in dynamic, context-aware conversations.

Audio sample from the Kokoro-TTS model.

The advent of large language models (LLMs) offers a compelling opportunity to enhance these systems. By incorporating LLMs into TTS pipelines, we can leverage their sophisticated reasoning and contextual understanding to create truly conversational speech systems. Instead of merely cloning a voice, these enhanced systems can interpret context, adapt to dialogue flows, and generate responses that feel both natural and interactive. Essentially, LLMs open up a new dimension where synthesis isn’t only about producing sound—it’s about enabling intelligent, context-aware conversation.

One practical way to integrate these capabilities is through a cascaded approach in a speech-to-speech system, which typically involves three distinct modules:

Speech-to-Text (STT): Converts the incoming speech into text.
Large Language Model (LLM): Processes and reasons on the transcribed text to understand context and generate a conversational response.
Text-to-Speech (TTS): Synthesizes the LLM-generated text back into natural-sounding speech.

Hover over elements for details

This cascaded method combines the strengths of each specialized component. However, this approach is not without its limitations. One major challenge is that the LLM does not capture the full richness of the speech input. Speech carries subtle cues—intonation, rhythm, emotion, and prosodic nuance—that are often lost in the conversion process to text. As a result, when an LLM processes transcribed text, it receives a significantly distilled representation of the original audio. This loss of detail can limit the model’s ability to produce responses that fully mirror the expressive qualities of the initial speech, potentially resulting in synthetic output that feels less dynamic or contextually aware.

Integrating speech directly with an LLM could solve this challenge but it also presents significant difficulties. Unlike text, speech is a continuous, high-dimensional signal. LLMs are designed to work with discrete tokens, so converting speech into a format that these models can process requires additional steps. Existing methods address this gap in two main ways:

Audio encoders (continuous): A dedicated encoder (often a Whisper-style network) maps the waveform to a sequence of continuous feature vectors, which a small adapter projects into the LLM's embedding space. This preserves fine acoustic detail with no quantization loss — but the LLM can only read these features, not produce them, so this path is for understanding audio, not generating it.
Neural codecs (discrete): A codec such as EnCodec, DAC, or XCodec quantizes the waveform into a sequence of discrete tokens that behave just like text tokens — and can be turned back into audio. Because the LLM can both read and write them, this is the path that unlocks speech generation, and it is the one this playbook leans on most.

This playbook is about the second path — the end-to-end one, where a single model ingests or produces speech directly rather than relaying it through a chain of black boxes. Everything downstream depends on one prerequisite the cascaded pipeline sidesteps entirely: an LLM only speaks in discrete tokens, so before anything else we need a way to turn a continuous waveform into a sequence of tokens, and those tokens back into sound. That is where we begin.

This post is a playbook, and it builds toward a concrete goal. We build up the modern speech stack roughly in the order you would learn it: first the neural codec that turns waveforms into tokens (the next chapter), then how to make an LLM understand audio (Qwen2-Audio, Voxtral), how to make it speak (LLaSA, Orpheus, CSM), and how those combine into real-time conversational systems — studying not just each architecture but how it is actually trained, and the two pillars every builder runs into along the way: data and evaluation. By the end, you should have a working recipe for all three: an LLM-based TTS model, a neural audio codec, and an audio LLM. The selection here is opinionated — models chosen because each teaches a distinct lesson. For an exhaustive map of the field, the ACL 2025 SpeechLM survey and the full-duplex spoken dialogue survey are excellent companions; the timeline at the top of this page shows where the models we cover sit in that larger story (AudioLM and VALL-E started the codec-token paradigm, SpeechGPT and SALMONN the encoder-plus-LLM one).

Neural Audio Codecs: The Backbone of High-Fidelity Speech

Before we look at a single model, we have to solve the problem they all share. An LLM is a machine for predicting the next item in a sequence of discrete symbols from a fixed vocabulary. Speech is the opposite of that: a continuous waveform, tens of thousands of real-valued samples every second, with no natural alphabet. Three consequences follow, and together they explain almost every design decision in the rest of this playbook.

Speech is continuous, not symbolic. Text arrives pre-segmented into a few tens of thousands of word-pieces; audio is just a signal. Something has to quantize it into a finite vocabulary before an LLM can read or write it at all.
Speech is expensive. A second of English reads as roughly 3–4 text tokens — but the same second of audio can be hundreds of tokens. Sequences get long fast, and this token rate becomes the central budget every architecture ends up fighting over.
Speech is one-to-many. A sentence has one spelling but infinitely many valid spoken realizations: different voices, pacing, emphasis, emotion, room acoustics. There is no single "correct" output the way there is in translation. This is why speech models sample instead of taking the argmax, why they can drift or hallucinate, and — as the training chapter will show — why reinforcement learning turns out to matter so much.

The device that takes on the first two problems directly is the neural audio codec: a learned compressor that turns a waveform into a short sequence of discrete tokens, and those tokens back into audio. Get it right and an LLM can treat speech as just another language to model; get it wrong and no amount of LLM scale will rescue you. It is the foundation everything else here is built on — so it is where we start.

A neural codec typically consists of an encoder that compresses raw audio into a compact latent representation, a quantizer that maps each latent frame to one or more discrete codebook indices, and a decoder that reconstructs the waveform from those indices. The key variables are:

Token rate (Hz): how many discrete tokens are emitted per second of audio. Lower rates mean the LLM has fewer positions to model, but may lose fine detail.
Codebook depth (RVQ levels): how many parallel codebooks are used. Residual Vector Quantization (RVQ) stacks multiple quantizers, each refining the reconstruction of the previous layer.
Bitrate (kbps): total information carried per second. Higher bitrates generally yield better fidelity but require more bandwidth and storage.

EnCodec (Meta, 2022) established the modern neural-codec paradigm with a 24 kHz, 1.5–6 kbps RVQGAN architecture . It uses a convolutional encoder/decoder and a residual quantizer with 32 codebooks at 75 Hz. EnCodec powers many early speech-language models and remains a robust baseline.

RVQ is easiest to understand by watching it work. The first codebook stores a coarse sketch of each frame; every subsequent codebook encodes only the residual — the part of the signal the previous levels missed. Drag the slider to add codebooks and watch the reconstruction converge while the token rate and bitrate climb:

Figure 1: Residual Vector Quantization, conceptually. Each added codebook encodes what the previous ones missed — fidelity rises, but so does the number of tokens the LLM must generate (EnCodec-style 10-bit codes at 75 Hz shown).

And here is the same trade-off with your ears instead of your eyes: the same speech clip passed through EnCodec at three codebook depths. At 2 codebooks the voice is intelligible but robotic; each doubling restores texture — and doubles the LLM's token bill.

2 codebooks · 1.5 kbps · 150 tok/s

8 codebooks · 6 kbps · 600 tok/s

32 codebooks · 24 kbps · 2,400 tok/s

Original · uncompressed

Figure 2: The same clip reconstructed by EnCodec (24 kHz) with 2, 8, and 32 codebooks. Generated with facebook/encodec_24khz in transformers by varying the bandwidth argument.

Two kinds of token: acoustic vs. semantic

There is a second axis that matters even more than codebook count once an LLM enters the picture: what the tokens actually encode. Pure compression codecs like EnCodec and DAC are trained for a single goal — reconstruct the waveform as faithfully as possible — so their tokens are purely acoustic: they capture how the audio sounds, but an individual token has no clean correspondence to the phonemes being spoken. That is ideal for storage and terrible for an LLM, which is at heart a model of linguistic structure. Asking it to predict acoustic tokens is asking it to predict reverb and microphone hiss alongside the actual words.

The fix is to bake linguistic structure into the tokens themselves. A semantic token is trained — usually distilled from a self-supervised speech model such as HuBERT or Wav2Vec2-BERT — to correlate with phonetic content rather than raw acoustics. Codecs built for SpeechLMs increasingly carry both: semantic tokens for "what is being said," acoustic tokens for "how it sounds." Mimi distills semantics into its first codebook and leaves the rest acoustic; XCodec2 and NeuCodec fuse a semantic encoder's features into the quantizer before emitting their single stream. This is the biggest reason recent SpeechLMs are so much more intelligible than the first codec-token experiments: when the first token the LLM predicts is already about meaning, next-token prediction becomes a linguistic act rather than an acoustic one. Keep this acoustic-vs-semantic split in mind for every codec below.

Descript Audio Codec (DAC) extends EnCodec to 44.1 kHz stereo audio and introduces improved RVQGAN training with larger codebooks, achieving higher perceptual quality at 6–8 kbps . DAC is used by OuteTTS and other production TTS pipelines where full-band audio fidelity is required.

SNAC (Multi-Scale Neural Audio Codec) takes a different approach by quantizing at multiple temporal resolutions simultaneously—e.g., separate codebooks for coarse and fine structure . This lets an LLM predict the most important long-range features first, then refine details, which is why Orpheus 3B employs SNAC for its streaming pipeline.

XCodec2 (HKUST, used by LLaSA) opts for a single large codebook of 65,536 entries at 50 Hz. By eliminating the hierarchical RVQ structure, XCodec2 simplifies the LLM's prediction task to straightforward next-token generation, reportedly utilizing ~99% of the codebook. The trade-off is that a single token must encode a 20 ms frame, so the model relies on the massive capacity of the codebook itself to preserve nuance. The design is catching on: Neuphonic\u2019s NeuCodec (the codec behind NeuTTS Air) extends XCodec2 with Finite Scalar Quantization, keeping the single 50 tokens/s stream while cutting the bitrate to 0.8 kbps and pairing a Wav2Vec2-BERT semantic encoder with the acoustic one.

Mimi (Kyutai, 2024) is the codec behind Moshi. It runs at 12.5 Hz with a bitrate of only 1.1 kbps, yet combines semantic tokens (from a self-supervised speech encoder) with acoustic tokens in a single stream. This hybrid design means the LLM receives both high-level linguistic content and low-level timbre information at an extremely low token rate, which is crucial for real-time full-duplex dialogue. Mimi's decoder is also heavily optimized for streaming, running in under ten milliseconds on a single CPU core.

Codec	Token Rate	RVQ Depth	Typical Bitrate	Used By
EnCodec	75 Hz	32	1.5–6 kbps	Early SpeechLLMs, audiobook pipelines
DAC	~75 Hz	9	6–8 kbps	OuteTTS, high-fidelity TTS
SNAC	Multi-scale	3 levels	~2 kbps	Orpheus 3B
XCodec2	50 Hz	1 (flat)	~2 kbps	LLaSA
NeuCodec	50 Hz	1 (FSQ, flat)	0.8 kbps	NeuTTS Air
Mimi	12.5 Hz	8 (1 semantic + 7 acoustic)	1.1 kbps	Moshi

The choice of codec is not merely an implementation detail: it shapes the LLM's input distribution, the achievable latency, and the fidelity ceiling. A flat single-codebook design like XCodec2 simplifies training and inference, but demands the LLM to carry more of the acoustic modeling burden. A multi-scale or semantic+acoustic design like Mimi pushes complexity into the codec, letting the LLM focus on high-level dialogue and reasoning. As the field matures, we expect to see more hybrid codecs that balance these axes for specific deployment constraints—on-device, low-latency, or studio-quality.

To see why this matters so much for LLMs, look at what one second of speech actually costs in tokens. A text LLM spends roughly 3–4 tokens to represent a second of spoken English; a naively flattened DAC stream spends nearly 800. This token bill determines how much conversation history fits in the context window — and therefore how long a dialogue your SpeechLM can remember:

Figure 3: The LLM-side cost of each codec, assuming the naive approach of flattening every codebook into one stream. Much of the architectural cleverness in later chapters — Moshi's depth transformer, CSM's parallel heads — exists precisely to avoid paying this bill in full.

Here is the payoff, and the shape of the rest of this playbook. Once we can move freely between waveforms and tokens, an LLM can do two mirror-image things with audio: consume it as input (recognition and understanding) or produce it as output (speech synthesis) — the same next-token objective, run in two directions. The two directions don't always use the same representation: as we'll see, a model that only needs to understand audio often feeds the LLM continuous encoder features instead of discrete codec tokens, while any model that generates speech must emit discrete tokens a codec can turn back into sound. The next chapters follow exactly that arc — first teaching an LLM to listen, then to speak, then combining both into real-time conversation.

One map for the whole field

Before the parade of models begins, here is a single mental model to carry through all of it. Almost every speech LLM is pinned down by just two questions: how does audio get into the LLM, and what does the LLM produce? The first axis is the continuous-vs-discrete choice we just drew — encoder features (rich, but read-only) versus codec tokens (discrete, and writable). The second is simply text out (understanding) versus speech out (generation). Place those two axes at right angles and the entire field falls into a handful of cells.

Figure 4: The design space. Hover any model to see how it is wired. The three columns are the input choices from the previous section; the two rows are the two directions of the ASR↔TTS duality.

A few things jump out once it's laid out this way. Understanding models cluster in one cell — continuous features in, text out — because if you never need to generate audio, discrete tokens only cost you fidelity; that is the entire "encoder-plus-LLM" recipe, and Qwen2-Audio, Voxtral, SALMONN, Audio Flamingo 3, and Granite Speech all live there. Pure TTS sits in the opposite corner — text in, codec tokens out — where models like LLaSA and Orpheus need no encoder at all. And the genuinely hard, interesting systems are the ones that span the whole grid: a model that takes audio in and emits speech out is a conversational agent, and how it bridges the two — a separate Talker head (Qwen2.5-Omni), a hybrid input (Kimi-Audio), or one unified token stream (Moshi) — is the defining design choice of the final chapter. As each model comes up, it helps to ask first: which cell is it in, and how does it move between them?

Making LLMs understand speech and audio

Qwen2-Audio

Qwen2-Audio employs a two-component architecture featuring an audio encoder and a large language model (LLM). The audio encoder is initialized using weights from the Whisper-large-v3 model . Leveraging powerful pre-trained models like Whisper as the audio encoder is an approach also seen in other audio-language models; for instance, the Ultravox's models similarly use a Whisper encoder as part of its architecture. For Qwen2-Audio, the input audio (resampled to 16kHz) is converted into a 128-channel mel-spectrogram (25ms window, 10ms hop), which is then processed by a pooling layer, resulting in each encoder frame representing about 40ms of audio. The LLM component is the Qwen-7B model, bringing the total parameter count for Qwen2-Audio to 8.2 billion.

The training involves a three-stage process aimed at maximizing the probability of the next text token, conditioned on the audio representations and preceding text tokens:

Pre-training: This stage replaces the hierarchical tags used previously with natural language prompts for various tasks and data types. This approach was found to enhance generalization and instruction-following abilities. The volume of pre-training data was also significantly expanded compared to earlier models.
Fine-tuning (SFT): Building on the pre-trained model's audio understanding, instruction-based fine-tuning is performed using meticulously curated, high-quality SFT data to improve alignment with human intent and enable interactive chat capabilities. Both the 'Audio Analysis' and 'Voice Chat' interaction modes are jointly trained for seamless integration, removing the need for users to switch modes explicitly.
Preference Optimization (DPO) : DPO is employed as a final stage to further align the model with human preferences. This involves using a dataset of triplets, each containing an input (audio + text prompt), a preferred (good) response, and a rejected (bad) response, to optimize the model. This stage helps improve the factuality and adherence to desired behaviors in the model's outputs.

This architectural and training methodology highlights a relatively straightforward approach to building powerful audio-language models. It demonstrates the feasibility of effectively combining strong, pre-existing unimodal models – like a capable audio encoder (Whisper) and a robust LLM (Qwen) – and then adapting them through targeted fine-tuning stages (SFT, DPO). This process of "plugging in" a modality-specific encoder and then fine-tuning the combined system mirrors common practices in multimodal LLMs, particularly analogous to how vision capabilities are often integrated into large language models.

The encoder-plus-LLM recipe, everywhere

Qwen2-Audio's blueprint has since become the standard pattern, and the transformers library has been steadily absorbing its descendants — each with one distinctive twist worth knowing:

Audio Flamingo 3 (NVIDIA) pairs a Whisper-style encoder with a Qwen2 LLM and uses one unified encoder across speech, environmental sound, and music — the same checkpoint transcribes a meeting, names a bird call, and analyzes a chord progression. It handles up to 10 minutes of audio by windowing into 30-second chunks and replacing placeholder tokens in place, so fusion never changes the sequence length.
Granite Speech (IBM) stacks a Conformer CTC encoder and a Q-former projector on a Granite LLM, with an elegant answer to the catastrophic-forgetting problem from the training section: a modality-specific LoRA adapter that activates only when audio is present. Text-only prompts run through the unmodified LLM, so text performance is untouched by construction rather than defended with replay data.
Gemma 3n (Google) folds the same idea into a small open omni model: audio joins text, images, and video as just another input modality, showing the encoder-plus-LLM recipe scaled down for on-device use.

The takeaway: "make an LLM understand audio" is no longer a research problem but a design menu — pick an encoder, pick a fusion mechanism (projector, Q-former, in-place replacement), and decide how you will protect the text weights (data replay, frozen stages, or Granite's audio-gated LoRA).

Scaling TTS with LLMs

LLaSA: a simple approach to scaling TTS with LLMs

The first TTS LLM-powered model we examine is LLaSA, developed by researchers at HKUST and Microsoft. LLaSA directly tackles the question: How far can we push large language model (LLM) scaling principles when applied to speech synthesis? While many text-to-speech (TTS) systems use hybrid pipelines or multiple models, LLaSA adopts a minimalist, LLM-style design—one transformer, one stage, one codebook—and scales it up massively across model size and data.

LLaSA’s design philosophy is simplicity and reuse. It starts from a pretrained Llama model and extends both the tokenizer and LM head to include 65,536 new audio tokens on top of the original text vocabulary. These audio tokens come from XCodec2, a flat vector quantizer operating at 50 Hz – i.e. one token corresponds to 20 ms of audio. This design yields 50 tokens per second of audio, all from one codebook. By using a very large codebook, the codec achieves high fidelity with a single token stream (reportedly ~99% codebook usage, meaning it effectively utilizes the full range of tokens for nuanced encoding).

The model is then fully fine-tuned on sequences that combine both text and audio tokens, allowing the transformer to treat speech as a natural continuation of text. No architecture changes are made to the transformer itself—LLaSA simply learns to model speech the same way Llama models language: as a next-token prediction task over a unified vocabulary of text and audio tokens.

The advantage of this single-level approach is its simplicity for the LLM – the model just treats audio tokens like another language with a 65k vocabulary, not unlike how a wordpiece tokenizer might have 50k tokens for text. Training LLaSA thus becomes very similar to training a standard LLM: they convert all audio in the training set into long sequences of XCodec2 tokens and concatenate with the corresponding text transcriptions. The transformer learns to predict the next token, whether that next token is part of the text or the audio. The authors note this compatibility means they can directly apply techniques like data parallel scaling, model compression, or acceleration from the NLP world to this TTS model.

LLaSA’s training of 250k hours is one of the largest in TTS, spanning diverse speech in English and Chinese, it also has a multilingual version for the 1B model and a 3B model. The resulting models are correspondingly powerful. The largest 8B model in particular demonstrates remarkable naturalness and prosody. According to the paper, increasing model size consistently improved speech quality – bigger models produced more accurate and complex prosody patterns and sounded more natural. This is analogous to how in text LLMs, going from 1B to 7B to 70B yields more fluent and context-aware language; here it yields more human-like intonation and rhythm. Even the 1B model, while less expressive, still functions for basic speech and is extremely lightweight to run.

It can also do zero-shot voice cloning by taking a speech prompt. If you feed a short recording of a speaker (converted to tokens via XCodec2) followed by the text, LLaSA will generate the continuation in that voice. This works because the model has effectively learned to continue in the style provided by preceding audio tokens. Additionally, because it’s bilingual, you can prompt it with a Chinese voice and have it speak English in that voice or vice-versa, enabling cross-lingual voice transfer – a very useful feature for voice assistants in multilingual settings.

More details about scaling inference-time compute with LLaSA

One interesting contribution of the LLaSA work is in scaling inference-time compute . The authors experimented with using speech understanding models as verifiers during generation . In practice, this means when sampling audio tokens, they would involve a pretrained model (like a speech recognition or speaker ID network) to guide the choice of tokens, re-ranking or biasing the outputs toward those that make the verifier happy. For example, a ASR verifier ensures the content is pronounced clearly (improving word accuracy), while a speaker encoder verifier ensures the voice timbre stays consistent, and a classifier might ensure the emotion matches some target. They found that by spending more computation at inference in this way, they could significantly improve aspects like emotional expressiveness, speaker consistency, and content accuracy in the generated speech. This is akin to how some text LLMs use external tools or rerankers to improve outputs post-hoc. While such techniques increase inference cost, they show a pathway to higher quality without retraining – useful for customizing style or ensuring correctness in critical applications. In terms of output quality, LLaSA is state-of-the-art on traditional TTS metrics. The paper reports that the codec can reconstruct 16 kHz speech with a MOS (Mean Opinion Score) around 4.1 (out of 5) on test data, which is near the ground truth quality. The transformer does introduce some modeling imperfections, but the massive training helps mitigate that. For prosody, the 8B model especially was noted to produce more lifelike intonation, handling even tricky sentences with complex emphasis better than smaller models. The bilingual nature means it learned to control tone appropriate to each language – for example, using the correct cadence for a Chinese question vs an English question. It can also mix languages to an extent (e.g., speaking an English sentence with a Chinese accent or vice versa, if prompted), reflecting the multilingual data.

Text tokens
Llama tokenizer (~128k vocab)

Audio tokens
XCodec2 · 1 codebook × 65,536 codes · 50 tokens/s

↓

Single Llama transformer
next-token prediction over the unified text + audio vocabulary — no extra adapters, no second stage

LLaSA-1B
Llama-3.2-1B
multilingual variant

LLaSA-3B
Llama-3.2-3B
EN + ZH

LLaSA-8B
Llama-3.1-8B
best prosody & naturalness

Figure 5: LLaSA variants and their single-codebook (XCodec2) token stream.

Concurrent Approaches

While LLaSA pushed the boundaries with its massive scale and single-codebook simplicity, other models emerged concurrently, exploring similar LLM-driven TTS principles but often opting for different neural audio codec strategies. Notable examples include Canopy Labs' Orpheus 3B and OuteAI's OuteTTS.

A key difference lies in their choice of codec. Instead of XCodec2's single large codebook, both Orpheus and OuteTTS use codecs based on Residual Vector Quantization (RVQ). RVQ codecs like SNAC (Multi-Scale Neural Audio Codec) and DAC (Descript Audio Codec) represent audio hierarchically, using multiple quantizers (codebooks) where each layer progressively refines the audio representation encoded by the previous layers. This can potentially offer higher fidelity at lower token rates compared to single-codebook approaches, though it may require the LLM to predict multiple token streams or handle a flattened representation.

Orpheus 3B

Built upon a Llama-3B backbone, Orpheus pairs the LLM with the SNAC. SNAC is an advanced RVQ codec that captures audio information across different temporal resolutions, aiming for efficient compression and detailed reconstruction. To manage the multiple token streams from SNAC within a standard LLM framework, Orpheus employs a strategy of generating a flattened sequence of tokens (7 tokens per audio frame sequentially). It achieves low-latency streaming (~200ms to ~25-50 ms with input streaming of text into the KV cache) suitable for real-time applications by using an optimized decoding process involving a sliding window technique on the SNAC decoder, ensuring smooth audio output without pops. Orpheus particularly emphasizes generating expressive, emotive speech and supports zero-shot voice cloning from short audio prompts.

Figure 6: Comparison between traditional Residual Vector Quantization (RVQ) and SNAC.

More models

Model	Base LLM	Audio Codec	Key Features
OuteAI/Llama-OuteTTS-1.0-1B	meta-llama/Llama-3.2-1B	ibm-research/DAC.speech.v1.0	- One-Shot Voice Cloning - Multilingual - Trained on ~60k hours of audio
SparkAudio/Spark-TTS-0.5B	Qwen/Qwen2.5-0.5B	BiCodec	- Controllable TTS with prompt - Trained on 100k hours of open source audio - Bilingual capabilities
nari-labs/Dia-1.6B	Custom 1.6B transformer (trained from scratch)	Descript Audio Codec (DAC)	- Multi-speaker dialogue generation ([S1]/[S2] tags) - Non-verbal sounds (laughter, coughs, sighs) - Voice cloning from an audio prompt
bosonai/higgs-audio-v2-generation-3B-base	meta-llama/Llama-3.2-3B	Unified tokenizer (semantic + acoustic, in-house)	- DualFFN: separate FFN path for audio tokens - 10M hours pretraining (AudioVerse) - Expressive multi-speaker audio, zero post-training
moonshotai/Kimi-Audio-7B-Instruct	Qwen 2.5-7B	Hybrid (continuous + discrete)	- Universal audio foundation model - ASR, TTS, audio QA, SER, SEC - 13M+ hours pre-training
Qwen/Qwen2.5-Omni-7B	Qwen 2.5	Thinker-Talker (built-in)	- Multimodal: text, image, audio, video - Real-time voice & video chat - Selectable voices (Chelsie, Ethan)
mistralai/Voxtral-Mini-4B-Realtime	Mistral 7B	Voxtral Codec	- Streaming ASR - 13 languages - 480ms delay
mistralai/Voxtral-4B-TTS	Mistral 7B	Voxtral Codec (VQ-FSQ)	- Multilingual TTS - Voice cloning from 3s audio - Flow-matching decoder

Beyond LLMs and Audio Codecs: Toward a conversational speech LLM system

Moshi: Real-Time Full-Duplex Speech Generation

Moshi is an open 7-billion-parameter speech–text foundation model that aims to make synthetic voices feel as immediate and interruptible as ordinary conversation. Every 80 ms the model’s large Temporal Transformer rolls the entire dialogue history—both text and previously generated audio tokens—into a single context vector. A tiny linear head turns that vector into one semantic token, after which a compact Depth Transformer fills in seven acoustic tokens that refine prosody and timbre. These eight tokens drive the causal Mimi codec, so the first fragment of sound can leave the speaker roughly 160–200 ms after the user finishes talking .

Because Moshi writes its own audio tokens and reads the user’s tokens in the same sequence, it can back-channel (“mm-hm”), pause mid-sentence when interrupted, or pick up the thread again—all without external VAD, ASR, or turn-taking heuristics. The entire stack is Apache-2.0 licensed, and the Mimi decoder runs in under ten milliseconds on a single CPU core, making on-device streaming practical for mobile hardware.

Moshi architecture: Temporal Transformer, Depth Transformer and Mimi codec — Figure 7: A single context vector becomes one semantic and seven acoustic tokens per 80 ms frame.

CSM-1B: Context-Aware Conversational TTS from Sesame

Sesame AI’s Conversational Speech Model (CSM-1B) attacks the same problem from the TTS side. Rather than a full-duplex agent like Moshi, CSM is a context-aware speech generator: text and audio tokens from the whole conversation so far form one interleaved sequence, and the model generates the next turn’s audio conditioned on all of it. Because the prosody of a reply depends on what was just said — and how it was said — feeding the model raw conversational context lets it get intonation, emotion, and timing right where an isolated TTS call would sound flat .

Architecturally, CSM is two transformers operating on Mimi codec tokens (12.5 Hz, one semantic + N−1 acoustic codebooks per frame). A Llama-style backbone processes the interleaved text-and-audio history and predicts the semantic codebook for each frame; a much smaller audio decoder then models the remaining acoustic codebooks to reconstruct high-fidelity speech. Both stages are autoregressive — the split exists so the expensive backbone runs only once per frame while the cheap decoder fills in the detail, keeping latency low enough for real-time use.

The training corpus is roughly 1 million hours of transcribed, diarized, predominantly English audio, trained for five epochs at a 2,048-token sequence length (about two minutes of dialogue). Sesame trained three sizes — 1B, 3B, and 8B backbones — and released the 1B model openly. In their evaluations, listeners rated isolated CSM utterances on par with human recordings; only when the conversational context was shown did humans retain a clear edge — evidence that naturalness is largely solved and contextual appropriateness is the remaining frontier.

Sesame CSM-1B block diagram — Figure 8: CSM-1B’s single-stage pipeline. The orange arrow shows audio tokens fed back into the same sequence that will soon contain the model’s next reply.

Moving towards truly interactive conversational speech, models need to inherently understand and adapt to the flow of dialogue. Sesame AI's Conversational Speech Model (CSM) represents a significant step in this direction, explicitly designed to leverage context for more natural and coherent speech synthesis, as detailed in their research "Crossing the Uncanny Valley of Voice".

Kimi-Audio: A Universal Audio Foundation Model

Kimi-Audio from Moonshot AI is a 7-billion-parameter open-source audio foundation model that unifies audio understanding, generation, and conversation within a single framework . Built on a Qwen 2.5-7B backbone, it represents one of the most ambitious attempts to create a universal audio model rather than a task-specific TTS or ASR system.

The model introduces a hybrid audio input architecture that combines continuous acoustic features with discrete semantic tokens, allowing the LLM core to ingest audio at multiple levels of abstraction. For output, Kimi-Audio employs parallel heads for both text and audio token generation, making it capable of producing text transcriptions, natural language answers, or synthesized speech from the same forward pass. A chunk-wise streaming detokenizer based on flow matching keeps audio output latency low enough for real-time interaction.

Pre-training consumed over 13 million hours of diverse audio data spanning speech, music, and environmental sounds, making Kimi-Audio competitive across a wide spectrum of benchmarks. It achieves strong results on ASR (Common Voice, LibriSpeech), audio question answering (MMAU), speech emotion recognition, and sound event classification, while also supporting end-to-end speech conversation. The model is released in two checkpoints: Kimi-Audio-7B (base) and Kimi-Audio-7B-Instruct (fine-tuned for dialogue).

Qwen-Omni: A Multimodal LLM for Conversational Speech

Alibaba's Qwen2.5-Omni pushes the boundary of multimodal LLMs by simultaneously perceiving text, images, audio, and video while generating both text and natural speech responses in a streaming manner . Unlike audio-only models, Qwen2.5-Omni is a true any-to-any system, making it suitable for rich voice-and-video assistants.

The architecture splits responsibilities into a Thinker–Talker design. The Thinker is a transformer backbone that processes all modalities; the Talker is a lightweight decoder head that converts hidden states into speech tokens. A novel position-embedding scheme called TMRoPE (Time-aligned Multimodal RoPE) synchronizes video timestamps with audio, so lip movements and spoken words stay aligned in the model's internal representation. This is especially important for tasks like video-based question answering or real-time video chat.

Qwen2.5-Omni comes in 3B and 7B variants. In audio-only benchmarks it rivals or surpasses Qwen2-Audio and Whisper-large-v3, while in speech generation it achieves competitive speaker similarity and content consistency on the SEED evaluation suite. On the OmniBench multimodal benchmark, the 7B model reaches state-of-the-art performance among open models. The model also supports selectable output voices—Chelsie (female, warm) and Ethan (male, upbeat)—giving developers control over persona without extra training.

Voxtral: Mistral AI's Audio Suite

Mistral AI's Voxtral family is a comprehensive audio suite that spans understanding, generation, and real-time transcription . Rather than a single monolithic model, Voxtral is split into three complementary systems, each optimized for a different audio modality and all released under permissive licenses.

Voxtral Chat (Mini and Small) is a multimodal audio chat model that comprehends both spoken audio and text documents. A 32K context window lets it handle audio files up to 40 minutes and sustain long multi-turn conversations. Voxtral Small outperforms several closed-source models on audio benchmarks while remaining compact enough to run locally. The model is trained on diverse audio tasks including transcription, audio question answering, and cross-modal reasoning.

Voxtral TTS is an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. It adopts a hybrid architecture: an autoregressive Transformer generates semantic speech tokens, and a flow-matching decoder turns those tokens into acoustic features via the custom Voxtral Codec—a speech tokenizer with hybrid VQ-FSQ quantization. In native-speaker evaluations, Voxtral TTS achieved a 68.4% win rate over ElevenLabs Flash v2.5 on multilingual voice cloning, demonstrating strong naturalness and expressivity.

Voxtral Realtime is a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike chunking or sliding-window adaptations of offline models, Voxtral Realtime is trained end-to-end for streaming with explicit alignment between audio and text streams. It builds on the Delayed Streams Modeling framework, introducing a causal audio encoder and Ada RMS-Norm for improved delay conditioning. At a 480 ms delay, it achieves performance on par with Whisper, the most widely deployed offline transcription system, while scaling to 13 languages.

All three Voxtral models are available on the Hugging Face Hub and can be loaded through the standard transformers library, making them accessible to any developer already familiar with the Hugging Face ecosystem.

How these systems are trained

Reading the model cards, these systems can look like magic: one checkpoint that listens, reasons, and speaks. But every model in this chapter follows essentially the same three-phase curriculum, directly inherited from text LLMs: pretraining teaches the modality, mid-training teaches the conversational format, and post-training teaches the behavior. Explore the actual pipelines below — the details differ, but the skeleton is remarkably consistent.

Figure 9: Training pipelines of five conversational speech systems, normalized to the same pretraining → mid-training → post-training skeleton.

Data: the real bottleneck

It is worth saying plainly: for most teams the hard part of a SpeechLM is not the model, it is the data. Each of the three phases above assumes a corpus of speech — usually paired with text — and assembling one at the right scale and quality is where the real effort goes. The numbers are sobering: a usable single-speaker voice can emerge from a few thousand hours, but the models in this playbook were trained on far more (LLaSA on 250k hours, CSM on ~1M, Kimi-Audio on 13M). The good news is that the open ecosystem has largely caught up, so you rarely start from zero.

Dataset	~Hours	Languages	Style	Notes
LibriSpeech	1k	English	read audiobooks	the classic ASR benchmark; clean, CC BY 4.0
LibriHeavy / LibriTTS-R	50k / 0.6k	English	read	LibriHeavy keeps punctuation & casing; LibriTTS-R is 24 kHz TTS-grade
Multilingual LibriSpeech	50k	8	read	the multilingual workhorse; CC BY 4.0
Granary	1M	25 (European)	pseudo-labeled, in-the-wild	NVIDIA, 2025; ~650k h ASR + ~350k h speech translation
GigaSpeech	10k	English	audiobooks, podcasts, web	varied speaking style, transcribed
Common Voice	30k+	100+	read, crowdsourced	unmatched language coverage; CC0
Emilia / Emilia-YODAS	100k+ / 200k+	6	in-the-wild (podcasts, talk shows)	the modern default for expressive TTS; check license terms
People's Speech	30k	English	diverse / spontaneous	large and permissively licensed (CC-BY)
Loquacious Set	25k	English	read, spontaneous, talks; clean & noisy	curated from 6 corpora for research and commercial use

Raw audio is never training-ready. The standard preparation pipeline — essentially what Voxtral, Moshi, and every model here run — is: segment long recordings with voice-activity detection, diarize to separate speakers, pseudo-label untranscribed audio with a strong ASR model (Whisper is the workhorse), filter on length, language ID, and quality, and finally encode each clip to codec tokens. This pipeline is increasingly productized: NVIDIA's Granary is essentially a million hours of audio run through exactly these steps with their NeMo tooling, released as ready-to-train data across 25 languages. That last step hides a practical trick worth knowing: encode the whole corpus to tokens once and store the tokens, not the waveforms. Neuphonic did exactly this to release Emilia-YODAS pre-encoded with NeuCodec, shrinking it from 1.7 TB to 41 GB — training then streams cheap integer tokens instead of decoding audio on the fly.

One caveat that catches newcomers: "open" does not always mean "commercially usable." Much of the largest in-the-wild speech data carries research-only or source-derived licenses, so check the terms before building a product on top of it. This is exactly the gap the Loquacious Set targets — 25k hours curated from six commercially-usable corpora into one diverse English set, assembled so you do not have to reconcile seven licenses yourself.

Pretraining: teaching the LLM a new modality

No one trains a SpeechLM from scratch on audio. Every system starts from a pretrained text LLM — Helium for Moshi, Qwen2.5-7B for Kimi-Audio, Mistral for Voxtral — because the linguistic knowledge, world knowledge, and reasoning live in the text weights. Audio pretraining is therefore really continued pretraining: expose the model to enormous amounts of audio (7M hours for Moshi , 13M hours for Kimi-Audio ) while actively defending the text abilities you started with. Moshi spends half of all pretraining batches on text-only data; Kimi-Audio's task mixture gives text-only data a sampling weight of 7 versus 1 for audio-only. Catastrophic forgetting, not audio modeling, is the main enemy at this stage.

The most important design decision is what the audio-text training sample looks like. Voxtral makes this beautifully explicit with two interleaving patterns, balanced 50/50 and disambiguated by a special token:

Figure 10: Voxtral's two pretraining patterns over the same chunked audio (A) and transcript (T) stream. Toggle to compare.

The repetition pattern is essentially ASR formatted as language modeling, and it drives transcription accuracy. The continuation pattern — predict the transcript of the next chunk from the current audio — is the one that produces understanding: the model can only continue speech it has actually comprehended, which is exactly the skill spoken QA and dialogue require. Kimi-Audio reaches the same conclusion independently: of its seven pretraining tasks, three are interleaving variants (audio→semantic-token, audio→text, and joint audio+text prediction), trained over a budget of 585B audio tokens and 585B text tokens. If you remember one thing about audio pretraining, make it this: interleaved data is what turns an ASR system into an audio language model.

A small but recurring trick: freeze first, unfreeze later. Voxtral's first pass over the data trains only the adapter between the frozen audio encoder and the frozen LLM; Kimi-Audio keeps its Whisper feature extractor frozen for the first ~20% of pretraining tokens. Aligning a randomly initialized adapter against two stable representations is much easier than letting everything drift at once.

Mid-training: from monologue to conversation

A model pretrained on single-stream audio can continue speech, but it has never seen a conversation — two voices, overlapping, taking turns. Moshi's recipe makes this phase explicit. First, it applies speaker diarization to its unsupervised corpus to simulate two streams (the target speaker's waveform on one channel, everyone else on the other) and trains for 100k steps. Only then does it use real two-channel telephone conversations — the Fisher corpus, 2,000 hours of phone calls recorded with one channel per participant — to learn genuine turn-taking, interruptions, and back-channels. The lesson generalizes: scarce gold data (real two-channel dialogue) is used last, after cheap simulated data has done the heavy lifting.

Mid-training is also where the engineering constraints of audio bite hardest, because audio sequences are long and each position can carry many codebooks. CSM's answer is compute amortization : the backbone predicts the semantic codebook on every frame, but the acoustic decoder trains on only a random 1/16th of frames — with no measurable difference in decoder loss. Qwen2.5-Omni instead splits the problem architecturally (the Talker trains separately from the Thinker) and adds a dedicated long-sequence stage so the model can attend over very long audio and video .

Post-training: SFT and reinforcement learning

Supervised fine-tuning for speech has a unique problem text never had: the assistant needs a voice, and your SFT data defines it. Moshi generates 20k+ hours of synthetic dialogues — transcripts written by its own text LLM, then synthesized with a TTS engine conditioned on one voice actor across 70+ speaking styles — while augmenting the user stream with gain changes, background noise, echo and reverb so the model stays robust to real microphones. Kimi-Audio records a professional voice actor across 20+ styles and emotion intensities, then uses voice conversion to expand coverage. And Voxtral contributes an important negative result: SFT data synthesized purely with TTS generalizes poorly to accented human speech — you need real spoken queries in the mix.

The final ingredient — and the fastest-moving — is reinforcement learning. The first wave was DPO, used for two distinct ends:

RL for response quality (the text-LLM playbook, ported): Voxtral samples two candidate answers per spoken question, replaces the audio with its transcript, and lets an ordinary text reward model rank the pair for DPO . Online DPO beat offline DPO on their speech-understanding judge — but also regressed ASR accuracy slightly, which is why Voxtral Small ships the SFT checkpoint. Alignment gains in one capability can tax another.
RL for speech stability: Qwen2.5-Omni's Talker is pretrained on transcripts that inevitably contain label noise, so it sometimes hallucinates, repeats, or garbles words when speaking. The fix is DPO with preference pairs ranked by a reward tied to word error rate: generate speech, transcribe it back with ASR, and prefer the sample whose transcription matches the intended text. RL here is not about helpfulness — it is a closed-loop correctness check on the audio itself.

The GRPO turn: verifiable rewards for speech

Through 2025, speech post-training caught the same wave that reshaped reasoning LLMs: a shift from preference pairs (DPO) to GRPO with verifiable rewards . The mechanism is a clean fit for TTS. For each input text the policy samples a group of G candidate utterances, each is scored by a reward function, and the group-relative advantage (R − mean) / std pushes probability toward the better-than-average samples — with a KL penalty anchoring the model to its starting point. Crucially, the reward needs no human labels and no trained reward model: synthesize the speech, run it back through an ASR model, and score the transcript against the target text. The ASR-in-the-loop verifiers that LLaSA originally used to rerank outputs at inference time (back in the TTS section) are now baked into the weights.

Figure 11: GRPO for TTS. The policy samples G candidates, an ASR-in-the-loop reward scores each with no human labels, and the group-relative advantage updates the policy. Toggle reward components to see the central tension.

The hard-won lesson across every one of these papers is that a single reward is a trap. Optimize for ASR-measured word error alone and intelligibility climbs while the cloned voice quietly drifts and prosody flattens — the model learns to enunciate, not to sound right. So the recipes converge on composites: the GRPO-TTS paper blends character error with the ASR model's log-likelihood (a softer signal that catches subtle mispronunciations), cutting CosyVoice2's Chinese CER from 1.41 to 1.07 while lifting naturalness MOS from 4.42 to 4.58 ; Multi-Reward GRPO adds an explicit speaker-similarity term and prosody proxies to keep voice identity intact at scale ; GLM-TTS optimizes pronunciation, timbre, and naturalness together ; and IndexTTS 2.5 reports GRPO trimming English WER from 1.89% to 1.73% with speaker similarity held stable. The pattern is always content-reward + identity-reward, plus guardrails like a duration penalty so the model can't game WER by slowing down.

This is also a domain where you do not need a frontier-scale rig: applying GRPO to LLaSA-1B — with a composite of ASR word error and model confidence — measurably improved error rates and naturalness on a single A100 , a reminder that the expensive part of a SpeechLM is the pretraining, not the alignment. The open frontier is rewards themselves: rule-based metrics are blind to prosody, emotion, and expressiveness, so learned generative reward models like GSRM are emerging to judge the perceptual qualities WER and cosine similarity miss — speech catching up to the trained-reward-model stage that text RLHF reached years ago .

Put together, the pattern across these systems is hard to miss. Pretraining = text LLM + massive interleaved audio + anti-forgetting replay. Mid-training = conversational structure (multi-stream, long context) + compute tricks. Post-training = voice-consistent SFT, then GRPO with a composite of verifiable (ASR-in-the-loop) and identity rewards. The final section of this playbook turns this into a checklist.

Evaluation: how do you know it's good?

You cannot improve what you cannot measure, and speech is measured along three largely independent axes. A model can be perfectly intelligible in the wrong voice, or sound gorgeous while saying the wrong words — so you track all three at once (plus latency, if you stream). Crucially, these are the same quantities the RL chapter used as rewards: evaluation metrics and training rewards are the same thing seen from two sides.

Intelligibility — did it say the right words? Transcribe the generated speech with an ASR model and compute WER/CER against the target text. Cheap, objective, fully automatable — and exactly the ASR-in-the-loop signal from the GRPO section.
Speaker similarity — is it the right voice? Embed the generated and reference clips with a speaker-verification model (typically WavLM-based) and take the cosine similarity, usually reported as SECS (speaker-encoder cosine similarity). The standard counterweight to a pure intelligibility score in voice cloning.
Naturalness — does it sound human? The gold standard is a subjective MOS (mean opinion score, 1–5) from human listeners: accurate but slow and costly. The practical substitute is a predicted MOS from a network trained on human ratings — UTMOS, DNSMOS, NISQA — cheap and automatable, but only an approximation that is easy to over-trust.

The crucial limitation echoes the lesson from reinforcement learning: these objective metrics are necessary but not sufficient. WER says nothing about whether a question sounds like a question; speaker similarity says nothing about emotion. Prosody, expressiveness, and affect still largely escape automatic measurement — which is why human MOS endures, and why the learned, generative reward models from the last chapter (GSRM and kin) are an active frontier for evaluation just as much as for training.

If you're building…	Measure above all	With
A TTS / voice-cloning model	intelligibility + voice match + naturalness	WER (Whisper), SECS (WavLM), UTMOS + a human MOS study
An ASR / understanding model	accuracy	WER/CER; task scores on AudioBench / VoiceBench
A neural codec	reconstruction + modelability	ViSQOL / PESQ / Mel-distance; plus the WER of a small LM trained on its tokens
A conversational agent	all of the above + latency	the above + time-to-first-audio and turn-taking metrics

You rarely report numbers in a vacuum — the field has converged on shared benchmarks: Seed-TTS-eval for zero-shot TTS (WER + speaker similarity on deliberately hard sentences), the Open ASR Leaderboard for recognition, and VoiceBench, AudioBench, and AIR-Bench for audio understanding and spoken dialogue. Note the codec row above: a codec can reconstruct beautifully and still be hard for an LLM to model, so the downstream test — train a small model on its tokens and measure the resynthesized speech — matters more for our purposes than raw reconstruction scores alone.

Using Speech LLMs with ⌘ Transformers

Everything in this post is a transformers model. Whisper, Voxtral, CSM, Qwen2.5-Omni and friends all live on the Hub with native library support, which means three familiar abstractions cover the whole speech stack: the pipeline() one-liner for transcription, the processor + chat template pattern (with {"type": "audio"} content parts) for audio understanding, and generate() with audio output for models that speak. And the roster keeps growing — recent merges include Audio Flamingo 3, Granite Speech, Higgs Audio V2, VibeVoice ASR, and Music Flamingo, all following the same idioms. Pick a task:

Figure 12: The four speech tasks and their transformers idioms. Each snippet is self-contained — swap the checkpoint to try a different model of the same family.

A few practical notes. Load in bfloat16 with device_map="auto" — speech models are no different from text LLMs here, and the 3–8B models in this post fit comfortably on a single 24 GB GPU. The chat-template pattern is the one worth internalizing: because audio is just another content type in the conversation list, the same code path handles audio + text questions, multi-turn dialogue, and (for omni models) images and video. And when you outgrow these snippets — streaming microphone input, sub-second response latency, interruption handling — the models' own repos (Moshi, CSM, Qwen2.5-Omni) ship dedicated real-time serving stacks that wrap the same checkpoints.

The Playbook: Training Your Own

Everything above condenses into three concrete recipes. None of them is hypothetical — every step below is something one of the surveyed models actually does, with the source noted so you can go deeper.

Recipe 1: An LLM-based TTS model (LLaSA / Orpheus style)

Pick your codec first. This is the decision everything else inherits. A flat single codebook (XCodec2: 50 Hz, 65,536 codes) gives you the simplest possible LLM training; a hierarchical RVQ codec (SNAC, DAC) gives better fidelity per token but forces you to flatten streams or add prediction heads. Prefer a codec with semantic grounding — intelligibility comes from the codec more than from the LLM.
Start from a pretrained text LLM (Llama 1B–8B class) and extend its tokenizer and LM head with the codec's vocabulary. No architecture changes.
Build the data pipeline: collect speech, transcribe what is unlabeled with a strong ASR model, encode the audio to codec tokens, and format samples as [text tokens] → [audio tokens]. LLaSA used 250k hours; useful models emerge at a few thousand hours.
Train with plain next-token prediction. All the text-LLM infrastructure (data parallelism, sequence packing, compression, serving stacks) applies unchanged — this is the entire point of the design.
Get voice cloning for free: condition on a short audio-token prompt of the target speaker and the model continues in that voice. No special training needed beyond diverse speakers in the data.
Align with GRPO and verifiable rewards. This is now the standard finishing step. Sample a group of candidates per prompt and score each with a composite reward: an ASR-in-the-loop term for content accuracy (CER/WER, optionally the ASR model's log-likelihood) plus a speaker-encoder term for voice consistency, with a duration guardrail. Never optimize word error alone — it improves intelligibility while eroding the cloned voice. This works at modest scale (LLaSA-1B + GRPO fits on a single A100), and learned reward models are the emerging way to also reward prosody and emotion.

Recipe 2: A neural audio codec

Architecture: a convolutional encoder that downsamples the waveform to a low frame rate, a quantizer, and a mirror-image decoder. The frame rate you choose (12.5–75 Hz) is the single most consequential number: it fixes the sequence length of every model trained on top.
Choose the quantizer: RVQ (Encodec, DAC) for hierarchical fidelity, a single large codebook (XCodec2) for LLM-friendliness, or FSQ variants (Voxtral Codec) to sidestep codebook-collapse issues entirely. Monitor codebook usage — a healthy codec uses nearly all its codes (LLaSA reports ~99%).
Train with the RVQGAN loss stack: multi-scale spectrogram reconstruction losses, adversarial discriminators (the main driver of perceptual quality), feature-matching, and commitment losses. DAC's recipe is the reference implementation.
Add semantic distillation if LLMs will consume your tokens: distill a self-supervised speech model (WavLM-style) into the first codebook (Mimi) or fuse a speech-understanding encoder's features before quantization (XCodec2). This is the difference between a compression codec and an LLM tokenizer.
Make it causal and streamable if conversation is the target: causal convolutions, chunk-wise decoding, and a decoder fast enough for real time (Mimi decodes in under 10 ms on one CPU core).
Evaluate on three axes: reconstruction quality (MOS, ViSQOL) vs. bitrate, downstream LLM modelability (train a small LM on your tokens, measure phone/word error of resynthesized speech), and latency.

Recipe 3: An audio LLM (understanding + conversation)

Start from the strongest text LLM you can serve, and attach a pretrained audio encoder (Whisper-class) through a small adapter. Optionally add a discrete semantic-token stream alongside the continuous features (Kimi-Audio's hybrid input).
Warm up the adapter alone — encoder and LLM frozen — for the first pass over the data (Voxtral), or keep the feature extractor frozen for the first ~20% of tokens (Kimi-Audio).
Pretrain on interleaved audio-text: mix the ASR-style repetition pattern with the cross-modal continuation pattern (~50/50, Voxtral), plus ASR, TTS, and audio-only tasks (Kimi-Audio). Keep a large share of text-only data — up to half of all batches — to prevent forgetting.
Mid-train for the target format: long-context extension for 40-minute audio, and multi-stream / two-channel data (diarization-simulated first, real conversations like Fisher last) if you want full-duplex dialogue rather than turn-based QA.
SFT with a consistent assistant voice (one voice actor + voice conversion, or a TTS engine with fixed conditioning) and diverse, augmented user audio. Include real human spoken queries — TTS-only SFT data fails on accented speech.
Align with DPO: rank candidate responses with a text reward model over transcripts for response quality, and with ASR-based WER rewards for speech stability. Watch for capability taxes — check ASR benchmarks before and after.
If speaking is required, bolt on generation last: either emit codec tokens directly (Moshi's depth transformer) or attach a Talker head over the LLM's hidden states with a streaming detokenizer (Qwen2.5-Omni, Kimi-Audio's flow-matching + vocoder).

Conclusion

The models surveyed here trace a clear trajectory. First, LLMs learned to listen: Qwen2-Audio and Voxtral Chat showed that pairing a strong audio encoder with a pretrained text model is enough for serious audio understanding. Then they learned to speak: LLaSA and Orpheus demonstrated that TTS can be reduced to next-token prediction over codec tokens, inheriting the entire scaling playbook of text LLMs. Finally, they are learning to converse: Moshi, CSM, Kimi-Audio, and Qwen2.5-Omni close the loop with full-duplex, streaming architectures where listening and speaking happen in the same model, in real time. And despite their architectural differences, they all train the same way: continued pretraining on interleaved audio-text with heavy text replay, mid-training for conversational structure, then voice-consistent SFT and DPO.

Underneath all of it sit the neural audio codecs, whose token rate, codebook structure, and semantic grounding quietly determine what the model above them can achieve. If there is one practical takeaway, it is this: when evaluating or building a SpeechLM, look at the codec first — the choice between a flat single codebook, a hierarchical RVQ stack, or a semantically distilled hybrid shapes the latency, quality, and modeling difficulty of everything downstream. The gap between synthetic and human conversation is closing fast, and it is being closed as much by better tokenizers as by bigger transformers.

Citation

For academic attribution, please cite this work as:

"SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored", 2025.

BibTeX citation

@misc{speechlms_explained,
    title={SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored},
    author={Steven Zheng},
    year={2025}
}