How modern speech systems are built and trained — LLM-based TTS, audio LLMs, and the neural codecs underneath them — ending with concrete recipes for training your own.
How modern speech systems are built and trained — LLM-based TTS, audio LLMs, and the neural codecs underneath them — ending with concrete recipes for training your own.
Classical text-to-speech (TTS) models have long excelled at voice cloning and speech synthesis. They generally follow a two-stage process: first, a model like Tacotron converts text into an intermediate acoustic representation (such as a spectrogram), and then a vocoder (for example, WaveGlow or HiFi-GAN) transforms that representation into waveform audio. While these systems are capable of producing lifelike voices, their primary focus has been on replicating a given speaker's sound, with limited capacity to engage in dynamic, context-aware conversations.
The advent of large language models (LLMs) offers a compelling opportunity to enhance these systems. By incorporating LLMs into TTS pipelines, we can leverage their sophisticated reasoning and contextual understanding to create truly conversational speech systems. Instead of merely cloning a voice, these enhanced systems can interpret context, adapt to dialogue flows, and generate responses that feel both natural and interactive. Essentially, LLMs open up a new dimension where synthesis isn’t only about producing sound—it’s about enabling intelligent, context-aware conversation.
One practical way to integrate these capabilities is through a cascaded approach in a speech-to-speech system, which typically involves three distinct modules:
This cascaded method combines the strengths of each specialized component. However, this approach is not without its limitations. One major challenge is that the LLM does not capture the full richness of the speech input. Speech carries subtle cues—intonation, rhythm, emotion, and prosodic nuance—that are often lost in the conversion process to text. As a result, when an LLM processes transcribed text, it receives a significantly distilled representation of the original audio. This loss of detail can limit the model’s ability to produce responses that fully mirror the expressive qualities of the initial speech, potentially resulting in synthetic output that feels less dynamic or contextually aware.
Integrating speech directly with an LLM could solve this challenge but it also presents significant difficulties. Unlike text, speech is a continuous, high-dimensional signal. LLMs are designed to work with discrete tokens, so converting speech into a format that these models can process requires additional steps. Existing methods address this gap in two main ways:
Despite these innovations, each method comes with trade-offs. Audio encoders must balance the preservation of critical information with the need for compact, discrete representations. Neural codecs, meanwhile, face challenges related to token rate—since speech typically generates far more tokens per second than text—and the potential loss of fine-grained acoustic details during quantization.
In summary, while classical TTS models provide a strong foundation for effective voice cloning and speech synthesis, integrating LLM reasoning significantly expands the potential use cases by enabling contextual, conversational interactions. The cascaded STT–LLM–TTS pipeline is a practical approach to achieve this integration, yet it carries inherent challenges such as error propagation between modules and difficulties in capturing the full richness of the speech signal. Advances in audio encoders and neural codecs are crucial for overcoming these hurdles, paving the way for next-generation conversational speech systems that seamlessly combine natural language understanding with high-fidelity audio synthesis.
This post is a playbook, and it builds toward a concrete goal. We will walk through the three families of systems that make up the modern speech stack — models that understand audio (Qwen2-Audio, Voxtral), models that speak (LLaSA, Orpheus, CSM), and the neural codecs that connect waveforms to tokens — and study not just their architectures but how each one is actually trained. By the end, you should have a working recipe for all three: an LLM-based TTS model, a neural audio codec, and an audio LLM. The selection here is opinionated — models chosen because each teaches a distinct lesson. For an exhaustive map of the field, the ACL 2025 SpeechLM survey
Qwen2-Audio employs a two-component architecture featuring an audio encoder and a large language model (LLM). The audio encoder is initialized using weights from the Whisper-large-v3 model
The training involves a three-stage process aimed at maximizing the probability of the next text token, conditioned on the audio representations and preceding text tokens:
This architectural and training methodology highlights a relatively straightforward approach to building powerful audio-language models. It demonstrates the feasibility of effectively combining strong, pre-existing unimodal models – like a capable audio encoder (Whisper) and a robust LLM (Qwen) – and then adapting them through targeted fine-tuning stages (SFT, DPO). This process of "plugging in" a modality-specific encoder and then fine-tuning the combined system mirrors common practices in multimodal LLMs, particularly analogous to how vision capabilities are often integrated into large language models.
Qwen2-Audio's blueprint has since become the standard pattern, and the transformers library has been steadily absorbing its descendants — each with one distinctive twist worth knowing:
The takeaway: "make an LLM understand audio" is no longer a research problem but a design menu — pick an encoder, pick a fusion mechanism (projector, Q-former, in-place replacement), and decide how you will protect the text weights (data replay, frozen stages, or Granite's audio-gated LoRA).
The first TTS LLM-powered model we examine is LLaSA, developed by researchers at HKUST and Microsoft. LLaSA directly tackles the question: How far can we push large language model (LLM) scaling principles when applied to speech synthesis? While many text-to-speech (TTS) systems use hybrid pipelines or multiple models, LLaSA adopts a minimalist, LLM-style design—one transformer, one stage, one codebook—and scales it up massively across model size and data.
LLaSA’s design philosophy is simplicity and reuse. It starts from a pretrained Llama model and extends both the tokenizer and LM head to include 65,536 new audio tokens on top of the original text vocabulary. These audio tokens come from XCodec2, a flat vector quantizer operating at 50 Hz – i.e. one token corresponds to 20 ms of audio. This design yields 50 tokens per second of audio, all from one codebook. By using a very large codebook, the codec achieves high fidelity with a single token stream (reportedly ~99% codebook usage, meaning it effectively utilizes the full range of tokens for nuanced encoding).
The model is then fully fine-tuned on sequences that combine both text and audio tokens, allowing the transformer to treat speech as a natural continuation of text. No architecture changes are made to the transformer itself—LLaSA simply learns to model speech the same way Llama models language: as a next-token prediction task over a unified vocabulary of text and audio tokens.
The advantage of this single-level approach is its simplicity for the LLM – the model just treats audio tokens like another language with a 65k vocabulary, not unlike how a wordpiece tokenizer might have 50k tokens for text. Training LLaSA thus becomes very similar to training a standard LLM: they convert all audio in the training set into long sequences of XCodec2 tokens and concatenate with the corresponding text transcriptions. The transformer learns to predict the next token, whether that next token is part of the text or the audio. The authors note this compatibility means they can directly apply techniques like data parallel scaling, model compression, or acceleration from the NLP world to this TTS model.
LLaSA’s training of 250k hours is one of the largest in TTS, spanning diverse speech in English and Chinese, it also has a multilingual version for the 1B model and a 3B model. The resulting models are correspondingly powerful. The largest 8B model in particular demonstrates remarkable naturalness and prosody. According to the paper, increasing model size consistently improved speech quality – bigger models produced more accurate and complex prosody patterns and sounded more natural. This is analogous to how in text LLMs, going from 1B to 7B to 70B yields more fluent and context-aware language; here it yields more human-like intonation and rhythm. Even the 1B model, while less expressive, still functions for basic speech and is extremely lightweight to run.
It can also do zero-shot voice cloning by taking a speech prompt. If you feed a short recording of a speaker (converted to tokens via XCodec2) followed by the text, LLaSA will generate the continuation in that voice. This works because the model has effectively learned to continue in the style provided by preceding audio tokens. Additionally, because it’s bilingual, you can prompt it with a Chinese voice and have it speak English in that voice or vice-versa, enabling cross-lingual voice transfer – a very useful feature for voice assistants in multilingual settings.
One interesting contribution of the LLaSA work is in scaling inference-time compute . The authors experimented with using speech understanding models as verifiers during generation . In practice, this means when sampling audio tokens, they would involve a pretrained model (like a speech recognition or speaker ID network) to guide the choice of tokens, re-ranking or biasing the outputs toward those that make the verifier happy. For example, a ASR verifier ensures the content is pronounced clearly (improving word accuracy), while a speaker encoder verifier ensures the voice timbre stays consistent, and a classifier might ensure the emotion matches some target. They found that by spending more computation at inference in this way, they could significantly improve aspects like emotional expressiveness, speaker consistency, and content accuracy in the generated speech. This is akin to how some text LLMs use external tools or rerankers to improve outputs post-hoc. While such techniques increase inference cost, they show a pathway to higher quality without retraining – useful for customizing style or ensuring correctness in critical applications. In terms of output quality, LLaSA is state-of-the-art on traditional TTS metrics. The paper reports that the codec can reconstruct 16 kHz speech with a MOS (Mean Opinion Score) around 4.1 (out of 5) on test data, which is near the ground truth quality. The transformer does introduce some modeling imperfections, but the massive training helps mitigate that. For prosody, the 8B model especially was noted to produce more lifelike intonation, handling even tricky sentences with complex emphasis better than smaller models. The bilingual nature means it learned to control tone appropriate to each language – for example, using the correct cadence for a Chinese question vs an English question. It can also mix languages to an extent (e.g., speaking an English sentence with a Chinese accent or vice versa, if prompted), reflecting the multilingual data.
While LLaSA pushed the boundaries with its massive scale and single-codebook simplicity, other models emerged concurrently, exploring similar LLM-driven TTS principles but often opting for different neural audio codec strategies. Notable examples include Canopy Labs' Orpheus 3B and OuteAI's OuteTTS.
A key difference lies in their choice of codec. Instead of XCodec2's single large codebook, both Orpheus and OuteTTS use codecs based on Residual Vector Quantization (RVQ). RVQ codecs like SNAC (Multi-Scale Neural Audio Codec)
Built upon a Llama-3B backbone, Orpheus pairs the LLM with the SNAC. SNAC is an advanced RVQ codec that captures audio information across different temporal resolutions, aiming for efficient compression and detailed reconstruction. To manage the multiple token streams from SNAC within a standard LLM framework, Orpheus employs a strategy of generating a flattened sequence of tokens (7 tokens per audio frame sequentially). It achieves low-latency streaming (~200ms to ~25-50 ms with input streaming of text into the KV cache) suitable for real-time applications by using an optimized decoding process involving a sliding window technique on the SNAC decoder, ensuring smooth audio output without pops. Orpheus particularly emphasizes generating expressive, emotive speech and supports zero-shot voice cloning from short audio prompts.
| Model | Base LLM | Audio Codec | Key Features |
|---|---|---|---|
| OuteAI/Llama-OuteTTS-1.0-1B | meta-llama/Llama-3.2-1B | ibm-research/DAC.speech.v1.0 |
- One-Shot Voice Cloning - Multilingual - Trained on ~60k hours of audio |
| SparkAudio/Spark-TTS-0.5B | Qwen/Qwen2.5-0.5B | BiCodec |
- Controllable TTS with prompt - Trained on 100k hours of open source audio - Bilingual capabilities |
| nari-labs/Dia-1.6B | Custom 1.6B transformer (trained from scratch) | Descript Audio Codec (DAC) |
- Multi-speaker dialogue generation ([S1]/[S2] tags) - Non-verbal sounds (laughter, coughs, sighs) - Voice cloning from an audio prompt |
| bosonai/higgs-audio-v2-generation-3B-base | meta-llama/Llama-3.2-3B | Unified tokenizer (semantic + acoustic, in-house) |
- DualFFN: separate FFN path for audio tokens - 10M hours pretraining (AudioVerse) - Expressive multi-speaker audio, zero post-training |
| moonshotai/Kimi-Audio-7B-Instruct | Qwen 2.5-7B | Hybrid (continuous + discrete) |
- Universal audio foundation model - ASR, TTS, audio QA, SER, SEC - 13M+ hours pre-training |
| Qwen/Qwen2.5-Omni-7B | Qwen 2.5 | Thinker-Talker (built-in) |
- Multimodal: text, image, audio, video - Real-time voice & video chat - Selectable voices (Chelsie, Ethan) |
| mistralai/Voxtral-Mini-4B-Realtime | Mistral 7B | Voxtral Codec |
- Streaming ASR - 13 languages - 480ms delay |
| mistralai/Voxtral-4B-TTS | Mistral 7B | Voxtral Codec (VQ-FSQ) |
- Multilingual TTS - Voice cloning from 3s audio - Flow-matching decoder |
Moshi is an open 7-billion-parameter speech–text foundation model that aims to make synthetic voices feel as immediate and interruptible as ordinary conversation. Every 80 ms the model’s large Temporal Transformer rolls the entire dialogue history—both text and previously generated audio tokens—into a single context vector. A tiny linear head turns that vector into one semantic token, after which a compact Depth Transformer fills in seven acoustic tokens that refine prosody and timbre. These eight tokens drive the causal Mimi codec, so the first fragment of sound can leave the speaker roughly 160–200 ms after the user finishes talking
Because Moshi writes its own audio tokens and reads the user’s tokens in the same sequence, it can back-channel (“mm-hm”), pause mid-sentence when interrupted, or pick up the thread again—all without external VAD, ASR, or turn-taking heuristics. The entire stack is Apache-2.0 licensed, and the Mimi decoder runs in under ten milliseconds on a single CPU core, making on-device streaming practical for mobile hardware.
Sesame AI’s Conversational Speech Model (CSM-1B) attacks the same problem from the TTS side. Rather than a full-duplex agent like Moshi, CSM is a context-aware speech generator: text and audio tokens from the whole conversation so far form one interleaved sequence, and the model generates the next turn’s audio conditioned on all of it. Because the prosody of a reply depends on what was just said — and how it was said — feeding the model raw conversational context lets it get intonation, emotion, and timing right where an isolated TTS call would sound flat
Architecturally, CSM is two transformers operating on Mimi codec tokens (12.5 Hz, one semantic + N−1 acoustic codebooks per frame). A Llama-style backbone processes the interleaved text-and-audio history and predicts the semantic codebook for each frame; a much smaller audio decoder then models the remaining acoustic codebooks to reconstruct high-fidelity speech. Both stages are autoregressive — the split exists so the expensive backbone runs only once per frame while the cheap decoder fills in the detail, keeping latency low enough for real-time use.
The training corpus is roughly 1 million hours of transcribed, diarized, predominantly English audio, trained for five epochs at a 2,048-token sequence length (about two minutes of dialogue). Sesame trained three sizes — 1B, 3B, and 8B backbones — and released the 1B model openly. In their evaluations, listeners rated isolated CSM utterances on par with human recordings; only when the conversational context was shown did humans retain a clear edge — evidence that naturalness is largely solved and contextual appropriateness is the remaining frontier.
Moving towards truly interactive conversational speech, models need to inherently understand and adapt to the flow of dialogue. Sesame AI's Conversational Speech Model (CSM) represents a significant step in this direction, explicitly designed to leverage context for more natural and coherent speech synthesis, as detailed in their research "Crossing the Uncanny Valley of Voice".
Kimi-Audio from Moonshot AI is a 7-billion-parameter open-source audio foundation model that unifies audio understanding, generation, and conversation within a single framework
The model introduces a hybrid audio input architecture that combines continuous acoustic features with discrete semantic tokens, allowing the LLM core to ingest audio at multiple levels of abstraction. For output, Kimi-Audio employs parallel heads for both text and audio token generation, making it capable of producing text transcriptions, natural language answers, or synthesized speech from the same forward pass. A chunk-wise streaming detokenizer based on flow matching keeps audio output latency low enough for real-time interaction.
Pre-training consumed over 13 million hours of diverse audio data spanning speech, music, and environmental sounds, making Kimi-Audio competitive across a wide spectrum of benchmarks. It achieves strong results on ASR (Common Voice, LibriSpeech), audio question answering (MMAU), speech emotion recognition, and sound event classification, while also supporting end-to-end speech conversation. The model is released in two checkpoints: Kimi-Audio-7B (base) and Kimi-Audio-7B-Instruct (fine-tuned for dialogue).
Alibaba's Qwen2.5-Omni pushes the boundary of multimodal LLMs by simultaneously perceiving text, images, audio, and video while generating both text and natural speech responses in a streaming manner
The architecture splits responsibilities into a Thinker–Talker design. The Thinker is a transformer backbone that processes all modalities; the Talker is a lightweight decoder head that converts hidden states into speech tokens. A novel position-embedding scheme called TMRoPE (Time-aligned Multimodal RoPE) synchronizes video timestamps with audio, so lip movements and spoken words stay aligned in the model's internal representation. This is especially important for tasks like video-based question answering or real-time video chat.
Qwen2.5-Omni comes in 3B and 7B variants. In audio-only benchmarks it rivals or surpasses Qwen2-Audio and Whisper-large-v3, while in speech generation it achieves competitive speaker similarity and content consistency on the SEED evaluation suite. On the OmniBench multimodal benchmark, the 7B model reaches state-of-the-art performance among open models. The model also supports selectable output voices—Chelsie (female, warm) and Ethan (male, upbeat)—giving developers control over persona without extra training.
Mistral AI's Voxtral family is a comprehensive audio suite that spans understanding, generation, and real-time transcription
Voxtral Chat (Mini and Small) is a multimodal audio chat model that comprehends both spoken audio and text documents. A 32K context window lets it handle audio files up to 40 minutes and sustain long multi-turn conversations. Voxtral Small outperforms several closed-source models on audio benchmarks while remaining compact enough to run locally. The model is trained on diverse audio tasks including transcription, audio question answering, and cross-modal reasoning.
Voxtral TTS is an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. It adopts a hybrid architecture: an autoregressive Transformer generates semantic speech tokens, and a flow-matching decoder turns those tokens into acoustic features via the custom Voxtral Codec—a speech tokenizer with hybrid VQ-FSQ quantization. In native-speaker evaluations, Voxtral TTS achieved a 68.4% win rate over ElevenLabs Flash v2.5 on multilingual voice cloning, demonstrating strong naturalness and expressivity.
Voxtral Realtime is a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike chunking or sliding-window adaptations of offline models, Voxtral Realtime is trained end-to-end for streaming with explicit alignment between audio and text streams. It builds on the Delayed Streams Modeling framework, introducing a causal audio encoder and Ada RMS-Norm for improved delay conditioning. At a 480 ms delay, it achieves performance on par with Whisper, the most widely deployed offline transcription system, while scaling to 13 languages.
All three Voxtral models are available on the Hugging Face Hub and can be loaded through the standard transformers library, making them accessible to any developer already familiar with the Hugging Face ecosystem.
Reading the model cards, these systems can look like magic: one checkpoint that listens, reasons, and speaks. But every model in this chapter follows essentially the same three-phase curriculum, directly inherited from text LLMs: pretraining teaches the modality, mid-training teaches the conversational format, and post-training teaches the behavior. Explore the actual pipelines below — the details differ, but the skeleton is remarkably consistent.
No one trains a SpeechLM from scratch on audio. Every system starts from a pretrained text LLM — Helium for Moshi, Qwen2.5-7B for Kimi-Audio, Mistral for Voxtral — because the linguistic knowledge, world knowledge, and reasoning live in the text weights. Audio pretraining is therefore really continued pretraining: expose the model to enormous amounts of audio (7M hours for Moshi
The most important design decision is what the audio-text training sample looks like. Voxtral
The repetition pattern is essentially ASR formatted as language modeling, and it drives transcription accuracy. The continuation pattern — predict the transcript of the next chunk from the current audio — is the one that produces understanding: the model can only continue speech it has actually comprehended, which is exactly the skill spoken QA and dialogue require. Kimi-Audio reaches the same conclusion independently: of its seven pretraining tasks, three are interleaving variants (audio→semantic-token, audio→text, and joint audio+text prediction), trained over a budget of 585B audio tokens and 585B text tokens. If you remember one thing about audio pretraining, make it this: interleaved data is what turns an ASR system into an audio language model.
A small but recurring trick: freeze first, unfreeze later. Voxtral's first pass over the data trains only the adapter between the frozen audio encoder and the frozen LLM; Kimi-Audio keeps its Whisper feature extractor frozen for the first ~20% of pretraining tokens. Aligning a randomly initialized adapter against two stable representations is much easier than letting everything drift at once.
A model pretrained on single-stream audio can continue speech, but it has never seen a conversation — two voices, overlapping, taking turns. Moshi's recipe makes this phase explicit. First, it applies speaker diarization to its unsupervised corpus to simulate two streams (the target speaker's waveform on one channel, everyone else on the other) and trains for 100k steps. Only then does it use real two-channel telephone conversations — the Fisher corpus, 2,000 hours of phone calls recorded with one channel per participant — to learn genuine turn-taking, interruptions, and back-channels. The lesson generalizes: scarce gold data (real two-channel dialogue) is used last, after cheap simulated data has done the heavy lifting.
Mid-training is also where the engineering constraints of audio bite hardest, because audio sequences are long and each position can carry many codebooks. CSM's answer is compute amortization
Supervised fine-tuning for speech has a unique problem text never had: the assistant needs a voice, and your SFT data defines it. Moshi generates 20k+ hours of synthetic dialogues — transcripts written by its own text LLM, then synthesized with a TTS engine conditioned on one voice actor across 70+ speaking styles — while augmenting the user stream with gain changes, background noise, echo and reverb so the model stays robust to real microphones. Kimi-Audio records a professional voice actor across 20+ styles and emotion intensities, then uses voice conversion to expand coverage. And Voxtral contributes an important negative result: SFT data synthesized purely with TTS generalizes poorly to accented human speech — you need real spoken queries in the mix.
The final ingredient — and the fastest-moving — is reinforcement learning. The first wave was DPO, used for two distinct ends:
Through 2025, speech post-training caught the same wave that reshaped reasoning LLMs: a shift from preference pairs (DPO) to GRPO with verifiable rewards (R − mean) / std pushes probability toward the better-than-average samples — with a KL penalty anchoring the model to its starting point. Crucially, the reward needs no human labels and no trained reward model: synthesize the speech, run it back through an ASR model, and score the transcript against the target text. The ASR-in-the-loop verifiers that LLaSA originally used to rerank outputs at inference time (back in the TTS section) are now baked into the weights.
The hard-won lesson across every one of these papers is that a single reward is a trap. Optimize for ASR-measured word error alone and intelligibility climbs while the cloned voice quietly drifts and prosody flattens — the model learns to enunciate, not to sound right. So the recipes converge on composites: the GRPO-TTS paper blends character error with the ASR model's log-likelihood (a softer signal that catches subtle mispronunciations), cutting CosyVoice2's Chinese CER from 1.41 to 1.07 while lifting naturalness MOS from 4.42 to 4.58
This is also a domain where you do not need a frontier-scale rig: applying GRPO to LLaSA-1B — with a composite of ASR word error and model confidence — measurably improved error rates and naturalness on a single A100
Put together, the pattern across these systems is hard to miss. Pretraining = text LLM + massive interleaved audio + anti-forgetting replay. Mid-training = conversational structure (multi-stream, long context) + compute tricks. Post-training = voice-consistent SFT, then GRPO with a composite of verifiable (ASR-in-the-loop) and identity rewards. The final section of this playbook turns this into a checklist.
Every speech LLM discussed so far relies on a neural audio codec to bridge the gap between continuous waveforms and discrete tokens. Understanding how these codecs work—and how they differ—is essential for appreciating the trade-offs each model makes.
A neural codec typically consists of an encoder that compresses raw audio into a compact latent representation, a quantizer that maps each latent frame to one or more discrete codebook indices, and a decoder that reconstructs the waveform from those indices. The key variables are:
EnCodec (Meta, 2022) established the modern neural-codec paradigm with a 24 kHz, 1.5–6 kbps RVQGAN architecture
RVQ is easiest to understand by watching it work. The first codebook stores a coarse sketch of each frame; every subsequent codebook encodes only the residual — the part of the signal the previous levels missed. Drag the slider to add codebooks and watch the reconstruction converge while the token rate and bitrate climb:
And here is the same trade-off with your ears instead of your eyes: the same speech clip passed through EnCodec at three codebook depths. At 2 codebooks the voice is intelligible but robotic; each doubling restores texture — and doubles the LLM's token bill.
facebook/encodec_24khz in transformers by varying the bandwidth argument.
Descript Audio Codec (DAC) extends EnCodec to 44.1 kHz stereo audio and introduces improved RVQGAN training with larger codebooks, achieving higher perceptual quality at 6–8 kbps
SNAC (Multi-Scale Neural Audio Codec) takes a different approach by quantizing at multiple temporal resolutions simultaneously—e.g., separate codebooks for coarse and fine structure
XCodec2 (HKUST, used by LLaSA) opts for a single large codebook of 65,536 entries at 50 Hz. By eliminating the hierarchical RVQ structure, XCodec2 simplifies the LLM's prediction task to straightforward next-token generation, reportedly utilizing ~99% of the codebook. The trade-off is that a single token must encode a 20 ms frame, so the model relies on the massive capacity of the codebook itself to preserve nuance. The design is catching on: Neuphonic\u2019s NeuCodec (the codec behind NeuTTS Air) extends XCodec2 with Finite Scalar Quantization, keeping the single 50 tokens/s stream while cutting the bitrate to 0.8 kbps and pairing a Wav2Vec2-BERT semantic encoder with the acoustic one.
Mimi (Kyutai, 2024) is the codec behind Moshi. It runs at 12.5 Hz with a bitrate of only 1.1 kbps, yet combines semantic tokens (from a self-supervised speech encoder) with acoustic tokens in a single stream. This hybrid design means the LLM receives both high-level linguistic content and low-level timbre information at an extremely low token rate, which is crucial for real-time full-duplex dialogue. Mimi's decoder is also heavily optimized for streaming, running in under ten milliseconds on a single CPU core.
| Codec | Token Rate | RVQ Depth | Typical Bitrate | Used By |
|---|---|---|---|---|
| EnCodec | 75 Hz | 32 | 1.5–6 kbps | Early SpeechLLMs, audiobook pipelines |
| DAC | ~75 Hz | 9 | 6–8 kbps | OuteTTS, high-fidelity TTS |
| SNAC | Multi-scale | 3 levels | ~2 kbps | Orpheus 3B |
| XCodec2 | 50 Hz | 1 (flat) | ~2 kbps | LLaSA |
| NeuCodec | 50 Hz | 1 (FSQ, flat) | 0.8 kbps | NeuTTS Air |
| Mimi | 12.5 Hz | 8 (1 semantic + 7 acoustic) | 1.1 kbps | Moshi |
The choice of codec is not merely an implementation detail: it shapes the LLM's input distribution, the achievable latency, and the fidelity ceiling. A flat single-codebook design like XCodec2 simplifies training and inference, but demands the LLM to carry more of the acoustic modeling burden. A multi-scale or semantic+acoustic design like Mimi pushes complexity into the codec, letting the LLM focus on high-level dialogue and reasoning. As the field matures, we expect to see more hybrid codecs that balance these axes for specific deployment constraints—on-device, low-latency, or studio-quality.
To see why this matters so much for LLMs, look at what one second of speech actually costs in tokens. A text LLM spends roughly 3–4 tokens to represent a second of spoken English; a naively flattened DAC stream spends nearly 800. This token bill determines how much conversation history fits in the context window — and therefore how long a dialogue your SpeechLM can remember:
Everything in this post is a transformers model. Whisper, Voxtral, CSM, Qwen2.5-Omni and friends all live on the Hub with native library support, which means three familiar abstractions cover the whole speech stack: the pipeline() one-liner for transcription, the processor + chat template pattern (with {"type": "audio"} content parts) for audio understanding, and generate() with audio output for models that speak. And the roster keeps growing — recent merges include Audio Flamingo 3, Granite Speech, Higgs Audio V2, VibeVoice ASR, and Music Flamingo, all following the same idioms. Pick a task:
transformers idioms. Each snippet is self-contained — swap the checkpoint to try a different model of the same family.
A few practical notes. Load in bfloat16 with device_map="auto" — speech models are no different from text LLMs here, and the 3–8B models in this post fit comfortably on a single 24 GB GPU. The chat-template pattern is the one worth internalizing: because audio is just another content type in the conversation list, the same code path handles audio + text questions, multi-turn dialogue, and (for omni models) images and video. And when you outgrow these snippets — streaming microphone input, sub-second response latency, interruption handling — the models' own repos (Moshi, CSM, Qwen2.5-Omni) ship dedicated real-time serving stacks that wrap the same checkpoints.
Everything above condenses into three concrete recipes. None of them is hypothetical — every step below is something one of the surveyed models actually does, with the source noted so you can go deeper.
[text tokens] → [audio tokens]. LLaSA used 250k hours; useful models emerge at a few thousand hours.The models surveyed here trace a clear trajectory. First, LLMs learned to listen: Qwen2-Audio and Voxtral Chat showed that pairing a strong audio encoder with a pretrained text model is enough for serious audio understanding. Then they learned to speak: LLaSA and Orpheus demonstrated that TTS can be reduced to next-token prediction over codec tokens, inheriting the entire scaling playbook of text LLMs. Finally, they are learning to converse: Moshi, CSM, Kimi-Audio, and Qwen2.5-Omni close the loop with full-duplex, streaming architectures where listening and speaking happen in the same model, in real time. And despite their architectural differences, they all train the same way: continued pretraining on interleaved audio-text with heavy text replay, mid-training for conversational structure, then voice-consistent SFT and DPO.
Underneath all of it sit the neural audio codecs, whose token rate, codebook structure, and semantic grounding quietly determine what the model above them can achieve. If there is one practical takeaway, it is this: when evaluating or building a SpeechLM, look at the codec first — the choice between a flat single codebook, a hierarchical RVQ stack, or a semantically distilled hybrid shapes the latency, quality, and modeling difficulty of everything downstream. The gap between synthetic and human conversation is closing fast, and it is being closed as much by better tokenizers as by bigger transformers.
For academic attribution, please cite this work as:
"SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored", 2025.
BibTeX citation
@misc{speechlms_explained,
title={SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored},
author={Steven Zheng},
year={2025}
}