A deep dive into recent SpeechLM models like Orpheus, LLaSA, and CSM – understanding architectures, neural audio codecs, and decoding strategies.
A deep dive into recent SpeechLM models like Orpheus, LLaSA, and CSM – understanding architectures, neural audio codecs, and decoding strategies.
Classical text-to-speech (TTS) models have long excelled at voice cloning and speech synthesis. They generally follow a two-stage process: first, a model like Tacotron converts text into an intermediate acoustic representation (such as a spectrogram), and then a vocoder (for example, WaveGlow or HiFi-GAN) transforms that representation into waveform audio. While these systems are capable of producing lifelike voices, their primary focus has been on replicating a given speaker's sound, with limited capacity to engage in dynamic, context-aware conversations.
The advent of large language models (LLMs) offers a compelling opportunity to enhance these systems. By incorporating LLMs into TTS pipelines, we can leverage their sophisticated reasoning and contextual understanding to create truly conversational speech systems. Instead of merely cloning a voice, these enhanced systems can interpret context, adapt to dialogue flows, and generate responses that feel both natural and interactive. Essentially, LLMs open up a new dimension where synthesis isn’t only about producing sound—it’s about enabling intelligent, context-aware conversation.
One practical way to integrate these capabilities is through a cascaded approach in a speech-to-speech system, which typically involves three distinct modules:
This cascaded method combines the strengths of each specialized component. However, this approach is not without its limitations. One major challenge is that the LLM does not capture the full richness of the speech input. Speech carries subtle cues—intonation, rhythm, emotion, and prosodic nuance—that are often lost in the conversion process to text. As a result, when an LLM processes transcribed text, it receives a significantly distilled representation of the original audio. This loss of detail can limit the model’s ability to produce responses that fully mirror the expressive qualities of the initial speech, potentially resulting in synthetic output that feels less dynamic or contextually aware.
Integrating speech directly with an LLM could solve this challenge but it also presents significant difficulties. Unlike text, speech is a continuous, high-dimensional signal. LLMs are designed to work with discrete tokens, so converting speech into a format that these models can process requires additional steps. Existing methods address this gap in two main ways:
Despite these innovations, each method comes with trade-offs. Audio encoders must balance the preservation of critical information with the need for compact, discrete representations. Neural codecs, meanwhile, face challenges related to token rate—since speech typically generates far more tokens per second than text—and the potential loss of fine-grained acoustic details during quantization.
In summary, while classical TTS models provide a strong foundation for effective voice cloning and speech synthesis, integrating LLM reasoning significantly expands the potential use cases by enabling contextual, conversational interactions. The cascaded STT–LLM–TTS pipeline is a practical approach to achieve this integration, yet it carries inherent challenges such as error propagation between modules and difficulties in capturing the full richness of the speech signal. Advances in audio encoders and neural codecs are crucial for overcoming these hurdles, paving the way for next-generation conversational speech systems that seamlessly combine natural language understanding with high-fidelity audio synthesis.
            Qwen2-Audio employs a two-component architecture featuring an audio encoder and a large language model (LLM). The audio encoder is initialized using weights from the Whisper-large-v3 model 
The training involves a three-stage process aimed at maximizing the probability of the next text token, conditioned on the audio representations and preceding text tokens:
This architectural and training methodology highlights a relatively straightforward approach to building powerful audio-language models. It demonstrates the feasibility of effectively combining strong, pre-existing unimodal models – like a capable audio encoder (Whisper) and a robust LLM (Qwen) – and then adapting them through targeted fine-tuning stages (SFT, DPO). This process of "plugging in" a modality-specific encoder and then fine-tuning the combined system mirrors common practices in multimodal LLMs, particularly analogous to how vision capabilities are often integrated into large language models.
The first TTS LLM-powered model we examine is LLaSA, developed by researchers at HKUST and Microsoft. LLaSA directly tackles the question: How far can we push large language model (LLM) scaling principles when applied to speech synthesis? While many text-to-speech (TTS) systems use hybrid pipelines or multiple models, LLaSA adopts a minimalist, LLM-style design—one transformer, one stage, one codebook—and scales it up massively across model size and data.
LLaSA’s design philosophy is simplicity and reuse. It starts from a pretrained Llama model and extends both the tokenizer and LM head to include 65,536 new audio tokens on top of the original text vocabulary. These audio tokens come from XCodec2, a flat vector quantizer operating at 50 Hz – i.e. one token corresponds to 20 ms of audio. This design yields 50 tokens per second of audio, all from one codebook. By using a very large codebook, the codec achieves high fidelity with a single token stream (reportedly ~99% codebook usage, meaning it effectively utilizes the full range of tokens for nuanced encoding).
The model is then fully fine-tuned on sequences that combine both text and audio tokens, allowing the transformer to treat speech as a natural continuation of text. No architecture changes are made to the transformer itself—LLaSA simply learns to model speech the same way Llama models language: as a next-token prediction task over a unified vocabulary of text and audio tokens.
The advantage of this single-level approach is its simplicity for the LLM – the model just treats audio tokens like another language with a 65k vocabulary, not unlike how a wordpiece tokenizer might have 50k tokens for text. Training LLaSA thus becomes very similar to training a standard LLM: they convert all audio in the training set into long sequences of XCodec2 tokens and concatenate with the corresponding text transcriptions. The transformer learns to predict the next token, whether that next token is part of the text or the audio. The authors note this compatibility means they can directly apply techniques like data parallel scaling, model compression, or acceleration from the NLP world to this TTS model.
LLaSA’s training of 250k hours is one of the largest in TTS, spanning diverse speech in English and Chinese, it also has a multilingual version for the 1B model and a 3B model. The resulting models are correspondingly powerful. The largest 8B model in particular demonstrates remarkable naturalness and prosody. According to the paper, increasing model size consistently improved speech quality – bigger models produced more accurate and complex prosody patterns and sounded more natural. This is analogous to how in text LLMs, going from 1B to 7B to 70B yields more fluent and context-aware language; here it yields more human-like intonation and rhythm. Even the 1B model, while less expressive, still functions for basic speech and is extremely lightweight to run.
It can also do zero-shot voice cloning by taking a speech prompt. If you feed a short recording of a speaker (converted to tokens via XCodec2) followed by the text, LLaSA will generate the continuation in that voice. This works because the model has effectively learned to continue in the style provided by preceding audio tokens. Additionally, because it’s bilingual, you can prompt it with a Chinese voice and have it speak English in that voice or vice-versa, enabling cross-lingual voice transfer – a very useful feature for voice assistants in multilingual settings.
One interesting contribution of the LLaSA work is in scaling inference-time compute . The authors experimented with using speech understanding models as verifiers during generation . In practice, this means when sampling audio tokens, they would involve a pretrained model (like a speech recognition or speaker ID network) to guide the choice of tokens, re-ranking or biasing the outputs toward those that make the verifier happy. For example, a ASR verifier ensures the content is pronounced clearly (improving word accuracy), while a speaker encoder verifier ensures the voice timbre stays consistent, and a classifier might ensure the emotion matches some target. They found that by spending more computation at inference in this way, they could significantly improve aspects like emotional expressiveness, speaker consistency, and content accuracy in the generated speech. This is akin to how some text LLMs use external tools or rerankers to improve outputs post-hoc. While such techniques increase inference cost, they show a pathway to higher quality without retraining – useful for customizing style or ensuring correctness in critical applications. In terms of output quality, LLaSA is state-of-the-art on traditional TTS metrics. The paper reports that the codec can reconstruct 16 kHz speech with a MOS (Mean Opinion Score) around 4.1 (out of 5) on test data, which is near the ground truth quality. The transformer does introduce some modeling imperfections, but the massive training helps mitigate that. For prosody, the 8B model especially was noted to produce more lifelike intonation, handling even tricky sentences with complex emphasis better than smaller models. The bilingual nature means it learned to control tone appropriate to each language – for example, using the correct cadence for a Chinese question vs an English question. It can also mix languages to an extent (e.g., speaking an English sentence with a Chinese accent or vice versa, if prompted), reflecting the multilingual data.
While LLaSA pushed the boundaries with its massive scale and single-codebook simplicity, other models emerged concurrently, exploring similar LLM-driven TTS principles but often opting for different neural audio codec strategies. Notable examples include Canopy Labs' Orpheus 3B
            A key difference lies in their choice of codec. Instead of XCodec2's single large codebook, both Orpheus and OuteTTS use codecs based on Residual Vector Quantization (RVQ). RVQ codecs like SNAC  (Multi-Scale Neural Audio Codec) 
Built upon a Llama-3B backbone, Orpheus pairs the LLM with the SNAC. SNAC is an advanced RVQ codec that captures audio information across different temporal resolutions, aiming for efficient compression and detailed reconstruction. To manage the multiple token streams from SNAC within a standard LLM framework, Orpheus employs a strategy of generating a flattened sequence of tokens (7 tokens per audio frame sequentially). It achieves low-latency streaming (~200ms to ~25-50 ms with input streaming of text into the KV cache) suitable for real-time applications by using an optimized decoding process involving a sliding window technique on the SNAC decoder, ensuring smooth audio output without pops. Orpheus particularly emphasizes generating expressive, emotive speech and supports zero-shot voice cloning from short audio prompts.
| Model | Base LLM | Audio Codec | Key Features | 
|---|---|---|---|
| OuteAI/Llama-OuteTTS-1.0-1B | meta-llama/Llama-3.2-1B | ibm-research/DAC.speech.v1.0 | - One-Shot Voice Cloning - Multilingual - Trained on ~60k hours of audio | 
| SparkAudio/Spark-TTS-0.5B | Qwen/Qwen2.5-0.5B | BiCodec | - Controllable TTS with prompt - Trained on 100k hours of open source audio - Bilingual capabilities | 
| nari-labs/Dia-1.6B | meta-llama/Llama-3.2-1B | XCodec2 | - Lightweight deployment - Multilingual support - Basic TTS functionality | 
        Moshi is an open 7-billion-parameter speech–text foundation model that aims to make synthetic voices feel as immediate and interruptible as ordinary conversation.  Every 80 ms the model’s large Temporal Transformer rolls the entire dialogue history—both text and previously generated audio tokens—into a single context vector.  A tiny linear head turns that vector into one semantic token, after which a compact Depth Transformer fills in seven acoustic tokens that refine prosody and timbre.  These eight tokens drive the causal Mimi codec, so the first fragment of sound can leave the speaker roughly 160–200 ms after the user finishes talking 
Because Moshi writes its own audio tokens and reads the user’s tokens in the same sequence, it can back-channel (“mm-hm”), pause mid-sentence when interrupted, or pick up the thread again—all without external VAD, ASR, or turn-taking heuristics. The entire stack is Apache-2.0 licensed, and the Mimi decoder runs in under ten milliseconds on a single CPU core, making on-device streaming practical for mobile hardware.
 
        
            Sesame AI’s Conversational Speech Model (CSM-1B) pursues the same low-latency goal but collapses Moshi’s two-stage hierarchy into a single multimodal Transformer.  Text tokens, the assistant’s outgoing audio tokens, and the user’s incoming audio tokens form one interleaved sequence.  This means the model literally “listens” and “speaks” in the same forward pass, letting it synchronise intonation, emotion, and interruption timing straight from raw context 
For each 80 ms frame the backbone emits a hidden state that already encodes minutes of dialogue. A lightweight head selects the semantic code, while six parallel heads output the remaining residual-vector-quantised acoustic codes in rapid succession—no inner autoregressive loop is required. Removing that extra loop shaves roughly one frame of latency, so measured start-to-speech time drops to about 120 ms on an A100 GPU. Human evaluators preferred CSM-1B’s conversational warmth to recorded ground truth 58 % of the time, suggesting that richer context can beat raw signal fidelity alone.
CSM-1B is trained on roughly 80 k hours of mixed scripted and spontaneous dialogue, then aligned with reinforcement learning from human feedback to fine-tune politeness, interruption timing, and persona consistency. In deployment the same model can transcribe, reason, and reply without the brittle module boundaries of cascaded STT–LLM–TTS pipelines.
 
        Moving towards truly interactive conversational speech, models need to inherently understand and adapt to the flow of dialogue. Sesame AI's Conversational Speech Model (CSM) represents a significant step in this direction, explicitly designed to leverage context for more natural and coherent speech synthesis, as detailed in their research "Crossing the Uncanny Valley of Voice".
For academic attribution, please cite this work as:
"SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored", 2025.
BibTeX citation
@misc{speechlms_explained,
    title={SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored},
    author={Your Name},
    year={2025}
}