SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored

A deep dive into recent SpeechLM models like Orpheus, LLaSA, and CSM – understanding architectures, neural audio codecs, and decoding strategies.

Beyond Voice Cloning: From Classical TTS to Conversational Speech Systems

Classical text-to-speech (TTS) models have long excelled at voice cloning and speech synthesis. They generally follow a two-stage process: first, a model like Tacotron converts text into an intermediate acoustic representation (such as a spectrogram), and then a vocoder (for example, WaveGlow or HiFi-GAN) transforms that representation into waveform audio. While these systems are capable of producing lifelike voices, their primary focus has been on replicating a given speaker's sound, with limited capacity to engage in dynamic, context-aware conversations.

Audio sample from the Kokoro-TTS model.

The advent of large language models (LLMs) offers a compelling opportunity to enhance these systems. By incorporating LLMs into TTS pipelines, we can leverage their sophisticated reasoning and contextual understanding to create truly conversational speech systems. Instead of merely cloning a voice, these enhanced systems can interpret context, adapt to dialogue flows, and generate responses that feel both natural and interactive. Essentially, LLMs open up a new dimension where synthesis isn’t only about producing sound—it’s about enabling intelligent, context-aware conversation.

One practical way to integrate these capabilities is through a cascaded approach in a speech-to-speech system, which typically involves three distinct modules:

Speech-to-Text (STT): Converts the incoming speech into text.
Large Language Model (LLM): Processes and reasons on the transcribed text to understand context and generate a conversational response.
Text-to-Speech (TTS): Synthesizes the LLM-generated text back into natural-sounding speech.

Hover over elements for details

This cascaded method combines the strengths of each specialized component. However, this approach is not without its limitations. One major challenge is that the LLM does not capture the full richness of the speech input. Speech carries subtle cues—intonation, rhythm, emotion, and prosodic nuance—that are often lost in the conversion process to text. As a result, when an LLM processes transcribed text, it receives a significantly distilled representation of the original audio. This loss of detail can limit the model’s ability to produce responses that fully mirror the expressive qualities of the initial speech, potentially resulting in synthetic output that feels less dynamic or contextually aware.

Integrating speech directly with an LLM could solve this challenge but it also presents significant difficulties. Unlike text, speech is a continuous, high-dimensional signal. LLMs are designed to work with discrete tokens, so converting speech into a format that these models can process requires additional steps. Existing methods address this gap in two main ways:

Audio Encoders: Some approaches use dedicated audio encoders to transform continuous speech into discrete tokens before feeding them into an LLM. This method aims to preserve critical acoustic features (like intonation and rhythm) while converting the signal into a format that LLMs can natively understand.
Neural Codecs: Neural codecs convert waveform audio into sequences of discrete tokens that can be treated similarly to text. These codecs, such as those used in models like DAC, Encodec, or XCodec, enable both TTS and speech-to-speech systems to leverage the token-based architecture of LLMs. By bridging the gap between continuous audio signals and the discrete token sequences required by LLMs, neural codecs help streamline the integration process and enable more efficient model training and inference.

Despite these innovations, each method comes with trade-offs. Audio encoders must balance the preservation of critical information with the need for compact, discrete representations. Neural codecs, meanwhile, face challenges related to token rate—since speech typically generates far more tokens per second than text—and the potential loss of fine-grained acoustic details during quantization.

In summary, while classical TTS models provide a strong foundation for effective voice cloning and speech synthesis, integrating LLM reasoning significantly expands the potential use cases by enabling contextual, conversational interactions. The cascaded STT–LLM–TTS pipeline is a practical approach to achieve this integration, yet it carries inherent challenges such as error propagation between modules and difficulties in capturing the full richness of the speech signal. Advances in audio encoders and neural codecs are crucial for overcoming these hurdles, paving the way for next-generation conversational speech systems that seamlessly combine natural language understanding with high-fidelity audio synthesis.

Making LLMs understand speech and audio

Qwen2-Audio

Qwen2-Audio employs a two-component architecture featuring an audio encoder and a large language model (LLM). The audio encoder is initialized using weights from the Whisper-large-v3 model . Leveraging powerful pre-trained models like Whisper as the audio encoder is an approach also seen in other audio-language models; for instance, the Ultravox's models similarly use a Whisper encoder as part of its architecture. For Qwen2-Audio, the input audio (resampled to 16kHz) is converted into a 128-channel mel-spectrogram (25ms window, 10ms hop), which is then processed by a pooling layer, resulting in each encoder frame representing about 40ms of audio. The LLM component is the Qwen-7B model, bringing the total parameter count for Qwen2-Audio to 8.2 billion.

The training involves a three-stage process aimed at maximizing the probability of the next text token, conditioned on the audio representations and preceding text tokens:

Pre-training: This stage replaces the hierarchical tags used previously with natural language prompts for various tasks and data types. This approach was found to enhance generalization and instruction-following abilities. The volume of pre-training data was also significantly expanded compared to earlier models.
Fine-tuning (SFT): Building on the pre-trained model's audio understanding, instruction-based fine-tuning is performed using meticulously curated, high-quality SFT data to improve alignment with human intent and enable interactive chat capabilities. Both the 'Audio Analysis' and 'Voice Chat' interaction modes are jointly trained for seamless integration, removing the need for users to switch modes explicitly.
Preference Optimization (DPO) : DPO is employed as a final stage to further align the model with human preferences. This involves using a dataset of triplets, each containing an input (audio + text prompt), a preferred (good) response, and a rejected (bad) response, to optimize the model. This stage helps improve the factuality and adherence to desired behaviors in the model's outputs.

This architectural and training methodology highlights a relatively straightforward approach to building powerful audio-language models. It demonstrates the feasibility of effectively combining strong, pre-existing unimodal models – like a capable audio encoder (Whisper) and a robust LLM (Qwen) – and then adapting them through targeted fine-tuning stages (SFT, DPO). This process of "plugging in" a modality-specific encoder and then fine-tuning the combined system mirrors common practices in multimodal LLMs, particularly analogous to how vision capabilities are often integrated into large language models.

Scaling TTS with LLMs

LLaSA: a simple approach to scaling TTS with LLMs

The first TTS LLM-powered model we examine is LLaSA, developed by researchers at HKUST and Microsoft. LLaSA directly tackles the question: How far can we push large language model (LLM) scaling principles when applied to speech synthesis? While many text-to-speech (TTS) systems use hybrid pipelines or multiple models, LLaSA adopts a minimalist, LLM-style design—one transformer, one stage, one codebook—and scales it up massively across model size and data.

LLaSA’s design philosophy is simplicity and reuse. It starts from a pretrained Llama model and extends both the tokenizer and LM head to include 65,536 new audio tokens on top of the original text vocabulary. These audio tokens come from XCodec2, a flat vector quantizer operating at 50 Hz – i.e. one token corresponds to 20 ms of audio. This design yields 50 tokens per second of audio, all from one codebook. By using a very large codebook, the codec achieves high fidelity with a single token stream (reportedly ~99% codebook usage, meaning it effectively utilizes the full range of tokens for nuanced encoding).

The model is then fully fine-tuned on sequences that combine both text and audio tokens, allowing the transformer to treat speech as a natural continuation of text. No architecture changes are made to the transformer itself—LLaSA simply learns to model speech the same way Llama models language: as a next-token prediction task over a unified vocabulary of text and audio tokens.

The advantage of this single-level approach is its simplicity for the LLM – the model just treats audio tokens like another language with a 65k vocabulary, not unlike how a wordpiece tokenizer might have 50k tokens for text. Training LLaSA thus becomes very similar to training a standard LLM: they convert all audio in the training set into long sequences of XCodec2 tokens and concatenate with the corresponding text transcriptions. The transformer learns to predict the next token, whether that next token is part of the text or the audio. The authors note this compatibility means they can directly apply techniques like data parallel scaling, model compression, or acceleration from the NLP world to this TTS model.

LLaSA’s training of 250k hours is one of the largest in TTS, spanning diverse speech in English and Chinese, it also has a multilingual version for the 1B model and a 3B model. The resulting models are correspondingly powerful. The largest 8B model in particular demonstrates remarkable naturalness and prosody. According to the paper, increasing model size consistently improved speech quality – bigger models produced more accurate and complex prosody patterns and sounded more natural. This is analogous to how in text LLMs, going from 1B to 7B to 70B yields more fluent and context-aware language; here it yields more human-like intonation and rhythm. Even the 1B model, while less expressive, still functions for basic speech and is extremely lightweight to run.

It can also do zero-shot voice cloning by taking a speech prompt. If you feed a short recording of a speaker (converted to tokens via XCodec2) followed by the text, LLaSA will generate the continuation in that voice. This works because the model has effectively learned to continue in the style provided by preceding audio tokens. Additionally, because it’s bilingual, you can prompt it with a Chinese voice and have it speak English in that voice or vice-versa, enabling cross-lingual voice transfer – a very useful feature for voice assistants in multilingual settings.

More details about scaling inference-time compute with LLaSA

One interesting contribution of the LLaSA work is in scaling inference-time compute . The authors experimented with using speech understanding models as verifiers during generation . In practice, this means when sampling audio tokens, they would involve a pretrained model (like a speech recognition or speaker ID network) to guide the choice of tokens, re-ranking or biasing the outputs toward those that make the verifier happy. For example, a ASR verifier ensures the content is pronounced clearly (improving word accuracy), while a speaker encoder verifier ensures the voice timbre stays consistent, and a classifier might ensure the emotion matches some target. They found that by spending more computation at inference in this way, they could significantly improve aspects like emotional expressiveness, speaker consistency, and content accuracy in the generated speech. This is akin to how some text LLMs use external tools or rerankers to improve outputs post-hoc. While such techniques increase inference cost, they show a pathway to higher quality without retraining – useful for customizing style or ensuring correctness in critical applications. In terms of output quality, LLaSA is state-of-the-art on traditional TTS metrics. The paper reports that the codec can reconstruct 16 kHz speech with a MOS (Mean Opinion Score) around 4.1 (out of 5) on test data, which is near the ground truth quality. The transformer does introduce some modeling imperfections, but the massive training helps mitigate that. For prosody, the 8B model especially was noted to produce more lifelike intonation, handling even tricky sentences with complex emphasis better than smaller models. The bilingual nature means it learned to control tone appropriate to each language – for example, using the correct cadence for a Chinese question vs an English question. It can also mix languages to an extent (e.g., speaking an English sentence with a Chinese accent or vice versa, if prompted), reflecting the multilingual data.

Figure 3: LLaSA variants and their single-codebook (XCodec2) token stream.

Concurrent Approaches: Orpheus 3B and OuteTTS 1.0

While LLaSA pushed the boundaries with its massive scale and single-codebook simplicity, other models emerged concurrently, exploring similar LLM-driven TTS principles but often opting for different neural audio codec strategies. Notable examples include Canopy Labs' Orpheus 3B

A key difference lies in their choice of codec. Instead of XCodec2's single large codebook, both Orpheus and OuteTTS use codecs based on Residual Vector Quantization (RVQ). RVQ codecs like SNAC (Multi-Scale Neural Audio Codec) and DAC (Descript Audio Codec) represent audio hierarchically, using multiple quantizers (codebooks) where each layer progressively refines the audio representation encoded by the previous layers. This can potentially offer higher fidelity at lower token rates compared to single-codebook approaches, though it may require the LLM to predict multiple token streams or handle a flattened representation.

Orpheus 3B

Built upon a Llama-3B backbone, Orpheus pairs the LLM with the SNAC. SNAC is an advanced RVQ codec that captures audio information across different temporal resolutions, aiming for efficient compression and detailed reconstruction. To manage the multiple token streams from SNAC within a standard LLM framework, Orpheus employs a strategy of generating a flattened sequence of tokens (7 tokens per audio frame sequentially). It achieves low-latency streaming (~200ms to ~25-50 ms with input streaming of text into the KV cache) suitable for real-time applications by using an optimized decoding process involving a sliding window technique on the SNAC decoder, ensuring smooth audio output without pops. Orpheus particularly emphasizes generating expressive, emotive speech and supports zero-shot voice cloning from short audio prompts.

Figure 4: Comparison between traditional Residual Vector Quantization (RVQ) and SNAC.

OuteTTS 1.0

This model integrates a smaller Llama-1B transformer with DAC. More specifically, they used a fined-tuned version of DAC that uses a two-codebook RVQ system. While potentially less complex than SNAC's multi-scale approach, it still provides a hierarchical audio representation. OuteTTS highlights features beyond basic synthesis, including claimed support for automatic word alignment. This capability, likely derived from attention mechanisms linking input text tokens to output audio tokens during generation, is valuable for applications needing precise text-audio synchronization (e.g., karaoke apps, video captioning, linguistic analysis). It also supports multilingual text input, adding versatility.

Toward a conversational speech LLM system

CSM-1B: Context-Aware Conversational TTS from Sesame AI

Moving towards truly interactive conversational speech, models need to inherently understand and adapt to the flow of dialogue. Sesame AI's Conversational Speech Model (CSM) represents a significant step in this direction, explicitly designed to leverage context for more natural and coherent speech synthesis, as detailed in their research "Crossing the Uncanny Valley of Voice".

Qwen-Omni: A Multimodal LLM for Conversational Speech

Neural Audio Codecs: The Backbone of High-Fidelity Speech

Citation

For academic attribution, please cite this work as:

"SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored", 2025.

BibTeX citation

@misc{speechlms_explained,
    title={SpeechLMs: LLM-Powered Text-to-Speech and Neural Audio Codecs Explored},
    author={Your Name},
    year={2025}
}