Vision & Multimodal
Bark and Fully Generative Audio
Bark is a transformer-based model that generates speech, music, and nonverbal audio from text by autoregressively predicting discrete audio codec tokens, without any phoneme pipeline or continuous acoustic model.
intermediate · 8 min read
Traditional text-to-speech systems decompose neatly into stages: a linguistic front-end that converts text to phonemes, an acoustic model that predicts spectral features, and a vocoder that converts those features to waveform. Each stage has a well-understood failure mode. Bark discards this pipeline entirely. Given the string "Hello, [laughter] that was unexpected!", it produces a waveform in a single unified pass - no phoneme table, no mel-spectrogram intermediate, no separately trained vocoder. The question worth asking is: how does collapsing three modules into one coherent generative model actually work, and what does "fully generative audio" mean precisely?
From Spectrograms to Discrete Tokens
The enabling technology is the neural audio codec. Whereas classical vocoders decoded from mel-spectrograms (continuous floating-point tensors with fixed frequency resolution), neural codecs compress arbitrary audio into sequences of small integers drawn from a learned codebook.
EnCodec (Defossez et al., 2022) exemplifies this approach. A convolutional encoder maps a raw waveform to a compressed continuous embedding, then a residual vector quantiser (RVQ) discretises it:
waveform → encoder → z ∈ R^(T × D)
↓ RVQ with K codebooks
(c_1, c_2, …, c_K) each c_i ∈ {0, …, N-1}
With K=8 codebooks of size N=1024 operating at 75 Hz frame rate, a 13-second clip becomes roughly 975 frames times 8 integers - about 7,800 tokens total. The first codebook captures coarse acoustic structure; subsequent codebooks refine fine-grained detail. This ordering matters enormously for what comes next.
The key insight: a decoder trained on (c_1, …, c_K) reconstructs perceptually high-quality audio. Audio generation therefore reduces to the question of how to generate these integer sequences.
AudioLM: The Three-Stage Hierarchy
Before Bark, Google's AudioLM (Borsos et al., 2022) established the architectural template. It separates generation into three sequential language models, each operating on a different token type:
| Stage | Token type | Model task |
|---|---|---|
| 1 | Semantic (w2v-BERT activations) | Long-range coherence, prosody, speaker |
| 2 | Coarse acoustic (RVQ levels 1-2) | Spectral envelope conditioned on semantics |
| 3 | Fine acoustic (RVQ levels 3-8) | Perceptual detail conditioned on coarse |
Stage 1 runs a transformer over semantic tokens extracted from a self-supervised audio model. These tokens carry linguistic and prosodic meaning but not fine acoustic detail - they represent "what is being said and how", not "exactly how it sounds". Stages 2 and 3 then ground this semantic skeleton in actual audio by predicting codec tokens.
The separation solves a genuine problem. If you tried to generate all K codebook sequences jointly in one left-to-right transformer, the model would need to attend across very long sequences with mixed semantics. By staging coarse-to-fine, each model has a narrower job.
VALL-E (Wang et al., 2023) pursued a related idea specifically for zero-shot voice cloning: given a 3-second speaker prompt plus a transcript, predict the codec token sequence for a new utterance in the same voice. Training on 60,000 hours of English speech (LibriLight), VALL-E demonstrated that scale alone - no new architecture - could transfer voice characteristics at inference time.
Bark's Architecture
Bark (Suno, 2023) adopts the AudioLM hierarchy but makes several practical changes that shift it from "audio continuation" to "text-to-audio generation":
Text conditioning replaces audio priming. Instead of continuing an existing audio clip, Bark takes a text prompt and an optional voice preset (a short reference audio that anchors speaker identity). The semantic model is conditioned on a byte-pair-encoded text sequence.
Three GPT-style transformers in sequence:
- Text-to-semantic model - maps BPE text tokens to semantic audio tokens (approximately w2v-BERT-like codebook indices).
- Semantic-to-coarse model - autoregressively predicts RVQ levels 1 and 2 from the semantic sequence.
- Coarse-to-fine model - predicts RVQ levels 3 through 8 from levels 1-2, but non-autoregressively using a parallel decoding strategy for speed.
The final codec decoder (EnCodec's decoder) then reconstructs the waveform from all 8 codebook levels.
Special tokens for non-speech audio. Bark's tokeniser includes explicit tokens such as [laughter], [sighs], [music], [applause], ♪ for lyrics, and capitalisation patterns the model has learned to associate with emphasis. These are not rule-based insertions; the model has encountered enough examples during training that it generates the appropriate acoustic events when these tokens appear.
A rough pseudo-trace for a short generation:
text = "Hello world [laughter]"
# Stage 1: text -> semantic tokens (~100 tokens/sec)
semantic = text_to_semantic_model(bpe(text), voice_preset)
# Stage 2: semantic -> coarse codec tokens (RVQ 1-2)
coarse = semantic_to_coarse_model(semantic) # autoregressive
# Stage 3: coarse -> fine codec tokens (RVQ 3-8)
fine = coarse_to_fine_model(coarse) # parallel per frame
# Decode to waveform
waveform = encodec_decoder(coarse, fine) # 24 kHz output
Inference on a single A100 runs at roughly 1-2x real time for the 13-second generation ceiling.
Why "Fully Generative" Matters
Classical TTS systems have a fundamental rigidity: they model audio as a deterministic function of text plus speaker embedding. Prosody variation is either absent or injected via a learned but constrained pitch/energy predictor.
A fully generative model treats audio generation as sampling from a distribution over all plausible renderings. Two calls with identical inputs produce different outputs. This has three practical consequences:
- Expressiveness. The model can produce laughs, whispers, hesitations, and singing that would require hand-crafted modules in a pipeline system.
- Variance. A given prompt does not reliably produce the same voice, pacing, or emotional register across runs. For production systems expecting determinism, this is a serious problem.
- Failure diversity. Mistakes are not systematic (like the attention failures in Tacotron 2 that cause repeated words). They are stochastic: a generation might simply be wrong in an unpredictable way.
The generative framing also means there is no explicit alignment between input text and output audio. Bark does not guarantee every word is spoken; it generates audio that is plausible given the text, which is not the same thing.
When It Falls Down
Hard length ceiling. Bark generates at most around 13 seconds per pass. Longer speech requires chunking the text and stitching outputs, which introduces audible discontinuities in prosody and background noise levels. There is no built-in continuity mechanism.
Speaker instability. Voice presets anchor tone and accent loosely. Across chunks, the generated voice drifts. Even within a single 13-second clip the model occasionally shifts to a different speaker register mid-utterance. This is not a bug to patch; it reflects the model having no explicit speaker identity representation beyond the short prompt context.
Phoneme-level unreliability. Because there is no phoneme alignment step, the model can drop syllables, merge words, or mispronounce low-frequency proper nouns and technical terms. Classical pipeline TTS with a grapheme-to-phoneme module and attention alignment would catch these deterministically.
VRAM requirements. Running all three transformers at full precision requires roughly 12 GB of VRAM. The smaller "small" model variants reduce this to around 2-4 GB but with measurably lower voice naturalness and higher word error rate.
Not a production voice cloning tool. The voice preset mechanism provides accent and general timbre matching, not identity-preserving cloning. For high-fidelity voice cloning with speaker verification guarantees, models like YourTTS or VALL-E operating with explicit speaker embeddings are more appropriate.
Evaluation is hard. Mean opinion scores (MOS) do not capture the variance across samples. A model that produces one excellent and one completely broken sample scores the same as a model that produces two mediocre samples. Reporting MOS without reporting variance or word error rate gives a misleading picture of usability.