Overview of Diffusion Language Models

Diffusion Language Models (DLLM) bring the ideas behind diffusion models that succeeded in image generation into language modeling. Whereas Autoregressive (AR) LLMs generate text left-to-right one token at a time, a DLLM treats the entire sequence in parallel and produces text by iteratively refining a sequence of [MASK] tokens. This book organizes the key references needed to understand modern DLLMs, covering their formulation, sampling strategies, and the correspondence with continuous diffusion models.

What is a DLLM?

Difference from AR LLMs

AR LLMs factorize a token sequence as \(p_\theta(x) = \prod_i p_\theta(x_i \mid x_{<i})\) and generate one token at a time from left to right. In contrast, a DLLM learns the joint distribution \(p_\theta(x)\) over the entire sequence as a gradual denoising process. Concretely,

Forward process: Gradually adds noise to a clean sequence \(x_0\) until every position becomes [MASK]
Reverse process: Starts from a fully masked sequence and predicts the [MASK] tokens to fill them in step by step

Unlike AR, the generation order need not be sequential — it can be parallel and can leverage context from both sides.

Why DLLMs are attracting attention

Parallel generation: Multiple tokens can be decided simultaneously per step, reducing the number of inference steps
Bidirectional context: Each token can be predicted using context from both the left and the right
Natural formulation of editing and infilling: Filling in [MASK] at arbitrary positions maps naturally to infilling and editing tasks
Room for inference-time intervention: Since the model is queried at every step, guidance and control mechanisms can be layered on more flexibly than in AR

Lineage of formulations

The main formulations of DLLMs have developed along the masked / absorbing discrete diffusion track.

flowchart LR
    MaskGIT["MaskGIT (2022)<br/>image-domain, confidence unmask"]
    D3PM["D3PM (2021)<br/>foundational math of discrete diffusion"]
    SEDD["SEDD (2024)<br/>concrete score / ratio matching"]
    MDLM["MDLM (2024)<br/>reduces to BERT training"]
    LLaDA["LLaDA (2025)<br/>8B scale, practical sampler"]
    Dream["Dream (2025)<br/>competing model"]

    D3PM --> MDLM
    D3PM --> SEDD
    MDLM --> LLaDA
    MDLM --> Dream
    MaskGIT -.-> LLaDA
    MaskGIT -.-> Dream

Figure 1: Lineage of discrete diffusion language model formulations

D3PM (Austin+ 2021) provided the foundational mathematics of discrete diffusion, and MDLM (Sahoo+ 2024) condensed it into an extremely concise objective — weighted BERT training. LLaDA (Nie+ 2025) scaled this formulation to 8B parameters and presented practical sampling strategies.

→ Details: MDLM: Masked Diffusion Language Models

→ Details: LLaDA: Large-Scale Masked DLM and Sampling

The core idea of MDLM

The crux of MDLM is that, when one considers a continuous-time \(t \in [0,1]\) forward process that independently replaces each token with [MASK] with probability \(t\), the ELBO simplifies to a masked cross-entropy weighted by \(1/t\).

\[\mathcal{L} = \mathbb{E}_{t, x_t} \left[ \frac{1}{t} \sum_i \mathbf{1}[x_t^i = \texttt{[MASK]}] \log p_\theta(x^i \mid x_t) \right]\]

This is a continuous-time generalization of BERT’s random masked prediction, and expresses “what training a DLLM amounts to” in a single line.

→ Details: MDLM: Masked Diffusion Language Models

Sampling strategies

Modern DLLM samplers represented by LLaDA roughly follow the loop below.

Initialize all positions to [MASK]
At each step, produce predictions for every position
Unmask the top-\(k\) positions by confidence; keep the remaining ones masked or re-mask them
Repeat until all positions are filled

The prototype of this confidence-based unmasking traces back to MaskGIT (Chang+ 2022) in image generation. Practical refinements such as low-confidence remasking and semi-autoregressive sampling (generating block by block) have since been proposed.

→ Details: LLaDA: Large-Scale Masked DLM and Sampling

→ Details: MaskGIT: Origin of Confidence-based Iterative Unmasking

Correspondence with continuous diffusion models

Continuous diffusion models that grew in the image domain (DDPM, SBM, VP-SDE, etc.) and MDLM correspond structurally, but operate on different mathematical objects.

Where they correspond: the forward-adds-noise / reverse-removes-noise structure, the derivation of the loss from the ELBO, SNR weighting, and the framework of guidance
Where they don’t: the score function \(\nabla_x \log p(x)\), SDE / probability flow ODE, and the VE/VP distinction

Holding the continuous-diffusion intuition as a template makes MDLM’s formulas easy to read at a glance — just remember that on the discrete side, the same objective is achieved with \(x_0\)-prediction cross-entropy rather than scores.

→ Details: Continuous vs Discrete Diffusion: Bridging the Two

Maturity of the field

DLLMs are far less established than AR LLMs in many areas.

Training recipes: mask schedules and noise design are still under research
Sampling: confidence-based unmasking, remasking, semi-AR, etc. are still evolving
Inference-time intervention: porting AR-side techniques such as guidance, constrained decoding, and editing-style interventions to DLLMs is yet to take off in earnest
Evaluation: DLLM-specific evaluation axes (e.g., quality per denoising step) are largely unexplored
Theory: expressiveness, convergence, and the correspondence with AR are at an early stage

Translating techniques established in AR LLMs into the DLLM setting alone constitutes a wide open research space.

→ Details: Open Problems in the DLLM Field

How to read this book

We recommend reading the chapters in numerical order. Each chapter is also self-contained, but the structure is designed to build from abstraction to concreteness and then to a broader view: formulation (01 MDLM) → state of the art in implementation (02 LLaDA) → correspondence with continuous diffusion (03) → another lineage of discrete diffusion (04) → origin of the sampler (05) → state of the field (06).

In particular, the first three chapters (01 → 02 → 03) form the core. Once you have the training formulation in 01, the implementation in 02, and the mathematical correspondence with continuous diffusion in 03, the modern DLLM landscape comes into three-dimensional view. Chapters 04–06 are more specialized and can be read selectively depending on your interests.

For readers already familiar with continuous diffusion

If you already know continuous diffusion (DDPM, SBM, VP-SDE, etc.), it helps to skim Continuous vs Discrete Diffusion: Bridging the Two in parallel while reading 01 MDLM — you will immediately see “ah, this is the discrete version of \(x_0\)-prediction.” Just remember that on the discrete side, the same goal is achieved with cross-entropy rather than scores.

What this book does not cover

This book is aimed at understanding the key references, and does not cover:

Code-level walkthroughs of individual implementations (refer to the official repositories and the paper appendices)
Step-by-step guides for building DLLM-based applications
Exhaustive coverage of the latest preprints (we limit ourselves to the main papers available at the time of writing)

The book aims to provide a map of the field, and does not attempt exhaustive benchmark reproduction or rigorous theorem proofs for every paper.

Three to start with

Pointers to specific papers and articles appear in the individual chapters, but here are the three to start with.

MDLM: Sahoo et al. “Simple and Effective Masked Diffusion Language Models” arXiv:2406.07524
LLaDA: Nie et al. “Large Language Diffusion Models” arXiv:2502.09992
Sander Dieleman’s blog (sander.ai): “Diffusion language models” and related overview posts that are particularly useful for bridging continuous and discrete diffusion