flowchart LR
MaskGIT["MaskGIT (2022)<br/>image-domain, confidence unmask"]
D3PM["D3PM (2021)<br/>foundational math of discrete diffusion"]
SEDD["SEDD (2024)<br/>concrete score / ratio matching"]
MDLM["MDLM (2024)<br/>reduces to BERT training"]
LLaDA["LLaDA (2025)<br/>8B scale, practical sampler"]
Dream["Dream (2025)<br/>competing model"]
D3PM --> MDLM
D3PM --> SEDD
MDLM --> LLaDA
MDLM --> Dream
MaskGIT -.-> LLaDA
MaskGIT -.-> Dream
Overview of Diffusion Language Models
Diffusion Language Models (DLLM) bring the ideas behind diffusion models that succeeded in image generation into language modeling. Whereas Autoregressive (AR) LLMs generate text left-to-right one token at a time, a DLLM treats the entire sequence in parallel and produces text by iteratively refining a sequence of [MASK] tokens. This book organizes the key references needed to understand modern DLLMs, covering their formulation, sampling strategies, and the correspondence with continuous diffusion models.
What is a DLLM?
Difference from AR LLMs
AR LLMs factorize a token sequence as \(p_\theta(x) = \prod_i p_\theta(x_i \mid x_{<i})\) and generate one token at a time from left to right. In contrast, a DLLM learns the joint distribution \(p_\theta(x)\) over the entire sequence as a gradual denoising process. Concretely,
- Forward process: Gradually adds noise to a clean sequence \(x_0\) until every position becomes
[MASK] - Reverse process: Starts from a fully masked sequence and predicts the
[MASK]tokens to fill them in step by step
Unlike AR, the generation order need not be sequential — it can be parallel and can leverage context from both sides.
Why DLLMs are attracting attention
- Parallel generation: Multiple tokens can be decided simultaneously per step, reducing the number of inference steps
- Bidirectional context: Each token can be predicted using context from both the left and the right
- Natural formulation of editing and infilling: Filling in
[MASK]at arbitrary positions maps naturally to infilling and editing tasks - Room for inference-time intervention: Since the model is queried at every step, guidance and control mechanisms can be layered on more flexibly than in AR
Lineage of formulations
The main formulations of DLLMs have developed along the masked / absorbing discrete diffusion track.
D3PM (Austin+ 2021) provided the foundational mathematics of discrete diffusion, and MDLM (Sahoo+ 2024) condensed it into an extremely concise objective — weighted BERT training. LLaDA (Nie+ 2025) scaled this formulation to 8B parameters and presented practical sampling strategies.
→ Details: MDLM: Masked Diffusion Language Models
→ Details: LLaDA: Large-Scale Masked DLM and Sampling
The core idea of MDLM
The crux of MDLM is that, when one considers a continuous-time \(t \in [0,1]\) forward process that independently replaces each token with [MASK] with probability \(t\), the ELBO simplifies to a masked cross-entropy weighted by \(1/t\).
\[\mathcal{L} = \mathbb{E}_{t, x_t} \left[ \frac{1}{t} \sum_i \mathbf{1}[x_t^i = \texttt{[MASK]}] \log p_\theta(x^i \mid x_t) \right]\]
This is a continuous-time generalization of BERT’s random masked prediction, and expresses “what training a DLLM amounts to” in a single line.
→ Details: MDLM: Masked Diffusion Language Models
Sampling strategies
Modern DLLM samplers represented by LLaDA roughly follow the loop below.
- Initialize all positions to
[MASK] - At each step, produce predictions for every position
- Unmask the top-\(k\) positions by confidence; keep the remaining ones masked or re-mask them
- Repeat until all positions are filled
The prototype of this confidence-based unmasking traces back to MaskGIT (Chang+ 2022) in image generation. Practical refinements such as low-confidence remasking and semi-autoregressive sampling (generating block by block) have since been proposed.
→ Details: LLaDA: Large-Scale Masked DLM and Sampling
→ Details: MaskGIT: Origin of Confidence-based Iterative Unmasking
Correspondence with continuous diffusion models
Continuous diffusion models that grew in the image domain (DDPM, SBM, VP-SDE, etc.) and MDLM correspond structurally, but operate on different mathematical objects.
- Where they correspond: the forward-adds-noise / reverse-removes-noise structure, the derivation of the loss from the ELBO, SNR weighting, and the framework of guidance
- Where they don’t: the score function \(\nabla_x \log p(x)\), SDE / probability flow ODE, and the VE/VP distinction
Holding the continuous-diffusion intuition as a template makes MDLM’s formulas easy to read at a glance — just remember that on the discrete side, the same objective is achieved with \(x_0\)-prediction cross-entropy rather than scores.
→ Details: Continuous vs Discrete Diffusion: Bridging the Two
Maturity of the field
DLLMs are far less established than AR LLMs in many areas.
- Training recipes: mask schedules and noise design are still under research
- Sampling: confidence-based unmasking, remasking, semi-AR, etc. are still evolving
- Inference-time intervention: porting AR-side techniques such as guidance, constrained decoding, and editing-style interventions to DLLMs is yet to take off in earnest
- Evaluation: DLLM-specific evaluation axes (e.g., quality per denoising step) are largely unexplored
- Theory: expressiveness, convergence, and the correspondence with AR are at an early stage
Translating techniques established in AR LLMs into the DLLM setting alone constitutes a wide open research space.
→ Details: Open Problems in the DLLM Field
How to read this book
We recommend reading the chapters in numerical order. Each chapter is also self-contained, but the structure is designed to build from abstraction to concreteness and then to a broader view: formulation (01 MDLM) → state of the art in implementation (02 LLaDA) → correspondence with continuous diffusion (03) → another lineage of discrete diffusion (04) → origin of the sampler (05) → state of the field (06).
In particular, the first three chapters (01 → 02 → 03) form the core. Once you have the training formulation in 01, the implementation in 02, and the mathematical correspondence with continuous diffusion in 03, the modern DLLM landscape comes into three-dimensional view. Chapters 04–06 are more specialized and can be read selectively depending on your interests.
If you already know continuous diffusion (DDPM, SBM, VP-SDE, etc.), it helps to skim Continuous vs Discrete Diffusion: Bridging the Two in parallel while reading 01 MDLM — you will immediately see “ah, this is the discrete version of \(x_0\)-prediction.” Just remember that on the discrete side, the same goal is achieved with cross-entropy rather than scores.
This book is aimed at understanding the key references, and does not cover:
- Code-level walkthroughs of individual implementations (refer to the official repositories and the paper appendices)
- Step-by-step guides for building DLLM-based applications
- Exhaustive coverage of the latest preprints (we limit ourselves to the main papers available at the time of writing)
The book aims to provide a map of the field, and does not attempt exhaustive benchmark reproduction or rigorous theorem proofs for every paper.
Three to start with
Pointers to specific papers and articles appear in the individual chapters, but here are the three to start with.
- MDLM: Sahoo et al. “Simple and Effective Masked Diffusion Language Models” arXiv:2406.07524
- LLaDA: Nie et al. “Large Language Diffusion Models” arXiv:2502.09992
- Sander Dieleman’s blog (sander.ai): “Diffusion language models” and related overview posts that are particularly useful for bridging continuous and discrete diffusion