A Survey on Diffusion Language Models: A Map of Li et al. 2025

This book is structured to read the state of the art in the formulation and implementation of Diffusion Language Models (DLMs) by drilling down vertically into one major reference at a time — MDLM, LLaDA, MaskGIT, Block Diffusion, and so on. In contrast, the survey by Li et al., “A Survey on Diffusion Language Models” (Li et al. 2025), scans the same field horizontally and consolidates in a single article the explosive growth in paper counts, the timeline of leading models, and task-by-task applications. This chapter organizes the survey itself as a single chapter and makes it function as a map of the entire book. Even without reading the survey directly, the chapter is laid out so that you can rapidly grasp (1) which areas are covered, (2) which areas are drilled into in the existing chapters of this book, and (3) which areas are touched on only by the survey.

Basic information on the survey

Item	Content
Title	A Survey on Diffusion Language Models
Authors	Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen
Publication	arXiv:2508.10875 (August 2025)
GitHub	Awesome-DLMs

The body of the survey is organized into 8 sections: §2 covers the taxonomy, §3 training, §4 inference, §5 multimodal, §6 performance comparisons, §7 applications, and §8 challenges and future directions. The GitHub repository maintains a corresponding paper list, which is also useful for catching up on new papers.

The scale of the research field

What the survey’s introduction emphasizes is the explosive growth of DLM research. If one counts, among papers citing D3PM — the foundational paper on discrete diffusion — those that contain “language” in the title or abstract as “discrete-DLM-related papers,” the number of publications grows exponentially from 2024 into 2025. Continuous-embedding diffusion (continuous DLM) has been studied since early on, but from 2024 onwards, discrete diffusion is overwhelmingly more active.

Figure 1: Annual trend of DLM-related papers. Discrete DLMs grew explosively in 2024–2025. Source: (Li et al. 2025)

Figure 1 explains why this book devotes more chapters to the discrete-diffusion side, including MDLM and LLaDA. There are several reasons the center of gravity of research shifted to the discrete side. First, MDLM distilled the formulation into the extremely concise objective of a “weighted BERT training.” Second, LLaDA showed at the 8B scale performance on par with AR LLMs of comparable size. Third, commercial-grade systems such as Mercury and Gemini Diffusion appeared, making practical utility visible.

A timeline map of the leading models

The survey organizes the DLM timeline by color into three lineages: continuous embedding (continuous), discrete, and multimodal.

Figure 2: Development timeline of DLMs. Color-coded into three lineages: continuous (blue), discrete (orange), and multimodal (purple). Source: (Li et al. 2025)

The main currents readable from Figure 2 are as follows.

Pre-2022: Continuous-embedding diffusion such as Diffusion-LM, SED, and CDCD dominated. An era of trying to import the image-diffusion framework directly into text
2023–2024: Starting from D3PM (2021), the formulations of discrete diffusion — SEDD, MDLM, MD4, RADD, etc. — matured rapidly. Absorbing diffusion using the [MASK] token became the de facto standard
2025: LLaDA-8B shows AR-equivalent performance from scratch, and Dream-7B obtains competitive results via adaptation from an AR model. Multimodal extensions such as LLaDA-V / MMaDA / Dimple also advance at once
Commercialization: Mercury (Inception Labs) and Gemini Diffusion (DeepMind) appear with practical speeds (thousands of tokens per second)

→ More: Recent Discrete DLMs

→ More: Multimodal DLM

The survey’s taxonomy

At the core of the survey is a four-axis taxonomy.

Figure 3: Taxonomy of DLMs. Four axes: Paradigms / Training / Inference / Multimodal & Applications. Source: (Li et al. 2025)

The four axes of Figure 3 are summarized below.

Paradigms (§2): Where the diffusion takes place. A three-way split into continuous / discrete / hybrid AR-Diffusion
Training Strategies (§3): Pre-training (from scratch / AR adaptation / image-diffusion adaptation) and post-training (SFT, GRPO-family RL, preference optimization) along two axes
Inference & Optimization (§4): Parallel decoding, unmasking/remasking, guidance, efficiency (KV cache, feature cache, step distillation)
Multimodal & Applications (§5, §7): Multimodal DLMs, conventional NLP, code generation, computational biology, robotics

Within this taxonomy, this book drills down vertically into Paradigms and the Pre-training portion of Training (MDLM, LLaDA, etc.), and adds new chapters to complement the rest.

A correspondence table between this book and the survey

The subsections of the survey and their corresponding chapters in this book are shown in @tbl-mapping. Use it as an entry point when reading this book, and as an index for “jumping from the survey into the deep dives of this book.”

Table 1: Correspondence between survey sections and chapters in this book

Survey section	Main topics	Corresponding chapter	Coverage
§2.1 Continuous DLMs	Diffusion-LM, SED, CDCD, Plaid, TESS	Embedding-space Diffusion	Covered in existing chapter
§2.2 Discrete DLMs	D3PM, SEDD, MDLM, LLaDA, Dream, RADD, DFM, GIDD	MDLM / LLaDA / D3PM and SEDD / Recent Discrete DLMs	Existing + new chapters
§2.3 Hybrid AR-Diffusion	SSD-LM, AR-Diffusion, BD3-LM, CtrlDiff, SpecDiff, SDAR, TiDAR, SDLM	Block Diffusion / Hybrid AR-Diffusion	Existing + new chapters
§3.1 Pre-training	from scratch / AR adaptation / image-diffusion adaptation	LLaDA / AR→DLM Adaptation	Existing + new chapters
§3.2 Post-training	DoT, DCoLT, diffu-GRPO, UniGRPO, VRPO family	Post-training (RL)	New chapter
§4 Inference	parallel decoding / KV cache / feature cache / step distillation	Inference Acceleration	New chapter
§4.3 Guidance	A-CFG, Freecache, DINGO	Guidance	New chapter
§5 Multimodal DLMs	LLaDA-V, MMaDA, Dimple, LaViDa, Fudoki, Muddit	Multimodal DLM	New chapter
§6 Performance Study	Cross-benchmark comparisons	—	Refer directly to Figure 6 of the survey
§7 Applications	Code, Bio, Robotics, NLP task families	Applications	New chapter
§8 Challenges	parallelism trade-off, infrastructure, long-context, scalability	Open Problems	Covered in existing chapter

All linked targets are chapters that exist in this book. If a region of the survey catches your interest, jumping to the corresponding chapter lets you dig into the details of formulation and implementation.

The significance of the three-paradigm split

The survey adopts the three-way split — continuous / discrete / hybrid AR-Diffusion — based on the space in which the diffusion takes place. This carries meaning by adding a third axis — hybrid — to the binary “absolutely-quantized discrete diffusion vs. continuous-embedding diffusion” used in this book’s overview.qmd.

In this book’s overview, we emphasized the MDLM-rooted “iterative refinement that fills [MASK]” as the core of modern DLMs and positioned the Diffusion-LM lineage of continuous-embedding diffusion as a separate stream. This is an appropriate introduction for understanding MDLM/LLaDA. However, adding the survey’s three-way split reveals a wider structure.

Concretely, the view emerges that AR and DLMs should be understood as a continuum. Pure AR is the extreme case of “block length 1 and causal,” pure DLM is the opposite extreme of “generating all positions in parallel simultaneously,” and hybrid AR-Diffusion such as Block Diffusion (BD3-LM), SDAR, and TiDAR realizes block-wise semi-autoregression in between. By explicitly drawing this three-way split, the survey naturally positions this book’s Block Diffusion chapter not as a one-off derivative but as a single point inside the AR–DLM continuum.

Once aware of this hybrid axis, one can predict that work combining AR’s advantages with DLM’s advantages — such as SpecDiff (incorporating DLM into speculative decoding) and TiDAR (parallel generation by DLM while outputting AR-style) — will continue to grow. This is why the survey dedicates a section to the related papers.

→ More: Block Diffusion

→ More: Hybrid AR-Diffusion

Key facts emphasized in the survey

The important observations repeatedly highlighted throughout the survey are summarized below. Each is treated in detail in the corresponding chapter of this book, but having them in mind as a survey-level overview sharpens the positioning of individual papers as you read them.

1. Discrete DLMs vastly outnumber continuous DLMs (as of 2025)

Continuous-embedding diffusion was the early mainstream from around 2022 with Diffusion-LM and SED, but it turned out that discrete diffusion using [MASK] is more concise in formulation (MDLM’s weighted BERT loss) and scales more easily (LLaDA-8B), and the center of research shifted to the discrete side. The trend in Figure 1 reflects this.

2. LLaDA-8B was the first to show “AR-equivalent at comparable scale from scratch”

Prior DLMs lagged behind AR on benchmarks, but LLaDA-8B (Nie et al. 2025) reached parity with LLaMA3-8B given equivalent pre-training data. This was a milestone that overturned the then-implicit assumption that “DLMs are essentially weaker than AR.” The survey positions LLaDA as a turning point.

3. Multimodal DLMs are a new frontier

LLaDA-V, MMaDA, Dimple, LaViDa, and others appeared in concentrated fashion in 2025. Unlike AR-family VLMs (LLaVA, etc.), DLMs’ bidirectional context is naturally suited to cross-modal reasoning and to unified “understanding + generation” models. This is why §5 of the survey is treated as a standalone chapter.

4. The arrival of commercial-grade systems (Mercury, Gemini Diffusion)

Inception Labs’ Mercury (Labs et al. 2025) and Google DeepMind’s Gemini Diffusion (Google DeepMind 2024) arrived in 2024–2025, achieving inference speeds of thousands of tokens per second. While academic DLMs remain around the 8B scale, on the commercial side, services running at practical speeds are already in operation — a suggestive contrast.

5. Possibly higher data efficiency than AR

Citing multiple papers, the survey points out a tendency for DLMs to use data more effectively than AR under multi-epoch training. This is related to the effect of the ELBO’s weighting term \(1/t\), which repeatedly exposes the same data at different mask rates. In an era when data is becoming the bottleneck for LLM training, this is a non-negligible advantage.

Regions mentioned in the survey but not drilled into in this book

To make this chapter function as a map, we explicitly mark the regions that this book “leaves to the survey.”

Performance Study (§6)

The survey provides figures comparing DLMs and AR side by side on major benchmarks such as GSM8K, HumanEval, and MMLU. This book does not attempt comprehensive benchmark reproduction of this kind (representative numbers are cited in each chapter, but for side-by-side comparison, viewing Figure 6 of the survey directly is faster).

Figure 4: Comparison of DLMs and AR on major benchmarks. Source: (Li et al. 2025)

Cross-cutting comparisons like Figure 4 are best consulted directly in Figure 6 of the survey. Each chapter of this book treats the performance of individual models, but the overall side-by-side picture is left to the survey side.

Fine-grained tasks in conventional NLP

§7 of the survey enumerates applications such as DiffusionNER, DiffuSum, EdiText, PoetryDiffusion, and XDLM. The Applications chapter of this book focuses on Code / Biology / Robotics, and only outlines applications to individual conventional-NLP tasks. For exhaustive coverage of task-specific papers, refer to §7 of the survey.

References

Google DeepMind. 2024. Gemini Diffusion. Product page. https://deepmind.google/technologies/gemini-diffusion/.

Labs, Inception, Samar Khanna, Siddhant Kharbanda, et al. 2025. “Mercury: Ultra-Fast Language Models Based on Diffusion.” arXiv Preprint arXiv:2506.17298. https://arxiv.org/abs/2506.17298.

Li, Tianyi, Mingda Chen, Bowei Guo, and Zhiqiang Shen. 2025. “A Survey on Diffusion Language Models.” arXiv Preprint arXiv:2508.10875. https://arxiv.org/abs/2508.10875.

Nie, Shen, Fengqi Zhu, Zebin You, et al. 2025. “Large Language Diffusion Models.” arXiv Preprint arXiv:2502.09992. https://arxiv.org/abs/2502.09992.