Recent Discrete DLMs: Latest Developments in Discrete Diffusion Language Models

D3PM (Discrete Denoising Diffusion Probabilistic Models) (Austin et al. 2021), SEDD (Score Entropy Discrete Diffusion) (Lou et al. 2024), MDLM (Masked Diffusion Language Model) (Sahoo et al. 2024), and LLaDA (Large Language Diffusion with mAsking) (Nie et al. 2025) have already been covered in dedicated chapters as the core of this book. These papers played their respective roles as “the foundational mathematics of discrete diffusion,” “an alternative formulation using the concrete score,” “a concise objective function based on the absorbing transition + masked cross-entropy (CE),” and “the first full-scale 8B Diffusion Language Model (DLM).”

However, since then, papers have appeared in rapid succession that (i) re-derive and simplify the MDLM formulation from a different angle, (ii) reexamine alternative formulations beyond absorbing, and (iii) build on LLaDA toward long-context, MoE (Mixture-of-Experts), and commercial-class large-scale deployments. This chapter organizes these works around §2.3 of the survey by Li et al. (Li et al. 2025) and consolidates the “what came next” of modern discrete DLMs into a single map.

Overall picture

The main lines of development can be divided into three axes.

flowchart TD
    D3PM["D3PM (2021)<br/>foundations of discrete diffusion"]
    SEDD["SEDD (2024)<br/>concrete score"]
    DBERT["DiffusionBERT (2023)<br/>BERT x diffusion"]
    MDLM["MDLM (2024)<br/>absorbing + masked CE"]
    MD4["MD4 (2024)<br/>continuous-time CE"]
    RDM["RDM (2024)<br/>weighted CE / flexible decoding"]
    RADD["RADD (2024)<br/>factorization of concrete score"]
    DFM["DFM (2024)<br/>discrete flow matching"]
    DDPD["DDPD (2024)<br/>planner-denoiser separation"]
    MGDM["MGDM (2024)<br/>token reweighting"]
    GIDD["GIDD (2025)<br/>mask + uniform interpolation"]
    LLaDA["LLaDA (2025)<br/>8B from-scratch"]
    DLLaMA["DiffuLLaMA (2025)<br/>AR-7B -> DLM"]
    Dream["Dream-7B (2025)<br/>Qwen2.5 -> DLM"]
    LongLLaDA["LongLLaDA (2025)<br/>NTK RoPE extrapolation"]
    Ultra["UltraLLaDA (2025)<br/>128K post-training"]
    MoE["LLaDA-MoE (2025)<br/>sparse MoE, 20T tokens"]
    Seed["Seed Diffusion (2025)<br/>commercial-class open-source"]

    D3PM --> SEDD
    D3PM --> MDLM
    D3PM --> DBERT
    MDLM --> MD4
    MDLM --> RDM
    MDLM --> RADD
    D3PM --> DFM
    MDLM --> DDPD
    MDLM --> MGDM
    MDLM --> GIDD
    MDLM --> LLaDA
    LLaDA --> DLLaMA
    LLaDA --> Dream
    LLaDA --> LongLLaDA
    LongLLaDA --> Ultra
    LLaDA --> MoE
    LLaDA --> Seed

Figure 1: Genealogy of discrete DLMs after D3PM. This chapter covers the papers in the yellow box.

To summarize, the developments covered in this chapter can be grouped along three axes.

MDLM-aligned refinements: lines that, within the same “absorbing + masked CE” framework, re-derive the objective or refine the schedule (RDM, MD4, DiffusionBERT)
Alternative formulations: lines that reexamine options other than absorbing (DFM, RADD, DDPD, MGDM, GIDD)
Scale-up: lines starting from LLaDA that pursue larger scale, longer context, MoE, and commercial deployment (DiffuLLaMA, Dream, LongLLaDA, UltraLLaDA, LLaDA-MoE, Seed Diffusion)

These axes are not independent; they influence each other. For example, the training objective of LLaDA is the weighted masked CE that RDM, MD4, and MDLM independently arrived at, and the MoE backbone sits on top of that same objective.

→ More: MDLM: Masked Diffusion Language Models

→ More: D3PM and SEDD: Alternative Formulations of Discrete Diffusion

MDLM-aligned refinements

Within the family that adopts absolute masking (the absorbing transition where a token never returns once it becomes [MASK]), there are multiple formulations equivalent to MDLM or that precede it. In fact, MDLM, MD4, and RDM are three works that independently arrived at the same “weighted masked CE,” and they are now regarded as essentially equivalent formulations.

RDM: simplification to weighted CE

RDM (Reparameterized Discrete diffusion Models) (Zheng et al. 2024) showed in 2023, earlier than MDLM, that reparameterizing the reverse process of discrete diffusion reduces the objective to a weighted cross-entropy. The structure is that the reverse posterior is rewritten as a mixture of “the probability of filling a masked token with \(x_0\)” and “the probability of remaining [MASK],” and then the ELBO (Evidence Lower Bound) can be optimized simply by applying a time-dependent weight to the cross-entropy of \(x_0\) prediction.

RDM’s contributions can be summarized in the following two points.

It replaced the training objective from “a sum of KL (Kullback-Leibler) terms of the ELBO” with “weighted CE,” making the implementation simpler
It enabled decoding-time schedule choices (greedy / stochastic / arbitrary order) to be treated separately from the theory

The latter in particular forms the theoretical basis for LLaDA’s low-confidence remasking and semi-autoregressive sampling. The separation that “minimizing the ELBO” and “choosing a sampling strategy” are problems on different layers became the standard from RDM onward.

MD4: continuous-time variational objective

MD4 (Simplified and Generalized Masked Diffusion) (Shi et al. 2024) independently, in the same year as MDLM, reformulated masked diffusion concisely as a continuous-time variational objective. Considering masking with probability \(t\) at each time \(t \in [0,1]\), MD4 showed that the continuous-time-limit ELBO can be written as a weighted integral of masked CE:

\[ \mathcal{L}_\text{MD4} = \mathbb{E}_{t \sim \mathcal{U}(0,1)} \, \mathbb{E}_{x_t} \left[ \frac{\alpha'_t}{1 - \alpha_t} \sum_i \mathbf{1}[x_t^i = \texttt{[MASK]}] \log p_\theta(x_0^i \mid x_t) \right] \]

Here \(\alpha_t\) is an arbitrary monotonic schedule. With the linear schedule \(\alpha_t = 1 - t\), we have \(\alpha'_t / (1 - \alpha_t) = 1/t\), which matches MDLM’s \(1/t\) weight.

MD4 is broader than MDLM in the following two respects.

It handles arbitrary noise schedules (cosine, polynomial, etc.) in a unified way
It naturally extends to generalized masking (designs that mask a subset of the vocabulary, state-dependent masking, etc.)

In practice, the same model can be trained with any of the MDLM, MD4, or RDM implementations, and the three are now often collectively called the “MDLM-family”. LLaDA, Dream, and DiffuLLaMA all adopt this family’s objective function.

DiffusionBERT: bridging to BERT pretraining

DiffusionBERT (Z. He et al. 2023) is an approach that uses BERT’s pretrained weights as initialization for masked diffusion training. The technical core is a mask schedule called the spindle noise schedule, which takes token informativeness into account.

Standard masked diffusion masks each position with uniform probability, but DiffusionBERT varies the mask probability according to the token’s information content (estimated by unigram frequency, TF-IDF, etc.). Specifically,

low-information tokens (e.g., stop words) are fixed early (less likely to be masked)
high-information tokens (proper nouns, low-frequency words) are fixed later

a design that more closely resembles how humans allocate attention when reading.

DiffusionBERT is on the 110M scale, but in terms of reusing existing BERT weights, it is positioned as a precursor to the AR-to-DLM adaptation line (DiffuLLaMA, Dream). It is worth referencing as the first paper to explicitly demonstrate a natural connection between pretrained MLMs (Masked Language Models) and masked diffusion.

Alternative formulations

While the MDLM-family established “absorbing transition + masked CE” as the standard, papers that reexamine options other than absolute masking have developed in parallel. Below we cover five that are featured in the survey in particular.

DFM: Discrete Flow Matching

DFM (Discrete Flow Matching) (Gat et al. 2024) is a framework that extends Flow Matching, originally for continuous data, to discrete variables. Whereas continuous Flow Matching “learns a flow along a probability path from a source distribution to a target distribution,” DFM learns a discrete counterpart called the probability velocity.

DFM’s forward process takes the following form.

\[ p_t(x_t \mid x_0, x_1) = (1 - t) \delta_{x_0}(x_t) + t \delta_{x_1}(x_t) \]

Here \(x_0\) is the source (typically a [MASK] sequence or a uniform distribution), and \(x_1\) is the target data. At each time \(t\), the structure is to choose “stay at \(x_0\) or move to \(x_1\)” via a Bernoulli draw. The learning target is the probability velocity, the instantaneous transition rate at time \(t\), which is the discrete analog of the vector field on the continuous side.

DFM’s contributions are the following two points.

It allows tools developed on the continuous side of Flow Matching (rectified flow, OT-based paths, etc.) to be brought into the discrete setting
It provides a general framework that unifies absorbing and uniform

In implementation scale, DFM was trained at 1.7B parameters on 2.5T tokens, demonstrating that the gap with AR LLMs can be substantially closed. Unlike the MDLM-family, it does not restrict the source distribution to [MASK], so it serves as an option when one wants to leave open a path that is not absolute masking.

RADD: factorization of the concrete score

RADD (Reparameterized Absorbing Discrete Diffusion) (Ou et al. 2024) is a paper that applies SEDD’s concrete-score formulation to the absorbing transition and then analytically factorizes the structure of the score. The central result is the following theorem.

The concrete score \(s_\theta(x_t)_y = p(y) / p(x_t)\) in absorbing diffusion can be factorized into a time-independent conditional probability \(p(y \mid x_t^{\text{unmasked}})\) and an analytically computable time-dependent scalar \(c(t)\).

\[ s_\theta(x_t, t)_y = c(t) \cdot p(y \mid x_t^{\text{unmasked}}) \]

The implications of this factorization are significant. Since the time dependence is concentrated in a scalar, the neural network only needs to learn the time-independent function of “predicting the value at a masked position from the unmasked context.” This is the same structure as MDLM’s \(x_0\)-prediction and AR’s next-token prediction, and it forms a framework that unifies absorbing diffusion, any-order AR, and MDLM.

RADD further shows the following.

The training objective of absorbing discrete diffusion and that of any-order AR are equivalent under appropriate weighting
SEDD’s score entropy loss and MDLM’s masked CE loss share the same optimum in the absorbing setting

This result is also important in practice: “the weighted CE of the MDLM-family is computing SEDD’s score analog under the hood.” RADD makes explicit that the difference between learning the concrete score and predicting \(x_0\) is only a surface notational difference.

DDPD: separation of planner and denoiser

DDPD (Discrete Diffusion with Planned Denoising) (S. Liu et al. 2025) is an approach that separates the generation process into two models: a planner and a denoiser. In the MDLM-family, a single model handles the strategy of “looking at all [MASK] positions and fixing the top \(k\) by confidence,” but DDPD divides this responsibility.

Planner: looking at the current sequence \(x_t\), predicts which position is the most corrupted (the one to refine next)
Denoiser: predicts the concrete token value for the position chosen by the planner

One step of the sampling loop can be written in the following pseudocode.

# x: current sequence (some [MASK], some already fixed)
for t in range(T, 0, -1):
    # 1. The planner picks the "most corrupted positions"
    corruption_scores = planner(x)
    target_positions = topk(corruption_scores, k_t)

    # 2. The denoiser produces predictions for those positions
    logits = denoiser(x)
    x[target_positions] = sample(softmax(logits[target_positions]))

The advantages of DDPD can be summarized as follows.

“Where to fix” and “what to fix it to” can be optimized independently
Even for positions already unmasked, if the planner judges them to be highly corrupted, an operation equivalent to remasking is possible (a generalization of low-confidence remasking)
The refinement strength can be adjusted at inference time simply by changing the planner’s temperature

Whereas LLaDA’s low-confidence remasking embedded an implicit planner that “selects positions from the same model’s confidence,” it becomes clearer to view DDPD as separating that out into an explicit model.

MGDM: addressing subgoal imbalance

MGDM (Mask-Guided Discrete Diffusion, also called Multi-Granularity Diffusion Modeling in the paper) (Ye, Gao, et al. 2024) was proposed to address the subgoal imbalance that becomes apparent in complex reasoning tasks.

The problem awareness is as follows. In reasoning tasks such as math and programming, the tokens to be generated mix “trivial connectives” with “decisive computation results,” and training that masks both uniformly dilutes the learning signal for the latter. AR LLMs mitigate this problem by taking loss at all positions via teacher forcing, but DLMs compute loss only at masked positions at each step, so learning difficult subgoals tends to be unstable.

MGDM’s solution is token-level reweighting, which dynamically adjusts the loss weight according to each token’s prediction difficulty. By assigning larger weights to difficult tokens (positions the model cannot confidently predict), it mitigates the subgoal imbalance.

MGDM’s contributions can be summarized as follows.

It empirically identified a weakness of DLMs in complex reasoning (math, puzzles, etc.)
It demonstrated a clear improvement over standard MDLM training via token-level reweighting
It provides the foundation for subsequent reasoning-oriented research such as Diffusion-of-Thought (Ye, Gong, et al. 2024) and d1 (Zhao et al. 2025)

GIDD: interpolation of mask and uniform

GIDD (Generalized Interpolating Discrete Diffusion) (Rütte et al. 2025) was proposed to solve the fundamental weakness of masked diffusion that “it cannot correct errors”.

The generation process of the MDLM-family is a one-way operation of “filling [MASK] with non-[MASK],” and there is no legitimate means in the formulation for later correcting a position that has been unmasked. LLaDA’s low-confidence remasking is a heuristic that fills this gap on the implementation side, but its consistency with the training distribution is not theoretically guaranteed.

GIDD’s forward process is defined by an interpolation of masking and uniform noise.

\[ q(x_t \mid x_0) = (1 - \beta_t - \gamma_t) \delta_{x_0} + \beta_t \delta_\texttt{[MASK]} + \gamma_t \, \text{Uniform}(V) \]

Here \(\beta_t\) is the rate of transition to mask and \(\gamma_t\) is the rate of transition to uniform. The key idea of GIDD is that including uniform noise (transitions to incorrect tokens) in the learning signal at training time lets the model naturally make decisions at inference such as “currently at token \(y\), but this is wrong and should be corrected to \(y'\).”

The implications of GIDD are as follows.

Self-correction capability is built into the formulation
It unifies D3PM’s uniform transition and absorbing transition as a continuous interpolation along the schedule
Operations other than “filling [MASK]” (rewriting existing tokens) become legitimately defined reverse steps at inference

Empirically, GIDD outperforms the MDLM-family especially on long-form generation and tasks involving complex edits. As a representative example of a “design that compensates for the weaknesses of absolute masking,” it is likely to become a reference baseline going forward.

Summary of alternative formulations

Table 1: Comparison of alternative formulations

Method	Source distribution	Learning target	Main functional difference
MDLM-family	`[MASK]` only	\(x_0\)-prediction CE	Absorbing-based (standard)
DFM	Arbitrary (mask or uniform)	Probability velocity	Tools of flow matching
RADD	`[MASK]` only	Time-independent factor of the concrete score	Unification of absorbing and any-order AR
DDPD	`[MASK]` only	Separate planner + denoiser	Explicit modeling of “where to fix”
MGDM	`[MASK]` only	\(x_0\)-prediction + token reweighting	Addressing subgoal imbalance in reasoning tasks
GIDD	`[MASK]` + uniform	\(x_0\)-prediction CE (including uniform)	Self-correction built into the formulation

Looking at Table 1, one can see that the value of alternative formulations lies less in “outperforming MDLM” than in “adding functionality MDLM does not have”. RADD adds theoretical unification, DDPD adds explicit planning, MGDM adds reasoning performance, and GIDD adds self-correction.

LLaDA-origin scale-up

Once LLaDA-8B showed that “DLMs can scale on par with AR LLMs,” 2025 saw derivative work along all axes — scale, long context, MoE, and commercial deployment — fall into place. These share the MDLM-family objective while differentiating themselves through their training starting point and architecture choices.

DiffuLLaMA / Dream-7B: adaptation from AR

DiffuLLaMA (Gong et al. 2025) (along with DiffuGPT from the same paper) proposed an adaptation method that reuses pretrained AR LLM weights, specifically LLaMA-7B, as masked diffusion weights. By replacing the AR model’s causal mask with bidirectional attention and continual pretraining with a masked CE loss, it significantly reduces the compute required for from-scratch training.

Dream-7B (Ye et al. 2025) is a model that applies the same adaptation strategy to Qwen2.5-7B, and with only 580B tokens of additional training it achieved performance on par with or better than LLaDA-8B (2.3T tokens, from-scratch). This sends a strong message: “you do not need trillions of tokens of from-scratch training every time to make a DLM.”

The details of both are left to a separate chapter, but in the context of this chapter the following points suffice.

Adaptation lines use the MDLM objective as is (no new mathematical contribution)
Pretrained AR weights serve as a good initialization for masked diffusion training
The data required for continual pretraining is 1/5 to 1/10 of from-scratch

→ More: AR-to-DLM Adaptation

LongLLaDA: analysis and extrapolation of long-context capability

LongLLaDA (X. Liu et al. 2025) is the first systematic analysis of DLMs’ long-context capability. While the extrapolation behavior of RoPE (Rotary Position Embedding) has been extensively studied for AR LLMs, this had been untouched for DLMs.

LongLLaDA’s main findings are as follows.

DLMs maintain stable perplexity even when context is extrapolated directly (it is preserved even in regions where AR degrades exponentially)
They also exhibit stable behavior on retrieval tasks (e.g., needle-in-a-haystack)
NTK (Neural Tangent Kernel)-based RoPE extrapolation established for AR also works for DLMs

Particularly important is the demonstration that training-free RoPE extrapolation works on DLMs as is. This suggests that knowhow from AR can be transferred directly to DLMs, and became the foothold for UltraLLaDA’s later post-training scaling.

UltraLLaDA: post-training scaling to 128K context

UltraLLaDA (G. He et al. 2025) goes beyond LongLLaDA’s training-free extrapolation and achieves a 128K context window by combining diffusion-aware NTK RoPE scaling with lightweight long-context post-training.

UltraLLaDA’s technical choices are as follows.

Measures the limits of training-free extrapolation (quality degrades at several K to tens of K)
Adjusts NTK scaling coefficients to account for the mask distribution specific to diffusion training
Lightly fine-tunes on long-context post-training data (full training is unnecessary)

As a result, on both retrieval and perplexity it substantially exceeds training-free extrapolation, showing that DLMs can possess long-context capability on par with AR LLMs.

LLaDA-MoE: efficiency via sparse MoE

LLaDA-MoE (Zhu et al. 2025) is the first paper to integrate sparse Mixture-of-Experts (MoE) into a DLM. With a configuration of 7B total parameters and 1.4B active parameters at inference, it was trained from scratch on a large-scale 20T-token dataset.

The significance of MoE integration can be organized as follows.

DLMs run a forward pass over all positions at each step, so reducing active parameters is more important than for AR
The configuration of 7B total with 1.4B active substantially lowers inference cost
On benchmarks, it is in the same range as Qwen2.5-3B-Instruct (knowledge, coding, reasoning)

LLaDA-MoE demonstrated that the MDLM objective combines straightforwardly with an MoE backbone. MoE is likely to become the standard configuration for future large-scale DLMs.

Seed Diffusion: commercial-class open-source DLM

Seed Diffusion (Song et al. 2025) is a large-scale DLM by the ByteDance Seed team, released as an open-source model with commercial-class inference speed and performance. Detailed specifications are deferred to the paper, but the significance in this chapter’s context is as follows.

Combines commercial-grade high-speed inference with open-source release
Demonstrates that MDLM-family training at large scale scales beyond lab-level efforts
Positioned in the lineage of “commercial DLMs” alongside Mercury (Labs et al. 2025) and Gemini Diffusion (Google DeepMind 2024)

With these large-scale open-source releases, the DLM ecosystem has reached a level that can keep pace with the Llama / Qwen line of AR LLMs as of the second half of 2025.

Comparison of major models

The discussion so far is summarized in a single table. We restructured Table 1 of the survey (Li et al. 2025) focusing on this chapter’s context.

Table 2: Major discrete DLMs after D3PM / SEDD

Model	Year	Starting point	Parameters	Training tokens	Main contribution
DiffusionBERT (Z. He et al. 2023)	2023	BERT-110M	110M	16B	BERT weight reuse, spindle schedule
RDM (Zheng et al. 2024)	2024	from-scratch	~170M	medium	weighted CE, flexible decoding
MD4 (Shi et al. 2024)	2024	from-scratch	~170M	medium	continuous-time variational objective
MDLM (Sahoo et al. 2024)	2024	from-scratch	110M	622B	Rao-Blackwellized weighted masked CE
DFM (Gat et al. 2024)	2024	from-scratch	1.7B	2.5T	discrete flow matching
RADD (Ou et al. 2024)	2024	from-scratch	~170M	medium	time-independent factorization of concrete score
DDPD (S. Liu et al. 2025)	2024	from-scratch	~200M	medium	planner-denoiser separation
MGDM (Ye, Gao, et al. 2024)	2024	from-scratch	~200M	medium	token-level reweighting for reasoning
LLaDA (Nie et al. 2025)	2025	from-scratch	1B / 8B	2.3T	first full-scale 8B DLM
DiffuLLaMA (Gong et al. 2025)	2025	LLaMA-7B adapt	7B	65B	AR-to-DLM adaptation
Dream-7B (Ye et al. 2025)	2025	Qwen2.5-7B adapt	7B	580B	adaptation from Qwen2.5
GIDD (Rütte et al. 2025)	2025	from-scratch	~170M	medium	mask + uniform interpolation, self-correction
LongLLaDA (X. Liu et al. 2025)	2025	LLaDA-8B	8B	2.3T	training-free NTK RoPE extrapolation
UltraLLaDA (G. He et al. 2025)	2025	LLaDA-8B	8B	2.3T + alpha	128K context post-training
LLaDA-MoE (Zhu et al. 2025)	2025	from-scratch	7B (1.4B active)	20T	sparse MoE integration
Seed Diffusion (Song et al. 2025)	2025	from-scratch	large	large	commercial-class open-source, fast inference

The structure visible in Table 2 is summarized in the next section.

Converging directions

The developments seen so far are superficially diverse but have several points of convergence.

1. Masked diffusion (absorbing) is the de facto standard

The alternative formulations (DFM, GIDD) each have their own contributions, but the MDLM-family is dominant in terms of training, scaling, and implementation ease. RADD provided theoretical backing for this generality and showed that “absorbing, any-order AR, and the MDLM-family share the same optimum.” This is strong evidence that “absolute masking is not an incidental choice, but a sufficiently universal one.”

2. Division of labor in scaling-up techniques

At 8B-and-above scales, four axes are developing in parallel: from-scratch training, adaptation from AR, MoE integration, and long-context post-training. Typical examples of each are as follows.

from-scratch: LLaDA-8B (2.3T tokens), LLaDA-MoE (20T tokens)
AR adaptation: DiffuLLaMA, Dream-7B
MoE: LLaDA-MoE
long-context post-training: UltraLLaDA (128K)

These are not mutually exclusive and can be combined (for example, a long-context MoE in the vein of LLaDA-MoE + UltraLLaDA is likely to appear at the next stage).

3. Alternative formulations derive value from “adding functionality” rather than “outperforming”

DFM, DDPD, MGDM, and GIDD derive value not so much by outperforming the MDLM-family but by adding functionality MDLM does not have.

DFM: freedom in the source distribution
DDPD: explicit planner (refinement control)
MGDM: reasoning performance
GIDD: self-correction

This is structurally similar to how features such as “base model + chain-of-thought prompting + tool use” have been stacked up around AR LLMs, and it suggests that even for DLMs, the choice of base formulation remains an axis for “functional differentiation”.

4. The emergence of commercial-class open-source

With the arrival in 2025 of large-scale, high-speed-inference DLMs such as Mercury, Seed Diffusion, and Gemini Diffusion, DLMs have moved beyond the laboratory prototype phase to be positioned as a practical alternative to AR LLMs. Their existence marks a turning point at which subsequent research shifts focus from “can it scale” to “what can be done at scale.”

References

Austin, Jacob, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. 2021. “Structured Denoising Diffusion Models in Discrete State-Spaces.” Advances in Neural Information Processing Systems. https://openreview.net/forum?id=h7-XixPCAL.

Gat, Itai, Tal Remez, Neta Shaul, et al. 2024. “Discrete Flow Matching.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2407.15595.

Gong, Shansan, Shivam Agarwal, Yizhe Zhang, et al. 2025. “Scaling Diffusion Language Models via Adaptation from Autoregressive Models.” International Conference on Learning Representations. https://arxiv.org/abs/2410.17891.

Google DeepMind. 2024. Gemini Diffusion. Product page. https://deepmind.google/technologies/gemini-diffusion/.

He, Gengfeng, Shen Nie, Fengqi Zhu, et al. 2025. “UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models.” arXiv Preprint arXiv:2510.10481. https://arxiv.org/abs/2510.10481.

He, Zhengfu, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. 2023. “DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models.” Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. https://arxiv.org/abs/2211.15029.

Labs, Inception, Samar Khanna, Siddhant Kharbanda, et al. 2025. “Mercury: Ultra-Fast Language Models Based on Diffusion.” arXiv Preprint arXiv:2506.17298. https://arxiv.org/abs/2506.17298.

Li, Tianyi, Mingda Chen, Bowei Guo, and Zhiqiang Shen. 2025. “A Survey on Diffusion Language Models.” arXiv Preprint arXiv:2508.10875. https://arxiv.org/abs/2508.10875.

Liu, Sulin, Juno Nam, Andrew Campbell, et al. 2025. “Think While You Generate: Discrete Diffusion with Planned Denoising.” International Conference on Learning Representations. https://arxiv.org/abs/2410.06264.

Liu, Xiaoran, Zhigeng Liu, Zengyi Gao, Qiao He, Xiang Ao, and Xinyu Qiu. 2025. “LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs.” arXiv Preprint arXiv:2506.14429. https://arxiv.org/abs/2506.14429.

Lou, Aaron, Chenlin Meng, and Stefano Ermon. 2024. “Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution.” Proceedings of the 41st International Conference on Machine Learning. https://arxiv.org/abs/2310.16834.

Nie, Shen, Fengqi Zhu, Zebin You, et al. 2025. “Large Language Diffusion Models.” arXiv Preprint arXiv:2502.09992. https://arxiv.org/abs/2502.09992.

Ou, Jingyang, Shen Nie, Kaiwen Xue, et al. 2024. “Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data.” arXiv Preprint arXiv:2406.03736. https://arxiv.org/abs/2406.03736.

Rütte, Dimitri von, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann. 2025. “Generalized Interpolating Discrete Diffusion.” arXiv Preprint arXiv:2503.04482. https://arxiv.org/abs/2503.04482.

Sahoo, Subham Sekhar, Marianne Arriola, Yair Schiff, et al. 2024. “Simple and Effective Masked Diffusion Language Models.” Advances in Neural Information Processing Systems. https://openreview.net/forum?id=L4uaAR4ArM.

Shi, Jiaxin, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias. 2024. “Simplified and Generalized Masked Diffusion for Discrete Data.” Advances in Neural Information Processing Systems. https://openreview.net/forum?id=xcqSOfHt4g.

Song, Yuxuan, Zheng Zhang, Cheng Luo, et al. 2025. “Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference.” arXiv Preprint arXiv:2508.02193. https://arxiv.org/abs/2508.02193.

Ye, Jiacheng, Jiahui Gao, Shansan Gong, et al. 2024. “Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning.” arXiv Preprint arXiv:2410.14157. https://arxiv.org/abs/2410.14157.

Ye, Jiacheng, Shansan Gong, Liheng Chen, et al. 2024. “Diffusion of Thought: Chain-of-Thoughts Reasoning in Diffusion Language Models.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2402.07754.

Ye, Jiacheng, Zhihui Xie, Lin Zheng, et al. 2025. Dream 7B. Blog post. https://hkunlp.github.io/blog/2025/dream/.

Zhao, Siyan, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. 2025. “D1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning.” arXiv Preprint arXiv:2504.12216. https://arxiv.org/abs/2504.12216.

Zheng, Lin, Jianbo Yuan, Lei Yu, and Lingpeng Kong. 2024. “A Reparameterized Discrete Diffusion Model for Text Generation.” First Conference on Language Modeling. https://arxiv.org/abs/2302.05737.

Zhu, Fengqi, Zebin You, Yipeng Xing, et al. 2025. “LLaDA-MoE: A Sparse MoE Diffusion Language Model.” arXiv Preprint arXiv:2509.24389. https://arxiv.org/abs/2509.24389.