Continuous vs Discrete Diffusion: Bridging the Two

The continuous diffusion models developed for image generation (DDPM (Ho et al. 2020), Score-Based Models (Song et al. 2021), VP-SDE, etc.) and the discrete diffusion models developed for language (masked / absorbing diffusion exemplified by MDLM (Sahoo et al. 2024)) correspond strongly in structure, but their mathematical objects differ. This chapter organizes the correspondence between the two and clarifies “how far the knowledge from the continuous side can be carried over to the discrete side” and “from where a translation becomes necessary.”

Conclusion

Let us state the conclusion up front. The relationship between continuous diffusion and discrete diffusion (masked / absorbing type) can be summarized as follows.

Parts with strong correspondence:

The structure of adding noise in the forward and removing it in the reverse
The flow of deriving the loss from the ELBO
The simplification of the loss via SNR-like weighting
The framework of guidance (classifier-free guidance)

Parts that do not correspond:

The score function \(\nabla_x \log p(x)\) — it cannot be naturally defined for discrete variables
SDE / probability flow ODE — mathematics that presupposes continuous variables
The distinction between VE / VP (variance exploding / variance preserving) — the concept of variance does not directly correspond in the absorbing process

If you keep the knowledge of continuous diffusion as a “template,” the formulas of MDLM read off in one pass. However, the correct translation is to understand that on the discrete side, the same objective is achieved by the cross-entropy of \(x_0\)-prediction rather than by the score.

In one line

“Use the knowledge of continuous diffusion as the template for MDLM’s formulas. But do not import the score-centric viewpoint too much.”

Correspondence Table

Lining up the main concepts of the two sides yields the following table.

Table 1: Conceptual correspondence between continuous diffusion and discrete diffusion (MDLM)

Concept	Continuous diffusion	MDLM (masked / absorbing discrete diffusion)
State space	\(\mathbf{x} \in \mathbb{R}^d\)	Token sequence \(x \in \mathcal{V}^L\) (discrete)
Time	\(t \in [0, T]\) (continuous)	\(t \in [0, 1]\) (continuous)
Forward process	VP-SDE: \(d\mathbf{x}_t = -\tfrac{1}{2}\beta(t)\mathbf{x}_t\,dt + \sqrt{\beta(t)}\,d\mathbf{w}_t\)	Each token independently absorbed into `[MASK]` at rate \(t\)
Marginal distribution	\(\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}\)	\(x_t^i = \texttt{[MASK]}\) w.p. \(t\), else \(x_0^i\)
Signal ratio	\(\text{SNR}(t) = \bar\alpha_t / (1-\bar\alpha_t)\)	\(\text{SNR}(t) \propto (1-t)/t\)
Training objective	DSM: \(\mathbb{E}[\lambda(t)\\|\mathbf{s}_\theta(\mathbf{x}_t,t) - \nabla\log p(\mathbf{x}_t\mid \mathbf{x}_0)\\|^2]\)	Masked CE with weight \(1/t\)
Parametrization	\(\epsilon\)-pred / \(\mathbf{x}_0\)-pred / score-pred	\(x_0\)-prediction
Reverse process	Reverse-time SDE or probability flow ODE	Discrete-time denoising loop
Acceleration	DDIM (deterministic ODE)	semi-AR sampling, block-parallel unmask
Guidance	Classifier / Classifier-Free Guidance	Classifier-Free Guidance directly transferable
Latent formulation	LDM (Stable Diffusion)	Non-standard in DLLM (not yet established)

What Table 1 makes clear is that only the way the forward is set up differs, while the rest of the skeleton follows the same flow. The continuous side destroys the state with Gaussian noise; the discrete side erases information with the [MASK] absorbing state. Either way, the framework of “gradually erase information → restore in reverse” is shared.

Viewed as a flow, the structural correspondence between the two processes is shown in Figure 1.

flowchart TB
    subgraph Continuous["Continuous diffusion (DDPM / VP-SDE)"]
        C0["x_0 (clean image)"] --> C1["x_t = √α_t x_0 + √(1-α_t) ε"]
        C1 --> C2["x_T ~ N(0, I) (pure noise)"]
        C2 -.reverse SDE / ODE.-> C0
    end

    subgraph Discrete["Discrete diffusion (MDLM)"]
        D0["x_0 (clean sequence)"] --> D1["x_t: each position MASK with prob t"]
        D1 --> D2["x_1: all positions MASK"]
        D2 -.denoising loop.-> D0
    end

    Continuous -.same ELBO structure.-> Discrete

Figure 1: Structural correspondence between continuous and discrete diffusion

Forward Process Correspondence in Detail

The mathematical objects of the two forwards differ, but the design philosophy of “gradually erase information” matches completely. Here we look at the correspondence at the formula level.

The forward of continuous diffusion (VP-SDE) is

\[ d\mathbf{x}_t = -\tfrac{1}{2}\beta(t) \mathbf{x}_t \, dt + \sqrt{\beta(t)} \, d\mathbf{w}_t \]

and the discretized DDPM transition is

\[ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}). \]

The marginal at an arbitrary time can be written in closed form as \(\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}\).

On the other hand, the forward of MDLM is the simple setting where each token is independently replaced by [MASK] with probability \(t\):

\[ q(x_t^i \mid x_0^i) = \begin{cases} \delta_{x_0^i} & \text{w.p. } 1 - t \\ \delta_{\texttt{[MASK]}} & \text{w.p. } t \end{cases} \]

What both share is:

A complete data state at \(t=0\) and a fully information-destroyed state at \(t=1\) (or \(T\))
The marginal at any time can be written in closed form (it suffices to sample a single step at training time)
The forward is a fixed process with no learnable parameters

In particular, the “closed form” property is the essential reason training becomes efficient on both sides. On the continuous side it is the Gaussian reparametrization; on the discrete side it is the independence at each position — the tools differ, but the purpose is the same.

Four Strongly Corresponding Points

Here we narrow down the most important “strong correspondences” of Table 1 to four points and look at them in detail.

Derivation Structure from ELBO to the Simplified Loss

In continuous diffusion, it is known that rewriting the ELBO in terms of the SNR (signal-to-noise ratio) transforms it into DSM (denoising score matching, a weighted L2) (Kingma et al. 2021). Specifically, the terms of the ELBO across timesteps cancel each other out, ultimately reducing to

\[ \mathcal{L}_{\text{cont}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ w(t) \, \| \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) - \boldsymbol{\epsilon} \|^2 \right] \]

a weighted squared error for noise prediction.

In MDLM, the discrete version of the same ELBO transforms into a masked cross-entropy with weight \(1/t\).

\[ \mathcal{L}_{\text{MDLM}} = \mathbb{E}_{t, x_t} \left[ \frac{1}{t} \sum_i \mathbf{1}[x_t^i = \texttt{[MASK]}] \, \log p_\theta(x_0^i \mid x_t) \right] \]

The weight \(1/t\) corresponds to the discrete version of the SNR-based weight. The behavior that the loss is heavily weighted at \(t \to 0\) (almost no masks, large SNR) and lightly weighted at \(t \to 1\) (almost full masks, small SNR) also parallels the continuous side.

The lesson here is that the continuous-side intuition that “depending on the parametrization, term cancellation in the ELBO yields a clean loss” carries over directly to MDLM. The surprise of “why does it reduce to such a concise formula” is, for someone who has been through continuous diffusion once, within the realm of déjà vu.

“Unifying Different Formulations” via SNR

On the continuous side, it is known that Score-Based Models (VE) and DDPM (VP) can be unified via SNR. The two differ in how the forward is written, but viewed through the single quantity \(\text{SNR}(t)\), they can be treated as the same family (Kingma et al. 2021).

In the same spirit, on the discrete side we can view the following in parallel as “differences in the SNR schedule of the forward”:

D3PM’s uniform transition: replacement by a random other token rather than by mask
MDLM’s absorbing transition: absorption into the [MASK] state
SEDD’s continuous-time discrete score: learning probability ratios in continuous time

These differ only in how the forward is set up (where the information escapes to, and at what rate), but their skeleton is the same. Just as the perspective that abstracts SBM and DDPM by one level is useful in continuous diffusion, taking a similarly elevated perspective in discrete diffusion makes the overview easier.

\(x_0\)-prediction Parametrization

In continuous diffusion, the three options

predict \(\epsilon\) (noise prediction)
predict \(\mathbf{x}_0\) (data prediction)
predict the score \(\nabla \log p\) (score prediction)

are known to be equivalent up to weighting. Implementation-wise, \(\epsilon\)-prediction is widely used, while \(\mathbf{x}_0\)-prediction has the advantage of interpretability at sampling time.

MDLM can be read as a straightforward discrete version of the \(x_0\)-prediction among these three. For masked positions, it predicts “what was the correct token before being masked.” It focuses solely on \(x_0\), without considering quantities corresponding to \(\epsilon\) or the score.

As a consequence, the loss reduces to cross-entropy — the loss most familiar to language models. For readers who have the intuition for taking the \(\mathbf{x}_0\)-pred parametrization on the continuous side, MDLM appears natural.

Shape of the Inference Loop

The appearance of the reverse process also corresponds strongly between the two.

Table 2: Correspondence in the shape of the inference loop

Continuous diffusion (DDPM)	Discrete diffusion (MDLM)
\(\mathbf{x}_T \to \mathbf{x}_{T-1} \to \cdots \to \mathbf{x}_0\)	All `[MASK]` \(\to\) partially unmasked \(\to \cdots \to\) fully unmasked
Proceed via score / noise prediction	Proceed via \(x_0\)-prediction
Number of steps \(T\) is a hyperparameter trading off quality vs compute	Same
Deterministic (DDIM) vs stochastic	Deterministic (greedy unmask) vs stochastic

As shown in Table 2, the trade-off structure “more steps gives better quality but more compute” is shared by both. There is a choice between deterministic and stochastic progression, and the empirical rule that determinizing makes it faster while suppressing quality degradation is also common.

Parts That Do Not Correspond / Require Translation

From here, we look at points where trying to import the knowledge of continuous diffusion directly to the discrete side fails.

The Score Function Cannot Be Naturally Defined for Discrete Variables

At the core of continuous diffusion is the score function \(\nabla_\mathbf{x} \log p(\mathbf{x})\). It is the vector field representing “in which direction to move the current data point so as to increase density,” and the sampling reverse SDE / ODE is essentially written using this score.

However, for discrete variables, the very concept of \(\nabla_x\) is meaningless to begin with. When \(x\) is a token ID (a categorical variable), “differentiation” cannot be defined.

How discrete diffusion models respond to this issue is where they branch out significantly.

MDLM’s choice: Give up on the score and write it as the cross-entropy of \(x_0\)-prediction (concise, practical)
SEDD’s choice: Learn the probability ratio \(p(y)/p(x)\) — called the “concrete score,” via ratio matching

The fact that MDLM reduces to “BERT’s random mask prediction” emerges directly as a consequence of avoiding the score function. This is a degree of freedom not present on the continuous side, and one could say “the discreteness is precisely what makes it concise.”

SEDD and the Concrete Score

SEDD is an attempt to define a score-like quantity even for discrete variables, learning a probability ratio. It has a mathematical structure close to the continuous-side score but is more complex to implement and train than MDLM. MDLM swings to the opposite direction and chose “not to use a score.” For a comparison of the two, see D3PM and SEDD: Alternative Formulations of Discrete Diffusion.

SDE / Probability Flow ODE Are Stories of Continuous Variables

The reverse of continuous diffusion can be written as a reverse-time SDE (Anderson 1982) or a probability flow ODE (Song+ 2021). These are mathematical duals of the forward written as an SDE, and the sampling algorithms (Euler-Maruyama, Heun’s method, etc.) are also borrowed from SDE/ODE numerical solvers.

In contrast, the discrete absorbing process is a continuous-time Markov chain (CTMC), which is a jump process rather than an SDE. The state transitions by discrete jumps. The reverse-time SDE of Anderson 1982 cannot be used as is.

As an alternative, one writes down the discrete reverse rate directly. D3PM (§3) and MDLM (§2) take the form of deriving the reverse transition probability from Bayes’ rule applied to the forward transition matrix. The look of the formulas differs, but the structure that “mathematics exists for inverting the forward” is the same.

The SDE/ODE chapters of continuous diffusion do not apply directly to the discrete side. As a reader, it suffices to carry over the sense that “math for inverting the forward exists”, and one needs to separately follow the discrete version for the specific formulas.

The VE / VP Distinction Cannot Be Carried Over

Continuous diffusion has an important classification axis between VE (Variance Exploding) and VP (Variance Preserving).

VE: \(\mathbf{x}_t = \mathbf{x}_0 + \sigma(t) \boldsymbol{\epsilon}\), the variance blows up over time
VP: \(\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}\), the variance is kept within a fixed range

However, the absorbing process has no axis of “variance increasing / preserved.” This is because all discrete absorbing processes are a family where “information disappears in one direction.”

The classification axis that is meaningful on the discrete side is rather where the probability escapes among the non-absorbing states.

uniform: transition to any token with equal probability (one choice in D3PM)
absorbing: absorption into the special [MASK] state (MDLM’s choice)
Gaussian-like: Gaussian-like diffusion in the embedding space (discretized Gaussian, another choice in D3PM)

Discard the VE/VP axis of the continuous side and translate it into the different axis of “where the information escapes to” — this is the correct way to bridge.

Table 3: Correspondence when translating concepts from the continuous side to the discrete side

Concept on the continuous side	Correspondence on the discrete side	Type of bridging
Score function \(\nabla_x \log p(x)\)	\(x_0\)-prediction CE (MDLM) or probability ratio (SEDD)	Translation required
SDE / probability flow ODE	CTMC (jump process)	Translation required
VE / VP	uniform / absorbing / discretized Gaussian	Translation required
SNR schedule	SNR \(\propto (1-t)/t\)	Directly usable
ELBO	ELBO	Directly usable
Classifier-Free Guidance	Classifier-Free Guidance	Directly usable

The top three rows of Table 3 (score, SDE/ODE, VE/VP) require “translation,” while the bottom three rows (SNR, ELBO, CFG) are “directly usable.”

Correspondence in Sampling Acceleration

Both continuous and discrete diffusion share the common motivation of “wanting to accelerate by reducing the number of steps.” The techniques developed on each side look different on the surface, but they share common principles.

Continuous Side: DDIM and Probability Flow ODE

The representative acceleration on the continuous side is DDIM (Song+ 2020) and the probability flow ODE (Song+ 2021).

DDIM: while keeping the same forward marginal, parametrically deform the reverse to enable a deterministic transition. High-quality generation is possible even with 10–50 steps
probability flow ODE: an ODE with the same marginal as the SDE. It has deterministic trajectories, and ODE solver acceleration techniques (DPM-Solver, etc.) can be applied

Both are based on the insight that “removing the stochastic noise injection and making the trajectory deterministic allows one to track it even with few steps.”

Discrete Side: semi-AR Sampling and Block-Parallel Unmask

Representative accelerations on the discrete side are as follows.

Greedy / confidence-based unmask: at each step, fix the top-\(k\) tokens with highest confidence. A strategy originating from MaskGIT
Semi-AR sampling: divide the sequence into blocks, unmask in parallel within a block, and progress left-to-right between blocks
Low-confidence remasking: even positions once unmasked are re-masked and rewritten if their confidence is low

These look different from DDIM on the continuous side, but they share the strategy of “deterministically fix the most confident parts first.” Whereas DDIM on the continuous side “removes the noise and draws a smooth trajectory,” greedy unmask on the discrete side “fixes the most confident tokens first” — each is a determinization tailored to its mathematical object.

Table 4: Correspondence of acceleration techniques

Axis	Continuous-side acceleration	Discrete-side acceleration
Determinization	DDIM, probability flow ODE	greedy unmask, top-\(k\) confidence
Parallelization	Batch-dimension parallelism	Position parallelism within a sequence (intra-block)
Distillation	progressive distillation	step distillation (DLLM version)
Solver	DPM-Solver, Heun	semi-AR scheduler

As shown in Table 4, the two can be organized as “responding to the same problem (reducing the number of steps) with tools tailored to their respective mathematical objects.”

Guidance Is Common to Both

Classifier-Free Guidance (CFG) is a method established in continuous diffusion, but it can be directly transferred to discrete diffusion.

On the continuous side, the conditional prediction \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c)\) and the unconditional prediction \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)\) are extrapolated by \(w\).

\[ \tilde{\boldsymbol{\epsilon}} = (1 + w) \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) - w \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset) \]

The same framework works on the discrete side. Conditional / unconditional extrapolation is applied to the predicted probabilities (logits) of \(x_0\).

\[ \log \tilde{p}(x_0 \mid x_t) = (1 + w) \log p_\theta(x_0 \mid x_t, c) - w \log p_\theta(x_0 \mid x_t, \emptyset) \]

This is logit-space guidance, and it naturally connects to logit-bias methods in AR LLMs. Anyone who has understood guidance in continuous diffusion may carry that intuition over to the discrete side as is.

Translation of Classifier Guidance

Classifier Guidance (the scheme using gradients of an external classifier) added the classifier’s gradient \(\nabla_\mathbf{x} \log p(c \mid \mathbf{x}_t)\) to the score on the continuous side. On the discrete side, the gradient cannot be defined, so a direct translation is not possible, but alternative forms adding the classifier’s score to the logits have been considered. In practice, CFG is simpler, and CFG is the mainstream on the discrete side as well.

Status of Latent-Space Formulation (Latent Diffusion)

One of the major practical breakthroughs in continuous diffusion is the Latent Diffusion Model (LDM). Widely known through Stable Diffusion, this method runs diffusion not in pixel space but in the latent space learned by a VAE, drastically reducing computational cost.

On the DLLM side, this “latent formulation” is not currently established as a standard method. Several reasons can be considered.

A token sequence is already a “compressed” representation in discrete form, so the motivation to further latentize is weak
The meaning of language is combinatorial, and smooth interpolation in latent space may not function as well as it does for images
In the world of AR LLMs, latent formulations are not mainstream, and research comparison points are scarce

Research on DLLMs that run diffusion in latent space does exist (such as diffusion in latent embeddings), but it has not yet reached a configuration of decisive victory like the image-side LDM.

Latent Formulation: Continuous vs Discrete

While the LDM on the continuous diffusion side took shape via the step “pixel space → latent space,” DLLMs already operate in discrete token space from the outset and are designed not to require an additional “latent formulation.” This is not a drawback but a consequence of having different goals.

Practical Recommendations (A Reading Guide)

For readers with knowledge of continuous diffusion entering the DLLM literature, the efficient reading order is as follows.

Go through the standard formulation of continuous diffusion once: DDPM, DSM, the SNR-unified perspective (reading the VDM paper is efficient)
Do not chase SDE/ODE chapters too deeply; focus on score matching and guidance
Sander Dieleman’s blog “Diffusion language models” to grasp the bridge between discrete and continuous
Enter MDLM. Use continuous knowledge as a “template” and understand that the discrete side achieves the same purpose with a different tool (\(x_0\)-prediction CE)

Reading in this order, when you look at the MDLM paper, phenomena such as “the ELBO terms cancel and become concise,” “an SNR-like weight emerges,” and “it becomes a cross-entropy of \(x_0\)-prediction” all fall into place as discrete versions of known continuous-side patterns.

Conversely, reading only MDLM without going through continuous diffusion risks making “why the weight \(1/t\)” and “why the ELBO simplifies so cleanly” appear as if handed down from above. The continuous-side intuition has value as scaffolding for reading the discrete-side formulas “without surprise.”

Reading Summary

Once you have gone through “ELBO → DSM simplification” on the continuous side, the MDLM “ELBO → weighted masked CE” reads at once as the discrete version of the same phenomenon. The score / SDE / VE-VP cannot be carried over to the discrete side, so they must be replaced by different tools there.

Recommendation of Sander Dieleman’s Blog Posts

The blog of DeepMind researcher Sander Dieleman (sander.ai) is the single source for most efficiently acquiring the mathematical intuition for diffusion models in general. Grasping the meta-overview “what is being done” from these articles before chasing the formulas of papers makes subsequent paper reading dramatically faster.

Listed in order of relevance to DLLMs:

“Diffusion language models”: organizes the relationship between discrete diffusion and AR, and the philosophy of whether diffusion makes sense for language. The article closest to the theme of this book
“Diffusion is spectral autoregression”: the view that diffusion and AR are not opposites but on a continuum. While on the continuous side, it is also suggestive when thinking about the relationship between discrete DLLMs and AR
“The geometry of diffusion guidance”: a geometric understanding of guidance in continuous diffusion. Provides intuition that can be applied to the discrete side as well

These are blog posts rather than papers, but for the purpose of efficiently acquiring the “template” of diffusion models, they are often superior to papers.

Common Misconceptions

When entering DLLMs from continuous diffusion, three particularly easy traps:

Misconception 1: “MDLM is exactly a discrete version of DDPM”

Partially correct, but the treatment of the score function is essentially different. DDPM has a formulation equivalent to score matching, while MDLM writes its objective with the CE of \(x_0\)-prediction without using a score. When calling MDLM “a discrete version of DDPM,” one should keep it at the level of correspondence in the forward structure and the ELBO structure.

Misconception 2: “DDIM can be directly imported into MDLM”

The philosophy of deterministic sampling is shared, but the specific formula of DDIM (the ODE form with \(\eta=0\)) presupposes continuous variables. On the discrete side, “determinization” means greedy unmask or top-\(k\) confidence selection. They are different algorithms and cannot be ported at the code level.

Misconception 3: “The theory of SDE / ODE automatically carries over to the discrete side”

Both the reverse-time SDE (Anderson 1982) and the probability flow ODE are mathematics of continuous variables. The reverse on the discrete side is different mathematics derived from the CTMC forward, and the various theorems of SDE/ODE (convergence, acceleration methods) cannot be automatically carried over. Discrete-version corresponding theorems are required.

A Different Perspective: The Continuity Between Diffusion and AR

Sander Dieleman’s “Diffusion is spectral autoregression” presents a view of continuous diffusion and AR as not opposites but on a continuum. The forward of continuous diffusion can be seen, in the frequency domain, as “a process of erasing information from high frequencies to low frequencies,” which can be interpreted as a kind of “spatial AR.”

This view also suggests something for DLLMs.

AR LLM: fix tokens one by one from left to right (AR along the time axis)
DLLM: while holding the entire sequence, gradually fix from positions with high confidence (AR along the confidence axis)

Viewing the two as a difference in “along which axis to AR,” DLLM is positioned as a relative of the AR LLM. The configuration corresponding to the relationship between continuous diffusion and image AR models exists on the discrete side as well.

flowchart LR
    A["AR LLM<br/>Axis: time (left→right)"] --> B["DLLM<br/>Axis: confidence (high→low)"]
    B --> C["Continuous Diffusion<br/>Axis: SNR (high→low)"]
    C --> D["Spectral AR<br/>Axis: frequency (high→low)"]

Figure 2: Continuity between AR and diffusion: along which axis to AR

Lined up as in Figure 2, AR and diffusion appear on a continuum as differences in “along which axis to sequentially decide.”

Summary

Continuous diffusion and discrete diffusion (the MDLM family) correspond strongly in structure but differ in mathematical objects.

The derivation structure “ELBO → simplified loss” is shared
The unified perspective of formulations via SNR is shared
The \(x_0\)-prediction parametrization and the shape of the inference loop are shared
The framework of guidance is shared

On the other hand,

The score function cannot be naturally defined for discrete variables → MDLM avoids it with \(x_0\)-prediction CE
SDE / ODE presuppose continuous variables → the discrete side uses CTMC (jump process)
The VE / VP distinction cannot be carried over → on the discrete side, the classification axis is “where to let the information escape”

There is great value in keeping the knowledge of continuous diffusion as a “template.” However, fixating on the score-centric view will cause one to miss the conciseness of the discrete side. MDLM on the discrete side is most naturally read as arriving at cross-entropy — the inherent loss of language models — by giving up on the score.

References

Dieleman, Sander. 2023. The Geometry of Diffusion Guidance. Blog post. https://sander.ai/2023/08/28/geometry.html.

Dieleman, Sander. 2024. Diffusion Is Spectral Autoregression. Blog post. https://sander.ai/2024/09/02/spectral-autoregression.html.

Dieleman, Sander. 2025. Diffusion Language Models. Blog post. https://sander.ai/2025/04/15/diffusion-language-models.html.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2006.11239.

Kingma, Diederik P., Tim Salimans, Ben Poole, and Jonathan Ho. 2021. “Variational Diffusion Models.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2107.00630.

Sahoo, Subham Sekhar, Marianne Arriola, Yair Schiff, et al. 2024. “Simple and Effective Masked Diffusion Language Models.” Advances in Neural Information Processing Systems. https://openreview.net/forum?id=L4uaAR4ArM.

Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” International Conference on Learning Representations. https://openreview.net/forum?id=PxTIG12RRHS.

Continuous vs Discrete Diffusion: Bridging the Two

Conclusion

Correspondence Table

Forward Process Correspondence in Detail

Four Strongly Corresponding Points

Derivation Structure from ELBO to the Simplified Loss

“Unifying Different Formulations” via SNR

\(x_0\)-prediction Parametrization

Shape of the Inference Loop

Parts That Do Not Correspond / Require Translation

The Score Function Cannot Be Naturally Defined for Discrete Variables

SDE / Probability Flow ODE Are Stories of Continuous Variables

The VE / VP Distinction Cannot Be Carried Over

Correspondence in Sampling Acceleration

Continuous Side: DDIM and Probability Flow ODE

Discrete Side: semi-AR Sampling and Block-Parallel Unmask

Guidance Is Common to Both

Status of Latent-Space Formulation (Latent Diffusion)

Practical Recommendations (A Reading Guide)

Recommendation of Sander Dieleman’s Blog Posts

Common Misconceptions

Misconception 1: “MDLM is exactly a discrete version of DDPM”

Misconception 2: “DDIM can be directly imported into MDLM”

Misconception 3: “The theory of SDE / ODE automatically carries over to the discrete side”

A Different Perspective: The Continuity Between Diffusion and AR

Summary

Links to Related Chapters

References