flowchart TD
Start["all positions [MASK]"] --> Pred["predict at all positions"]
Pred --> Q1{"how many to unmask?"}
Q1 --> Q2{"which positions to unmask?<br/>(confidence? random?)"}
Q2 --> Q3{"remask or not?<br/>if so, what?"}
Q3 --> Q4{"semi-AR?<br/>block width?"}
Q4 --> Check{"all positions fixed?"}
Check -- "No" --> Pred
Check -- "Yes" --> End["output"]
Open Problems in the DLLM Field
This chapter serves as the conclusion of the book. Compared to Autoregressive (AR) LLMs, Diffusion Language Models (DLLM) remain a field in which many areas are still unestablished and there is substantial room for research. By contrast, AR LLMs already have an enormous body of techniques systematized into procedures, and the ecosystem is mature. In this chapter we compare the situations of the two by domain, and organize the open problems and research directions on the DLLM side.
“There is room to maneuver” is a correct perception
DLLMs are still developing both theoretically and in terms of implementation. The concise formulation of MDLM (Sahoo+ 2024) and the scale-up by LLaDA (Nie+ 2025) have established “the skeleton of the modern DLLM,” but the surrounding areas remain open space: no standards have been firmly established in training recipes, sampling, inference-time intervention, evaluation, or theory.
While this situation is an opportunity for researchers, for practitioners it also represents the cost that “if you are going to adopt this, you have to assemble the recipe yourself.” Even merely translating the toolkit that already exists on the AR LLM side (scaling laws, instruction tuning, inference-time intervention, an established suite of evaluation benchmarks) over to DLLMs generates a considerable number of research topics.
The statement that “DLLMs still have room to maneuver” is not an irresponsible observation, but rather a correct characterization of the current research landscape. For readers who have worked through this book, this chapter plays the role of mapping where this room exists.
What is important here is to view the locations of this room separately as “training,” “sampling,” “inference-time intervention,” “evaluation,” “theory,” and “architecture.” Many discussions lump everything together as “DLLMs are still immature” without distinguishing domains, but in fact maturity differs by domain. For example, training recipes have reached a practical line with LLaDA, but inference-time intervention is almost untouched. The comparison table and the subsequent sections make this distinction explicit.
Domain-by-domain comparison
The maturity of AR LLMs and DLLMs in major domains is summarized in Table 1.
| Domain | AR LLM status | DLLM status |
|---|---|---|
| Main baselines | GPT-4 / Claude / Llama are the de facto standard | LLaDA / Dream have only just appeared |
| Training recipes | Scaling laws, instruction tuning, RLHF/DPO etc. systematized | Not yet established; mask schedule and choice of noise are under investigation |
| Sampling | top-p / top-k / temperature / contrastive decoding etc. mature | confidence-based unmask, remasking, semi-AR etc. still developing |
| Inference-time intervention | self-consistency, CoT, ToT, MBR, verification etc. mature | Yet to take off in earnest |
| Evaluation benchmarks | MMLU, GSM8K, MATH, HumanEval etc. established | The same benchmarks are reused, but DLLM-specific axes of evaluation are unexplored |
| Theory | scaling laws, in-context learning theory etc. are becoming mature | Expressiveness of mask diffusion, correspondence with AR, convergence etc. still in early stages |
| Architecture | decoder-only is the de facto standard | encoder-decoder, bidirectional attention, hybrid etc. still in flux |
This table is not meant to be read as “DLLMs are inferior to AR.” DLLMs are a framework that attempts language generation with a different factorization from AR, and a comparison between the two is more appropriately captured as “mapping different terrains” rather than as “time differences in the same race.” The work of formally reproducing what has been established in AR within DLLMs, and the work of drawing out DLLM-specific strengths (parallelism, bidirectionality, naturalness of editing) are each progressing independently.
Below, we organize the situation and open questions for each domain individually.
Training recipes
For AR LLMs, scaling laws (Kaplan, Chinchilla), instruction tuning (FLAN, SuperNI family), and preference optimization such as RLHF or DPO are published as recipes that are essentially systematized as procedures. They have a level of maturity such that reading the papers brings you close to reproduction, and even newcomers are in a position where they can build on top of existing recipes.
The DLLM side has not reached this point. The MDLM loss function itself has an extremely clear form as “weighted BERT training,” but the surrounding recipes for growing that equation into a model of practical quality (time-step distribution, data mixing, warmup, evaluation loops) are not yet settled. The LLaDA paper describes many settings, but at present it is not possible to distinguish whether they were adopted “because they are optimal” or “because they worked.”
The following can be listed as open questions.
- Mask schedule: At training time, from what distribution should the time \(t\) be sampled to be optimal? Uniform, cosine, logarithmic, or task-dependent? On the continuous diffusion side, SNR-aware weighting has become standard practice, but what is the corresponding choice on the discrete side?
- Choice of noise: Can transitions other than absorbing (uniform transition, hybrid) win at scale, or is absorbing dominantly advantageous? The D3PM framework permits diverse choices, but a map of relative advantage by scale is not organized
- Data efficiency: Compared to AR models of the same size, is data efficiency better or worse, and how should the trade-off be measured? Even looking at the same number of tokens, DLLMs have the aspect that they can use multiple mask patterns from a single sample as learning signals
- Instruction following: At SFT time, should the mask rate be the same distribution as at training time, or should it be adjusted task-dependently? The impact of choices such as not masking instruction text / masking only outputs is also not yet organized
- Preference optimization: How should preference optimization corresponding to DPO / GRPO on the AR side be constructed in DLLMs? Should comparisons be at the trajectory level, or step by step?
In particular, mask schedule and noise design are about as important as the learning rate schedule on the AR side, but at present they are selected only empirically. If theoretical guidelines emerge, there is room for training efficiency to move significantly from that alone.
Sampling
For AR LLMs, sampling strategies are essentially exhaustively researched. Choices such as top-p, top-k, temperature, typical sampling, contrastive decoding, and speculative decoding are used stably, and their behavior is organized both theoretically and empirically.
In contrast, DLLM samplers are still developing. Figure 1 shows the points in the DLLM sampling loop where decisions are required.
The open questions corresponding to each decision point are as follows.
- Optimal number of steps: How is the trade-off curve between quality and computational cost drawn? What is the natural unit corresponding to “1 token = 1 forward” on the AR side? Can a lower bound on NFE for reaching the same quality be obtained theoretically?
- Dynamic scheduling: Can the appropriate number of steps be predicted per input? Can simple inputs be run with few steps and complex inputs with many steps adaptively? Can a criterion for early stopping be obtained from the internal state?
- Remasking decision: When and where should one remask? Confidence-based, or some other signal (entropy, margin, external verifier)? The low-confidence remasking used in LLaDA (Nie et al. 2025) is powerful, but comparisons with other strategies are insufficient. GIDD (Rütte et al. 2025) shows that mixing non-mask noise (uniform) during training gives the model a self-correction ability — a training-time choice that changes what remasking strategies remain meaningful at inference
- Trajectory diversity: How can one avoid mode collapse in low-temperature sampling? What should the “trajectory-level diversity control” corresponding to AR’s top-p look like?
- Hybrid with AR: How does the optimal block width of semi-AR (block-wise semi-autoregressive) depend on task and model scale? It can also be viewed as a continuous parameter that degenerates to AR at block width 1 and to fully parallel at block width equal to sequence length. BD3-LMs (Arriola et al. 2025) promote block structure to a first-class citizen by introducing it at training time (see the Block Diffusion chapter)
- Theoretical optimality of the schedule: Whether the noise schedule at inference time (which \(t\) to pass through and in what order) has optimization room independent of the training-time schedule
Sampling directly determines “the performance that can be drawn from the same trained model,” so it is a research area where the effect is large at low cost. Using publicly trained models (such as LLaDA) as raw material, there is room to make contributions worth a paper just through inference-time algorithms.
Inference-time intervention
For AR LLMs, the toolkit for inference-time intervention is rich.
- Self-consistency: Take multiple samples and take a majority vote
- Chain of Thought (CoT): Have the model write out step-by-step reasoning as intermediate tokens
- Tree of Thoughts (ToT): Search with branching
- Minimum Bayes Risk (MBR): Choose the minimum risk within a hypothesis set
- Verification / Process Reward Model: Run a verifier against intermediate states
- Constrained decoding: Constrained generation, attribute control
These are designed to fit well with the AR property of “extending sequentially.” In DLLMs, naively applying the same interventions does not work, or requires transformation into a different form.
The following can be listed as open areas.
- DLLM version of CoT: How should stepwise reasoning be implemented in DLLMs? Should it be expanded across the whole sequence at once (writing thinking and answer in a single parallel unmask), or sequentially in block units (block 1 = thinking, block 2 = answer)? In AR, the “left to right” structure fit well with CoT, but in DLLMs the structure changes. The training-time block structure used in BD3-LMs (Arriola et al. 2025) is plausibly aligned with block-wise CoT expansion
- Classifier guidance / CFG family: Discrete-side versions of the classifier guidance and classifier-free guidance that have become standard practice in continuous diffusion. On the embedding-diffusion side, Diffusion-LM (Li et al. 2022) brought classifier guidance to text early on (see the Embedding-space Diffusion chapter); on the discrete side, Nisonoff et al. (Nisonoff et al. 2025) gave a general CTMC-based framework, and Schiff et al. (Schiff et al. 2024) organized simple implementations specifically for masked diffusion. LLaDA (Nie et al. 2025) also implements CFG. The area could become the standard tool for attribute control, style control, and conditional generation, but optimal guidance schedules and strengths are not yet settled
- Constrained decoding: Generation under structural constraints such as grammars, JSON Schema, or regular expressions. AR has developed WFST-style per-token constraint mechanisms, but in bidirectional unmask the way constraints are satisfied changes. The treatment of intermediate states that transiently violate constraints during the loop is also not yet organized. The discrete guidance frameworks (Nisonoff et al. 2025; Schiff et al. 2024) can be reused when the constraint can be expressed as a classifier
- Editing / fill-in interventions: Center-filling (fill-in-the-middle), token replacement at arbitrary positions, structured rewriting, and other areas where DLLMs are structurally stronger than AR. LLaDA (Nie et al. 2025) can accept arbitrary mask arrangements, so naive infilling is possible, but standard APIs, evaluation sets, and benchmarks are not in place, so “design of DLLM-native editing interfaces” itself can become a research target. The encoder-decoder form of DiffuSeq (Gong et al. 2023) provides a reference point for seq2seq-style editing
- Verifier / reward model as guidance: Methods for placing PRMs or reward models as a classifier-guidance-like signal at each step of the DLLM loop. Nisonoff et al. (Nisonoff et al. 2025) and Schiff et al. (Schiff et al. 2024) give general guidance frameworks on the discrete side, on top of which treating a verifier as a classifier-like signal is conceivable. Meanwhile, gradient-based guidance cannot be transferred to the discrete side in the same form as on the continuous side, so different techniques (ratio injection, logit re-weighting, etc.) are needed
- Allocation of test-time compute: A DLLM version of the direction of raising performance by increasing inference-time compute (which made great progress on the AR side with the o1 family). Unlike AR’s “increase the number of samples and take the best,” DLLMs have multiple knobs in parallel (step count, guidance strength, block partition (Arriola et al. 2025), remask strategy, etc.), so the allocation problem becomes high-dimensional
DLLMs have the structural advantage that they can intervene at each step, so the flexibility of intervention is inherently higher than for AR. The current situation is that recipes for drawing out that flexibility have not yet been established.
Figure 2 shows typical places in the DLLM loop where intervention is possible. Whereas in AR the places where intervention is possible are concentrated in “the distribution of the next token,” in DLLMs multiple intervention points exist in parallel at each step.
flowchart LR
State["current state<br/>(partially fixed)"] --> Model["DLLM forward"]
Model --> Logits["logits at all positions"]
Logits -.-> G1["guidance:<br/>add to logits"]
Logits --> Sample["sampling"]
Sample -.-> G2["constrained<br/>decoding"]
Sample --> Confidence["confidence computation"]
Confidence -.-> G3["re-evaluate with<br/>verifier"]
Confidence --> Unmask["unmask / remask"]
Unmask --> NextState["next state"]
Because each dashed location can be used as an independent intervention layer, the expressiveness of intervention is higher than “tweaking temperature or top-p” on the AR side. The issue is the recipe and evaluation methodology for utilizing that expressiveness.
Evaluation
The challenges on the evaluation side are twofold. First, current DLLM papers compare performance on the same benchmarks as AR (MMLU, GSM8K, MATH, HumanEval, etc.), which is necessary as a side-by-side reference point but does not measure the characteristics of DLLMs. Second, evaluation axes that measure DLLM-specific properties have not yet been proposed.
Among DLLM-specific evaluation axes, the following have room for development.
- Performance per NFE (Number of Function Evaluations): A standard metric for continuous diffusion models, directly showing the trade-off between computational cost and quality. Bringing this to the language side would make comparisons with AR’s “cost of producing the same number of tokens” transparent
- Step-quality curve: At how many steps does quality saturate? How does this depend on the task?
- Editability / Controllability: Advantage on editing tasks (filling arbitrary positions, replacing specific tokens, constrained rewriting). This is an area where AR is essentially weak and DLLM strengths should show, but there is no standard evaluation set
- Bidirectional knowledge usage: Measurement in settings that use context in both left and right directions (bidirectional cloze, middle-filling). This is structurally difficult in AR
Existing benchmarks such as MMLU and GSM8K are implicitly designed on the premise of “AR-style generation.” DLLM advantages should lie in “parallelism,” “bidirectionality,” and “naturalness of editing,” but these do not readily show up in MMLU scores. Even if a DLLM appears weaker than AR on the same benchmark, that is not necessarily an essential weakness of DLLM.
Conversely, even if one designs a new evaluation axis where DLLM advantages appear, in order for it to carry the persuasive force of not being viewed as “an arbitrary selection of an evaluation where DLLM is advantageous,” the evaluation axis itself needs practical value (naturalness as a task, industrial use context, usefulness for humans). The design of evaluation axes itself is an area where it is recognized as a research contribution.
Another challenge on the evaluation side is the standardization of sampling settings. In AR, the choices boil down to roughly “greedy or temperature=1,” but in DLLMs the combinations of number of steps, remask strategy, block size, etc. are enormous, and how these are fixed for comparison varies from paper to paper. For reproducibility and comparability, standardization of evaluation protocols will also become necessary going forward.
Theory
Theoretical understanding of DLLMs is in its early stages. On the continuous diffusion side, the SDE / ODE correspondence, convergence analysis of score matching, discussions of expressiveness, and so on are progressing, but on the discrete side, ad hoc results are still scattered.
The following can be cited as questions with large room for development.
- Expressiveness: Are DLLMs equivalent to AR, stronger, or weaker? Under what conditions can they express arbitrary probability distributions? AR has the form of approximating each factor of the chain rule \(p(x) = \prod_i p(x_i \mid x_{<i})\) with a neural network, and in terms of expressiveness it is sufficient to have the set of conditional distributions. DLLMs have a different decomposition (the denoising chain), and the equivalence or gap in expressiveness between the two is not obvious
- Convergence: How is convergence of the iterative refinement loop theoretically guaranteed? Does it approach the true distribution as the number of steps increases, or does it plateau somewhere? Convergence analysis of SDE / ODE has progressed on the continuous diffusion side, but the corresponding organization is thin on the discrete side
- Correspondence with AR: What is the mathematical correspondence for translating AR LLM techniques (speculative decoding, KV cache, long-context optimization, context window extension) into DLLMs? Among AR methods, there are some that strongly depend on the “left to right” structure and others that do not. The DLLM translation of the latter is relatively easy, but the translation of the former requires new structure
- Scaling law: Is there a scaling law specific to DLLMs? Does it have the same form as AR (\(L \propto N^{-\alpha}\)), or a different form that takes the number of steps into account? Since performance moves with inference-time NFE even at the same parameter count, there are elements that cannot be captured by an AR-form law
- Sample complexity: Theoretical bounds on the amount of data needed for training. How does the fact that multiple mask patterns can be generated from the same sequence as learning signals factor into the data efficiency argument?
In particular, “correspondence with AR” is the language-side version of the “bridge between continuous diffusion and discrete diffusion” repeatedly touched on in this book. There is not yet a dictionary that maps results established in AR over to DLLMs. There is also research room in the reverse direction of translation (how findings on the DLLM side feed back to AR).
Architecture
For AR LLMs, the decoder-only Transformer has become the de facto standard, and architecture selection has essentially converged. Although positional encoding stabilized with RoPE, the other choices (attention scheme, normalization, activation function, etc.) have largely settled as well. The DLLM side is still in flux.
The points of contention can be organized as follows.
- Attention scheme: Fully bidirectional (BERT family), preserving causal masking, or hybrid? LLaDA is bidirectional, and the Dream lineage may take different choices. Going bidirectional makes weight sharing with AR difficult, and bootstrapping from AR pretraining becomes hard to use
- Encoder-decoder: Whether to separate input (condition) and output (generation target). This can be a natural choice for conditional generation. There is the advantage of inheriting the T5 lineage’s experience, and the disadvantage of losing the simplicity of decoder-only
- Positional encoding: In bidirectional architectures, which of RoPE / ALiBi / learned is optimal? Conclusions established in AR cannot necessarily be carried over as is. In particular, behavior with long contexts or variable-length mask arrangements has room for re-examination
- Long-context handling: Because DLLMs hold the entire sequence in memory, constraints may be tighter than for AR with long contexts. BD3-LMs (Arriola et al. 2025) partition the sequence into blocks and offload completed blocks to a KV cache, which loosens some of the long-context constraints; combinations with sparse attention or hierarchy are also conceivable, but a standard design answer is not yet settled
- Initialization from AR: The route of using an existing AR pretrained model as initialization for a DLLM is attractive, but it must be reconciled with bidirectionalization. There are attempts in this direction in the Dream-family papers (Ye et al. 2025)
Architecture choice is inseparable from training recipes, and there is the complexity that both must be explored simultaneously. On the AR side, the two are independent enough to be explored separately, but in DLLMs they are more tightly coupled.
Research directions
The following are themes that are accessible to start working on when beginning DLLM research.
- Translating to DLLMs the AR-side interventions that do not strongly depend on left-to-right sequential structure (guidance, constrained decoding, editing-style interventions, infilling)
- Designing new interventions that directly leverage the structural features of DLLMs (parallelism, bidirectionality, natural editing)
- Improving sampling strategies (new schedules, new remask strategies, dynamic step allocation)
- Proposing DLLM-specific evaluation axes and re-evaluating existing models, especially developing benchmarks for editing and fill-in tasks
- Small-scale theoretical analysis (expressiveness, convergence, correspondence with AR, the shape of the scaling law)
In particular, the two axes of “translating what has been established in AR into DLLMs” and “drawing out DLLM-specific advantages” both contain many untouched items, and each piece of the AR-side knowledge stock has room for re-examination on the DLLM side.
These themes often do not require massive computational resources. Sampler improvements and proposals of evaluation axes can be verified on top of existing trained models (such as LLaDA), and theoretical analysis can be discussed with small-scale synthetic data. “There is room to maneuver” also means that there is room for entry even for researchers with limited computational resources.
Conversely, simply by exhaustively grasping techniques established on the AR LLM side and making a “DLLM translation list” of them, one can put together a research program for the near term. The literature covered in each chapter of this book is the foundation for that translation work, and in particular the MDLM loss function and the LLaDA sampler are valuable as “skeletons that become starting points for translation.”
An example of translation
As a concrete example, consider translating AR’s chain-of-thought (CoT) into DLLMs. In AR, the structure of “writing out a reasoning trace as intermediate tokens from left to right, then producing the answer” aligned naturally with sequential generation. The moment one tries to do the same thing in a DLLM, multiple design axes coexist.
- Expand the whole sequence in a single parallel unmask (writing thinking and answer together), or expand block by block (block 1 = thinking, block 2 = answer)?
- Use different mask schedules or remask strategies for the thinking block versus the answer block?
- Can the length of the thinking block be decided dynamically, or is it fixed-length?
- Is there room for asymmetric design — semi-AR for the thinking block, parallel for the answer block?
- Should there be early-stop criteria on the intermediate state (e.g., advance to the answer once the thinking is sufficiently fixed)?
Even the one-line idea “do CoT in a DLLM” already has a clearly wider implementation design space than its AR counterpart. This is the general pattern that appears whenever AR-side established methods are translated into DLLMs — the same kinds of design choices line up for guidance, constrained decoding, and editing-style interventions. One established procedure on the AR side opens up, on the DLLM side, into a family of methods that carry their own design space. That is the structural shape of DLLM research at present.
Will DLLMs replace AR?
At the end of this chapter, let us touch on the often-asked question of “will DLLMs replace AR?”
In the short term, AR and DLLMs are likely to coexist.
- AR strengths: Established ecosystem, quality on long contexts, inference cost structure (KV cache), streaming generation
- DLLM strengths: Parallel generation, bidirectional context, naturalness of editing / fill-in, flexibility of inference-time intervention
A realistic future is one where they are used differently depending on the task and required characteristics. A scenario in which AR becomes “the standard for long-context generation and dialogue” and DLLMs become “the standard for editing, control, and structured generation” is sufficiently plausible.
In the medium to long term, there is also the possibility that hybrids combining the good points of both (semi-AR, block diffusion, AR backbone + diffusion head, etc.) become mainstream. As of the time of writing of this book, it is not yet determined which scenario is dominant.
Neither the strong claim that DLLMs will completely replace AR nor the strong denial that DLLMs will not be practical can be supported by the current evidence. The intermediate “different uses” and “integration” are the realistic answers for the time being.
From the researcher’s standpoint, this very uncertainty is an opportunity. Periods when the dominant model has not yet been determined are also the periods when newcomers are most likely to move the frontier. Compared to the current state of AR LLMs (giant companies own the training recipes, and external researchers can only do tricks at inference time), DLLMs are still a field where there is room to contribute to the training recipes themselves.
Summary of this book
DLLMs are a field still developing both theoretically and in implementation. The concise formulation of MDLM and the scale-up by LLaDA have been established, but the other areas (sampler design, inference-time intervention, evaluation, theory, architecture) are still open space. The work of translating the enormous technical stack established for AR LLMs into DLLMs is likely to be the main research frontier over the next several years.
The body of literature treated in this book corresponds to the “scaffolding” for this open space. MDLM provides the training scaffold, LLaDA provides the scale and practical-sampler scaffold, MaskGIT provides the origin of confidence-based unmask, D3PM / SEDD provides other choices of discrete diffusion, and the continuous/discrete bridge provides a translation dictionary for reusing the known knowledge of the continuous diffusion side. These can be read independently of each other, but reading them together is the first time that “what a DLLM is” can be understood in three dimensions.
By leveraging the knowledge of continuous diffusion as a “template” while switching to the toolkit specific to the discrete side (cross-entropy-based objective, \(x_0\)-prediction, confidence-based sampling), the literature treated in this book should appear in three dimensions. After that, each reader can build research and implementations on top of this scaffolding according to their own interests.
What I want to emphasize at the end is that DLLM research is not a project to “make an evolved version of AR LLMs,” but a project to “try a factorization different from AR for the problem of language generation.” The dominance of AR is, more than a necessity from performance, largely a matter of historical path dependence. The structural features of DLLMs — parallelism, bidirectionality, naturalness of editing — are properties that are essentially hard to obtain in AR, and as designs that draw on them head-on accumulate, a lineage of language models with strengths different from AR could be established.
I hope that this book functions as a starting point for that.