Tree Search and MCTS

Treating Chain-of-Thought (CoT) not as a single sequence but as a tree structure allows multiple continuations to be expanded and selected at each node. Research that brings classical search algorithms such as Monte Carlo Tree Search (MCTS), Tree-of-Thoughts (ToT), and beam search into Large Language Models (LLMs) inherits the lineage going back to AlphaGo, and went through another major update in 2025–2026.

This chapter organizes the tree-search methods that appeared in 2025–2026 along seven axes.

  • Representative MCTS methods: the lineage of AlphaMath / rStar-Math
  • Dynamic choice between wider vs deeper: the branching control opened up by AB-MCTS
  • Verification granularity control: the continuous spectrum between beam search and Best-of-N shown by VG-Search
  • Uncertainty-based branching: the shift from point estimates of the value function to distributional estimates, and on to policy entropy
  • Verifier-free / lightweight verifier lines: self-evaluation approaches that reduce reliance on PRMs
  • Faithfulness warnings: the gap between a “deep trace” and the “actual decision”
  • Efficiency lines: implementation optimizations that are conscious of the compute budget

Finally, we also touch on the movement to recast tree search as probabilistic inference.

Treating CoT as a Tree

The motivation for treating LLM reasoning as a tree is simple. Even if we raise the sampling temperature and draw many single chains, traces that continue a redundant reasoning on top of a wrong premise cannot be rescued. With a tree structure, once an error is noticed, one can switch to a different branch.

NoteBasic Elements of Tree Search

LLM tree-search algorithms can be roughly described by three elements (Wei et al. 2025).

  • Search Mechanism: which node to expand (MCTS’s UCB, beam search’s top-k, A*’s heuristic, etc.)
  • Reward Formulation: how to assign value to each node or trajectory (PRM, self-evaluation, consistency, etc.)
  • Transition Function: the unit of transition from a node to the next (token, step, thought type, etc.)

The survey (Wei et al. 2025) classifies existing methods along these three axes and provides a conceptual map that unifies test-time scaling and self-improvement.

Node Semantics Vary Across Papers

Among the three axes in the callout above, the Transition Function axis (the unit of transition from one node to the next) is actually the biggest source of confusion when reading LLM tree-search papers across the literature. Even when papers use the same words “tree” and “node,” what a node represents varies widely. The main methods covered in this book fall into the following seven systems.

Table 1: Seven systems of node semantics in LLM tree-search papers
System What a node represents How boundaries are drawn Representative examples
(α) Reasoning step A single reasoning step (sentence or paragraph) Newlines / step splitters ToT (Yao et al. 2023), MITS (J. Li et al. 2025)
(β) Semantic marker Chunk up to the next transition marker Vocabulary like “Wait” / “Therefore” Beyond the Last Answer (Hammoud et al. 2025)
(γ) Fixed-length window \(L\) tokens or a fractional position \(\tau\) Fixed length or ratio Path-Consistency (Zhu et al. 2024), PoLR (Jindal et al. 2026), Prefix Consistency (Iwase et al. 2026)
(δ) MDP state State in problem space Domain-specific encoder Reasoning via Planning (RAP) and other planning systems
(ε) Whole candidate solution One full reasoning trace or one code candidate Node = whole solution AB-MCTS (Inoue et al. 2025) (Code Contest, etc.)
(ζ) Aggregator node Virtual node that aggregates child statistics Non-leaf itself carries meaning The GEN node of AB-MCTS (Inoue et al. 2025)
(η) Question / sub-problem A reformulation of the problem or a sub-question LLM-emitted sub-question Decomposed Prompting line

When the meaning of a node differs, (i) the same “branching factor” or “depth” become incomparable, (ii) the granularity required by a verifier differs, and (iii) the unit of compute budget (LLM call count vs token count) changes, making fair comparison difficult. When reading each method in this chapter, paying attention to “what is a node here” greatly improves clarity. In particular, when comparing the prefix-based methods of Self-Consistency and Weighted Majority Voting (which sit in the γ system) with the (α) Reasoning step systems covered in this chapter, note that the referents of “node” differ between the two.

ToT and the Choice of Search Algorithm

The foundational LLM tree-search paper is Tree of Thoughts (ToT) (Yao et al. 2023), which most subsequent work inherits. ToT is described by four components.

  • Thought decomposition: split the problem into units of “1 thought = 1 intermediate step”
  • Thought generator \(G(p_\theta, s, k)\): sample \(k\) candidate thoughts from state \(s\)
  • State evaluator \(V(p_\theta, S)\): ask the LLM itself “is this state promising?” and obtain a scalar
  • Search algorithm: explore the tree with BFS / DFS / Beam Search

That is, ToT carries a self-evaluation structure where “LLM = policy + value function,” and an explicit LLM evaluator drives the search. The original paper switches search algorithm by task.

Table 2: ToT’s per-task choice of search algorithm in the original paper
Task Search algorithm Setting
Game of 24 BFS with top-\(b\) keep \(b = 5\) (effectively Beam Search)
Creative Writing BFS \(b = 1\) (nearly greedy)
5×5 Mini Crosswords DFS with backtracking

In other words, ToT is a framework rather than a single algorithm. Subsequent papers use compound names such as ToT-BS (Beam Search), ToT-DFS, and ToT-BFS to make the search algorithm explicit. When reading the literature, even if a paper simply writes “ToT,” one needs to check the experiment section to learn which search algorithm is actually being used.

MCTS vs ToT

ToT is not MCTS. The two are different families of search algorithms, and the methods covered in this chapter also split largely along this line. MCTS classically runs a 4-stage cycle.

  1. Selection: descend from existing nodes via UCB / Thompson sampling, etc.
  2. Expansion: when reaching a leaf, add a new child
  3. Simulation (rollout): random rollout from the new child until a final outcome
  4. Backpropagation: propagate the rollout outcome back to the root, updating all ancestors’ statistics

ToT has neither rollout nor backpropagation. It calls the LLM evaluator once at each node to obtain a scalar, then advances along the high-scoring branch via BFS / DFS / Beam. The differences are summarized below.

Table 3: Algorithmic differences between MCTS and ToT
Item MCTS ToT
Selection rule UCB / Thompson sampling (statistical) LLM evaluator scalar (heuristic)
Rollout Yes (random simulation down to a leaf) No (1 node = 1 LLM evaluation)
Backpropagation Yes (rollout outcome propagates to all ancestors) No (each node independent)
Value update Iteratively update \(N(v), W(v)\), etc. Single LLM evaluation only

The LLM tree-search papers covered in this chapter split into these two camps.

When reading the discussions of VG-Search (H. M. Chen et al. 2025) or UATS (Song et al. 2026) / UVM (Yu et al. 2025) later, keeping track of which camp each method belongs to helps the picture come together.

That the way the reward is given dominates tree-search performance has become increasingly clear with the skepticism toward PRMs discussed later (Cinquin et al. 2025) and the rise of verifier-free lines (M. Wu et al. 2025; Rakhsha et al. 2025).

Representative MCTS Methods

A widely referenced starting point in the LLM × MCTS lineage is AlphaMath (G. Chen et al. 2024). It uses the ground-truth match rate of MCTS rollouts as value to learn an annotation-free Process Reward Model (PRM), and alternately improves the policy and the value. By bringing an AlphaGo-style self-play structure into math problems, it has become a reference point for subsequent MCTS-for-reasoning research. The overall pipeline is shown in Figure 1.

Figure 1: AlphaMath’s MCTS-based self-evolution pipeline. MCTS rollouts assign values to correct/incorrect nodes, filtered solutions are used to update the policy-value model in a K-round loop. The ability to learn a PRM without annotation decisively influenced subsequent research. Source: (G. Chen et al. 2024)

rStar-Math (Guan et al. 2025) pushes this lineage further at the scale of a 7B small language model (SLM), achieving 53.3% on AIME 2024 (8/15 problems correct, surpassing o1-preview). The method overview is shown in Figure 2.

Figure 2: rStar-Math’s three-step design. (a) MCTS-driven deep thinking where the SLM and PPM verify at each step, (b) Q-value filtering to build per-step preference pairs, (c) four rounds of self-evolution that alternately strengthen the SLM and PPM. Source: (Guan et al. 2025)

The self-evolution in Figure 2 can be read as extending AlphaMath’s policy-value alternating update (Figure 1) to both MCTS and PPM. Figure 3 shows the pass@K performance when increasing the number of MCTS samples. As the sample count grows, performance scales logarithmically across AIME / MATH / Olympiad / College Math, reaching over 60% on AIME with 64 samples.

Figure 3: rStar-Math’s pass@K MCTS scaling curves. As the number of samples (horizontal axis) increases, performance monotonically improves across AIME / MATH / Olympiad / College Math. rStar-Qwen2.5-Math-7B scales up to about 63% on AIME even at 64 samples, showing the strong effect of test-time compute. Source: (Guan et al. 2025)
ImportantWhy rStar-Math Matters

The contributions of rStar-Math are threefold.

  • Code-augmented CoT data synthesis: trajectories obtained from MCTS rollouts are verified by Python execution and used as training data
  • Process Preference Model (PPM) training without score annotation: rather than numerical scores, the PPM is trained pairwise from the relative order of Q-values from rollouts
  • Self-evolution of the policy and PPM: four rounds of iterative training lift performance

The result that a 7B SLM surpasses o1-preview shows that by embedding MCTS into the data generation pipeline, deep reasoning can be learned in an almost verifier-free manner, and directly influenced subsequent work such as DeepSearch (F. Wu et al. 2025).

SC-MCTS* (Gao et al. 2024) combines an interpretable reward model based on the principle of contrastive decoding with speculative decoding for speedup, reporting that Llama-3.1-70B + SC-MCTS* exceeds o1-mini by an average of +17.4% on the Blocksworld planning task. It is an example of simultaneously improving reward and inference speed, two of the three MCTS elements.

Dynamic Choice Between Wider vs Deeper

Determining the branching factor of MCTS has classically been a difficult problem. In the LLM context, one needs to decide online whether to “sample again from the same node (wider) or descend into a child node (deeper).”

AB-MCTS (Adaptive Branching MCTS) (Inoue et al. 2025) answers this question with Thompson sampling. For each node, an “expansion arm” that generates a new child and a “continuation arm” that selects an existing child are handled as a Bernoulli–Gaussian mixture Bayesian posterior, and the optimal branching is automatically determined per node. NeurIPS 2025 Spotlight. Figure 4 contrasts it with existing methods.

Figure 4: Contrast of repeated sampling / sequential refinement / standard MCTS / AB-MCTS. Repeated sampling is wide only, sequential refinement is deep only, and standard MCTS expands with a fixed branching factor, while AB-MCTS dynamically decides wider or deeper at each node via Thompson sampling. Source: (Inoue et al. 2025)

As the bottom right of Figure 4 shows, AB-MCTS switches per-node between “generating an additional child from the same node” and “deepening an existing child” through Thompson sampling on a Bayesian posterior.

NoteUncertainty Lines as Derivatives

AB-MCTS’s wider vs deeper control is a framework that decides “where to place the next sample” by uncertainty. UATS (Song et al. 2026) and UVM (Yu et al. 2025) discussed later can be interpreted as generalizing this idea by carrying it into the uncertainty of the value function itself.

Verification Granularity Control

If wider vs deeper is the problem of “where to place the sample,” then verification granularity is the problem of “how frequently to call the verifier.” Beam search verifies at every step, Best-of-N verifies only at the terminal. The two have been treated as separate algorithms, but in fact they are only the two ends of a continuous spectrum.

VG-Search (H. M. Chen et al. 2025) introduces a granularity parameter \(g\) and defines a general form in which the PRM screens candidates every \(g\) steps. With \(g=1\) we recover beam search, with \(g=L\) (the full length of the reasoning path) we recover Best-of-N, and \(1<g<L\) lies in between. Figure 5 shows an overview.

Figure 5: Overview of VG-Search. Through a verification granularity \(g\), beam search (\(g=1\), PRM verification at every step) and Best-of-N (\(g=L\), only at the terminal) are unified into a single algorithm. Each cycle generates \(B_1 \times B_2\) candidates and the PRM keeps the top \(B_1\); the next \(g-1\) steps proceed without verification. Source: (H. M. Chen et al. 2025)

The key finding of VG-Search is that the optimal \(g\) varies with the task, the compute budget, and the model capability.

  • Harder problems require denser verification (smaller \(g\)). The probability of erring at each step is high, so early trajectory correction is needed
  • Larger compute budgets suffice with coarser verification (larger \(g\)). Even when verifier calls are reduced, accuracy is maintained
  • Stronger generators match coarser verification. If the generator emits stable steps, frequent PRM calls are wasted

On MATH-500 (Qwen2.5-Math-7B + Skywork-o1-1.5B verifier), adaptive \(g\) selection achieves up to +3.1% accuracy improvement over beam search and +3.6% over Best-of-N, while at the same time reducing FLOPs by more than 52%. For instance, \(g=3\) reaches about \(2^{13}\) FLOPs / about 88% accuracy, whereas \(g=1\) (beam) requires about \(2^{15}\) FLOPs and stays around 87.5%. There exists a regime where lowering the verification frequency improves both accuracy and efficiency.

Uncertainty-Based Branching

The trend of replacing the point estimate of a PRM (a single numerical score) with an uncertainty distribution of the value function strengthened in 2025–2026.

UATS (Adaptive Uncertainty-Aware Tree Search) (Song et al. 2026) estimates the epistemic uncertainty of the PRM via Monte Carlo Dropout and dynamically allocates compute budget per node via a Reinforcement Learning (RL) controller. Theoretically, while standard search incurs linear regret, uncertainty-aware search can achieve sublinear regret. It is reported to mitigate PRM blow-ups on Out-Of-Distribution (OOD) reasoning paths. Figure 6 succinctly captures the motivation of UATS.

Figure 6: UATS motivation. Even with the same PRM scores (0.85 vs 0.76), an OOD step with a high variance in the PRM predictive distribution is unreliable and the wrong branch ends up being chosen, while a low-variance ID step can be evaluated correctly. A typical example showing that a point-estimate PRM is insufficient. Source: (Song et al. 2026)

In the OOD step on the left of Figure 6, the PRM’s mean score is high at 0.85 but its variance \(\sigma^2=0.032\) is large, and as a result the wrong branch (Wrong) is taken. The ID step on the right, conversely, has a lower score of 0.76 but with \(\sigma^2=0.001\) can confidently select the correct branch (Correct). A typical case where the point-estimate PRM diverges from the node’s “true value.”

Robust Search with Uncertainty-aware Value Models (Yu et al. 2025) proposes the Uncertainty-aware Value Model (UVM), which replaces a single-point value estimate with a value distribution, together with Group Thompson Sampling that selects the optimal candidate from that distribution. In addition to in-distribution evaluation on GSM8K and MATH, it substantially mitigates verifier failure on OOD settings such as AIME25 and Minerva.

MITS (J. Li et al. 2025) adopts an entropy-based dynamic sampling design that “concentrates compute on uncertain steps,” and avoids expensive look-ahead simulation by using pointwise mutual information (PMI) as a step-wise score. The final prediction is a weighted majority vote of PMI and consensus.

Similarly using entropy as a branching gate, Entropy-Gated Branching (X. Li et al. 2025) uses the entropy of the next-token prediction distribution directly as a branching signal, instead of a value function. The design is to generate multiple continuations only at steps whose entropy exceeds a threshold, and to proceed along a single path otherwise: “branch only where the model is hesitant.” Figure 7 contrasts Normal decoding / Beam search / Entropy-Gated Branching.

Figure 7: Contrast of Normal decoding / Beam search / Entropy-Gated Branching. Beam search expands to a fixed width at every step and verifies with the PRM, whereas Entropy-Gated branches only at high-entropy (hesitant) steps and does not branch at confident steps. Low-quality branches are pruned by a lightweight verifier. Source: (X. Li et al. 2025)

It reports +22.6% accuracy improvement over standard decoding and 31–75% speedup over uniform beam search on math reasoning benchmarks. The idea of concentrating compute on “uncertain steps” follows the same spirit as MITS’s entropy + PMI, but extracts uncertainty from the policy-side distribution rather than from the value side.

Representative approaches to uncertainty-based branching of the value function are summarized in Table 4.

Table 4: Main methods of uncertainty-based branching
Method Representation of uncertainty Control target
AB-MCTS (Inoue et al. 2025) Bayesian posterior over branching arms wider vs deeper
UATS (Song et al. 2026) PRM epistemic uncertainty via MC Dropout per-node budget (RL)
UVM (Yu et al. 2025) Value distribution + Group Thompson candidate selection
MITS (J. Li et al. 2025) Step entropy + PMI sample count
Entropy-Gated (X. Li et al. 2025) Entropy of next-token prediction distribution whether to branch (gate)

Verifier-Free / Lightweight Verifier Lines

As the limits of PRM-guided search (Cinquin et al. 2025) became recognized, methods that control search without an external reward model have been growing.

SELT (Self-Evaluation Tree Search) (M. Wu et al. 2025) is an MCTS that redefines UCB through the LLM’s intrinsic self-evaluation, without using any external reward model. It decomposes each node into atomic subtasks and applies semantic clustering. Improvements in robustness and suppression of hallucinations are reported on MMLU and Seal-Tools.

W2S-AlignTree (D. Zhang et al. 2026) proposes weak-to-strong inference-time alignment, using the step-level signal of a weak model as MCTS guidance for a strong model. With an Entropy-Aware PUCT, a UCB refinement, it secures trajectory diversity and lifts Llama3-8B summarization quality from 1.89 to 2.19 (+15.9%). AAAI 2026 Oral.

Majority of the Bests (MoB) (Rakhsha et al. 2025) is grounded in the observation that “even if the reward model is imperfect, the most frequent mode can be trusted.” It builds bootstrap subsets from a Best-of-N (BoN) candidate set, takes the argmax of each subset, and majority-votes them. Improvements on 25/30 settings out of 5 benchmarks × 3 LLMs × 2 RMs. NeurIPS 2025 Poster.

TipMoB as a Philosophical Kin

MoB’s stance that “even if the reward is unreliable, the most frequent mode is trustworthy” is essentially the same philosophy as self-consistency in Self-Consistency and Weighted Majority Voting (X. Wang et al. 2023). The idea of using “not individual scores but the collective mode” as a signal source is spreading as a main axis of verifier-free design in tree-search reward design as well.

CMCTS (Lin et al. 2025) restricts the action space to five disjoint kinds—understanding / planning / reflection / coding / summary—and prevents non-sensical transitions through a PRM and partial-order rules. Zero-shot, a 7B model is reported to surpass a 72B baseline by an average of +4.8%. Rather than making the verifier itself lightweight, this approach reduces dependence on the PRM by tightening the search space.

Probabilistic Inference Framing

The movement to recast tree search not as “search” but as “probabilistic inference” surfaced at COLM 2025.

Particle-based Monte Carlo (Puri et al. 2025) views inference-time scaling as typical-set sampling of a state-space model and implements it with a particle filter (Sequential Monte Carlo, SMC). It has a side effect of avoiding reward hacking, and reports Qwen2.5-Math-1.5B surpassing GPT-4o with 4 rollouts and 7B reaching o1-class with 32 rollouts. Figure 8 visualizes each step of SMC.

Figure 8: SMC implementation of Particle-based MC. At each step, multiple particles generate next steps and are resampled in proportion to their reward score (PRM). Low-score particles (e.g., score 0.0133, 0.0392) are culled and high-score particles (0.9883, 0.9883) survive. The final answer is the particle with the highest reward. A good example showing that MCTS’s rollout/resample corresponds to SMC’s proposal/resample. Source: (Puri et al. 2025)

PRM-guided GFlowNets (Younsi et al. 2025) automatically learns a PRM via MCTS and achieves reward-proportional diverse sampling via a GFlowNet. It complements the greediness of MCTS and raises path coverage with reward-proportional sampling. Reports generalization improvements of +2.59% on MATH Level 5 and +9.4% on SAT MATH.

NoteReading Tree Search as SMC

From the perspective of Particle-based MC, MCTS’s operation of “rolling out and resampling by reward” corresponds to SMC’s proposal-likelihood-resample. Once tree search can be re-described as an inference algorithm, the tools of the Bayesian context (importance sampling, posterior approximation, the conflation issue between reward and likelihood) become directly applicable.

Faithfulness Warnings

A warning bell is starting to ring on whether the reasoning trace obtained from tree search really reflects “deep computation.”

WarningExtracting Search Trees (S. Chen et al. 2026)

An analytical paper that extracts a search tree from an LLM’s reasoning trace in the Four-in-a-row game. The findings are threefold.

  • LLMs search more shallowly than humans
  • Performance is predicted by breadth rather than depth
  • Important: even when deep nodes are expanded in the trace, the actual move selection can be explained by a myopic model that ignores deep nodes

In other words, it has been quantitatively shown that LLMs may “pretend to do deep search while actually deciding shallowly.” This is an important observation in the lineage of the chain-of-thought faithfulness discussion (Lanham et al. 2023).

Figure 9 shows an ablation experiment that removes branches of a specific depth from the trace and feeds it back to the player.

Figure 9: Pruning experiment from Extracting Search Trees. The reasoning trace is labeled by an LLM judge into “preamble / depth-1 branch / depth-2 branch / final decision” and pruned traces with each part removed are re-input to the player. The move change rate remains at only 3.7% when depth-1 branches are kept and depth-2 branches are removed, showing that depth-2 exploration barely contributes to the actual move selection (= myopic decision). Source: (S. Chen et al. 2026)

The right panel of Figure 9 is the quantitative evidence. While removing the final decision and the whole branch tied to it changes 32% of the moves, removing all depth-2 nodes while keeping only depth-1 changes the move rate by only 3.7% on the chosen branch side. That is, the LLM expands deep nodes but actually picks moves using only shallow information.

This finding implies that evaluation designs that depend on the apparent depth of tree search (designs that equate trace length or the number of expanded nodes with “amount of thought”) need to be reconsidered. The naive scaling view that “more rollouts is better” may also be invalid computation piled on top of shallow decisions.

The Lineage of Tree-of-Thoughts and Graph-of-Thoughts

Outside of pure MCTS, RL post-training of ToT, graph extensions, and a direction that analyzes existing long CoT post-hoc as a tree have also appeared.

LCoT2Tree (Jiang et al. 2025) is a framework that automatically converts long CoT into a hierarchical tree and extracts structural patterns with a Graph Neural Network (GNN). Figure 10 shows the conversion pipeline. The five stages—sketch extraction → step segmentation → assigning steps to sketch nodes → labeling each step’s function (exploration / verification / backtrack / continuation) → constructing the tree—yield a reasoning tree with node features (depth, out-degree, number of siblings) and edge features (four kinds of functions).

Figure 10: Conversion pipeline of LCoT2Tree. From a long CoT, a sketch is extracted, steps are segmented and assigned to sketch nodes, and each step’s function (continuous logic / exploration / backtrack / verification) is labeled to construct a reasoning tree. Source: (Jiang et al. 2025)

GNN-based analysis yields three observations: (1) exploration count, backtrack count, and verification count correlate strongly with answer accuracy, (2) tree-structural features predict accuracy better than simple token length, and (3) over-branching is identified as a typical failure pattern. This is a warning against the naive scaling view that “more branching is better,” and can be read as empirical support for the “branch only where the model is hesitant” designs of AB-MCTS (Inoue et al. 2025) and Entropy-Gated (X. Li et al. 2025) covered in this chapter.

ToTRL (H. Wu et al. 2025) is based on the observation that “long CoT tends to become redundant introspection and trial-and-error,” and learns ToT strategies from puzzle games via on-policy RL. ToTQwen3-8B is reported to improve both accuracy and efficiency on complex reasoning.

Adaptive Graph of Thoughts (AGoT) (Pandey et al. 2025) recursively decomposes a complex query into subproblems and builds, at test time, a dynamic directed acyclic graph (DAG) that expands only the necessary parts. It integrates the strengths of CoT / ToT / GoT and achieves +46.2% on GPQA training-free.

Forest-of-Thought (Bi et al. 2024) takes a forest structure that ensembles multiple independent ToTs, combining sparse activation for selecting relevant paths, dynamic self-correction, and consensus-guided decision. A design that balances tree-level diversity and consensus voting.

Policy Guided Tree Search (PGTS) (Y. Li 2025) is an RL-based tree search in which a learned policy dynamically selects among expand / branch / backtrack / terminate. It achieves state-of-the-art on math / logic / planning while significantly cutting cost.

Efficiency Lines

To bring the tree-search methods discussed in this chapter and the Self-Consistency lines from the previous chapter into production use, compute budget and efficiency are decisive.

FETCH (A. Wang et al. 2025) decomposes the failures of tree search into two pitfalls.

  • Over-exploration: semantically equivalent states are redundantly expanded
  • Under-exploration: high variance of verifier scoring causes frequent trajectory switching

As remedies, it merges equivalent states with agglomerative clustering on SimCSE embeddings, and suppresses variance with TD(λ) and a verifier ensemble. It simultaneously improved multiple tree-search algorithms on GSM8K / GSM-Plus / MATH (ACL 2025 long).

DPTS (Dynamic Parallel Tree Search) (Ding et al. 2025) decomposes the slowness of ToT into “frequent switching of reasoning focus” and “redundant suboptimal exploration,” and achieves 2–4x speedup with a Parallelism Streamline and a Search-and-Transition Mechanism. It improves throughput on Qwen-2.5 / Llama-3 while maintaining accuracy on MATH500 / GSM8K.

A*-Decoding (Chatziveroglou 2025) views language-model generation as a structural search over the state space of partial solutions and applies A*. It matches the performance of BoN and particle filtering with 1/3 the tokens and 30% fewer PRM passes, and shows an example where Llama-3.2-1B keeps up with the 70x larger Llama-3.1-70B on MATH500 / AIME24.

PRM-BAS (PRM-guided Beam Annealing Search) (Hu et al. 2025) combines an annealing schedule that shrinks beam width as the search depth grows with a PRM dense reward. It improves MathVista / MathVision / ChartQA and achieves the same quality as a large fixed beam at half the compute.

AdaSearch / AdaBeam (Quamar et al. 2025) is based on the hypothesis that “early tokens are decisive for alignment” and proposes a block-wise search that concentrates the compute budget upfront with a decaying schedule. Harmlessness +10%, sentiment +33%, math reasoning +24% (vs. BoN).

CATS (Cost-Augmented MCTS) (Z. Zhang et al. 2025) embeds explicit cost-awareness into MCTS for planning under a budget constraint, and uniformly compares DFS / BFS / MCTS / bidirectional. While a GPT-4.1 baseline fails under a tight budget, CATS balances accuracy and cost.

The main designs of the efficiency line are summarized in Table 5.

Table 5: Comparison of efficiency-oriented tree-search methods
Method Main idea Effect
FETCH (A. Wang et al. 2025) Suppress over-exploration via semantic clustering Simultaneously improves multiple tree searches
DPTS (Ding et al. 2025) Parallel KV and transition mechanism 2–4x speedup
A*-Decoding (Chatziveroglou 2025) A* heuristic 1/3 tokens, 30% fewer PRM passes
PRM-BAS (Hu et al. 2025) Beam-width annealing Same quality at half compute
AdaSearch (Quamar et al. 2025) Concentrate budget on initial blocks Large gains (math +24%)
CATS (Z. Zhang et al. 2025) Explicit cost-awareness Maintains accuracy under tight budget

Extension to Self-Improvement

The movement of recycling tree-search results back into training is also notable. DeepSearch (F. Wu et al. 2025) identifies the cause of RLVR plateauing within a few thousand steps as sparse rollout exploration, and integrates MCTS into the training loop. With global frontier selection, entropy-based supervision, and a replay buffer with caching, it achieves an average accuracy of 62.95% and a 5.7x GPU-hour reduction.

Against the controversy of “can RLVR surpass the base” treated in RLVR Theory and Limits, DeepSearch shows that MCTS-augmented RLVR can break the plateau by supplying new exploration signals. Together with rStar-Math (Guan et al. 2025) covered in this chapter, it represents a direction that blurs the boundary between test-time search and training-time exploration.

Chapter Summary

Tree-search research in 2025–2026 advanced along five main axes.

  • Skepticism toward PRMs drove the verifier-free / uncertainty-aware lines: starting from Limits of PRM-Guided (Cinquin et al. 2025), SELT (M. Wu et al. 2025), MoB (Rakhsha et al. 2025), UATS (Song et al. 2026), and UVM (Yu et al. 2025) independently presented designs that “reduce dependence on external verifiers”
  • Dynamic control of wider vs deeper became standard: the Thompson sampling framework of AB-MCTS (Inoue et al. 2025) directly propagated into the subsequent uncertainty-based methods (Song et al. 2026; Yu et al. 2025)
  • Verification granularity was organized as a continuous spectrum: VG-Search (H. M. Chen et al. 2025) connected beam search and Best-of-N through granularity \(g\) and showed that the optimal granularity varies with task, budget, and model
  • Entropy-based branching spread: Entropy-Gated Branching (X. Li et al. 2025) and MITS (J. Li et al. 2025) established the design of “branching only at hesitant steps,” and LCoT2Tree (Jiang et al. 2025) provided empirical support by identifying over-branching as a failure pattern via GNN analysis
  • Faithfulness warnings accumulated: from Lanham et al.’s Early Answering (Lanham et al. 2023) to Extracting Search Trees (S. Chen et al. 2026), observations that “the deep traces shown by LLMs can be explained by myopic decisions” and that “answers do not change much when the CoT is cut midway” urge a reconsideration of evaluation designs that naively read trace length or expansion count as “amount of thought”

In parallel, the movement to recast tree search as probabilistic inference (Puri et al. 2025) and the movement to embed it into self-improvement (Guan et al. 2025; F. Wu et al. 2025) are also accelerating. The three axes “Search Mechanism × Reward Formulation × Transition Function” shown by the survey (Wei et al. 2025) remain effective as a framework for coherently surveying the close-to-30 papers covered in this chapter.

See Process Reward Models for discussions of the PRM line, Self-Consistency and Weighted Majority Voting for self-consistency and verifier-free voting, and Test-Time Compute Scaling for the topic of compute-budget allocation.

References

Bi, Zhenni, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. 2024. “Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning.” arXiv Preprint arXiv:2412.09078. https://arxiv.org/abs/2412.09078.
Chatziveroglou, Giannis. 2025. A*-Decoding: Token-Efficient Inference Scaling.” arXiv Preprint arXiv:2505.13672. https://arxiv.org/abs/2505.13672.
Chen, Guoxin, Minpeng Liao, Chengxi Li, and Kai Fan. 2024. AlphaMath Almost Zero: Process Supervision Without Process.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2405.03553.
Chen, Hao Mark, Guanxi Lu, Yasuyuki Okoshi, Zhiwen Mo, Masato Motomura, and Hongxiang Fan. 2025. “Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2505.11730.
Chen, Sixing, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, and Marcelo G. Mattar. 2026. “Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning.” arXiv Preprint arXiv:2605.06840. https://arxiv.org/abs/2605.06840.
Cinquin, Tristan, Geoff Pleiss, and Agustinus Kristiadi. 2025. “Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs.” arXiv Preprint arXiv:2510.20272. https://arxiv.org/abs/2510.20272.
Ding, Yifu, Wentao Jiang, Shunyu Liu, et al. 2025. “Dynamic Parallel Tree Search for Efficient LLM Reasoning.” Annual Meeting of the Association for Computational Linguistics. https://arxiv.org/abs/2502.16235.
Gao, Zitian, Boye Niu, Yuzhen Huang, and Wen Wang. 2024. SC-MCTS*: Interpretable Contrastive Monte Carlo Tree Search Reasoning.” arXiv Preprint arXiv:2410.01707. https://arxiv.org/abs/2410.01707.
Guan, Xinyu, Li Lyna Zhang, Yifei Liu, et al. 2025. rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking.” arXiv Preprint arXiv:2501.04519. https://arxiv.org/abs/2501.04519.
Hammoud, Hasan Abed Al Kader, Hani Itani, and Bernard Ghanem. 2025. “Beyond the Last Answer: Your Reasoning Trace Uncovers More Than You Think.” arXiv Preprint arXiv:2504.20708. https://arxiv.org/abs/2504.20708.
Hu, Pengfei et al. 2025. PRM-BAS: Enhancing Multimodal Reasoning Through PRM-Guided Beam Annealing Search.” arXiv Preprint arXiv:2504.10222. https://arxiv.org/abs/2504.10222.
Inoue, Yuichi, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. 2025. “Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2503.04412.
Iwase, Naoto, Shotaro Ichihara, Ahmed Quamar, and Junpei Komiyama. 2026. “Reliable Chain-of-Thought via Prefix Consistency.” arXiv Preprint arXiv:2605.07654. https://arxiv.org/abs/2605.07654.
Jiang, Gangwei, Yahui Liu, Zhaoyi Li, et al. 2025. “What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning.” Conference on Empirical Methods in Natural Language Processing. https://arxiv.org/abs/2505.22148.
Jindal, Ishan, Sai Prashanth Akuthota, Jayant Taneja, and SACHIN DEV SHARMA. 2026. THE PATH OF LEAST RESISTANCE: GUIDING LLM REASONING TRAJECTORIES WITH PREFIX CONSENSUS.” The Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=hrnSqERgPn.
Lanham, Tamera, Anna Chen, Ansh Radhakrishnan, et al. 2023. “Measuring Faithfulness in Chain-of-Thought Reasoning.” arXiv Preprint arXiv:2307.13702. https://arxiv.org/abs/2307.13702.
Li, Jiaxi, Yucheng Shi, Jin Lu, and Ninghao Liu. 2025. MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information.” arXiv Preprint arXiv:2510.03632. https://arxiv.org/abs/2510.03632.
Li, Xianzhi, Ethan Callanan, Abdellah Ghassel, and Xiaodan Zhu. 2025. “Entropy-Gated Branching for Efficient Test-Time Reasoning.” arXiv Preprint arXiv:2503.21961. https://arxiv.org/abs/2503.21961.
Li, Yang. 2025. “Policy Guided Tree Search for Enhanced LLM Reasoning.” arXiv Preprint arXiv:2502.06813. https://arxiv.org/abs/2502.06813.
Lin, Qingwen et al. 2025. CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in LLM.” Applied Intelligence (Springer); arXiv Preprint arXiv:2502.11169. https://arxiv.org/abs/2502.11169.
Pandey, Tushar, Ara Ghukasyan, Oktay Goktas, and Santosh Kumar Radha. 2025. “Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures.” arXiv Preprint arXiv:2502.05078. https://arxiv.org/abs/2502.05078.
Puri, Isha, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. 2025. “A Probabilistic Inference Approach to Inference-Time Scaling of LLMs Using Particle-Based Monte Carlo Methods.” Conference on Language Modeling. https://arxiv.org/abs/2502.01618.
Quamar, Atif et al. 2025. “Adaptive Blockwise Search: Inference-Time Alignment for LLMs.” arXiv Preprint arXiv:2510.23334. https://arxiv.org/abs/2510.23334.
Rakhsha, Amin, Kanika Madan, Tianyu Zhang, Amir-massoud Farahmand, and Amir Khasahmadi. 2025. “Majority of the Bests: Improving Best-of-N via Bootstrapping.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2511.18630.
Song, Zeen, Zihao Ma, Wenwen Qiang, Changwen Zheng, and Gang Hua. 2026. “Adaptive Uncertainty-Aware Tree Search for Robust Reasoning.” arXiv Preprint arXiv:2602.06493. https://arxiv.org/abs/2602.06493.
Wang, Ante, Linfeng Song, Ye Tian, et al. 2025. “Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls.” Annual Meeting of the Association for Computational Linguistics. https://arxiv.org/abs/2502.11183.
Wang, Xuezhi, Jason Wei, Dale Schuurmans, et al. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” International Conference on Learning Representations. https://openreview.net/forum?id=1PL1NIMMrw.
Wei, Jiaqi, Xiang Zhang, Yuejin Yang, et al. 2025. “Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey.” arXiv Preprint arXiv:2510.09988. https://arxiv.org/abs/2510.09988.
Wu, Fang et al. 2025. DeepSearch: Overcome the Bottleneck of RLVR via Monte Carlo Tree Search.” arXiv Preprint arXiv:2509.25454. https://arxiv.org/abs/2509.25454.
Wu, Haoyang, Chenwei Chen, et al. 2025. ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential Through Puzzles Solving.” arXiv Preprint arXiv:2505.12717. https://arxiv.org/abs/2505.12717.
Wu, Mengsong, Di Zhang, Yuqiang Li, Dongzhan Zhou, and Wenliang Chen. 2025. SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition.” arXiv Preprint arXiv:2506.07557. https://arxiv.org/abs/2506.07557.
Yao, Shunyu, Dian Yu, Jeffrey Zhao, et al. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2305.10601.
Younsi, Adam, Abdalgader Abubaker, Mohamed El Amine Seddik, Hakim Hacid, and Salem Lahlou. 2025. “Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets.” arXiv Preprint arXiv:2504.19981. https://arxiv.org/abs/2504.19981.
Yu, Fei, Yingru Li, and Benyou Wang. 2025. “Robust Search with Uncertainty-Aware Value Models for Language Model Reasoning.” arXiv Preprint arXiv:2502.11155. https://arxiv.org/abs/2502.11155.
Zhang, Dongyu et al. 2026. W2S-AlignTree: Weak-to-Strong Inference-Time Alignment via Monte Carlo Tree Search.” AAAI Conference on Artificial Intelligence. https://arxiv.org/abs/2511.11518.
Zhang, Zihao, Kunlun Liu, et al. 2025. “Cost-Augmented MCTS for LLM-Assisted Planning.” arXiv Preprint arXiv:2505.14656. https://arxiv.org/abs/2505.14656.
Zhu, Jiace, Yuanzhe Huang, Yingtao Shen, Jie Zhao, and An Zou. 2024. Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs. arXiv preprint arXiv:2409.01281. https://arxiv.org/abs/2409.01281.