Reliable Reasoning: Signals and Methods for Trustworthy LLM Reasoning
What Is Happening in 2025–2026
In the eighteen months since DeepSeek-R1 and OpenAI o1 were released, research on Large Language Model (LLM) reasoning has entered a new phase: a transition from adding methods that work to questioning what was thought to work. This book traces that transition through more than 190 leading works from ICLR 2026, ACL 2026, ICML 2026, NeurIPS 2025, EMNLP 2025 and related venues, organized along three axes — training-side signals, inference-side signals, and structural approaches.
The most frequently rediscovered findings of this period — rediscovered by independent groups — are the kind that shake the assumptions of reasoning research itself.
- Reinforcement Learning with Verifiable Rewards (RLVR) does not appear to acquire new reasoning capabilities. After Yue et al. (Yang Yue et al. 2025a) reported that the base model overtakes the RL model in pass@K, Path Not Taken (Hanqing et al. 2025), Sparse but Critical (Meng et al. 2026), Post-Training as Reweighting (Bu et al. 2025), Reshaping Reasoning (X. Chen et al. 2025), and GRPO Transcend (Ni et al. 2025) independently arrived at the same conclusion: RLVR is a re-weighting inside the base distribution.
- The confidence of reasoning models is structurally broken. Decoupling Reasoning (Ma et al. 2026), Taming Overconfidence (Leng et al. 2024), Reasoning about Uncertainty (Mei et al. 2025), DINCO (V. Wang and Stengel-Eskin 2025), and Wired for Overconfidence (T. Zhao et al. 2026) each independently reported severe miscalibration after RLHF/RLVR.
- The traces that look “deeply considered” are in fact myopic. Extracting Search Trees (S. Chen et al. 2026) extracted a search tree from an LLM’s reasoning trace and showed quantitatively that even when deep nodes are expanded, the actual move selection is determined by shallow information. Reasoning Horizon (D. Ye et al. 2026) showed that the last 70–85% of a CoT has virtually no causal influence on the answer.
- Process Reward Model (PRM) -guided search does not consistently beat naive Best-of-N. Limits of PRM-Guided (Cinquin et al. 2025) and Hard2Verify (Pandit et al. 2025) demonstrated that the generalization of existing PRMs collapses on frontier-level problems.
Technical improvements continue. But the foundational research into what is really working is now proceeding with at least the same intensity as the visible efficiency gains.
In parallel, there is one more observation that runs through this book. Four research lines that rarely interact — training, inference, search, and faithfulness probing — have independently arrived at the same operation (cut the CoT at a prefix and measure something). We treat this in detail in the cross-cutting observations later in the chapter.
Three Questions That Run Through This Book
- Q1. Training side: Does RLVR genuinely expand what the base model can do, or is it merely a re-weighting of already-existing capabilities?
- Q2. Inference side: How can we estimate the correctness of a reasoning trace produced by the model when no ground truth is available?
- Q3. Compute budget: Where (length, number of samples, search) should the limited inference compute be spent?
These three questions sit at the intersection where multiple independently developed research lines have begun to overlap rapidly during 2025–2026. The book is a map of that intersection.
Scope of the Nine Chapters
Theory and Limits of RLVR
The basic question of what RLVR is doing developed into the field’s largest debate in 2025–2026. While five-plus groups independently converged on the re-weighting view, alternative metrics like CoT-Pass@K (Wen et al. 2026) pushed back, arguing that RLVR does genuinely exceed the base. We follow the debate through to the reconciliation proposed by the Two-Stage Dynamic View (Yang Yue et al. 2025b), in which the effect of RL switches dynamically as a function of training stage and problem difficulty.
→ Detail: Theory and Limits of RLVR
GRPO and Reward Design
Group Relative Policy Optimization (GRPO) is now the baseline; the discussion assumes one of DAPO (Q. Yu et al. 2025), Dr. GRPO (Z. Liu et al. 2025), GSPO (Zheng et al. 2025), or VAPO (Yu Yue et al. 2025). A larger current is the family that turns the policy’s own confidence or consistency into the reward (TTRL, Intuitor, PPPO, PRIME) — a family that connects directly to the prefix observation that runs through the book. Meanwhile, Spurious Rewards (Shao et al. 2025) delivered the unsettling result that random rewards still raise Qwen’s accuracy, effectively imposing a cross-model robustness obligation on any new reward proposal.
→ Detail: GRPO and Reward Design
Process Reward Models
The scarcity of step labels had been the biggest barrier to widespread PRM use. Now five independent signal sources — MC rollouts, log-likelihood ratios, pseudo-labels back-propagated from the outcome, next-token probability, and completer agreement — have produced a wave of label-free PRMs. In parallel, generative PRMs have shifted from scalar heads to CoT-verbalized verifiers, enabling test-time compute scaling on the verifier side as well. At the same time, difficulty-extended benchmarks like Hard2Verify (Pandit et al. 2025) cause existing PRMs to lose substantial accuracy, exposing the limits of PRM generalization.
→ Detail: Process Reward Models
Self-Consistency and Weighted Majority Voting
Starting from Wang et al.’s Self-Consistency (X. Wang et al. 2023), the years 2025–2026 extended this framework in four directions: (i) weighting — DeepConf (Fu et al. 2025), CISC (Taubenfeld et al. 2025), CER (Razghandi et al. 2025), Self-Certainty (Z. Kang et al. 2025), IEW (Sharma and Chopra 2025); (ii) prefix exploitation — PoLR (Jindal et al. 2026), Path-Consistency (Zhu et al. 2024), Prefix-Confidence Scaling (Otth et al. 2025), ST-BoN (Y. Wang et al. 2025), Beyond the Last Answer (Hammoud et al. 2025), Prefix Consistency (Iwase et al. 2026); (iii) theoretical support — MAP-optimality of weighted majority vote (Kuang et al. 2025); (iv) adaptive sampling — BEACON (Wan et al. 2025), ReASC (Kim et al. 2026). The prefix-exploitation family is the convergence case that most symbolizes this book: six-plus groups independently arrived at the same operation.
Confidence and Uncertainty
Of the three classical routes for measuring “does the model know it is right?” — logit-based, verbalized, and sampling-based — both logit-based and verbalized signals are no longer trustworthy for reasoning models, as established in 2026. Negative evidence converged from three angles: at the circuit level (Wired for Overconfidence (T. Zhao et al. 2026)), in decision-theoretic terms (Faithful? (Jiawei Wang et al. 2026)), and in direct comparison (DINCO (V. Wang and Stengel-Eskin 2025)). The shift to sampling-based signals is now effectively the default.
→ Detail: Confidence and Uncertainty
Test-Time Compute Scaling
That “investing more compute at inference time raises accuracy” is now established fact. The question has shifted from whether it rises to where and how to invest. We cover Budget Forcing (Muennighoff et al. 2025) and adaptive allocation (CaTS (C. Huang et al. 2025), T1 (M. Kang et al. 2025), Fractional Reasoning (S. Liu et al. 2025)), system-side optimization (ThinKV (Ramachandran et al. 2025), SpecReason (Pan et al. 2025), Sleep-time Compute (Lin et al. 2025)), latent reasoning (Coconut (Hao et al. 2024)), and Markovian Thinker (Aghajohari et al. 2025) for linear scaling. We also organize the non-monotone “CoT can be both too long and too short” finding and the domain-dependence (X. Huang et al. 2025) showing that math-optimized operating points do not automatically transfer to medical reasoning.
→ Detail: Test-Time Compute Scaling
Tree Search and MCTS
The MCTS lineage of AlphaMath (G. Chen et al. 2024) and rStar-Math (Guan et al. 2025), the dynamic wider vs. deeper selection of AB-MCTS (Inoue et al. 2025), the verification-granularity spectrum that VG-Search (H. M. Chen et al. 2025) uses to unify beam search and Best-of-N, the uncertainty-aware line (UATS (Song et al. 2026), UVM (F. Yu et al. 2025)), and the verifier-free line (SELT (M. Wu et al. 2025), MoB (Rakhsha et al. 2025)) all run in parallel. The decisive observation of this chapter is the faithfulness warning of Extracting Search Trees (S. Chen et al. 2026): the deep traces shown by LLMs can be explained by myopic decisions.
→ Detail: Tree Search and MCTS
Reasoning-Structure Analysis
A fourth signal axis — independent of aggregation, confidence, and search — emerged in 2025–2026: reading the structure of the CoT itself. Reasoning Horizon (D. Ye et al. 2026) uses intervention experiments to show that the last 70–85% of a trace is causally empty. FSF (Feng et al. 2025) converts the CoT into a reasoning graph and demonstrates that the failed-step fraction is a stronger predictor of correctness than length or review ratio. CRV (Z. Zhao et al. 2025) predicts the correctness of each step with AUROC 92 from the internal attribution graph, and Four Habits of STaRs (Gandhi et al. 2025) identifies four cognitive patterns shared by self-improving models.
→ Detail: Reasoning-Structure Analysis
Reasoning in Diffusion LLMs
In masked diffusion LLMs such as LLaDA (Nie et al. 2025), Dream (J. Ye et al. 2025), and MMaDA (Yang et al. 2025), the time axis is no longer the AR sequence but the denoising trajectory. Prophet (Li et al. 2025) exploits the phenomenon that the answer is determined early on the same trajectory; Time-is-a-Feature (W. Wang et al. 2025) applies majority vote across denoising steps within a single trajectory; I-DLM (Y. Yu et al. 2026) proposes a self-verify mechanism within a single forward pass. Prefix-based aggregation developed for AR is being reinvented in DLLMs with the denoising step as the time axis.
→ Detail: Reasoning in Diffusion LLMs
Cross-Cutting Observations
Each chapter can be read standalone, but several cross-chapter patterns form the core of this book’s argument.
Observation 1: Four Research Lines Have Independently Converged on the Same “Prefix” Operation
Four research lines that rarely interact — training, inference, search, and faithfulness probing — have all arrived at the same operation: cut the CoT at a prefix and measure something.
- Training side: GRPO-VPS (Jingyi Wang et al. 2026), which probes per-segment correctness probability and feeds it into the reward; PACR (Yoon et al. 2025), which encodes the monotonicity of correctness probability during reasoning into the reward; PPPO (Sun et al. 2025), which estimates the value of a prefix as an MDP state.
- Inference side: PoLR (Jindal et al. 2026), Prefix-Confidence Scaling (Otth et al. 2025), Path-Consistency (Zhu et al. 2024), ST-BoN (Y. Wang et al. 2025), Beyond the Last Answer (Hammoud et al. 2025), Prefix Consistency (Iwase et al. 2026) — all sample many continuations from a short prefix and aggregate them.
- Search side: MCTS-style methods that expand prefixes as nodes (AlphaMath (G. Chen et al. 2024), rStar-Math (Guan et al. 2025)).
- Faithfulness side: Early Answering (Lanham et al. 2023), which truncates the CoT and asks the model to answer in order to measure faithfulness — physically the same operation as resampling, but with the opposite purpose.
That this simultaneous convergence happens without mutual reference strongly suggests that the prefix is a load-bearing unit of reasoning. This is the single most important observation that runs through the book.
Observation 2: Systematic Suspicion of Signals That Were “Thought to Work”
Four negative findings accumulated independently during 2025–2026 are shaking the assumptions of reasoning research.
- Suspicion of RLVR’s capability-expansion claim: five independent groups from Yue et al. (Yang Yue et al. 2025a) onward confirmed the re-weighting view (Theory and Limits of RLVR).
- Suspicion of verbalized confidence: five independent groups confirmed severe miscalibration (Confidence and Uncertainty).
- Suspicion of faithfulness: the gap between deep traces and the actual decision (Tree Search and MCTS, Reasoning-Structure Analysis).
- Suspicion of PRM-guided search: it does not consistently outperform naive Best-of-N (Process Reward Models).
These doubts are the driving force behind an explosion in the search for new signal sources — verifier-free, uncertainty-aware, sampling-based confidence, and reasoning-structure analysis. The convergence in Observation 1 is the flip side: that exploration is independently landing in the same place.
Observation 3: At ICLR 2026, Adaptive Allocation Became the Default
Fixed-K self-consistency and fixed-token-budget CoT have receded into legacy-baseline territory. The fact that the methods that allocate compute dynamically based on problem difficulty or confidence — CaTS (C. Huang et al. 2025), T1 (M. Kang et al. 2025), Fractional Reasoning (S. Liu et al. 2025), ThinKV (Ramachandran et al. 2025), DiffAdapt (X. Liu et al. 2025), BEACON (Wan et al. 2025), ReASC (Kim et al. 2026) — were simultaneously accepted at ICLR 2026 symbolizes the community’s shift to decide adaptively as the default.
Observation 4: A New Signal Axis That Reads Trace Structure
Alongside aggregation (consensus across multiple traces), confidence (token distribution of a single trace), and test-time compute (budget allocation over number and length of traces), a fourth signal axis emerged independently: reading the structure of the CoT itself. Reasoning Horizon’s causal vacuum, FSF’s failed-step fraction, CRV’s internal attribution graph, and Four Habits’ cognitive patterns all rest on the same intuition: a predictive signal lives in structural features that a single length or confidence value cannot capture. The integration of external aggregation and internal observation is an open question for 2026.
Observation 5: Post-RLVR Miscalibration Connects Q1 and Q2
The phenomenon that RLHF / RLVR-trained reasoning models lose confidence calibration directly links Q1 (what is RLVR learning?) with Q2 (what signals work at inference?). The RLVR picture in which re-weighting sharpens an already-high-probability distribution (the front half of Observation 2) and the observation that confidence, broadly distributed in the 0.7–0.9 band after SFT, collapses to near 1.0 after RLHF (Leng et al. 2024) may be two sides of the same “sharpening” phenomenon. A mechanistic link between the two is still rare in the literature and remains an open question.
Observation 6: The Nature of Reasoning Depends on the Domain
Many of the inference-time methods this book covers are implicitly optimized for mathematics. In knowledge-intensive domains such as medicine, the KI (Knowledge Index)–accuracy correlation dominates the InfoGain–accuracy correlation (Knowledge or Reasoning? (J. Wu et al. 2025)), and the optimal thinking budget saturates around 4K tokens (m1 (X. Huang et al. 2025)); budget forcing may overturn an initially correct answer. The transferability of math-derived findings is an open question that cuts across chapters.
How to Read This Book
- If you want to know one area: jump straight to the relevant chapter — every chapter is written to stand alone.
- If you want a panoramic view: start with Self-Consistency and Weighted Majority Voting (the inference-side signal space) and Theory and Limits of RLVR (the training-side signal space).
- If you are starting new research: use any one of the six observations above as an entry point — each cuts across multiple areas. In particular, Observation 1 (prefix convergence) and Observation 2 (systematic suspicion) are the book’s strongest claims; keeping them in mind while reading each chapter makes the cross-chapter patterns easier to see.