Reliable Reasoning: Signals and Methods for Trustworthy LLM Reasoning

What Is Happening in 2025–2026

In the eighteen months since DeepSeek-R1 and OpenAI o1 were released, research on Large Language Model (LLM) reasoning has entered a new phase: a transition from adding methods that work to questioning what was thought to work. This book traces that transition through more than 190 leading works from ICLR 2026, ACL 2026, ICML 2026, NeurIPS 2025, EMNLP 2025 and related venues, organized along three axes — training-side signals, inference-side signals, and structural approaches.

The most frequently rediscovered findings of this period — rediscovered by independent groups — are the kind that shake the assumptions of reasoning research itself.

Reinforcement Learning with Verifiable Rewards (RLVR) does not appear to acquire new reasoning capabilities. After Yue et al. (Yang Yue et al. 2025a) reported that the base model overtakes the RL model in pass@K, Path Not Taken (Hanqing et al. 2025), Sparse but Critical (Meng et al. 2026), Post-Training as Reweighting (Bu et al. 2025), Reshaping Reasoning (X. Chen et al. 2025), and GRPO Transcend (Ni et al. 2025) independently arrived at the same conclusion: RLVR is a re-weighting inside the base distribution.
The confidence of reasoning models is structurally broken. Decoupling Reasoning (Ma et al. 2026), Taming Overconfidence (Leng et al. 2024), Reasoning about Uncertainty (Mei et al. 2025), DINCO (V. Wang and Stengel-Eskin 2025), and Wired for Overconfidence (T. Zhao et al. 2026) each independently reported severe miscalibration after RLHF/RLVR.
The traces that look “deeply considered” are in fact myopic. Extracting Search Trees (S. Chen et al. 2026) extracted a search tree from an LLM’s reasoning trace and showed quantitatively that even when deep nodes are expanded, the actual move selection is determined by shallow information. Reasoning Horizon (D. Ye et al. 2026) showed that the last 70–85% of a CoT has virtually no causal influence on the answer.
Process Reward Model (PRM) -guided search does not consistently beat naive Best-of-N. Limits of PRM-Guided (Cinquin et al. 2025) and Hard2Verify (Pandit et al. 2025) demonstrated that the generalization of existing PRMs collapses on frontier-level problems.

Technical improvements continue. But the foundational research into what is really working is now proceeding with at least the same intensity as the visible efficiency gains.

In parallel, there is one more observation that runs through this book. Four research lines that rarely interact — training, inference, search, and faithfulness probing — have independently arrived at the same operation (cut the CoT at a prefix and measure something). We treat this in detail in the cross-cutting observations later in the chapter.

Three Questions That Run Through This Book

Central problems

Q1. Training side: Does RLVR genuinely expand what the base model can do, or is it merely a re-weighting of already-existing capabilities?
Q2. Inference side: How can we estimate the correctness of a reasoning trace produced by the model when no ground truth is available?
Q3. Compute budget: Where (length, number of samples, search) should the limited inference compute be spent?

These three questions sit at the intersection where multiple independently developed research lines have begun to overlap rapidly during 2025–2026. The book is a map of that intersection.

Scope of the Nine Chapters

Theory and Limits of RLVR

The basic question of what RLVR is doing developed into the field’s largest debate in 2025–2026. While five-plus groups independently converged on the re-weighting view, alternative metrics like CoT-Pass@K (Wen et al. 2026) pushed back, arguing that RLVR does genuinely exceed the base. We follow the debate through to the reconciliation proposed by the Two-Stage Dynamic View (Yang Yue et al. 2025b), in which the effect of RL switches dynamically as a function of training stage and problem difficulty.

→ Detail: Theory and Limits of RLVR

GRPO and Reward Design

Group Relative Policy Optimization (GRPO) is now the baseline; the discussion assumes one of DAPO (Q. Yu et al. 2025), Dr. GRPO (Z. Liu et al. 2025), GSPO (Zheng et al. 2025), or VAPO (Yu Yue et al. 2025). A larger current is the family that turns the policy’s own confidence or consistency into the reward (TTRL, Intuitor, PPPO, PRIME) — a family that connects directly to the prefix observation that runs through the book. Meanwhile, Spurious Rewards (Shao et al. 2025) delivered the unsettling result that random rewards still raise Qwen’s accuracy, effectively imposing a cross-model robustness obligation on any new reward proposal.

→ Detail: GRPO and Reward Design

Process Reward Models

The scarcity of step labels had been the biggest barrier to widespread PRM use. Now five independent signal sources — MC rollouts, log-likelihood ratios, pseudo-labels back-propagated from the outcome, next-token probability, and completer agreement — have produced a wave of label-free PRMs. In parallel, generative PRMs have shifted from scalar heads to CoT-verbalized verifiers, enabling test-time compute scaling on the verifier side as well. At the same time, difficulty-extended benchmarks like Hard2Verify (Pandit et al. 2025) cause existing PRMs to lose substantial accuracy, exposing the limits of PRM generalization.

→ Detail: Process Reward Models

Self-Consistency and Weighted Majority Voting

Starting from Wang et al.’s Self-Consistency (X. Wang et al. 2023), the years 2025–2026 extended this framework in four directions: (i) weighting — DeepConf (Fu et al. 2025), CISC (Taubenfeld et al. 2025), CER (Razghandi et al. 2025), Self-Certainty (Z. Kang et al. 2025), IEW (Sharma and Chopra 2025); (ii) prefix exploitation — PoLR (Jindal et al. 2026), Path-Consistency (Zhu et al. 2024), Prefix-Confidence Scaling (Otth et al. 2025), ST-BoN (Y. Wang et al. 2025), Beyond the Last Answer (Hammoud et al. 2025), Prefix Consistency (Iwase et al. 2026); (iii) theoretical support — MAP-optimality of weighted majority vote (Kuang et al. 2025); (iv) adaptive sampling — BEACON (Wan et al. 2025), ReASC (Kim et al. 2026). The prefix-exploitation family is the convergence case that most symbolizes this book: six-plus groups independently arrived at the same operation.

→ Detail: Self-Consistency and Weighted Majority Voting

Confidence and Uncertainty

Of the three classical routes for measuring “does the model know it is right?” — logit-based, verbalized, and sampling-based — both logit-based and verbalized signals are no longer trustworthy for reasoning models, as established in 2026. Negative evidence converged from three angles: at the circuit level (Wired for Overconfidence (T. Zhao et al. 2026)), in decision-theoretic terms (Faithful? (Jiawei Wang et al. 2026)), and in direct comparison (DINCO (V. Wang and Stengel-Eskin 2025)). The shift to sampling-based signals is now effectively the default.

→ Detail: Confidence and Uncertainty

Test-Time Compute Scaling

That “investing more compute at inference time raises accuracy” is now established fact. The question has shifted from whether it rises to where and how to invest. We cover Budget Forcing (Muennighoff et al. 2025) and adaptive allocation (CaTS (C. Huang et al. 2025), T1 (M. Kang et al. 2025), Fractional Reasoning (S. Liu et al. 2025)), system-side optimization (ThinKV (Ramachandran et al. 2025), SpecReason (Pan et al. 2025), Sleep-time Compute (Lin et al. 2025)), latent reasoning (Coconut (Hao et al. 2024)), and Markovian Thinker (Aghajohari et al. 2025) for linear scaling. We also organize the non-monotone “CoT can be both too long and too short” finding and the domain-dependence (X. Huang et al. 2025) showing that math-optimized operating points do not automatically transfer to medical reasoning.

→ Detail: Test-Time Compute Scaling

Tree Search and MCTS

The MCTS lineage of AlphaMath (G. Chen et al. 2024) and rStar-Math (Guan et al. 2025), the dynamic wider vs. deeper selection of AB-MCTS (Inoue et al. 2025), the verification-granularity spectrum that VG-Search (H. M. Chen et al. 2025) uses to unify beam search and Best-of-N, the uncertainty-aware line (UATS (Song et al. 2026), UVM (F. Yu et al. 2025)), and the verifier-free line (SELT (M. Wu et al. 2025), MoB (Rakhsha et al. 2025)) all run in parallel. The decisive observation of this chapter is the faithfulness warning of Extracting Search Trees (S. Chen et al. 2026): the deep traces shown by LLMs can be explained by myopic decisions.

→ Detail: Tree Search and MCTS

Reasoning-Structure Analysis

A fourth signal axis — independent of aggregation, confidence, and search — emerged in 2025–2026: reading the structure of the CoT itself. Reasoning Horizon (D. Ye et al. 2026) uses intervention experiments to show that the last 70–85% of a trace is causally empty. FSF (Feng et al. 2025) converts the CoT into a reasoning graph and demonstrates that the failed-step fraction is a stronger predictor of correctness than length or review ratio. CRV (Z. Zhao et al. 2025) predicts the correctness of each step with AUROC 92 from the internal attribution graph, and Four Habits of STaRs (Gandhi et al. 2025) identifies four cognitive patterns shared by self-improving models.

→ Detail: Reasoning-Structure Analysis

Reasoning in Diffusion LLMs

In masked diffusion LLMs such as LLaDA (Nie et al. 2025), Dream (J. Ye et al. 2025), and MMaDA (Yang et al. 2025), the time axis is no longer the AR sequence but the denoising trajectory. Prophet (Li et al. 2025) exploits the phenomenon that the answer is determined early on the same trajectory; Time-is-a-Feature (W. Wang et al. 2025) applies majority vote across denoising steps within a single trajectory; I-DLM (Y. Yu et al. 2026) proposes a self-verify mechanism within a single forward pass. Prefix-based aggregation developed for AR is being reinvented in DLLMs with the denoising step as the time axis.

→ Detail: Reasoning in Diffusion LLMs

Cross-Cutting Observations

Each chapter can be read standalone, but several cross-chapter patterns form the core of this book’s argument.

Observation 1: Four Research Lines Have Independently Converged on the Same “Prefix” Operation

Four research lines that rarely interact — training, inference, search, and faithfulness probing — have all arrived at the same operation: cut the CoT at a prefix and measure something.

Training side: GRPO-VPS (Jingyi Wang et al. 2026), which probes per-segment correctness probability and feeds it into the reward; PACR (Yoon et al. 2025), which encodes the monotonicity of correctness probability during reasoning into the reward; PPPO (Sun et al. 2025), which estimates the value of a prefix as an MDP state.
Inference side: PoLR (Jindal et al. 2026), Prefix-Confidence Scaling (Otth et al. 2025), Path-Consistency (Zhu et al. 2024), ST-BoN (Y. Wang et al. 2025), Beyond the Last Answer (Hammoud et al. 2025), Prefix Consistency (Iwase et al. 2026) — all sample many continuations from a short prefix and aggregate them.
Search side: MCTS-style methods that expand prefixes as nodes (AlphaMath (G. Chen et al. 2024), rStar-Math (Guan et al. 2025)).
Faithfulness side: Early Answering (Lanham et al. 2023), which truncates the CoT and asks the model to answer in order to measure faithfulness — physically the same operation as resampling, but with the opposite purpose.

That this simultaneous convergence happens without mutual reference strongly suggests that the prefix is a load-bearing unit of reasoning. This is the single most important observation that runs through the book.

Observation 2: Systematic Suspicion of Signals That Were “Thought to Work”

Four negative findings accumulated independently during 2025–2026 are shaking the assumptions of reasoning research.

Suspicion of RLVR’s capability-expansion claim: five independent groups from Yue et al. (Yang Yue et al. 2025a) onward confirmed the re-weighting view (Theory and Limits of RLVR).
Suspicion of verbalized confidence: five independent groups confirmed severe miscalibration (Confidence and Uncertainty).
Suspicion of faithfulness: the gap between deep traces and the actual decision (Tree Search and MCTS, Reasoning-Structure Analysis).
Suspicion of PRM-guided search: it does not consistently outperform naive Best-of-N (Process Reward Models).

These doubts are the driving force behind an explosion in the search for new signal sources — verifier-free, uncertainty-aware, sampling-based confidence, and reasoning-structure analysis. The convergence in Observation 1 is the flip side: that exploration is independently landing in the same place.

Observation 3: At ICLR 2026, Adaptive Allocation Became the Default

Fixed-K self-consistency and fixed-token-budget CoT have receded into legacy-baseline territory. The fact that the methods that allocate compute dynamically based on problem difficulty or confidence — CaTS (C. Huang et al. 2025), T1 (M. Kang et al. 2025), Fractional Reasoning (S. Liu et al. 2025), ThinKV (Ramachandran et al. 2025), DiffAdapt (X. Liu et al. 2025), BEACON (Wan et al. 2025), ReASC (Kim et al. 2026) — were simultaneously accepted at ICLR 2026 symbolizes the community’s shift to decide adaptively as the default.

Observation 4: A New Signal Axis That Reads Trace Structure

Alongside aggregation (consensus across multiple traces), confidence (token distribution of a single trace), and test-time compute (budget allocation over number and length of traces), a fourth signal axis emerged independently: reading the structure of the CoT itself. Reasoning Horizon’s causal vacuum, FSF’s failed-step fraction, CRV’s internal attribution graph, and Four Habits’ cognitive patterns all rest on the same intuition: a predictive signal lives in structural features that a single length or confidence value cannot capture. The integration of external aggregation and internal observation is an open question for 2026.

Observation 5: Post-RLVR Miscalibration Connects Q1 and Q2

The phenomenon that RLHF / RLVR-trained reasoning models lose confidence calibration directly links Q1 (what is RLVR learning?) with Q2 (what signals work at inference?). The RLVR picture in which re-weighting sharpens an already-high-probability distribution (the front half of Observation 2) and the observation that confidence, broadly distributed in the 0.7–0.9 band after SFT, collapses to near 1.0 after RLHF (Leng et al. 2024) may be two sides of the same “sharpening” phenomenon. A mechanistic link between the two is still rare in the literature and remains an open question.

Observation 6: The Nature of Reasoning Depends on the Domain

Many of the inference-time methods this book covers are implicitly optimized for mathematics. In knowledge-intensive domains such as medicine, the KI (Knowledge Index)–accuracy correlation dominates the InfoGain–accuracy correlation (Knowledge or Reasoning? (J. Wu et al. 2025)), and the optimal thinking budget saturates around 4K tokens (m1 (X. Huang et al. 2025)); budget forcing may overturn an initially correct answer. The transferability of math-derived findings is an open question that cuts across chapters.

How to Read This Book

If you want to know one area: jump straight to the relevant chapter — every chapter is written to stand alone.
If you want a panoramic view: start with Self-Consistency and Weighted Majority Voting (the inference-side signal space) and Theory and Limits of RLVR (the training-side signal space).
If you are starting new research: use any one of the six observations above as an entry point — each cuts across multiple areas. In particular, Observation 1 (prefix convergence) and Observation 2 (systematic suspicion) are the book’s strongest claims; keeping them in mind while reading each chapter makes the cross-chapter patterns easier to see.

References

Aghajohari, Milad, Kamran Chitsaz, Amirhossein Kazemnejad, et al. 2025. “The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning.” arXiv Preprint arXiv:2510.06557. https://arxiv.org/abs/2510.06557.

Bu, Dake, Wei Huang, Andi Han, et al. 2025. “Post-Training as Reweighting: A Stochastic View of Reasoning Trajectories in Language Models.” arXiv Preprint arXiv:2511.07368. https://arxiv.org/abs/2511.07368.

Chen, Guoxin, Minpeng Liao, Chengxi Li, and Kai Fan. 2024. “AlphaMath Almost Zero: Process Supervision Without Process.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2405.03553.

Chen, Hao Mark, Guanxi Lu, Yasuyuki Okoshi, Zhiwen Mo, Masato Motomura, and Hongxiang Fan. 2025. “Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2505.11730.

Chen, Sixing, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, and Marcelo G. Mattar. 2026. “Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning.” arXiv Preprint arXiv:2605.06840. https://arxiv.org/abs/2605.06840.

Chen, Xingwu, Tianle Li, and Difan Zou. 2025. “Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics Through Pattern Selection.” arXiv Preprint arXiv:2506.04695. https://arxiv.org/abs/2506.04695.

Cinquin, Tristan, Geoff Pleiss, and Agustinus Kristiadi. 2025. “Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs.” arXiv Preprint arXiv:2510.20272. https://arxiv.org/abs/2510.20272.

Feng, Yunzhen, Julia Kempe, Cheng Zhang, Parag Jain, and Anthony Hartshorn. 2025. “What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT.” arXiv Preprint arXiv:2509.19284. https://arxiv.org/abs/2509.19284.

Fu, Yichao, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. 2025. “Deep Think with Confidence.” arXiv Preprint arXiv:2508.15260. https://arxiv.org/abs/2508.15260.

Gandhi, Kanishk, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. 2025. “Cognitive Behaviors That Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs.” arXiv Preprint arXiv:2503.01307. https://arxiv.org/abs/2503.01307.

Guan, Xinyu, Li Lyna Zhang, Yifei Liu, et al. 2025. “rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking.” arXiv Preprint arXiv:2501.04519. https://arxiv.org/abs/2501.04519.

Hammoud, Hasan Abed Al Kader, Hani Itani, and Bernard Ghanem. 2025. “Beyond the Last Answer: Your Reasoning Trace Uncovers More Than You Think.” arXiv Preprint arXiv:2504.20708. https://arxiv.org/abs/2504.20708.

Hanqing et al. 2025. “The Path Not Taken: RLVR Provably Learns Off the Principals.” arXiv Preprint arXiv:2511.08567. https://arxiv.org/abs/2511.08567.

Hao, Shibo, Sainbayar Sukhbaatar, DiJia Su, et al. 2024. “Training Large Language Models to Reason in a Continuous Latent Space.” International Conference on Learning Representations. https://arxiv.org/abs/2412.06769.

Huang, Chengsong, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. 2025. “Efficient Test-Time Scaling via Self-Calibration.” International Conference on Learning Representations. https://arxiv.org/abs/2503.00031.

Huang, Xiaoke, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. 2025. “M1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models.” arXiv Preprint arXiv:2504.00869. https://arxiv.org/abs/2504.00869.

Inoue, Yuichi, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. 2025. “Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2503.04412.

Iwase, Naoto, Yuki Ichihara, Mohammad Atif Quamar, and Junpei Komiyama. 2026. “Reliable Chain-of-Thought via Prefix Consistency.” arXiv Preprint arXiv:2605.07654. https://arxiv.org/abs/2605.07654.

Jindal, Ishan, Sai Prashanth Akuthota, Jayant Taneja, and SACHIN DEV SHARMA. 2026. “THE PATH OF LEAST RESISTANCE: GUIDING LLM REASONING TRAJECTORIES WITH PREFIX CONSENSUS.” The Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=hrnSqERgPn.

Kang, Minki, Jongwon Jeong, and Jaewoong Cho. 2025. “T1: Tool-Integrated Self-Verification for Test-Time Compute Scaling in Small Language Models.” International Conference on Learning Representations. https://arxiv.org/abs/2504.04718.

Kang, Zhewei, Xuandong Zhao, and Dawn Song. 2025. “Scalable Best-of-N Selection for Large Language Models via Self-Certainty.” Advances in Neural Information Processing Systems. https://openreview.net/forum?id=29FRqmVQK8.

Kim, Junseok, Nakyeong Yang, Kyungmin Min, and Kyomin Jung. 2026. “Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning.” arXiv Preprint arXiv:2601.02970. https://arxiv.org/abs/2601.02970.

Kuang, Peng, Yanli Wang, Xiaoyu Han, Yaowenqi Liu, Kaidi Xu, and Haohan Wang. 2025. “Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling.” International Conference on Learning Representations. https://arxiv.org/abs/2510.13918.

Lanham, Tamera, Anna Chen, Ansh Radhakrishnan, et al. 2023. “Measuring Faithfulness in Chain-of-Thought Reasoning.” arXiv Preprint arXiv:2307.13702. https://arxiv.org/abs/2307.13702.

Leng, Jixuan, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. 2024. “Taming Overconfidence in LLMs: Reward Calibration in RLHF.” arXiv Preprint arXiv:2410.09724. https://arxiv.org/abs/2410.09724.

Li, Pengxiang, Yefan Zhou, Dilxat Muhtar, et al. 2025. “Diffusion Language Models Know the Answer Before Decoding.” arXiv Preprint arXiv:2508.19982. https://arxiv.org/abs/2508.19982.

Lin, Kevin, Charlie Snell, Yu Wang, et al. 2025. “Sleep-Time Compute: Beyond Inference Scaling at Test-Time.” arXiv Preprint arXiv:2504.13171. https://arxiv.org/abs/2504.13171.

Liu, Sheng, Tianlang Chen, Pan Lu, et al. 2025. “Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute.” International Conference on Learning Representations. https://arxiv.org/abs/2506.15882.

Liu, Xiang, Xuming Hu, Xiaowen Chu, and Eunsol Choi. 2025. “DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference.” International Conference on Learning Representations. https://arxiv.org/abs/2510.19669.

Liu, Zichen, Changyu Chen, Wenjun Li, et al. 2025. “Understanding R1-Zero-Like Training: A Critical Perspective.” Conference on Language Modeling (COLM). https://arxiv.org/abs/2503.20783.

Ma, Zhengzhao, Xueru Wen, Boxi Cao, et al. 2026. “Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards.” arXiv Preprint arXiv:2603.09117. https://arxiv.org/abs/2603.09117.

Mei, Zhiting, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, and Anirudha Majumdar. 2025. “Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know?” arXiv Preprint arXiv:2506.18183. https://arxiv.org/abs/2506.18183.

Meng, Haoming, Kexin Huang, Shaohang Wei, et al. 2026. “Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs.” International Conference on Learning Representations. https://arxiv.org/abs/2603.22446.

Muennighoff, Niklas, Zitong Yang, Weijia Shi, et al. 2025. “S1: Simple Test-Time Scaling.” In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, edited by Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng. Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-main.1025.

Ni, Kangqi, Zhen Tan, Zijie Liu, Pingzhi Li, and Tianlong Chen. 2025. “Can GRPO Help LLMs Transcend Their Pretraining Origin?” arXiv Preprint arXiv:2510.15990. https://arxiv.org/abs/2510.15990.

Nie, Shen, Fengqi Zhu, Zebin You, et al. 2025. “Large Language Diffusion Models.” arXiv Preprint arXiv:2502.09992. https://arxiv.org/abs/2502.09992.

Otth, Matthias, Jonas Hübotter, Ido Hakimi, and Andreas Krause. 2025. “Maximizing Prefix-Confidence at Test-Time Efficiently Improves Mathematical Reasoning.” arXiv Preprint arXiv:2507.18122. https://arxiv.org/abs/2507.18122.

Pan, Rui, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, and Ravi Netravali. 2025. “SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning.” arXiv Preprint arXiv:2504.07891. https://arxiv.org/abs/2504.07891.

Pandit, Shrey, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, and Shafiq Joty. 2025. “Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math.” arXiv Preprint arXiv:2510.13744. https://arxiv.org/abs/2510.13744.

Rakhsha, Amin, Kanika Madan, Tianyu Zhang, Amir-massoud Farahmand, and Amir Khasahmadi. 2025. “Majority of the Bests: Improving Best-of-N via Bootstrapping.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2511.18630.

Ramachandran, Akshat, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. 2025. “ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models.” International Conference on Learning Representations. https://arxiv.org/abs/2510.01290.

Razghandi, Ali, Seyed Mohammad Hadi Hosseini, and Mahdieh Soleymani Baghshah. 2025. “CER: Confidence Enhanced Reasoning in LLMs.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. https://arxiv.org/abs/2502.14634.

Shao, Rulin, Shuyue Stella Li, Rui Xin, et al. 2025. “Spurious Rewards: Rethinking Training Signals in RLVR.” arXiv Preprint arXiv:2506.10947. https://arxiv.org/abs/2506.10947.

Sharma, Aman, and Paras Chopra. 2025. “The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute.” arXiv Preprint arXiv:2511.02309. https://arxiv.org/abs/2511.02309.

Song, Zeen, Zihao Ma, Wenwen Qiang, Changwen Zheng, and Gang Hua. 2026. “Adaptive Uncertainty-Aware Tree Search for Robust Reasoning.” arXiv Preprint arXiv:2602.06493. https://arxiv.org/abs/2602.06493.

Sun, Yiliu, Zicheng Zhao, Yang Wei, Yanfang Zhang, and Chen Gong. 2025. “Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning.” arXiv Preprint arXiv:2512.15274. https://arxiv.org/abs/2512.15274.

Taubenfeld, Amir, Tom Sheffer, Eran Ofek, et al. 2025. “Confidence Improves Self-Consistency in LLMs.” In Findings of the Association for Computational Linguistics: ACL 2025, edited by Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar. Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.findings-acl.1030.

Wan, Guangya, Zixin Stephen Xu, Sasa Zorc, et al. 2025. “BEACON: Bayesian Optimal Stopping for Efficient LLM Sampling.” arXiv Preprint arXiv:2510.15945. https://arxiv.org/abs/2510.15945.

Wang, Jiawei, Yanfei Zhou, Siddartha Devic, and Deqing Fu. 2026. “Are LLM Decisions Faithful to Verbal Confidence?” arXiv Preprint arXiv:2601.07767. https://arxiv.org/abs/2601.07767.

Wang, Jingyi, Lei Zhu, Tengjin Weng, et al. 2026. “GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision.” International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2604.20659.

Wang, Victor, and Elias Stengel-Eskin. 2025. “Calibrating Verbalized Confidence with Self-Generated Distractors.” International Conference on Learning Representations. https://arxiv.org/abs/2509.25532.

Wang, Wen, Bozhen Fang, Chenchen Jing, et al. 2025. “Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models.” arXiv Preprint arXiv:2508.09138. https://arxiv.org/abs/2508.09138.

Wang, Xuezhi, Jason Wei, Dale Schuurmans, et al. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” International Conference on Learning Representations. https://openreview.net/forum?id=1PL1NIMMrw.

Wang, Yiming, Pei Zhang, Siyuan Huang, et al. 2025. “Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2503.01422.

Wen, Xumeng, Zihan Liu, Shun Zheng, et al. 2026. “Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs.” The Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=jGbRWwIidy.

Wu, Juncheng, Sheng Liu, Haoqin Tu, et al. 2025. “Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains.” arXiv Preprint arXiv:2506.02126. https://arxiv.org/abs/2506.02126.

Wu, Mengsong, Di Zhang, Yuqiang Li, Dongzhan Zhou, and Wenliang Chen. 2025. “SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition.” arXiv Preprint arXiv:2506.07557. https://arxiv.org/abs/2506.07557.

Yang, Ling, Ye Tian, Bowen Li, et al. 2025. “MMaDA: Multimodal Large Diffusion Language Models.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2505.15809.

Ye, Donald, Max Loffgren, Om Kotadia, and Linus Wong. 2026. “Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning.” arXiv Preprint arXiv:2602.11201. https://arxiv.org/abs/2602.11201.

Ye, Jiacheng, Zhihui Xie, Lin Zheng, et al. 2025. “Dream 7B: Diffusion Large Language Models.” arXiv Preprint arXiv:2508.15487. https://arxiv.org/abs/2508.15487.

Yoon, Eunseop, Hee Suk Yoon, Jaehyun Jang, et al. 2025. “PACR: Progressively Ascending Confidence Reward for LLM Reasoning.” arXiv Preprint arXiv:2510.22255. https://arxiv.org/abs/2510.22255.

Yu, Fei, Yingru Li, and Benyou Wang. 2025. “Robust Search with Uncertainty-Aware Value Models for Language Model Reasoning.” arXiv Preprint arXiv:2502.11155. https://arxiv.org/abs/2502.11155.

Yu, Qiying, Zheng Zhang, Ruofei Zhu, et al. 2025. “DAPO: An Open-Source LLM Reinforcement Learning System at Scale.” arXiv Preprint arXiv:2503.14476. https://arxiv.org/abs/2503.14476.

Yu, Yifan et al. 2026. “Introspective Diffusion Language Models.” arXiv Preprint arXiv:2604.11035. https://arxiv.org/abs/2604.11035.

Yue, Yang, Zhiqi Chen, Rui Lu, et al. 2025a. “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” NeurIPS 2025 Workshop on Efficient Reasoning. https://arxiv.org/abs/2504.13837.

Yue, Yang, Zhiqi Chen, Rui Lu, et al. 2025b. “The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View.” arXiv Preprint arXiv:2510.04028. https://arxiv.org/abs/2510.04028.

Yue, Yu, Yufeng Yuan, Qiying Yu, et al. 2025. “VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks.” arXiv Preprint arXiv:2504.05118. https://arxiv.org/abs/2504.05118.

Zhao, Tianyi, Yinhan He, Wendy Zheng, Yujie Zhang, and Chen Chen. 2026. “Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs.” arXiv Preprint arXiv:2604.01457. https://arxiv.org/abs/2604.01457.

Zhao, Zheng, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, and Nicola Cancedda. 2025. “Verifying Chain-of-Thought Reasoning via Its Computational Graph.” arXiv Preprint arXiv:2510.09312. https://arxiv.org/abs/2510.09312.

Zheng, Chujie, Shixuan Liu, Mingze Li, et al. 2025. “Group Sequence Policy Optimization.” arXiv Preprint arXiv:2507.18071. https://arxiv.org/abs/2507.18071.

Zhu, Jiace, Yuanzhe Huang, Yingtao Shen, Jie Zhao, and An Zou. 2024. Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs. arXiv preprint arXiv:2409.01281. https://arxiv.org/abs/2409.01281.