PTRM

Probabilistic Tiny Recursive Model (PTRM) is a test-time scaling framework released in May 2026 by Sghaier and Parviz at Mila Québec AI Institute / École de Technologie Supérieure (ETS), together with Jolicoeur-Martineau, the original author of Tiny Recursive Model (TRM) (Sghaier et al. 2026). In one sentence, PTRM lets TRM escape the bad latent basin that its deterministic recursion falls into, by adding Gaussian noise of scale \(\sigma\) to the latent at every deep recursion step and running \(K\) trajectories in parallel, and then reusing the Q head that TRM already learned for adaptive halting as a verifier to pick the best trajectory. No retraining and no task-specific augmentation are required, and it lifts TRM from 87.4 % to 98.75 % on Sudoku-Extreme and from 62.6 % to 91.2 % on Pencil Puzzle Bench (PPBench). Around the same time, a mechanistic analysis paper by Efstathiou & Balwani (Efstathiou and Balwani 2026), presented at the ICLR 2026 Workshop on Latent and Implicit Thinking, independently arrived at the same attractor hypothesis, and this chapter treats the two papers as a pair. As a test-time extension of TRM, the ARC Prize 2025 Paper Award winner, this book places PTRM as “the third strategy that puts a trained model directly to work as is”.

Figure 1: Performance comparison of PTRM. On the PPBench puzzle set (left), it lifts TRM from 62.6 % to 91.2 %, beating the ensemble of the top 7 LLMs (assuming a perfect verifier) at 55.1 % by 36 pp. On Sudoku-Extreme (right), where Chain-of-Thought style frontier LLMs score 0 %, it reaches state-of-the-art at 98.75 %. Source: (Sghaier et al. 2026)

TRM Failure Modes: Trapped in a Bad Basin

The mechanistic analysis in Section 3 of the PTRM paper is the starting point. The authors recorded TRM’s latent trajectories on PPBench, for 100 puzzles at a time and at the granularity of one supervision step, and projected them onto the principal plane via Principal Component Analysis (PCA). The trajectories fall into three qualitative modes.

  • Quick success: within a few steps the trajectory moves into a convergent region and stays there. The Q value (halting logit) and the cell accuracy (the fraction of predicted cells that are correct) rise in sync and saturate at the same supervision step.
  • Delayed success: the trajectory oscillates for many steps inside a bounded region, and then at some step suddenly escapes to a different region and converges. The Q value and the cell accuracy spike at the same moment of escape.
  • Failure: the trajectory keeps oscillating in the bounded region, the Q value stays negative throughout (below 0.5 after sigmoid), and the cell accuracy never reaches 100 %.

From these observations, the authors introduce the vocabulary of good basin / bad basin. A good basin is a latent region in which the cell accuracy stays high, and a bad basin is one in which it does not. The key finding is that failure and delayed success behave identically in the early phase. Both are trapped in the same bad basin, and only the latter happens to escape through some stochastic fluctuation and reaches the correct answer. PTRM’s diagnosis is that “failure is not a lack of capability but a lack of an escape mechanism”.

Figure 2: The three modes of TRM trajectories. The top row shows the latent trajectories projected onto the principal plane via PCA (light shades are early steps, dark shades are later steps, circles are starts, squares are ends), and the bottom row shows the Q value (solid line, left axis) and the cell accuracy (dashed line, right axis) over supervision steps. Failure and delayed success oscillate in the same bounded region in the early phase, and only the latter escapes. Source: (Sghaier et al. 2026)

The Q Head Tracks Trajectory Quality

The second finding that anchors the PTRM design is in Section 3.2. TRM’s Q head was originally trained as the halting signal of Adaptive Computation Time (ACT) (Graves 2016), with a binary cross-entropy on whether “the current prediction matches the ground truth”. Aggregating over 100 PPBench validation puzzles, the authors find that the Q head’s logit \(\hat q\) tracks the cell accuracy almost perfectly at every supervision step.

At convergence, the two groups separate sharply, with \(\hat q \approx +6\) (sigmoid about 1) on correct trajectories and \(\hat q \approx -6\) (sigmoid about 0) on incorrect ones. The core interpretation of PTRM is that “as a by-product of the ACT auxiliary loss, TRM had effectively learned a verifier inside itself”. The downstream test-time procedure reuses this Q head as a rollout selector without any additional training.

Figure 3: The Q value tracks the cell accuracy step by step. Aggregated over 100 PPBench validation puzzles, split into two groups by their final correctness. On correct trajectories (green), the Q value and the cell accuracy rise almost in perfect synchrony, while on incorrect ones (red) the Q value stays negative throughout. Source: (Sghaier et al. 2026)

PTRM Method

PTRM assembles the two findings of the previous section into a single inference-time procedure. At every deep recursion step, Gaussian noise of scale \(\sigma\) is added to the latent \(z\). This is run as \(K\) parallel rollouts, and the best trajectory is chosen by the terminal Q value of each rollout. TRM’s architecture is left untouched, and the checkpoint is used as is. The intervention is purely at test time on a trained model.

Algorithm

We reproduce Algorithm 1 of the paper here. rec is TRM’s deep recursion step (one round of latent recursion), \(f_O\) is the output head, and \(f_Q\) is the Q head. \(D\) is the number of supervision steps.

def ptrm_inference(x, K, D, sigma, z0, y0):
    candidates = []
    for k in range(K):                       # K parallel rollouts
        z, y = z0, y0
        for t in range(D):                   # D deep recursion steps
            eps = sigma * randn_like(z)      # Gaussian noise
            z = z + eps
            z, y = rec(x, z, y)              # standard TRM step
        y_hat = argmax(f_O(y))
        q_hat = f_Q(y)
        candidates.append((y_hat, q_hat))
    k_star = argmax([q for _, q in candidates])
    return candidates[k_star][0]

What deserves attention is that the noise injection happens at every supervision step, not only at the first one of a rollout. As discussed below, this is consistent with GRAM’s (Baek et al. 2026) ablation that “adding noise only to the initial \(z\) does not work”, and shows that independent stochasticity is needed at each step of the trajectory.

Figure 4: The PTRM mechanism. (a) Standard TRM runs a single deterministic rollout of depth \(D\) along the depth axis. (b) PTRM runs \(K\) rollouts in parallel at the same depth and injects Gaussian noise \(\epsilon\) into the latent at every deep recursion step. The Q head picks the best trajectory among the \(K\) candidates (width axis \(K\)). Source: (Sghaier et al. 2026)

Empirical Demonstration of Bad-Basin Escape

Section 4.1 takes a single failure puzzle from Figure 2 and runs \(K=100\) rollouts on it. Of the 100, 92 stay in the same bad basin, and only 8 escape to a distinct region and reach the correct answer. The number of escapes increases monotonically with \(K\), from 0 at \(K=5\), to 1 at \(K=25\), to 8 at \(K=100\). This quantitatively confirms that Gaussian noise generates a per-rollout escape probability.

Figure 5: Principal-plane projection of 100 rollouts that PTRM generates on the same failure puzzle. 92 rollouts (red) remain in the bad basin, while 8 (green) escape to a different region and reach the correct answer. Source: (Sghaier et al. 2026)

Width Scaling and the Verifier Quality of the Q Head

Section 4.2 organizes PTRM’s most practical contribution. The authors distinguish three aggregation metrics.

  • pass@\(K\): the probability that any of the \(K\) rollouts is correct (oracle upper bound)
  • best-Q@\(K\): the probability that the rollout with the highest Q value is correct (the practical metric)
  • mode@\(K\): the probability that the most frequent answer is correct (voting)

On PPBench validation, as \(K\) increases from 1 to 100, both pass@\(K\) and best-Q@\(K\) rise from 76.4 % to 89.5 %, and the gap between them stays below 1 pp throughout. This means that the Q head functions almost as well as an oracle verifier. By contrast, mode@\(K\) rises by only 1.3 pp. The gains from width scaling come from Q head selection, not from voting.

The comparison with the depth axis is also in Section 4.2. With \(K=1\) fixed, increasing depth \(D\) from 16 to 48 lifts PPBench validation from 76.4 % to only 79.5 % (+3.1 pp). With \(D=16\) fixed, increasing \(K\) from 1 to 100 lifts it by +13 pp. Combined with the computational property that rollouts are independent and parallelizable while depth is sequential, width becomes the primary test-time scaling axis.

Figure 6: Dependence of pass@\(K\), best-Q@\(K\), and mode@\(K\) on \(K\) on PPBench validation. pass@\(K\) and best-Q@\(K\) rise almost in sync with \(K\) (gap < 1 pp), showing that the Q head works as a strong verifier. mode@\(K\) stays nearly flat, indicating that voting yields no gain. Source: (Sghaier et al. 2026)

Main Results

Section 5 of the paper evaluates PTRM on three benchmarks. The shared protocol is to “use the public TRM checkpoint as is and only change the inference settings \((K, D, \sigma)\)”.

PPBench

PPBench (Waugh 2026) is a constraint satisfaction puzzle collection with 62,231 puzzles across 94 puzzle types and is the main benchmark of PTRM. The authors focus on 6 types (sudoku, lightup, nurikabe, shakashaka, heyawake, tapa) and measure per-puzzle accuracy on the 300-puzzle golden set defined by Waugh et al. They aggregate over the 5 types that exclude shakashaka, which deterministic TRM already solves at 100 %.

The main result is that PTRM (\(K=100, D=48, \sigma=0.2\)) reaches an aggregate of 91.2 %, broken down as sudoku 97.8 %, lightup 100 %, nurikabe 88.9 %, heyawake 85.7 %, and tapa 80.0 %. This is +28.6 pp above deterministic TRM (\(K=1, D=16\)) at 62.6 %. TRM with depth alone increased (\(K=1, D=48\)) reaches only 66.0 % (+3.4 pp), confirming that most of the gain is from width.

The comparison with frontier LLMs is taken on the same golden set. With direct prompting, gemini-3.1-pro reaches 24.5 %, gpt-5.2@xhigh 24.5 %, and claude-opus-4-6@thinking 34.7 %. The ensemble of the top 7 LLMs with multi-step agentic prompting (assuming a perfect verifier) reaches 55.1 %. PTRM beats this by 36 pp at 91.2 %, and the cost per puzzle is $0.001, three orders of magnitude cheaper than the LLM ensemble at $2.66.

Sudoku-Extreme, Maze-Hard, ARC-AGI-2

Section 5.3 evaluates PTRM on the same three benchmarks used by the original TRM paper. On Sudoku-Extreme, TRM 87.28 % becomes PTRM 98.75 % (\(K=100, D=64, \sigma=0.3\), state-of-the-art). On Maze-Hard, 83.80 % becomes 86.73 % (\(K=100, D=16, \sigma=1.0\)). On ARC-AGI-2, pass@1 goes from 7.36 % to 8.47 % and pass@100 from 14.31 % to 15.97 % (\(K=25, D=16, \sigma=0.2\), with pass@2 at 9.72 %, roughly on par).

Sudoku-Extreme is the benchmark on which PTRM most dramatically eliminates the basin trap that was deterministic TRM’s weakness, and PTRM reaches state-of-the-art there. On ARC-AGI-2, however, it does not reach the large LLMs in the Grok-4 family, and on Maze-Hard best-Q@\(K\) falls well short of pass@\(K\) (see the ablation below). One of the limits of PTRM, acknowledged by the paper itself, is that there are tasks where “the verifier quality of the Q head becomes the ceiling”.

Noise Ablation

Section 5.4 and Appendix B sweep \(\sigma\) on the three benchmarks. The optimal \(\sigma\) is task-dependent.

  • Sudoku-Extreme: hits the ceiling around \(\sigma \approx 0.1\) and stays roughly flat up to \(\sigma = 1.0\). best-Q@\(K\) is 98.5 %, within 1 pp of pass@\(K\) at 99.3 %.
  • Maze-Hard: pass@\(K\) keeps rising up to \(\sigma \approx 1.0\), from 83.8 % to about 96 %. best-Q@\(K\) stays around 86 %, and the gap between the two opens to about 10 pp. A large verifier headroom remains.
  • ARC-AGI-2: peaks around \(\sigma \approx 0.6\). The gap between best-Q@\(K\) and pass@\(K\) is also not negligible.

The wide gaps on Maze-Hard and ARC-AGI-2 show that what limits PTRM is not “search capability” but “verifier capability”. The central direction of the paper’s Future Work is that a stronger verifier, trained or attached, could pick up the rollouts that the current Q head is missing.

Figure 7: \(\sigma\) ablation on the three benchmarks. Horizontal axis is the noise scale \(\sigma\), and the lines are pass@\(K\) (blue), best-Q@\(K\) (orange), and mode@\(K\) (green). Sudoku-Extreme hits the ceiling at \(\sigma = 0.1\) and stays flat. Maze-Hard’s pass@\(K\) keeps rising up to \(\sigma = 1.0\) but the gap with best-Q@\(K\) widens. ARC-AGI-2 peaks at \(\sigma = 0.6\). The shaded region is the verifier headroom. Source: (Sghaier et al. 2026)
NoteA Negative Result: Langevin Sampling Did Not Help

In Appendix C, the authors try a Langevin dynamics update that uses the Q head’s gradient to push the latent toward a high-Q region, \(z \leftarrow z - \eta \nabla_z E(z) + \sqrt{2\eta}\, \xi\) (with \(E(z) = -\log \mathrm{sigmoid}(f_Q(z))\) and \(\xi \sim \mathcal{N}(0, I)\)). On PCA plots, \(\nabla_z f_Q\) does point toward the good basin, so the idea looked promising, but the accuracy matched a noise-only ablation in which the gradient term was set to zero. All of the gain came from the noise term, and the gradient guidance added nothing. Based on this, the authors dropped the Langevin variant and converged to the simpler PTRM. This is a methodological finding that test-time gradient guidance on a trained model, at least through TRM’s Q head, did not work.

Limits and Contributions in This Book

Section 7 / Conclusion of the PTRM paper acknowledges three limits. First, the validation is mainly on grid-shaped puzzles, and the gains on ARC-AGI-2 and Heyawake are small. Second, the verification quality of the Q head is the ceiling, and on Maze-Hard a large gap between best-Q@\(K\) and pass@\(K\) remains. Third, the development of a stronger verifier is left as future work.

In the context of this book, PTRM’s contributions can be organized into three points.

First, it diagnoses “the deterministic limit of TRM” at the mechanism level. The vocabulary of three trajectory modes on the PCA plot, and of bad / good basins, becomes a shared language for talking about the behavior of the TRM lineage. Combined with Efstathiou & Balwani’s attractor hypothesis, the dynamics interpretation of small recursive models like HRM / TRM enters a new stage.

Second, it establishes width scaling as a test-time compute axis. Depth is sequential, cannot be parallelized, and from some point onward starts to incur overfitting cost (consistent with TRM’s Table 4, which already showed that “deeper is not always better”). Width does not have these properties, and gains rise linearly with the number of rollouts. As one answer to the question of “what to invest test-time compute in”, width scaling is now an established option.

Third, the finding that a trained Q head can be reused as a verifier is an observation that carries over to recursive reasoning models in general. A head trained as an auxiliary loss for adaptive halting actually works as a trajectory selector. If future RRM designs build in the assumption that halting signals can “double as a verifier”, the efficiency of test-time scaling becomes easier to raise.

In this book, as a continuation of the “subtraction” performed in TRM (peeling off the HRM story through ablations), PTRM presents the minimal intervention that puts a trained TRM directly to work at test time. Lined up with GRAM (the “addition” that introduces stochasticity at training time) and LDT (which “sounds up” the latent via lattice projection), the three form a series in which the intervention onto the TRM core grows progressively larger. PTRM is the lightest of the three, and by changing only the test-time procedure while keeping the checkpoint untouched, it sits at the position that first asks “what is gained and what is lost” relative to the heavier modifications that follow.

References

Bae, Sangmin, Yujin Kim, Reza Bayat, et al. 2025. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation.” arXiv Preprint arXiv:2507.10524. https://arxiv.org/abs/2507.10524.
Baek, Junyeob, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, and Sungjin Ahn. 2026. “Generative Recursive Reasoning.” arXiv Preprint arXiv:2605.19376. https://arxiv.org/abs/2605.19376.
Bansal, Arpit, Avi Schwarzschild, Eitan Borgnia, et al. 2022. “End-to-End Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking.” Advances in Neural Information Processing Systems 35: 20232–42. https://arxiv.org/abs/2202.05826.
Bear, Jay, Adam Prügel-Bennett, and Jonathon Hare. 2024. “Rethinking Deep Thinking: Stable Learning of Algorithms Using Lipschitz Constraints.” Advances in Neural Information Processing Systems 37: 97027–52. https://arxiv.org/abs/2410.23451.
Blayney, Hugh, Álvaro Arroyo, Johan Obando-Ceron, et al. 2026. “A Mechanistic Analysis of Looped Reasoning Language Models.” arXiv Preprint arXiv:2604.11791. https://arxiv.org/abs/2604.11791.
Efstathiou, Andreas, and Aishwarya Balwani. 2026. “Recursive Reasoning as Attractor Landscape Search: Mechanistic Dynamics of the Tiny Recursive Model.” Workshop on Latent and Implicit Thinking – Going Beyond CoT Reasoning, ICLR 2026. https://openreview.net/forum?id=kKps9W1K7n.
Graves, Alex. 2016. “Adaptive Computation Time for Recurrent Neural Networks.” arXiv Preprint arXiv:1603.08983. https://arxiv.org/abs/1603.08983.
Hakimi, Navid. 2026. Form Follows Function: Recursive Stem Model.” arXiv Preprint arXiv:2603.15641. https://arxiv.org/abs/2603.15641.
Ren, Zirui, and Ziming Liu. 2026. “Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models.” arXiv Preprint arXiv:2601.10679. https://arxiv.org/abs/2601.10679.
Schwarzschild, Avi, Eitan Borgnia, Arjun Gupta, et al. 2021. “Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks.” Advances in Neural Information Processing Systems 34: 6695–706. https://arxiv.org/abs/2106.04537.
Sghaier, Amin, Ali Parviz, and Alexia Jolicoeur-Martineau. 2026. Probabilistic Tiny Recursive Model.” arXiv Preprint arXiv:2605.19943. https://arxiv.org/abs/2605.19943.
Wang, Minghan, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, and Gholamreza Haffari. 2026. GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler.” arXiv Preprint arXiv:2602.14077. https://arxiv.org/abs/2602.14077.
Waugh, Justin. 2026. Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning.” arXiv Preprint arXiv:2603.02119. https://arxiv.org/abs/2603.02119.
You, Runyang, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, and Wenjie Li. 2025. Parallel Test-Time Scaling for Latent Reasoning Models.” arXiv Preprint arXiv:2510.07745. https://arxiv.org/abs/2510.07745.