Technical Reference

How AI Systems Learn: From Prompts to Verifiable Rewards

A literature-backed guide to the training methodology spectrum. Understand why reinforcement learning with verifiable rewards produces more reliable AI, and how these methods stack.

The Spectrum

Five Ways to Improve AI

Every AI improvement method sits on a spectrum from lightweight to heavyweight. Each adds a different kind of capability.

Prompt Engineering
PE
Few-Shot Prompting
Few-Shot
Supervised Fine-Tuning
SFT
Reinforcement Learning from Human Feedback
RLHF
Reinforcement Learning with Verifiable Rewards
RLVR
Less investmentMore investment, more reliability

The Key Insight

Same Prompt, Better Foundation, Better Results

Training methods don't replace each other. They multiply each other.

Your PromptYour Evals
Foundation +40%
40%
Foundation +40%
SFT +15%
55%
Foundation +40%
SFT +15%
RLHF +10%
65%
Foundation +40%
SFT +15%
RLHF +10%
RLVR +20%
85%

Illustrative, based on patterns across multiple papers. Exact gains vary by domain and benchmark.

Where Tacit Fits

We Generate the Verification Signal

Most organizations sit at the bottom of the training ladder: prompt engineering, maybe some fine-tuning. Moving up to RLVR requires verifiable environments and expert signal, and that's what we build.

We don't replace your AI. We generate the verification signal that lets any AI improve through RLVR. Your organization's expert judgment becomes the ground truth that the model practices against.

Discovery completeness
78%
Process quality
72%
Decision accuracy
85%
Scope compliance
95%
Outcome metrics
80%

This is what the research calls 'the critical innovation': domain-specific verifiers.

We make them creatable by domain experts, not ML engineers. A single well-designed scenario contains all the ground truth needed to score a conversation.

Reliability

RLVR Makes AI More Reliable, Not Just More Accurate(Conceptual)

The most important finding from the RLVR literature is not that models get smarter. It is that they get more reliable.

Base model
After RLVR training

Pass@k: capability

At least one of k attempts is correct. Given enough retries, the base model catches up: it covers more problems because it never pruned an approach.

0%25%50%75%100%
k=1k=2k=4k=8k=16k=64k=256

Worst-of-k: reliability

Every one of k attempts is correct. The base model's lucky solves stop repeating; the RLVR model holds its score.

0%25%50%75%100%
k=1k=2k=4k=8k=16k=64k=256

Conceptual chart. Same two models in both panels: more attempts close the capability gap on the left and widen the reliability gap on the right. In deployment every interaction is another attempt, and worst-of-k is the regime you live in.

RLVR converts a model that can solve a problem in 256 tries into one that solves it reliably on the first try.

Yue et al. (Tsinghua), NeurIPS 2025 Best Paper Runner-Up

The Compression View

RLVR narrows the model's focus to reliable reasoning paths. It trades breadth for reliability. At high k, the base model covers more problems because it hasn't pruned any approaches.

The Expansion View

When you measure reasoning correctness (not just final answers), RLVR models show expanded capabilities at all k values. Standard pass@k was giving false credit to lucky guesses.

Both Are Happening

Early training compresses (exploitation phase). Extended training expands (exploration phase). The model first learns to be reliable, then learns to be capable.

RLHF vs RLVR

Why Verifiable Rewards Change the Game

RLHF trains models to match human preferences. In specialized domains, this creates models that sound expert-like but are wrong in ways only a domain expert would catch.

Learn What People Prefer

Signal: This one sounds better to the reviewer

Failure mode: Confidently wrong, expert-sounding nonsense

Learn What Actually Works

Signal: This one IS correct, verified against ground truth

Advantage: Reliable accuracy backed by verifiable evidence

RLVR Is Breaking Out of Math and Code

Med-RLVR
Medicine

+8 pts out-of-distribution

First medical RLVR using a 3B parameter model. Matched SFT in-distribution and gained 8 percentage points on health benchmarks it was never trained on.

K2V: Knowledge-to-Verification
Agriculture

Small model outperforms 72B

A model dramatically smaller outperformed Qwen2.5-72B-Instruct on domain-specific benchmarks by converting domain knowledge into verification signal.

Rubrics as Rewards
Medicine + Science

+31% relative improvement

Extended RLVR beyond binary verification using structured rubrics as reward signal. 31% relative improvement on HealthBench over Likert-based baselines.

Literature

Cited Research

Backed by peer-reviewed research and published technical reports.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL

2025

DeepSeek-AI (Guo et al.), published in Nature

Matched OpenAI o1 using GRPO with verifiable rewards. AIME 2024 pass@1: 79.8%. MATH-500: 97.3%.

arXiv:2501.12948

DeepSeekMath: Pushing the Limits of Mathematical Reasoning

2024

DeepSeek-AI

Introduced GRPO (Group Relative Policy Optimization). Eliminated the critic model, cutting RL compute roughly in half vs PPO.

arXiv:2402.03300

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

2024

Allen AI

First open pipeline demonstrating SFT + DPO + RLVR stacking. RLVR improved performance on tasks it was never trained on.

arXiv:2411.15124

Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?

2025

Yue et al. (Tsinghua University)

NeurIPS 2025 Best Paper Runner-Up. Showed RLVR as 'search compression' where pass@1 improves while pass@256 decreases.

arXiv:2504.13837

RLVR Implicitly Incentivizes Correct Reasoning in Base LLMs

2025

Wen et al. (Microsoft Research Asia, Peking University)

Introduced CoT-Pass@K showing RLVR extends reasoning boundaries when measuring reasoning correctness, not just final answers.

arXiv:2506.14245

The RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both?

2025

Yao et al.

Reconciled the compression vs expansion debate: both happen in sequence. Exploitation phase first, exploration phase follows.

arXiv:2510.04028

Med-RLVR: Reinforcement Learning with Verifiable Rewards for Medicine

2025

Zhang et al. (Microsoft Research)

First medical RLVR. 3B model matched SFT in-distribution, +8 pts out-of-distribution on health benchmarks.

arXiv:2502.19655

K2V: Knowledge-to-Verification for RLVR

2025

Yuan et al.

Small model outperformed Qwen2.5-72B-Instruct on agriculture domain benchmark using domain knowledge as verification signal.

arXiv:2605.18261

Rubrics as Rewards: Extending RLVR Beyond Binary Verification

2025

Gunjal et al.

31% relative improvement on HealthBench over Likert-based baselines. Also +7% relative on GPQA-Diamond for science reasoning.

arXiv:2507.17746

Direct Preference Optimization

2023

Rafailov et al. (Stanford)

Showed DPO matches or exceeds RLHF-PPO while eliminating the reward model entirely. Simplified preference learning.

arXiv:2305.18290

Reinforcement Learning via Self-Distillation

2026

Hübotter et al. (ETH Zurich, MIT)

Introduced SDPO (Self-Distillation Policy Optimization): the model learns from its own feedback-conditioned predictions, with no external teacher or reward model. Matched best-of-k discovery on hard binary-reward tasks with 3x fewer attempts.

arXiv:2601.20802

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation

2026

Liu et al.

Weighted SDPO's self-distillation by pass rate, an implicit curriculum that tracks model competence. +3.2 mean@16 on Qwen3-8B and +1.8 on OLMo-3-7B across reasoning benchmarks.

arXiv:2605.27765

Qwen3 Technical Report

2025

Qwen Team (Alibaba)

4-stage pipeline: CoT cold start, GRPO, fusion, general RL. Only 3,995 query-verifier pairs needed for the RLVR stage.

arXiv:2505.09388

GRPO Amplifies Success Probability

2025

Mroueh (IBM Research)

Mathematical proof that GRPO systematically amplifies the probability of successful reasoning paths.

arXiv:2503.06639

Absolute Zero: Reinforced Self-Play Reasoning with Zero Data

2025

Zhao et al. (Tsinghua University)

The model proposes its own tasks and learns from a code executor's verifiable rewards, with zero external training data. Reached state-of-the-art coding and math reasoning in the zero-data setting.

arXiv:2505.03335

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

2026

Li et al.

A single strategically designed training sample lifted reasoning across physics, chemistry, and biology. Structural quality and skill diversity of the data matter more than volume.

arXiv:2601.03111

Evaluating Parameter Efficient Methods for RLVR

2025

Yin et al.

First comprehensive benchmark of 12+ parameter-efficient fine-tuning methods under RLVR. Structural LoRA variants (DoRA, AdaLoRA) beat standard LoRA; SVD-based methods misalign with RL optimization.

arXiv:2512.23165

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-Tuning

2026

Lee et al.

Across nine LoRA variants, properly tuned learning rates bring all methods within 1-2% of the same peak. Reported gains from variant methods are largely tuning artifacts.

arXiv:2602.04998

AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery

2025

Novikov et al. (Google DeepMind)

LLM-driven evolution over code, scored by automated evaluators. Found a 48-multiplication procedure for 4x4 complex matrix multiplication, the first improvement in this setting in 56 years.

arXiv:2506.13131

Efficiently Scaling Transformer Inference

2022

Pope et al. (Google)

The engineering playbook for serving 500B+ parameter transformers: partitioning strategies and multiquery attention that trade latency against throughput at scale.

arXiv:2211.05102

Fast Inference from Transformers via Speculative Decoding

2023

Leviathan et al. (Google)

A small draft model proposes tokens that the large model verifies in parallel, producing identical outputs 2-3x faster without retraining or output changes.

arXiv:2211.17192

The Method Is Public. The Signal Is Yours.

Every paper on this page is published. The verification signal that makes RLVR work in your domain has to come from your experts, and that is the part we build with you.