Technical Reference
How AI Systems Learn: From Prompts to Verifiable Rewards
A literature-backed guide to the training methodology spectrum. Understand why reinforcement learning with verifiable rewards produces more reliable AI, and how these methods stack.
The Spectrum
Five Ways to Improve AI
Every AI improvement method sits on a spectrum from lightweight to heavyweight. Each adds a different kind of capability.
The Key Insight
Same Prompt, Better Foundation, Better Results
Training methods don't replace each other. They multiply each other.
Illustrative, based on patterns across multiple papers. Exact gains vary by domain and benchmark.
Where Tacit Fits
We Generate the Verification Signal
Most organizations sit at the bottom of the training ladder: prompt engineering, maybe some fine-tuning. Moving up to RLVR requires verifiable environments and expert signal, and that's what we build.
We don't replace your AI. We generate the verification signal that lets any AI improve through RLVR. Your organization's expert judgment becomes the ground truth that the model practices against.
This is what the research calls 'the critical innovation': domain-specific verifiers.
We make them creatable by domain experts, not ML engineers. A single well-designed scenario contains all the ground truth needed to score a conversation.
Reliability
RLVR Makes AI More Reliable, Not Just More Accurate(Conceptual)
The most important finding from the RLVR literature is not that models get smarter. It is that they get more reliable.
Pass@k: capability
At least one of k attempts is correct. Given enough retries, the base model catches up: it covers more problems because it never pruned an approach.
Worst-of-k: reliability
Every one of k attempts is correct. The base model's lucky solves stop repeating; the RLVR model holds its score.
Conceptual chart. Same two models in both panels: more attempts close the capability gap on the left and widen the reliability gap on the right. In deployment every interaction is another attempt, and worst-of-k is the regime you live in.
RLVR converts a model that can solve a problem in 256 tries into one that solves it reliably on the first try.
Yue et al. (Tsinghua), NeurIPS 2025 Best Paper Runner-Up
The Compression View
RLVR narrows the model's focus to reliable reasoning paths. It trades breadth for reliability. At high k, the base model covers more problems because it hasn't pruned any approaches.
The Expansion View
When you measure reasoning correctness (not just final answers), RLVR models show expanded capabilities at all k values. Standard pass@k was giving false credit to lucky guesses.
Both Are Happening
Early training compresses (exploitation phase). Extended training expands (exploration phase). The model first learns to be reliable, then learns to be capable.
RLHF vs RLVR
Why Verifiable Rewards Change the Game
RLHF trains models to match human preferences. In specialized domains, this creates models that sound expert-like but are wrong in ways only a domain expert would catch.
Learn What People Prefer
Signal: This one sounds better to the reviewer
Failure mode: Confidently wrong, expert-sounding nonsense
Learn What Actually Works
Signal: This one IS correct, verified against ground truth
Advantage: Reliable accuracy backed by verifiable evidence
RLVR Is Breaking Out of Math and Code
+8 pts out-of-distribution
First medical RLVR using a 3B parameter model. Matched SFT in-distribution and gained 8 percentage points on health benchmarks it was never trained on.
Small model outperforms 72B
A model dramatically smaller outperformed Qwen2.5-72B-Instruct on domain-specific benchmarks by converting domain knowledge into verification signal.
+31% relative improvement
Extended RLVR beyond binary verification using structured rubrics as reward signal. 31% relative improvement on HealthBench over Likert-based baselines.
Literature
Cited Research
Backed by peer-reviewed research and published technical reports.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
2025DeepSeek-AI (Guo et al.), published in Nature
Matched OpenAI o1 using GRPO with verifiable rewards. AIME 2024 pass@1: 79.8%. MATH-500: 97.3%.
arXiv:2501.12948DeepSeekMath: Pushing the Limits of Mathematical Reasoning
2024DeepSeek-AI
Introduced GRPO (Group Relative Policy Optimization). Eliminated the critic model, cutting RL compute roughly in half vs PPO.
arXiv:2402.03300Tulu 3: Pushing Frontiers in Open Language Model Post-Training
2024Allen AI
First open pipeline demonstrating SFT + DPO + RLVR stacking. RLVR improved performance on tasks it was never trained on.
arXiv:2411.15124Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?
2025Yue et al. (Tsinghua University)
NeurIPS 2025 Best Paper Runner-Up. Showed RLVR as 'search compression' where pass@1 improves while pass@256 decreases.
arXiv:2504.13837RLVR Implicitly Incentivizes Correct Reasoning in Base LLMs
2025Wen et al. (Microsoft Research Asia, Peking University)
Introduced CoT-Pass@K showing RLVR extends reasoning boundaries when measuring reasoning correctness, not just final answers.
arXiv:2506.14245The RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both?
2025Yao et al.
Reconciled the compression vs expansion debate: both happen in sequence. Exploitation phase first, exploration phase follows.
arXiv:2510.04028Med-RLVR: Reinforcement Learning with Verifiable Rewards for Medicine
2025Zhang et al. (Microsoft Research)
First medical RLVR. 3B model matched SFT in-distribution, +8 pts out-of-distribution on health benchmarks.
arXiv:2502.19655K2V: Knowledge-to-Verification for RLVR
2025Yuan et al.
Small model outperformed Qwen2.5-72B-Instruct on agriculture domain benchmark using domain knowledge as verification signal.
arXiv:2605.18261Rubrics as Rewards: Extending RLVR Beyond Binary Verification
2025Gunjal et al.
31% relative improvement on HealthBench over Likert-based baselines. Also +7% relative on GPQA-Diamond for science reasoning.
arXiv:2507.17746Direct Preference Optimization
2023Rafailov et al. (Stanford)
Showed DPO matches or exceeds RLHF-PPO while eliminating the reward model entirely. Simplified preference learning.
arXiv:2305.18290Reinforcement Learning via Self-Distillation
2026Hübotter et al. (ETH Zurich, MIT)
Introduced SDPO (Self-Distillation Policy Optimization): the model learns from its own feedback-conditioned predictions, with no external teacher or reward model. Matched best-of-k discovery on hard binary-reward tasks with 3x fewer attempts.
arXiv:2601.20802Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation
2026Liu et al.
Weighted SDPO's self-distillation by pass rate, an implicit curriculum that tracks model competence. +3.2 mean@16 on Qwen3-8B and +1.8 on OLMo-3-7B across reasoning benchmarks.
arXiv:2605.27765Qwen3 Technical Report
2025Qwen Team (Alibaba)
4-stage pipeline: CoT cold start, GRPO, fusion, general RL. Only 3,995 query-verifier pairs needed for the RLVR stage.
arXiv:2505.09388GRPO Amplifies Success Probability
2025Mroueh (IBM Research)
Mathematical proof that GRPO systematically amplifies the probability of successful reasoning paths.
arXiv:2503.06639Absolute Zero: Reinforced Self-Play Reasoning with Zero Data
2025Zhao et al. (Tsinghua University)
The model proposes its own tasks and learns from a code executor's verifiable rewards, with zero external training data. Reached state-of-the-art coding and math reasoning in the zero-data setting.
arXiv:2505.03335One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling
2026Li et al.
A single strategically designed training sample lifted reasoning across physics, chemistry, and biology. Structural quality and skill diversity of the data matter more than volume.
arXiv:2601.03111Evaluating Parameter Efficient Methods for RLVR
2025Yin et al.
First comprehensive benchmark of 12+ parameter-efficient fine-tuning methods under RLVR. Structural LoRA variants (DoRA, AdaLoRA) beat standard LoRA; SVD-based methods misalign with RL optimization.
arXiv:2512.23165Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-Tuning
2026Lee et al.
Across nine LoRA variants, properly tuned learning rates bring all methods within 1-2% of the same peak. Reported gains from variant methods are largely tuning artifacts.
arXiv:2602.04998AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery
2025Novikov et al. (Google DeepMind)
LLM-driven evolution over code, scored by automated evaluators. Found a 48-multiplication procedure for 4x4 complex matrix multiplication, the first improvement in this setting in 56 years.
arXiv:2506.13131Efficiently Scaling Transformer Inference
2022Pope et al. (Google)
The engineering playbook for serving 500B+ parameter transformers: partitioning strategies and multiquery attention that trade latency against throughput at scale.
arXiv:2211.05102Fast Inference from Transformers via Speculative Decoding
2023Leviathan et al. (Google)
A small draft model proposes tokens that the large model verifies in parallel, producing identical outputs 2-3x faster without retraining or output changes.
arXiv:2211.17192The Method Is Public. The Signal Is Yours.
Every paper on this page is published. The verification signal that makes RLVR work in your domain has to come from your experts, and that is the part we build with you.