All field notes
healthbenchSLMresearch

The 8B SLM That Beats Frontier AI on HealthBench

Alex DongJune 26, 20264 min read

Executive Summary

We trained an 8B parameter Specialised Language Model (SLM), and evaluated its performance on the hardest physician-validated subset of HealthBench: a benchmark designed by OpenAI to measure models' performance on real-world healthcare scenarios.

  • Accuracy: Our SLM scores 96.5%. Opus 4.6 scores 18.1%, and Sonnet 4.6 scores 26.3%.
  • Stress test: Each request is run 16 times, and the weakest result is reported as a measurement for worst-case performance. Our SLM retained a 72.4% score. Every other LLM fell below 10%.
  • Cost: At 1 million queries per month, our SLM would save US$146,000 to US$247,000 annually.

The results reveal that despite their small size, SLMs trained with the right methodology can deliver higher accuracy, better reliability, and far lower cost than general-purpose frontier LLMs.

Methodology

What is HealthBench?

HealthBench is a benchmark developed by OpenAI to measure frontier models' performance in healthcare settings. It was designed to address the limitations of prior healthcare benchmarks, which were not representative of real-world use cases and often relied on multiple-choice questions.

OpenAI worked with 262 physicians from 60 countries to develop 5,000 realistic multi-turn conversations, then scored them against 48,562 physician-written criteria. Two properties make it the right benchmark to test against. First, latest frontier models perform relatively poorly, so there is real room to improve. Second, it tests open-ended reasoning, the kind of skill that matters in real-world situations.

Experiment Subset

We deliberately chose the hardest, most rigorously validated subset of HealthBench for our experiment, concretely:

Multiple physician validated: Multiple physicians independently agreed on the grading criteria. This removes annotation noise and ensures that our scores reflect the judgement of real clinicians, not an LLM.

The hardest 20%: We only tested against the most difficult 20% questions in HealthBench. Easy questions are already near-solved. The hard cases are where the benchmark still has enough gradient to show meaningful progress.

Open-ended questioning: Tests whether a model knows when to stop, ask for clarification, and generate a follow-up question. The HealthBench paper reveals that this is where frontier LLMs perform the worst.

Training Details

Training data: Publicly accessible datasets, including clinical questions from Reddit r/AskDocs and Mayo Clinic question and answer material. We then generated synthetic training data using our proprietary methods, with 1,046 synthetic cases in our training dataset.

Compute time: Total model training time took under 28 hours on a single L40S GPU.

Compute cost: The total training cost was just shy of US$40.

Base model: Qwen3-8B, an open-weight model from the Qwen3 family.

Results

Accuracy

Our model scores 96.5%. Anthropic's strongest result is Haiku 4.5 at 31.2%, while Opus 4.6 reaches 18.1%. The Qwen3-8B base model starts at 2.9%, so the lift comes from the training process.

Fig 1

HealthBench performance versus inference cost at 1 million queries per month.

Inference Cost

At 1 million queries per month, the inference bill for using our model is around US$300. Sonnet 4.6 would cost about US$12,500, with Opus 4.6 near US$20,900.

That is about US$146,000 to US$247,000 in annual savings, with higher accuracy and the operational freedom to deploy, tune, and monitor the model inside the organisation’s own environment.

Reliability

In a safety-critical domain like health, the consequences of one bad model response can outweigh the benefits of many good ones. Average accuracy score only tells you what happens on one random draw. The stress test asks a different and more salient question: if the same model processes the same problem many times, how bad is the worst answer likely to be?

Fig 2

Reliability curves across repeated samples.

Our trained model stays dependable under real workload pressure. It starts at 96.5% and drops to 72.4% after 16 rounds of testing.

Every other model's performance collapses catastrophically. Haiku 4.5 drops from 31.2% to 2.6%, Sonnet 4.6 drops from 26.3% to 5.3%. A model that demos well but has a fast-collapsing reliability curve is fragile in production, especially in agentic systems with retries, branches, or multiple tool calls.

Why Do Frontier Models Struggle Here?

Our analysis points to three main reasons.

First, frontier LLMs' training incentives are misaligned with this context-seeking behaviour.

The benchmark rewards restraint: identify what is unknown, ask for the missing clinical context, then give advice.

Frontier LLMs are trained to be helpful, which biases them toward giving a response immediately, even when key clinical information required to make the correct judgement is never surfaced. [1]

Our model was trained to behave like an expert: recognise gaps, ask the right follow-up questions, and only provide advice after clarifying the scope of the problem.

Second, the very capability that makes frontier models impressive also makes them too eager to answer.

We were surprised that Opus 4.6 scores below Haiku 4.5, despite being the more capable model. Our trace analysis suggests that Opus is more confident, more comprehensive, and less willing to stop at a clarifying question.

Prompt engineering was not enough to steer the model to behave differently. But training was. The experiment shows that, with the right methodology, we can move a model away from its default response pattern and teach it to adopt a new behaviour that is more appropriate for the real-world environment it operates in.

Third, Opus 4.6 did not answer 10.8% of valid clinical questions because its guardrails treated the questions as policy violations.

This automated blocking behaviour highlights the operational constraint of general LLMs. Their guardrails come pre-set by the provider, so valid domain-specific questions can be blocked under a generic policy. An SLM gives the organisation freedom to deploy guardrails that match its own risk profile, legal requirements, and operational environments.

Conclusion

Using LLMs for clinical tasks seems like the right approach. On the surface, they are trained on every textbook, have read everything available on the internet, and perform well on easy benchmarks. Yet, they have been shown to be less capable to operate in a real-world environment.

Our SLM, trained with our unique scenario design and Reinforcement Learning environment, learned to act like an expert and seek more information before giving advice.

SLM gives organisations the opportunity to achieve significant better performance on specialised environment. We believe they offer significant better value than frontier LLMs that cost tens of billions to build and are two orders of magnitude more expensive to operate.

[1]: This "eagerness to respond" is so ingrained that we found prompt engineering was insufficient to steer the LLMs away from this behaviour.