The Mission · Health-AI Watch

Can an AI conduct a health study that survives an audit?

Comparing how frontier AI systems conduct rigorous, reproducible health studies—from the first protocol decision to the final claim.

← The Mission

Design-stage preview. The release evidence ledger is live. The study benchmark is being specified; no model has been formally run or scored by Health-AI Watch.

A solo researcher or small team does not need another medical-trivia champion. The useful question is whether an AI can turn a real health question into correct, checkable work—and how much supervision, correction, time, and money that takes.

The study run

Grade the whole chain, not just the final answer.

A one-prompt answer hides where research fails. Each formal run will move through the same ordered workflow and leave a checkable artifact at every stage.

Stage 01

Question & protocol

Define the decision, estimand, exclusions, validation plan, and stopping rule before seeing the answer.

Locked protocol

Stage 02

Research engineering

Turn the protocol and public data documentation into deterministic, executable analysis code.

Runnable code

Stage 03

Analysis & uncertainty

Produce the required estimates, intervals, diagnostics, and calibrated interpretation.

Results package

Stage 04

Sensitivity

Test whether the decision changes under plausible assumptions, alternate specifications, and planted traps.

Stress tests

Stage 05

Audit & repair

Receive mechanical test failures, diagnose them, and correct the work without being handed the answer.

Repair log

Stage 06

Publication package

Separate finding from speculation and deliver rerunnable code, provenance, limitations, and a concise public explanation.

Auditable study

Research-fidelity scorecard

The result matters. So does how it got there.

The Watch will publish a profile, not one composite rank. Different researchers may value speed, cost, methodological judgment, or error detection differently.

Dimension	The question	How it is observed
Result fidelity	Did the executable analysis reach the authority-backed or true-by-construction answer?	Mechanical comparison with locked reference outputs
Method fidelity	Was the design appropriate for the decision, data, and failure costs?	Predeclared protocol requirements and scored design traps
Calibration	Did confidence track correctness and evidence strength?	Brier score, calibration error, abstention, and Confidently-Wrong Rate
Reproducibility	Can a clean environment rerun the submitted work?	Deterministic build, artifact, and provenance checks
Audit & recovery	Did the system detect, explain, and repair failed checks?	Hidden tests plus a retained repair transcript
Policy calibration	Did it protect genuinely unsafe work without blocking benign public-data research?	Predeclared unsafe and over-refusal cases, scored separately
Human burden	How much steering and correction did success require?	Intervention count, intervention type, elapsed time, and failure state
Cost & repeatability	What did a successful run cost, and did repeated runs agree?	Token and billing provenance with repeated-run intervals

Mechanical checks cover structured decisions, code, artifacts, and locked outputs. Methodological judgments that cannot be reduced to a defensible rule remain separate and require appropriate domain review.

Why this test is different

Knowing medicine is not the same as doing reliable research.

Existing health evaluations provide useful signals. They generally do not test whether a model can produce an end-to-end study that another researcher can rerun and challenge.

Answer vs. workflow

A correct response can hide a broken method.

Question format alone can materially change apparent performance. A study adds design choices, executable work, uncertainty, sensitivity, provenance, and limits that must survive inspection.

CRAFT-MD, Nature Medicine

Output vs. recovery

The first attempt is only half the story.

Research systems encounter failed tests and conflicting evidence. The Watch records whether the model can find and repair its own mistake without being handed the answer.

Model vs. system

Researchers use products and workflows, not abstract weights.

Tools, context, system prompts, reasoning settings, safety layers, interfaces, and human supervision all affect the result. Every comparison must name the full configuration.

The first pilot

Small enough to audit. Hard enough to matter.

The initial comparison will use one fresh, bounded study built from unrestricted public data. It will test a deliberately small model panel deeply rather than every release superficially.

Controlled comparison

Same study, explicit conditions

Three or four decision-relevant models will receive the same study packet inside a controlled harness, with a shared tool policy, predeclared intervention protocol, and resource ceiling. Each configuration will run repeatedly.

Recorded exactly: model and API identifiers, dates, system prompt, reasoning and sampling settings, tool access, context, interventions, elapsed time, tokens, and cost.

Contamination resistance

Fresh case, hidden checks

The formal case will not ask models to reproduce studies already published on this site. It will use new questions, locked reference code, hidden perturbations, and mechanically checked artifacts.

Published result: a dimension-by-dimension research-fidelity dossier, including failures and human burden—not an overall winner.

No model grades another model, and no manual-chat scores. Personal experience can shape the protocol, but formal results require mechanical keys and a locked, reproducible harness.

Controlled and native-product tracks stay separate. Cross-model comparisons use the controlled harness. Observations from ChatGPT, Claude, or another product include its routing and interface and cannot be merged into the same score.

Component checks

Does it know when to proceed—and when to stop?

End-to-end study performance can hide dangerous local behaviors. Two focused components remain part of the design.

B2 · Calibration

Does confidence track correctness?

Fresh objective items test stated confidence, abstention on false premises, and robustness when a user pushes against a correct answer.

Signature metric: Confidently-Wrong Rate, reported with Brier score, calibration error, abstention, and repeated-run variation.

Research-action calibration

Does it choose the proportionate next step?

Cases test when to proceed, check, clarify, defer to an expert, stop for inadequate evidence, or refuse an unsafe request—without blocking benign biology or public-data research.

Reported separately: unsafe compliance, unsupported continuation, appropriate deferral, and benign over-refusal.

Clinical triage is later. Patient-facing escalation and treatment scenarios require authoritative answer keys and external clinical review before any ranking can be published.

Capability is not product policy. Base-model behavior remains separate from routing, classifiers, system prompts, tools, and interface restrictions.

The release radar

Track broadly. Test selectively.

The living release ledger identifies plausible candidates, access routes, prices, outside evidence, and unresolved questions. A model enters the formal panel only when it could change a research decision.

312026 releases indexed

10with outside evidence

2with direct third-party health evidence

0formal Watch scores

Evidence cutoff July 20, 2026. A missing result means not established by the cutoff—not zero, unsafe, or poor.

Browse the release evidence ledger →

The longer arc

Research first. Clinical claims only after clinical evidence.

This Watch starts where the work is answerable now: bounded studies with public data, locked outputs, and an audit trail. Once a system can reliably reproduce and audit that work, the harder research question opens: can it produce a novel finding that survives independent verification? Dependable research partnership may eventually support larger roles in diagnosis and treatment, but this page does not tell patients which model to trust or establish clinical readiness.

Follow the build log → Researchers and reviewers: get in touch