The Question
Diabetes risk prediction underpins clinical screening decisions. Published risk scores combine known factors — age, BMI, blood pressure, smoking, physical activity — using relative risks from landmark epidemiological studies. Machine learning can discover nonlinear interactions between these same factors. Which approach better identifies future diabetes cases in 26,224 adults tracked over decades?
ADM Prediction (Made Before Running Models)
Predicted winner: machine learning. Published relative risks capture the direction of each factor (obesity increases risk, activity decreases it) but not the nonlinear interactions between them. Diabetes is a threshold disease: borderline BMI + borderline glucose + borderline BP compounds risk beyond what any additive score captures. With 15 features and >20K samples, gradient boosting has enough data to learn these synergies.
Expected margin: 5–10 AUC points. Actual: +10.7 AUC points. Prediction confirmed.
Results
ROC Curves
Feature Importance (Top 8)
Confidence Intervals
Net Reclassification
The domain score uses published relative risks. But before concluding that tree-based ML is needed, we should ask: does logistic regression — the simplest ML model — close most of the gap?
Logistic regression (AUC 0.693) captures 93% of the gain over the domain score. GradientBoosting adds only +0.007 more. The improvement comes from flexible feature weighting, not nonlinear tree interactions.
A model can discriminate (rank people correctly) without being calibrated (predicting accurate probabilities). Calibration matters when you need to tell someone “you have a 30% risk” and mean it.
The ML model (ECE 0.017) is well-calibrated — predicted probabilities closely match observed rates. The domain score (ECE 0.175) systematically overestimates risk in the middle range.
A model that works well on average might fail for specific populations. Does the ML advantage hold across age groups, sexes, and BMI categories — or does it only work for some subgroups?
The ADM Insight
Published diabetes risk factors — BMI thresholds, hypertension, family history — capture the population average. But a 35-year-old woman with BMI 28, normal blood pressure, and high stress has a different trajectory than a 55-year-old man with identical numbers. The ML model finds the interactions that guidelines flatten into averages. This is a question where ML provides genuine predictive value.
The published risk score uses log-relative-risks from six landmark studies: BMI thresholds (RR=2/4/8 at 25/30/35), hypertension (RR=1.6), smoking (RR=1.4), physical inactivity (RR=1.3), poor self-rated health (graded), and depression with CESD > 4 (RR=1.6), plus a BMI-age interaction term.
The GradientBoosting model trains on the same 7 variables plus 8 additional features (weight, education years, vigorous/moderate activity, alcohol consumption, depression score, self-rated health). 200 trees, max depth 4.
Both evaluated on identical held-out data via 5-fold stratified cross-validation. Bootstrap 95% CIs from 1,000 resamples. Non-overlapping CIs confirm statistical significance.
Domain baseline: The published risk score uses relative risks from 6 landmark studies. It does not include family history or kidney function (not available in HRS). A clinical Framingham Diabetes Risk Score with those variables might narrow the gap.
Logistic regression closes the gap: Logistic regression (AUC 0.693) closes 93% of the gap — most of the improvement comes from flexible feature weighting, not nonlinear interactions. GradientBoosting adds only +0.007 AUC above LR.
Published ML benchmarks: Recent diabetes prediction models in the literature achieve AUC 0.72–0.80 on prospective cohorts with richer feature sets (fasting glucose, HbA1c, family history). Our AUC 0.700 is competitive but not state-of-the-art, reflecting HRS's limited biomarker panel.
Temporal features limited: HRS collects data biennially, not annually. BMI change over 2-year intervals captures less signal than continuous monitoring would.
No external validation: All results from within-HRS cross-validation. Replication on NHANES or UK Biobank would strengthen generalizability claims.