Longevity Study → Investigation 7

Will This Person Develop Heart Disease?

26,332 people. Published Framingham-style risk scores vs ML — another clear ML win (p=0.0005), despite a strong domain baseline.

The Question

Cardiovascular disease risk prediction has been the gold standard of clinical scoring since the Framingham Heart Study. We deliberately strengthened the domain baseline — replacing a crude 6-variable score with an 11-variable published relative-risk model. The question is whether ML can still add value when the domain model already encodes decades of cardiology research.

26,332
People Tracked
0.681
ML AUC [0.668–0.698]
p=0.0005
Permutation Test

ADM Prediction (Made Before Running Models)

Predicted winner: ML. CVD has well-established population-level risk factors (Framingham), but HRS includes temporal features (health decline across waves, BP onset timing, depression changes) that static risk scores cannot encode. Non-traditional predictors like activity patterns and depression interact in ways additive log-RR scores miss.

Expected: 5–10 AUC points advantage, driven by temporal features and activity-health interactions. Actual: +8.9 AUC points (0.681 vs 0.592, p<0.001). Prediction confirmed.

Results

ROC Curves

Feature Importance (Top 8)

Ablation Study

Confidence Intervals

Stronger Baselines: Does It Have to Be GradientBoosting?

Before attributing the ML advantage to nonlinear tree interactions, we test whether logistic regression — using the same features — captures most of the gain.

Logistic regression (AUC 0.680) captures 96% of the gain over the domain score. GradientBoosting adds only +0.004 more. The ML advantage comes from using more features with flexible weighting, not from tree-based nonlinearity.

Calibration: Are the Probabilities Trustworthy?

Discrimination (AUC) tells you how well the model ranks people. Calibration tells you whether the predicted probabilities are accurate — critical for clinical decision thresholds.

The ML model (ECE 0.019) is well-calibrated. The domain score (ECE 0.169) systematically overestimates risk across all probability bins — a common problem with additive log-RR scores that ignore correlations between risk factors.

Subgroup Analysis: Does ML Help Everyone Equally?

Heart disease risk varies dramatically by age and sex. Does the ML advantage hold across subgroups, or is it driven by one demographic? The answer reveals a subtlety: ML wins overall but not uniformly — a form of Simpson’s paradox where the aggregate advantage comes from cross-group patterns rather than within-group superiority.

The ADM Insight

We deliberately strengthened the domain baseline — replacing a crude 6-variable Framingham score with an 11-variable published relative-risk model. It barely moved (0.580 → 0.592). The ML advantage isn't a strawman artifact. Heart disease risk is genuinely heterogeneous: the same risk factors at the same levels produce different outcomes in different people. The ML model's +8.9 AUC advantage is statistically robust (p=0.0005).

The subgroup paradox: ML wins overall, but the domain model outperforms in several demographic subgroups. This is a form of Simpson's paradox — the aggregate advantage comes from ML capturing cross-group patterns (e.g., how depression interacts differently with heart risk in men vs. women) rather than uniformly outperforming within each subgroup. It's a useful reminder: aggregate performance ≠ subgroup performance. A deployed system would need subgroup-specific validation, not just an overall AUC.

Data from the HRS RAND Longitudinal Study (1992–2022). 26,332 adults tracked for incident heart disease (6,282 developed CVD, 23.9% prevalence).

The crude baseline uses 6 Framingham variables (age, sex, BMI, smoking, blood pressure, diabetes). The published-RR domain model extends this with 5 additional variables using published relative risks.

The GradientBoosting model (200 trees, max depth 4) trains on the same features plus activity levels, self-rated health, BMI change, and interaction terms. Evaluated via 5-fold stratified CV. Permutation test (2,000 permutations) yields p=0.0005.

Ablation: temporal features (+BMI change over time) add +0.002 AUC. Interaction terms (+BMI*Activity) add +0.001. The bulk of the ML advantage comes from the base nonlinear model.

Strengthened baseline: We deliberately upgraded from a 6-variable Framingham score (AUC 0.580) to an 11-variable published-RR model (AUC 0.592). The minimal improvement (+0.012) suggests the published risk factors have saturated their predictive value in this population.

No comparison to Pooled Cohort Equations: The ACC/AHA Pooled Cohort Equations (2013) are the current clinical standard for CVD risk. HRS lacks some required inputs (total cholesterol, HDL), preventing direct comparison. Our domain baseline is a reasonable proxy but not identical.

Moderate AUCs overall: AUC 0.681 is significant but moderate. This reflects the difficulty of predicting heart disease from survey data alone — clinical risk models using blood lipids and ECG data typically achieve AUC 0.75–0.85.

Long follow-up window: "Developing heart disease" is measured over up to 30 years. Competing risks (death from other causes, dropout) affect the outcome. A time-to-event (Cox) framework might be more appropriate than binary classification.