Longevity Study → Investigation 13

Can We Predict Who Will Die?

The hardest endpoint in health prediction. Not self-reported health, not survey responses — actual mortality. Thousands of Americans aged 50+, tracked for a decade. Published risk factors vs machine learning on the outcome that matters most.

The Question

Every other investigation in this study uses self-reported outcomes — self-rated health, survey-reported diagnoses, functional limitations. Those are useful but subjective. Mortality is binary, objective, and unambiguous. If a model can predict 10-year mortality from baseline health data, it passes the hardest validation test in health analytics.

--
People Aged 50+
--
Died Within 10 Years
--
ML AUC
--
Domain AUC

ADM Prediction (Made Before Running Models)

Predicted winner: ML, but domain should be strong. Mortality is dominated by age (Gompertz law — risk doubles every 8 years after 50), so a simple age + disease count score will have decent discrimination. But individual interactions — how smoking compounds with diabetes, how depression accelerates decline in the presence of heart disease — should give ML an edge. The question is how much.

Results

ROC Curves

Feature Importance (Top 8)

Multi-Model Comparison

Calibration: Predicted vs Observed

Subgroup Analysis: Does ML Help Everyone Equally?

Mortality risk differs dramatically by age and sex. Does the ML advantage hold for younger vs older adults, men vs women?

The ADM Insight

Mortality is dominated by age — the Gompertz law says risk doubles every 8 years after 50, and that alone gets you to AUC 0.84. Published risk factors encode this well. ML adds +0.011 AUC — real but modest. Logistic regression (0.850) nearly matches GradientBoosting (0.854). This is the investigation where domain knowledge is strongest, precisely because the underlying biology is well-characterized. The real value of ML here isn’t discrimination — it’s calibration (ECE 0.011 vs 0.063 for domain) and temporal trajectory features.

Clinical Context: What Does +0.011 AUC Mean?

AUC improvements are easy to report and hard to interpret. The Cook and JUPITER trials are useful anchors — both resulted in guideline changes.

Study Biomarker Added AUC Gain Clinical Outcome
Cook et al., NEJM 2006 CRP added to Framingham +0.004 FDA-approved hs-CRP test for CV risk
Ridker et al. (JUPITER), 2008 Rosuvastatin trial population +0.013 Basis for expanded statin eligibility guidelines
This study — ML model Full feature set vs. domain model +0.011 Between CRP landmark and JUPITER — clinically meaningful range

At a 20% 10-year risk threshold, a +0.011 AUC improvement reclassifies roughly 2–4% of patients across the decision line — people who get a preventive intervention recommendation with one model and don’t with the other.

Calibration matters more than AUC when you’re talking to a patient.

The domain model’s ECE (Expected Calibration Error) is 0.063. The ML model’s ECE is 0.011. ECE 0.063 means the model is telling a patient they have a 20% chance of an event when the real probability is closer to 26% — or vice versa. AUC measures ranking. ECE measures whether the number you give the patient means what you say it means. That’s the metric that matters for a clinical conversation.

Cohort: HRS RAND respondents aged 50–90 at their earliest wave with complete baseline data. Binary outcome: died within 10 years of baseline (computed from RADYEAR — year of death — and interview year).

Domain baseline: Charlson-style mortality risk score using published relative risks. Age (Gompertz: RR doubles per decade), heart disease (RR 2.0), stroke (RR 2.5), cancer (RR 3.0), diabetes (RR 1.8), lung disease (RR 2.0), current smoking (RR 2.0), underweight BMI (RR 1.8), obesity (RR 1.5), poor self-rated health (RR 2.0), walking difficulty (RR 1.5).

ML model: GradientBoostingClassifier (200 trees, max depth 4, learning rate 0.1). Trained on 15+ baseline features: demographics, chronic conditions, health behaviors, functional status, depression.

Evaluation: 5-fold stratified cross-validation. Bootstrap 95% CIs from 1,000 resamples. Calibration analysis (Brier score, ECE). Multi-model comparison (LogisticRegression, RandomForest, GradientBoosting).

No biomarkers: HRS does not include blood biomarkers (lipids, glucose, HbA1c). Models with blood tests achieve higher AUC in the literature.

10-year window is broad: A person who dies in year 1 and year 9 are both coded as events. Shorter prediction windows (1-year, 5-year) would test different aspects of the model.

Competing risks: Over 10 years, causes of death shift (cancer vs cardiovascular vs neurodegenerative). A single binary model conflates these.

Self-reported conditions: Chronic disease diagnoses in HRS are self-reported ("Has a doctor ever told you..."), which may miss undiagnosed conditions.

No external validation: All results from within-HRS cross-validation. Replication on ELSA (English Longitudinal Study of Ageing) would strengthen claims.