Longevity Study → Investigation 11

Who Will Experience Functional Decline?

13,998 people tracked over 30 years. Domain knowledge identifies most at-risk groups well. ML adds 3.8 AUC points — the question is whether the margin justifies the complexity.

13,998
People Tracked
0.777
ML AUC
0.739
Domain AUC

ADM Prediction (Made Before Running Models)

Predicted winner: Domain sufficient. Functional decline in older adults is well-characterized by published geriatric risk factors: age is the dominant predictor, followed by arthritis, obesity, depression, and inactivity. These explain most variance because decline is driven by cumulative physiological load rather than subtle interactions.

Expected: Domain captures the essential signal. ML may improve marginally (+2–4 AUC). Actual: Domain AUC 0.739, ML AUC 0.777. Marginal ML gain that doesn’t justify the complexity. Prediction confirmed.

Model Performance Comparison

Domain knowledge — age, disease count, and activity level — already predicts functional decline well (AUC 0.739). The ML model adds 3.8 AUC points (0.777), and the hybrid reaches 0.782. The improvement is statistically significant but modest — and the cost-benefit question is real. Where do those extra points come from?

AUC Comparison (higher is better)
Risk Group Stratification
Atypical Case Detection
What Drives the ML Prediction?

Walking difficulty is the #1 feature — a functional marker that traditional domain models often overlook. Self-rated health, depression, and hospital stays capture the subjective and episodic signals that age + disease count miss.

Top 8 Feature Importances (ML Model)
Stronger Baselines: Does It Have to Be GradientBoosting?

Before deploying a complex ML pipeline, we should ask: does logistic regression — the simplest supervised model — capture the same signal?

Logistic regression actually outperforms GradientBoosting on this problem (AUC 0.777 vs 0.772). When the signal is largely linear (age + disease count → decline), simpler models can match or beat complex ones.

Calibration: Are the Probabilities Trustworthy?

With a 57% base rate, accurate probability estimates are critical for triage decisions — distinguishing “likely to decline” from “certain to decline” determines intervention urgency.

The domain score (ECE 0.208) substantially underestimates risk at high predicted probabilities. The ML model (ECE 0.028) and hybrid (ECE 0.024) are both well-calibrated, closely tracking the diagonal.

Functional decline is defined as losing the ability to perform at least one Activity of Daily Living (ADL) — bathing, dressing, eating, transferring, or toileting — within 4 years.

The domain model uses age, number of chronic diseases, and vigorous activity frequency — the three strongest epidemiological predictors.

The ML model uses gradient-boosted trees trained on all available features including walking difficulty, self-rated health, depression scores, BMI, and hospitalization history.

The hybrid model uses the domain risk score as a feature alongside the full feature set, allowing it to build on rather than replace domain knowledge.

ADM Insight

This is the closest call in the study — and the most instructive. Published frailty criteria (age, disease count, activity level) already predict functional decline well (AUC 0.739). The ML adds 3.8 AUC points. The hybrid adds 0.1 more. Statistically significant? Yes. Clinically transformative? Questionable. The 850 reclassified cases matter — these are younger individuals with subtle early warning combinations that age-based screening misses. But deploying a 15-feature ML pipeline to gain 3.8 AUC points over a 3-variable screening tool requires honest cost-benefit analysis. Sometimes the right fidelity is the simplest model that captures the essential signal. A hospital system would need to weigh: infrastructure cost of real-time ML scoring vs. the clinical value of catching those 850 additional cases early.

High decline rate: 57% of the 50+ cohort experienced functional decline over 30 years. This high base rate makes prediction easier and AUC differences smaller than in rare-event prediction.

ADL definition sensitivity: Functional decline is defined as losing ≥1 ADL (bathing, dressing, eating, transferring, toileting). Different thresholds (≥2 ADLs, IADL-based) might show different domain-vs-ML gaps.

4-year prediction window: Results are for 4-year decline prediction. Shorter windows (1-2 years) would favor domain models more; longer windows might favor ML more as individual trajectories diverge.

No cost-effectiveness analysis: The marginal AUC gain of 3.8 points is presented without cost context. A full analysis would compare ML infrastructure costs against the clinical value of earlier intervention for the 850 reclassified cases.

Self-reported outcomes: ADL limitations in HRS are self-reported, subject to reporting bias, cultural differences, and mood effects.