Skip to main content
Studies · Case Study

Can You Predict How Someone Will Age?

Everyone assumes AI will change health prediction. Most clinicians say published guidelines already work. We took 64,037 real people from two countries, tracked for up to 30 years, and asked thirteen questions — from diabetes risk to cognitive decline to the hardest endpoint: actual mortality. The split: two answered by published medicine, seven by ML, two by both, and two surprises — wealth and sleep add nothing once health conditions are known.

64,037
Real People
3
Datasets
30 yr
Longitudinal Tracking
2–7–2–2
Domain–ML–Hybrid–Null

The Data

Three public health studies spanning two countries across three decades. NHANES (USA) and CHNS (China) provide detailed biomarker snapshots. The HRS follows 45,234 Americans every two years for up to 30 years — so every prediction is compared to what actually happened.

Datasets
3 public health studies
People
64,037 individuals
Countries
USA + China
Timespan
30 years (1992–2022)
Biomarkers
26 blood measures
Questions
13 investigations
How to Read This Study

For each question, we predict which method should win — and why — before running any models. Every prediction is compared against what actually happened to real people tracked for up to 30 years — including one investigation validated against actual mortality, matching our pre-analysis predictions in all thirteen cases. The result: a 2–7–2–2 split across domain knowledge, machine learning, hybrid methods, and “more data doesn’t help.”

Three publicly available datasets form the foundation for all thirteen investigations. NHANES gives biomarker depth. CHNS lets us test whether American patterns hold in China. HRS tracks people long enough to see if predictions came true.

NHANES 2017–2018 (CDC)
9,254 Americans. Full biomarker panels + lifestyle surveys. Cross-sectional — one comprehensive snapshot per person. The gold standard for population-level health assessment in the United States.
CHNS 2009 (UNC Chapel Hill)
9,549 Chinese adults. 26 fasting blood biomarkers — the same measures, different population. Enables cross-population transfer validation: do American health norms apply in China?
HRS RAND 1992–2022 (University of Michigan)
45,234 Americans tracked every 2 years for up to 30 years. Self-rated health, BMI, disease onset, depression, physical function, lifestyle — the longitudinal dataset that lets us check predictions against what actually happened.
The Study

Thirteen Questions, Four Methods

We took 64,000 real people and asked thirteen questions about aging, health prediction, and intervention — including the hardest endpoint: actual mortality. Each demanded a different analytical method and a different fidelity level. The right method depends on what you’re trying to predict.

#QuestionDecision TypeBest MethodData
Domain Knowledge Sufficient
4US norms in China?Transfer ValidationPublished adjustmentsNHANES+CHNS
11Physical independence?Decline PredictionFrailty screeningHRS 30yr
ML Adds Genuine Value
2Diabetes risk?Risk ScreeningGradient BoostingHRS 30yr
5Individual inflammation?Biomarker PredictionGradient BoostingNHANES
6Health in 10 years?Trajectory ForecastML on early wavesHRS 30yr
7Heart disease risk?Disease PredictionML vs published RRHRS 30yr
8Lifestyle interactions?Interaction DiscoveryNeural net + SHAPNHANES
1310-year mortality?Survival PredictionML vs Charlson-styleHRS 30yr
14Cognitive decline?Regression ForecastGradientBoostingHRS 30yr
More Data Doesn’t Always Help
15Wealth beyond health?Feature EvaluationHealth+SDOH MLHRS 30yr
16Sleep predicts decline?Feature EvaluationHealth+Sleep MLHRS 30yr
Hybrid: Encode + Learn
9Weight trajectory?Long-Horizon ForecastPhysics + residualHRS 30yr
10Biological age?Health AssessmentKDM + MLNHANES

Methodology note: All ML models are GradientBoosting ensembles (scikit-learn) evaluated via 5-fold stratified cross-validation with logistic regression as an additional baseline. Confidence intervals from 1,000 bootstrap resamples. Domain baselines use published clinical scoring methods with cited relative risks. Calibration curves (predicted vs observed) reported for all classification investigations. Sample sizes vary by investigation (1,907–32,230) depending on dataset and inclusion criteria. See individual investigation pages for full methodology.

The Investigations

Each Question Gets the Method It Needs

Each tier groups questions by which method won.

Domain Knowledge Sufficient
ML Adds Genuine Value
Investigation 2

Diabetes Risk

“Will this person develop diabetes in 10 years?”
26,224 HRS participants tracked 30 years; 4,657 developed diabetes. Published risk scoring (AUC 0.593) → logistic regression (0.693) → GradientBoosting (0.700). Most of the gain comes from flexible feature weighting, not nonlinear trees. All ML models well-calibrated (ECE 0.017).
Domain vs LR vs GradientBoosting ML Wins p<0.001
Investigation 5

Individual Inflammation

“What are this person’s inflammatory markers given their lifestyle?”
CRP varies 100-fold across individuals. GradientBoosting doubles explained variance (R²=0.259 vs 0.114). Population averages are useless when individual variation is this large.
Age-Sex Norms vs Gradient Boosting ML Wins 2x R²
Investigation 6

Health Trajectory

“How will this person’s health change over 10 years?”
7,864 people tracked 8+ waves. ML beats population curves by 15–19% RMSE improvement, with gains growing at longer horizons. The trajectory IS the signal.
Population Curve vs ML on Early Waves ML Wins +18%
Investigation 7

Heart Disease Prediction

“Will this person develop heart disease?”
6,282 people developed heart disease. Published RR (AUC 0.592) → logistic regression (0.680) → GradientBoosting (0.681). Same pattern as diabetes: LR captures 96% of the gain. Non-traditional predictors — depression, functional status — add genuine value. ML well-calibrated (ECE 0.019).
Domain vs LR vs GradientBoosting ML Wins p=0.0005
Investigation 8

Lifestyle Interactions

“How do lifestyle factors interact to affect health?”
Three-tier comparison: additive R²=0.208, +published interactions R²=0.218 (+5%), GBM R²=0.245 (+12%). The real finding: obesity×diabetes adds +2.1 mg/L CRP beyond additive (57% synergistic), sedentary×diabetes +1.8 mg/L. Magnitudes not in standard clinical models.
Three-Tier: Additive → +Interactions → GBM GBM Discovers Interactions
Investigation 13

10-Year Mortality

“Can we predict who will die within a decade?”
32,230 people aged 50+, 8,321 deaths. The hardest endpoint — actual mortality, not self-report. Domain (AUC 0.843) is strong because age dominates. ML adds +0.011 AUC — real but modest, and in the range of published clinical landmarks (Cook NEJM 2006: +0.004; JUPITER: +0.013). More importantly, the ML model’s calibration error is 6× lower (ECE 0.011 vs 0.063) — the domain model systematically misestimates absolute risk. For shared decision-making, calibration is the metric that matters.
Domain vs LR vs GradientBoosting ML Wins Modest (+0.012)
Investigation 14

Cognitive Decline

“Can we predict who will experience cognitive decline?”
5,212 people tracked across 11 waves of cognitive testing. ML beats age-education curves by 18% at 10 years (RMSE 4.38 vs 5.33). Individual cognitive slope, depression, and cardiovascular risk explain decline far better than population averages.
Age-Education Curve vs GradientBoosting ML Wins +18% RMSE
More Data Doesn’t Always Help
Hybrid: Encode What You Know, Learn the Rest

The Honest Finding

Wealth and sleep are the surprises. The mortality gradient across wealth quintiles is 2×—but it’s almost entirely mediated through health conditions. Once you know someone’s diagnoses, knowing their income changes almost nothing. Same for sleep.

The Fidelity Lesson

No Single Method Works

Each axis represents one of the thirteen questions. The radius shows where the optimal fidelity landed. Cross-population transfer needs almost no model complexity — published norms are fine. Biological age demands the highest — neither domain knowledge nor machine learning alone comes close. Wealth and sleep (rose) sit at the bottom — adding them to health models yields nothing. The shape is irregular. That’s the point.

The radar shape is irregular because different questions need different methods.

Thirteen axes, four tiers. Teal = domain knowledge. Blue = ML needed. Rose = more data doesn’t help. Gold = hybrid.

Limitations

What We Measured and What We Didn't

Self-reported data in HRS. Nine of thirteen investigations use the Health and Retirement Study, where health status (1–5 scale), disease onset, and BMI are self-reported. Self-reported BMI is typically underestimated by 1–2 kg/m², and disease onset dates reflect when a diagnosis was received, not necessarily when the condition began. NHANES provides measured biomarkers but is cross-sectional, so it cannot track outcomes over time. This asymmetry means our longitudinal predictions carry more measurement noise than the cross-sectional investigations.

No external validation cohort. All models are evaluated via cross-validation within their training datasets. A true external validation — testing HRS-trained models on ELSA (English Longitudinal Study of Ageing) or SHARE (European equivalent) — would strengthen generalizability claims. This is deferred pending data access agreements.

Cross-sectional vs. longitudinal design. Investigations using NHANES (Q4, Q5, Q8, Q10) capture associations at a single point in time. We cannot confirm that the biomarker patterns identified actually predict future outcomes without longitudinal follow-up. The HRS investigations (Q2, Q6, Q7, Q9, Q11, Q13, Q14, Q15, Q16) do have longitudinal validation.

Survivorship bias in HRS. Participants must survive to each follow-up wave to contribute data. People who die between waves are lost, and those who drop out tend to be sicker. This biases longitudinal analyses toward healthier survivors, potentially underestimating the true predictive power of health decline indicators.

Model complexity ceiling. All ML models are GradientBoosting ensembles or logistic regression. Deep learning, survival models (Cox), or time-series architectures might perform differently. We deliberately chose interpretable models to keep the focus on whether ML helps, not on squeezing marginal gains from architecture search.

Thirteen Questions. Real Data. The Right Method.

64,037 people, two countries, thirty years. Thirteen questions, each matched to the method it needed.

Apply This to Your Problem →
Data Sources

Reproducible with Public Data

All three datasets are publicly available and freely accessible for research use. Every analysis is reproducible with open-source Python tools (scikit-learn, scipy, pandas).

  • Primary Dataset — USA National Health and Nutrition Examination Survey (NHANES) 2017–2018. Centers for Disease Control and Prevention (CDC). 9,254 Americans with full biomarker panels + lifestyle surveys. Cross-sectional. CDC NHANES →
  • Cross-Population Dataset — China China Health and Nutrition Survey (CHNS) 2009. University of North Carolina at Chapel Hill. 9,549 Chinese adults with 26 fasting blood biomarkers. Enables cross-population transfer validation. UNC CHNS →
  • Longitudinal Dataset — 30 Years Health and Retirement Study (HRS) RAND Longitudinal File, waves 1992–2022. University of Michigan. 45,234 Americans tracked every 2 years for up to 30 years. The dataset that lets us check predictions against what actually happened. HRS →
  • Biological Age Reference Klemera, P. & Doubal, S. (2006). “A new approach to the concept and computation of biological age.” Mechanisms of Ageing and Development, 127(3), 240–248. Defines the KDM biological age estimator used in Investigation 10.
  • Frailty Criteria Fried, L.P. et al. (2001). “Frailty in older adults: Evidence for a phenotype.” J. Gerontology, 56A(3), M146–M156. Defines the five-criterion frailty index used as domain baseline in Investigation 11.