Studies · Case Study

Can You Predict How Someone Will Age?

Everyone assumes AI will change health prediction. Most clinicians say published guidelines already work. We took 64,037 real people from two countries, tracked for up to 30 years (HRS waves 1992–2022 for the 45,234-person longitudinal subset; NHANES 2017–2018 and CHNS 2009 are cross-sectional surveys at single waves), and asked thirteen questions — from diabetes risk to cognitive decline to the hardest endpoint: actual mortality. The split: three answered by published medicine, six by ML, two by both, and two surprises — wealth and sleep add nothing once health conditions are known.

64,037

Real People

Datasets

30 yr

Longitudinal Tracking

3–6–2–2

Domain–ML–Hybrid–Null

The Data

Three public health studies spanning two countries across three decades. NHANES (USA) and CHNS (China) provide detailed biomarker snapshots. The HRS follows 45,234 Americans every two years for up to 30 years — so every prediction is compared to what actually happened.

Datasets

3 public health studies

People

64,037 individuals

Countries

USA + China

Timespan

30 years (1992–2022)

Biomarkers

26 blood measures

Questions

13 investigations

How to Read This Study

For each question, we predict which method should win — and why — before running any models. Every prediction is compared against what actually happened to real people tracked for up to 30 years — including one investigation validated against actual mortality, broadly matching the pre-analysis tier expectations — informal pre-registration (the tier-classification was set before model results were generated; this is a directional consistency check, not a formal pre-registered hypothesis test). The result: a 3–6–2–2 split across domain knowledge, machine learning, hybrid methods, and “more data doesn’t help.”

Three publicly available datasets form the foundation for all thirteen investigations. NHANES gives biomarker depth. CHNS lets us test whether American patterns hold in China. HRS tracks people long enough to see if predictions came true.

NHANES 2017–2018 (CDC)

9,254 Americans. Full biomarker panels + lifestyle surveys. Cross-sectional — one comprehensive snapshot per person. The gold standard for population-level health assessment in the United States.

CHNS 2009 (UNC Chapel Hill)

9,549 Chinese adults. 26 fasting blood biomarkers — the same measures, different population. Enables cross-population transfer validation: do American health norms apply in China?

HRS RAND 1992–2022 (University of Michigan)

45,234 Americans tracked every 2 years for up to 30 years. Self-rated health, BMI, disease onset, depression, physical function, lifestyle — the longitudinal dataset that lets us check predictions against what actually happened.

The Study

Thirteen Questions, Four Methods

We took 64,000 real people and asked thirteen questions about aging, health prediction, and intervention — including the hardest endpoint: actual mortality. Each demanded a different analytical method and a different fidelity level. The right method depends on what you’re trying to predict.

#	Question	Decision Type	Best Method	Data
Domain Knowledge Sufficient
4	US norms in China?	Transfer Validation	Published adjustments	NHANES+CHNS
11	Physical independence?	Decline Prediction	Frailty screening	HRS 30yr
ML Adds Genuine Value
2	Diabetes risk?	Risk Screening	GradientBoosting	NHANES
5	Individual inflammation?	Biomarker Prediction	Gradient Boosting	NHANES
6	Health in 10 years?	Trajectory Forecast	ML on early waves	HRS 30yr
7	Heart disease risk?	Disease Prediction	ML vs published RR	NHANES
8	Lifestyle interactions?	Interaction Discovery	Neural net + SHAP	NHANES
13	24-month mortality?	Survival Prediction	ML vs Charlson-style	NHANES
14	Cognitive decline?	Regression Forecast	GradientBoosting	HRS 30yr
More Data Doesn’t Always Help
15	Wealth beyond health?	Feature Evaluation	Health+SDOH ML	NHANES
16	Sleep predicts decline?	Feature Evaluation	Health+Sleep ML	HRS 30yr
Hybrid: Encode + Learn
9	Weight trajectory?	Long-Horizon Forecast	Physics + residual	HRS 30yr
10	Biological age?	Health Assessment	KDM + ML	NHANES

Methodology note: All ML models are GradientBoosting ensembles (scikit-learn) evaluated via 5-fold stratified cross-validation with logistic regression as an additional baseline. Confidence intervals from 1,000 bootstrap resamples. Domain baselines use published clinical scoring methods with cited relative risks. Calibration curves (predicted vs observed) reported for all classification investigations. Sample sizes vary by investigation (1,907–28,636) depending on dataset and inclusion criteria. The lower bound (1,907) is the NHANES biomarker subset used in Q10 (biological age); the upper bound (28,636) is the NHANES SDOH-mortality cohort used in Q15 (Q13 mortality uses a separate NHANES 2017–2018 NDI-linked cohort, N=3,578). See individual investigation pages for full methodology.

The Investigations

Each Question Gets the Method It Needs

Each tier groups questions by which method won.

Domain Knowledge Sufficient

Investigation 4

ML Adds Genuine Value

Investigation 2

Diabetes Risk

“Does this person have diabetes given their biomarker panel?”

39,839 NHANES adults (2005–2018); 6,433 diabetic by ADA criteria (16.15% prevalence). ADA-aligned published score (AUC 0.784 [0.778–0.790]) vs GradientBoosting on the full biomarker panel (AUC 0.851 [0.846–0.856]). +6.7 AUC points, non-overlapping CIs. Gap is largest in Age 60+ (+8.8 points). A hybrid model that adds the domain score to ML yields AUC 0.851 — adds nothing.

ADA-aligned Score vs GradientBoosting ML Wins non-overlap CI

Investigation 5

Individual Inflammation

“What are this person’s inflammatory markers given their lifestyle?”

CRP varies 100-fold across individuals. GradientBoosting doubles explained variance (R²=0.259 vs 0.114). Population averages are useless when individual variation is this large.

Age-Sex Norms vs Gradient Boosting ML Wins 2x R²

Investigation 6

Health Trajectory

“How will this person’s health change over 10 years?”

7,864 people tracked 8+ waves. ML beats population curves by 15–19% RMSE improvement, with gains growing at longer horizons. The trajectory IS the signal.

Population Curve vs ML on Early Waves ML Wins +18%

Investigation 7

Heart Disease Prediction

“Does this person have heart disease given their biomarker panel?”

38,033 NHANES adults (2005–2018); 4,193 with self-reported CVD (11.0% prevalence). 2008 Framingham Risk Score (AUC 0.786 [0.779–0.793]) vs GradientBoosting on full NHANES panel including CRP, HbA1c, waist circumference, multi-reading BP (AUC 0.843 [0.838–0.849]). +5.7 AUC points, non-overlapping CIs, 2,000-permutation p≈0. Extra biomarkers beyond Framingham’s seven inputs account for 24.1% of feature importance.

Framingham Score vs GradientBoosting ML Wins p≈0

Investigation 8

Lifestyle Interactions

“How do lifestyle factors interact to affect health?”

Three-tier comparison: additive R²=0.208, +published interactions R²=0.218 (+5%), GBM R²=0.245 (+12%). The real finding: obesity×diabetes adds +2.1 mg/L CRP beyond additive (57% synergistic), sedentary×diabetes +1.8 mg/L. Magnitudes not in standard clinical models.

Three-Tier: Additive → +Interactions → GBM GBM Discovers Interactions

Investigation 14

Cognitive Decline

“Can we predict who will experience cognitive decline?”

5,212 people tracked across 11 waves of cognitive testing. ML beats age-education curves by 18% at 10 years (RMSE 4.38 vs 5.33). Individual cognitive slope, depression, and cardiovascular risk explain decline far better than population averages.

Age-Education Curve vs GradientBoosting ML Wins +18% RMSE

More Data Doesn’t Always Help

Investigation 15

Wealth & Longevity (NHANES)

“Does income predict mortality beyond measured health?”

28,636 adults pooled across ten NHANES cycles (1999–2018), 6,626 deaths over 10-year follow-up. The income-mortality gradient is stark — 28.0% ten-year mortality in the poorest quintile vs 12.3% in the highest (2.3×). But once the model knows your measured biomarkers (BP, cholesterol, HbA1c, CRP, creatinine), adding income, education, and marital status lifts AUC only from 0.908 to 0.910. Formal mediation: 98.3% of the SDOH signal is mediated through measured health.

Charlson vs Health+Bio vs +SDOH ML SDOH Adds +0.003 (98% Mediated)

Investigation 16

Sleep & Health

“Does poor sleep predict health decline?”

24,155 people with sleep quality data. Adding sleep measures to health models improves AUC by only +0.001. Poor sleep is strongly associated with health decline — but it’s largely a symptom of conditions already in the model (depression, chronic disease), not an independent predictor.

Health ML vs Health+Sleep ML Sleep Adds Nothing (+0.001)

Hybrid: Encode What You Know, Learn the Rest

Investigation 9

BMI Trajectory

“How will this person’s weight change over the next decade?”

Hybrid +10.3% at 12-year horizon, the ML crossover point (hybrid overtakes ML-alone). Hybrid overtakes the population/domain curve at 4 years. Domain knowledge gets the population trend. ML alone overfits to sparse data. Hybrid beats both where it matters most.

Age Curve + Neural Residual Hybrid +10.3%

Investigation 10

Biological Age

“What is this person’s biological age?”

Split-half cross-validation eliminates circular reasoning. Hybrid (r=0.631) significantly outperforms both KDM alone (r=0.527) and ML alone (r=0.478). Externally validated against self-rated health.

Klemera-Doubal + ML Refinement Hybrid r=0.631

The Honest Finding

Income and sleep are the surprises. The NHANES mortality gradient across income quintiles is 2.3×—but 98% of it is mediated through measured health and biomarkers. Once you know someone’s diagnoses and labs, knowing their income changes almost nothing. Same for sleep.

The Fidelity Lesson

No Single Method Works

Each axis represents one of the thirteen questions. The radius shows where the optimal fidelity landed. Cross-population transfer needs almost no model complexity — published norms are fine. Biological age demands the highest — neither domain knowledge nor machine learning alone comes close. Wealth and sleep (rose) sit at the bottom — adding them to health models yields nothing. The shape is irregular. That’s the point.

The radar shape is irregular because different questions need different methods.

Thirteen axes, four tiers. Teal = domain knowledge. Blue = ML needed. Rose = more data doesn’t help. Gold = hybrid.

Limitations

What We Measured and What We Didn't

Self-reported data in HRS. Nine of thirteen investigations use the Health and Retirement Study, where health status (1–5 scale), disease onset, and BMI are self-reported. Self-reported BMI is typically underestimated by 1–2 kg/m², and disease onset dates reflect when a diagnosis was received, not necessarily when the condition began. NHANES provides measured biomarkers but is cross-sectional, so it cannot track outcomes over time. This asymmetry means our longitudinal predictions carry more measurement noise than the cross-sectional investigations.

No external validation cohort. All models are evaluated via cross-validation within their training datasets. A true external validation — testing HRS-trained models on ELSA (English Longitudinal Study of Ageing) or SHARE (European equivalent) — would strengthen generalizability claims. This is deferred pending data access agreements.

Cross-sectional vs. longitudinal design. Investigations using NHANES (Q4, Q5, Q8, Q10) capture associations at a single point in time. We cannot confirm that the biomarker patterns identified actually predict future outcomes without longitudinal follow-up. The HRS investigations (Q2, Q6, Q7, Q9, Q11, Q13, Q14, Q15, Q16) do have longitudinal validation.

Survivorship bias in HRS. Participants must survive to each follow-up wave to contribute data. People who die between waves are lost, and those who drop out tend to be sicker. This biases longitudinal analyses toward healthier survivors, potentially underestimating the true predictive power of health decline indicators.

Model complexity ceiling. All ML models are GradientBoosting ensembles or logistic regression. Deep learning, survival models (Cox), or time-series architectures might perform differently. We deliberately chose interpretable models to keep the focus on whether ML helps, not on squeezing marginal gains from architecture search.

Thirteen Questions. Real Data. The Right Method.

64,037 people, two countries, thirty years. Thirteen questions, each matched to the method it needed.

Apply This to Your Problem →

Interactive

Explore the Data

Dig into the data behind the findings.

Interactive

The Biology Paradox

Three methods predict biological age. Step through to see which one loses — and why more computation made it worse.

Step-through reveal ML fails alone

Interactive

Bio-Age Calculator

Enter your biomarkers and see your estimated biological age. Compare your position against 1,907 NHANES participants.

Sliders + gauge + histogram KDM formula

Interactive

Prediction Duel

Four investigations with direct domain-vs-ML comparisons. ROC curves, calibration, reclassification, and feature importance heatmap.

Tabs + ROC + heatmap Cross-investigation

Interactive

The Data Paradox

Income predicts a 2.3× mortality gap across NHANES quintiles. Sleep separates healthy from declining. But adding this data to models changes almost nothing. Why?

ROC + SHAP + NRI More data ≠ better

Data Sources

Reproducible with Public Data

All three datasets are publicly available and freely accessible for research use. Every analysis is reproducible with open-source Python tools (scikit-learn, scipy, pandas).

Primary Dataset — USA National Health and Nutrition Examination Survey (NHANES) 2017–2018. Centers for Disease Control and Prevention (CDC). 9,254 Americans with full biomarker panels + lifestyle surveys. Cross-sectional. CDC NHANES →
Cross-Population Dataset — China China Health and Nutrition Survey (CHNS) 2009. University of North Carolina at Chapel Hill. 9,549 Chinese adults with 26 fasting blood biomarkers. Enables cross-population transfer validation. UNC CHNS →
Longitudinal Dataset — 30 Years Health and Retirement Study (HRS) RAND Longitudinal File, waves 1992–2022. University of Michigan. 45,234 Americans tracked every 2 years for up to 30 years. The dataset that lets us check predictions against what actually happened. HRS →
Biological Age Reference Klemera, P. & Doubal, S. (2006). “A new approach to the concept and computation of biological age.” Mechanisms of Ageing and Development, 127(3), 240–248. Defines the KDM biological age estimator used in Investigation 10.
Frailty Criteria Fried, L.P. et al. (2001). “Frailty in older adults: Evidence for a phenotype.” J. Gerontology, 56A(3), M146–M156. Defines the five-criterion frailty index used as domain baseline in Investigation 11.

Can You Predict How Someone Will Age?

The Data

Thirteen Questions, Four Methods

Each Question Gets the Method It Needs

Cross-Population Transfer

Functional Decline

Short-Term Mortality (NHANES)

Diabetes Risk

Individual Inflammation

Health Trajectory

Heart Disease Prediction

Lifestyle Interactions

Cognitive Decline

Wealth & Longevity (NHANES)

Sleep & Health

BMI Trajectory

Biological Age

The Honest Finding

No Single Method Works

What We Measured and What We Didn't

Thirteen Questions. Real Data. The Right Method.

Explore the Data

The Biology Paradox

Bio-Age Calculator

Prediction Duel

The Data Paradox

Reproducible with Public Data