Can You Predict How Someone Will Age?
Everyone assumes AI will change health prediction. Most clinicians say published guidelines already work. We took 64,037 real people from two countries, tracked for up to 30 years, and asked thirteen questions — from diabetes risk to cognitive decline to the hardest endpoint: actual mortality. The split: two answered by published medicine, seven by ML, two by both, and two surprises — wealth and sleep add nothing once health conditions are known.
The Data
Three public health studies spanning two countries across three decades. NHANES (USA) and CHNS (China) provide detailed biomarker snapshots. The HRS follows 45,234 Americans every two years for up to 30 years — so every prediction is compared to what actually happened.
For each question, we predict which method should win — and why — before running any models. Every prediction is compared against what actually happened to real people tracked for up to 30 years — including one investigation validated against actual mortality, matching our pre-analysis predictions in all thirteen cases. The result: a 2–7–2–2 split across domain knowledge, machine learning, hybrid methods, and “more data doesn’t help.”
Three publicly available datasets form the foundation for all thirteen investigations. NHANES gives biomarker depth. CHNS lets us test whether American patterns hold in China. HRS tracks people long enough to see if predictions came true.
Thirteen Questions, Four Methods
We took 64,000 real people and asked thirteen questions about aging, health prediction, and intervention — including the hardest endpoint: actual mortality. Each demanded a different analytical method and a different fidelity level. The right method depends on what you’re trying to predict.
| # | Question | Decision Type | Best Method | Data |
|---|---|---|---|---|
| Domain Knowledge Sufficient | ||||
| 4 | US norms in China? | Transfer Validation | Published adjustments | NHANES+CHNS |
| 11 | Physical independence? | Decline Prediction | Frailty screening | HRS 30yr |
| ML Adds Genuine Value | ||||
| 2 | Diabetes risk? | Risk Screening | Gradient Boosting | HRS 30yr |
| 5 | Individual inflammation? | Biomarker Prediction | Gradient Boosting | NHANES |
| 6 | Health in 10 years? | Trajectory Forecast | ML on early waves | HRS 30yr |
| 7 | Heart disease risk? | Disease Prediction | ML vs published RR | HRS 30yr |
| 8 | Lifestyle interactions? | Interaction Discovery | Neural net + SHAP | NHANES |
| 13 | 10-year mortality? | Survival Prediction | ML vs Charlson-style | HRS 30yr |
| 14 | Cognitive decline? | Regression Forecast | GradientBoosting | HRS 30yr |
| More Data Doesn’t Always Help | ||||
| 15 | Wealth beyond health? | Feature Evaluation | Health+SDOH ML | HRS 30yr |
| 16 | Sleep predicts decline? | Feature Evaluation | Health+Sleep ML | HRS 30yr |
| Hybrid: Encode + Learn | ||||
| 9 | Weight trajectory? | Long-Horizon Forecast | Physics + residual | HRS 30yr |
| 10 | Biological age? | Health Assessment | KDM + ML | NHANES |
Methodology note: All ML models are GradientBoosting ensembles (scikit-learn) evaluated via 5-fold stratified cross-validation with logistic regression as an additional baseline. Confidence intervals from 1,000 bootstrap resamples. Domain baselines use published clinical scoring methods with cited relative risks. Calibration curves (predicted vs observed) reported for all classification investigations. Sample sizes vary by investigation (1,907–32,230) depending on dataset and inclusion criteria. See individual investigation pages for full methodology.
Each Question Gets the Method It Needs
Each tier groups questions by which method won.
Cross-Population Transfer
Functional Decline
Diabetes Risk
Individual Inflammation
Health Trajectory
Heart Disease Prediction
Lifestyle Interactions
10-Year Mortality
Cognitive Decline
Wealth & Longevity
Sleep & Health
BMI Trajectory
Biological Age
No Single Method Works
Each axis represents one of the thirteen questions. The radius shows where the optimal fidelity landed. Cross-population transfer needs almost no model complexity — published norms are fine. Biological age demands the highest — neither domain knowledge nor machine learning alone comes close. Wealth and sleep (rose) sit at the bottom — adding them to health models yields nothing. The shape is irregular. That’s the point.
The radar shape is irregular because different questions need different methods.
Thirteen axes, four tiers. Teal = domain knowledge. Blue = ML needed. Rose = more data doesn’t help. Gold = hybrid.
What We Measured and What We Didn't
Self-reported data in HRS. Nine of thirteen investigations use the Health and Retirement Study, where health status (1–5 scale), disease onset, and BMI are self-reported. Self-reported BMI is typically underestimated by 1–2 kg/m², and disease onset dates reflect when a diagnosis was received, not necessarily when the condition began. NHANES provides measured biomarkers but is cross-sectional, so it cannot track outcomes over time. This asymmetry means our longitudinal predictions carry more measurement noise than the cross-sectional investigations.
No external validation cohort. All models are evaluated via cross-validation within their training datasets. A true external validation — testing HRS-trained models on ELSA (English Longitudinal Study of Ageing) or SHARE (European equivalent) — would strengthen generalizability claims. This is deferred pending data access agreements.
Cross-sectional vs. longitudinal design. Investigations using NHANES (Q4, Q5, Q8, Q10) capture associations at a single point in time. We cannot confirm that the biomarker patterns identified actually predict future outcomes without longitudinal follow-up. The HRS investigations (Q2, Q6, Q7, Q9, Q11, Q13, Q14, Q15, Q16) do have longitudinal validation.
Survivorship bias in HRS. Participants must survive to each follow-up wave to contribute data. People who die between waves are lost, and those who drop out tend to be sicker. This biases longitudinal analyses toward healthier survivors, potentially underestimating the true predictive power of health decline indicators.
Model complexity ceiling. All ML models are GradientBoosting ensembles or logistic regression. Deep learning, survival models (Cox), or time-series architectures might perform differently. We deliberately chose interpretable models to keep the focus on whether ML helps, not on squeezing marginal gains from architecture search.
Thirteen Questions. Real Data. The Right Method.
64,037 people, two countries, thirty years. Thirteen questions, each matched to the method it needed.
Apply This to Your Problem →Explore the Data
Dig into the data behind the findings.
The Biology Paradox
Three methods predict biological age. Step through to see which one loses — and why more computation made it worse.
Bio-Age Calculator
Enter your biomarkers and see your estimated biological age. Compare your position against 1,907 NHANES participants.
Prediction Duel
Four investigations with direct domain-vs-ML comparisons. ROC curves, calibration, reclassification, and feature importance heatmap.
The Data Paradox
Wealth predicts a 4.5× mortality gap. Sleep separates healthy from declining. But adding this data to models changes almost nothing. Why?
Reproducible with Public Data
All three datasets are publicly available and freely accessible for research use. Every analysis is reproducible with open-source Python tools (scikit-learn, scipy, pandas).
- Primary Dataset — USA National Health and Nutrition Examination Survey (NHANES) 2017–2018. Centers for Disease Control and Prevention (CDC). 9,254 Americans with full biomarker panels + lifestyle surveys. Cross-sectional. CDC NHANES →
- Cross-Population Dataset — China China Health and Nutrition Survey (CHNS) 2009. University of North Carolina at Chapel Hill. 9,549 Chinese adults with 26 fasting blood biomarkers. Enables cross-population transfer validation. UNC CHNS →
- Longitudinal Dataset — 30 Years Health and Retirement Study (HRS) RAND Longitudinal File, waves 1992–2022. University of Michigan. 45,234 Americans tracked every 2 years for up to 30 years. The dataset that lets us check predictions against what actually happened. HRS →
- Biological Age Reference Klemera, P. & Doubal, S. (2006). “A new approach to the concept and computation of biological age.” Mechanisms of Ageing and Development, 127(3), 240–248. Defines the KDM biological age estimator used in Investigation 10.
- Frailty Criteria Fried, L.P. et al. (2001). “Frailty in older adults: Evidence for a phenotype.” J. Gerontology, 56A(3), M146–M156. Defines the five-criterion frailty index used as domain baseline in Investigation 11.