Inflammatory Biomarker Prediction

The Question

C-reactive protein (CRP) is the primary clinical marker of systemic inflammation — elevated CRP predicts cardiovascular events, diabetes progression, and all-cause mortality. Published norms provide age- and sex-specific reference ranges. But CRP varies enormously between individuals: a healthy 60-year-old might have CRP of 0.5 while an unhealthy 40-year-old has CRP of 8.0. Can machine learning capture the individual-level drivers that population norms miss?

0.114

Domain R² (log CRP)

0.259

ML R² (log CRP)

2.26x

More Variance Explained

ADM Prediction (Made Before Running Models)

Predicted winner: ML. CRP is driven by complex, nonlinear interactions between obesity, smoking, activity, and metabolic status. Published norms give age/sex baselines but miss the exponential BMI-CRP relationship and lifestyle factor synergies. However, even ML will have modest R² because CRP has enormous intra-individual variation.

Expected: ML wins on R² by ~2x, but both models remain weak in absolute terms. Actual: R² 0.259 vs 0.114 (2.26x improvement, +126%). Both still explain <26% of variance. Prediction confirmed.

Results

R² Comparison

Feature Importance (Top 8)

CRP by Lifestyle

Prediction Scatter

The ADM Insight

CRP varies enormously between individuals — a healthy 60-year-old might have CRP of 0.5 while an unhealthy 40-year-old has CRP of 8.0. Age and sex explain only 11% of this variation. The machine learning model captures BMI, metabolic status, and lifestyle patterns that published norms ignore, explaining 2x the variance. Lifestyle interactions (obesity + diabetes + smoking) create synergistic inflammation that no additive model can represent.

Data from NHANES 2017–2018. 5,059 adults with high-sensitivity CRP measurements. Nationally representative survey with examination and laboratory components.

The domain model uses published CRP norms from age/sex groups plus known risk factor adjustments: obesity multiplier, smoking effect, diabetes effect. Evaluated on log-transformed CRP.

The GradientBoosting regressor (300 trees, depth 5) trains on log(CRP+0.1). 15 features including BMI, waist circumference, blood pressure, HbA1c, lipids, activity levels, and alcohol consumption.

Both evaluated on log-transformed CRP because raw CRP is heavily right-skewed. R² on log scale better reflects prediction accuracy across the clinically relevant range.

Log-transformed outcome: R² values are on log(CRP+0.1), not raw CRP. The log transform compresses the extreme right tail where CRP can exceed 100 mg/L. Raw CRP R² is substantially lower for both models.

Cross-sectional snapshot: NHANES provides one CRP measurement per person. CRP fluctuates with acute illness, recent exercise, and menstrual cycle. Repeat measurements would provide more stable estimates and likely improve both models.

R² of 0.259 means 74% unexplained: Even the ML model leaves most CRP variance unexplained. Genetics, acute infections, medication use, and measurement noise likely account for much of the residual. The 2x improvement over domain is real but both models have limited clinical utility for individual prediction.

No causal claims: Feature importance shows BMI as the top predictor, but this is associational. Weight loss interventions may or may not reduce CRP proportionally — that requires randomized trial data.

Q2: Diabetes Risk → Q7: Heart Disease → Q13: Mortality →

What Drives This Person's Inflammation?