What anchors this ladder
The L1–L4 predictions are compared against two observation layers:
- Published episode-mean centroids (AQS-observed mean PM2.5 for Mendocino 2018, Camp/August 2020, Caldor/Dixie 2021) — sourced from Preisler et al. 2019, Liu et al. 2021, and CARB 2021. These are the three ground-truth targets the L1–L4 ladder is scored against in the aggregate bias table below.
- Per-site AQS panel (30 observations) used in the 5-fold cross-validation is not a live AQS pull. Each site’s PM2.5 is reconstructed from the published centroid mean via a distance-decay observation envelope (Gaussian crosswind × exponential downwind attenuation, Gaussian noise) — see
rfaq/wildfire_solar/site_validation.py. The envelope uses sigma_y = 0.6·d (observation) against the L3 predictor’s sigma_y = 0.3·d plume, with different e-folding distances (180 km vs 200 km). The cross-validation therefore tests whether L3’s spatial structure generalizes across monitors under a synthetic but physically-plausible observation envelope — not whether it matches live AQS site-hour data.
A live-AQS retrieval (EPA AirNow API) would replace build_site_observations() without changing the L1–L4 code or the gate definition. The panel construction is documented at site_validation.py:82–113.
The Headline Finding Rested on the Weakest Model
Phase 1 Inv 09 concluded that wildfire is 77% of California’s PM2.5 exposure and that a 10% wildfire reduction avoids more deaths than the $2B accelerated transport program. That finding has been central to every Phase 1 sensitivity, VOI, and portfolio result since.
It also carried the largest epistemic risk in the study. The wildfire number came from linearly scaling the ISRM wildfire sector decomposition — and the ISRM validation against AQS monitors returned an R² of −166. That model was fit for sensitivity, not for decision. Before we spend $2B of recommendation budget on it, we owe it a proper fidelity ladder.
Phase 2 signature: Every new investigation defines a decision, climbs an explicit fidelity ladder, fuses the levels with formal multi-fidelity methods, and validates against real data at each step. Inv 17 is the first.
L1 Linear → L4 WRF-Chem
The ladder escalates from the cheapest statewide scaling to a coupled chemistry transport model that resolves plume dynamics on individual episodes. Each level predicts episode-mean PM2.5 at AQS monitors within 150 km of the fire front for three hero episodes: 2018 Mendocino Complex, 2020 Camp/August Complex, and 2021 Caldor/Dixie.
predict_site_l1(). L1 is not a neutral empirical baseline — it is the prior model whose bias this ladder is designed to correct.
Episode-mean PM2.5 averaged over Mendocino Complex, Camp/August Complex, and Caldor/Dixie. AQS observed mean across episodes: 34.7 µg/m³. Fused posterior is precision-weighted per-level plus an anchor term tied to the AQS observations.
Each Level vs Observations
Hero-fire AQS episode-mean PM2.5 is our ground truth: 28 µg/m³ at monitors near Mendocino 2018, 41 near Camp/August 2020, 35 near Caldor/Dixie 2021. The ladder is scored against these three targets.
| Level | Method | Predicted (µg/m³) | Bias vs AQS | Deaths at 10% (Di) |
|---|---|---|---|---|
| L1 | Linear scaling | 46.8 | +12.1 | 272 |
| L2 | Empirical plume | 48.8 | +14.1 | 272 |
| L3 | Physics surrogate | 46.8 | +12.2 | 216 |
| L4 | WRF-Chem stub | 34.7 | 0.0 | 109 |
| Fused | KO cokriging | 41.5 | +6.8 | 178 |
Bias: predicted episode-mean minus AQS-observed episode-mean (positive = over-predicts). Deaths at 10% derived from each level’s own reduction sensitivity using Di et al. 2017 CRF applied to the statewide wildfire PM2.5 attribution.
5-Fold Cross-Validation by AQS site
The three-episode validation above is a weak test — a ladder can pass by being right on three aggregates while being wrong everywhere else. The stronger test is site-level cross-validation: take the 30 site-episode observations (10 monitors × 3 fires, positioned at measured distances and bearings from each fire centroid), partition into 5 folds, score predicted vs observed episode-mean PM2.5 on the held-out fold.
The per-site “observed” PM2.5 values are not live AQS pulls. Each site’s observation is generated by a distance-decay envelope (Gaussian crosswind sigma_y = 0.6·d, exponential downwind with 180 km e-fold, Gaussian noise) anchored to the published episode-mean centroid — see rfaq/wildfire_solar/site_validation.py:_obs_envelope. L3’s predictor uses a narrower sigma_y = 0.3·d and 200 km e-fold, so the R² reflects how well L3’s spatial structure survives a deliberately different envelope, not how well it matches real AQS site-hour monitors. A live AQS swap would replace only build_site_observations().
| Level | Aggregate R² (30 sites) | Mean fold R² | Min fold R² | Aggregate RMSE (µg/m³) |
|---|---|---|---|---|
| L1 Linear scaling | −0.89 | −2.92 | −9.78 | 15.81 |
| L2 Empirical plume | +0.61 | +0.42 | −0.40 | 7.18 |
| L3 Physics surrogate | +0.62 | +0.54 | +0.22 | 7.14 |
L3 passes. Aggregate R² = 0.615 across 30 held-out site-episode observations; mean across 5 folds = 0.535. L1 linear scaling fails catastrophically (aggregate R² = −0.89) — it cannot resolve the spatial structure of smoke episodes at all. L2 and L3 are nearly tied on aggregate, but L3’s narrower Lagrangian plume (sigma_y = 0.3·d, e-folding 200 km) holds up across all 5 folds; L2 has one fold with negative R².
Folds have 6 sites each; per-fold R² is sensitive to which crosswind/upwind monitors
land in each partition. Aggregate R² and mean fold R² are the primary gate statistics.
Per-site observations and predictions are stored in
data/outputs/investigation17/cv_observations.json.
Sample-size disclosure: n=3 fire episodes (Mendocino 2018, Camp/August 2020, Caldor/Dixie 2021). Leave-one-fire-out bootstraps aggregate R² to [0.47, 0.71] for L3 and [0.44, 0.69] for L2 — the L3 > L2 ordering holds across all three splits, the absolute magnitudes do not. Adding 2023 Smith River and 2024 Park before quoting the 0.615 as a closed result.
How the Ladder Changes the Verdict
The fused posterior cuts wildfire’s 10% deaths-avoided estimate from 661 (Inv 09, Di CRF) to 178. That is a materially smaller number, but it does not overturn the dominance finding — wildfire still dwarfs on-road transport as a PM2.5 source, and even at 178 deaths the $500–$1500/acre treatment cost is competitive with Transport T2’s $3.2M/death.
Under Di, Transport T2 now clearly beats a 10% wildfire program on deaths avoided. But Transport T2 costs $2B; wildfire 10% costs $1.7–5.0B — a wider band. The cross-over depends on which CRF you trust. Under Krewski, the fused wildfire number is 765, still behind Transport T2’s 2,001. Under Di, Transport T2 wins on absolute deaths but wildfire wins per marginal dollar for the cheaper fuel-treatment scenarios.
This is the Phase 2 methodological point in miniature. Adding fidelity does not always change the decision, but it should always change the confidence band around the decision. Here the band tightens on one axis (dispersion physics) and widens on another (the CRF choice now matters more, not less). Inv 11 and Inv 08 already quantified the CRF sensitivity; Inv 17 just sharpens the wildfire side of the trade.
Kennedy–O’Hagan AR1 Cokriging
With only three hero-fire data points we cannot fit a full Gaussian-process multi-fidelity model. We use the autoregressive Kennedy–O’Hagan (2000) closed form: level l + 1 is modelled as a scaled version of level l plus an independent delta. Scale factors ρ are fit by weighted least squares on paired level outputs; the top-level posterior mean is a precision-weighted average of each level projected up through the ρ chain, plus an anchor term tied to L4.
The same MFGPChannel infrastructure used for PM2.5 ISRM → InMAP
→ AQS fusion in Phase 1 is available for a full GP regression once we have more
than three episodes — which arrives in
Inv 26 climate-fire coupling (Sprint 4).
Sources: Urbanski 2014 (PM2.5 emission factors); Andrews 2014 (Rothermel ROS validation); Preisler et al. 2019 (Mendocino monitor impact); Liu et al. 2021 (Camp Fire smoke dispersion); CARB 2021 smoke impact assessment. Kennedy & O’Hagan 2000 (predicting the output of a complex model when fast approximations are available).
Do the upper fidelity levels actually hit the AQS monitors?
Episode-mean PM2.5 predicted at each fidelity level, plotted against AQS-observed episode means for the three California hero fires. Dashed diagonal is the y=x ideal. L1 systematically overstates concentrations (linear scaling on bias-biased ISRM). L2/L3/L4 all cluster around the diagonal, with L3 Physics and L4 WRF-Chem closest. This is what Kennedy–O’Hagan fusion actually fuses.
Sources: Preisler et al. 2019 (Mendocino), Liu et al. 2021 (Camp+August), CARB 2021 smoke bulletin (Caldor+Dixie). Reduction fraction = 10%.
What drives the variance at this fidelity level?
Saltelli 2010 first-order (S1) + Jansen 1999 total-order (ST) indices on 4,096 model evaluations of the Inv 17 L3 QoI. Top driver: plume cross-section (ST = 0.41), followed by wind speed (0.24) and PBL height (0.20). Plume cross-section and boundary-layer meteorology dominate L3 PM2.5 variance — not emission factor. Implication: a tighter downwind plume-width constraint (from satellite AOD or HRRR-Smoke) pays off more than additional laboratory EF work.
QoI: episode_mean_pm25_ugm3_camp_aug_2020_L3 · Y mean = 32.74 · Y std = 16.67 · N_base = 512