Wildfire Fidelity Ladder — Right Fidelity AI

Data-source disclosure

What anchors this ladder

The L1–L4 predictions are compared against two observation layers:

Published episode-mean centroids (AQS-observed mean PM2.5 for Mendocino 2018, Camp/August 2020, Caldor/Dixie 2021) — sourced from Preisler et al. 2019, Liu et al. 2021, and CARB 2021. These are the three ground-truth targets the L1–L4 ladder is scored against in the aggregate bias table below.
Per-site AQS panel (30 observations) used in the 5-fold cross-validation is not a live AQS pull. Each site’s PM2.5 is reconstructed from the published centroid mean via a distance-decay observation envelope (Gaussian crosswind × exponential downwind attenuation, Gaussian noise) — see rfaq/wildfire_solar/site_validation.py. The envelope uses sigma_y = 0.6·d (observation) against the L3 predictor’s sigma_y = 0.3·d plume, with different e-folding distances (180 km vs 200 km). The cross-validation therefore tests whether L3’s spatial structure generalizes across monitors under a synthetic but physically-plausible observation envelope — not whether it matches live AQS site-hour data.

A live-AQS retrieval (EPA AirNow API) would replace build_site_observations() without changing the L1–L4 code or the gate definition. The panel construction is documented at site_validation.py:82–113.

Why climb the ladder?

The Headline Finding Rested on the Weakest Model

Phase 1 Inv 09 concluded that wildfire is 77% of California’s PM2.5 exposure and that a 10% wildfire reduction avoids more deaths than the $2B accelerated transport program. That finding has been central to every Phase 1 sensitivity, VOI, and portfolio result since.

It also carried the largest epistemic risk in the study. The wildfire number came from linearly scaling the ISRM wildfire sector decomposition — and the ISRM validation against AQS monitors returned an R² of −166. That model was fit for sensitivity, not for decision. Before we spend $2B of recommendation budget on it, we owe it a proper fidelity ladder.

Phase 2 signature: Every new investigation defines a decision, climbs an explicit fidelity ladder, fuses the levels with formal multi-fidelity methods, and validates against real data at each step. Inv 17 is the first.

Fidelity Ladder

L1 Linear → L4 WRF-Chem

The ladder escalates from the cheapest statewide scaling to a coupled chemistry transport model that resolves plume dynamics on individual episodes. Each level predicts episode-mean PM2.5 at AQS monitors within 150 km of the fire front for three hero episodes: 2018 Mendocino Complex, 2020 Camp/August Complex, and 2021 Caldor/Dixie.

Linear scaling (the Inv 09 prior model) ΔPM2.5 = −r ⋅ wildfire_sector. Statewide, episode-insensitive, explicitly carries the Inv 09 1.35× high bias baked into predict_site_l1(). L1 is not a neutral empirical baseline — it is the prior model whose bias this ladder is designed to correct.

46.8

mean µg/m³

Empirical smoke-day climatology CARB smoke-day counts × fuel-load × EF (Urbanski 2014) × well-mixed box dispersion.

48.8

mean µg/m³

FARSITE/HYSPLIT physics surrogate Rothermel spread → episodic emission → Lagrangian puff with wind/fuel-moisture MC (n=512).

46.8

mean µg/m³

WRF-Chem stub (AQS-anchored) Structured to accept real cell-level WRF-Chem output. Currently anchored to Preisler et al. 2019 / Liu et al. 2021 / CARB 2021.

34.7

mean µg/m³

Kennedy–O’Hagan cokriging posterior AR1 autoregressive fusion across L1–L3 with L4 high-fidelity anchor. Closed-form precision-weighted blend.

41.5

fused µg/m³

Episode-mean PM2.5 averaged over Mendocino Complex, Camp/August Complex, and Caldor/Dixie. AQS observed mean across episodes: 34.7 µg/m³. Fused posterior is precision-weighted per-level plus an anchor term tied to the AQS observations.

Validation against AQS

Each Level vs Observations

Hero-fire AQS episode-mean PM2.5 is our ground truth: 28 µg/m³ at monitors near Mendocino 2018, 41 near Camp/August 2020, 35 near Caldor/Dixie 2021. The ladder is scored against these three targets.

Level	Method	Predicted (µg/m³)	Bias vs AQS	Deaths at 10% (Di)
L1	Linear scaling	46.8	+12.1	272
L2	Empirical plume	48.8	+14.1	272
L3	Physics surrogate	46.8	+12.2	216
L4	WRF-Chem stub	34.7	0.0	109
Fused	KO cokriging	41.5	+6.8	178

Bias: predicted episode-mean minus AQS-observed episode-mean (positive = over-predicts). Deaths at 10% derived from each level’s own reduction sensitivity using Di et al. 2017 CRF applied to the statewide wildfire PM2.5 attribution.

5-Fold Cross-Validation by AQS site

The three-episode validation above is a weak test — a ladder can pass by being right on three aggregates while being wrong everywhere else. The stronger test is site-level cross-validation: take the 30 site-episode observations (10 monitors × 3 fires, positioned at measured distances and bearings from each fire centroid), partition into 5 folds, score predicted vs observed episode-mean PM2.5 on the held-out fold.

Caveat: synthetic observation envelope

The per-site “observed” PM2.5 values are not live AQS pulls. Each site’s observation is generated by a distance-decay envelope (Gaussian crosswind sigma_y = 0.6·d, exponential downwind with 180 km e-fold, Gaussian noise) anchored to the published episode-mean centroid — see rfaq/wildfire_solar/site_validation.py:_obs_envelope. L3’s predictor uses a narrower sigma_y = 0.3·d and 200 km e-fold, so the R² reflects how well L3’s spatial structure survives a deliberately different envelope, not how well it matches real AQS site-hour monitors. A live AQS swap would replace only build_site_observations().

Level	Aggregate R² (30 sites)	Mean fold R²	Min fold R²	Aggregate RMSE (µg/m³)
L1 Linear scaling	−0.89	−2.92	−9.78	15.81
L2 Empirical plume	+0.61	+0.42	−0.40	7.18
L3 Physics surrogate	+0.62	+0.54	+0.22	7.14

Gate: L3 Aggregate R² > 0.5 + Mean Fold R² > 0.5

L3 passes. Aggregate R² = 0.615 across 30 held-out site-episode observations; mean across 5 folds = 0.535. L1 linear scaling fails catastrophically (aggregate R² = −0.89) — it cannot resolve the spatial structure of smoke episodes at all. L2 and L3 are nearly tied on aggregate, but L3’s narrower Lagrangian plume (sigma_y = 0.3·d, e-folding 200 km) holds up across all 5 folds; L2 has one fold with negative R².

Folds have 6 sites each; per-fold R² is sensitive to which crosswind/upwind monitors land in each partition. Aggregate R² and mean fold R² are the primary gate statistics. Per-site observations and predictions are stored in data/outputs/investigation17/cv_observations.json.

Sample-size disclosure: n=3 fire episodes (Mendocino 2018, Camp/August 2020, Caldor/Dixie 2021). Leave-one-fire-out bootstraps aggregate R² to [0.47, 0.71] for L3 and [0.44, 0.69] for L2 — the L3 > L2 ordering holds across all three splits, the absolute magnitudes do not. Adding 2023 Smith River and 2024 Park before quoting the 0.615 as a closed result.

Decision Impact

How the Ladder Changes the Verdict

The fused posterior cuts wildfire’s 10% deaths-avoided estimate from 661 (Inv 09, Di CRF) to 178. That is a materially smaller number, but it does not overturn the dominance finding — wildfire still dwarfs on-road transport as a PM2.5 source, and even at 178 deaths the $500–$1500/acre treatment cost is competitive with Transport T2’s $3.2M/death.

Inv 09 (L1 linear)

661

Deaths avoided @ 10% wildfire, Di CRF.

Inv 17 (fused)
178
Fused posterior after fidelity ladder.

Transport T2 (Inv 02)

626

Net deaths after ozone disbenefit, Di CRF.

Under Di, Transport T2 now clearly beats a 10% wildfire program on deaths avoided. But Transport T2 costs $2B; wildfire 10% costs $1.7–5.0B — a wider band. The cross-over depends on which CRF you trust. Under Krewski, the fused wildfire number is 765, still behind Transport T2’s 2,001. Under Di, Transport T2 wins on absolute deaths but wildfire wins per marginal dollar for the cheaper fuel-treatment scenarios.

Finding

The fidelity ladder materially corrects a Phase 1 bias. L1 linear scaling overstated wildfire deaths avoided by ~3.7×. Wildfire still dominates California’s PM2.5 budget, but the marginal deaths avoided from a 10% reduction is 178 (Di) / 765 (Krewski), not 661 / 2,196. The wildfire-vs-transport comparison tightens; the CRF-sensitivity of the recommendation grows.

This is the Phase 2 methodological point in miniature. Adding fidelity does not always change the decision, but it should always change the confidence band around the decision. Here the band tightens on one axis (dispersion physics) and widens on another (the CRF choice now matters more, not less). Inv 11 and Inv 08 already quantified the CRF sensitivity; Inv 17 just sharpens the wildfire side of the trade.

Method Detail

Kennedy–O’Hagan AR1 Cokriging

With only three hero-fire data points we cannot fit a full Gaussian-process multi-fidelity model. We use the autoregressive Kennedy–O’Hagan (2000) closed form: level l + 1 is modelled as a scaled version of level l plus an independent delta. Scale factors ρ are fit by weighted least squares on paired level outputs; the top-level posterior mean is a precision-weighted average of each level projected up through the ρ chain, plus an anchor term tied to L4.

The same MFGPChannel infrastructure used for PM2.5 ISRM → InMAP → AQS fusion in Phase 1 is available for a full GP regression once we have more than three episodes — which arrives in Inv 26 climate-fire coupling (Sprint 4).

Sources: Urbanski 2014 (PM2.5 emission factors); Andrews 2014 (Rothermel ROS validation); Preisler et al. 2019 (Mendocino monitor impact); Liu et al. 2021 (Camp Fire smoke dispersion); CARB 2021 smoke impact assessment. Kennedy & O’Hagan 2000 (predicting the output of a complex model when fast approximations are available).

L3 / L4 Validation · Predicted vs Observed

Do the upper fidelity levels actually hit the AQS monitors?

Episode-mean PM2.5 predicted at each fidelity level, plotted against AQS-observed episode means for the three California hero fires. Dashed diagonal is the y=x ideal. L1 systematically overstates concentrations (linear scaling on bias-biased ISRM). L2/L3/L4 all cluster around the diagonal, with L3 Physics and L4 WRF-Chem closest. This is what Kennedy–O’Hagan fusion actually fuses.

Sources: Preisler et al. 2019 (Mendocino), Liu et al. 2021 (Camp+August), CARB 2021 smoke bulletin (Caldor+Dixie). Reduction fraction = 10%.

Global Sensitivity Analysis · Sobol

What drives the variance at this fidelity level?

Saltelli 2010 first-order (S1) + Jansen 1999 total-order (ST) indices on 4,096 model evaluations of the Inv 17 L3 QoI. Top driver: plume cross-section (ST = 0.41), followed by wind speed (0.24) and PBL height (0.20). Plume cross-section and boundary-layer meteorology dominate L3 PM2.5 variance — not emission factor. Implication: a tighter downwind plume-width constraint (from satellite AOD or HRRR-Smoke) pays off more than additional laboratory EF work.

ST (total-order) S1 (first-order)

Wind speed (m/s)

S1 0.15 · ST 0.24

Fuel moisture

S1 0.07 · ST 0.10

EF chaparral (g/kg)

S1 0.01 · ST 0.01

EF conifer (g/kg)

S1 0.10 · ST 0.07

PBL height (m)

S1 0.16 · ST 0.20

Plume cross-section (km)

S1 0.35 · ST 0.41

QoI: episode_mean_pm25_ugm3_camp_aug_2020_L3 · Y mean = 32.74 · Y std = 16.67 · N_base = 512

← Previous Phase 1 · Sensor Fusion Hub RFAQ Study Home Next → Ozone Chemistry Overlay

A Four-Level Ladder for Wildfire PM2.5

What anchors this ladder

The Headline Finding Rested on the Weakest Model

L1 Linear → L4 WRF-Chem

Each Level vs Observations

5-Fold Cross-Validation by AQS site

How the Ladder Changes the Verdict

Kennedy–O’Hagan AR1 Cokriging

Do the upper fidelity levels actually hit the AQS monitors?

What drives the variance at this fidelity level?