California Freight Cleanup → Investigation 3-1

How well does the base transport model predict measured PM_2.5?

5-fold CV-RMSE 6.08 µg/m³ • fails Tessum 2017 • R² = −3.0 • MFB −0.013 (passes Boylan-Russell)

We held the ISRM source-receptor model against 66 California air monitors across five years (2019–2023). The model is designed to track which pollution sources contribute to which locations — not to predict absolute concentrations at a given site. We document that failure explicitly here so no one reads the cascade’s headline accuracy stat (RMSE 2.76 after all corrections) as a marginal improvement over a functioning base. The base does not function as an absolute predictor. That’s why the corrections matter.

The decision

Should the California Freight Cleanup cascade present L1 ISRM × NEI as a predictive concentration model? No. L1’s role is sector decomposition and relative policy ranking — not absolute-level PM_2.5 prediction. This gate quantifies the absolute-level failure so every downstream consumer (Investigation 1-1, 6, 7, 11, 14, 15) is correctly disclosed: their inputs come from a cascade that uses L4 MFGP for concentration levels, not L1.

Methodology

Panel. 66 California AQS sites, stratified by air basin and monitoring setting (urban / rural / near-road), 2019–2023, requiring ≥80% daily completeness per year. The Phase A1 panel yields 330 site-years (117,128 daily observations).

Predictor (L1). The pre-computed isrm_sector_pm25.npz (~21k cells, on-road + residential + EGU + area + wildfire annual-mean total) sampled at each AQS site by nearest-cell Euclidean lookup. One predicted value per site per year — no daily structure, no AQS fitting.

Evaluation protocol. Three layers: (1) global in-sample RMSE, MFB, R² across all 330 site-years; (2) held-out 5-fold cross-validation, basin-stratified (same folds reused by Investigation 3-2–43 for rung-comparability); (3) stratified breakouts by season, air basin, concentration band, and setting. Strata with fewer than 30 paired daily observations are reported as null rather than fabricated metrics.

Baselines applied. Tessum et al. 2017 PNAS Table 2 (InMAP CV-RMSE 3.0–5.0 µg/m³, DOI 10.1073/pnas.1614453114) is the named comparator gate. Boylan & Russell 2006 Atmos. Env. (|MFB| ≤ 0.6, DOI 10.1016/j.atmosenv.2005.09.087) is the regulatory annual-scale criterion.

Findings

Held-out accuracy: RMSE 6.08 µg/m³ — outside the published accuracy range for this class of model

The headline held-out 5-fold mean RMSE is 6.08 µg/m³ (SD 2.03 across folds), exceeding the Tessum 2017 InMAP CV-RMSE upper bound of 5.0 µg/m³ by 22%. The global in-sample RMSE is 6.60 µg/m³.

Per-fold detail:

Fold	Test sites	n	RMSE µg/m³	MFB	R²
0	16	80	9.24	+0.059	−10.83
1	15	75	6.10	−0.098	−2.02
2	13	65	6.42	+0.015	−3.40
3	12	60	4.68	−0.003	−0.90
4	10	50	3.96	−0.050	−0.019

R² = −3.0: the model is less useful than predicting the average everywhere

A constant predicting the AQS global mean would score R² = 0. ISRM × NEI scores R² = −3.0 — three times worse than that. The matrix preserves source-receptor proportionality (sector shares remain meaningful) but cannot place absolute PM_2.5 at any specific site within ±5 µg/m³. This is not a model failure — it is a category error to use L1 as an absolute concentration predictor.

Performance collapses at high concentrations

By concentration band: RMSE 8.0 at <10 µg/m³, 8.4 at 10–25, 20.2 at 25–50, 84.0 at ≥50 µg/m³ (MFB −1.5 — under-predicts by roughly 150% at extreme concentrations). ISRM has no event-scale wildfire smoke; the cascade closes this gap via L4 MFGP + FAQSD’s AQS assimilation step.

By air basin (global daily obs):

Basin	RMSE	MFB	n
SJV	13.74	+0.197	21,282
Sacramento	13.30	+0.166	10,557
LA Basin	11.03	+0.359	14,447
rest_ca	10.72	+0.120	42,481
Bay Area	9.76	+0.228	24,899
San Diego	8.70	+0.526	3,462

The regulatory bias check passes — but only because opposite-sign regional errors cancel

Global MFB −0.013 passes |MFB| ≤ 0.6 — but only because opposite-sign biases cancel. Stratified MFB runs from +0.526 (San Diego) to −0.763 (25–50 µg/m³ band). The global gate tells us nothing about L1 adequacy. Only the stratified table reveals the structural bias pattern.

Caveats

Annual-mean evaluation only. ISRM and the NEI field are annual-mean by construction. Lifting to daily cadence is owed in Phase B2 (ECHO-AIR, separately tracked). The Tessum 2017 3–5 µg/m³ comparator is also annual-mean CV-RMSE, so the comparison is apples-to-apples — but a passing L1 would not license day-of operational use.
Nearest-cell sampling. AQS sites matched by Euclidean lookup to the nearest ISRM cell (~25 km mean spacing). No bilinear interpolation. Near-road sites (RMSE 13.83) are worst; the ISRM matrix has no sub-cell traffic gradients by design.
The gate is on raw L1, not the AQS-anchored variant. Downstream consumers use the B-6 path B′ AQS-anchored field, which corrects the absolute level bias. Re-running against that field would pass Tessum trivially (bias-corrected by construction) and would mask the structural finding this investigation documents.
Fixed Phase A1 site sample of 66 monitors. Sensitivity to site selection is a Phase F follow-on. Per-fold RMSE spread partly reflects which specific sites land in fold 0 versus fold 4.

Provenance

Item
run.py	`[internal artifact]`
results.json	`investigations/39_aqs-held-out-validation/latest/results.json`
analysis.md	`investigations/39_aqs-held-out-validation/latest/analysis.md`
scenario.md	`investigations/39_aqs-held-out-validation/latest/scenario.md`
AQS panel	`data/processed/aqs/daily_panel_2019_2023.parquet` (sha256 5af5b25a2b0d)
Validation sites	`data/processed/aqs/validation_sites.json` (sha256 a6bfd352357a)
Tessum 2017 baseline	PNAS 114(13):3367–3372 (DOI 10.1073/pnas.1614453114)
Boylan & Russell 2006	Atmos. Env. (DOI 10.1016/j.atmosenv.2005.09.087)
Last run	2026-05-01 (results sha256 c63ae2d281ce)

How well does the base transport model predict measured PM2.5?