California Freight Cleanup → Investigation 3-4

Can a multi-fidelity emulator chain three data sources into a validated PM_2.5 surrogate?

2019 RMSE 2.76 µg/m³ • passes Tessum + Boylan-Russell • ρ_top stable ±0.016 across 4 years • FAQSD rung null result

We used a multi-fidelity Gaussian process to chain together three data sources (the base transport model, an EPA-published fused product, and direct monitor readings) and produce concentration estimates that pass both standard regulatory accuracy tests. Two findings: the corrected surrogate hits RMSE 2.76 µg/m³; and adding the EPA fused product as a middle rung made no difference — a clean lesson about when more fidelity rungs actually help.

The decision

Two questions answered here. Does the Le Gratiet linear MFGP chain deliver regulatory-grade PM_2.5 estimates (RMSE ≤ 5.0 µg/m³, |MFB| ≤ 0.6) at held-out AQS sites? Yes. Does adding FAQSD as an intermediate rung between CMAQ and AQS improve accuracy over the simpler 3-level chain? No.

Methodology

Algorithm. Le Gratiet & Garnier 2014 recursive linear MFGP. Each level fits a scaling coefficient ρ_k by OLS through origin, then a GP residual δ_k(lat, lon) under a ConstantKernel × Matern(nu=1.5) + WhiteKernel kernel. At test time, the recursive predict reconstructs higher-rung outputs from L1 only — CMAQ and FAQSD values at held-out test sites are never queried. This is the structural sidestep around FAQSD’s AQS-leakage problem.

Multi-year hybrid chain (Phase 6c.3). 2019 uses the full 4-level chain (L1→CMAQ EQUATES→FAQSD→AQS). 2020–2022 use a 3-level chain (L1→FAQSD→AQS) because CMAQ EQUATES v5.3.2 ATOTIJ is frozen at 2019 (no EPA/CMAS/AWS Open Data release of 2020+ as of May 2026). FAQSD substitutes as the mid rung for 2020–2022.

3-level vs 4-level A/B. The 2019 run simultaneously fits both the 4-level chain and a 3-level baseline (L1→CMAQ→AQS, same sites/folds, same kernel) to isolate the value of the FAQSD rung.

Splits. Investigation 3-1’s 5-fold basin-stratified site groupings are reused across all years to make rung-to-rung RMSE comparisons apples-to-apples. 5-fold spatial CV on 64 CA sites (2 dropped for missing FAQSD data).

Findings

L4 MFGP predicted vs observed scatter, RMSE 2.76 µg/m³ — Figure: L4 MFGP cross-validation predicted vs observed PM_2.5 at AQS sites, CA 2019. 5-fold spatial CV, 64 sites. RMSE 2.76 µg/m³; OLS slope and R² shown. Points colored by fold; dashed line is 1:1.

2019 anchor: RMSE 2.76 µg/m³ — passes both accuracy standards

5-fold CV-RMSE 2.762 µg/m³ (SD 0.892), MFB −0.055. Passes Tessum 2017 (≤ 5.0) and Boylan-Russell 2006 (|MFB| ≤ 0.6). Per-fold range 1.68–4.13 µg/m³.

Year	Chain	Sites	RMSE µg/m³	MFB	Tessum
2019	4-level	64	2.762	−0.055	Pass
2020	3-level	64	5.088	−0.080	Fail (wildfire)
2021	3-level	64	4.003	−0.076	Pass
2022	3-level	64	3.184	−0.054	Pass

Multi-year mean RMSE 3.759 µg/m³ (SD 1.025 across 4 years).

Adding the EPA fused product as a middle rung adds nothing — a clean example of redundant fidelity

The 4-level MFGP (2019, with FAQSD as L3) scores RMSE 2.762 µg/m³. The 3-level baseline (L1→CMAQ→AQS) scores 2.715 µg/m³. The 4-level is worse by 0.047 µg/m³ — well inside fold noise (SD 0.892). The diagnostic is ρ₃ ≈ 0.995 across all five folds: at training sites, FAQSD and AQS are essentially identity (FAQSD is fit on AQS, so they share information). Once the leakage is structurally blocked at test sites, FAQSD adds GP variance without adding bias correction.

This is the cleanest ADM lesson in the study: more fidelity rungs only help when each rung carries information that is not redundant with what follows it. FAQSD is redundant with AQS at the training-site level. The 3-level chain is the production surrogate.

The coupling between the corrected surrogate and monitors is stable year to year (±0.016)

Per-year ρ_top (the top-rung→AQS coupling):

Year	Chain	ρ₁ (L1→mid)	ρ_top
2019	4-level (CMAQ mid)	0.368	0.9954
2020	3-level (FAQSD mid)	0.811	0.9929
2021	3-level (FAQSD mid)	0.664	1.0023
2022	3-level (FAQSD mid)	0.588	1.0093

ρ_top range 0.016 across four years. The linear MFGP chain generalizes without year-fixed effects — the constant-ρ assumption holds at the top rung. ρ₁ varies by 0.44 but this is partly a definition-shift artifact (mid rung changes from CMAQ in 2019 to FAQSD in 2020–2022).

2020 wildfire year: RMSE 5.09 — the base model has no wildfire smoke; this is an input problem, not a chain failure

The 2020 RMSE (5.09, failing Tessum) reflects the LNU/SCU/August Complex fire season: ISRM × NEI does not include wildfire smoke (NEI smoke kernels are not in the ISRM matrix). FAQSD partially recovers fire-season PM via AQS data assimilation in the 3-level chain. The 2021 (4.00) and 2022 (3.18) results confirm this is a wildfire-year anomaly, not chain degradation.

Year-by-year: the corrected surrogate beats the base model every year; the satellite reference struggles in the 2020 wildfire season

Year	L1 RMSE	L3 vD RMSE	L4 RMSE	Δ(L4−L1)
2019	6.443	1.081	2.762	−3.68
2020	5.975	8.423	5.088	−0.89
2021	5.725	3.494	4.003	−1.72
2022	5.824	1.770	3.184	−2.64

Note: The L3 vD 2019 value (1.08 µg/m³) above is evaluated at the 64 training sites used by this investigation’s 2019 MFGP chain; Investigation 3-3’s headline 4.34 µg/m³ is evaluated across all 5 years and 66 sites (330 obs) against pooled AQS annual means. These are different evaluation scopes and should not be compared directly.

L3 van Donkelaar scores anomalously high RMSE in 2020 (8.42 µg/m³) because the satellite-fused field’s annual-mean product cannot resolve the extreme wildfire smoke episodes. L4 MFGP still beats L1 that year by 0.89 µg/m³.

Caveats

AQS-leakage is bounded, not eliminated. The canonical recursive predict blocks direct FAQSD use at test sites but training-fold FAQSD values still encode neighbor AQS information. ρ_top ≈ 1.0 is consistent with this: the FAQSD–AQS coupling at training sites is identity, so ρ_top carries no bias correction; the spatial residual δ carries what remains.
Hybrid chain: 2019 is 4-level, 2020–2022 are 3-level. CMAQ EQUATES is frozen at 2019. The mid-rung identity changes between years. Per-year ρ₁ comparisons are not apples-to-apples; only ρ_top is a clean cross-year stability metric.
CMAQ EQUATES ATOTIJ at site-locations is pre-extracted. The input is a CSV of per-site annual means, not a re-run of CMAQ at AQS coordinates. EPA’s spatial-sampling choices are inherited untested.
5-fold spatial CV is leave-cluster-out, not leave-region-out. Out-of-CA generalization is not tested here. The headline RMSE is a California-only number. See Investigation 3-8 for the 7-state portability test.
Linear MFGP, not NARGP. If the true chain has multiplicative or regime-dependent structure (NH3-limited vs VOC-limited aerosol regimes), the linear chain will saturate. The ρ_top stability finding (range 0.016) suggests the linear form is not currently the binding constraint for annual-mean CA evaluation.

Provenance

Item
run.py	`[internal artifact]`
results.json	`investigations/42_l4-mfgp-corrected/latest/results.json`
Method label	`le_gratiet_2014_multi_year_linear_mfgp`
Kernel	ConstantKernel × Matern(nu=1.5) + WhiteKernel
CMAQ input	`data/raw/cmaq_equates/ca_site_cmaq_pm25_2019.csv` (sha256 35d812f9ba52)
FAQSD inputs	2019–2022 daily .txt.gz (sha256s in results.json inputs_from)
Upstream: Investigation 3-1 folds + L1	sha256 c63ae2d281ce
Upstream: Investigation 3-3 L3 predictions	artifact sha256 621a2d74fe13
Upstream: Investigation 3-5 L5 2019 RMSE	0.857 µg/m³ (sha256 278e28fe52db)
Last run	2026-05-02 (results sha256 b89d8204eb15)

Can a multi-fidelity emulator chain three data sources into a validated PM2.5 surrogate?