California Freight Cleanup → Investigation 3-4
Can a multi-fidelity emulator chain three data sources into a validated PM2.5 surrogate?
2019 RMSE 2.76 µg/m³ • passes Tessum + Boylan-Russell • ρtop stable ±0.016 across 4 years • FAQSD rung null resultWe used a multi-fidelity Gaussian process to chain together three data sources (the base transport model, an EPA-published fused product, and direct monitor readings) and produce concentration estimates that pass both standard regulatory accuracy tests. Two findings: the corrected surrogate hits RMSE 2.76 µg/m³; and adding the EPA fused product as a middle rung made no difference — a clean lesson about when more fidelity rungs actually help.
The decision
Two questions answered here. Does the Le Gratiet linear MFGP chain deliver regulatory-grade PM2.5 estimates (RMSE ≤ 5.0 µg/m³, |MFB| ≤ 0.6) at held-out AQS sites? Yes. Does adding FAQSD as an intermediate rung between CMAQ and AQS improve accuracy over the simpler 3-level chain? No.
Methodology
Algorithm. Le Gratiet & Garnier 2014 recursive linear MFGP.
Each level fits a scaling coefficient ρk by OLS through origin,
then a GP residual δk(lat, lon) under a
ConstantKernel × Matern(nu=1.5) + WhiteKernel kernel.
At test time, the recursive predict reconstructs higher-rung
outputs from L1 only — CMAQ and FAQSD values at held-out test sites
are never queried. This is the structural sidestep around FAQSD’s
AQS-leakage problem.
Multi-year hybrid chain (Phase 6c.3). 2019 uses the full 4-level chain (L1→CMAQ EQUATES→FAQSD→AQS). 2020–2022 use a 3-level chain (L1→FAQSD→AQS) because CMAQ EQUATES v5.3.2 ATOTIJ is frozen at 2019 (no EPA/CMAS/AWS Open Data release of 2020+ as of May 2026). FAQSD substitutes as the mid rung for 2020–2022.
3-level vs 4-level A/B. The 2019 run simultaneously fits both the 4-level chain and a 3-level baseline (L1→CMAQ→AQS, same sites/folds, same kernel) to isolate the value of the FAQSD rung.
Splits. Investigation 3-1’s 5-fold basin-stratified site groupings are reused across all years to make rung-to-rung RMSE comparisons apples-to-apples. 5-fold spatial CV on 64 CA sites (2 dropped for missing FAQSD data).
Findings
2019 anchor: RMSE 2.76 µg/m³ — passes both accuracy standards
5-fold CV-RMSE 2.762 µg/m³ (SD 0.892), MFB −0.055. Passes Tessum 2017 (≤ 5.0) and Boylan-Russell 2006 (|MFB| ≤ 0.6). Per-fold range 1.68–4.13 µg/m³.
| Year | Chain | Sites | RMSE µg/m³ | MFB | Tessum |
|---|---|---|---|---|---|
| 2019 | 4-level | 64 | 2.762 | −0.055 | Pass |
| 2020 | 3-level | 64 | 5.088 | −0.080 | Fail (wildfire) |
| 2021 | 3-level | 64 | 4.003 | −0.076 | Pass |
| 2022 | 3-level | 64 | 3.184 | −0.054 | Pass |
Multi-year mean RMSE 3.759 µg/m³ (SD 1.025 across 4 years).
Adding the EPA fused product as a middle rung adds nothing — a clean example of redundant fidelity
The 4-level MFGP (2019, with FAQSD as L3) scores RMSE 2.762 µg/m³. The 3-level baseline (L1→CMAQ→AQS) scores 2.715 µg/m³. The 4-level is worse by 0.047 µg/m³ — well inside fold noise (SD 0.892). The diagnostic is ρ3 ≈ 0.995 across all five folds: at training sites, FAQSD and AQS are essentially identity (FAQSD is fit on AQS, so they share information). Once the leakage is structurally blocked at test sites, FAQSD adds GP variance without adding bias correction.
This is the cleanest ADM lesson in the study: more fidelity rungs only help when each rung carries information that is not redundant with what follows it. FAQSD is redundant with AQS at the training-site level. The 3-level chain is the production surrogate.
The coupling between the corrected surrogate and monitors is stable year to year (±0.016)
Per-year ρtop (the top-rung→AQS coupling):
| Year | Chain | ρ1 (L1→mid) | ρtop |
|---|---|---|---|
| 2019 | 4-level (CMAQ mid) | 0.368 | 0.9954 |
| 2020 | 3-level (FAQSD mid) | 0.811 | 0.9929 |
| 2021 | 3-level (FAQSD mid) | 0.664 | 1.0023 |
| 2022 | 3-level (FAQSD mid) | 0.588 | 1.0093 |
ρtop range 0.016 across four years. The linear MFGP chain generalizes without year-fixed effects — the constant-ρ assumption holds at the top rung. ρ1 varies by 0.44 but this is partly a definition-shift artifact (mid rung changes from CMAQ in 2019 to FAQSD in 2020–2022).
2020 wildfire year: RMSE 5.09 — the base model has no wildfire smoke; this is an input problem, not a chain failure
The 2020 RMSE (5.09, failing Tessum) reflects the LNU/SCU/August Complex fire season: ISRM × NEI does not include wildfire smoke (NEI smoke kernels are not in the ISRM matrix). FAQSD partially recovers fire-season PM via AQS data assimilation in the 3-level chain. The 2021 (4.00) and 2022 (3.18) results confirm this is a wildfire-year anomaly, not chain degradation.
Year-by-year: the corrected surrogate beats the base model every year; the satellite reference struggles in the 2020 wildfire season
| Year | L1 RMSE | L3 vD RMSE | L4 RMSE | Δ(L4−L1) |
|---|---|---|---|---|
| 2019 | 6.443 | 1.081 | 2.762 | −3.68 |
| 2020 | 5.975 | 8.423 | 5.088 | −0.89 |
| 2021 | 5.725 | 3.494 | 4.003 | −1.72 |
| 2022 | 5.824 | 1.770 | 3.184 | −2.64 |
Note: The L3 vD 2019 value (1.08 µg/m³) above is evaluated at the 64 training sites used by this investigation’s 2019 MFGP chain; Investigation 3-3’s headline 4.34 µg/m³ is evaluated across all 5 years and 66 sites (330 obs) against pooled AQS annual means. These are different evaluation scopes and should not be compared directly.
L3 van Donkelaar scores anomalously high RMSE in 2020 (8.42 µg/m³) because the satellite-fused field’s annual-mean product cannot resolve the extreme wildfire smoke episodes. L4 MFGP still beats L1 that year by 0.89 µg/m³.
Caveats
- AQS-leakage is bounded, not eliminated. The canonical recursive predict blocks direct FAQSD use at test sites but training-fold FAQSD values still encode neighbor AQS information. ρtop ≈ 1.0 is consistent with this: the FAQSD–AQS coupling at training sites is identity, so ρtop carries no bias correction; the spatial residual δ carries what remains.
- Hybrid chain: 2019 is 4-level, 2020–2022 are 3-level. CMAQ EQUATES is frozen at 2019. The mid-rung identity changes between years. Per-year ρ1 comparisons are not apples-to-apples; only ρtop is a clean cross-year stability metric.
- CMAQ EQUATES ATOTIJ at site-locations is pre-extracted. The input is a CSV of per-site annual means, not a re-run of CMAQ at AQS coordinates. EPA’s spatial-sampling choices are inherited untested.
- 5-fold spatial CV is leave-cluster-out, not leave-region-out. Out-of-CA generalization is not tested here. The headline RMSE is a California-only number. See Investigation 3-8 for the 7-state portability test.
- Linear MFGP, not NARGP. If the true chain has multiplicative or regime-dependent structure (NH3-limited vs VOC-limited aerosol regimes), the linear chain will saturate. The ρtop stability finding (range 0.016) suggests the linear form is not currently the binding constraint for annual-mean CA evaluation.
Provenance
| Item | |
|---|---|
| run.py | [internal artifact] |
| results.json | investigations/42_l4-mfgp-corrected/latest/results.json |
| Method label | le_gratiet_2014_multi_year_linear_mfgp |
| Kernel | ConstantKernel × Matern(nu=1.5) + WhiteKernel |
| CMAQ input | data/raw/cmaq_equates/ca_site_cmaq_pm25_2019.csv (sha256 35d812f9ba52) |
| FAQSD inputs | 2019–2022 daily .txt.gz (sha256s in results.json inputs_from) |
| Upstream: Investigation 3-1 folds + L1 | sha256 c63ae2d281ce |
| Upstream: Investigation 3-3 L3 predictions | artifact sha256 621a2d74fe13 |
| Upstream: Investigation 3-5 L5 2019 RMSE | 0.857 µg/m³ (sha256 278e28fe52db) |
| Last run | 2026-05-02 (results sha256 b89d8204eb15) |