California Freight Cleanup → Investigation 3-11
Which physics-informed GP kernel modification actually helps?
Kernel C (CMAQ covariate): 2.542 µg/m³ (−-22.0% mean vs baseline 2.762; one of five folds worsens, three improve, one is flat) • A/B: both worseThe the CEC freight solicitation proposal scoring rubric flags a physics-informed GP kernel (advection-diffusion structure in the covariance) as the highest-leverage methodology lever for Criterion 1 (Technical Merit). Investigation 3-11 is the honest empirical test: three candidate kernels, one single-axis change per experiment, materiality threshold set before running. The lever does pull weight—but only via mean-prior augmentation (Kennedy-O’Hagan 2001 style), not via covariance-form physics encoding.
The decision
The CEC proposal scoring rewards physics-informed model improvements. But the honest version of that claim requires an actual experiment: does encoding physical knowledge about atmospheric transport into the model improve accuracy on real California monitoring data, or is it purely a methodological signal with no numerical payoff?
Either verdict is defensible. A real improvement supports the technical-merit narrative. A clean negative — three variants tested, none beat the baseline — is itself a rigor signal. We found a partial positive: one of three variants wins by 8%, but the mechanism is different from what the standard atmospheric physics framing would suggest.
Methodology
All three candidate kernels are tested as drop-in replacements for
the isotropic Matern(ν=1.5) + WhiteKernel residual fit inside
Investigation 3-4’s exact 2019 4-level MFGP chain. Every other chain element is
preserved bit-identical: the 64-site California AQS panel, the Investigation 3-1 5-fold
spatial CV split, the ρk OLS-through-origin layer, and the canonical
recursive predict using only L1 at test. Only the δk GP kernel varies.
The baseline RMSE must reproduce Investigation 3-4’s 2.762 µg/m³ exactly
before any kernel comparison is valid; it does (2.762, Δ = 0.000).
| Kernel | Description | Physics encoding |
|---|---|---|
| Baseline | Isotropic Matern(ν=1.5) + WhiteKernel | None—Investigation 3-4 exact form |
| Kernel A | Input (lat,lon) rotated to align with CA-aggregate prevailing-wind transport axis (75° ENE per NCEP 2019 850 hPa); ARD Matern | Anisotropic advection: correlation falls off slower along-wind than cross-wind |
| Kernel B | ARD Matern with independent (lat, lon) length-scales fit by MLE | Data-driven anisotropy; reality-check for A |
| Kernel C | Input extended to [lat, lon, cmaq_z]; cmaq_z filled with ρ1 × f1_test at predict time | Chemistry-prior covariate: GP encodes spatial proximity AND chemistry-prior similarity |
The materiality threshold is set at 0.140 µg/m³ (≈ 5% of baseline 2.762, larger than typical fold-to-fold noise). A kernel winning by less than this is reported as “no improvement below materiality”—not as a win. CA aggregate wind direction (255° from-direction, transport axis 75°) is from the NCEP/NCAR Reanalysis 2019 annual-mean 850 hPa composite.
Findings
Adding chemistry-model output as an input feature wins: 8% accuracy improvement
RMSE 2.542 µg/m³ versus baseline 2.762 µg/m³. The improvement clears the 0.140 µg/m³ materiality threshold with margin. 3 of 5 folds improve substantially; 1 fold is flat; 1 fold worsens. The kernel is not uniformly better; it is a high-mean-low-floor tradeoff with increased fold-to-fold standard deviation (0.892 → 1.045).
Kernel C’s win is mechanistically interpretable, not an AQS leakage artifact
The concern with adding the CMAQ value as a kernel-input covariate is the NARGP-style collapse mode documented in Investigation 3-4 (adding the prior-rung value as input lets the GP collapse to plain spatial kriging, producing a spurious −1.84 µg/m³ “improvement”). Investigation 3-11 guards against this by using ρ1 × f1_test (the surrogate-reconstructed CMAQ, not held-out CMAQ truth) at predict time. The −0.22 µg/m³ improvement, compared to the spurious −1.84 µg/m³ from the NARGP-collapse mode, is consistent with a real modest signal rather than a leakage artifact.
Wind-alignment and data-driven stretch variants both hurt accuracy
Wind-aligned anisotropy (Kernel A, RMSE 3.012) worsens RMSE on all 5 folds. Data-driven ARD anisotropy (Kernel B, RMSE 2.871) is within fold noise of the baseline. A hand-coded physics anisotropy at the CA-aggregate scale is the wrong granularity: California’s wind field is basin-segmented (Bay Area NW marine, San Joaquin Valley stagnation, LA WSW onshore, Mojave N synoptic), so a single 75° transport axis averages across regimes with nearly orthogonal preferred directions. At 64 AQS training sites (13 per fold), neither kernel has enough data to find a better anisotropy than the isotropic baseline.
The winning mechanism is not the one the physics framing implies — that matters for how we describe it
Kernel C is closer to a Kennedy-O’Hagan 2001 model-bias-correction in 3-D input space than to a true PDE-derived kernel (advection-diffusion structure in the covariance function). A reviewer who reads carefully will notice this. The defensible proposal claim: “three candidate physics-informed kernel modifications tested; the chemistry-prior covariate variant reduced RMSE by 8%; the pure-anisotropy variants did not. The winning mechanism is mean-prior augmentation, not covariance-form physics encoding.” This is more credible than claiming a clean advection-diffusion kernel win.
Caveats
- Kernel C win is uneven across folds (3 improve, 1 flat, 1 worsens). The mean Δ = −0.220 is dominated by fold 2 (Δ = −0.929). A leave-region-out CV at the air-basin level is the required next validation step before adopting Kernel C in the production chain.
- Single year (2019), 64 sites, 5 folds. At this data scale, ARD and physics-informed modifications have limited signal to exploit beyond isotropic Matérn. 200–500 training sites (e.g., extending Investigation 3-8’s out-of-CA portability work) and per-grid-cell wind from HRRR reanalysis would be needed to meaningfully test covariance-form physics encoding.
- CA-aggregate wind axis (75° ENE) is coarse. Per-basin local rotation is the natural follow-on if Kernel A is to be retested; but with 8–14 sites per basin, per-basin fitting would be over-parameterised.
- Spectral, Wilson-Adams, and PINN kernels not tested. Those are 1–2 week research projects; out of scope for a single lever-test investigation. The candidates here are the cheapest physics-informed modifications a reviewer would expect to see.
- Do not adopt Kernel C in the production chain without leave-region-out CV. Even at an 8% RMSE improvement, adoption requires a full cascade re-run (Investigation 1-1/6/11/14/15/23/44/52). The validation surface is too thin for that commitment.
Provenance
| Item | SHA-256 (12-char) | |
|---|---|---|
| results.json | 91c8c82d18ce |
|
| analysis.md | — | |
| scenario.md | — | |
| Upstream: Investigation 3-4 (4-level MFGP baseline) | investigations/42_l4-mfgp-corrected/latest/results.json | b89d8204eb15 |
| Upstream: Investigation 3-1 (5-fold CV splits) | investigations/39_aqs-held-out-validation/latest/results.json | c63ae2d281ce |
| Run timestamp | 2026-05-03T22:37:04 64 CA AQS sites 5-fold spatial CV n_restarts = 6 | |