California Freight Cleanup → Investigation 3-11

Which physics-informed GP kernel modification actually helps?

Kernel C (CMAQ covariate): 2.542 µg/m³ (−-22.0% mean vs baseline 2.762; one of five folds worsens, three improve, one is flat) • A/B: both worse

The the CEC freight solicitation proposal scoring rubric flags a physics-informed GP kernel (advection-diffusion structure in the covariance) as the highest-leverage methodology lever for Criterion 1 (Technical Merit). Investigation 3-11 is the honest empirical test: three candidate kernels, one single-axis change per experiment, materiality threshold set before running. The lever does pull weight—but only via mean-prior augmentation (Kennedy-O’Hagan 2001 style), not via covariance-form physics encoding.

The decision

The CEC proposal scoring rewards physics-informed model improvements. But the honest version of that claim requires an actual experiment: does encoding physical knowledge about atmospheric transport into the model improve accuracy on real California monitoring data, or is it purely a methodological signal with no numerical payoff?

Either verdict is defensible. A real improvement supports the technical-merit narrative. A clean negative — three variants tested, none beat the baseline — is itself a rigor signal. We found a partial positive: one of three variants wins by 8%, but the mechanism is different from what the standard atmospheric physics framing would suggest.

Methodology

All three candidate kernels are tested as drop-in replacements for the isotropic Matern(ν=1.5) + WhiteKernel residual fit inside Investigation 3-4’s exact 2019 4-level MFGP chain. Every other chain element is preserved bit-identical: the 64-site California AQS panel, the Investigation 3-1 5-fold spatial CV split, the ρ_k OLS-through-origin layer, and the canonical recursive predict using only L1 at test. Only the δ_k GP kernel varies. The baseline RMSE must reproduce Investigation 3-4’s 2.762 µg/m³ exactly before any kernel comparison is valid; it does (2.762, Δ = 0.000).

Kernel	Description	Physics encoding
Baseline	Isotropic Matern(ν=1.5) + WhiteKernel	None—Investigation 3-4 exact form
Kernel A	Input (lat,lon) rotated to align with CA-aggregate prevailing-wind transport axis (75° ENE per NCEP 2019 850 hPa); ARD Matern	Anisotropic advection: correlation falls off slower along-wind than cross-wind
Kernel B	ARD Matern with independent (lat, lon) length-scales fit by MLE	Data-driven anisotropy; reality-check for A
Kernel C	Input extended to [lat, lon, cmaq_z]; cmaq_z filled with ρ₁ × f1_test at predict time	Chemistry-prior covariate: GP encodes spatial proximity AND chemistry-prior similarity

The materiality threshold is set at 0.140 µg/m³ (≈ 5% of baseline 2.762, larger than typical fold-to-fold noise). A kernel winning by less than this is reported as “no improvement below materiality”—not as a win. CA aggregate wind direction (255° from-direction, transport axis 75°) is from the NCEP/NCAR Reanalysis 2019 annual-mean 850 hPa composite.

Bar chart of 5-fold CV-RMSE for baseline (2.762), Kernel A (3.012), Kernel B (2.871), and Kernel C (2.542) µg/m³. Kernel C is the only variant below the baseline; A and B are both above. — 5-fold spatial CV-RMSE (California 2019, 64 AQS sites) for the baseline and three candidate kernels. The horizontal line marks the materiality threshold (0.140 µg/m³ improvement). Only Kernel C clears it.

Findings

Adding chemistry-model output as an input feature wins: 8% accuracy improvement

RMSE 2.542 µg/m³ versus baseline 2.762 µg/m³. The improvement clears the 0.140 µg/m³ materiality threshold with margin. 3 of 5 folds improve substantially; 1 fold is flat; 1 fold worsens. The kernel is not uniformly better; it is a high-mean-low-floor tradeoff with increased fold-to-fold standard deviation (0.892 → 1.045).

Kernel C’s win is mechanistically interpretable, not an AQS leakage artifact

The concern with adding the CMAQ value as a kernel-input covariate is the NARGP-style collapse mode documented in Investigation 3-4 (adding the prior-rung value as input lets the GP collapse to plain spatial kriging, producing a spurious −1.84 µg/m³ “improvement”). Investigation 3-11 guards against this by using ρ₁ × f1_test (the surrogate-reconstructed CMAQ, not held-out CMAQ truth) at predict time. The −0.22 µg/m³ improvement, compared to the spurious −1.84 µg/m³ from the NARGP-collapse mode, is consistent with a real modest signal rather than a leakage artifact.

Wind-alignment and data-driven stretch variants both hurt accuracy

Wind-aligned anisotropy (Kernel A, RMSE 3.012) worsens RMSE on all 5 folds. Data-driven ARD anisotropy (Kernel B, RMSE 2.871) is within fold noise of the baseline. A hand-coded physics anisotropy at the CA-aggregate scale is the wrong granularity: California’s wind field is basin-segmented (Bay Area NW marine, San Joaquin Valley stagnation, LA WSW onshore, Mojave N synoptic), so a single 75° transport axis averages across regimes with nearly orthogonal preferred directions. At 64 AQS training sites (13 per fold), neither kernel has enough data to find a better anisotropy than the isotropic baseline.

The winning mechanism is not the one the physics framing implies — that matters for how we describe it

Kernel C is closer to a Kennedy-O’Hagan 2001 model-bias-correction in 3-D input space than to a true PDE-derived kernel (advection-diffusion structure in the covariance function). A reviewer who reads carefully will notice this. The defensible proposal claim: “three candidate physics-informed kernel modifications tested; the chemistry-prior covariate variant reduced RMSE by 8%; the pure-anisotropy variants did not. The winning mechanism is mean-prior augmentation, not covariance-form physics encoding.” This is more credible than claiming a clean advection-diffusion kernel win.

Caveats

Kernel C win is uneven across folds (3 improve, 1 flat, 1 worsens). The mean Δ = −0.220 is dominated by fold 2 (Δ = −0.929). A leave-region-out CV at the air-basin level is the required next validation step before adopting Kernel C in the production chain.
Single year (2019), 64 sites, 5 folds. At this data scale, ARD and physics-informed modifications have limited signal to exploit beyond isotropic Matérn. 200–500 training sites (e.g., extending Investigation 3-8’s out-of-CA portability work) and per-grid-cell wind from HRRR reanalysis would be needed to meaningfully test covariance-form physics encoding.
CA-aggregate wind axis (75° ENE) is coarse. Per-basin local rotation is the natural follow-on if Kernel A is to be retested; but with 8–14 sites per basin, per-basin fitting would be over-parameterised.
Spectral, Wilson-Adams, and PINN kernels not tested. Those are 1–2 week research projects; out of scope for a single lever-test investigation. The candidates here are the cheapest physics-informed modifications a reviewer would expect to see.
Do not adopt Kernel C in the production chain without leave-region-out CV. Even at an 8% RMSE improvement, adoption requires a full cascade re-run (Investigation 1-1/6/11/14/15/23/44/52). The validation surface is too thin for that commitment.

Provenance

Item	SHA-256 (12-char)
results.json	`91c8c82d18ce`
analysis.md	—
scenario.md	—
Upstream: Investigation 3-4 (4-level MFGP baseline)	investigations/42_l4-mfgp-corrected/latest/results.json	`b89d8204eb15`
Upstream: Investigation 3-1 (5-fold CV splits)	investigations/39_aqs-held-out-validation/latest/results.json	`c63ae2d281ce`
Run timestamp	2026-05-03T22:37:04 64 CA AQS sites 5-fold spatial CV n_restarts = 6