Can a GP kernel itself know the physics?
Inv 34 showed that a small MLP with a PDE-residual penalty beats a data-only MLP on sparse monitor data. But the PINN gives point predictions only — no calibrated uncertainty. For decision-making under uncertainty (e.g., CEC's question of which scenarios to run at high fidelity), uncertainty is the product. This investigation asks the next natural question: can we swap the MLP for a Gaussian process and encode the PDE directly into the kernel, so the posterior is physics-consistent by construction and carries honest variance?
The governing PDE is the same as Inv 34:
∂c/∂t + u ∂c/∂x − D ∂²c/∂x² + k c = S(x)
with u = 5.0 km/h, D = 8.0 km²/h, k = 0.12/h, source centered at x = 200 km (width = 80 km).
If c ~ GP, so does Lc
The core observation (Sarkka 2011; Alvarez, Luengo & Lawrence 2013) is that linear operators commute with the covariance structure of a GP. If we place a squared-exponential prior c ∼ GP(0, k) and let L = ∂t + u∂x − D∂x2 + kr be the ADR operator, then:
- Lc ∼ GP(0, L1L2 k). The operator pushes through to the kernel by applying it to each argument.
- cov(c, Lc)(x, x′) = L2 k(x, x′). Cross-covariance is a one-sided derivative of the base kernel.
- Since Lc = S (the known source), observing S at collocation points is mathematically identical to observing a linear transform of c. Conditioning the GP on both types of observations gives a posterior that respects the PDE in expectation and variance.
All 1D kernel derivatives through order four are available in closed form for the SE kernel. The ADR operator applied to both arguments produces a sum of 16 analytic terms (enumerated in rfaq/atmospheric/physics_gp.py) — no finite differencing needed.
Why this beats a PINN on uncertainty. A PINN gives a deterministic function. A physics-informed GP gives a full posterior — mean and variance — with the PDE baked into both. For downstream VOI / EVPPI / active learning computations (Inv 36 uses this structure), the variance is what you need.
From no-physics to physics-kerneled
What each surrogate predicts at t = 8.0h
Trained on 9 noisy observations at 3 downstream stations plus IC anchors, the physics-GP (green, with ±2σ band) tracks the reference plume to within a few tenths of a µg/m³ across the entire domain. The plain-GP (gold) fits the data points but extrapolates wildly upwind of the nearest station — the SE kernel has no mechanism for propagating information across a dynamical system. The green band narrows where the PDE + collocation data constrain the solution tightly, and widens a little upstream where only IC anchors pin it.
What marginal-likelihood maximization tells us
| Hyperparameter | Physics-GP | Plain-GP | Interpretation |
|---|---|---|---|
| σ2 (prior variance) | 62.7 | 90.8 | Signal scale. Both recover roughly the same process variance. |
| lx (spatial scale, km) | 125.4 | 309.9 | Plain-GP inflates lx toward the domain width; the physics kernel prefers a tighter scale because the operator already supplies long-range structure. |
| lt (temporal scale, h) | 10.9 | 10.8 | Similar story. Physics-GP focuses on residual temporal correlation beyond what advection explains. |
| Neg log marginal likelihood | -150.5 | 41.9 | Lower is better after accounting for the larger observation set of the physics-GP. |
Hyperparameters learned via scipy L-BFGS-B on log-transformed variables. 100 max iterations. Training time < 2 seconds for 229-point Cholesky (9 stations + 20 IC anchors + 200 collocation) on a laptop.
The case for physics in the kernel, not the loss
In one sense this is the same finding as Inv 34: encoding the PDE is worth 3–4× more training data. What a physics-GP gives you that a PINN cannot shows up in three places.
- Calibrated uncertainty. Every query returns mean + variance. The variance is the posterior uncertainty after the PDE constraint has propagated information through the domain — not a post-hoc bootstrap.
- Active learning built in. The posterior variance tells you where additional monitors would reduce uncertainty the most. This feeds directly into Inv 27 (adaptive monitor placement) and Inv 36 (EVPPI).
- Convex-ish training. Hyperparameter NLL is smooth and has ≤ 5 parameters for this problem. No stochastic gradient descent, no learning-rate tuning, no loss-term weighting.
The trade-off is O(n³) scaling with training-set size. For 104 observations (e.g., a year of hourly AQS data statewide), the standard scaling fixes apply: SVGP (sparse variational GPs — fits a small set of inducing points instead of the full kernel matrix) or KISS-GP (structured kernel interpolation — exploits grid structure for fast matrix-vector products) bring this down to O(n log n) with modest accuracy loss.
Pair physics-kerneled GPs with PINNs, don't choose between them
Recommendation: For regional PM2.5 emulation where CEC needs uncertainty-aware surrogate evaluations (VOI, active learning, calibration targets), a physics-kerneled GP is the right primitive. For high-throughput deterministic inference (operational nowcasting, optimization inner loops), a PINN is faster post-training. An integrated system would use the physics-GP during experimental-design phases and distill it into a PINN for production.
What this pedagogical demo does not prove
- Squared-exponential base kernel — a Matern-5/2 would be more realistic for plume tails but has more complex derivatives.
- Dense Cholesky O(n^3); with 229 training points this is trivial (~0.5s), but production systems with 10^4+ observations would need sparse GPs.
- 1D PDE is a pedagogical simplification. Extending to 2D/3D multiplies kernel derivative complexity but the method is identical.
- Source term S(x) assumed known; joint inference of source + state is the natural next step (e.g. 4D-Var-like approach — see Inv 33).
- The operator L is linear in c. For nonlinear chemistry (O3 + NOx coupling), the method generalizes via iterative linearization (NGPs; Raissi 2017) but is no longer closed-form.
- Plain-GP baseline has no PDE structure — its large RMSE reflects SE-kernel inability to propagate information across a dynamical system from sparse data. With richer kernels (periodic, neural-kernel) the gap would narrow but still favor the physics kernel.