Skip to main content
Studies · CA Air Quality · Investigation 35 · Phase 3

A linear-operator GP bakes the PDE into its kernel

Companion to Inv 34 (PINN): same 1D advection-diffusion-reaction PDE, same 9 noisy monitor observations, same 200 collocation points. But instead of a neural net with a physics loss term, we place a Gaussian-process prior on the concentration field and let the PDE operator L act directly on the kernel. The posterior is physics-consistent by construction — with calibrated uncertainty as a first-class output.

0.123
Physics-GP RMSE (µg/m³)
97%
RMSE cut vs plain GP
9
Station obs
200
Collocation points
The Question

Can a GP kernel itself know the physics?

Inv 34 showed that a small MLP with a PDE-residual penalty beats a data-only MLP on sparse monitor data. But the PINN gives point predictions only — no calibrated uncertainty. For decision-making under uncertainty (e.g., CEC's question of which scenarios to run at high fidelity), uncertainty is the product. This investigation asks the next natural question: can we swap the MLP for a Gaussian process and encode the PDE directly into the kernel, so the posterior is physics-consistent by construction and carries honest variance?

The governing PDE is the same as Inv 34:

∂c/∂t + u ∂c/∂x − D ∂²c/∂x² + k c = S(x)

with u = 5.0 km/h, D = 8.0 km²/h, k = 0.12/h, source centered at x = 200 km (width = 80 km).

The linear-operator kernel trick

If c ~ GP, so does Lc

The core observation (Sarkka 2011; Alvarez, Luengo & Lawrence 2013) is that linear operators commute with the covariance structure of a GP. If we place a squared-exponential prior c ∼ GP(0, k) and let L = ∂t + u∂x − D∂x2 + kr be the ADR operator, then:

  • Lc ∼ GP(0, L1L2 k). The operator pushes through to the kernel by applying it to each argument.
  • cov(c, Lc)(x, x′) = L2 k(x, x′). Cross-covariance is a one-sided derivative of the base kernel.
  • Since Lc = S (the known source), observing S at collocation points is mathematically identical to observing a linear transform of c. Conditioning the GP on both types of observations gives a posterior that respects the PDE in expectation and variance.

All 1D kernel derivatives through order four are available in closed form for the SE kernel. The ADR operator applied to both arguments produces a sum of 16 analytic terms (enumerated in rfaq/atmospheric/physics_gp.py) — no finite differencing needed.

Why this beats a PINN on uncertainty. A PINN gives a deterministic function. A physics-informed GP gives a full posterior — mean and variance — with the PDE baked into both. For downstream VOI / EVPPI / active learning computations (Inv 36 uses this structure), the variance is what you need.

Surrogate ladder

From no-physics to physics-kerneled

L1
Plain SE-kernel GP (station obs only) No physics. The squared-exponential kernel just says "nearby points have similar values." 9 noisy observations + 20 IC anchors.
4.26
RMSE µg/m³
L2
PINN (Inv 34) 24-unit tanh MLP + PDE residual loss at 200 collocation points. Point predictions only.
5.18
RMSE µg/m³
L3
Physics-GP (this investigation) Same data. SE base kernel + ADR operator baked into covariance. 3 hyperparameters fit by marginal likelihood. Posterior carries calibrated ±σ.
0.123
RMSE µg/m³
L4
Finite-difference reference solver Explicit upwind FD integration at fine resolution. Ground truth — but orders of magnitude slower than either surrogate.
truth
solver
Snapshot comparison

What each surrogate predicts at t = 8.0h

Truth (FD solver) Physics-GP ±2σ Plain-GP 0200400600800 05101520 Distance along wind axis (km) — snapshot at t = 8.0h (held-out time) PM2.5 (μg/m³)

Trained on 9 noisy observations at 3 downstream stations plus IC anchors, the physics-GP (green, with ±2σ band) tracks the reference plume to within a few tenths of a µg/m³ across the entire domain. The plain-GP (gold) fits the data points but extrapolates wildly upwind of the nearest station — the SE kernel has no mechanism for propagating information across a dynamical system. The green band narrows where the PDE + collocation data constrain the solution tightly, and widens a little upstream where only IC anchors pin it.

4.26
Plain-GP RMSE (µg/m³)
0.123
Physics-GP RMSE (µg/m³)
97%
Full-domain improvement
97%
Held-out time improvement
Fitted hyperparameters

What marginal-likelihood maximization tells us

HyperparameterPhysics-GPPlain-GPInterpretation
σ2 (prior variance) 62.7 90.8 Signal scale. Both recover roughly the same process variance.
lx (spatial scale, km) 125.4 309.9 Plain-GP inflates lx toward the domain width; the physics kernel prefers a tighter scale because the operator already supplies long-range structure.
lt (temporal scale, h) 10.9 10.8 Similar story. Physics-GP focuses on residual temporal correlation beyond what advection explains.
Neg log marginal likelihood -150.5 41.9 Lower is better after accounting for the larger observation set of the physics-GP.

Hyperparameters learned via scipy L-BFGS-B on log-transformed variables. 100 max iterations. Training time < 2 seconds for 229-point Cholesky (9 stations + 20 IC anchors + 200 collocation) on a laptop.

Why this matters

The case for physics in the kernel, not the loss

In one sense this is the same finding as Inv 34: encoding the PDE is worth 3–4× more training data. What a physics-GP gives you that a PINN cannot shows up in three places.

  • Calibrated uncertainty. Every query returns mean + variance. The variance is the posterior uncertainty after the PDE constraint has propagated information through the domain — not a post-hoc bootstrap.
  • Active learning built in. The posterior variance tells you where additional monitors would reduce uncertainty the most. This feeds directly into Inv 27 (adaptive monitor placement) and Inv 36 (EVPPI).
  • Convex-ish training. Hyperparameter NLL is smooth and has ≤ 5 parameters for this problem. No stochastic gradient descent, no learning-rate tuning, no loss-term weighting.

The trade-off is O(n³) scaling with training-set size. For 104 observations (e.g., a year of hourly AQS data statewide), the standard scaling fixes apply: SVGP (sparse variational GPs — fits a small set of inducing points instead of the full kernel matrix) or KISS-GP (structured kernel interpolation — exploits grid structure for fast matrix-vector products) bring this down to O(n log n) with modest accuracy loss.

Decision implication

Pair physics-kerneled GPs with PINNs, don't choose between them

Recommendation: For regional PM2.5 emulation where CEC needs uncertainty-aware surrogate evaluations (VOI, active learning, calibration targets), a physics-kerneled GP is the right primitive. For high-throughput deterministic inference (operational nowcasting, optimization inner loops), a PINN is faster post-training. An integrated system would use the physics-GP during experimental-design phases and distill it into a PINN for production.

Caveats

What this pedagogical demo does not prove

  • Squared-exponential base kernel — a Matern-5/2 would be more realistic for plume tails but has more complex derivatives.
  • Dense Cholesky O(n^3); with 229 training points this is trivial (~0.5s), but production systems with 10^4+ observations would need sparse GPs.
  • 1D PDE is a pedagogical simplification. Extending to 2D/3D multiplies kernel derivative complexity but the method is identical.
  • Source term S(x) assumed known; joint inference of source + state is the natural next step (e.g. 4D-Var-like approach — see Inv 33).
  • The operator L is linear in c. For nonlinear chemistry (O3 + NOx coupling), the method generalizes via iterative linearization (NGPs; Raissi 2017) but is no longer closed-form.
  • Plain-GP baseline has no PDE structure — its large RMSE reflects SE-kernel inability to propagate information across a dynamical system from sparse data. With richer kernels (periodic, neural-kernel) the gap would narrow but still favor the physics kernel.