Physics-Informed GP (ADR Kernel) — Linear-Operator Surrogate

The Question

Can a GP kernel itself know the physics?

Inv 34 showed that a small MLP with a PDE-residual penalty beats a data-only MLP on sparse monitor data. But the PINN gives point predictions only — no calibrated uncertainty. For decision-making under uncertainty (e.g., CEC's question of which scenarios to run at high fidelity), uncertainty is the product. This investigation asks the next natural question: can we swap the MLP for a Gaussian process and encode the PDE directly into the kernel, so the posterior is physics-consistent by construction and carries honest variance?

The governing PDE is the same as Inv 34:

∂c/∂t + u ∂c/∂x − D ∂²c/∂x² + k c = S(x)

with u = 5.0 km/h, D = 8.0 km²/h, k = 0.12/h, source centered at x = 200 km (width = 80 km).

The linear-operator kernel trick

If c ~ GP, so does Lc

The core observation (Sarkka 2011; Alvarez, Luengo & Lawrence 2013) is that linear operators commute with the covariance structure of a GP. If we place a squared-exponential prior c ∼ GP(0, k) and let L = ∂_t + u∂_x − D∂_x² + k_r be the ADR operator, then:

Lc ∼ GP(0, L₁L₂ k). The operator pushes through to the kernel by applying it to each argument.
cov(c, Lc)(x, x′) = L₂ k(x, x′). Cross-covariance is a one-sided derivative of the base kernel.
Since Lc = S (the known source), observing S at collocation points is mathematically identical to observing a linear transform of c. Conditioning the GP on both types of observations gives a posterior that respects the PDE in expectation and variance.

All 1D kernel derivatives through order four are available in closed form for the SE kernel. The ADR operator applied to both arguments produces a sum of 16 analytic terms (enumerated in rfaq/atmospheric/physics_gp.py) — no finite differencing needed.

Why this beats a PINN on uncertainty. A PINN gives a deterministic function. A physics-informed GP gives a full posterior — mean and variance — with the PDE baked into both. For downstream VOI / EVPPI / active learning computations (Inv 36 uses this structure), the variance is what you need.

Surrogate ladder

From no-physics to physics-kerneled

Plain SE-kernel GP (station obs only) No physics. The squared-exponential kernel just says "nearby points have similar values." 9 noisy observations + 20 IC anchors.

4.26

RMSE µg/m³

PINN (Inv 34) 24-unit tanh MLP + PDE residual loss at 200 collocation points. Point predictions only.

5.18

RMSE µg/m³

Physics-GP (this investigation) Same data. SE base kernel + ADR operator baked into covariance. 3 hyperparameters fit by marginal likelihood. Posterior carries calibrated ±σ.

0.123

RMSE µg/m³

Finite-difference reference solver Explicit upwind FD integration at fine resolution. Ground truth — but orders of magnitude slower than either surrogate.

truth

solver

Snapshot comparison

What each surrogate predicts at t = 8.0h

Trained on 9 noisy observations at 3 downstream stations plus IC anchors, the physics-GP (green, with ±2σ band) tracks the reference plume to within a few tenths of a µg/m³ across the entire domain. The plain-GP (gold) fits the data points but extrapolates wildly upwind of the nearest station — the SE kernel has no mechanism for propagating information across a dynamical system. The green band narrows where the PDE + collocation data constrain the solution tightly, and widens a little upstream where only IC anchors pin it.

4.26

Plain-GP RMSE (µg/m³)

0.123

Physics-GP RMSE (µg/m³)

97%

Full-domain improvement

97%

Held-out time improvement

Fitted hyperparameters

What marginal-likelihood maximization tells us

Hyperparameter	Physics-GP	Plain-GP	Interpretation
σ² (prior variance)	62.7	90.8	Signal scale. Both recover roughly the same process variance.
l_x (spatial scale, km)	125.4	309.9	Plain-GP inflates l_x toward the domain width; the physics kernel prefers a tighter scale because the operator already supplies long-range structure.
l_t (temporal scale, h)	10.9	10.8	Similar story. Physics-GP focuses on residual temporal correlation beyond what advection explains.
Neg log marginal likelihood	-150.5	41.9	Lower is better after accounting for the larger observation set of the physics-GP.

Hyperparameters learned via scipy L-BFGS-B on log-transformed variables. 100 max iterations. Training time < 2 seconds for 229-point Cholesky (9 stations + 20 IC anchors + 200 collocation) on a laptop.

Why this matters

The case for physics in the kernel, not the loss

In one sense this is the same finding as Inv 34: encoding the PDE is worth 3–4× more training data. What a physics-GP gives you that a PINN cannot shows up in three places.

Calibrated uncertainty. Every query returns mean + variance. The variance is the posterior uncertainty after the PDE constraint has propagated information through the domain — not a post-hoc bootstrap.
Active learning built in. The posterior variance tells you where additional monitors would reduce uncertainty the most. This feeds directly into Inv 27 (adaptive monitor placement) and Inv 36 (EVPPI).
Convex-ish training. Hyperparameter NLL is smooth and has ≤ 5 parameters for this problem. No stochastic gradient descent, no learning-rate tuning, no loss-term weighting.

The trade-off is O(n³) scaling with training-set size. For 10⁴ observations (e.g., a year of hourly AQS data statewide), the standard scaling fixes apply: SVGP (sparse variational GPs — fits a small set of inducing points instead of the full kernel matrix) or KISS-GP (structured kernel interpolation — exploits grid structure for fast matrix-vector products) bring this down to O(n log n) with modest accuracy loss.

Decision implication

Pair physics-kerneled GPs with PINNs, don't choose between them

Recommendation: For regional PM2.5 emulation where CEC needs uncertainty-aware surrogate evaluations (VOI, active learning, calibration targets), a physics-kerneled GP is the right primitive. For high-throughput deterministic inference (operational nowcasting, optimization inner loops), a PINN is faster post-training. An integrated system would use the physics-GP during experimental-design phases and distill it into a PINN for production.

Caveats

What this pedagogical demo does not prove

Squared-exponential base kernel — a Matern-5/2 would be more realistic for plume tails but has more complex derivatives.
Dense Cholesky O(n^3); with 229 training points this is trivial (~0.5s), but production systems with 10^4+ observations would need sparse GPs.
1D PDE is a pedagogical simplification. Extending to 2D/3D multiplies kernel derivative complexity but the method is identical.
Source term S(x) assumed known; joint inference of source + state is the natural next step (e.g. 4D-Var-like approach — see Inv 33).
The operator L is linear in c. For nonlinear chemistry (O3 + NOx coupling), the method generalizes via iterative linearization (NGPs; Raissi 2017) but is no longer closed-form.
Plain-GP baseline has no PDE structure — its large RMSE reflects SE-kernel inability to propagate information across a dynamical system from sparse data. With richer kernels (periodic, neural-kernel) the gap would narrow but still favor the physics kernel.

← Previous PINN WRF-Chem Surrogate Hub RFAQ Study Home Next → EVPPI Decomposition (Inv 36)

A linear-operator GP bakes the PDE into its kernel

Can a GP kernel itself know the physics?

If c ~ GP, so does Lc

From no-physics to physics-kerneled

What each surrogate predicts at t = 8.0h

What marginal-likelihood maximization tells us

The case for physics in the kernel, not the loss

Pair physics-kerneled GPs with PINNs, don't choose between them

What this pedagogical demo does not prove