Sequential Portfolio POMDP — Learning Over 10 Years

Why Sequential?

Commit-and-Forget Leaves Lives on the Table

Phase 1 scored each portfolio (T1, T2, B1, B2, W1, W2, Solar, combinations) as a single commitment: pick once, deploy for 10 years, count deaths. That framing assumes the CRF is known and fixed. Reality: over the first 3–5 years the program will observe actual mortality, refine the CRF posterior, and have every opportunity to reallocate the remaining budget.

In the Phase 2 Inv 21 hierarchical Bayesian analysis the posterior CRF CI brackets both Di (HR 1.073) and Krewski (HR 1.056) — so initial deployments under one CRF can be reallocated under the other as data rolls in. Inv 22 makes that information flow explicit.

The question: how much additional value does adaptive sequential allocation deliver over the Phase 1 commit-at-year-0 approach, and how sophisticated does the policy need to be to capture that value?

Fidelity Ladder

From One-Shot to Bayes-Optimal

Five policy fidelity levels. Each one drives a 10-year, $4B simulated deployment with a mixture-prior true CRF (35% Di / 35% Krewski / 30% Inv 21 posterior), noisy lagged mortality observations, and conjugate Bayesian updating.

One-shot static schedule (Phase 1) Commit T2+B2+W1 annual spend at year 0, never adjust. Matches Phase 1 Inv 12 portfolio picks.

651

deaths/traj

Two-stage stochastic program Commit 50% now; observe CRF for 5 years; commit remaining 50% under realized regime.

702

deaths/traj

Rolling-horizon annual refit Every year re-optimize next-year allocation based on Bayesian posterior over CRF.

743

deaths/traj

POMDP belief-state value iteration Explicit belief over {Di, Krewski, Inv 21} regimes; value-iterate optimal action per belief.

735

deaths/traj

BO-optimal policy parameters (Emukit MFGP) Multi-fidelity BO over aggressiveness + risk-aversion; finds the policy that dominates all others.

777

deaths/traj

Fused

Kennedy–O’Hagan AR1 posterior Precision-weighted across L1–L4 anchored by L5. Captures the improvement ladder with uncertainty.

762

deaths/traj

200 Monte Carlo trajectories per policy. Discount 3%. Observations lagged 2 years with σ = 35% + 10 deaths. Belief state: Gaussian over log-CRF β with conjugate update.

Policy Comparison

The Gap is 19%

Policy	Mean Deaths Avoided	95% CI	Deaths per $B	vs L1 One-shot
L1 One-shot (Phase 1)	651	[531, 2133]	163	+0.0%
L2 Two-stage	702	[572, 2302]	175	+7.8%
L3 Rolling horizon	743	[621, 2254]	186	+14.1%
L4 POMDP belief-state	735	[587, 2560]	184	+12.9%
L5 BO-optimal	777	[621, 2677]	194	+19.2%

Mean + 95% CI across 200 MC trajectories; each trajectory samples a different true CRF. Deaths per $B computed from mean lifetime discounted spend ($4B).

Finding

The best sequential policy (L5_bo) delivers 777 deaths avoided per trajectory vs 651 under Phase 1's one-shot approach — a 19.2% improvement, or about 125 extra lives saved over a $4B 10-year program. At the EPA $11.6M VSL, that is $1.5B of value left on the table by commit-and-forget policies.

Where the Value Comes From

Learning Beats Committing

L1 locks in a single allocation that splits the difference between Di and Krewski regimes. If the true CRF ends up Krewski-like, it under-allocates to transport (T2); if Di-like, it over-allocates. Either way, the fixed schedule is sub-optimal in hindsight.

L2 (two-stage) captures the simplest version of the gain: commit 50% early, wait 5 years for mortality to accumulate, and redirect the remaining budget. That alone adds 7.8% deaths avoided. L3 rolling-horizon refits annually, which adds another 6.3 pp because the early-year posterior collapses onto the true regime faster than 5-year batching.

L4 makes the belief state explicit (three discrete regime hypotheses + soft Bayesian weighting) and value-iterates the best response per belief. That nearly matches L3. L5 then wraps the policy in a 2-parameter family (aggressiveness, risk-aversion) and runs multi-fidelity Bayesian optimization to find the corner of the Pareto frontier — dominating all lower levels on expected deaths avoided.

Method Detail

The Policy Space

To compare one-shot commitments against adaptive schedules on equal footing, each fidelity level is expressed as a decision rule mapping the current evidence state to next-year spend — that is the object the sequential ladder is grading.

Each policy is a function π(state) → budget allocation. For L1–L4, π is hand-written; for L5, π is parameterized by (aggressiveness, risk-aversion) and the two parameters are optimized via Kennedy–O’Hagan MFGP over 30 evaluations of a cheap surrogate (50 MC trajectories) and 5 evaluations of the true simulator (200 MC trajectories).

Observations are noisy lagged mortality signals: given a fraction of option o deployed and a true CRF β, the annual deaths avoided is linear-interpolated between the Di and Krewski anchors published in Phase 1. Observations are lagged 2 years (epi surveillance) with Gaussian noise scaling with the signal. Bayesian updates use a conjugate normal-normal model on log-CRF.

Sources: Bellman 1957 (DP); Kaelbling et al. 1998 (POMDP); Kennedy & O'Hagan 2000 (MFGP); Tange 2018 (BO in public health). The Inv 21 hierarchical-Bayes posterior over CA-pooled CRF motivates the regime-belief formulation here.

Implication for the portfolio. This investigation says the Phase 1 deliverable should not be a single portfolio pick but a policy — a contingent plan that reacts to 3-year mortality observations. The 19% lives-avoided improvement at a $4B 10-year budget is worth $1.5B against a $0 incremental cost (the policy itself is just a decision rule).

← Previous Hierarchical Bayesian CRF Hub RFAQ Study Home Next → Robust Optimization

Should the Policy Be Fixed or Adaptive?

Commit-and-Forget Leaves Lives on the Table

From One-Shot to Bayes-Optimal

The Gap is 19%

Learning Beats Committing

The Policy Space