Skip to main content

Can ML Beat the Stress Score?

We pitted a 3-variable rule-based stress score against logistic regression and gradient boosting on 4,344 institutions with 244 closures. The ML wins on paper, but that's not the whole story.

Model Performance: Recall, Precision, F1 Score
Finding
Gradient boosting: 92% recall, 98% precision, 0.95 F1 vs. 93% recall, 33% precision, 0.49 F1 for the stress score. Better numbers, but 244 closures is a small training set. Don't read too much into the gap.

The stress score catches 93% of closures but flags 2,000+ false positives. Logistic regression catches 98% with far fewer false alarms. Gradient boosting achieves the best balance — 92% recall with 98% precision — but it's a black box that requires 17 features and careful tuning.

The real question isn't "which model has the best F1?" It's "which model would you trust a policy decision to?" A simple rule that anyone can audit beats a complex model that nobody can explain — especially when the stakes involve accreditation and financial aid decisions that affect hundreds of thousands of students.

Model Comparison

Side by Side

Model Recall Precision F1 AUC
Stress Score (3 rules) 93.0% 33.4% 0.491
Logistic Regression 98.4% 67.7% 0.800 0.996
Gradient Boosting 91.8% 98.2% 0.949 0.999

Gradient boosting dominates on every metric except recall, where logistic regression edges it out (98.4% vs. 91.8%). For a closure early-warning system, missing fewer closures matters more than fewer false alarms — which argues for logistic regression as the deployment model. But the stress score is still the right first step: it's transparent, auditable, and catches 93% of closures with three variables anyone can check.

Feature Importance

What the Model Learned

Top 10 Feature Importances (Gradient Boosting)

The gradient boosting model confirms what the stress score already knew: cash reserves (30.4%) and default rate (28.0%) dominate. Tuition dependency (12.4%) rounds out the top three — exactly the three variables in the stress score.

The remaining 14 features contribute a combined 29% of importance. Enrollment (7.5%), Pell rate (6.5%), and tuition level (3.9%) add some predictive power, but the marginal gains from features 7–17 are negligible. The model complexity buys precision, not recall.

For deployment, use the stress score. The gradient boosting model is a better predictor, but the stress score is transparent and still works when the data shifts.

5-fold stratified cross-validation on 4,344 institutions (244 closures). Gradient boosting: 200 trees, max_depth=4, learning_rate=0.05. Stress score: tuition_dependency > 85% AND enrollment < 1000 AND default_rate > 15%.