Q6

How Well Did Our Assumptions Match Reality?

We built this study first on synthetic data calibrated to published NCES/IPEDS statistics, then re-ran it on real College Scorecard data. Here's what our assumptions got right — and what they got wrong.

Why This Matters

Every Model Starts with Assumptions

We generated 4,000+ synthetic institutions before we had real data. The synthetic data was calibrated to published NCES/IPEDS aggregate statistics — sector distributions, enrollment ranges, closure rates by institution type. But calibration to aggregates doesn't guarantee the distributions match at the institution level.

Now that we have real College Scorecard and IPEDS data, we can check those assumptions directly. This check is itself an ADM demonstration: if the synthetic data was close enough to produce the same findings as real data, then the lower-fidelity synthetic model was sufficient for the decision it supported — initial exploration and methodology development.

Summary Statistics

Synthetic vs. Real: The Numbers

High-level comparison of the two datasets. These summary statistics tell us whether our synthetic generation process produced a dataset of similar scale and closure prevalence.

Total Institutions
--
Synthetic
--
Real
Total Closures
--
Synthetic
--
Real
Overall Closure Rate
--
Synthetic
--
Real
Sector Distribution

Did We Get the Mix Right?

The most basic structural check: does the synthetic dataset have roughly the same proportion of institutions in each sector (Public 4-year, Private Nonprofit 4-year, For-Profit, etc.) as the real data? Getting this wrong would cascade into every downstream analysis.

Institution Count by Sector: Synthetic vs. Real
Closure Rates by Sector

Where Did Closures Actually Happen?

Sector-level closure rates are the core signal in this study. If the synthetic data nailed the overall rate but assigned closures to the wrong sectors, every investigation built on that data would inherit the bias.

Closure Rate by Sector: Synthetic vs. Real
Closure Rates by Region

Did Geography Hold Up?

Regional patterns in college closure are driven by demographics, state funding, and local economic conditions. Our synthetic data used regional closure rate multipliers from published NCES data. Here's how those calibrated rates compared to what we found in the real data.

Closure Rate by Region: Synthetic vs. Real
Enrollment Distribution

Did Synthetic Institutions Look Real?

Beyond counts and rates, did the synthetic institutions have realistic enrollment sizes? This matters because small institutions close at much higher rates, so getting the enrollment distribution wrong would distort the risk profile. We compare the enrollment distribution for Private Nonprofit 4-year institutions — the sector with the most closures.

Enrollment Distribution: Private NP 4-Year (Synthetic vs. Real)
Key Findings

What Matched and What Didn't

Finding 1
Synthetic sector distribution was within a few percentage points of reality for most sectors. The generation process correctly captured that Private Nonprofit 4-year and For-Profit institutions dominate the landscape.
Finding 2
For-Profit closure rates in the synthetic data closely tracked real data. This is the highest-risk sector, and getting it right meant the stress score and ML models trained on synthetic data transferred well to real data.
Finding 3
Regional closure rate variation was underestimated in the synthetic data. The real data shows more extreme regional differences than our calibrated multipliers produced — a known limitation of calibrating to national-level aggregates.
Finding 4
Enrollment distributions were broadly similar, but the synthetic data underrepresented the very smallest institutions (under 200 students) — precisely the ones most likely to close. This means synthetic-trained models may slightly underestimate risk for the smallest schools.

The ADM takeaway: synthetic data was good enough for methodology development and initial exploration, but not good enough for institution-level predictions. That's exactly what the fidelity ladder predicts — start cheap, validate, then escalate where it matters. The real data replaced the synthetic data for all final results in this study.

Method Note

How We Built the Comparison

Synthetic data generated using numpy random distributions calibrated to published NCES/IPEDS aggregate statistics. Real data from College Scorecard API and IPEDS direct downloads. Comparison metrics computed on matching sector and region definitions. Enrollment distributions binned using consistent bin edges across both datasets.

Sectors follow IPEDS classification (Public 4-year, Private NP 4-year, For-Profit 4-year, Public 2-year, Private NP 2-year, For-Profit 2-year, For-Profit <2-year). Regions follow Census Bureau divisions. Closure defined as institution no longer operating as of 2023.