Every Model Starts with Assumptions
We generated 4,000+ synthetic institutions before we had real data. The synthetic data was calibrated to published NCES/IPEDS aggregate statistics — sector distributions, enrollment ranges, closure rates by institution type. But calibration to aggregates doesn't guarantee the distributions match at the institution level.
Now that we have real College Scorecard and IPEDS data, we can check those assumptions directly. This check is itself an ADM demonstration: if the synthetic data was close enough to produce the same findings as real data, then the lower-fidelity synthetic model was sufficient for the decision it supported — initial exploration and methodology development.
Synthetic vs. Real: The Numbers
High-level comparison of the two datasets. These summary statistics tell us whether our synthetic generation process produced a dataset of similar scale and closure prevalence.
Did We Get the Mix Right?
The most basic structural check: does the synthetic dataset have roughly the same proportion of institutions in each sector (Public 4-year, Private Nonprofit 4-year, For-Profit, etc.) as the real data? Getting this wrong would cascade into every downstream analysis.
Where Did Closures Actually Happen?
Sector-level closure rates are the core signal in this study. If the synthetic data nailed the overall rate but assigned closures to the wrong sectors, every investigation built on that data would inherit the bias.
Did Geography Hold Up?
Regional patterns in college closure are driven by demographics, state funding, and local economic conditions. Our synthetic data used regional closure rate multipliers from published NCES data. Here's how those calibrated rates compared to what we found in the real data.
Did Synthetic Institutions Look Real?
Beyond counts and rates, did the synthetic institutions have realistic enrollment sizes? This matters because small institutions close at much higher rates, so getting the enrollment distribution wrong would distort the risk profile. We compare the enrollment distribution for Private Nonprofit 4-year institutions — the sector with the most closures.
What Matched and What Didn't
The ADM takeaway: synthetic data was good enough for methodology development and initial exploration, but not good enough for institution-level predictions. That's exactly what the fidelity ladder predicts — start cheap, validate, then escalate where it matters. The real data replaced the synthetic data for all final results in this study.
How We Built the Comparison
Synthetic data generated using numpy random distributions calibrated to published NCES/IPEDS aggregate statistics. Real data from College Scorecard API and IPEDS direct downloads. Comparison metrics computed on matching sector and region definitions. Enrollment distributions binned using consistent bin edges across both datasets.
Sectors follow IPEDS classification (Public 4-year, Private NP 4-year, For-Profit 4-year, Public 2-year, Private NP 2-year, For-Profit 2-year, For-Profit <2-year). Regions follow Census Bureau divisions. Closure defined as institution no longer operating as of 2023.