Study 01 · Molecular Track · Closed 2026-06-05

Molecular Aggregation & Variant Effects.

A rigorous, calibrated reproduction and extension of amyloid-β variant-effect modeling. Every number script-generated from checksum-locked public data. Every caveat kept in the record.

By Michael Key · ORCID

Whose foundation this stands on

The data, the assay, and the comprehensive scientific picture here belong primarily to the Lehner / Bolognesi lab — the Aβ and TDP-43 deep-mutational-scanning atlases (GSE151147 / GSE128165), the published CANYA aggregation model, and the energetic / thermodynamic analysis of Aβ aggregation with hundreds of measured couplings (Seuma et al., Science Advances 2025). That the nucleation assay accurately discriminates the known familial Alzheimer’s-disease mutations is itself the source lab’s published finding (Seuma et al., “The genetic landscape for amyloid-β fibril nucleation accurately discriminates familial Alzheimer’s-disease mutations,” eLife 2021); the familial-discrimination result below is reproduced on that foundation, not discovered here. They have already published the comprehensive answer. This study’s contribution is not a new measurement or a standalone discovery. It is a rigorous, calibrated reproduction and extension on their foundation, whose distinctive value is honest, calibrated uncertainty — a model that says how sure it is, and is right about that. Everything here is mechanistic understanding of in-vitro effects: not therapy, not a clinical claim, not a statement about any individual patient.

Study at a glance · the fidelity path

Reproduce 0.82 Held-out AUROC for the published CANYA baseline; the paper reports 0.809.

Calibrate 0.068 → 0.011 Calibration error fell roughly ~6x without changing the ranking.

Specialize 0.62 vs 0.20 Spearman correlation on 468 held-out Aβ single mutations.

Expose the boundary 8 positives The familial-mutation AUROC is illustrative; the large-n correlation is the result to trust.

Contribution 1

A calibrated Aβ model that knows its own uncertainty.

The foundation: reproducing the field’s model

Before building anything, I reproduced CANYA — a leading published aggregation model — on its own held-out data: AUROC 0.82 on 7,626 random peptides (paper reports 0.809). The pipeline is faithful and there’s a concrete baseline to improve on.

That faithful reproduction also revealed something important: CANYA ranks well but is materially overconfident. Its raw calibration error was 0.068. Isotonic recalibration cut that approximately ~6x to 0.011, with ranking unchanged. This is a standard, competent technique — not a breakthrough — but it makes the reported uncertainty more reliable.

The disease model

Trained on 14,015 Aβ double-mutants and tested on 468 held-out single-mutants — including all 8 dominant familial-Alzheimer’s mutations, never seen in training — a from-scratch Aβ-specific additive model predicts measured nucleation at Spearman 0.62 versus 0.20 for zero-shot CANYA on the identical points. On the disease-relevant metric — discriminating the familial mutations from non-familial ones, the source lab’s own published result (Seuma et al., eLife 2021), reproduced here rather than discovered — it reaches AUROC 0.92, matching the measured assay’s own ceiling (0.90) — on 8 familial positives, see the caveats — and well above zero-shot CANYA (0.76).

Load-bearing caveats on the disease result

The familial AUROC has a training-advantage component. My model trained on Aβ data; CANYA was applied cold. The honest claim is “Aβ-specific modeling greatly outperforms a transferred general model,” not “my architecture is better.” The honest estimate controlling for which variants the model can actually learn is lower — the load-bearing result is the large-n Spearman across 468 points, not the 8-positive AUROC.
Matching ~0.90 is within its noise. “Matched the ceiling” — not “surpassed it.” That ceiling — the assay’s own ability to discriminate the familial mutations — is the source lab’s published result (Seuma et al., eLife 2021), reproduced here.
The familial AUROC rests on 8 positive examples. A small-n result. The Spearman across 468 is the number to trust.
No model cures anything. Cures need labs and trials. This is a contribution to mechanistic understanding of in-vitro nucleation.

Honest calibrated uncertainty on the disease model itself

A point prediction is exactly the confidently-wrong number this mission refuses. Wrapping the disease model in distribution-free (conformal) prediction intervals exposed something accuracy metrics cannot see: a structural blind spot affecting nearly half the test variants — positions never seen among the training doubles, where a position-additive model is structurally uninformative. The naïve way of reporting confidence would have been badly wrong.

Group-conditional (Mondrian) calibration fixed the conditional honesty: a single marginal interval over-covered the variants the model knows and under-covered the ones it doesn’t, whereas per-learnability-group intervals bring both to nominal target coverage simultaneously — sharp where the model is informed, honestly wide where it isn’t.

One load-bearing exception remains — and it is the most disease-relevant one. All familial mutations fall in the “learnable” band, but their effects run hotter than the band expects. The calibrated interval under-covers them, with both misses above the interval — the worst direction for a pathogenic risk call. The right-fidelity deliverable is a sharp interval for general screening plus an explicit wider-upper buffer for the pathogenic familial end. Not one interval everywhere. That nuance is the right-fidelity edge made concrete: a model that states its uncertainty correctly, including admitting where — and in which direction — it is least safe.

Contribution 2

The disease signal is protein-specific — a biological barrier, not a modeling one.

A natural hope: one model spanning the misfolding diseases. I tested it directly. Naïve cross-protein transfer between Aβ (Alzheimer’s) and TDP-43 (ALS) fails — backwards. Transfer Spearman: -0.19. The aggregation→disease relationship inverts sign between the two proteins: a hydrophobic substitution correlates with harm in Aβ (+0.30) but with protection in TDP-43 (-0.45). This is consistent with known biology — it emerged from the pipeline rather than being put in by hand. The method is surfacing something real.

The sharper question: is that barrier a weakness of crude features, or is it biological? I re-ran the transfer with a strong protein-language-model representation. The pLM is a substantially better within-protein model (compare within-Aβ Spearman 0.37 and within-TDP-43 Spearman 0.61 for the crude biophysical baseline). Yet the pLM transfers no better — a statistical zero for Aβ→TDP-43, strongly negative for TDP-43→Aβ. A strong representation captures each protein’s signal but transfers not at all — the barrier is biological, not representational.

The study used 1,196 TDP-43 single-mutant measurements (GSE128165) as the transfer target.

Honest caveat (load-bearing)

The two assays measure different phenotypes: Aβ nucleation vs TDP-43 toxicity. I cannot isolate the protein sign-flip from the phenotype mismatch without a shared-assay control — data I do not have. I can say the failure is not representational and is biological in the broad sense. A clean test would need two proteins assayed for the same phenotype.

Contribution 3

The suppressor question — a go/no-go.

The boldest disease-framed question the atlas can answer directly: for each familial mutation, does a second mutation make it aggregate less than expected? This is a suppressor — not a therapy (you cannot add a second mutation to a person), but a mechanistic signal. The approach needs no model: epistasis is measured(double) minus the sum of individual effects, straight from the published scores. Using the established epistasis method from the Lehner/Bolognesi lab’s own published library.

Verdict: PARTIAL — at its thinnest, one replicate away from a clean negative. With two real confounds handled — a global sub-additive baseline (suppression is the atlas-wide norm, so every value must be offset-corrected) and a measurement ceiling sitting on the most aggregating familial mutations — the signal is enhancer-dominated. Four of the six well-covered familial mutations show no suppression at all. One non-ceiling lead emerged: fragile, resting on two discordant replicates, losing significance on either alone.

The honest kept-in-the-record result: I checked rigorously, and the public atlas does not support a familial-suppressor map. Doing it properly would require the lab’s full thermodynamic modeling, which sits above this study’s scope and data.

What I ruled out

Bigger and cleverer doesn’t fix the tail.

Between the model and the suppressor check, I tested whether bigger or cleverer machinery closes the disease model’s remaining weakness: systematic under-prediction of the most pathogenic (extreme-nucleating) variants. It does not, from two directions.

Protein-language-model head. A pLM-based model (ESM-2, 8M parameters) scores higher than the additive model on overall correlation — but primarily by gaining useful predictions for the cold-start variants the additive model is blind to. Where the additive model can learn, it remains competitive. And the pLM’s cold-start gain is the result most exposed to the model’s pretraining familiarity with this exact protein sequence.

Scaling the pLM four-fold (8M → 35M parameters) reduces tail under-prediction but regresses cold-start performance — no clean win on both simultaneously.

Asymmetric (tail-aware) training loss. Reduces under-prediction, but turns out to be global debiasing, not a tail fix — it lifts bulk residuals as much as tail residuals, and a trivial one-number offset matches it.

The conclusion that holds across all of it: the pathogenic-tail under-prediction is a signal limit of a shallow single-protein assay, not something a bigger representation or a cleverer objective resolves. Resolving it would require structure-based data, or more coverage of the extreme end of the assay distribution. The honest cheap fix for the model’s global low-bias is a one-number debias folded into the calibration layer.

Honest bottom line

What this contributes, honestly stated.

What this study contributes

A rigorously reproduced, honestly calibrated Aβ variant-effect model whose differentiator is trustworthy uncertainty: sharp where it is informed, explicitly wide where it is blind, and honest about where its intervals fail on the disease-relevant end.
Two clean, well-controlled negatives: cross-disease generalization is blocked by biology (not representation), and a bigger model or cleverer loss does not fix the pathogenic tail.
A PARTIAL on the suppressor question: one fragile lead, not a map. The public atlas does not support a familial-suppressor claim.
Every claim carries its caveat, every number reproduces, and the negatives are kept in the record — exactly the kind of honest, verifiable contribution this mission exists to make.

The molecular track closes here. The Lehner/Bolognesi lab has the comprehensive answer on molecular mechanisms. My most defensible contribution is the calibration work — a model that refuses to be confidently wrong — and the cross-protein finding: two proteins assayed by different methods, a sign inversion, a biological barrier. These are honest inputs to the field. The next study is calibrated clinical prognosis.

← Back to Research

Audit trail

Every number is checkable.

Data. All raw datasets are public and locked by sha256 checksum: GSE268261 (CANYA random-peptide evaluation set), GSE151147 (Aβ deep-mutational-scanning atlas, Seuma et al.), GSE128165 (TDP-43 toxicity atlas). No data not in these accessions enters any result.

Reproduction. Each investigation regenerates its results from scratch on every run: a reproduction harness re-runs each twice and asserts byte-identical output, so if any number here drifts from what the scripts produce, it is immediately visible. Full code, the research protocol, and per-investigation provenance are available on request; the public repository opens at launch.

Method. Full method and per-investigation caveats — the research protocol and each investigation’s analysis, scenario, and reproduction notes — are available on request now and will be published with the public repository at launch.

Read the build log → ← Research index