Molecular Aggregation & Variant Effects.
A rigorous, calibrated reproduction and extension of amyloid-β variant-effect modeling. Every number script-generated from checksum-locked public data. Every caveat kept in the record.
By Michael Key · ORCID
The data, the assay, and the comprehensive scientific picture here belong primarily to the Lehner / Bolognesi lab — the Aβ and TDP-43 deep-mutational-scanning atlases (GSE151147 / GSE128165), the published CANYA aggregation model, and the energetic / thermodynamic analysis of Aβ aggregation with hundreds of measured couplings (Seuma et al., Science Advances 2025). That the nucleation assay accurately discriminates the known familial Alzheimer’s-disease mutations is itself the source lab’s published finding (Seuma et al., “The genetic landscape for amyloid-β fibril nucleation accurately discriminates familial Alzheimer’s-disease mutations,” eLife 2021); the familial-discrimination result below is reproduced on that foundation, not discovered here. They have already published the comprehensive answer. This study’s contribution is not a new measurement or a standalone discovery. It is a rigorous, calibrated reproduction and extension on their foundation, whose distinctive value is honest, calibrated uncertainty — a model that says how sure it is, and is right about that. Everything here is mechanistic understanding of in-vitro effects: not therapy, not a clinical claim, not a statement about any individual patient.
A calibrated Aβ model that knows its own uncertainty.
The foundation: reproducing the field’s model
Before building anything, I reproduced CANYA — a leading published aggregation model — on its own held-out data: AUROC 0.82 on 7,626 random peptides (paper reports 0.809). The pipeline is faithful and there’s a concrete baseline to improve on.
That faithful reproduction also revealed something important: CANYA ranks well but lies about how sure it is. Its raw calibration error was 0.068 — systematic overconfidence. Isotonic recalibration cut that approximately ~6x to 0.011, with ranking unchanged. This is a standard, competent technique — not a breakthrough — but it makes the model’s uncertainty trustworthy. That matters for the mission: never a confidently wrong number.
The disease model
Trained on 14,015 Aβ double-mutants and tested on 468 held-out single-mutants — including all 8 dominant familial-Alzheimer’s mutations, never seen in training — a from-scratch Aβ-specific additive model predicts measured nucleation at Spearman 0.62 versus 0.20 for zero-shot CANYA on the identical points. On the disease-relevant metric — discriminating the familial mutations from non-familial ones, the source lab’s own published result (Seuma et al., eLife 2021), reproduced here rather than discovered — it reaches AUROC 0.92, matching the measured assay’s own ceiling (0.90) — on 8 familial positives, see the caveats — and well above zero-shot CANYA (0.76).
- The familial AUROC has a training-advantage component. My model trained on Aβ data; CANYA was applied cold. The honest claim is “Aβ-specific modeling greatly outperforms a transferred general model,” not “my architecture is better.” The honest estimate controlling for which variants the model can actually learn is lower — the load-bearing result is the large-n Spearman across 468 points, not the 8-positive AUROC.
- Matching ~0.90 is within its noise. “Matched the ceiling” — not “surpassed it.” That ceiling — the assay’s own ability to discriminate the familial mutations — is the source lab’s published result (Seuma et al., eLife 2021), reproduced here.
- The familial AUROC rests on 8 positive examples. A small-n result. The Spearman across 468 is the number to trust.
- No model cures anything. Cures need labs and trials. This is a contribution to mechanistic understanding of in-vitro nucleation.
Honest calibrated uncertainty on the disease model itself
A point prediction is exactly the confidently-wrong number this mission refuses. Wrapping the disease model in distribution-free (conformal) prediction intervals exposed something accuracy metrics cannot see: a structural blind spot affecting nearly half the test variants — positions never seen among the training doubles, where a position-additive model is structurally uninformative. The naïve way of reporting confidence would have been badly wrong.
Group-conditional (Mondrian) calibration fixed the conditional honesty: a single marginal interval over-covered the variants the model knows and under-covered the ones it doesn’t, whereas per-learnability-group intervals bring both to nominal target coverage simultaneously — sharp where the model is informed, honestly wide where it isn’t.
One load-bearing exception remains — and it is the most disease-relevant one. All familial mutations fall in the “learnable” band, but their effects run hotter than the band expects. The calibrated interval under-covers them, with both misses above the interval — the worst direction for a pathogenic risk call. The right-fidelity deliverable is a sharp interval for general screening plus an explicit wider-upper buffer for the pathogenic familial end. Not one interval everywhere. That nuance is the right-fidelity edge made concrete: a model that states its uncertainty correctly, including admitting where — and in which direction — it is least safe.
The disease signal is protein-specific — a biological barrier, not a modeling one.
A natural hope: one model spanning the misfolding diseases. I tested it directly. Naïve cross-protein transfer between Aβ (Alzheimer’s) and TDP-43 (ALS) fails — backwards. Transfer Spearman: -0.19. The aggregation→disease relationship inverts sign between the two proteins: a hydrophobic substitution correlates with harm in Aβ (+0.30) but with protection in TDP-43 (-0.45). This is consistent with known biology — it emerged from the pipeline rather than being put in by hand. The method is surfacing something real.
The sharper question: is that barrier a weakness of crude features, or is it biological? I re-ran the transfer with a strong protein-language-model representation. The pLM is a substantially better within-protein model (compare within-Aβ Spearman 0.37 and within-TDP-43 Spearman 0.61 for the crude biophysical baseline). Yet the pLM transfers no better — a statistical zero for Aβ→TDP-43, strongly negative for TDP-43→Aβ. A strong representation captures each protein’s signal but transfers not at all — the barrier is biological, not representational.
The study used 1,196 TDP-43 single-mutant measurements (GSE128165) as the transfer target.
The two assays measure different phenotypes: Aβ nucleation vs TDP-43 toxicity. I cannot isolate the protein sign-flip from the phenotype mismatch without a shared-assay control — data I do not have. I can say the failure is not representational and is biological in the broad sense. A clean test would need two proteins assayed for the same phenotype.
The suppressor question — a go/no-go.
The boldest disease-framed question the atlas can answer directly: for each familial mutation, does a second mutation make it aggregate less than expected? This is a suppressor — not a therapy (you cannot add a second mutation to a person), but a mechanistic signal. The approach needs no model: epistasis is measured(double) minus the sum of individual effects, straight from the published scores. Using the established epistasis method from the Lehner/Bolognesi lab’s own published library.
Verdict: PARTIAL — at its thinnest, one replicate away from a clean negative. With two real confounds handled — a global sub-additive baseline (suppression is the atlas-wide norm, so every value must be offset-corrected) and a measurement ceiling sitting on the most aggregating familial mutations — the signal is enhancer-dominated. Four of the six well-covered familial mutations show no suppression at all. One non-ceiling lead emerged: fragile, resting on two discordant replicates, losing significance on either alone.
The honest kept-in-the-record result: I checked rigorously, and the public atlas does not support a familial-suppressor map. Doing it properly would require the lab’s full thermodynamic modeling, which sits above this study’s scope and data.
Bigger and cleverer doesn’t fix the tail.
Between the model and the suppressor check, I tested whether bigger or cleverer machinery closes the disease model’s remaining weakness: systematic under-prediction of the most pathogenic (extreme-nucleating) variants. It does not, from two directions.
Protein-language-model head. A pLM-based model (ESM-2, 8M parameters) scores higher than the additive model on overall correlation — but primarily by gaining useful predictions for the cold-start variants the additive model is blind to. Where the additive model can learn, it remains competitive. And the pLM’s cold-start gain is the result most exposed to the model’s pretraining familiarity with this exact protein sequence.
Scaling the pLM four-fold (8M → 35M parameters) reduces tail under-prediction but regresses cold-start performance — no clean win on both simultaneously.
Asymmetric (tail-aware) training loss. Reduces under-prediction, but turns out to be global debiasing, not a tail fix — it lifts bulk residuals as much as tail residuals, and a trivial one-number offset matches it.
The conclusion that holds across all of it: the pathogenic-tail under-prediction is a signal limit of a shallow single-protein assay, not something a bigger representation or a cleverer objective resolves. Resolving it would require structure-based data, or more coverage of the extreme end of the assay distribution. The honest cheap fix for the model’s global low-bias is a one-number debias folded into the calibration layer.
What this contributes, honestly stated.
- A rigorously reproduced, honestly calibrated Aβ variant-effect model whose differentiator is trustworthy uncertainty: sharp where it is informed, explicitly wide where it is blind, and honest about where its intervals fail on the disease-relevant end.
- Two clean, well-controlled negatives: cross-disease generalization is blocked by biology (not representation), and a bigger model or cleverer loss does not fix the pathogenic tail.
- A PARTIAL on the suppressor question: one fragile lead, not a map. The public atlas does not support a familial-suppressor claim.
- Every claim carries its caveat, every number reproduces, and the negatives are kept in the record — exactly the kind of honest, verifiable contribution this mission exists to make.
The molecular track closes here. The Lehner/Bolognesi lab has the comprehensive answer on molecular mechanisms. My most defensible contribution is the calibration work — a model that refuses to be confidently wrong — and the cross-protein finding: two proteins assayed by different methods, a sign inversion, a biological barrier. These are honest inputs to the field. The next study is calibrated clinical prognosis.
← Back to ResearchEvery number is checkable.
Data. All raw datasets are public and locked by sha256 checksum: GSE268261 (CANYA random-peptide evaluation set), GSE151147 (Aβ deep-mutational-scanning atlas, Seuma et al.), GSE128165 (TDP-43 toxicity atlas). No data not in these accessions enters any result.
Reproduction. Each investigation regenerates its results from scratch on every run: a reproduction harness re-runs each twice and asserts byte-identical output, so if any number here drifts from what the scripts produce, it is immediately visible. Full code, the research protocol, and per-investigation provenance are available on request; the public repository opens at launch.
Method. Full method and per-investigation caveats — the research protocol and each investigation’s analysis, scenario, and reproduction notes — are available on request now and will be published with the public repository at launch.