Data Stewardship — How Research Data Is Handled

This page exists so a reviewer never has to take a handling claim on faith. It states exactly what happens to data the moment it arrives, who can reach it, and how it is destroyed when a study ends. Where this is currently a one-person practice rather than an institutional control, it says so plainly.

1. Where the data lives

Research data is stored on a single local machine. It is encrypted at rest — full-disk encryption on the device that holds it. It is not synced to any cloud service. There is no Dropbox, Google Drive, iCloud, or comparable background sync touching the data directories; the working folders are deliberately kept out of any sync path. Datasets governed by a Data Use Agreement or a restricted-access license stay on that one machine and are not copied to shared drives, personal cloud storage, or collaborator devices.

Backups, where a license permits them, are made to local encrypted media held under the same single-researcher control — never to a third-party cloud backup service.

2. Raw data never enters an AI or chat context

This is already the standing practice, not an aspiration: raw patient-level data never enters a chat window, an LLM prompt, or any AI model context. AI tools assist with writing analysis code and reasoning about method — but the data itself is processed only by code running locally, against the files on that machine. Only aggregate outputs — summary statistics, model coefficients, calibration curves, performance metrics computed across many records — ever leave that local boundary. No individual record, identifier, or row of source data is pasted into, uploaded to, or transmitted through any external service, AI or otherwise.

The numbers that appear in the research write-ups are of this aggregate kind: they are computed by scripts over the protected data and reported as summaries, with the underlying records remaining local.

3. Who can access it

Access is held by one person — the sole researcher running this mission. No collaborators, contractors, or services currently hold credentials to the protected datasets. There is no shared account and no team with standing access. When that changes — as the work grows toward collaborators and a governing board — this page will be updated to describe the access controls that come with it.

4. Retention and destruction

Data is retained only as long as the active study and any required reproducibility window need it — and never longer than the governing agreement or license permits. Each restricted dataset is tracked to the terms it arrived under, including any retention limit or end-of-project destruction clause.

When a study concludes, or when an agreement requires it, the raw data is destroyed: the encrypted source files and any local working copies and intermediate caches derived from the protected records are deleted, and encrypted backup media are wiped. What is kept afterward is the analysis code and the aggregate results — not the underlying patient-level data. If an agreement specifies a particular destruction method or requires written confirmation of destruction, that requirement is followed and recorded.

5. Abiding by each repository’s terms

Every dataset is used strictly within the terms it is released under. That includes:

Non-commercial and share-alike licenses (for example CC BY-NC-SA): the work is non-commercial, attribution is preserved, and derived material is handled consistent with the license.
Data Use Agreements (DUAs) for restricted clinical repositories: access scope, permitted uses, redistribution limits, and retention/destruction clauses are all honored as written.
Non-commercial intent overall: this is a research mission, not a product. There is no sale, no commercial redistribution of source data, and no use outside the stated research purpose.

If a repository’s terms conflict with anything described on this page, the repository’s terms govern.

6. Human-subjects & ethics oversight

The section above covers how data is technically protected. This one is about the harder question a data-access reviewer is right to ask: under whose ethical oversight is the work done? The honest answer is stated plainly: this is an independent solo researcher without an institutional IRB of record. There is no university or hospital review board that this mission currently sits under by default. Rather than paper over that, this page names it.

What that means in practice:

Strict operation under each program’s terms. Every dataset is used only within its governing Data Use Agreement, license, and access terms — the same commitment made in the section above, applied here to the ethics conditions a program attaches to access.
Deference to required review. Where a program or dataset requires IRB review, a determination, or a documented exemption as a condition of access, that requirement is honored before the data is used. If access is contingent on an IRB pathway, an institutional or independent-IRB review will be arranged to satisfy it — rather than seeking a workaround.
Honest classification of the work. Analyses on public, fully de-identified data — for example NHANES, HRS, and public DMS (deep mutational scanning) atlases — are not human-subjects research and are not represented as carrying approvals they do not need. Restricted patient-level cohorts are a different matter: they are handled under their own governing approvals, agreements, and any review their custodian requires.
Human-subjects protections, even as one person. The commitment to the people behind the data does not depend on having an institution watching. There are no attempts to re-identify individuals in any dataset; data collection and retention are minimized to what an analysis actually needs; and the local-only, never-in-an-LLM-context handling described elsewhere on this page is itself a human-subjects protection, not just a security measure.

This is an area where a solo operator can show intent and discipline but cannot, alone, substitute for institutional review. The position here is deference, not claim: no IRB approval, determination, or exemption is asserted that has not been granted, and wherever a custodian’s process is required, that process — not this page — governs.

7. Provenance and reproducibility

A reported number is only trustworthy if it can be traced back to its source. Every analysis is built so that the chain from input to reported figure is auditable:

sha256-locked inputs. Each source dataset is fingerprinted with a sha256 hash, so the exact bytes an analysis ran against are pinned and any later change to the input is detectable.
Script-generated numbers. The figures in the research write-ups are produced by code, not typed in by hand. Each reported value is emitted by a script and bound to the page through a managed figures manifest, so a number on the site and the number a script produced cannot silently drift apart.
An audit chain. Locked input → analysis code → generated result → the value shown on the page. Given the hashed inputs and the code, the aggregate results are reproducible — without ever exposing the underlying records.

Where this stands today

Honest framing matters here too: this is currently a solo researcher operating these controls personally, not an institution with separation of duties and formal audit. The practices above are real and in force, but they are one person’s discipline. I am building toward a governance structure — a board and named collaborators — that will hold this work to standards beyond a single person’s word. As that forms, the access, oversight, and destruction processes described here will be tightened and made answerable to more than me, and this page will be revised to match. Until then, this is the honest state of the practice.

Questions from a clinician, a data-access reviewer, or a prospective collaborator are welcome — reach out at michael@rightfidelity.ai, or see the ways to get involved → — Michael

Where the data lives, who can touch it, and how its provenance is locked.