Making biological data reusable and cumulative
From fragmented experiments to models of biological systems.
I have about 100 terabytes of microscopy data from four years of my PhD, collected at a cost of hundreds of thousands of dollars in grant money. My PhD work resulted in a rich dataset on stroke recovery in mice. I collected data on brain structure and function tracked from individual capillaries to the entire cortical surface, spanning multiple imaging modalities across weeks of recovery. A dataset that could feed into building comprehensive models of stroke recovery. Only about a third of it went into publications. The rest has been sitting on hard drives in a cardboard box in my closet waiting to be analyzed and published for the last five years.
My story is not unusual. I’ve talked to dozens of students, postdocs, and professors across biology, and they all say the same thing: a majority of their resource-intensive experimental data never sees the light of day. Even when data does make it into public archives, less than 1% is ever reused. The open science movement has made good strides by advocating for data archives, the FAIR principles1, and the creation of new institutes with strong data practices, such as the Allen Institute. But the problem of fragmented data still persists across most of the research ecosystem.
A few structural reasons explain this problem. First, data that serve a larger scientific goal, like connectomics or for foundation models, are still viewed with suspicion if they don’t come attached to an immediate testable hypothesis. The scientific system values publications that tell a story, and large-scale data collection efforts that can’t point to a specific, publication-sized question are often criticized for not practicing the scientific method rigorously. Second, documenting the tacit knowledge and metadata that goes into an experiment is time-consuming, and it falls entirely on the scientist who generated the data, typically a grad student or postdoc with no incentive to spend hours on paperwork. Third, until recently, there was no clear downstream need for it. Without the computational tools to analyze data at scale, reuse didn’t add much value to our collective knowledge.
This has changed recently. The Protein Data Bank (PDB) and AlphaFold form the clearest example of what reusable biological data can unlock. Fifty years of carefully curated protein structures, collected without any idea of what they’d eventually enable, became the foundation for a model that can now predict the structure of millions of proteins it has never seen. We are at a similar moment for biology more broadly. The push toward virtual cells, organs, and organisms has created a huge demand for exactly the kind of rich, multi-modal experimental data that labs are already generating every day.
Most of the ongoing debates about biological data and virtual cells concern which new data to collect at scale. I want to make the case that a tremendous amount of that data is already being generated but is distributed across thousands of labs worldwide. The problem is that it lives in private archives and was never built to be reused. Fixing that requires intervening at the source, before the knowledge walks out the door with the graduate student who generated it.
Data reuse is valuable for new discoveries and informing new experiments
In addition to the PDB, another example of well-coordinated datasets enabling discoveries is the Allen Institute’s Brain Observatory. In just a few years of publicly archiving electrophysiology and calcium imaging data from the mouse visual cortex, over 100 new papers have been published reusing their data. When data is packaged well, reuse rates are high. The Allen Institute’s own theory group took this further, building a model of the mouse primary visual cortex by combining electrophysiology, calcium imaging, patch-clamp recordings, connectomics, and cell-type classification from the same brain region. This was only possible because the institute spent over a decade developing shared instrumentation, workflows, and data standards across its teams so that data from different groups could actually be combined.
The common theme in these examples is a long period of careful, coordinated data generation followed by a downstream application that would have been impossible without it. The structural biologists in 1971 did not anticipate AlphaFold or even know which parts of their dataset would be most impactful. Similarly, building foundation models of biology will require large, biologically grounded datasets, but there is rarely agreement on exactly what these datasets should be. Even if they did agree, the amount of data needed is larger than any institution can generate alone. Even simulating a single cell requires data spanning many orders of magnitude in space and time.
The data being generated right now, distributed across thousands of labs, spans exactly those scales: from atomic resolution structures to cellular dynamics to whole organisms. Every day, graduate students and postdocs are collecting datasets across perturbation experiments, genetic knockouts, drug treatments, and imaging across scales and modalities. These datasets are small individually, but they are diverse and information-rich. If combined, this distributed output could serve as the training ground for foundation models and provide empirical evidence of what data is actually useful. Instead of debating in the abstract which modalities and perturbations are most important, we could let the data reveal the gaps.
Previous attempts at better data practices in academia are still insufficient
When the data-sharing community talks about improving practices, the conversation almost always comes down to incentives: Make it easier to get citations for datasets. Fund departments to hire data stewards. Mandate data management plans. These are reasonable ideas on the surface, but they all treat “researchers” as a single group, when in practice, the people who generate data and the people who benefit from sharing it are almost never the same person.
The grad student running experiments is trying to graduate. The postdoc is trying to publish a paper that will get them a faculty position in a reasonable amount of time. The professor is optimizing for citations, grants, and prestige. Time is the scarce resource for trainees, while recognition is the scarce resource for faculty. These are distinct problems that require distinct solutions, and conflating them produces interventions that don’t fix the underlying problem.
Take the data steward model. When funding agencies respond to the data problem by giving labs more money to hire data stewards, the money flows to the professor’s research budget, not to the postdoc who still has to spend many hours working with the steward instead of running experiments. The baseline for that trainee is zero time spent on documentation. The data steward moves it to some time for coordinating, reviewing, and providing experimental context that only they have. That delta, even if smaller than doing it all on their own, still falls entirely on the person who receives nothing in return.
Hiring data stewards also does nothing about fragmentation. Each lab develops documentation practices around its own needs. This makes sense, but the result is metadata that is incompatible across labs. Even well-resourced institutes with genuinely robust internal practices find it difficult to combine their data with other institutes without a further round of coordination. The steward model improves documentation within labs while leaving the cross-lab problem completely unaddressed.
The publication incentive argument has been made for decades with modest results. Data citation practices are slowly improving, but waiting for cultural change in academic publishing is the wrong intervention for a problem that can be solved now. Trainees don’t want to do paperwork; they want to do experiments, and no citation metric is going to change that psychology at the scale the problem requires, since citations only come years down the line.
The bottleneck is at data generation
Many efforts have focused on tackling experimental metadata generation at the public archive layer. Most recently, AI is being thrown at existing datasets to generate metadata where it’s missing. But this approach has a fundamental limitation: most metadata is tacit knowledge that lives inside the heads of experimentalists who have long since moved on from their data.
In my own stroke study, the critical context of which artery I targeted for occlusion in each mouse, which hemisphere, how I registered macroscopic and two-photon imaging sessions to the same region, locations of these regions in each mouse, behavioral analyses, all lived in screenshots and handwritten notes that never made it onto the archive. Only the processed imaging data did. An AI looking at only the image files would have no way to reconstruct where they came from or how they relate to one another.
Even setting the metadata aside, the vast majority of data generated across labs never makes it onto public archives at all. In my stroke study again, I collected three distinct imaging modalities, but the publication ended up focusing only on one method, cortex-wide macroscopic recovery. The high-resolution two-photon capillary structure and function data stayed on the lab server with the hope that a future student would analyze and publish it. Five years later, it has not been looked at once, and probably never will be.
Data that doesn’t support a published figure rarely gets deposited, a serious flaw in how results get shared. The current scientific structure values a publication that pushes the knowledge story forward. The system is not built around sharing data or half-stories. This means that the archive only captures what worked for a nice story, which is only a fragment of the whole project.
The solution needs to tackle the data source and data integration
My hypothesis is that if you capture experimental metadata at the bench in real-time, with zero added burden on the experimentalist, you can dramatically increase both the volume and quality of data that makes it into public archives, regardless of what gets published. If successful, the resulting datasets would actually be combinable across labs, because the metadata would be structured and machine-readable from the start rather than reconstructed after the fact.
A pilot experiment to test this would include a small number of diverse labs with real-time and automated capture tools, and measure whether metadata completeness and archive deposition rates improve, and whether the resulting datasets can be combined across labs and modalities. Tools like PRISM, developed by the FRO Cultivarium, demonstrate that the capture layer is technically feasible. PRISM uses LLMs with smart glasses to generate rich metadata from wet-lab work without any additional paperwork. A similar approach could work for in vivo imaging. I could put on my smart glasses at the start of my day and the system would automatically track mouse IDs, document behavioral parameters, capture imaging settings across modalities, and produce a structured machine-readable record, without filling out a single form.
Metadata capture is only the first layer. From there, the data needs to flow automatically into archives regardless of what goes into a final publication. Many of these archives already exist but they need to be standardized across datasets and be easily searchable through an aggregation layer. The final layer is a query-based aggregation layer. Imagine seeking out a deeper understanding of how memory works and being able to ask, in plain language: show me all available data from the mouse hippocampus across modalities, in a format I can combine. A federation layer sitting above the distributed archives, which is able to pull and integrate data in response to natural language queries, would turn what is currently a fragmented ecosystem into something an AI scientist could use to build models.
A well-coordinated research program that builds the backend tooling and infrastructure to make fragmented data flow seamlessly from the bench to the model could immediately give us access to exabytes of rich biological data.
So, why now?
For most of the history of this problem, there was no compelling reason to fix the lack of reusability. Data reuse was a nice idea without a clear application. Now, scientists across industry and academia are actively building biological foundation models, and the demand for large, structured, multi-modal datasets is immediate. At the same time, the tools to address the capture problem at the bench at scale have only recently become available. The same AI wave that is driving demand for biological foundation models is also what makes it possible to capture experimental knowledge at the bench without adding burden to the person running the experiment.
We could be collecting data with future AI in mind, where we can expect human scientists to be working alongside autonomous agents. In this scenario, having all the metadata is even more important because we expect the autonomous researcher to be doing more of the analysis. One could even imagine a biological data marketplace, where AI scientists specify what data they need and decentralized labs supply it through experiments, while the middle layer ensures integration. Now is the right time to run a coordination experiment on academia to leverage the enormous amounts of data being generated every day.
A set of guidelines to make data Findable, Accessible, Interoperable, and Reusable. While most NIH funded publications need to have a data management plan in place, the enforcement around what constitutes a good data plan is loosely defined.


