In most labs, experiments follow a typical pattern: set up the experiment, collect the data, and then analyze it. Often, it isn’t until after data collection that an experimentalist begins to think about data analysis and whether it’s possible to establish meaningful conclusions. Yet, thinking about statistics and planning the analysis should come in the very beginning – during the experimental design phase itself.
“You could do this giant experiment that could take you years, and if it wasn’t set up in line with the way that it is going to be analyzed, it might actually be very hard to gain insight from that experiment,” says Parag Mallick, co-founder and chief scientist of Nautilus Biotechnology.
In a recent Translating Proteomics podcast episode, Mallick explores the power and potential of combining statistics, AI, and experimental design with Olga Vitek, Raymond Bradford Bradstreet Professor and Director of the Barnett Institute for Chemical and Biological Analysis at Northeastern University. Professor Vitek’s work focuses on the development of statistical, AI, and ML tools for use in proteomics studies. This blog post highlights insights from the podcast that show just how essential it is to consider statistics in experimental design – particularly when studying proteins.
Why statistics is important for experimental design
Vitek believes that statistics should be considered in the beginning of every experiment. “You really need to have a good sense of how the data will be analyzed in order to design the experiment appropriately,” she says. “When you design experiments in a way that accounts for the properties of the data, for the uncertainty and variation, and for the systematic variation that exists, you end up with experiments that are much more informative.” Without considering statistics upfront, you can end up with results that are underpowered or hard to interpret because of batch effects or hidden variables.
To really build in statistical rigor, Vitek recommends talking to a statistician before collecting the data to discuss how the data will be analyzed and design experiments based on that. The design should encompass the principles of replication, randomization, and blocking so that the results you get are trustworthy and not due to a one-off finding or things like handling variables.

For observational studies from settings that aren’t controlled, understanding variables becomes even more challenging, but not impossible. “If we can understand the sources of variation, even if we cannot control them, we can account for them,” says Vitek. For example, causal diagrams can be used to identify and adjust for uncontrolled variability in statistical models
With the rise of computing resources and AI, Vitek notes that it’s becoming easier to analyze data. The hard part goes beyond learning how to run the software packages, but rather, it is understanding how these methods work and how they can “fail silently.” For instance, these methods could give you realistic numbers or small p-values even if the sample size is too small to support a confident conclusion.
Moving towards systems approaches
It can be very difficult to determine the biological importance of protein expression changes reported after proteomics experiments. Commonly reported relative fold changes in protein abundance are difficult to place into other contexts outside of their source experiments, making direct comparisons to existing literature and datasets challenging.
Yet, with advances in AI/ML it’s becoming ever more feasible to glean biological insights from disparate proteomic data. Such data often sits in disconnected repositories or scattered across multiple published papers, but machine learning can help scientists interpret this data, combine it with new data, appropriately label it, and place it within the context of current biological knowledge to develop a unified understanding of the proteome. Moving beyond the proteome, proteomics data can also be combined with transcriptomic or genomic data. While it can be difficult to integrate current proteomic datasets with these other modalities, we believe the single-molecule proteomic data generated on the Nautilus Voyager™ Platform will be easier to integrate for a more holistic understanding of biology.
For protein studies, these integrative approaches are even more important because the reality is not simply “protein A regulates protein B.” Instead, one protein can have many variants, combinations of post-translational modifications, and behaviors depending on context. “You need to somehow harmonize all this information,” says Vitek, referring to scientific abstracts, ontologies, databases, and repositories. When this information is parsed and converted into structured and harmonized knowledge bases, scientists can compute on them. “If we can compute on them, then we can really take the next step forward and ask questions that are not just mechanistic,” Vitek explains. “We can ask more synergetic questions like how do these proteins regulate each other or how do they do it in a particular context?”
Future of AI, machine learning, and statistics in proteomics
As researchers develop means of integrating omics datasets and establishing not just data points but insights, Vitek is excited about the potential to use omics data to create artificial cells that mimic biology. This would involve balancing the wealth of transcriptomics data scientists currently have with an increasing amount of data across other omes, such as the proteome, and at different temporal and spatial resolutions in order to train models of biology. Artificial cells could help scientists predict outcomes to perturbations such as stress or exposure to external stimuli before performing actual experiments, and Vitek says artificial cells could be “incredible” for diagnosing, treating, and preventing disease.
Building towards this future begins with better experimental design that incorporates statistical principles. This can be achieved by bridging the gap between the people who do experiments and people who analyze data. “A solution is really improving the communication between people who are experts in different fields,” she says. “Experimentalists need to understand the mindset of experimental design, but statisticians also need to understand what the data are.”
“Part of my dream involves humans interacting in a more mutually complementary manner and understanding enough of the language of the other field to have a conversation,” says Vitek.
The Nautilus Voyager™ Platform was built through similar complementary efforts of researchers from a wide variety of fields, and the single-molecule data it generates is designed to be easier to integrate into statistical, AI, and multiomic workflows. Ultimately, we hope this data will accelerate efforts to achieve an integrated understanding of biology that makes innovations like the artificial cells that Professor Vitek envisions possible!
MORE ARTICLES