AI Meets Biology: Insights and Innovations from Our AI in Single-Cell Biology Workshop

--

AI in Single-Cell Biology Workshop participants. Photo credit: CZI

Earlier this year, we hosted an AI in Single-Cell Biology Workshop that brought together computational experts, biologists, and AI leaders to foster collaboration and sharing among participants; discuss novel methods, trending topics, and future directions; and disseminate learnings.

Participants focused on state-of-the-art methods, challenges, and opportunities in model evaluation and on translating predictions into biological discoveries. By bringing together diverse perspectives and expertise, the workshop facilitated a dynamic exchange of ideas that sparked innovation and laid the groundwork for future developments.

What We Learned

Evaluation and Validation Frameworks: There is a need for robust frameworks to evaluate and validate the performance of computational models in biology. This includes developing standardized tasks and curated datasets that can serve as benchmarks. Such frameworks are essential for assessing the accuracy, reliability, and utility of models in capturing complex biological phenomena, enabling comparisons across different approaches, and facilitating improvements. A notable existing framework for common single-cell analysis tasks includes Open Problems, an open source, community-driven, extensible platform for continuously updated benchmarking of formalized tasks in single-cell analysis.

Incorporating Prior Knowledge: Leveraging existing biological knowledge is crucial for informing model development and interpretation. Incorporating prior knowledge about gene functions, cellular pathways, and disease mechanisms can guide the training of models, helping them to make biologically plausible predictions and to identify novel insights that are consistent with established theories. This approach can also help in addressing challenges related to data sparsity and bias, enhancing the models’ ability to generalize from limited or unbalanced data.

Foundation Model of Biology: The concept of a foundation model — a comprehensive and versatile model that can be adapted or fine-tuned for a wide range of tasks — holds particular promise for biology. Such a model could encapsulate a broad spectrum of biological knowledge, from molecular to organismal levels, serving as a powerful tool for exploration, hypothesis generation, and decision-making in diverse biological contexts. Recent advancements in this area highlight the integration of large language models and deep learning techniques to enhance data analysis, representation, and discovery across various biological domains. Approaches like BioBridge and CellPLM demonstrate the potential of combining knowledge graphs and multi-modal data to improve model interoperability and cellular interactions and responses to treatment prediction. Models such as DNABERT-2 and DrugGPT leverage LLMs for multi-species genomic analysis and drug discovery, respectively, while other methods focus on augmenting models with domain-specific tools (e.g., chemistry) and enhancing model capabilities for protein representation and biological image segmentation. Overall, current state-of-the-art approaches reflect a shift towards more generalized, efficient, and contextually aware AI models, pushing the boundaries of current methodologies in biomedical research and applications.

Active Areas of Research: Following the workshop, the group highlighted several areas of active research that present significant opportunities for advancing the field. These include:

  • Perturbation Prediction: Developing models capable of predicting the outcomes of genetic or environmental perturbations on cells or organisms. This area is crucial for understanding disease mechanisms, drug responses, etc.
  • Spatial Transcriptomics: Combining gene expression data with spatial information within tissues offers insights into the cellular composition and architecture of tissues in health and disease. Models that can analyze and interpret spatial transcriptomics data can uncover new aspects of tissue biology and disease pathology.
  • Multi-modal Models: Integrating data from multiple modalities (e.g., genomics, proteomics, imaging) can provide a more comprehensive understanding of biological systems. Multi-modal models have the potential to capture the complexity of biological systems better than single-modality approaches, leading to more accurate and holistic insights.

Day 1: The Current State of the Computational Models

To develop a shared understanding of the state of the field, the first day of the workshop centered on an overview of successful applications of AI to unravel biological complexities, and an acknowledgment of the limitations and gaps in current approaches. A significant portion of the day was devoted to discussing the importance of robust evaluation frameworks to assess the effectiveness of computational models.

Here’s a recap of day one sessions:

  • The cutting-edge technology, Pinnacle, is a multi-scale graph neural network designed to enhance protein representation learning by incorporating cell type or state-specific contexts. Michelle Li, a biomedical informatic researcher from Harvard Medical School, shared Pinnacle’s unique ability to adjust protein representations based on biological context, overcoming challenges like the lack of uniform data processing and the need for transparency and interpretability in models. The system has been applied to therapeutic target identification, showing improved prediction capabilities over context-free models, and has the potential to inform precision medicine by making contextualized predictions relevant to specific therapeutic areas. Li also highlighted the importance of considering context in biological models, suggesting that context-aware models like Pinnacle can significantly advance our understanding of biological processes and disease mechanisms.
  • Building on the need for context in biological models, Smita Krishnaswamy, an Associate Professor of Genetics and of Computer Science at the Yale School of Medicine, highlighted the opportunities and challenges of applying large generative models to drive discovery in biology. Generative models can produce biological entities such as molecules, networks, and cells, but these applications require consideration of the entities’ complex structures and dynamics. Integrating deep learning with geometry and topology, Krishnaswamy discussed the development of models that can accurately represent and generate complex biological phenomena, paving the way for innovations in understanding and manipulating biological systems.
  • Subsequent sessions focused on enhancing model performance and the limitations of large language models (LLM) and foundation models. Participants discussed the advantages of incorporating prior biological knowledge into models compared to data-driven approaches alone and how this hybrid approach enhances the interpretability and predictive power of the models. The integration of prior biological knowledge into computational models highlights a critical balance between enhancing model performance and avoiding bias, particularly in data-limited situations.
  • The second small group discussion focused on the advantages and limitations of large language models (LLM) and foundation models in capturing the complexity and heterogeneity of single-cell biology data and how alternative approaches address these limitations. Despite these challenges, LLMs offer advantages such as the ability to learn from vast data sets, their generalizability, and the fact that pre-trained models do not require a predefined hypothesis. To address the limitations and leverage these advantages, attendees called for more rigorous benchmarking tasks and frameworks, improved data splits for training, testing, and validation, and the exploration of pretraining and masking techniques.

Day 2: Evaluation and Applications of AI Methods in Biology

Following the table-setting conversations from day one, the second day of the convening centered on the evaluation and practical applications of AI methods in biology, emphasizing specific use cases that demonstrate the impact of computational methods on biological research.

Here’s a recap of day two sessions:

  • Tianyu Liu, a PhD student from Yale University, presented on the transformative potential of foundation models in biology while also acknowledging the challenges related to computational demands, data quality, and model interpretability. Evaluation frameworks are crucial for benchmarking these models against specialized methods, with future directions aiming at enhancing training strategies, incorporating multi-modal datasets, and exploring novel biological applications. Through collaborative efforts and continuous innovation, foundation models hold the promise of significantly advancing our understanding of biological systems and improving our ability to address complex biological questions.
  • Utilizing neural networks, particularly convolution and self-attention mechanisms, David Kelley, a principal investigator from Calico Labs, presented a novel approach involving predicting regulatory activities like DNA hypersensitivity and transcription gene expression from extensive DNA sequences. His project aims to comprehensively annotate how each nucleotide in the genome influences cell type and state-specific gene regulation, aiming to predict the outcomes of modifications within the DNA sequence on regulatory activity. This involves understanding the relationship between sequence and activity and how regulatory elements across the genome interact, regardless of spatial distance. Recent advancements have incorporated RNA sequencing data into the model, improving the prediction of gene expression and enabling finer resolution in modeling regulatory activities. Validation efforts focus on understanding enhancer-promoter connections and exploring genomic variants’ effects on gene regulation.

Day 3: The Potential of Multi-modal Models

Connecting and modeling data across scales and modalities. From left to right: Ivana Jelic, Senior Program Manager for Cell Science at CZI; Fabian Theis, Director of the Computational Health Center and Director of the Institute for Computational Biology at Helmholtz Munich; Theofanis Karaletsos, Head of AI for science at CZI; and James Zou, Associate Professor of Biomedical Data Science at Stanford University.

The final day was forward-looking, exploring the future directions of AI in biology. In a panel discussion, Ivana Jelic, Senior Program Manager for Cell Science at CZI; Fabian Theis, Director of the Computational Health Center and Director of the Institute for Computational Biology at Helmholtz Munich; Theofanis Karaletsos, Head of AI for science at CZI; and James Zou, Associate Professor of Biomedical Data Science at Stanford University, shared their perspectives on how the field could integrate data from multiple modalities and scales, such as genomic, transcriptomic, epigenomic, and proteomic data to gain a comprehensive understanding of cellular processes at the single-cell level. They emphasized that multimodal data has the potential to provide a deeper mechanistic understanding of cellular responses to stimuli and the importance of considering causality in the context of disease. The panelists also acknowledged the current challenges such as technical complexities, biases from dataset selection, slow adoption of uncertainty-aware predictions, and the need for scalable and accessible model systems.

Here’s a recap of day three:

  • Attendees were encouraged to push the frontiers of AI applications in biology through new approaches such as the possibility of using machine learning, particularly generative models, to better understand how biological systems generate themselves, emphasizing the need for mechanistic insights to prevent reliance on black-box models.
  • Gary Bader, a professor at The Donnelly Centre at the University of Toronto, presented his group’s goal of bridging the gaps between disciplines focusing on different scales of human biology and utilizing single-cell and spatial genomics to offer a unified perspective on human biology.
  • Loic Royer, Senior Group Leader & Director of Imaging AI of CZ Biohub San Francisco, also shared about the intricacies of developing and employing advanced imaging and machine learning technologies to study the dynamics of cell development in zebrafish embryos. His talk extended to leveraging deep learning for image analysis and the potential of engaging in dialogues with AI, like ChatGPT, to facilitate bioimaging tasks, illustrating a future where complex biological processes can be understood and manipulated with precision through a blend of imaging, computational, and AI-driven approaches.

This workshop was an exciting opportunity to build momentum for the potential of AI and computational models to transform biological research. By addressing the need for evaluation frameworks, incorporating prior knowledge, aiming for foundation models, and exploring active research areas, the field can unlock new opportunities for discovery and innovation in biology. These learnings will continue to drive CZI’s vision of building an AI-powered platform for biology that could predict the behavior of healthy and diseased cells. Read more about our AI work.

--

--

Chan Zuckerberg Initiative Science

Supporting the science and technology that will make it possible to cure, prevent, or manage all diseases by the end of the century.