DNA & Disease: Facing Pandemics with Pathogen Genomics


When an influenza pandemic began to sweep the globe in 1918, doctors and scientists had limited tools for investigating the deadly disease. Viruses were too small to see with the microscopes of the day. So people wrongly blamed the illness on what was visible under the lens, a bacteria they called Bacillus influenzae. Humanity had little defense against the real threat, an H1N1 virus, as it killed tens of millions of people.

As today we once again face a global pandemic, scientists are better equipped to investigate the infectious pathogen, this time SARS-CoV-2. Over the past century, new technologies have developed that enable us to not only see the virus under the microscope but also to peer into its genetic information that underlies how the virus functions. New tools for making sense of genetic information have made it much easier to identify and monitor emerging contagions, in the service of combating them. Putting these tools into the hands of local health jurisdictions has been an ongoing priority for the Chan Zuckerberg Initiative (CZI) and the Chan Zuckerberg Biohub (CZ Biohub).

“Empowering local communities is a powerful way to stamp out infectious disease,” said Phil Smoot, CZI’s head of science technology and vice president of engineering. “Modern genomic technologies can provide communities with tools for understanding the origin and spread of disease.”

These tools, made possibly by advances in DNA sequencing, have had a powerful impact on human health during the ongoing COVID-19 pandemic. And they are continuing to evolve, in preparation for the next global health crisis.

Sequencing: Faster, Cheaper, More Available

Every infectious disease agent, from the common cold to exotic brain parasites, has a genome made of DNA or RNA. This genetic material contains instructions that allow a pathogen to attack, reproduce, and move to a new host. It can also be used to fingerprint a pathogen, thanks to advances in sequencing technology that enable scientists to read genetic “barcodes” within the pathogen’s DNA or RNA.

Technology has come a long way since scientists produced the first complete DNA genome sequence (of a bacteriophage) in the 1970s, using Sanger sequencing. In this technique, enzymes create a new strand of DNA from a single strand of existing DNA by incorporating the building blocks of DNA, nucleotides, one at a time. Special fluorescent building blocks interrupt this process, creating DNA strands truncated at every base pair, each letter, in the sequence. After the reaction is complete, scientists separate the truncated bits on a gel and scan them with a laser to detect each one’s final, fluorescent base pair, yielding a sequence. This technique made it possible to sequence the human genome, a project that lasted 13 years and cost $2.7 billion dollars.

Next-generation sequencing, which emerged around the turn of the 21st century, now allows an entire genome to be read in a day for less than $1,000. Instead of producing one sequence per reaction, this technique can generate millions or billions at once. In next-generation sequencing, DNA is chopped into lots of small pieces that are affixed to a surface and sequenced in large batches. Enzymes create new strands of DNA from each fragment, as in Sanger sequencing, but the fluorescent molecules they incorporate can be detected during the reaction, instead of at the end. The result is a long list of short DNA sequences. These genetic puzzle pieces are put together by a computer algorithm that looks for overlaps in the sequences. The process is akin to cutting up pages in a book, photocopying them, and then putting them back together by looking for overlapping sentence fragments.

With sequencing becoming cheaper and more widely available, the real challenge now lies in putting it to work against diseases in the field and in handling and analyzing the data it provides. That’s where CZI is focused.

Metagenomics: A Tool for Identification

Consider the task of figuring out what is making someone sick when you don’t have a prior idea of what the cause might be and running tests for every possible pathogen would take too long. This is where sequencing can help by looking at all the different pathogens in a person’s sample, for example their saliva. The fluid likely contains the pathogen behind the illness, with a genomic barcode that can reveal its identity. However, this barcode is also mixed in with lots of other DNA: e.g., the infected person’s own DNA (from cells shed from the throat) and DNA from any other organisms currently infecting that person, as well as from the microbes naturally found in their mouth, known as commensals. In short: The saliva sample contains a cornucopia of genetic material from different sources.

Metagenomics provides a genetic inventory of all the organisms contained within a biological sample, such as that saliva. An extension of genomics, which studies a single organism’s chromosomes, metagenomics deals with genetic material from many sources.

To put metagenomics to work on that saliva, you might start by performing a series of laboratory steps that chop up the jumble of DNA from many different sources within the sample and sequence the resulting bits. Then comes the task of piecing together those fragments, which is more complicated than piecing together DNA from a single source. Two sequences from the genomes of two different organisms — an infectious microbe and a non-pathogenic commensal microbe from a person’s mouth, for instance — can look similar.

Sorting all this out requires a reference database of known genomes. Sequences from the sample are compared to that database, producing matches at varying probabilities. This offers a way to identify the organisms in a sample with no prior knowledge of what they are.

Given the complexity of the data analysis involved, metagenomics has been largely limited to specialized laboratories, which limits access to these important tools. Chan Zuckerberg ID (CZ ID), created by CZI in partnership with CZ Biohub, hopes to change that. CZ ID is an open-source metagenomic analysis platform for researchers that is designed to be easy to use. It enables timely identification and discovery of pathogens for outbreak detection and surveillance, and microbiome characterization for any lab, regardless of their bioinformatics resources.

“We didn’t reinvent the wheel with CZ ID,” said Cristina Tato, Director of the Rapid Response team at CZ Biohub. “Our goal was to create a platform that makes existing tools so easy that any scientist in the world could use them to help identify disease and improve human health.”

Hoping to better understand how viruses move from other animals to human beings, researchers recently used CZ ID (formerly known as IDseq) to investigate fruit bats in Madagascar, which can be infected by coronaviruses. After analyzing samples from urine, feces, and the throat, they sequenced several viruses, including two new coronaviruses. This data helps scientists explore how viruses from different populations of bats, in Asia and Africa, mix and exchange genetic material. It adds to our storehouse of knowledge for tracking potential sources of genetic material that could modify human coronaviruses — and provides a starting point for investigating how such swapping can yield a virus in a bat able to infect a person.

A researcher from Madagascar at the Chan Zuckerberg Biohub, which provides researchers free access, training, and compute on the Chan Zuckerberg ID platform, and the necessary equipment and supplies to begin work in their own countries through the Bill & Melinda Gates Foundation Grand Challenges Explorations Grants.

Genomic Epidemiology: Tracking Disease

Identifying the culprit behind an emerging disease is the first step in tackling an infectious disease outbreak. But as the COVID-19 pandemic has highlighted, being able to monitor the spread of an infectious disease is critical for public health. Epidemiologists use contact tracing to track the transmission of a pathogen in a population. But there are limitations to that approach, namely the reliance on interviewing individuals and being able to gather an accurate history of where they went, who they came into contact with, and when. Enter genomic epidemiology, which helps supplement standard contact tracing by leveraging sequencing data.

As we have seen with COVID-19, viruses evolve over time. The vast majority of mutations have little effect; many actually hamper the pathogen. Only a very small fraction increase its ability to spread or cause serious disease. But all of these mutations add to an ever-growing barcode that helps us track pathogens and determine whether different cases of a disease are related.

For example, at a prison in California, local authorities had developed a COVID screening process at entry. New arrivals who tested positive were quarantined for the protection of existing inmates. But when an outbreak occurred in the prison, in spite of these precautionary measures, those in charge were left with questions. Had their screening protocol failed, allowing the virus to be brought in from the outside? Or was what they were seeing the result of transmission within the facility?

Local public health investigators obtained samples from individuals who tested positive and sequenced the virus in each sample. What they needed next was a way to compare the results from different people, to better understand whether those people were infected by the same strain of virus, or different strains.

That’s where Chan Zuckerberg GEN EPI (CZ GEN EPI), new software developed by CZI in partnership with CZ Biohub, comes in. CZ GEN EPI (formerly known as Aspen) is an open source, no-code genomic epidemiology analysis platform for public health. It supports public health departments to translate pathogen data into public health insights by revealing potential outbreak sources, variants, and showing an overall picture of how microorganisms are spreading in the community. By eliminating the need for in-house bioinformatics or programming expertise, CZ GEN EPI enables broader access to key analytical tools developed by the scientific community.

“There’s a lack of familiarity with that scale of data and analysis in public health departments,” says Patrick Ayscue, senior biosecurity fellow at CZ Biohub. “We wanted to give them a place to easily analyze those data and interpret the results.”

Jeremy Corrigan, DrPH, at the Humboldt County Public Health Laboratory. Teams at CZI and the CZ Biohub collaborated to launch Chan Zuckerberg GEN EPI (formerly known as COVID Tracker) — a free, open source software tool to help public health officials analyze their COVID-19 data, track the virus and prepare for future crises.

When the local public health investigators used CZ GEN EPI to analyze the prison samples, the COVID strains identified in screened individuals from outside the prison were found to be different from strains identified from inmates who tested positive. This finding suggested that the intake screening and quarantine program had worked, and that efforts to combat the pandemic going forward should focus on preventing spread within the prison. In similar cases, CZ GEN EPI has been used to analyze samples from nursing homes, schools, workplaces, and hospitals experiencing COVID-19 outbreaks to understand how the disease was spreading and inform the public health response.

Metagenomics and genomic epidemiology are two sides of the same coin: two tools that put sequencing to work for the purpose of identifying and monitoring disease.

During the current pandemic, CZI has supported getting these tools into the hands of more people researching and combating the spread of disease. Now the work is to evolve these efforts to prepare for the next one, by making these tools more powerful and available.



Chan Zuckerberg Initiative Science

Supporting the science and technology that will make it possible to cure, prevent, or manage all diseases by the end of the century.