by Sam Berry

In 1918, a new influenza (flu) strain infected nearly a third of the world’s population, leaving tens of millions dead. At the time, relatively little was known about this strain, later called the Spanish Flu—why it was so dangerous, how it spread, even what it was made up of. In the past 100 years, we’ve unveiled the structure of the double-helical DNA molecule that encodes life, how that molecule’s alphabet translates to particular molecular structures, and how combinations of those molecules lead to viral function.

Despite this progress, a century later, no one was prepared for the changes to bat coronaviruses that allowed them to infect humans, causing the most economically disruptive pandemic since 1918. Furthermore, influenza continues to take thousands of lives annually. With what we’ve learned about biology, can we be better prepared for the next pandemic?

Anticipating evolution

Viruses, like any other biological entity, have a genetic sequence of DNA or RNA that determines what they look like and what they’re able to do. DNA and RNA are like  molecular blueprints. Groups of letters in these sequences each code for unique molecules called proteins, the shape or which is determined by its particular genetic sequence (Figure 1). In turn, this shape affords the protein particular functions, such as holding the virus together or allowing it to invade host cells.

Figure 1: A molecular code. Within every virus (left) is a genetic code, contained within a pattern of letters in the molecule DNA or RNA. Specific combinations of these letters, called genes (center), give the instructions for building particular proteins (right) that make up the virus. For example, the protein on the right is a spike protein that can be seen on the outside of the virus.

Each time a virus replicates itself, it will copy over its entire genetic sequence, and sometimes it will make tiny mistakes. These mistakes, typically a change in a single one of the thousands of letters in the virus’s sequence, might change the properties of one of the virus’s proteins and therefore change the virus’s capabilities. If this change (or “mutation”) gives the virus a new ability that promotes reproduction, the mutation could become widespread in the viral population over successive generations. Over time, various mutations will build up in a viral strain and fundamentally change its properties. For example, it might be able to infect a new organism or cause a new symptom.

Studying this mutational process in detail has been challenging, but advances have already been made to understand the ways viruses evolve. One popular but controversial approach, termed “gain of function research,” attempts to analyze viral evolution by engineering new viruses in the lab environment.

Gain-of-function research involves making different mutations to natural viruses in a laboratory to understand how these viruses might attain new functions. In theory, this kind of research could be very useful for predicting which animal viruses have the potential to jump to humans, and which existing viruses could easily evolve new functions such as airborne transmission. By finding these possibilities, we could theoretically better prepare for them, by readying vaccines for new strains of virus before they even emerge in nature. 

The risks in this endeavor are enormous, particularly the possibility that a virus engineered to be more deadly could escape from the laboratory. The method has received no shortage of bad press in recent months, particularly in the wake of the revelation that the Wuhan Institute of Virology had been conducting this kind of research on bat coronaviruses prior to the outbreak (although it’s worth noting that there is no evidence that the COVID-19 originated from this work). The risks seem particularly untenable given that there is no guarantee that a mutation engineered in the lab would actually be the same as one that would occur naturally, or that the “trial-and-error” approach used in these experiments is actually exhaustive. However, even as the pandemic highlights the potential consequences if gain of function research were to go wrong, it also shows the incredible riskiness of doing nothing.

Not always mentioned in this dichotomy between dangerous research and inaction is a broader question: why do we have to blindly engineer new strains in the lab? Why can’t we use our current scientific understand to predict the evolutionary path the virus will take?

Word games

A classic analogy to the process of molecular evolution is a game proposed by evolutionary biologist John Maynard Smith, in which you start with one word and try to change it one letter a time to an entirely different word. However, each time you change a letter, it has to make a legitimate word. For example, if you started with “WORD” and wanted to convert it to “GENE,” the shortest path you could take would be “WORD → WORE → GORE → GONE → GENE.” Other paths – say, moving from “WORD” to “WERD” – would require passing through combinations of letters that do not form a word. This is analogous to trying to evolve from one viral sequence to another by passing through a sequence that does not allow the virus to survive or reproduce. Instead, there are only very particular paths that the virus can take from one sequence (and function) to another in order to acquire new properties because of the strict constraint that each intermediate sequence must still be functional.

In the case of Maynard-Smith’s game, there are 26 letters and four positions in the word. Starting from “WORD,” each letter in the word can be changed to any of 25 different ones; this means that there are 100 options that are one step away from WORD in the alphabet (possible second words in the game). How many of these four-letter sequences are actually words? Well, you could just make every one of those 100 possible changes, search for it in the dictionary, and – if you find it – write it down and continue again from there. Repeated many times, we could form a “network” of all possible words that could be reached by this game starting from “WORD”, and we’ve solved the game (Figure 2).  

Figure 2: Solving the word gameWhen trying to convert between one word and another one letter at a time, only particular paths (blue) allow you to do so while maintaining a valid word at each step. This is analogous to the process of viral evolution, where mutations can only be accepted if they code for a functional protein and thus functional viruses.

Unfortunately, this problem is a bit harder for actual genetic sequences. For one thing, each of the several dozen genes in the virus does not just have four letters; on average, they have around 1,000. This changes the space of possible starting moves in the game from 100 to 4,000 and makes the total space vastly larger. While there are under 500,000 possible four-letter words, the number of 1000-letter genes is larger than the number of atoms in the universe. Moreover, using a computer to determine whether each change will lead to a functional protein is not nearly as trivial as looking up the word in the dictionary. The physical laws that determine exactly how a sequence codes for a particular shape in proteins has been one of the most notoriously difficult problems to unravel in modern science. Even if you could predict whether or not the change to sequence would change the shape of the protein in a meaningful way, that doesn’t necessarily tell you what kind of effect it would have on the virus’s function and how it might impact transmission.

Nature’s logbook

Scientists do have one asset in the game of predicting viral mutations: a logbook of the past billion years that nature has spent playing the sequence-changing game. Over the past two decades, scientists have built up a database of millions of biological sequences, from humans to chickens to microbes. Many of these sequences represent different variants of the same proteins that diverged at different times throughout evolutionary history, forming the leaves of a vast, billion-year evolutionary tree (Figure 3). Considered in the context of everything else we know about these particular proteins, this data contains rich information on the rules of the game and the tricks to beating it – if we know how to look. 

Figure 3: The molecular tree of life. A revolution in genomic sequencing has given us the sequence of millions of different protein variants that together form an evolutionary tree. Shown above are four different structures of the same protein from related viruses. While these proteins evolved from a common ancestor, they have changed over evolutionary history to have a somewhat different shape, and therefore modified functionality.

Fortunately, scientists can now use a variety of methods to extract the rules of the game from this vast logbook. They can infer the sequences of ancient proteins and bring them back to life in the lab, then reconstruct all of the evolutionary steps taken to change function through evolution. We can exploit patterns in which positions in proteins tend to change in tandem with one another in order to predict which mutations can occur, and which will have a significant effect. New machine learning algorithms are increasingly being applied to find patterns in the complex natural history of viral sequences and predict future ones. And, perhaps most importantly, we are developing an ability to understand exactly how effects on particular viral proteins translate into effects on your body by developing a better understanding of our own cells and immune systems.

Can we prevent the next crisis?

So far, none of these techniques that are more deeply rooted in a basic understanding of evolution and life’s history have been used to successfully predict a viral pandemic or to design vaccines in anticipation of an outbreak; neither has gain-of-function research. While viral forecasting has become more mainstream—for example, the Center for Disease Control (CDC) now has a full flu forecasting program—these techniques mostly remain constrained to predicting the spread of current strains, not the emergence of new ones. However, recent research with new models that incorporate knowledge about the flu’s evolutionary history could begin to change this, and this is only the tip of the iceberg of what could theoretically be possible with a deeper understanding of viral evolution.

Beyond simply being less risky than gain of function approaches, an approach rooted in fundamentally understanding the evolutionary game that viruses are playing offers much more power and applicability. While gain-of-function studies are narrowly confined to the virus they are studying and do not provide the whole story for a single virus, a stronger basic understanding could be applied to any new virus as it emerges—or even to non-viral threats. For example, bacteria and viruses evolve by the same mechanisms, and the same research that would allow us to understand the emergence of new viruses could also help us combat the evolution of antibiotic resistance among bacteria.

The evolutionary approach is harder and more expensive than traditional gain-of-function research because it requires research into the basic principles underlying evolution, rather than simply trying to predict through trial and error. However, doing nothing has been proven to be more costly, as COVID-19 has cost the global economy trillions of dollars, without accounting for the monumental cost of human lives. Since viruses will keep playing this evolutionary game, maybe we should join them and figure out the rules they’re playing by.


Sam Berry is a second-year Ph.D. student in Biophysics at Harvard University.

For more information:

Leave a Reply

Your email address will not be published. Required fields are marked *